diff mbox series

phb4: Avoid complete reset without finishing slot polling

Message ID 20170914072840.11373-1-ruscur@russell.cc
State Changes Requested
Headers show
Series phb4: Avoid complete reset without finishing slot polling | expand

Commit Message

Russell Currey Sept. 14, 2017, 7:28 a.m. UTC
In rare circumstances, a complete reset can be requested on a slot while
it is still going through the link polling process.  This results in the
link never coming up.  In complete reset, check to see if the link needs
to complete its state transitions, and if so poll to completion.

Signed-off-by: Russell Currey <ruscur@russell.cc>
---
 hw/phb4.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

Stewart Smith Sept. 15, 2017, 7:02 a.m. UTC | #1
Russell Currey <ruscur@russell.cc> writes:
> In rare circumstances, a complete reset can be requested on a slot while
> it is still going through the link polling process.  This results in the
> link never coming up.  In complete reset, check to see if the link needs
> to complete its state transitions, and if so poll to completion.

I was having a bit of a problem with this on my (DD1) zz. I naturally
saved all the logs and spent a whole bunch of time debugging. No,
wait, that would have been all useful or something.

Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
patch out to see if it fixes it"... and then it booted.

Any ideas? Want to borrow the system to tell me I just saw something
random and rather that your code is perfect?
Russell Currey Sept. 15, 2017, 7:15 a.m. UTC | #2
On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
> Russell Currey <ruscur@russell.cc> writes:
> > In rare circumstances, a complete reset can be requested on a slot while
> > it is still going through the link polling process.  This results in the
> > link never coming up.  In complete reset, check to see if the link needs
> > to complete its state transitions, and if so poll to completion.
> 
> I was having a bit of a problem with this on my (DD1) zz. I naturally
> saved all the logs and spent a whole bunch of time debugging. No,
> wait, that would have been all useful or something.
> 
> Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
> patch out to see if it fixes it"... and then it booted.
> 
> Any ideas? Want to borrow the system to tell me I just saw something
> random and rather that your code is perfect?
> 

Sounds good.  This patch has been tested quite a bit and seems to be doing more
fixing than breaking, maybe there's some weird case on DD1
Stewart Smith Sept. 15, 2017, 7:34 a.m. UTC | #3
Russell Currey <ruscur@russell.cc> writes:
> On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
>> Russell Currey <ruscur@russell.cc> writes:
>> > In rare circumstances, a complete reset can be requested on a slot while
>> > it is still going through the link polling process.  This results in the
>> > link never coming up.  In complete reset, check to see if the link needs
>> > to complete its state transitions, and if so poll to completion.
>> 
>> I was having a bit of a problem with this on my (DD1) zz. I naturally
>> saved all the logs and spent a whole bunch of time debugging. No,
>> wait, that would have been all useful or something.
>> 
>> Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
>> patch out to see if it fixes it"... and then it booted.
>> 
>> Any ideas? Want to borrow the system to tell me I just saw something
>> random and rather that your code is perfect?
>> 
>
> Sounds good.  This patch has been tested quite a bit and seems to be doing more
> fixing than breaking, maybe there's some weird case on DD1

or zz. or my zz. Or the random old FSP build that's on my zz. All the
things that could be wrong with prerelease hardware could well be wrong here.
Russell Currey Sept. 21, 2017, 5:25 a.m. UTC | #4
On Fri, 2017-09-15 at 17:34 +1000, Stewart Smith wrote:
> Russell Currey <ruscur@russell.cc> writes:
> > On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
> > > Russell Currey <ruscur@russell.cc> writes:
> > > > In rare circumstances, a complete reset can be requested on a slot while
> > > > it is still going through the link polling process.  This results in the
> > > > link never coming up.  In complete reset, check to see if the link needs
> > > > to complete its state transitions, and if so poll to completion.
> > > 
> > > I was having a bit of a problem with this on my (DD1) zz. I naturally
> > > saved all the logs and spent a whole bunch of time debugging. No,
> > > wait, that would have been all useful or something.
> > > 
> > > Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
> > > patch out to see if it fixes it"... and then it booted.
> > > 
> > > Any ideas? Want to borrow the system to tell me I just saw something
> > > random and rather that your code is perfect?
> > > 
> > 
> > Sounds good.  This patch has been tested quite a bit and seems to be doing
> > more
> > fixing than breaking, maybe there's some weird case on DD1
> 
> or zz. or my zz. Or the random old FSP build that's on my zz. All the
> things that could be wrong with prerelease hardware could well be wrong here.
> 

Not sure if it's your issue or not, but there are some known problems with this
patch so hold off on merging, I'll respin once I figure it out.

Or merge it anyway.  The overall level of broken probably won't change much.
Stewart Smith Sept. 21, 2017, 9:26 a.m. UTC | #5
Russell Currey <ruscur@russell.cc> writes:
> On Fri, 2017-09-15 at 17:34 +1000, Stewart Smith wrote:
>> Russell Currey <ruscur@russell.cc> writes:
>> > On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
>> > > Russell Currey <ruscur@russell.cc> writes:
>> > > > In rare circumstances, a complete reset can be requested on a slot while
>> > > > it is still going through the link polling process.  This results in the
>> > > > link never coming up.  In complete reset, check to see if the link needs
>> > > > to complete its state transitions, and if so poll to completion.
>> > > 
>> > > I was having a bit of a problem with this on my (DD1) zz. I naturally
>> > > saved all the logs and spent a whole bunch of time debugging. No,
>> > > wait, that would have been all useful or something.
>> > > 
>> > > Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
>> > > patch out to see if it fixes it"... and then it booted.
>> > > 
>> > > Any ideas? Want to borrow the system to tell me I just saw something
>> > > random and rather that your code is perfect?
>> > > 
>> > 
>> > Sounds good.  This patch has been tested quite a bit and seems to be doing
>> > more
>> > fixing than breaking, maybe there's some weird case on DD1
>> 
>> or zz. or my zz. Or the random old FSP build that's on my zz. All the
>> things that could be wrong with prerelease hardware could well be wrong here.
>> 
>
> Not sure if it's your issue or not, but there are some known problems with this
> patch so hold off on merging, I'll respin once I figure it out.
>
> Or merge it anyway.  The overall level of broken probably won't change much.

I bumped my FSP bulid and I'm *certainly* suffering an attack of the
killer IPMI on that system now (if I wasn't already).

Swearing at IPMI continues at a (slightly) increased rate.
Stewart Smith Sept. 28, 2017, 4:53 a.m. UTC | #6
Russell Currey <ruscur@russell.cc> writes:
> On Fri, 2017-09-15 at 17:34 +1000, Stewart Smith wrote:
>> Russell Currey <ruscur@russell.cc> writes:
>> > On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
>> > > Russell Currey <ruscur@russell.cc> writes:
>> > > > In rare circumstances, a complete reset can be requested on a slot while
>> > > > it is still going through the link polling process.  This results in the
>> > > > link never coming up.  In complete reset, check to see if the link needs
>> > > > to complete its state transitions, and if so poll to completion.
>> > > 
>> > > I was having a bit of a problem with this on my (DD1) zz. I naturally
>> > > saved all the logs and spent a whole bunch of time debugging. No,
>> > > wait, that would have been all useful or something.
>> > > 
>> > > Instead, I went "err... stopped in PCI in kernel, let's randomly pull this
>> > > patch out to see if it fixes it"... and then it booted.
>> > > 
>> > > Any ideas? Want to borrow the system to tell me I just saw something
>> > > random and rather that your code is perfect?
>> > > 
>> > 
>> > Sounds good.  This patch has been tested quite a bit and seems to be doing
>> > more
>> > fixing than breaking, maybe there's some weird case on DD1
>> 
>> or zz. or my zz. Or the random old FSP build that's on my zz. All the
>> things that could be wrong with prerelease hardware could well be wrong here.
>> 
>
> Not sure if it's your issue or not, but there are some known problems with this
> patch so hold off on merging, I'll respin once I figure it out.
>
> Or merge it anyway.  The overall level of broken probably won't change much.

Any further thoughts?
Russell Currey Sept. 28, 2017, 4:56 a.m. UTC | #7
On Thu, 2017-09-28 at 14:53 +1000, Stewart Smith wrote:
> Russell Currey <ruscur@russell.cc> writes:
> > On Fri, 2017-09-15 at 17:34 +1000, Stewart Smith wrote:
> > > Russell Currey <ruscur@russell.cc> writes:
> > > > On Fri, 2017-09-15 at 17:02 +1000, Stewart Smith wrote:
> > > > > Russell Currey <ruscur@russell.cc> writes:
> > > > > > In rare circumstances, a complete reset can be requested on a slot
> > > > > > while
> > > > > > it is still going through the link polling process.  This results in
> > > > > > the
> > > > > > link never coming up.  In complete reset, check to see if the link
> > > > > > needs
> > > > > > to complete its state transitions, and if so poll to completion.
> > > > > 
> > > > > I was having a bit of a problem with this on my (DD1) zz. I naturally
> > > > > saved all the logs and spent a whole bunch of time debugging. No,
> > > > > wait, that would have been all useful or something.
> > > > > 
> > > > > Instead, I went "err... stopped in PCI in kernel, let's randomly pull
> > > > > this
> > > > > patch out to see if it fixes it"... and then it booted.
> > > > > 
> > > > > Any ideas? Want to borrow the system to tell me I just saw something
> > > > > random and rather that your code is perfect?
> > > > > 
> > > > 
> > > > Sounds good.  This patch has been tested quite a bit and seems to be
> > > > doing
> > > > more
> > > > fixing than breaking, maybe there's some weird case on DD1
> > > 
> > > or zz. or my zz. Or the random old FSP build that's on my zz. All the
> > > things that could be wrong with prerelease hardware could well be wrong
> > > here.
> > > 
> > 
> > Not sure if it's your issue or not, but there are some known problems with
> > this
> > patch so hold off on merging, I'll respin once I figure it out.
> > 
> > Or merge it anyway.  The overall level of broken probably won't change much.
> 
> Any further thoughts?
> 
I haven't had time to look back into it yet.  It still needs to be reworked.
diff mbox series

Patch

diff --git a/hw/phb4.c b/hw/phb4.c
index 1e148e0b..8dd16b69 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -2744,11 +2744,15 @@  static int64_t phb4_creset(struct pci_slot *slot)
 		pci_slot_set_state(slot, PHB4_SLOT_NORMAL);
 		return slot->ops.freset(slot);
 	default:
-		PHBERR(p, "CRESET: Unexpected slot state %08x, resetting...\n",
-		       slot->state);
-		pci_slot_set_state(slot, PHB4_SLOT_NORMAL);
-		return slot->ops.creset(slot);
-
+		if (slot->state & PCI_SLOT_STATE_LINK) {
+			PHBINF(p, "CRESET: Slot needs link polling\n");
+			return slot->ops.poll_link(slot);
+		} else {
+			PHBERR(p, "CRESET: Unexpected slot state %08x\n",
+			       slot->state);
+			pci_slot_set_state(slot, PHB4_SLOT_NORMAL);
+			return slot->ops.creset(slot);
+		}
 	}
 
 error: