Message ID | 1251169150.4023.11.camel@obelisk.thedillows.org |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
David Dillow <dave@thedillows.org> writes: > On Mon, 2009-08-24 at 17:51 -0700, Eric W. Biederman wrote: >> When I decode the bits in status they are TxOK, RxOK and TxDescUnavail so it looks >> there is some bidirectional communication going on. >> >> Do we really want to loop when those bits are set? > > Maybe not when only those bits are set, but I worry that we would trade > one race for another where we stop getting interrupts from the card. > >> Perhaps we want to remove them from rtl_cfg_infos for the part? > > Then you'd never get an interrupt for them in the first place, I think. > > I'm not real happy with the interrupt handling in the driver; it makes a > certain amount of sense to split the MSI vs non-MSI interrupt cases out. > It also means another pass through re-auditing things against the vendor > driver. That's more work than I'm able to commit to at the moment. > > I've not been able to reproduce it locally on my r8169d, running for ~30 > minutes straight at full speed. I've not tried running it in UP, though. > Perhaps I can do that tomorrow. > > Here's a possible patch to mask the NAPI events while we're running in > NAPI mode. I'm not sure it is going to help, since the intr_mask was > 0xffff when you hit the loop guard, so I left it in for now. Interesting. If I understand this correctly the situation is that we have on the chip there is correct logic for a level triggered interrupt and that the msi logic sits on it and sends an event when the interrupt signal goes high, but when we acknowledge some bits but not all it does not send another interrupt. Baring playing games with what version of the card has working logic and which does not we seem to have to simple choices (if we don't want to loop possibly forever). - Don't use the msi logic on this card. - Move all of the logic into rtl8169_poll and only come out of NAPI mode when we have caught up with all of the interrupt work. Is that how you understand the hardware issue you are trying to work around? Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-08-25 at 13:22 -0700, Eric W. Biederman wrote: > David Dillow <dave@thedillows.org> writes: > > I'm not real happy with the interrupt handling in the driver; it makes a > > certain amount of sense to split the MSI vs non-MSI interrupt cases out. > > It also means another pass through re-auditing things against the vendor > > driver. That's more work than I'm able to commit to at the moment. > > > > I've not been able to reproduce it locally on my r8169d, running for ~30 > > minutes straight at full speed. I've not tried running it in UP, though. > > Perhaps I can do that tomorrow. > > > > Here's a possible patch to mask the NAPI events while we're running in > > NAPI mode. I'm not sure it is going to help, since the intr_mask was > > 0xffff when you hit the loop guard, so I left it in for now. > > Interesting. > > If I understand this correctly the situation is that we have on the > chip there is correct logic for a level triggered interrupt and that > the msi logic sits on it and sends an event when the interrupt signal > goes high, but when we acknowledge some bits but not all it does not > send another interrupt. Correct, we have to acknowledge all current outstanding event sources before we get another MSI interrupt. It looks like the MSI interrupt is triggered on the edge transition of a logical OR of all irq sources. > Baring playing games with what version of the card has working logic > and which does not we seem to have to simple choices (if we don't want > to loop possibly forever). > - Don't use the msi logic on this card. > - Move all of the logic into rtl8169_poll and only come out of NAPI > mode when we have caught up with all of the interrupt work. > > Is that how you understand the hardware issue you are trying to work > around? That's how I understood the issue I was working around with the problematic patch, but I thought I had covered both issues fairly well without having to split the handling any further -- we ACK all existing sources each pass through the loop, so we'll get a new interrupt on the unmasked events, but not on ones we've masked out for NAPI until NAPI completes and unmasks them. I'm curious how you managed to receive an packet between us clearing the all current sources and reading the current source list continuously for 60+ seconds -- the loop is basically status = get IRQ events from chip while (status) { /* process events, start NAPI if needed */ clear current events from chip status = get IRQ events from chip } That seems like a very small race window to consistently hit -- especially for long enough to trigger soft lockups. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Dillow <dave@thedillows.org> writes: > On Tue, 2009-08-25 at 13:22 -0700, Eric W. Biederman wrote: >> David Dillow <dave@thedillows.org> writes: >> > I'm not real happy with the interrupt handling in the driver; it makes a >> > certain amount of sense to split the MSI vs non-MSI interrupt cases out. >> > It also means another pass through re-auditing things against the vendor >> > driver. That's more work than I'm able to commit to at the moment. >> > >> > I've not been able to reproduce it locally on my r8169d, running for ~30 >> > minutes straight at full speed. I've not tried running it in UP, though. >> > Perhaps I can do that tomorrow. >> > >> > Here's a possible patch to mask the NAPI events while we're running in >> > NAPI mode. I'm not sure it is going to help, since the intr_mask was >> > 0xffff when you hit the loop guard, so I left it in for now. >> >> Interesting. >> >> If I understand this correctly the situation is that we have on the >> chip there is correct logic for a level triggered interrupt and that >> the msi logic sits on it and sends an event when the interrupt signal >> goes high, but when we acknowledge some bits but not all it does not >> send another interrupt. > > Correct, we have to acknowledge all current outstanding event sources > before we get another MSI interrupt. It looks like the MSI interrupt is > triggered on the edge transition of a logical OR of all irq sources. > >> Baring playing games with what version of the card has working logic >> and which does not we seem to have to simple choices (if we don't want >> to loop possibly forever). >> - Don't use the msi logic on this card. >> - Move all of the logic into rtl8169_poll and only come out of NAPI >> mode when we have caught up with all of the interrupt work. >> >> Is that how you understand the hardware issue you are trying to work >> around? > > That's how I understood the issue I was working around with the > problematic patch, but I thought I had covered both issues fairly well > without having to split the handling any further -- we ACK all existing > sources each pass through the loop, so we'll get a new interrupt on the > unmasked events, but not on ones we've masked out for NAPI until NAPI > completes and unmasks them. > I'm curious how you managed to receive an packet between us clearing the > all current sources and reading the current source list continuously for > 60+ seconds -- the loop is basically > status = get IRQ events from chip > while (status) { > /* process events, start NAPI if needed */ > clear current events from chip > status = get IRQ events from chip > } > > That seems like a very small race window to consistently hit -- > especially for long enough to trigger soft lockups. Interesting indeed. When I hit the guard we had popped out of NAPI mode while we were in the loop. The only way to do that is if poll and interrupt were running on different cpus. I am a bit curious about TxDescUnavail. Perhaps we had a temporary memory shortage and that is what was screaming? I don't think we do anything at all with that state. Perhaps the flaw here is simply not masking TxDescUnavail while we are in NAPI mode? Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Dillow <dave@thedillows.org> writes: > On Mon, 2009-08-24 at 17:51 -0700, Eric W. Biederman wrote: >> When I decode the bits in status they are TxOK, RxOK and TxDescUnavail so it looks >> there is some bidirectional communication going on. >> >> Do we really want to loop when those bits are set? > > Maybe not when only those bits are set, but I worry that we would trade > one race for another where we stop getting interrupts from the card. > >> Perhaps we want to remove them from rtl_cfg_infos for the part? > > Then you'd never get an interrupt for them in the first place, I think. > > I'm not real happy with the interrupt handling in the driver; it makes a > certain amount of sense to split the MSI vs non-MSI interrupt cases out. > It also means another pass through re-auditing things against the vendor > driver. That's more work than I'm able to commit to at the moment. > > I've not been able to reproduce it locally on my r8169d, running for ~30 > minutes straight at full speed. I've not tried running it in UP, though. > Perhaps I can do that tomorrow. > > Here's a possible patch to mask the NAPI events while we're running in > NAPI mode. I'm not sure it is going to help, since the intr_mask was > 0xffff when you hit the loop guard, so I left it in for now. Ok. I now get what your patch was trying to do and that does look like a reasonable test. I prefer: while ((status != 0xffff) && (status & tp->intr_mask)) The presence of TxDescUnavail suggests as is usually the case that the interrupt status bits are independent of which interrupts are actually enabled to fire. I will take a moment and give that a try. I still like the idea of masking everything having poll do all of the work and then unmasking everything. That seems a little less fragile to me. Eric > diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c > index b82780d..12755b7 100644 > --- a/drivers/net/r8169.c > +++ b/drivers/net/r8169.c > @@ -3556,6 +3556,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > void __iomem *ioaddr = tp->mmio_addr; > int handled = 0; > int status; > + int count = 0; > > /* loop handling interrupts until we have no new ones or > * we hit a invalid/hotplug case. > @@ -3564,6 +3565,15 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > while (status && status != 0xffff) { > handled = 1; > > + if (count++ > 100) { > + printk_once("r8169 screaming irq status %08x " > + "mask %08x event %08x napi %08x\n", > + status, tp->intr_mask, tp->intr_event, > + tp->napi_event); > + break; > + } > + > + > /* Handle all of the error cases first. These will reset > * the chip, so just exit the loop. > */ > @@ -3613,6 +3623,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) > RTL_W16(IntrStatus, > (status & RxFIFOOver) ? (status | RxOverflow) : status); > status = RTL_R16(IntrStatus); > + status &= tp->intr_mask; > } > > return IRQ_RETVAL(handled); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-08-25 at 14:24 -0700, Eric W. Biederman wrote: > David Dillow <dave@thedillows.org> writes: > > I'm curious how you managed to receive an packet between us clearing the > > all current sources and reading the current source list continuously for > > 60+ seconds -- the loop is basically > > > > status = get IRQ events from chip > > while (status) { > > /* process events, start NAPI if needed */ > > clear current events from chip > > status = get IRQ events from chip > > } > > > > That seems like a very small race window to consistently hit -- > > especially for long enough to trigger soft lockups. > > Interesting indeed. When I hit the guard we had popped out of NAPI > mode while we were in the loop. The only way to do that is if > poll and interrupt were running on different cpus. That is the normal case on an SMP machine, but again that race window should be fairly small as well -- from the __napi_schedule() to the acking of the interrupt source is only a few lines of code, most of which is in an error case that is skipped. Granted there may be a fair number of instructions there if debugging or tracing is on -- I've not checked -- but even then hitting that race consistently for 60+ seconds doesn't seem likely. Being out of NAPI in the guard may be a red herring -- it doesn't tell us how long you were out of NAPI when you hit it. If there's a stuck bit somewhere, then you could have been out of NAPI after the first cycle and we'd have no way to tell. You could add some variables to keep track of the status and mask values, and how long ago they changed to see. > I am a bit curious about TxDescUnavail. Perhaps we had a temporary > memory shortage and that is what was screaming? I don't think we do > anything at all with that state. TxDescUnavail is normal -- it means the chip finished sending everything we asked it to. > Perhaps the flaw here is simply not masking TxDescUnavail while we are > in NAPI mode? No, we never enable it on the chip, and it gets masked out when we decide if we want to go to NAPI mode -- it is not set in tp->napi_event: if (status & tp->intr_mask & tp->napi_event) { -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-08-25 at 14:37 -0700, Eric W. Biederman wrote: > David Dillow <dave@thedillows.org> writes: > > Here's a possible patch to mask the NAPI events while we're running in > > NAPI mode. I'm not sure it is going to help, since the intr_mask was > > 0xffff when you hit the loop guard, so I left it in for now. > > Ok. I now get what your patch was trying to do and that does look like > a reasonable test. > > I prefer: > while ((status != 0xffff) && (status & tp->intr_mask)) I had thought of going that route first, but if you have any interrupt event sources set, you want to enter the loop at least once to clear them, otherwise you never see another MSI interrupt. If the masking is the way things play out, then I'd put it where I had it and put in a comment as to why it is there. > The presence of TxDescUnavail suggests as is usually the case > that the interrupt status bits are independent of which interrupts > are actually enabled to fire. Yes, but I seem to recall the MSI's edge detection being especially oddly done -- I did tests with various masks and using the ability to have it generate an interrupt on user demand, and IIRC it was handled before the mask was applied, so we really did care about the events that were active -- but I may misremember. > I will take a moment and give that a try. > > I still like the idea of masking everything having poll do all > of the work and then unmasking everything. That seems a little less > fragile to me. I wouldn't object if you did it, but I don't have time for it right now. And it may make Francois's life harder when he does his periodic sweep of the vendor driver, looking for differences. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric W. Biederman <ebiederm@xmission.com> : [...] > I am a bit curious about TxDescUnavail. Perhaps we had a temporary > memory shortage and that is what was screaming? I don't think we do > anything at all with that state. You are not alone, the driver completely ignores this bit. As far as I remember, the TxDescUnavail event mostly pops up when the driver makes an excessive use of TxPoll requests. > Perhaps the flaw here is simply not masking TxDescUnavail while we are > in NAPI mode ? Yes, it is worth trying.
David Dillow <dave@thedillows.org> : [...] > I wouldn't object if you did it, but I don't have time for it right now. > And it may make Francois's life harder when he does his periodic sweep > of the vendor driver, looking for differences. This part of Realtek's driver(s) is not too tricky (I wonder if some code is there by design or accident but it is a different story). I do not feel safe with the TxDescUnavail bit : the driver does not explicitely do anything to handle it but the behavior of the driver can change depending on it. :o/
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index b82780d..12755b7 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -3556,6 +3556,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) void __iomem *ioaddr = tp->mmio_addr; int handled = 0; int status; + int count = 0; /* loop handling interrupts until we have no new ones or * we hit a invalid/hotplug case. @@ -3564,6 +3565,15 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) while (status && status != 0xffff) { handled = 1; + if (count++ > 100) { + printk_once("r8169 screaming irq status %08x " + "mask %08x event %08x napi %08x\n", + status, tp->intr_mask, tp->intr_event, + tp->napi_event); + break; + } + + /* Handle all of the error cases first. These will reset * the chip, so just exit the loop. */ @@ -3613,6 +3623,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) RTL_W16(IntrStatus, (status & RxFIFOOver) ? (status | RxOverflow) : status); status = RTL_R16(IntrStatus); + status &= tp->intr_mask; } return IRQ_RETVAL(handled);