[2.6.30-rc4] r8169: avoid losing MSI interrupts

Message ID	1251169150.4023.11.camel@obelisk.thedillows.org
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Subject: Re: [PATCH 2.6.30-rc4] r8169: avoid losing MSI interrupts From: David Dillow <dave@thedillows.org> To: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michael Riepe <michael.riepe@googlemail.com>, Michael Buesch <mb@bu3sch.de>, Francois Romieu <romieu@fr.zoreil.com>, Rui Santos <rsantos@grupopie.com>, Michael =?ISO-8859-1?Q?B=FCker?= <m.bueker@berlin.de>, linux-kernel@vger.kernel.org, netdev@vger.kernel.org In-Reply-To: <m1r5v0ogvw.fsf@fess.ebiederm.org> References: <200903041828.49972.m.bueker@berlin.de> <1242001754.4093.12.camel@obelisk.thedillows.org> <200905112248.44868.mb@bu3sch.de> <200905112310.08534.mb@bu3sch.de> <1242077392.3716.15.camel@lap75545.ornl.gov> <4A09DC3E.2080807@googlemail.com> <1242268709.4979.7.camel@obelisk.thedillows.org> <4A0C6504.8000704@googlemail.com> <1242328457.32579.12.camel@lap75545.ornl.gov> <4A0C7443.1010000@googlemail.com> <1243042174.3580.23.camel@obelisk.thedillows.org> <m1skfkrik2.fsf@fess.ebiederm.org> <1250895567.23419.1.camel@obelisk.thedillows.org> <1250897657.23419.5.camel@obelisk.thedillows.org> <m1iqggf4rf.fsf@fess.ebiederm.org> <m18whcf3w2.fsf@fess.ebiederm.org> <1250973787.3582.14.camel@obelisk.thedillows.org> <m1r5v0ogvw.fsf@fess.ebiederm.org> Content-Type: text/plain Date: Mon, 24 Aug 2009 22:59:10 -0400 Message-Id: <1251169150.4023.11.camel@obelisk.thedillows.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

David Dillow Aug. 25, 2009, 2:59 a.m. UTC

On Mon, 2009-08-24 at 17:51 -0700, Eric W. Biederman wrote:
> When I decode the bits in status they are TxOK, RxOK and TxDescUnavail so it looks
> there is some bidirectional communication going on.
> 
> Do we really want to loop when those bits are set?

Maybe not when only those bits are set, but I worry that we would trade
one race for another where we stop getting interrupts from the card.

> Perhaps we want to remove them from rtl_cfg_infos for the part?

Then you'd never get an interrupt for them in the first place, I think.

I'm not real happy with the interrupt handling in the driver; it makes a
certain amount of sense to split the MSI vs non-MSI interrupt cases out.
It also means another pass through re-auditing things against the vendor
driver. That's more work than I'm able to commit to at the moment.

I've not been able to reproduce it locally on my r8169d, running for ~30
minutes straight at full speed. I've not tried running it in UP, though.
Perhaps I can do that tomorrow.

Here's a possible patch to mask the NAPI events while we're running in
NAPI mode. I'm not sure it is going to help, since the intr_mask was
0xffff when you hit the loop guard, so I left it in for now.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 25, 2009, 8:22 p.m. UTC | #1

David Dillow <dave@thedillows.org> writes:

> On Mon, 2009-08-24 at 17:51 -0700, Eric W. Biederman wrote:
>> When I decode the bits in status they are TxOK, RxOK and TxDescUnavail so it looks
>> there is some bidirectional communication going on.
>> 
>> Do we really want to loop when those bits are set?
>
> Maybe not when only those bits are set, but I worry that we would trade
> one race for another where we stop getting interrupts from the card.
>
>> Perhaps we want to remove them from rtl_cfg_infos for the part?
>
> Then you'd never get an interrupt for them in the first place, I think.
>
> I'm not real happy with the interrupt handling in the driver; it makes a
> certain amount of sense to split the MSI vs non-MSI interrupt cases out.
> It also means another pass through re-auditing things against the vendor
> driver. That's more work than I'm able to commit to at the moment.
>
> I've not been able to reproduce it locally on my r8169d, running for ~30
> minutes straight at full speed. I've not tried running it in UP, though.
> Perhaps I can do that tomorrow.
>
> Here's a possible patch to mask the NAPI events while we're running in
> NAPI mode. I'm not sure it is going to help, since the intr_mask was
> 0xffff when you hit the loop guard, so I left it in for now.

Interesting.

If I understand this correctly the situation is that we have on the
chip there is correct logic for a level triggered interrupt and that
the msi logic sits on it and sends an event when the interrupt signal
goes high, but when we acknowledge some bits but not all it does not
send another interrupt.

Baring playing games with what version of the card has working logic
and which does not we seem to have to simple choices (if we don't want
to loop possibly forever).
- Don't use the msi logic on this card.
- Move all of the logic into rtl8169_poll and only come out of NAPI
  mode when we have caught up with all of the interrupt work.

Is that how you understand the hardware issue you are trying to work
around?

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Dillow Aug. 25, 2009, 8:40 p.m. UTC | #2

On Tue, 2009-08-25 at 13:22 -0700, Eric W. Biederman wrote:
> David Dillow <dave@thedillows.org> writes:
> > I'm not real happy with the interrupt handling in the driver; it makes a
> > certain amount of sense to split the MSI vs non-MSI interrupt cases out.
> > It also means another pass through re-auditing things against the vendor
> > driver. That's more work than I'm able to commit to at the moment.
> >
> > I've not been able to reproduce it locally on my r8169d, running for ~30
> > minutes straight at full speed. I've not tried running it in UP, though.
> > Perhaps I can do that tomorrow.
> >
> > Here's a possible patch to mask the NAPI events while we're running in
> > NAPI mode. I'm not sure it is going to help, since the intr_mask was
> > 0xffff when you hit the loop guard, so I left it in for now.
> 
> Interesting.
> 
> If I understand this correctly the situation is that we have on the
> chip there is correct logic for a level triggered interrupt and that
> the msi logic sits on it and sends an event when the interrupt signal
> goes high, but when we acknowledge some bits but not all it does not
> send another interrupt.

Correct, we have to acknowledge all current outstanding event sources
before we get another MSI interrupt. It looks like the MSI interrupt is
triggered on the edge transition of a logical OR of all irq sources.

> Baring playing games with what version of the card has working logic
> and which does not we seem to have to simple choices (if we don't want
> to loop possibly forever).
> - Don't use the msi logic on this card.
> - Move all of the logic into rtl8169_poll and only come out of NAPI
>   mode when we have caught up with all of the interrupt work.
> 
> Is that how you understand the hardware issue you are trying to work
> around?

That's how I understood the issue I was working around with the
problematic patch, but I thought I had covered both issues fairly well
without having to split the handling any further -- we ACK all existing
sources each pass through the loop, so we'll get a new interrupt on the
unmasked events, but not on ones we've masked out for NAPI until NAPI
completes and unmasks them.

I'm curious how you managed to receive an packet between us clearing the
all current sources and reading the current source list continuously for
60+ seconds -- the loop is basically

status = get IRQ events from chip
while (status) {
	/* process events, start NAPI if needed */
	clear current events from chip
	status = get IRQ events from chip
}

That seems like a very small race window to consistently hit --
especially for long enough to trigger soft lockups.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 25, 2009, 9:24 p.m. UTC | #3

David Dillow <dave@thedillows.org> writes:

> On Tue, 2009-08-25 at 13:22 -0700, Eric W. Biederman wrote:
>> David Dillow <dave@thedillows.org> writes:
>> > I'm not real happy with the interrupt handling in the driver; it makes a
>> > certain amount of sense to split the MSI vs non-MSI interrupt cases out.
>> > It also means another pass through re-auditing things against the vendor
>> > driver. That's more work than I'm able to commit to at the moment.
>> >
>> > I've not been able to reproduce it locally on my r8169d, running for ~30
>> > minutes straight at full speed. I've not tried running it in UP, though.
>> > Perhaps I can do that tomorrow.
>> >
>> > Here's a possible patch to mask the NAPI events while we're running in
>> > NAPI mode. I'm not sure it is going to help, since the intr_mask was
>> > 0xffff when you hit the loop guard, so I left it in for now.
>> 
>> Interesting.
>> 
>> If I understand this correctly the situation is that we have on the
>> chip there is correct logic for a level triggered interrupt and that
>> the msi logic sits on it and sends an event when the interrupt signal
>> goes high, but when we acknowledge some bits but not all it does not
>> send another interrupt.
>
> Correct, we have to acknowledge all current outstanding event sources
> before we get another MSI interrupt. It looks like the MSI interrupt is
> triggered on the edge transition of a logical OR of all irq sources.
>
>> Baring playing games with what version of the card has working logic
>> and which does not we seem to have to simple choices (if we don't want
>> to loop possibly forever).
>> - Don't use the msi logic on this card.
>> - Move all of the logic into rtl8169_poll and only come out of NAPI
>>   mode when we have caught up with all of the interrupt work.
>> 
>> Is that how you understand the hardware issue you are trying to work
>> around?
>
> That's how I understood the issue I was working around with the
> problematic patch, but I thought I had covered both issues fairly well
> without having to split the handling any further -- we ACK all existing
> sources each pass through the loop, so we'll get a new interrupt on the
> unmasked events, but not on ones we've masked out for NAPI until NAPI
> completes and unmasks them.


> I'm curious how you managed to receive an packet between us clearing the
> all current sources and reading the current source list continuously for
> 60+ seconds -- the loop is basically


> status = get IRQ events from chip
> while (status) {
> 	/* process events, start NAPI if needed */
> 	clear current events from chip
> 	status = get IRQ events from chip
> }
>
> That seems like a very small race window to consistently hit --
> especially for long enough to trigger soft lockups.

Interesting indeed.  When I hit the guard we had popped out of NAPI
mode while we were in the loop.  The only way to do that is if
poll and interrupt were running on different cpus.

I am a bit curious about TxDescUnavail.  Perhaps we had a temporary
memory shortage and that is what was screaming?  I don't think we do
anything at all with that state.

Perhaps the flaw here is simply not masking TxDescUnavail while we are
in NAPI mode?

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Aug. 25, 2009, 9:37 p.m. UTC | #4

David Dillow <dave@thedillows.org> writes:

> On Mon, 2009-08-24 at 17:51 -0700, Eric W. Biederman wrote:
>> When I decode the bits in status they are TxOK, RxOK and TxDescUnavail so it looks
>> there is some bidirectional communication going on.
>> 
>> Do we really want to loop when those bits are set?
>
> Maybe not when only those bits are set, but I worry that we would trade
> one race for another where we stop getting interrupts from the card.
>
>> Perhaps we want to remove them from rtl_cfg_infos for the part?
>
> Then you'd never get an interrupt for them in the first place, I think.
>
> I'm not real happy with the interrupt handling in the driver; it makes a
> certain amount of sense to split the MSI vs non-MSI interrupt cases out.
> It also means another pass through re-auditing things against the vendor
> driver. That's more work than I'm able to commit to at the moment.
>
> I've not been able to reproduce it locally on my r8169d, running for ~30
> minutes straight at full speed. I've not tried running it in UP, though.
> Perhaps I can do that tomorrow.
>
> Here's a possible patch to mask the NAPI events while we're running in
> NAPI mode. I'm not sure it is going to help, since the intr_mask was
> 0xffff when you hit the loop guard, so I left it in for now.

Ok.  I now get what your patch was trying to do and that does look like
a reasonable test.  

I prefer:
while ((status != 0xffff) && (status & tp->intr_mask))

The presence of TxDescUnavail suggests as is usually the case
that the interrupt status bits are independent of which interrupts
are actually enabled to fire.

I will take a moment and give that a try.

I still like the idea of masking everything having poll do all
of the work and then unmasking everything.  That seems a little less
fragile to me.

Eric

> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
> index b82780d..12755b7 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -3556,6 +3556,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  	void __iomem *ioaddr = tp->mmio_addr;
>  	int handled = 0;
>  	int status;
> +	int count = 0;
>  
>  	/* loop handling interrupts until we have no new ones or
>  	 * we hit a invalid/hotplug case.
> @@ -3564,6 +3565,15 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  	while (status && status != 0xffff) {
>  		handled = 1;
>  
> +		if (count++ > 100) {
> +			printk_once("r8169 screaming irq status %08x "
> +				"mask %08x event %08x napi %08x\n",
> +				status, tp->intr_mask, tp->intr_event,
> +				tp->napi_event);
> +			break;
> +		}
> +
> +
>  		/* Handle all of the error cases first. These will reset
>  		 * the chip, so just exit the loop.
>  		 */
> @@ -3613,6 +3623,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  		RTL_W16(IntrStatus,
>  			(status & RxFIFOOver) ? (status | RxOverflow) : status);
>  		status = RTL_R16(IntrStatus);
> +		status &= tp->intr_mask;
>  	}
>  
>  	return IRQ_RETVAL(handled);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Dillow Aug. 25, 2009, 9:46 p.m. UTC | #5

On Tue, 2009-08-25 at 14:24 -0700, Eric W. Biederman wrote:
> David Dillow <dave@thedillows.org> writes:
> > I'm curious how you managed to receive an packet between us clearing the
> > all current sources and reading the current source list continuously for
> > 60+ seconds -- the loop is basically
> 
> 
> > status = get IRQ events from chip
> > while (status) {
> > 	/* process events, start NAPI if needed */
> > 	clear current events from chip
> > 	status = get IRQ events from chip
> > }
> >
> > That seems like a very small race window to consistently hit --
> > especially for long enough to trigger soft lockups.
> 
> Interesting indeed.  When I hit the guard we had popped out of NAPI
> mode while we were in the loop.  The only way to do that is if
> poll and interrupt were running on different cpus.

That is the normal case on an SMP machine, but again that race window
should be fairly small as well -- from the __napi_schedule() to the
acking of the interrupt source is only a few lines of code, most of
which is in an error case that is skipped. Granted there may be a fair
number of instructions there if debugging or tracing is on -- I've not
checked -- but even then hitting that race consistently for 60+ seconds
doesn't seem likely.

Being out of NAPI in the guard may be a red herring -- it doesn't tell
us how long you were out of NAPI when you hit it. If there's a stuck bit
somewhere, then you could have been out of NAPI after the first cycle
and we'd have no way to tell. You could add some variables to keep track
of the status and mask values, and how long ago they changed to see.

> I am a bit curious about TxDescUnavail.  Perhaps we had a temporary
> memory shortage and that is what was screaming?  I don't think we do
> anything at all with that state.

TxDescUnavail is normal -- it means the chip finished sending everything
we asked it to.

> Perhaps the flaw here is simply not masking TxDescUnavail while we are
> in NAPI mode?

No, we never enable it on the chip, and it gets masked out when we
decide if we want to go to NAPI mode -- it is not set in tp->napi_event:

	if (status & tp->intr_mask & tp->napi_event) {

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Dillow Aug. 25, 2009, 9:54 p.m. UTC | #6

On Tue, 2009-08-25 at 14:37 -0700, Eric W. Biederman wrote:
> David Dillow <dave@thedillows.org> writes:
> > Here's a possible patch to mask the NAPI events while we're running in
> > NAPI mode. I'm not sure it is going to help, since the intr_mask was
> > 0xffff when you hit the loop guard, so I left it in for now.
> 
> Ok.  I now get what your patch was trying to do and that does look like
> a reasonable test.  
> 
> I prefer:
> while ((status != 0xffff) && (status & tp->intr_mask))

I had thought of going that route first, but if you have any interrupt
event sources set, you want to enter the loop at least once to clear
them, otherwise you never see another MSI interrupt.

If the masking is the way things play out, then I'd put it where I had
it and put in a comment as to why it is there.

> The presence of TxDescUnavail suggests as is usually the case
> that the interrupt status bits are independent of which interrupts
> are actually enabled to fire.

Yes, but I seem to recall the MSI's edge detection being especially
oddly done -- I did tests with various masks and using the ability to
have it generate an interrupt on user demand, and IIRC it was handled
before the mask was applied, so we really did care about the events that
were active -- but I may misremember.

> I will take a moment and give that a try.
> 
> I still like the idea of masking everything having poll do all
> of the work and then unmasking everything.  That seems a little less
> fragile to me.

I wouldn't object if you did it, but I don't have time for it right now.
And it may make Francois's life harder when he does his periodic sweep
of the vendor driver, looking for differences.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Francois Romieu Aug. 25, 2009, 10:19 p.m. UTC | #7

Eric W. Biederman <ebiederm@xmission.com> :
[...]
> I am a bit curious about TxDescUnavail.  Perhaps we had a temporary
> memory shortage and that is what was screaming?  I don't think we do
> anything at all with that state.

You are not alone, the driver completely ignores this bit.

As far as I remember, the TxDescUnavail event mostly pops up when the
driver makes an excessive use of TxPoll requests.

> Perhaps the flaw here is simply not masking TxDescUnavail while we are
> in NAPI mode ?

Yes, it is worth trying.

Francois Romieu Aug. 25, 2009, 11:11 p.m. UTC | #8

David Dillow <dave@thedillows.org> :
[...]
> I wouldn't object if you did it, but I don't have time for it right now.
> And it may make Francois's life harder when he does his periodic sweep
> of the vendor driver, looking for differences.

This part of Realtek's driver(s) is not too tricky (I wonder if some code is
there by design or accident but it is a different story).

I do not feel safe with the TxDescUnavail bit : the driver does not
explicitely do anything to handle it but the behavior of the driver
can change depending on it. :o/

[2.6.30-rc4] r8169: avoid losing MSI interrupts

Commit Message

Comments

Patch