diff mbox

[RFC] bnx2x: fix tx queue locking and memory barriers

Message ID 20100225140834.0169e9f2@dhcp-lab-109.englab.brq.redhat.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Stanislaw Gruszka Feb. 25, 2010, 1:08 p.m. UTC
We have done some optimizations in bnx2x_start_xmit() and bnx2x_tx_int(), which 
in my opinion can lead into some theoretical race conditions. 

I can be pretty wrong here, but if so, we have to optimize some other drivers,
which use memory barriers/locking schema from that patch (like tg3, bnx2).

Memory barriers here IMHO, prevent to make queue permanently stopped when on one
cpu bnx2x_tx_int() make queue empty, whereas on other cpu bnx2x_start_xmit() see
it full and make stop it, such cause queue will be stopped forever. 

I'm not quite sure what for is __netif_tx_lock, but other drivers use it.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Feb. 25, 2010, 10:18 a.m. UTC | #1
From: Stanislaw Gruszka <sgruszka@redhat.com>
Date: Thu, 25 Feb 2010 14:08:34 +0100

> Memory barriers here IMHO, prevent to make queue permanently stopped
> when on one cpu bnx2x_tx_int() make queue empty, whereas on other
> cpu bnx2x_start_xmit() see it full and make stop it, such cause
> queue will be stopped forever.

Instead of having an opinion, please show the exact sequence
of events that can lead to this situation.  With such facts
inhand, you will have no need for an opinion :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Zolotarov Feb. 25, 2010, 1:28 p.m. UTC | #2
Stanislaw,

there is no need for memory barrier in bnx2x_tx_avail() as long as they are taken outside the function the way we minimize the possibility of calls for that barrier only to the rare cases, when the ring is going to be closed (XOFF).

The same is with the smp_mb() in bnx2x_tx_int() - we don't want to call it unless we really needed and this is only in the case we need to XON the queue which is currently XOFFed.

The tx_lock() is not needed there as well. There is a slight and very hypothetical possibility for the case when we might release the queue, when it's full and then there is a possibility that bnx2x_start_xmit() will return NETIF_TX_BUSY. And even if it happens (while nobody has reported about such a case) it's not fatal.

Dave, I'm running stress tests on the patch, which will eliminate even that hypothetical issue, I've mentioned before. Once we fill confident with this patch we'll send it to u.

Thanks,
vlad


> -----Original Message-----
> From: Stanislaw Gruszka [mailto:sgruszka@redhat.com] 
> Sent: Thursday, February 25, 2010 3:09 PM
> To: netdev@vger.kernel.org; Vladislav Zolotarov
> Cc: David Miller; Eilon Greenstein; David Howells
> Subject: [RFC PATCH] bnx2x: fix tx queue locking and memory barriers
> 
> We have done some optimizations in bnx2x_start_xmit() and 
> bnx2x_tx_int(), which 
> in my opinion can lead into some theoretical race conditions. 
> 
> I can be pretty wrong here, but if so, we have to optimize 
> some other drivers,
> which use memory barriers/locking schema from that patch 
> (like tg3, bnx2).
> 
> Memory barriers here IMHO, prevent to make queue permanently 
> stopped when on one
> cpu bnx2x_tx_int() make queue empty, whereas on other cpu 
> bnx2x_start_xmit() see
> it full and make stop it, such cause queue will be stopped forever. 
> 
> I'm not quite sure what for is __netif_tx_lock, but other 
> drivers use it.
> 
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 5adf2a0..ca91aa8 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -893,7 +893,10 @@ static inline u16 bnx2x_tx_avail(struct 
> bnx2x_fastpath *fp)
>  	u16 prod;
>  	u16 cons;
>  
> -	barrier(); /* Tell compiler that prod and cons can change */
> +	/* prod and cons can change on other cpu, want to see
> +	   consistend available space and queue (stop/running) state */
> +	smp_mb();
> +
>  	prod = fp->tx_bd_prod;
>  	cons = fp->tx_bd_cons;
>  
> @@ -957,21 +960,23 @@ static int bnx2x_tx_int(struct 
> bnx2x_fastpath *fp)
>  	fp->tx_pkt_cons = sw_cons;
>  	fp->tx_bd_cons = bd_cons;
>  
> +	/* Need to make the tx_bd_cons update visible to start_xmit()
> +	 * before checking for netif_tx_queue_stopped().  Without the
> +	 * memory barrier, there is a small possibility that 
> start_xmit()
> +	 * will miss it and cause the queue to be stopped forever. This
> +	 * can happen when we make queue empty here, when on other cpu
> +	 * start_xmit() still see it becoming full and stop.
> +	 */
> +	smp_mb();
> +
>  	/* TBD need a thresh? */
>  	if (unlikely(netif_tx_queue_stopped(txq))) {
> -
> -		/* Need to make the tx_bd_cons update visible 
> to start_xmit()
> -		 * before checking for 
> netif_tx_queue_stopped().  Without the
> -		 * memory barrier, there is a small possibility that
> -		 * start_xmit() will miss it and cause the 
> queue to be stopped
> -		 * forever.
> -		 */
> -		smp_mb();
> -
> +		__netif_tx_lock(txq, smp_processor_id());
>  		if ((netif_tx_queue_stopped(txq)) &&
>  		    (bp->state == BNX2X_STATE_OPEN) &&
>  		    (bnx2x_tx_avail(fp) >= MAX_SKB_FRAGS + 3))
>  			netif_tx_wake_queue(txq);
> +		__netif_tx_unlock(txq);
>  	}
>  	return 0;
>  }
> 
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stanislaw Gruszka Feb. 25, 2010, 3:40 p.m. UTC | #3
On Thu, 25 Feb 2010 02:18:22 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> > Memory barriers here IMHO, prevent to make queue permanently stopped
> > when on one cpu bnx2x_tx_int() make queue empty, whereas on other
> > cpu bnx2x_start_xmit() see it full and make stop it, such cause
> > queue will be stopped forever.
> 
> Instead of having an opinion, please show the exact sequence
> of events that can lead to this situation.  With such facts
> inhand, you will have no need for an opinion :-)

Ok, here is the story:

Queue (4000 elements) is almost full: fp->tx_bd_prod = 3980, fp->tx_bd_cons = 0,
Packets ware already transmitted by device, but that was not reported to
the driver because interrupts where disabled. We are transmitting new skb
with 20 fragments.

cpu0: (in bnx2x_poll)                   cpu1: (transferring data)

bnx2x_tx_int():                         bnx2x_start_xmit():
local fp->tx_bd_cons = 3980;            send more data to device
return

bnx2x_tx_int():                         local fp->tx_bd_prod = 4000
local fp->tx_bd_cons = 4000;            local fp->tx_bd_cons still 0
queue not stopped                       no avail space in queue
return;                                 stop queue
                                        smp_mb() - not paired, cpu1 does not "see" cpu0 caches changes
                                        local fp->tx_bd_cons still 0
                                        no wake
					
Finally queue is sopped and device will not generate interrupt nor
call bnx2x_tx_int() from bnx2x_poll() since bnx2x_has_tx_work() will
return false.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Zolotarov Feb. 25, 2010, 3:49 p.m. UTC | #4
In bnx2x_start_xmit(): right after the "stop queue" there is an smp_mb(), which will bring the cpu0 cache and a fresh fp->tx_bd_cons value to cpu1 and the following if() will return true and the queue will be released from bnx2x_start_xmit() flow.

No deadlock here.

vlad

> -----Original Message-----
> From: Stanislaw Gruszka [mailto:sgruszka@redhat.com] 
> Sent: Thursday, February 25, 2010 5:40 PM
> To: David Miller
> Cc: netdev@vger.kernel.org; Vladislav Zolotarov; Eilon 
> Greenstein; dhowells@redhat.com
> Subject: Re: [RFC PATCH] bnx2x: fix tx queue locking and 
> memory barriers
> 
> On Thu, 25 Feb 2010 02:18:22 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > > Memory barriers here IMHO, prevent to make queue 
> permanently stopped
> > > when on one cpu bnx2x_tx_int() make queue empty, whereas on other
> > > cpu bnx2x_start_xmit() see it full and make stop it, such cause
> > > queue will be stopped forever.
> > 
> > Instead of having an opinion, please show the exact sequence
> > of events that can lead to this situation.  With such facts
> > inhand, you will have no need for an opinion :-)
> 
> Ok, here is the story:
> 
> Queue (4000 elements) is almost full: fp->tx_bd_prod = 3980, 
> fp->tx_bd_cons = 0,
> Packets ware already transmitted by device, but that was not 
> reported to
> the driver because interrupts where disabled. We are 
> transmitting new skb
> with 20 fragments.
> 
> cpu0: (in bnx2x_poll)                   cpu1: (transferring data)
> 
> bnx2x_tx_int():                         bnx2x_start_xmit():
> local fp->tx_bd_cons = 3980;            send more data to device
> return
> 
> bnx2x_tx_int():                         local fp->tx_bd_prod = 4000
> local fp->tx_bd_cons = 4000;            local fp->tx_bd_cons still 0
> queue not stopped                       no avail space in queue
> return;                                 stop queue
>                                         smp_mb() - not 
> paired, cpu1 does not "see" cpu0 caches changes
>                                         local fp->tx_bd_cons still 0
>                                         no wake
> 					
> Finally queue is sopped and device will not generate interrupt nor
> call bnx2x_tx_int() from bnx2x_poll() since bnx2x_has_tx_work() will
> return false.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stanislaw Gruszka Feb. 25, 2010, 4:03 p.m. UTC | #5
On Thu, 25 Feb 2010 07:49:48 -0800
"Vladislav Zolotarov" <vladz@broadcom.com> wrote:

> In bnx2x_start_xmit(): right after the "stop queue" there is an smp_mb(), which will bring the cpu0 cache and a fresh fp->tx_bd_cons value to cpu1 and the following if() will return true and the queue will be released from bnx2x_start_xmit() flow.

If I understand correctly what is written in Documentation/memory-barriers.txt
this smp_mb() need to have another "paired" smp_{w}mb() on cpu0 to make value
be updated on cpu1, which is missing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 25, 2010, 4:06 p.m. UTC | #6
From: Stanislaw Gruszka <sgruszka@redhat.com>
Date: Thu, 25 Feb 2010 17:03:08 +0100

> On Thu, 25 Feb 2010 07:49:48 -0800
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> 
>> In bnx2x_start_xmit(): right after the "stop queue" there is an smp_mb(), which will bring the cpu0 cache and a fresh fp->tx_bd_cons value to cpu1 and the following if() will return true and the queue will be released from bnx2x_start_xmit() flow.
> 
> If I understand correctly what is written in Documentation/memory-barriers.txt
> this smp_mb() need to have another "paired" smp_{w}mb() on cpu0 to make value
> be updated on cpu1, which is missing.

The invocation of ->hard_start_xmit() creates an implicit memory
barrier because all such invocations take the netdev spinlock.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladislav Zolotarov Feb. 25, 2010, 4:14 p.m. UTC | #7
Correct. There a missing barrier in bnx2x_tx_int(). I'll create a proper patch shortly. 

Thanks. 

> -----Original Message-----
> From: Stanislaw Gruszka [mailto:sgruszka@redhat.com] 
> Sent: Thursday, February 25, 2010 6:03 PM
> To: Vladislav Zolotarov
> Cc: David Miller; netdev@vger.kernel.org; Eilon Greenstein; 
> dhowells@redhat.com
> Subject: Re: [RFC PATCH] bnx2x: fix tx queue locking and 
> memory barriers
> 
> On Thu, 25 Feb 2010 07:49:48 -0800
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> 
> > In bnx2x_start_xmit(): right after the "stop queue" there 
> is an smp_mb(), which will bring the cpu0 cache and a fresh 
> fp->tx_bd_cons value to cpu1 and the following if() will 
> return true and the queue will be released from 
> bnx2x_start_xmit() flow.
> 
> If I understand correctly what is written in 
> Documentation/memory-barriers.txt
> this smp_mb() need to have another "paired" smp_{w}mb() on 
> cpu0 to make value
> be updated on cpu1, which is missing.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stanislaw Gruszka Feb. 25, 2010, 4:16 p.m. UTC | #8
On Thu, 25 Feb 2010 08:06:24 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> > If I understand correctly what is written in Documentation/memory-barriers.txt
> > this smp_mb() need to have another "paired" smp_{w}mb() on cpu0 to make value
> > be updated on cpu1, which is missing.
> 
> The invocation of ->hard_start_xmit() creates an implicit memory
> barrier because all such invocations take the netdev spinlock.

But barrier is missing on cpu which call bnx2x_pull(), how spinlock
before bnx2x_start_xmit() helps with that? Below again the whole
picture:

cpu0: (in bnx2x_poll)                   cpu1: (transferring data)

bnx2x_tx_int():                         bnx2x_start_xmit():
local fp->tx_bd_cons = 3980;            send more data to device
return

bnx2x_tx_int():                         local fp->tx_bd_prod = 4000
local fp->tx_bd_cons = 4000;            local fp->tx_bd_cons still 0
queue not stopped                       no avail space in queue
return;                                 stop queue
                                        smp_mb() - not paired, cpu1 does not "see" cpu0 caches changes
                                        local fp->tx_bd_cons still 0
                                        no wake
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Howells March 10, 2010, 5:09 p.m. UTC | #9
Stanislaw Gruszka <sgruszka@redhat.com> wrote:

> -	barrier(); /* Tell compiler that prod and cons can change */
> +	/* prod and cons can change on other cpu, want to see
> +	   consistend available space and queue (stop/running) state */
> +	smp_mb();
> +
>  	prod = fp->tx_bd_prod;
>  	cons = fp->tx_bd_cons;

I suspect that this isn't what you want.

The barrier() didn't tell the compiler that fp->tx_bd_prod and fp->tx_bd_cons
could change.  What it did was to say that the accesses to those two variables
must be performed after all the other accesses issued by that CPU prior to the
barrier - at least as far as the compiler is concerned.

You don't need to separate the reads of tx_bd_prod and tx_bd_cons above with a
memory barrier.  They aren't ever altered in the same place.


What you want is something more like the following pseudocode.

To insert into a circular buffer:

	bd_prod = fp->tx_bd_prod;
	bd_cons = fp->tx_bd_cons;

	if (CIRC_SPACE(bd_cons, bd_prod, NUM_TX_BD) <= 0)
		goto no_space;

	/* get a tx_buf and first BD */
	tx_start_bd = &fp->tx_desc_ring[bd_prod].start_bd;
	tx_start_bd->bd_flags.as_bitfield = ETH_TX_BD_FLAGS_START_BD;
	tx_start_bd->general_data = (UNICAST_ADDRESS <<
				     ETH_TX_START_BD_ETH_ADDR_TYPE_SHIFT);
	tx_start_bd->general_data |= (1 << ETH_TX_START_BD_HDR_NBDS_SHIFT);

	smp_wmb(); /* commit buffer contents before incrementing index */

	fp->tx_bd_prod = TX_BD(bd_prod + 1);

To read from a circular buffer:

	bd_prod = fp->tx_bd_prod;
	bd_cons = fp->tx_bd_cons;

	smp_read_barrier_depends(); /* read index before reading contents */

	if (CIRC_CNT(bd_cons, bd_prod, NUM_TX_BD) <= 0)
		goto no_data;

	tx_start_bd = &fp->tx_desc_ring[bd_cons].start_bd;
	munge_descriptor(tx_start_bd);

	smp_mb(); /* finish reading descriptor before incrementing index */

	fp->tx_bd_cons = TX_BD(bd_cons + 1);

At least, I'm fairly certain that's correct.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Howells March 10, 2010, 5:19 p.m. UTC | #10
David Howells <dhowells@redhat.com> wrote:

> > -	barrier(); /* Tell compiler that prod and cons can change */
> > +	/* prod and cons can change on other cpu, want to see
> > +	   consistend available space and queue (stop/running) state */
> > +	smp_mb();
> > +
> >  	prod = fp->tx_bd_prod;
> >  	cons = fp->tx_bd_cons;
> 
> I suspect that this isn't what you want.
> 
> The barrier() didn't tell the compiler that fp->tx_bd_prod and fp->tx_bd_cons
> could change.  What it did was to say that the accesses to those two variables
> must be performed after all the other accesses issued by that CPU prior to the
> barrier - at least as far as the compiler is concerned.
> 
> You don't need to separate the reads of tx_bd_prod and tx_bd_cons above with a
> memory barrier.  They aren't ever altered in the same place.

Having said that, you might need a memory barrier before reading tx_bd_prod in
the consumer if the producer waggles a flag in memory to indicate to the
consumer that it should consume, and a memory barrier in the producer before
waggling that flag:

	[producer]
	...
	smp_wmb(); /* commit buffer contents before incrementing index */
	fp->tx_bd_prod = TX_BD(bd_prod + 1);
	smp_wmb(); /* commit increment index before prodding consumer */
	prod_consumer();

	[consumer]
	check_prod_flag();
	smp_rmb(); /* read producer index after checking prod flag */
	bd_prod = fp->tx_bd_prod;
	bd_cons = fp->tx_bd_cons;
	smp_read_barrier_depends(); /* read index before reading contents */

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 10, 2010, 5:49 p.m. UTC | #11
From: David Howells <dhowells@redhat.com>
Date: Wed, 10 Mar 2010 17:09:50 +0000

> The barrier() didn't tell the compiler that fp->tx_bd_prod and fp->tx_bd_cons
> could change.  What it did was to say that the accesses to those two variables
> must be performed after all the other accesses issued by that CPU prior to the
> barrier - at least as far as the compiler is concerned.

barrier() has a "memory" asm clobber which says that all memory could
have changed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Howells March 10, 2010, 6:32 p.m. UTC | #12
David Miller <davem@davemloft.net> wrote:

> > The barrier() didn't tell the compiler that fp->tx_bd_prod and
> > fp->tx_bd_cons could change.  What it did was to say that the accesses to
> > those two variables must be performed after all the other accesses issued
> > by that CPU prior to the barrier - at least as far as the compiler is
> > concerned.
> 
> barrier() has a "memory" asm clobber which says that all memory could
> have changed.

Indeed, but only as far as the compiler is concerned.

However, the problem is almost certainly not this, but that the item to be
read from the buffer may not have been written yet by the producing CPU, even
though it's written the head pointer.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stanislaw Gruszka March 11, 2010, 1:10 p.m. UTC | #13
On Wed, Mar 10, 2010 at 05:09:50PM +0000, David Howells wrote:
> Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> 
> > -	barrier(); /* Tell compiler that prod and cons can change */
> > +	/* prod and cons can change on other cpu, want to see
> > +	   consistend available space and queue (stop/running) state */
> > +	smp_mb();
> > +
> >  	prod = fp->tx_bd_prod;
> >  	cons = fp->tx_bd_cons;
> 
> I suspect that this isn't what you want.

Yes, I realized that. I posted other patches, removing you from CC, since
thought you are not interested.

http://patchwork.ozlabs.org/patch/47172/
http://patchwork.ozlabs.org/patch/47173/
http://patchwork.ozlabs.org/patch/47174/

I removed there barrier() and not use smp_mb(), with explanation why.

Stanislaw
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index 5adf2a0..ca91aa8 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -893,7 +893,10 @@  static inline u16 bnx2x_tx_avail(struct bnx2x_fastpath *fp)
 	u16 prod;
 	u16 cons;
 
-	barrier(); /* Tell compiler that prod and cons can change */
+	/* prod and cons can change on other cpu, want to see
+	   consistend available space and queue (stop/running) state */
+	smp_mb();
+
 	prod = fp->tx_bd_prod;
 	cons = fp->tx_bd_cons;
 
@@ -957,21 +960,23 @@  static int bnx2x_tx_int(struct bnx2x_fastpath *fp)
 	fp->tx_pkt_cons = sw_cons;
 	fp->tx_bd_cons = bd_cons;
 
+	/* Need to make the tx_bd_cons update visible to start_xmit()
+	 * before checking for netif_tx_queue_stopped().  Without the
+	 * memory barrier, there is a small possibility that start_xmit()
+	 * will miss it and cause the queue to be stopped forever. This
+	 * can happen when we make queue empty here, when on other cpu
+	 * start_xmit() still see it becoming full and stop.
+	 */
+	smp_mb();
+
 	/* TBD need a thresh? */
 	if (unlikely(netif_tx_queue_stopped(txq))) {
-
-		/* Need to make the tx_bd_cons update visible to start_xmit()
-		 * before checking for netif_tx_queue_stopped().  Without the
-		 * memory barrier, there is a small possibility that
-		 * start_xmit() will miss it and cause the queue to be stopped
-		 * forever.
-		 */
-		smp_mb();
-
+		__netif_tx_lock(txq, smp_processor_id());
 		if ((netif_tx_queue_stopped(txq)) &&
 		    (bp->state == BNX2X_STATE_OPEN) &&
 		    (bnx2x_tx_avail(fp) >= MAX_SKB_FRAGS + 3))
 			netif_tx_wake_queue(txq);
+		__netif_tx_unlock(txq);
 	}
 	return 0;
 }