Message ID | 1430273488-8403-2-git-send-email-ast@plumgrid.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote: > + if (pkt_dev->flags & F_DO_RX) { > + local_bh_disable(); > + atomic_add(burst, &pkt_dev->skb->users); > + do { > + ret = netif_receive_skb(pkt_dev->skb); > + if (ret == NET_RX_DROP) > + pkt_dev->errors++; > + pkt_dev->last_ok = 1; > + pkt_dev->sofar++; > + pkt_dev->seq_num++; > + } while (--burst > 0); > + local_bh_enable(); > + goto out; > + } > + This looks buggy. skb can be put on a queue, so skb->next and skb->prev cannot be reused, or queues will be corrupted. Note that on TX, it is possible to have the same issue if you use a virtual device like bonding, and skb is queued on a slave qdisc. (Thats why we have this IFF_TX_SKB_SHARING flag) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/28/15 9:14 PM, Eric Dumazet wrote: > On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote: > >> + if (pkt_dev->flags & F_DO_RX) { >> + local_bh_disable(); >> + atomic_add(burst, &pkt_dev->skb->users); >> + do { >> + ret = netif_receive_skb(pkt_dev->skb); >> + if (ret == NET_RX_DROP) >> + pkt_dev->errors++; >> + pkt_dev->last_ok = 1; >> + pkt_dev->sofar++; >> + pkt_dev->seq_num++; >> + } while (--burst > 0); >> + local_bh_enable(); >> + goto out; >> + } >> + > > This looks buggy. > > skb can be put on a queue, so skb->next and skb->prev cannot be reused, > or queues will be corrupted. don't see the bug yet. Any layer that wants to do such queueing should do skb_share_check first. Just like ip_rcv does. So everything in IP world should work fine, because it will be operating on clean cloned skb. > Note that on TX, it is possible to have the same issue if you use a > virtual device like bonding, and skb is queued on a slave qdisc. > > (Thats why we have this IFF_TX_SKB_SHARING flag) yep, that's why xmit into veth is not useful for benchmarking of rx, since the hottest functions are alloc_skb and fill_packet, while netif_receive_skb is not even seen in top 10. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-04-29 at 14:55 -0700, Alexei Starovoitov wrote: > On 4/28/15 9:14 PM, Eric Dumazet wrote: > > On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote: > > > >> + if (pkt_dev->flags & F_DO_RX) { > >> + local_bh_disable(); > >> + atomic_add(burst, &pkt_dev->skb->users); > >> + do { > >> + ret = netif_receive_skb(pkt_dev->skb); > >> + if (ret == NET_RX_DROP) > >> + pkt_dev->errors++; > >> + pkt_dev->last_ok = 1; > >> + pkt_dev->sofar++; > >> + pkt_dev->seq_num++; > >> + } while (--burst > 0); > >> + local_bh_enable(); > >> + goto out; > >> + } > >> + > > > > This looks buggy. > > > > skb can be put on a queue, so skb->next and skb->prev cannot be reused, > > or queues will be corrupted. > > don't see the bug yet. > Any layer that wants to do such queueing should do skb_share_check > first. Just like ip_rcv does. So everything in IP world should > work fine, because it will be operating on clean cloned skb. Really this is what _you_ think is needed, so that your patch can fly. In current state of the stack, the skb_share_check() is done where we know that packet _might_ be delivered to multiple end points (deliver_skb() does atomic_inc(&skb->users) ) But RPS/RFS/GRO do not care of your new rule. Yes, before reaching __netif_receive_skb_core(), packets are supposed to belong to the stack. We are supposed to queue them, without adding a check for skb->users being one or not, and eventually add an expensive memory allocation/copy. We are not going to add an extra check just to make pktgen rx fast. pktgen will have to comply to existing rules. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/29/15 3:19 PM, Eric Dumazet wrote: > On Wed, 2015-04-29 at 14:55 -0700, Alexei Starovoitov wrote: >> On 4/28/15 9:14 PM, Eric Dumazet wrote: >>> On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote: >>> >>> >>> This looks buggy. >>> >>> skb can be put on a queue, so skb->next and skb->prev cannot be reused, >>> or queues will be corrupted. >> >> don't see the bug yet. >> Any layer that wants to do such queueing should do skb_share_check >> first. Just like ip_rcv does. So everything in IP world should >> work fine, because it will be operating on clean cloned skb. > > Really this is what _you_ think is needed, so that your patch can fly. > > In current state of the stack, the skb_share_check() is done where we > know that packet _might_ be delivered to multiple end points > (deliver_skb() does atomic_inc(&skb->users) ) > > But RPS/RFS/GRO do not care of your new rule. > > Yes, before reaching __netif_receive_skb_core(), packets are supposed to > belong to the stack. We are supposed to queue them, without adding a > check for skb->users being one or not, and eventually add an expensive > memory allocation/copy. > > We are not going to add an extra check just to make pktgen rx fast. > pktgen will have to comply to existing rules. I'm not making and not suggesting any new rules. ip_rcv is doing this skb_share_check() not because of pktgen rx, but because there can be taps and deliver_skb() as you said. gro has a different interface and this pktgen cannot benchmark it. rps/rfs is not benchmarkable but this approach either. To me this is all fine. I'm not trying to do a universal benchmarking tool. This one is dumb and simple and primarily oriented to benchmark changes to netif_receive_skb and ingress qdisc only. I'm not suggesting to use it everywhere. I already mentioned in cover letter: "The profile dump looks as expected for RX of UDP packets without local socket except presence of __skb_clone." Clearly I'm not suggesting to use pktgen rx to optimize IP stack and not suggesting at all that stack should assume users!=1 when skb hits netif_receive_skb. Today at the beginning of netif_receive_skb we know that users==1 without checking. I'm not changing that assumption. Just like pktgen xmit path is cheating little bit while benchmarking TX, I'm cheating a little bit with users!=1 on RX. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-04-29 at 15:38 -0700, Alexei Starovoitov wrote: > I'm not making and not suggesting any new rules. > ip_rcv is doing this skb_share_check() not because of pktgen rx, > but because there can be taps and deliver_skb() as you said. > gro has a different interface and this pktgen cannot benchmark it. > rps/rfs is not benchmarkable but this approach either. > To me this is all fine. I'm not trying to do a universal > benchmarking tool. This one is dumb and simple and primarily > oriented to benchmark changes to netif_receive_skb and ingress > qdisc only. I'm not suggesting to use it everywhere. > I already mentioned in cover letter: > "The profile dump looks as expected for RX of UDP packets > without local socket except presence of __skb_clone." > Clearly I'm not suggesting to use pktgen rx to optimize IP stack > and not suggesting at all that stack should assume users!=1 > when skb hits netif_receive_skb. Today at the beginning of > netif_receive_skb we know that users==1 without checking. > I'm not changing that assumption. > Just like pktgen xmit path is cheating little bit while > benchmarking TX, I'm cheating a little bit with users!=1 on RX. Look, you don't have to write all this, just fix the bug I mentioned to you about RPS / RFS. (details in Documentation/networking/scaling.txt ) Simply read the code in enqueue_to_backlog(), this obviously needs to own skb->next/prev So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no way around it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/29/15 3:56 PM, Eric Dumazet wrote: > > So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no > way around it. if I only knew how to do it... The cost of continuously allocating skbs is way higher than netif_receive_skb itself. Such benchmarking tool would measure the speed of skb alloc/free instead of speed of netif_receive_skb. Are you suggesting to pre-allocate 10s of millions of skbs and then feed them in one go? The profile will be dominated by cache misses in the first few lines of __netif_receive_skb_core() where it accesses skb->dev,data,head. Doesn't sound too useful either. Other thoughts? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-04-29 at 16:28 -0700, Alexei Starovoitov wrote: > On 4/29/15 3:56 PM, Eric Dumazet wrote: > > > > So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no > > way around it. > > if I only knew how to do it... > The cost of continuously allocating skbs is way higher than > netif_receive_skb itself. Such benchmarking tool would measure the > speed of skb alloc/free instead of speed of netif_receive_skb. > Are you suggesting to pre-allocate 10s of millions of skbs and > then feed them in one go? The profile will be dominated by > cache misses in the first few lines of __netif_receive_skb_core() > where it accesses skb->dev,data,head. Doesn't sound too useful either. > Other thoughts? The code I copy pasted from your patch is buggy. Not the whole thing. You have to replace it by something smarter. This should not be hard really. Zap the 'burst/clone' thing, this is not going to work for RX. TX was okay, not RX. You could for instance do : atomic_inc(&skb->users); netif_receive_skb(skb); if (atomic_read(skb->users) != 1) { /* This is too bad, I can not recycle this skb because it is still used */ consume_skb(skb); /* allocate a fresh new skb */ skb = ... } else { /* Yeah ! Lets celebrate, cost of reusing this skb was one atomic op */ } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/29/15 4:39 PM, Eric Dumazet wrote: > > Zap the 'burst/clone' thing, this is not going to work for RX. > TX was okay, not RX. > > You could for instance do : > > atomic_inc(&skb->users); > netif_receive_skb(skb); > if (atomic_read(skb->users) != 1) { > /* This is too bad, I can not recycle this skb because it is still used */ > consume_skb(skb); > /* allocate a fresh new skb */ > skb = ... > } else { > /* Yeah ! Lets celebrate, cost of reusing this skb was one atomic op */ > } ahh, great! I think I'm starting to understand. then the following should be ok as well? atomic_add(burst, &skb->users); do { netif_receive_skb(skb); if (atomic_read(skb->users) != burst) { /* too bad, can not recycle */ atomic_sub(burst - 1, &skb->users); consume_skb(skb); /* allocate a fresh new skb */ skb = ... /* and get out of this loop */ } else { /* Yeah ! the cost of reusing this skb was one atomic op amortized over 'burst' iterations */ } } while (--burst > 0); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 508155b283dd..4f6c56bca550 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -203,6 +203,7 @@ #define F_NODE (1<<15) /* Node memory alloc*/ #define F_UDPCSUM (1<<16) /* Include UDP checksum */ #define F_NO_TIMESTAMP (1<<17) /* Don't timestamp packets (default TS) */ +#define F_DO_RX (1<<18) /* generate packets for RX */ /* Thread control flag bits */ #define T_STOP (1<<0) /* Stop run */ @@ -1080,7 +1081,7 @@ static ssize_t pktgen_if_write(struct file *file, len = num_arg(&user_buffer[i], 10, &value); if (len < 0) return len; - if ((value > 0) && + if ((value > 0) && !(pkt_dev->flags & F_DO_RX) && (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING))) return -ENOTSUPP; i += len; @@ -1089,6 +1090,12 @@ static ssize_t pktgen_if_write(struct file *file, sprintf(pg_result, "OK: clone_skb=%d", pkt_dev->clone_skb); return count; } + if (!strcmp(name, "rx")) { + pkt_dev->flags |= F_DO_RX; + + sprintf(pg_result, "OK: RX is ON"); + return count; + } if (!strcmp(name, "count")) { len = num_arg(&user_buffer[i], 10, &value); if (len < 0) @@ -1134,7 +1141,7 @@ static ssize_t pktgen_if_write(struct file *file, return len; i += len; - if ((value > 1) && + if ((value > 1) && !(pkt_dev->flags & F_DO_RX) && (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING))) return -ENOTSUPP; pkt_dev->burst = value < 1 ? 1 : value; @@ -3349,11 +3356,29 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev) pkt_dev->last_pkt_size = pkt_dev->skb->len; pkt_dev->allocated_skbs++; pkt_dev->clone_count = 0; /* reset counter */ + if (pkt_dev->flags & F_DO_RX) + pkt_dev->skb->protocol = eth_type_trans(pkt_dev->skb, + pkt_dev->skb->dev); } if (pkt_dev->delay && pkt_dev->last_ok) spin(pkt_dev, pkt_dev->next_tx); + if (pkt_dev->flags & F_DO_RX) { + local_bh_disable(); + atomic_add(burst, &pkt_dev->skb->users); + do { + ret = netif_receive_skb(pkt_dev->skb); + if (ret == NET_RX_DROP) + pkt_dev->errors++; + pkt_dev->last_ok = 1; + pkt_dev->sofar++; + pkt_dev->seq_num++; + } while (--burst > 0); + local_bh_enable(); + goto out; + } + txq = skb_get_tx_queue(odev, pkt_dev->skb); local_bh_disable(); @@ -3403,6 +3428,7 @@ unlock: local_bh_enable(); +out: /* If pkt_dev->count is zero, then run forever */ if ((pkt_dev->count != 0) && (pkt_dev->sofar >= pkt_dev->count)) { pktgen_wait_for_skb(pkt_dev);
Introduce 'RX' mode for pktgen which generates the packets using familiar pktgen commands, but feeds them into netif_receive_skb() instead of ndo_start_xmit(). It can be used to benchmark different kernel RX paths. Sample script 'pktgen.sh': !/bin/bash function pgset() { local result echo $1 > $PGDEV result=`cat $PGDEV | fgrep "Result: OK:"` if [ "$result" = "" ]; then cat $PGDEV | fgrep Result: fi } ETH=$1 PGDEV=/proc/net/pktgen/kpktgend_0 pgset "rem_device_all" pgset "add_device $ETH" PGDEV=/proc/net/pktgen/$ETH pgset "rx" pgset "clone_skb 1000" pgset "pkt_size 60" pgset "dst 99.1.1.2" pgset "dst_mac 90:e2:ba:6e:a8:e5" pgset "count 10000000" pgset "burst 32" PGDEV=/proc/net/pktgen/pgctrl echo "Running... ctrl^C to stop" pgset "start" echo "Done" cat /proc/net/pktgen/$ETH Usage: $ sudo ./pktgen.sh eth2 ... Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags) 43033682pps 20656Mb/sec (20656167360bps) errors: 10000000 Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- net/core/pktgen.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-)