Message ID | 20180515193128.GA11901@plex.lan |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Series | Poor TCP performance with XPS enabled after scrubbing skb | expand |
On 05/15/2018 12:31 PM, Flavio Leitner wrote: > Hi, > > There is a significant throughput issue (~50% drop) for a single TCP > stream when the skb is scrubbed and XPS is enabled. > > If I turn CONFIG_XPS off, then the issue never happens and the test > reaches line rate. The same happens if I echo 0 to tx-*/xps_cpus. > > It looks like that when the skb is scrubbed, there is no more reference > to the struct sock, And this is really the problem here, since it breaks back pressure (and TCP Small queues) I am not sure why skb_orphan() is used in this scrubbing really.
On Tue, May 15, 2018 at 02:08:09PM -0700, Eric Dumazet wrote: > > > On 05/15/2018 12:31 PM, Flavio Leitner wrote: > > Hi, > > > > There is a significant throughput issue (~50% drop) for a single TCP > > stream when the skb is scrubbed and XPS is enabled. > > > > If I turn CONFIG_XPS off, then the issue never happens and the test > > reaches line rate. The same happens if I echo 0 to tx-*/xps_cpus. > > > > It looks like that when the skb is scrubbed, there is no more reference > > to the struct sock, > > And this is really the problem here, since it breaks back pressure (and TCP Small queues) > > I am not sure why skb_orphan() is used in this scrubbing really. > veth originally called skb_orphan() on veth_xmit() most probably because there was no TX completion. Then the code got generalized to dev_forward_skb() and later on moved to skb_scrub_packet(). The issue is that we call skb_scrub_packet() on TX and RX paths and that is done while crossing netns. It doesn't look correct to keep the ->sk because I suspect that iptables/selinux/bpf, or some code path that I am probably missing could expose/use the wrong ->sk, for example. However, netdev_pick_tx() can't store the queue mapping without ->sk. The hack in the first email relies on the headers (skb_tx_hash) to always selected the same TX queue, which solves the original problem but not the TCP small queues you mentioned.
From: Flavio Leitner <fbl@sysclose.org> Date: Thu, 24 May 2018 16:17:29 -0300 > veth originally called skb_orphan() on veth_xmit() most probably > because there was no TX completion. Then the code got generalized to > dev_forward_skb() and later on moved to skb_scrub_packet(). > > The issue is that we call skb_scrub_packet() on TX and RX paths and > that is done while crossing netns. It doesn't look correct to keep > the ->sk because I suspect that iptables/selinux/bpf, or some code > path that I am probably missing could expose/use the wrong ->sk, for > example. > > However, netdev_pick_tx() can't store the queue mapping without ->sk. > > The hack in the first email relies on the headers (skb_tx_hash) to > always selected the same TX queue, which solves the original problem > but not the TCP small queues you mentioned. Right, we can't allow a socket reference to escape over a netns crossing. However, that is where we get the queue mapping state. We might need to put the sk based decision into the skb somehow in order to satisfy these two incompatibel requirements.
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 71c72a9..482d046 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -31,9 +31,10 @@ /* 0 - Reserved to indicate value not set * 1..NR_CPUS - Reserved for sender_cpu - * NR_CPUS+1..~0 - Region available for NAPI IDs + * NR_CPUS+1 - Scrubbed packet, do not use XPS + * NR_CPUS+2..~0 - Region available for NAPI IDs */ -#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) +#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 2)) #ifdef CONFIG_NET_RX_BUSY_POLL diff --git a/net/core/dev.c b/net/core/dev.c index af0558b..5567d4f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3398,6 +3398,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; + if (skb->sender_cpu == (u32)(NR_CPUS + 1)) + return -1; + rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_maps); if (dev_maps) { @@ -3459,7 +3462,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev, #ifdef CONFIG_XPS u32 sender_cpu = skb->sender_cpu - 1; - if (sender_cpu >= (u32)NR_CPUS) + if (sender_cpu >= (u32)NR_CPUS + 1) skb->sender_cpu = raw_smp_processor_id() + 1; #endif diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 345b518..99040a0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4898,6 +4898,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) ipvs_reset(skb); skb_orphan(skb); skb->mark = 0; + skb->sender_cpu = (u32)(NR_CPUS + 1); } EXPORT_SYMBOL_GPL(skb_scrub_packet);