Message ID | 20130520101601.14133.874.stgit@ladj378.jer.intel.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, 2013-05-20 at 13:16 +0300, Eliezer Tamir wrote: > Adds a new ndo_ll_poll method and the code that supports and uses it. > This method can be used by low latency applications to busy poll ethernet > device queues directly from the socket code. The ip_low_latency_poll sysctl > entry controls how many cycles to poll. Set to zero to disable. > This changelog lacks a lot of information, see below. Part of this information was in your 0/4 text, but it wont be included in the git tree. > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > Tested-by: Willem de Bruijn <willemb@google.com> > Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> > --- > > + > +static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi) > +{ > + skb->dev_ref = napi; > +} > + > +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb) > +{ > + sk->dev_ref = skb->dev_ref; > +} I do not see why it's safe to keep a pointer to a napi object without taking a reference, or something to prevent object being removed. Using a genid might be enough. (some counter incremented every time a napi is dismantled) Alternatively, use a napi_id instead of a pointer. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 20 May 2013 08:29:24 -0700 > Part of this information was in your 0/4 text, but it wont be included > in the git tree. Yes it will, in the merge commit I make when I merge this stuff in. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20/05/2013 18:29, Eric Dumazet wrote: > On Mon, 2013-05-20 at 13:16 +0300, Eliezer Tamir wrote: --- >> +static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi) >> +{ >> + skb->dev_ref = napi; >> +} >> + >> +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb) >> +{ >> + sk->dev_ref = skb->dev_ref; >> +} > > I do not see why it's safe to keep a pointer to a napi object without > taking a reference, or something to prevent object being removed. > > Using a genid might be enough. (some counter incremented every time a > napi is dismantled) I really like this approach and I tried it. The main problem I had is that you need to increase the size of the skb to store the generation id unless you stuff it in the flags2 bitfield. There appear to be only 7 useful bit left there. Is it OK to use them all up? > Alternatively, use a napi_id instead of a pointer. I'm not sure I understand what you propose. -Eliezer -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-05-21 at 10:28 +0300, Eliezer Tamir wrote: > On 20/05/2013 18:29, Eric Dumazet wrote: > > On Mon, 2013-05-20 at 13:16 +0300, Eliezer Tamir wrote: > --- > >> +static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi) > >> +{ > >> + skb->dev_ref = napi; > >> +} > >> + > >> +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb) > >> +{ > >> + sk->dev_ref = skb->dev_ref; > >> +} > > > > I do not see why it's safe to keep a pointer to a napi object without > > taking a reference, or something to prevent object being removed. > > > > Using a genid might be enough. (some counter incremented every time a > > napi is dismantled) > > I really like this approach and I tried it. > The main problem I had is that you need to increase the size of the skb > to store the generation id unless you stuff it in the flags2 bitfield. > There appear to be only 7 useful bit left there. > Is it OK to use them all up? > > > > Alternatively, use a napi_id instead of a pointer. > > I'm not sure I understand what you propose. Oh well. To get a pointer to a struct net_device, we can use ifindex, and do a rcu lookup into a hash table to get the net_device. We do not need {pointer,ifindex} but {ifindex} is enough My suggestion is to not have skb->skb_ref but skb->napi_index : Its safe to copy its value from skb->napi_index to sk->napi_index without refcounting. All NAPI need to get a unique napi_index, and be inserted in a hash table for immediate/fast lookup. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 May 2013, Eric Dumazet wrote: : > > Alternatively, use a napi_id instead of a pointer. : > : > I'm not sure I understand what you propose. : : Oh well. : : To get a pointer to a struct net_device, we can use ifindex, and do a : rcu lookup into a hash table to get the net_device. We do not need : {pointer,ifindex} but {ifindex} is enough : : My suggestion is to not have skb->skb_ref but skb->napi_index : Its safe : to copy its value from skb->napi_index to sk->napi_index without : refcounting. : : All NAPI need to get a unique napi_index, and be inserted in a hash : table for immediate/fast lookup. : Maybe even that's not needed. Couldn't skb->queue_mapping give the correct NAPI instance in multiqueue nics? The NAPI instance could be made easily available from skb->dev. In any case an index is much better than a new pointer. Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-05-21 at 19:02 +0200, Pekka Riikonen wrote: > Maybe even that's not needed. Couldn't skb->queue_mapping give the > correct NAPI instance in multiqueue nics? The NAPI instance could be made > easily available from skb->dev. In any case an index is much better than > a new pointer. We do not keep skb->dev information once a packet leaves the rcu protected region. Once packet is queued to tcp input queues, skb->dev is NULL. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-05-21 at 10:48 -0700, Eric Dumazet wrote: > We do not keep skb->dev information once a packet leaves the rcu > protected region. > > Once packet is queued to tcp input queues, skb->dev is NULL. This is done in tcp_v4_rcv() & tcp_v6_rcv() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pekka Riikonen <priikone@iki.fi> Date: Tue, 21 May 2013 19:02:19 +0200 (CEST) > On Tue, 21 May 2013, Eric Dumazet wrote: > > : > > Alternatively, use a napi_id instead of a pointer. > : > > : > I'm not sure I understand what you propose. > : > : Oh well. > : > : To get a pointer to a struct net_device, we can use ifindex, and do a > : rcu lookup into a hash table to get the net_device. We do not need > : {pointer,ifindex} but {ifindex} is enough > : > : My suggestion is to not have skb->skb_ref but skb->napi_index : Its safe > : to copy its value from skb->napi_index to sk->napi_index without > : refcounting. > : > : All NAPI need to get a unique napi_index, and be inserted in a hash > : table for immediate/fast lookup. > : > Maybe even that's not needed. Couldn't skb->queue_mapping give the > correct NAPI instance in multiqueue nics? The NAPI instance could be made > easily available from skb->dev. In any case an index is much better than > a new pointer. Queue mapping is more volatile, and consider layered devices. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21/05/2013 20:51, Eric Dumazet wrote: > On Tue, 2013-05-21 at 10:48 -0700, Eric Dumazet wrote: > >> We do not keep skb->dev information once a packet leaves the rcu >> protected region. >> >> Once packet is queued to tcp input queues, skb->dev is NULL. > > This is done in tcp_v4_rcv() & tcp_v6_rcv() So if we move calling sk_mark_ll() into the rcu protected region, We would not need the gen id in the skb. We could save skb->dev and skb->queue_mapping along with the generation id in the socket. then we don't need to change the skb struct at all. skb_mark_ll() can go away. we also call sk_mark_ll in one central place in net_protocol->handler instead of all over the place. or am I missing something? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-05-21 at 22:25 +0300, Eliezer Tamir wrote: > On 21/05/2013 20:51, Eric Dumazet wrote: > > On Tue, 2013-05-21 at 10:48 -0700, Eric Dumazet wrote: > > > >> We do not keep skb->dev information once a packet leaves the rcu > >> protected region. > >> > >> Once packet is queued to tcp input queues, skb->dev is NULL. > > > > This is done in tcp_v4_rcv() & tcp_v6_rcv() > > So if we move calling sk_mark_ll() into the rcu protected region, > We would not need the gen id in the skb. > We could save skb->dev and skb->queue_mapping along with the generation > id in the socket. > > then we don't need to change the skb struct at all. > skb_mark_ll() can go away. > we also call sk_mark_ll in one central place in net_protocol->handler > instead of all over the place. > > or am I missing something? You cannot keep a pointer to some object without taking a reference on the object. Thats why I suggested to use an napi identifier instead, this is safe and cheap. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 May 2013, David Miller wrote: > From: Pekka Riikonen <priikone@iki.fi> > Date: Tue, 21 May 2013 19:02:19 +0200 (CEST) > >> On Tue, 21 May 2013, Eric Dumazet wrote: >> >> : > > Alternatively, use a napi_id instead of a pointer. >> : > >> : > I'm not sure I understand what you propose. >> : >> : Oh well. >> : >> : To get a pointer to a struct net_device, we can use ifindex, and do a >> : rcu lookup into a hash table to get the net_device. We do not need >> : {pointer,ifindex} but {ifindex} is enough >> : >> : My suggestion is to not have skb->skb_ref but skb->napi_index : Its safe >> : to copy its value from skb->napi_index to sk->napi_index without >> : refcounting. >> : >> : All NAPI need to get a unique napi_index, and be inserted in a hash >> : table for immediate/fast lookup. >> : >> Maybe even that's not needed. Couldn't skb->queue_mapping give the >> correct NAPI instance in multiqueue nics? The NAPI instance could be made >> easily available from skb->dev. In any case an index is much better than >> a new pointer. > > Queue mapping is more volatile, and consider layered devices. > Yes, true. The napi_index then is probably the way to go. Main thing for me is that it doesn't increase skb size when in union with dma_cookie (skb has been growing lately). Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index f98ca63..cfcf0ea 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -19,6 +19,11 @@ ip_no_pmtu_disc - BOOLEAN Disable Path MTU Discovery. default FALSE +ip_low_latency_poll - INTEGER + Low latency busy poll timeout. (needs CONFIG_INET_LL_RX_POLL) + Approximate time in ms to spin waiting for packets on the device queue. + default 0 + min_pmtu - INTEGER default 552 - minimum discovered Path MTU diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a94a5a0..e25f798 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -943,6 +943,9 @@ struct net_device_ops { gfp_t gfp); void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif +#ifdef CONFIG_INET_LL_RX_POLL + int (*ndo_ll_poll)(struct napi_struct *dev); +#endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); int (*ndo_set_vf_vlan)(struct net_device *dev, diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 2e0ced1..4047e1e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -384,6 +384,7 @@ typedef unsigned char *sk_buff_data_t; * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions + * @dev_ref: the NAPI struct this skb came from * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows @@ -497,8 +498,11 @@ struct sk_buff { /* 7/9 bit hole (depending on ndisc_nodetype presence) */ kmemcheck_bitfield_end(flags2); -#ifdef CONFIG_NET_DMA - dma_cookie_t dma_cookie; +#if defined CONFIG_NET_DMA || defined CONFIG_INET_LL_RX_POLL + union { + struct napi_struct *dev_ref; + dma_cookie_t dma_cookie; + }; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h new file mode 100644 index 0000000..d3c86d6 --- /dev/null +++ b/include/net/ll_poll.h @@ -0,0 +1,86 @@ +/* + * low latency network device queue flush + * Copyright(c) 2013 Intel Corporation. + * Author: Eliezer Tamir + * + * For now this depends on CONFIG_I86_TSC + */ + +#ifndef _LINUX_NET_LL_POLL_H +#define _LINUX_NET_LL_POLL_H + +#ifdef CONFIG_INET_LL_RX_POLL +#include <linux/netdevice.h> +struct napi_struct; +extern int sysctl_net_ll_poll __read_mostly; + +/* return values from ndo_ll_poll */ +#define LL_FLUSH_DONE 0 +#define LL_FLUSH_FAILED 1 +#define LL_FLUSH_BUSY 2 + +/* we don't mind a ~2.5% imprecision */ +#define TSC_MHZ (tsc_khz >> 10) + +static inline bool sk_valid_ll(struct sock *sk) +{ + return sysctl_net_ll_poll && sk->dev_ref && + !need_resched() && !signal_pending(current); +} + +static inline bool sk_poll_ll(struct sock *sk, int nonblock) +{ + struct napi_struct *napi = sk->dev_ref; + const struct net_device_ops *ops; + unsigned long end_time = TSC_MHZ * ACCESS_ONCE(sysctl_net_ll_poll) + + get_cycles(); + + if (!napi->dev->netdev_ops->ndo_ll_poll) + return false; + + local_bh_disable(); + + ops = napi->dev->netdev_ops; + while (skb_queue_empty(&sk->sk_receive_queue) && + !time_after((unsigned long)get_cycles(), end_time)) { + if (ops->ndo_ll_poll(napi) == LL_FLUSH_FAILED) + break; /* premanent failure */ + if (nonblock) + break; + } + + local_bh_enable(); + + return !skb_queue_empty(&sk->sk_receive_queue); +} + +static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi) +{ + skb->dev_ref = napi; +} + +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb) +{ + sk->dev_ref = skb->dev_ref; +} +#else /* CONFIG_INET_LL_RX_FLUSH */ + +static inline bool sk_valid_ll(struct sock *sk) +{ + return 0; +} + +static inline bool sk_poll_ll(struct sock *sk, int nonblock) +{ + return 0; +} + +static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi) +{ +} + +static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb) +{ +} +#endif /* CONFIG_INET_LL_RX_FLUSH */ +#endif /* _LINUX_NET_LL_POLL_H */ diff --git a/include/net/sock.h b/include/net/sock.h index 66772cf..eeabdcd4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -399,6 +399,9 @@ struct sock { int (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb); void (*sk_destruct)(struct sock *sk); +#ifdef CONFIG_INET_LL_RX_POLL + struct napi_struct *dev_ref; +#endif }; /* diff --git a/net/core/datagram.c b/net/core/datagram.c index b71423d..df3dab8 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -56,6 +56,7 @@ #include <net/sock.h> #include <net/tcp_states.h> #include <trace/events/skb.h> +#include <net/ll_poll.h> /* * Is a socket 'connection oriented' ? @@ -201,12 +202,18 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags, } else __skb_unlink(skb, queue); + sk_mark_ll(sk, skb); spin_unlock_irqrestore(&queue->lock, cpu_flags); *off = _off; return skb; } spin_unlock_irqrestore(&queue->lock, cpu_flags); +#ifdef CONFIG_INET_LL_RX_POLL + if (sk_valid_ll(sk) && sk_poll_ll(sk, flags & MSG_DONTWAIT)) + continue; +#endif + /* User doesn't want to wait */ error = -EAGAIN; if (!timeo) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index af9185d..4efd230 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -739,6 +739,10 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old) new->vlan_tci = old->vlan_tci; skb_copy_secmark(new, old); + +#ifdef CONFIG_INET_LL_RX_POLL + new->dev_ref = old->dev_ref; +#endif } /* diff --git a/net/core/sock.c b/net/core/sock.c index 6ba327d..d8058ce 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -139,6 +139,8 @@ #include <net/tcp.h> #endif +#include <net/ll_poll.h> + static DEFINE_MUTEX(proto_list_mutex); static LIST_HEAD(proto_list); @@ -2284,6 +2286,10 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_stamp = ktime_set(-1L, 0); +#ifdef CONFIG_INET_LL_RX_POLL + sk->dev_ref = NULL; +#endif + /* * Before updating sk_refcnt, we must commit prior changes to memory * (Documentation/RCU/rculist_nulls.txt for details) diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 8603ca8..d209f0f 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -409,6 +409,18 @@ config INET_LRO If unsure, say Y. +config INET_LL_RX_POLL + bool "Low Latency Receive Poll" + depends on X86_TSC + default n + ---help--- + Support Low Latency Receive Queue Poll. + (For network card drivers which support this option.) + When waiting for data in read or poll call directly into the the device driver + to flush packets which may be pending on the device queues into the stack. + + If unsure, say N. + config INET_DIAG tristate "INET: socket monitoring interface" default y diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index fa2f63f..d0fcaaf 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -25,6 +25,7 @@ #include <net/inet_frag.h> #include <net/ping.h> #include <net/tcp_memcontrol.h> +#include <net/ll_poll.h> static int zero; static int one = 1; @@ -326,6 +327,15 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, +#ifdef CONFIG_INET_LL_RX_POLL + { + .procname = "ip_low_latency_poll", + .data = &sysctl_net_ll_poll, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif { .procname = "tcp_syn_retries", .data = &sysctl_tcp_syn_retries, diff --git a/net/socket.c b/net/socket.c index 6b94633..e7cce5b 100644 --- a/net/socket.c +++ b/net/socket.c @@ -105,6 +105,12 @@ #include <linux/sockios.h> #include <linux/atalk.h> +#ifdef CONFIG_INET_LL_RX_POLL +#include <net/ll_poll.h> +int sysctl_net_ll_poll __read_mostly; +EXPORT_SYMBOL_GPL(sysctl_net_ll_poll); +#endif + static int sock_no_open(struct inode *irrelevant, struct file *dontcare); static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); @@ -1142,13 +1148,29 @@ EXPORT_SYMBOL(sock_create_lite); /* No kernel lock held - perfect */ static unsigned int sock_poll(struct file *file, poll_table *wait) { + unsigned int poll_result; struct socket *sock; /* * We can't return errors to poll, so it's either yes or no. */ sock = file->private_data; - return sock->ops->poll(file, sock, wait); + + poll_result = sock->ops->poll(file, sock, wait); + +#ifdef CONFIG_INET_LL_RX_POLL + if (wait && + !(poll_result & (POLLRDNORM | POLLERR | POLLRDHUP | POLLHUP))) { + struct sock *sk = sock->sk; + + /* only try once per poll */ + if (sk_valid_ll(sk) && sk_poll_ll(sk, 1)) + poll_result = sock->ops->poll(file, sock, wait); + + } +#endif /* CONFIG_INET_LL_RX_POLL */ + + return poll_result; } static int sock_mmap(struct file *file, struct vm_area_struct *vma)