Message ID | 4A092F59.7020900@cosmosbay.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
Eric Dumazet a écrit : > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls > dev_kree_skb() long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > It seems right place to release dst is in dev_hard_start_xmit(), for most > devices but ones that are virtual, and some exceptions. > > David Miller suggested to define a new device flag, set in alloc_netdev_mq() > (so that most devices set it at init time), and carefuly unset in devices > which dont want a NULL skb->dst in their ndo_start_xmit(). > > List of devices that must clear this flag is : > > - loopback device, because it calls netif_rx() and quoting Patrick : > "ip_route_input() doesn't accept loopback addresses, so loopback packets > already need to have a dst_entry attached." > - appletalk/ipddp.c : needs skb->dst in its xmit function > > - And all devices that call again dev_queue_xmit() from their xmit function > (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > drivers/net/appletalk/ipddp.c | 1 + > drivers/net/bonding/bond_main.c | 1 + > drivers/net/eql.c | 1 + > drivers/net/ifb.c | 1 + > drivers/net/loopback.c | 1 + > drivers/net/macvlan.c | 1 + > drivers/net/wan/hdlc_fr.c | 1 + > include/linux/if.h | 3 +++ > net/core/dev.c | 9 +++++++++ > 9 files changed, 19 insertions(+) > > diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c > index da64ba8..0268561 100644 > --- a/drivers/net/appletalk/ipddp.c > +++ b/drivers/net/appletalk/ipddp.c > @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) > if (!dev) > return ERR_PTR(-ENOMEM); > > + dev->priv_flags &= IFF_XMIT_DST_RELEASE; Oops, I forgot the ~ here, sorry > strcpy(dev->name, "ipddp%d"); > > if (version_printed++ == 0) > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 815191d..a29f421 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -5148,6 +5148,7 @@ int bond_create(char *name, struct bond_params *params) > goto out_rtnl; > } > > + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > if (!name) { > res = dev_alloc_name(bond_dev, "bond%d"); > if (res < 0) > diff --git a/drivers/net/eql.c b/drivers/net/eql.c > index 5210bb1..19b7dd9 100644 > --- a/drivers/net/eql.c > +++ b/drivers/net/eql.c > @@ -194,6 +194,7 @@ static void __init eql_setup(struct net_device *dev) > > dev->type = ARPHRD_SLIP; > dev->tx_queue_len = 5; /* Hands them off fast */ > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > } > > static int eql_open(struct net_device *dev) > diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c > index 60a2630..96713ef 100644 > --- a/drivers/net/ifb.c > +++ b/drivers/net/ifb.c > @@ -156,6 +156,7 @@ static void ifb_setup(struct net_device *dev) > > dev->flags |= IFF_NOARP; > dev->flags &= ~IFF_MULTICAST; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > random_ether_addr(dev->dev_addr); > } > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > index 6f71157..da472c6 100644 > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -170,6 +170,7 @@ static void loopback_setup(struct net_device *dev) > dev->tx_queue_len = 0; > dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ > dev->flags = IFF_LOOPBACK; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > dev->features = NETIF_F_SG | NETIF_F_FRAGLIST > | NETIF_F_TSO > | NETIF_F_NO_CSUM > diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c > index 329cd50..d5334b4 100644 > --- a/drivers/net/macvlan.c > +++ b/drivers/net/macvlan.c > @@ -414,6 +414,7 @@ static void macvlan_setup(struct net_device *dev) > { > ether_setup(dev); > > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > dev->netdev_ops = &macvlan_netdev_ops; > dev->destructor = free_netdev; > dev->header_ops = &macvlan_hard_header_ops, > diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c > index 8005301..bfa0161 100644 > --- a/drivers/net/wan/hdlc_fr.c > +++ b/drivers/net/wan/hdlc_fr.c > @@ -1054,6 +1054,7 @@ static void pvc_setup(struct net_device *dev) > dev->flags = IFF_POINTOPOINT; > dev->hard_header_len = 10; > dev->addr_len = 2; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > } > > static const struct net_device_ops pvc_ops = { > diff --git a/include/linux/if.h b/include/linux/if.h > index 1108f3e..b9a6229 100644 > --- a/include/linux/if.h > +++ b/include/linux/if.h > @@ -67,6 +67,9 @@ > #define IFF_ISATAP 0x80 /* ISATAP interface (RFC4214) */ > #define IFF_MASTER_ARPMON 0x100 /* bonding master, ARP mon in use */ > #define IFF_WAN_HDLC 0x200 /* WAN HDLC device */ > +#define IFF_XMIT_DST_RELEASE 0x400 /* dev_hard_start_xmit() is allowed to > + * release skb->dst > + */ > > #define IF_GET_IFACE 0x0001 /* for querying only */ > #define IF_GET_PROTO 0x0002 > diff --git a/net/core/dev.c b/net/core/dev.c > index 14dd725..b9aef53 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1688,6 +1688,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > goto gso; > } > > + /* > + * If device doesnt need skb->dst, release it right now while > + * its hot in this cpu cache > + */ > + if ((dev->priv_flags & IFF_XMIT_DST_RELEASE) && skb->dst) { > + dst_release(skb->dst); > + skb->dst = NULL; > + } > rc = ops->ndo_start_xmit(skb, dev); > /* > * TODO: if skb_orphan() was called by > @@ -5028,6 +5036,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, > netdev_init_queues(dev); > > INIT_LIST_HEAD(&dev->napi_list); > + dev->priv_flags = IFF_XMIT_DST_RELEASE; > setup(dev); > strcpy(dev->name, name); > return dev; > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > Eric Dumazet a écrit : ... > > List of devices that must clear this flag is : > > > > - loopback device, because it calls netif_rx() and quoting Patrick : > > "ip_route_input() doesn't accept loopback addresses, so loopback packets > > already need to have a dst_entry attached." > > - appletalk/ipddp.c : needs skb->dst in its xmit function > > > > - And all devices that call again dev_queue_xmit() from their xmit function > > (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski a écrit : > On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: >> Eric Dumazet a écrit : > ... >>> List of devices that must clear this flag is : >>> >>> - loopback device, because it calls netif_rx() and quoting Patrick : >>> "ip_route_input() doesn't accept loopback addresses, so loopback packets >>> already need to have a dst_entry attached." >>> - appletalk/ipddp.c : needs skb->dst in its xmit function >>> >>> - And all devices that call again dev_queue_xmit() from their xmit function >>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > Yes I spoted vlan earlier this afternoon. For other net/*, I didnt not find candidates yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > >> Eric Dumazet a écrit : > > ... > >>> List of devices that must clear this flag is : > >>> > >>> - loopback device, because it calls netif_rx() and quoting Patrick : > >>> "ip_route_input() doesn't accept loopback addresses, so loopback packets > >>> already need to have a dst_entry attached." > >>> - appletalk/ipddp.c : needs skb->dst in its xmit function > >>> > >>> - And all devices that call again dev_queue_xmit() from their xmit function > >>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > > > Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > > > > Yes I spoted vlan earlier this afternoon. > > For other net/*, I didnt not find candidates yet. Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ using dev_queue_xmit)? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2009 at 10:05:15PM +0200, Jarek Poplawski wrote: ... > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > using dev_queue_xmit)? I wonder if we can't simplify this e.g. by checking indirectly for some hardware flag/capability like checksums etc. Some older hardware could be omitted, by is it really so big deal? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski a écrit : > On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: >> Jarek Poplawski a écrit : >>> On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: >>>> Eric Dumazet a écrit : >>> ... >>>>> List of devices that must clear this flag is : >>>>> >>>>> - loopback device, because it calls netif_rx() and quoting Patrick : >>>>> "ip_route_input() doesn't accept loopback addresses, so loopback packets >>>>> already need to have a dst_entry attached." >>>>> - appletalk/ipddp.c : needs skb->dst in its xmit function >>>>> >>>>> - And all devices that call again dev_queue_xmit() from their xmit function >>>>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr >>> Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? >>> >> Yes I spoted vlan earlier this afternoon. >> >> For other net/*, I didnt not find candidates yet. > > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > using dev_queue_xmit)? > As I said, I didnt found other relevant uses, but I am probably wrong :) About ppoe for example, two calls to dev_queue_xmit() are not relevant. One is from static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len) This is a normal direct call, not a call from its ndo_start_xmit() Second one is from static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb) Same observation, there is no impact for this one as well. So we have to care on device drivers that have a ndo_start_xmit() call that could re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from other paths. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2009 at 10:52:27PM +0200, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: > >> Jarek Poplawski a écrit : > >>> On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > >>>> Eric Dumazet a écrit : > >>> ... > >>>>> List of devices that must clear this flag is : > >>>>> > >>>>> - loopback device, because it calls netif_rx() and quoting Patrick : > >>>>> "ip_route_input() doesn't accept loopback addresses, so loopback packets > >>>>> already need to have a dst_entry attached." > >>>>> - appletalk/ipddp.c : needs skb->dst in its xmit function > >>>>> > >>>>> - And all devices that call again dev_queue_xmit() from their xmit function > >>>>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > >>> Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > >>> > >> Yes I spoted vlan earlier this afternoon. > >> > >> For other net/*, I didnt not find candidates yet. > > > > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > > using dev_queue_xmit)? > > > > As I said, I didnt found other relevant uses, but I am probably wrong :) > > About ppoe for example, two calls to dev_queue_xmit() are not relevant. > > One is from > > static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, > struct msghdr *m, size_t total_len) > > This is a normal direct call, not a call from its ndo_start_xmit() > > Second one is from > > static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb) > > Same observation, there is no impact for this one as well. > > > > So we have to care on device drivers that have a ndo_start_xmit() call that could > re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from > other paths. > Isn't it called through ppp_generic's ndo_start_xmit()? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c index da64ba8..0268561 100644 --- a/drivers/net/appletalk/ipddp.c +++ b/drivers/net/appletalk/ipddp.c @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) if (!dev) return ERR_PTR(-ENOMEM); + dev->priv_flags &= IFF_XMIT_DST_RELEASE; strcpy(dev->name, "ipddp%d"); if (version_printed++ == 0) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 815191d..a29f421 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5148,6 +5148,7 @@ int bond_create(char *name, struct bond_params *params) goto out_rtnl; } + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; if (!name) { res = dev_alloc_name(bond_dev, "bond%d"); if (res < 0) diff --git a/drivers/net/eql.c b/drivers/net/eql.c index 5210bb1..19b7dd9 100644 --- a/drivers/net/eql.c +++ b/drivers/net/eql.c @@ -194,6 +194,7 @@ static void __init eql_setup(struct net_device *dev) dev->type = ARPHRD_SLIP; dev->tx_queue_len = 5; /* Hands them off fast */ + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static int eql_open(struct net_device *dev) diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c index 60a2630..96713ef 100644 --- a/drivers/net/ifb.c +++ b/drivers/net/ifb.c @@ -156,6 +156,7 @@ static void ifb_setup(struct net_device *dev) dev->flags |= IFF_NOARP; dev->flags &= ~IFF_MULTICAST; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; random_ether_addr(dev->dev_addr); } diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 6f71157..da472c6 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -170,6 +170,7 @@ static void loopback_setup(struct net_device *dev) dev->tx_queue_len = 0; dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ dev->flags = IFF_LOOPBACK; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 329cd50..d5334b4 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -414,6 +414,7 @@ static void macvlan_setup(struct net_device *dev) { ether_setup(dev); + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->netdev_ops = &macvlan_netdev_ops; dev->destructor = free_netdev; dev->header_ops = &macvlan_hard_header_ops, diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c index 8005301..bfa0161 100644 --- a/drivers/net/wan/hdlc_fr.c +++ b/drivers/net/wan/hdlc_fr.c @@ -1054,6 +1054,7 @@ static void pvc_setup(struct net_device *dev) dev->flags = IFF_POINTOPOINT; dev->hard_header_len = 10; dev->addr_len = 2; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static const struct net_device_ops pvc_ops = { diff --git a/include/linux/if.h b/include/linux/if.h index 1108f3e..b9a6229 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -67,6 +67,9 @@ #define IFF_ISATAP 0x80 /* ISATAP interface (RFC4214) */ #define IFF_MASTER_ARPMON 0x100 /* bonding master, ARP mon in use */ #define IFF_WAN_HDLC 0x200 /* WAN HDLC device */ +#define IFF_XMIT_DST_RELEASE 0x400 /* dev_hard_start_xmit() is allowed to + * release skb->dst + */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 diff --git a/net/core/dev.c b/net/core/dev.c index 14dd725..b9aef53 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1688,6 +1688,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + /* + * If device doesnt need skb->dst, release it right now while + * its hot in this cpu cache + */ + if ((dev->priv_flags & IFF_XMIT_DST_RELEASE) && skb->dst) { + dst_release(skb->dst); + skb->dst = NULL; + } rc = ops->ndo_start_xmit(skb, dev); /* * TODO: if skb_orphan() was called by @@ -5028,6 +5036,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, netdev_init_queues(dev); INIT_LIST_HEAD(&dev->napi_list); + dev->priv_flags = IFF_XMIT_DST_RELEASE; setup(dev); strcpy(dev->name, name); return dev;
One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls dev_kree_skb() long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. It seems right place to release dst is in dev_hard_start_xmit(), for most devices but ones that are virtual, and some exceptions. David Miller suggested to define a new device flag, set in alloc_netdev_mq() (so that most devices set it at init time), and carefuly unset in devices which dont want a NULL skb->dst in their ndo_start_xmit(). List of devices that must clear this flag is : - loopback device, because it calls netif_rx() and quoting Patrick : "ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached." - appletalk/ipddp.c : needs skb->dst in its xmit function - And all devices that call again dev_queue_xmit() from their xmit function (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- drivers/net/appletalk/ipddp.c | 1 + drivers/net/bonding/bond_main.c | 1 + drivers/net/eql.c | 1 + drivers/net/ifb.c | 1 + drivers/net/loopback.c | 1 + drivers/net/macvlan.c | 1 + drivers/net/wan/hdlc_fr.c | 1 + include/linux/if.h | 3 +++ net/core/dev.c | 9 +++++++++ 9 files changed, 19 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html