diff mbox

pktgen node allocation

Message ID 19363.14702.909265.380669@gargle.gargle.HOWL
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Robert Olsson March 19, 2010, 8:44 a.m. UTC
Hi,
Here is patch to manipulate packet node allocation and implicitly 
how packets are DMA'd etc.

The flag NODE_ALLOC enables the function and numa_node_id();
when enabled it can also be explicitly controlled via a new 
node parameter

Tested this with 10 Intel 82599 ports w. TYAN S7025 E5520 CPU's.
Was able to TX/DMA ~80 Gbit/s to Ethernet wires.

Cheers
					--ro


Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 19, 2010, 9:28 a.m. UTC | #1
Le vendredi 19 mars 2010 à 09:44 +0100, Robert Olsson a écrit :
> 
> Hi,
> Here is patch to manipulate packet node allocation and implicitly 
> how packets are DMA'd etc.
> 
> The flag NODE_ALLOC enables the function and numa_node_id();
> when enabled it can also be explicitly controlled via a new 
> node parameter
> 
> Tested this with 10 Intel 82599 ports w. TYAN S7025 E5520 CPU's.
> Was able to TX/DMA ~80 Gbit/s to Ethernet wires.
> 
> Cheers
> 					--ro
> 

I cannot understand how this can help.

__netdev_alloc_skb() is supposed to already take into account NUMA
properties :

int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;

If this doesnt work, we should correct core stack, not only pktgen :)

Are you allocating memory in the node where pktgen CPU is running or the
node close to the NIC ?


Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Olsson March 19, 2010, 1:35 p.m. UTC | #2
Eric Dumazet writes:
 > Le vendredi 19 mars 2010 à 09:44 +0100, Robert Olsson a écrit :
 > 
 > I cannot understand how this can help.
 > 
 > __netdev_alloc_skb() is supposed to already take into account NUMA
 > properties :
 > 
 > int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 > 
 > If this doesnt work, we should correct core stack, not only pktgen :)
 > 
 > Are you allocating memory in the node where pktgen CPU is running or the
 > node close to the NIC ?

 I didn't say it should help the idea was to give some hooks to 
 experiment and see effects with different node memory allocations.
 There are many degrees of freedom wrt buses(device)/CPU/menory.

 Cheers
				--ro

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 19, 2010, 1:47 p.m. UTC | #3
Le vendredi 19 mars 2010 à 14:35 +0100, robert@herjulf.net a écrit :
> Eric Dumazet writes:
>  > Le vendredi 19 mars 2010 à 09:44 +0100, Robert Olsson a écrit :
>  > 
>  > I cannot understand how this can help.
>  > 
>  > __netdev_alloc_skb() is supposed to already take into account NUMA
>  > properties :
>  > 
>  > int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>  > 
>  > If this doesnt work, we should correct core stack, not only pktgen :)
>  > 
>  > Are you allocating memory in the node where pktgen CPU is running or the
>  > node close to the NIC ?
> 
>  I didn't say it should help the idea was to give some hooks to 
>  experiment and see effects with different node memory allocations.
>  There are many degrees of freedom wrt buses(device)/CPU/menory.
> 

Well, you said "Tested this with 10 Intel 82599 ports w. TYAN S7025
E5520 CPU's. Was able to TX/DMA ~80 Gbit/s to Ethernet wires."

I am interested to know what particular setup you did to maximize
throughput then, or are you saing you managed to reduce it ? :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 22, 2010, 3:37 a.m. UTC | #4
From: robert@herjulf.net
Date: Fri, 19 Mar 2010 14:35:22 +0100

> 
> Eric Dumazet writes:
>  > Le vendredi 19 mars 2010 à 09:44 +0100, Robert Olsson a écrit :
>  > 
>  > I cannot understand how this can help.
>  > 
>  > __netdev_alloc_skb() is supposed to already take into account NUMA
>  > properties :
>  > 
>  > int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>  > 
>  > If this doesnt work, we should correct core stack, not only pktgen :)
>  > 
>  > Are you allocating memory in the node where pktgen CPU is running or the
>  > node close to the NIC ?
> 
>  I didn't say it should help the idea was to give some hooks to 
>  experiment and see effects with different node memory allocations.
>  There are many degrees of freedom wrt buses(device)/CPU/menory.

I think it's a useful feature and by default the netdev alloc
is still used, so... applied to net-next-2.6
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Olsson March 22, 2010, 6:24 a.m. UTC | #5
Eric Dumazet writes:

 > Well, you said "Tested this with 10 Intel 82599 ports w. TYAN S7025
 > E5520 CPU's. Was able to TX/DMA ~80 Gbit/s to Ethernet wires."
 > 
 > I am interested to know what particular setup you did to maximize
 > throughput then, or are you saing you managed to reduce it ? :)


Some notes from the experiment, It's getting
complex and hairy. Anyway results from the first
tests to give you an idea... My colleague Olof 
might have some comments/details

pktgen sending on 10 * 10g interfaces. 

[From pktgen script]
fn()
{
  i=$1  #ifname
  c=$2  #queue / cpu core
  n=$3  # numa node
  PGDEV=/proc/net/pktgen/kpktgend_$c
  pgset "add_device eth$i@$c  "
  PGDEV=/proc/net/pktgen/eth$i@$c
  pgset "node $n"
  pgset "$COUNT"
  pgset "flag NODE_ALLOC"
  pgset "$CLONE_SKB"
  pgset "$PKT_SIZE"
  pgset "$DELAY"
  pgset "dst 10.0.0.0" 
}      

remove_all
# Setup

# TYAN S7025 with two nodes.
# Each node has own bus with it's own TYLERSBURG bridge
# so eth0-eth3 is closest to node0 which in turn "owns"
# CPU-cores 0-3 in this HW setup. So we setup so 
# pktgen according to this. clone_skb=1000000.
# Used slots are PCIe-x16 except when PCIe-x8 is indicated.

# eth0 queue=0(CPU) node=0
fn 0 0 0
fn 1 1 0
fn 2 2 0
fn 3 3 0
fn 4 4 1
fn 5 5 1
fn 6 6 1
fn 7 7 1
fn 8 12 1
fn 9 13 1

Result "manually" tuned. 

eth0 9617.7 M bit/s      822 k pps 
eth1 9619.1 M bit/s      823 k pps 
eth2 9619.1 M bit/s      823 k pps 
eth3 9619.2 M bit/s      823 k pps 
eth4 5995.2 M bit/s      512 k pps  <-  PCIe-x8
eth5 5995.3 M bit/s      512 k pps  <-  PCIe-x8
eth6 9619.2 M bit/s      823 k pps 
eth7 9619.2 M bit/s      823 k pps 
eth8 9619.1 M bit/s      823 k pps 
eth9 9619.0 M bit/s      823 k pps 

> 90 Gbit/s

Result "manually" mistuned by switching node 0 and 1. 

eth0 9613.6 M bit/s      822 k pps 
eth1 9614.9 M bit/s      822 k pps 
eth2 9615.0 M bit/s      822 k pps 
eth3 9615.1 M bit/s      822 k pps 
eth4 2918.5 M bit/s      249 k pps  <-  PCIe-x8
eth5 2918.4 M bit/s      249 k pps  <-  PCIe-x8
eth6 8597.0 M bit/s      735 k pps 
eth7 8597.0 M bit/s      735 k pps 
eth8 8568.3 M bit/s      733 k pps 
eth9 8568.3 M bit/s      733 k pps 

A lot things is to be investgated...

Cheers
					--ro
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 22, 2010, 7:43 a.m. UTC | #6
Le lundi 22 mars 2010 à 07:24 +0100, Robert Olsson a écrit :
> Eric Dumazet writes:
> 
>  > Well, you said "Tested this with 10 Intel 82599 ports w. TYAN S7025
>  > E5520 CPU's. Was able to TX/DMA ~80 Gbit/s to Ethernet wires."
>  > 
>  > I am interested to know what particular setup you did to maximize
>  > throughput then, or are you saing you managed to reduce it ? :)
> 
> 
> Some notes from the experiment, It's getting
> complex and hairy. Anyway results from the first
> tests to give you an idea... My colleague Olof 
> might have some comments/details
> 
> pktgen sending on 10 * 10g interfaces. 
> 
> [From pktgen script]
> fn()
> {
>   i=$1  #ifname
>   c=$2  #queue / cpu core
>   n=$3  # numa node
>   PGDEV=/proc/net/pktgen/kpktgend_$c
>   pgset "add_device eth$i@$c  "
>   PGDEV=/proc/net/pktgen/eth$i@$c
>   pgset "node $n"
>   pgset "$COUNT"
>   pgset "flag NODE_ALLOC"
>   pgset "$CLONE_SKB"
>   pgset "$PKT_SIZE"
>   pgset "$DELAY"
>   pgset "dst 10.0.0.0" 
> }      
> 
> remove_all
> # Setup
> 
> # TYAN S7025 with two nodes.
> # Each node has own bus with it's own TYLERSBURG bridge
> # so eth0-eth3 is closest to node0 which in turn "owns"
> # CPU-cores 0-3 in this HW setup. So we setup so 
> # pktgen according to this. clone_skb=1000000.
> # Used slots are PCIe-x16 except when PCIe-x8 is indicated.
> 
> # eth0 queue=0(CPU) node=0
> fn 0 0 0
> fn 1 1 0
> fn 2 2 0
> fn 3 3 0
> fn 4 4 1
> fn 5 5 1
> fn 6 6 1
> fn 7 7 1
> fn 8 12 1
> fn 9 13 1
> 
> Result "manually" tuned. 
> 
> eth0 9617.7 M bit/s      822 k pps 
> eth1 9619.1 M bit/s      823 k pps 
> eth2 9619.1 M bit/s      823 k pps 
> eth3 9619.2 M bit/s      823 k pps 
> eth4 5995.2 M bit/s      512 k pps  <-  PCIe-x8
> eth5 5995.3 M bit/s      512 k pps  <-  PCIe-x8
> eth6 9619.2 M bit/s      823 k pps 
> eth7 9619.2 M bit/s      823 k pps 
> eth8 9619.1 M bit/s      823 k pps 
> eth9 9619.0 M bit/s      823 k pps 
> 
> > 90 Gbit/s
> 
> Result "manually" mistuned by switching node 0 and 1. 
> 
> eth0 9613.6 M bit/s      822 k pps 
> eth1 9614.9 M bit/s      822 k pps 
> eth2 9615.0 M bit/s      822 k pps 
> eth3 9615.1 M bit/s      822 k pps 
> eth4 2918.5 M bit/s      249 k pps  <-  PCIe-x8
> eth5 2918.4 M bit/s      249 k pps  <-  PCIe-x8
> eth6 8597.0 M bit/s      735 k pps 
> eth7 8597.0 M bit/s      735 k pps 
> eth8 8568.3 M bit/s      733 k pps 
> eth9 8568.3 M bit/s      733 k pps 
> 
> A lot things is to be investgated...

Sure :)

I wonder why eth0-eth3 results are unchanged after a node flip.

Thanks for sharing


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Olsson March 22, 2010, 6:05 p.m. UTC | #7
Eric Dumazet writes:

 > > Result "manually" tuned. 
 > > 
 > > eth0 9617.7 M bit/s      822 k pps 
 > > eth1 9619.1 M bit/s      823 k pps 
 > > eth2 9619.1 M bit/s      823 k pps 
 > > eth3 9619.2 M bit/s      823 k pps 
 > > eth4 5995.2 M bit/s      512 k pps  <-  PCIe-x8
 > > eth5 5995.3 M bit/s      512 k pps  <-  PCIe-x8
 > > eth6 9619.2 M bit/s      823 k pps 
 > > eth7 9619.2 M bit/s      823 k pps 
 > > eth8 9619.1 M bit/s      823 k pps 
 > > eth9 9619.0 M bit/s      823 k pps 
 > > 
 > > > 90 Gbit/s

 DMA potential this box is about four 10g ports.

 > > Result "manually" mistuned by switching node 0 and 1. 
 > > 
 > > eth0 9613.6 M bit/s      822 k pps 
 > > eth1 9614.9 M bit/s      822 k pps 
 > > eth2 9615.0 M bit/s      822 k pps 
 > > eth3 9615.1 M bit/s      822 k pps 
 > > eth4 2918.5 M bit/s      249 k pps  <-  PCIe-x8
 > > eth5 2918.4 M bit/s      249 k pps  <-  PCIe-x8
 > > eth6 8597.0 M bit/s      735 k pps 
 > > eth7 8597.0 M bit/s      735 k pps 
 > > eth8 8568.3 M bit/s      733 k pps 
 > > eth9 8568.3 M bit/s      733 k pps 
 > > 
 > I wonder why eth0-eth3 results are unchanged after a node flip.

 Yes it's strange. 

 With clone_skb=1 we could see differences with just one GIGE interface 
 using 64 byte pkts so it might be very different on 10g.  We're getting 
 unfortunely closer to hardware...

 Cheers
					--ro

 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 4392381..c195fd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -169,7 +169,7 @@ 
 #include <asm/dma.h>
 #include <asm/div64.h>		/* do_div */
 
-#define VERSION 	"2.72"
+#define VERSION 	"2.73"
 #define IP_NAME_SZ 32
 #define MAX_MPLS_LABELS 16 /* This is the max label stack depth */
 #define MPLS_STACK_BOTTOM htonl(0x00000100)
@@ -190,6 +190,7 @@ 
 #define F_IPSEC_ON    (1<<12)	/* ipsec on for flows */
 #define F_QUEUE_MAP_RND (1<<13)	/* queue map Random */
 #define F_QUEUE_MAP_CPU (1<<14)	/* queue map mirrors smp_processor_id() */
+#define F_NODE          (1<<15)	/* Node memory alloc*/
 
 /* Thread control flag bits */
 #define T_STOP        (1<<0)	/* Stop run */
@@ -372,6 +373,7 @@  struct pktgen_dev {
 
 	u16 queue_map_min;
 	u16 queue_map_max;
+	int node;               /* Memory node */
 
 #ifdef CONFIG_XFRM
 	__u8	ipsmode;		/* IPSEC mode (config) */
@@ -607,6 +609,9 @@  static int pktgen_if_show(struct seq_file *seq, void *v)
 	if (pkt_dev->traffic_class)
 		seq_printf(seq, "     traffic_class: 0x%02x\n", pkt_dev->traffic_class);
 
+	if (pkt_dev->node >= 0)
+		seq_printf(seq, "     node: %d\n", pkt_dev->node);
+
 	seq_printf(seq, "     Flags: ");
 
 	if (pkt_dev->flags & F_IPV6)
@@ -660,6 +665,9 @@  static int pktgen_if_show(struct seq_file *seq, void *v)
 	if (pkt_dev->flags & F_SVID_RND)
 		seq_printf(seq, "SVID_RND  ");
 
+	if (pkt_dev->flags & F_NODE)
+		seq_printf(seq, "NODE_ALLOC  ");
+
 	seq_puts(seq, "\n");
 
 	/* not really stopped, more like last-running-at */
@@ -1074,6 +1082,21 @@  static ssize_t pktgen_if_write(struct file *file,
 			pkt_dev->dst_mac_count);
 		return count;
 	}
+	if (!strcmp(name, "node")) {
+		len = num_arg(&user_buffer[i], 10, &value);
+		if (len < 0)
+			return len;
+
+		i += len;
+
+		if(node_possible(value)) {
+			pkt_dev->node = value;
+			sprintf(pg_result, "OK: node=%d", pkt_dev->node);
+		}
+		else
+			sprintf(pg_result, "ERROR: node not possible");
+		return count;
+	}
 	if (!strcmp(name, "flag")) {
 		char f[32];
 		memset(f, 0, 32);
@@ -1166,12 +1189,18 @@  static ssize_t pktgen_if_write(struct file *file,
 		else if (strcmp(f, "!IPV6") == 0)
 			pkt_dev->flags &= ~F_IPV6;
 
+		else if (strcmp(f, "NODE_ALLOC") == 0)
+			pkt_dev->flags |= F_NODE;
+
+		else if (strcmp(f, "!NODE_ALLOC") == 0)
+			pkt_dev->flags &= ~F_NODE;
+
 		else {
 			sprintf(pg_result,
 				"Flag -:%s:- unknown\nAvailable flags, (prepend ! to un-set flag):\n%s",
 				f,
 				"IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, "
-				"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ, IPSEC\n");
+				"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ, IPSEC, NODE_ALLOC\n");
 			return count;
 		}
 		sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags);
@@ -2572,9 +2601,27 @@  static struct sk_buff *fill_packet_ipv4(struct net_device *odev,
 	mod_cur_headers(pkt_dev);
 
 	datalen = (odev->hard_header_len + 16) & ~0xf;
-	skb = __netdev_alloc_skb(odev,
-				 pkt_dev->cur_pkt_size + 64
-				 + datalen + pkt_dev->pkt_overhead, GFP_NOWAIT);
+
+	if(pkt_dev->flags & F_NODE) {
+		int node;
+
+		if(pkt_dev->node >= 0)
+			node = pkt_dev->node;
+		else
+			node =  numa_node_id();
+
+		skb = __alloc_skb(NET_SKB_PAD + pkt_dev->cur_pkt_size + 64
+				  + datalen + pkt_dev->pkt_overhead, GFP_NOWAIT, 0, node);
+		if (likely(skb)) {
+			skb_reserve(skb, NET_SKB_PAD);
+			skb->dev = odev;
+		}
+	}
+	else
+	  skb = __netdev_alloc_skb(odev,
+				   pkt_dev->cur_pkt_size + 64
+				   + datalen + pkt_dev->pkt_overhead, GFP_NOWAIT);
+
 	if (!skb) {
 		sprintf(pkt_dev->result, "No memory");
 		return NULL;
@@ -3674,6 +3721,7 @@  static int pktgen_add_device(struct pktgen_thread *t, const char *ifname)
 	pkt_dev->svlan_p = 0;
 	pkt_dev->svlan_cfi = 0;
 	pkt_dev->svlan_id = 0xffff;
+	pkt_dev->node = -1;
 
 	err = pktgen_setup_dev(pkt_dev, ifname);
 	if (err)