diff mbox

[5/5,v4] net: add old_queue_mapping into skb->cb

Message ID 1292475410-24665-1-git-send-email-xiaosuo@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Changli Gao Dec. 16, 2010, 4:56 a.m. UTC
For the skbs returned from ifb, we should use the queue_mapping
saved before ifb.

We save old queue_mapping in old_queue_mapping just before calling 
dev_queue_xmit, and restore the old_queue_mapping to queue_mapping
just before reinjecting the ingress packets.

A new struct dev_skb_cb is added, and valid in qdisc and gso layer.
The original qdisc_skb_cb and DEV_GSO_CB use dev_skb_cb as the first
member.

netem_skb_cb is changed to contain qdisc_skb_cb.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
v4: don't restore the old_queue_mapping for egress packets.
v3: use the method from Eric to allocate memory from skb->cb. Thank him.
v2: save old_queue_mapping in skb->cb
 drivers/net/ifb.c         |    1 +
 include/linux/netdevice.h |   10 ++++++++++
 include/net/sch_generic.h |    3 ++-
 net/core/dev.c            |   15 ++++++++++-----
 net/sched/act_mirred.c    |    1 +
 net/sched/sch_netem.c     |    8 ++++----
 6 files changed, 28 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Dec. 16, 2010, 5:47 a.m. UTC | #1
Le jeudi 16 décembre 2010 à 12:56 +0800, Changli Gao a écrit :
> For the skbs returned from ifb, we should use the queue_mapping
> saved before ifb.
> 
> We save old queue_mapping in old_queue_mapping just before calling 
> dev_queue_xmit, and restore the old_queue_mapping to queue_mapping
> just before reinjecting the ingress packets.
> 
> A new struct dev_skb_cb is added, and valid in qdisc and gso layer.
> The original qdisc_skb_cb and DEV_GSO_CB use dev_skb_cb as the first
> member.
> 
> netem_skb_cb is changed to contain qdisc_skb_cb.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal Dec. 17, 2010, 1:09 p.m. UTC | #2
On Thu, 2010-12-16 at 12:56 +0800, Changli Gao wrote:
> For the skbs returned from ifb, we should use the queue_mapping
> saved before ifb.
> 
> We save old queue_mapping in old_queue_mapping just before calling 
> dev_queue_xmit, and restore the old_queue_mapping to queue_mapping
> just before reinjecting the ingress packets.
> 
> A new struct dev_skb_cb is added, and valid in qdisc and gso layer.
> The original qdisc_skb_cb and DEV_GSO_CB use dev_skb_cb as the first
> member.
> 
> netem_skb_cb is changed to contain qdisc_skb_cb.

I am sorry Changli - I think we are talking past each other. I
a conflicted on the whole point of saving and restoring these
devqueue mappings. I understand that for ifb, saving and restoring the
original devs is fundamental for its operation- but i am not sure i see
it for the queues. As an example:

---
# For all packets arriving on ifb0, change mapping to 3 and
# redirect to to ifb1

tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
match u32 0 0 flowid 1:2 \
action skbedit queue_mapping 4 \
action mirred egress redirect dev ifb1
#
# redirect all packets arriving in eth0 to ifb0
$TC filter add eth0 parent 1:0 protocol ip prio 10 u32 \
match u32 0 0 flowid 1:2 action mirred egress redirect dev ifb0
----

what is the expected behavior?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Dec. 17, 2010, 1:41 p.m. UTC | #3
On Fri, Dec 17, 2010 at 9:09 PM, jamal <hadi@cyberus.ca> wrote:
> On Thu, 2010-12-16 at 12:56 +0800, Changli Gao wrote:
>
> I am sorry Changli - I think we are talking past each other. I
> a conflicted on the whole point of saving and restoring these
> devqueue mappings. I understand that for ifb, saving and restoring the
> original devs is fundamental for its operation- but i am not sure i see
> it for the queues. As an example:
>
> ---
> # For all packets arriving on ifb0, change mapping to 3 and
> # redirect to to ifb1
>
> tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:2 \
> action skbedit queue_mapping 4 \
> action mirred egress redirect dev ifb1
> #
> # redirect all packets arriving in eth0 to ifb0
> $TC filter add eth0 parent 1:0 protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:2 action mirred egress redirect dev ifb0
> ----
>
> what is the expected behavior?
>

I doubt it can work.

eth0 -> ifb0: skb->skb_iif = eth0.
ifb0 -> ifb1: skb->skb_iif = ifb0
ifb1 -> ifb0: as skb->skb_iif == ifb0, skb->skb_iif = ifb1
ifb0 -> ifb1...
...

Did you test it?
jamal Dec. 21, 2010, 1:07 p.m. UTC | #4
On Fri, 2010-12-17 at 21:41 +0800, Changli Gao wrote:

Sorry for the latency - I am a little swamped.

> I doubt it can work.

It should work - this used to be part of my regression tests.
If it doesnt work something is broken.

In any case, I shouldnt have used this example because
it distracted from the point i was trying to make:
You are restoring the old qmap when the point is we could change
it to a new one. A simpler example illustrating how
a qmap could be changed:

----
tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
 match u32 0 0 flowid 1:2 \
 action skbedit queue_mapping 4
---

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Dec. 21, 2010, 2:03 p.m. UTC | #5
On Tue, Dec 21, 2010 at 9:07 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-12-17 at 21:41 +0800, Changli Gao wrote:
>
> Sorry for the latency - I am a little swamped.
>
>> I doubt it can work.
>
> It should work - this used to be part of my regression tests.
> If it doesnt work something is broken.

When I tested it, my OS got frozen.

>
> In any case, I shouldnt have used this example because
> it distracted from the point i was trying to make:
> You are restoring the old qmap when the point is we could change
> it to a new one. A simpler example illustrating how
> a qmap could be changed:
>
> ----
> tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
>  match u32 0 0 flowid 1:2 \
>  action skbedit queue_mapping 4
> ----

Currently, you can only change the rx queue mapping, because for tx,
dev_pick_tx() doesn't use skb->queue_mapping to choose tx queue.

However, I don't think change the rx queue mapping is a good idea.
When the skbs returned from ifb enter netif_receive_skb() again,
get_rps_cpu() may warn about the wrong rx queue, and my this patch is
used to solve this problem. Even though the rx queue is legal, a
different rps_cpus settings will be used, and the skbs may be
redirected to different CPUs. Is it expected?
Eric Dumazet Dec. 21, 2010, 3:24 p.m. UTC | #6
Le mardi 21 décembre 2010 à 22:03 +0800, Changli Gao a écrit :
> However, I don't think change the rx queue mapping is a good idea.
> When the skbs returned from ifb enter netif_receive_skb() again,
> get_rps_cpu() may warn about the wrong rx queue, and my this patch is
> used to solve this problem. Even though the rx queue is legal, a
> different rps_cpus settings will be used, and the skbs may be
> redirected to different CPUs. Is it expected?
> 
> 

Do we really want a multi queue ifb at all ?

Why not use percpu data and LLTX, like we did for other virtual devices
(loopback, tunnels, vlans, ...)

I guess most ifb uses need to finaly deliver packets in a monoqueue
anyway, optimizing ifb might raise lock contention on this resource.

See what we did in commit 79640a4ca6955e3e (net: add additional lock to
qdisc to increase throughput) : Adding one spinlock actually helped a
lot ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao Dec. 22, 2010, 12:08 a.m. UTC | #7
On Tue, Dec 21, 2010 at 11:24 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Do we really want a multi queue ifb at all ?
>
> Why not use percpu data and LLTX, like we did for other virtual devices
> (loopback, tunnels, vlans, ...)
>
> I guess most ifb uses need to finaly deliver packets in a monoqueue
> anyway, optimizing ifb might raise lock contention on this resource.
>

For ingress shaping, the later processing should be in parallel, so
multiple ri_tasklets are needed.

For one queue ifb, the queue mapping should be saved and restored for
ingress skbs, as it is reset to 0 in dev_pick_tx(ifb).
jamal Dec. 23, 2010, 1 p.m. UTC | #8
On Tue, 2010-12-21 at 22:03 +0800, Changli Gao wrote:

> When I tested it, my OS got frozen.

I will look into it the next opportunity i get. The example i showed
is on egress btw. A ping from outside that matches the filter
will be a good test.

> Currently, you can only change the rx queue mapping, because for tx,
> dev_pick_tx() doesn't use skb->queue_mapping to choose tx queue.

If skbedit is on egress, it will happen after (and override whatever
dev_pick_tx() chose), no? Thats the whole point for skbedits queuemap
editing.

> However, I don't think change the rx queue mapping is a good idea.

I agree for that as a default policy. But it is
policy that skbedit can and should be able to override.

> When the skbs returned from ifb enter netif_receive_skb() again,
> get_rps_cpu() may warn about the wrong rx queue, and my this patch is
> used to solve this problem. Even though the rx queue is legal, a
> different rps_cpus settings will be used, and the skbs may be
> redirected to different CPUs. Is it expected?

I am not sure without analyzing what performance impact would be, i.e i
think that the only reason i wouldnt do it is because it may have crazy
effect on performance but:
If i wanted to override the choice made by rps through some policy, why
shouldnt i be able to do it? Same thing if i wanted to bypass rps. tc
level seems appropriate.
I may be misreading the code: Quick glance at the code indicates users
have no choice on ingress: rps happens first then we can do tc level -
so it doesnt matter what changes we make to the queue map it will not
take effect in any case. Am i mistaken?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal Dec. 23, 2010, 1:21 p.m. UTC | #9
On Tue, 2010-12-21 at 16:24 +0100, Eric Dumazet wrote:

> 
> Do we really want a multi queue ifb at all ?
> 
> Why not use percpu data and LLTX, like we did for other virtual devices
> (loopback, tunnels, vlans, ...)
> 
> I guess most ifb uses need to finaly deliver packets in a monoqueue
> anyway, optimizing ifb might raise lock contention on this resource.

I guess once you start having hardware that is multiqueue on the ingress
side at least then something per cpu is needed on ifb. But i agree that
the optimizations may end up harming the simplicity that ifb intended.
It is already jumping a lot of hoops to work around things as is.

> See what we did in commit 79640a4ca6955e3e (net: add additional lock to
> qdisc to increase throughput) : Adding one spinlock actually helped a
> lot ;)

Yes, that was fascinating stuff;-> I am still scratching my head and
continuing to itch on when i can get more time to look closely with some
testing. But i dont see the connection with what Changli is attempting
with multiq ifb - what do you have in mind.

cheers,
jamal



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 918a38e..1632345 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -96,6 +96,7 @@  static void ri_tasklet(unsigned long arg)
 			dev_queue_xmit(skb);
 		} else if (from & AT_INGRESS) {
 			skb_pull(skb, skb->dev->hard_header_len);
+			skb->queue_mapping = dev_skb_cb(skb)->old_queue_mapping;
 			netif_rx(skb);
 		} else
 			BUG();
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d31bc3c..6f128e3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1295,6 +1295,16 @@  struct napi_gro_cb {
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
 
+struct dev_skb_cb {
+	u16		old_queue_mapping;
+};
+
+static inline struct dev_skb_cb *dev_skb_cb(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct dev_skb_cb));
+	return (struct dev_skb_cb *)skb->cb;
+}
+
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
 	struct net_device	*dev;	/* NULL is wildcarded here	     */
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ea1f8a8..52c32e3 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -198,8 +198,8 @@  struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
+	struct dev_skb_cb	dev_cb; /* must be the first */
 	unsigned int		pkt_len;
-	char			data[];
 };
 
 static inline int qdisc_qlen(struct Qdisc *q)
@@ -209,6 +209,7 @@  static inline int qdisc_qlen(struct Qdisc *q)
 
 static inline struct qdisc_skb_cb *qdisc_skb_cb(struct sk_buff *skb)
 {
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct qdisc_skb_cb));
 	return (struct qdisc_skb_cb *)skb->cb;
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d28b3a0..bf5ced5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1901,10 +1901,15 @@  static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
 }
 
 struct dev_gso_cb {
-	void (*destructor)(struct sk_buff *skb);
+	struct dev_skb_cb	dev_cb; /* must be the first */
+	void			(*destructor)(struct sk_buff *skb);
 };
 
-#define DEV_GSO_CB(skb) ((struct dev_gso_cb *)(skb)->cb)
+static inline struct dev_gso_cb *dev_gso_cb(struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct dev_gso_cb));
+	return (struct dev_gso_cb *)skb->cb;
+}
 
 static void dev_gso_skb_destructor(struct sk_buff *skb)
 {
@@ -1918,7 +1923,7 @@  static void dev_gso_skb_destructor(struct sk_buff *skb)
 		kfree_skb(nskb);
 	} while (skb->next);
 
-	cb = DEV_GSO_CB(skb);
+	cb = dev_gso_cb(skb);
 	if (cb->destructor)
 		cb->destructor(skb);
 }
@@ -1947,7 +1952,7 @@  static int dev_gso_segment(struct sk_buff *skb)
 		return PTR_ERR(segs);
 
 	skb->next = segs;
-	DEV_GSO_CB(skb)->destructor = skb->destructor;
+	dev_gso_cb(skb)->destructor = skb->destructor;
 	skb->destructor = dev_gso_skb_destructor;
 
 	return 0;
@@ -2103,7 +2108,7 @@  gso:
 
 out_kfree_gso_skb:
 	if (likely(skb->next == NULL))
-		skb->destructor = DEV_GSO_CB(skb)->destructor;
+		skb->destructor = dev_gso_cb(skb)->destructor;
 out_kfree_skb:
 	kfree_skb(skb);
 out:
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 0c311be..419e82f 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -197,6 +197,7 @@  static int tcf_mirred(struct sk_buff *skb, struct tc_action *a,
 
 	skb2->skb_iif = skb->dev->ifindex;
 	skb2->dev = dev;
+	dev_skb_cb(skb2)->old_queue_mapping = skb->queue_mapping;
 	dev_queue_xmit(skb2);
 	err = 0;
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index e5593c0..c2cf72f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,14 +77,14 @@  struct netem_sched_data {
 
 /* Time stamp put into socket buffer control block */
 struct netem_skb_cb {
-	psched_time_t	time_to_send;
+	struct qdisc_skb_cb	qdisc_cb; /* must be the first */
+	psched_time_t		time_to_send;
 };
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
 {
-	BUILD_BUG_ON(sizeof(skb->cb) <
-		sizeof(struct qdisc_skb_cb) + sizeof(struct netem_skb_cb));
-	return (struct netem_skb_cb *)qdisc_skb_cb(skb)->data;
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct netem_skb_cb));
+	return (struct netem_skb_cb *)skb->cb;
 }
 
 /* init_crandom - initialize correlated random number generator