pkt_sched: act_xt support new Xtables interface

Message ID	50D8413C.8050508@openwrt.org
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id E72F12C0092 for <patchwork-incoming@ozlabs.org>; Mon, 24 Dec 2012 22:49:43 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752325Ab2LXLth (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Mon, 24 Dec 2012 06:49:37 -0500 Received: from nbd.name ([46.4.11.11]:52287 "EHLO nbd.name" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752211Ab2LXLtg (ORCPT <rfc822;netdev@vger.kernel.org>); Mon, 24 Dec 2012 06:49:36 -0500 Message-ID: <50D8413C.8050508@openwrt.org> Date: Mon, 24 Dec 2012 12:49:16 +0100 From: Felix Fietkau <nbd@openwrt.org> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Jamal Hadi Salim <jhs@mojatatu.com> CC: Yury Stankevich <urykhy@gmail.com>, Hasan Chowdhury <shemonc@gmail.com>, Stephen Hemminger <shemminger@vyatta.com>, Jan Engelhardt <jengelh@inai.de>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, pablo@netfilter.org, netfilter-devel@vger.kernel.org Subject: Re: [PATCH] pkt_sched: act_xt support new Xtables interface References: <50C4821D.5090206@gmail.com> <50C9B4BB.9060609@mojatatu.com> <50CCE961.5050204@mojatatu.com> <alpine.LNX.2.01.1212160002330.4901@nerf07.vanv.qr> <CAASe=fQT2pVOK0uctdaKL+aOrF8nYeTMfoF15kmd-rC02+7Vnw@mail.gmail.com> <50CDFB6A.3090806@mojatatu.com> <50CE1A04.1000405@mojatatu.com> <alpine.LNX.2.01.1212162003340.27614@nerf07.vanv.qr> <50CE3203.9080007@mojatatu.com> <50CF1071.1050405@mojatatu.com> <CAASe=fRuJdtisEvp7uo=PHwN3nKHqsYDW4Om1gk2MK-vyNvBrA@mail.gmail.com> <50D06177.2090905@mojatatu.com> <CAASe=fR6Hm2dxp=1wDchtrzqnaH6qacHpg2wrsqLfmGpPbQ9Fg@mail.gmail.com> <50D1A8A7.1090002@mojatatu.com> <50D1AB7E.5060000@mojatatu.com> <50D2D229.6040802@gmail.com> <50D305FD.7000901@mojatatu.com> <50D327CD.3050904@gmail.com> <50D45E25.7050703@mojatatu.com> <50D46060.2070308@gmail.com> <50D46928.9070809@mojatatu.com> <50D46EC1.2040608@gmail.com> <50D5B366.30005@mojatatu.com> <50D5BC96.9010602@gmail.com> <50D5BF00.7050304@mojatatu.com> <50D83DDB.102@mojatatu.com> In-Reply-To: <50D83DDB.102@mojatatu.com> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

Felix Fietkau Dec. 24, 2012, 11:49 a.m. UTC

On 2012-12-24 12:34 PM, Jamal Hadi Salim wrote:
> 
> Some good news Yury.
> I am told Felix Fietkau <nbd@openwrt.org> (on CC) actually
> already solved this issue and it is a feature in openwrt. I
> cant find the code.
> 
> Felix - Yury is trying to retrieve skb->mark fields from
> netfilter connmark. My understanding is you have written
> such an action. Can you please point us to it - and any
> reason you havent submitted this for inclusion in kernel
> proper?
After I added it as an experiment, I got distracted with other projects
again and forgot about submitting it. Take a look at the code - if the
approach is reasonable, I'll submit this thing for inclusion soon.

- Felix


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Dec. 24, 2012, 12:19 p.m. UTC | #1

On 12-12-24 06:49 AM, Felix Fietkau wrote:

>
> After I added it as an experiment, I got distracted with other projects
> again and forgot about submitting it. Take a look at the code - if the
> approach is reasonable, I'll submit this thing for inclusion soon.
>

Excellent ;-> Simple and elegant.

Usable as is  - some minor comments.
First nitpick: The name is not very reflective, how about:
GetMarkFromConntrack or something along those lines?


> +static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
> +		       struct tcf_result *res)
> +{
> +	struct nf_conn *c;
> +	enum ip_conntrack_info ctinfo;
> +	int proto;
> +	int r;
> +
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		if (skb->len < sizeof(struct iphdr))
> +			goto out;
> +		proto = PF_INET;
> +	} else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		if (skb->len < sizeof(struct ipv6hdr))
> +			goto out;
> +		proto = PF_INET6;
> +	} else
> +		goto out;
> +

I would have said that this action is probably also not useful for 
egress qdisc path since skb->mark would already be set. It maybe worth 
checking skb->tc_verd and skipping overhead of nf_conntrack_in() call.
Look at act_mirred for such a check.

> +	r = nf_conntrack_in(dev_net(skb->dev), proto, NF_INET_PRE_ROUTING, skb);
> +	if (r != NF_ACCEPT)
> +		goto out;
> +
> +	c = nf_ct_get(skb, &ctinfo);
> +	if (!c)
> +		goto out;
> +
> +	skb->mark = c->mark;
> +	nf_conntrack_put(skb->nfct);
> +	skb->nfct = NULL;
> +
> +out:
> +	return TC_ACT_PIPE;

Ok, perhaps set tcf_action in (iproute2) user space to TC_ACT_PIPE then 
just return policy->tcf_action here.

Even better is to have a different TC_ACT_XXX returned for failure
vs success... Your success path becomes TC_ACT_PIPE and let the
user program the failure branch optionally. This would allow for 
branching to different actions if success/failure, example:
if mark is found {
    if mark is 0xa redirect to ifb0
    else
      redirect to ifb1
} else
       set mark to 3 then redirect to ifb9

etc.

Not sure if that made sense. I am under the influence of nyquil ;->

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Pablo Neira Ayuso Dec. 24, 2012, 1:12 p.m. UTC | #2

Hi Felix,

On Mon, Dec 24, 2012 at 12:49:16PM +0100, Felix Fietkau wrote:
> On 2012-12-24 12:34 PM, Jamal Hadi Salim wrote:
> > 
> > Some good news Yury.
> > I am told Felix Fietkau <nbd@openwrt.org> (on CC) actually
> > already solved this issue and it is a feature in openwrt. I
> > cant find the code.
> > 
> > Felix - Yury is trying to retrieve skb->mark fields from
> > netfilter connmark. My understanding is you have written
> > such an action. Can you please point us to it - and any
> > reason you havent submitted this for inclusion in kernel
> > proper?
> After I added it as an experiment, I got distracted with other projects
> again and forgot about submitting it. Take a look at the code - if the
> approach is reasonable, I'll submit this thing for inclusion soon.
> 
> - Felix
> 
> --- /dev/null
> +++ b/net/sched/act_connmark.c
> @@ -0,0 +1,137 @@
> +/*
> + * Copyright (c) 2011 Felix Fietkau <nbd@openwrt.org>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/skbuff.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <net/netlink.h>
> +#include <net/pkt_sched.h>
> +#include <net/act_api.h>
> +
> +#include <net/netfilter/nf_conntrack.h>
> +#include <net/netfilter/nf_conntrack_core.h>
> +
> +#define TCA_ACT_CONNMARK	20
> +
> +#define CONNMARK_TAB_MASK     3
> +static struct tcf_common *tcf_connmark_ht[CONNMARK_TAB_MASK + 1];
> +static u32 connmark_idx_gen;
> +static DEFINE_RWLOCK(connmark_lock);
> +
> +static struct tcf_hashinfo connmark_hash_info = {
> +	.htab	=	tcf_connmark_ht,
> +	.hmask	=	CONNMARK_TAB_MASK,
> +	.lock	=	&connmark_lock,
> +};
> +
> +static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a,
> +		       struct tcf_result *res)
> +{
> +	struct nf_conn *c;
> +	enum ip_conntrack_info ctinfo;
> +	int proto;
> +	int r;
> +
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		if (skb->len < sizeof(struct iphdr))
> +			goto out;
> +		proto = PF_INET;
> +	} else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		if (skb->len < sizeof(struct ipv6hdr))
> +			goto out;
> +		proto = PF_INET6;
> +	} else
> +		goto out;
> +
> +	r = nf_conntrack_in(dev_net(skb->dev), proto, NF_INET_PRE_ROUTING, skb);

conntrack needs to see defragmented packets, you have to call
nf_defrag_ipv4 / _ipv6 respectively before that.

This also changes the semantics of the raw table in iptables since it
will now see packet with conntrack already attached. So this would
also break -j CT --notrack.

This needs more thinking. I can appreciate the value of calling
conntrack from different points of the packet traversal, but there are
a couple of thing we have to resolve before allowing that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamal Hadi Salim Dec. 24, 2012, 2:05 p.m. UTC | #3

Hi Pablo,

On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:

>
> conntrack needs to see defragmented packets, you have to call
> nf_defrag_ipv4 / _ipv6 respectively before that.
>

This should not be too hard to do - although my thinking says this
should be a separate action.

> This also changes the semantics of the raw table in iptables since it
> will now see packet with conntrack already attached. So this would
> also break -j CT --notrack.
>

Is there a flag we can check which says a flow is not to be tracked?
Doesnt nf_conntrack_in() fail if --no track is set?

> This needs more thinking. I can appreciate the value of calling
> conntrack from different points of the packet traversal, but there are
> a couple of thing we have to resolve before allowing that.

There is user need for this Pablo - as you can see from what Felix
deployed it seems to be used a lot more wider audience dependency.
What do we need to do to get this to work properly?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Pablo Neira Ayuso Dec. 24, 2012, 6:19 p.m. UTC | #4

Hi Jamal,

On Mon, Dec 24, 2012 at 09:05:42AM -0500, Jamal Hadi Salim wrote:
> On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:
> >
> >conntrack needs to see defragmented packets, you have to call
> >nf_defrag_ipv4 / _ipv6 respectively before that.
> >
> 
> This should not be too hard to do - although my thinking says this
> should be a separate action.
> 
> >This also changes the semantics of the raw table in iptables since it
> >will now see packet with conntrack already attached. So this would
> >also break -j CT --notrack.
> 
> Is there a flag we can check which says a flow is not to be tracked?
> Doesnt nf_conntrack_in() fail if --no track is set?

The notrack dummy conntrack (consider it a flag) is attached in
prerouting raw table. By attaching conntracks at ingress, the notrack
flag will be ignored. Note that this also breaks conntrack templates
via -j CT, that allows us to set custom conntrack timeouts, zones and
helpers at prerouting raw.

Basically, ct templates are attached via -j CT, this template is
munched by nf_conntrack_in, which adds the corresponding ct features
based on the template information.

> >This needs more thinking. I can appreciate the value of calling
> >conntrack from different points of the packet traversal, but there are
> >a couple of thing we have to resolve before allowing that.
> 
> There is user need for this Pablo - as you can see from what Felix
> deployed it seems to be used a lot more wider audience dependency.
> What do we need to do to get this to work properly?

The conntrack code needs to be generalized to allow creating conntrack
with features all at once (so we can remove the template
infrastructure). Even after that, we'll still have that -j CT rules
will be ignored if you're using, let's name it, act_ct from ingress to
attach the conntrack to it.

With the current approach you're using, people will see conntracks in
the iptables raw table, that breaks the current semantics.

We'll have the netfilter workshop by Q1/Q2 2013 (still TBA), I think
this is material for discussion in it.

cheers,
Pablo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Pablo Neira Ayuso Dec. 26, 2012, 11:10 p.m. UTC | #5

On Mon, Dec 24, 2012 at 07:19:43PM +0100, Pablo Neira Ayuso wrote:
> Hi Jamal,
> 
> On Mon, Dec 24, 2012 at 09:05:42AM -0500, Jamal Hadi Salim wrote:
> > On 12-12-24 08:12 AM, Pablo Neira Ayuso wrote:
> > >
> > >conntrack needs to see defragmented packets, you have to call
> > >nf_defrag_ipv4 / _ipv6 respectively before that.
> > >
> > 
> > This should not be too hard to do - although my thinking says this
> > should be a separate action.
> > 
> > >This also changes the semantics of the raw table in iptables since it
> > >will now see packet with conntrack already attached. So this would
> > >also break -j CT --notrack.
> > 
> > Is there a flag we can check which says a flow is not to be tracked?
> > Doesnt nf_conntrack_in() fail if --no track is set?
> 
> The notrack dummy conntrack (consider it a flag) is attached in
> prerouting raw table. By attaching conntracks at ingress, the notrack
> flag will be ignored. Note that this also breaks conntrack templates
> via -j CT, that allows us to set custom conntrack timeouts, zones and
> helpers at prerouting raw.
> 
> Basically, ct templates are attached via -j CT, this template is
> munched by nf_conntrack_in, which adds the corresponding ct features
> based on the template information.

I'm still spinning around this and I don't come with some easy
solution that doesn't break the existing semantics. One possibility
can be to drop the ct reference after leaving ingress, so the lookup
happens again in prerouting after the raw table to attach it again and
no ct is seen in the raw table but:

1) it's suboptimal in case users have rules using ct at ingress and in
   iptables.

2) the conntrack template infrastructure needs to be reworked/replaced
   by something more flexible to attach features to conntracks, so we
   can still attach features for conntrack entries that were created
   at ingress (so helpers / custom timeouts / notrack don't break).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pkt_sched: act_xt support new Xtables interface

Commit Message

Comments

Patch