diff mbox

[1/1] netfilter: Add possibility to turn off netfilters defrag per netns

Message ID 201201041118.18552.hans.schillstrom@ericsson.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Hans Schillstrom Jan. 4, 2012, 10:18 a.m. UTC
On Wednesday 04 January 2012 10:03:49 Jozsef Kadlecsik wrote:
> On Wed, 4 Jan 2012, Hans Schillstrom wrote:
> 
> > On Wednesday 04 January 2012 09:28:05 Jozsef Kadlecsik wrote:
> > > 
> > > On Wed, 4 Jan 2012, Hans Schillstrom wrote:
> > > 
> > > > In some cases it not desirable to have auto defrag.
> > > > Ex. in a cluster where packets can arrive on different blades.
> > > > In that case it is possible to use containers (LXC) and send
> > > > all fragments to one place where defrag is enabled.
> > > > 
> > > > This patch makes it possible to turn off the defrag per network name space,
> > > > by setting net.netfilter.nf_conntrack_nodefrag to 1.
> > > > Both IPv4 and IPv6 is effected by this sysctl.
> > > > Default is 0 which is defrag.
> > > 
> > > Conntrack assumes that the packets are defragmented and will drop any 
> > > unfragmented one. So your patch results packet drops.
> > 
> > Hmmm, more work...
> > > 
> > > Also, if you want to disable defragmentation then why don't you simply 
> > > "mark" the packets with the NOTRACK target?
> > 
> > I don't think that will work since NF_IP_PRI_CONNTRACK_DEFRAG is -400
> 
> Then change NF_IP_PRI_RAW so that it precedes NF_IP_PRI_CONNTRACK_DEFRAG. 
> The raw table should be made possible to completely override conntack and 
> defrag is implicit part of the latter.
> 

An other idea, turn off both conntrack and defrag
i.e. do like NOTRAC and rename the flag  ?

Quick example for IPv4:
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Engelhardt Jan. 4, 2012, 11:17 a.m. UTC | #1
On Wednesday 2012-01-04 11:18, Hans Schillstrom wrote:

>On Wednesday 04 January 2012 10:03:49 Jozsef Kadlecsik wrote:
>> On Wed, 4 Jan 2012, Hans Schillstrom wrote:
>> 
>> > On Wednesday 04 January 2012 09:28:05 Jozsef Kadlecsik wrote:
>> > > 
>> > > On Wed, 4 Jan 2012, Hans Schillstrom wrote:
>> > > 
>> > > > In some cases it not desirable to have auto defrag.
>> > > > Ex. in a cluster where packets can arrive on different blades.
>> > > > In that case it is possible to use containers (LXC) and send
>> > > > all fragments to one place where defrag is enabled.
>> > > > 
>> > > > This patch makes it possible to turn off the defrag per network name space,
>> > > > by setting net.netfilter.nf_conntrack_nodefrag to 1.
>> > > > Both IPv4 and IPv6 is effected by this sysctl.
>> > > > Default is 0 which is defrag.
>> > > 
>> > > Conntrack assumes that the packets are defragmented and will drop any 
>> > > unfragmented one. So your patch results packet drops.
>> > 
>> > Hmmm, more work...
>> > > 
>> > > Also, if you want to disable defragmentation then why don't you simply 
>> > > "mark" the packets with the NOTRACK target?
>> > 
>> > I don't think that will work since NF_IP_PRI_CONNTRACK_DEFRAG is -400
>> 
>> Then change NF_IP_PRI_RAW so that it precedes NF_IP_PRI_CONNTRACK_DEFRAG. 
>> The raw table should be made possible to completely override conntack and 
>> defrag is implicit part of the latter.
>> 
>
>An other idea, turn off both conntrack and defrag
>i.e. do like NOTRAC and rename the flag  ?

Or just add a new table - that one we can remove/stash
when I get my xt2 patches out.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 4, 2012, 11:48 a.m. UTC | #2
On Wednesday 04 January 2012 12:17:18 Jan Engelhardt wrote:
> On Wednesday 2012-01-04 11:18, Hans Schillstrom wrote:
> 
> >On Wednesday 04 January 2012 10:03:49 Jozsef Kadlecsik wrote:
> >> On Wed, 4 Jan 2012, Hans Schillstrom wrote:
> >> 
> >> > On Wednesday 04 January 2012 09:28:05 Jozsef Kadlecsik wrote:
> >> > > 
> >> > > On Wed, 4 Jan 2012, Hans Schillstrom wrote:
> >> > > 
> >> > > > In some cases it not desirable to have auto defrag.
> >> > > > Ex. in a cluster where packets can arrive on different blades.
> >> > > > In that case it is possible to use containers (LXC) and send
> >> > > > all fragments to one place where defrag is enabled.
> >> > > > 
> >> > > > This patch makes it possible to turn off the defrag per network name space,
> >> > > > by setting net.netfilter.nf_conntrack_nodefrag to 1.
> >> > > > Both IPv4 and IPv6 is effected by this sysctl.
> >> > > > Default is 0 which is defrag.
> >> > > 
> >> > > Conntrack assumes that the packets are defragmented and will drop any 
> >> > > unfragmented one. So your patch results packet drops.
> >> > 
> >> > Hmmm, more work...
> >> > > 
> >> > > Also, if you want to disable defragmentation then why don't you simply 
> >> > > "mark" the packets with the NOTRACK target?
> >> > 
> >> > I don't think that will work since NF_IP_PRI_CONNTRACK_DEFRAG is -400
> >> 
> >> Then change NF_IP_PRI_RAW so that it precedes NF_IP_PRI_CONNTRACK_DEFRAG. 
> >> The raw table should be made possible to completely override conntack and 
> >> defrag is implicit part of the latter.
> >> 
> >
> >An other idea, turn off both conntrack and defrag
> >i.e. do like NOTRAC and rename the flag  ?
> 
> Or just add a new table - that one we can remove/stash
> when I get my xt2 patches out.
> 
I like that idea, an "early" table at prio -500 with PREROUTING.
There is also a need for a new flag "--allfrags"
i.e. all fragments needs to be sorted out and sent to same dest for defrag.

ex.
iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
Pablo Neira Ayuso Jan. 4, 2012, 5:40 p.m. UTC | #3
On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> I like that idea, an "early" table at prio -500 with PREROUTING.
> There is also a need for a new flag "--allfrags"
> i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> 
> ex.
> iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK

New tables add too much overhead. We have discussed this before with
Patrick.

Since this still remains specific to your needs, I think you can
remove nf_conntrack module in your setup.

I don't come with one sane setup that may want selectively defragment
some traffic yes and other not.

Am I missing anything else?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jozsef Kadlecsik Jan. 4, 2012, 6:05 p.m. UTC | #4
On Wed, 4 Jan 2012, Pablo Neira Ayuso wrote:

> On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > I like that idea, an "early" table at prio -500 with PREROUTING.
> > There is also a need for a new flag "--allfrags"
> > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > 
> > ex.
> > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> 
> New tables add too much overhead. We have discussed this before with
> Patrick.
> 
> Since this still remains specific to your needs, I think you can
> remove nf_conntrack module in your setup.
> 
> I don't come with one sane setup that may want selectively defragment
> some traffic yes and other not.
> 
> Am I missing anything else?

I agree. If you don't want defragmentation at all, then make sure you 
don't load the nf_conntrack module directly/indirectly. Conntrack doesn't 
work without defragmentation anyway.

The only thing what such a really-early table could buy at the moment is 
to specify which flows not to defragment at layer 3 level.

If we had dynamic hooks registration and hook priorities at table level, 
that'd come handy now.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 4, 2012, 8:45 p.m. UTC | #5
On Wednesday, January 04, 2012 18:40:35 Pablo Neira Ayuso wrote:
> On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > I like that idea, an "early" table at prio -500 with PREROUTING.
> > There is also a need for a new flag "--allfrags"
> > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > 
> > ex.
> > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> 
> New tables add too much overhead. We have discussed this before with
> Patrick.
> 
> Since this still remains specific to your needs, I think you can
> remove nf_conntrack module in your setup.
> 
> I don't come with one sane setup that may want selectively defragment
> some traffic yes and other not.
> 
> Am I missing anything else?
>

I might have been a little bit unclear, so I'll try the opposite :-)

Network namesapce i.e. Linux Containers (LXC) creates new possibilities,
Linux moves to new domains - Large Clusters controllers.

When you have two or more interfaces (on different machines) that receives data
from the Internet you will sooner or later end up with fragments on different
interfaces.

If you deal with Virtual IP:s in the cluster (which is very common)
there must be some place where packet defrag occurs, before sending
it to a load balancer.

Hardware is cheap but space and power consumption is not, so
no one wants extra hardware. If possible extra hops should also be avoided.

With existing functionality an extra level of physical machines must be
added between the (FW/GW) and the Load-Balancers to do the defrag,
which is not very efficient.

With a solution where it's possible to sort out fragments early
(based on ex source address) and send them to the same Container for defragmentation
no extra hardware is needed and only fragmented packet have an extra hop.


A Simplified Example:
(ASCII grapichs have some limitaions)

            Blade 1
         +------------+
         |   +-----+  | Defrag/LB
Inet A   |   | FW. |  |  Trafic                 VIP 11.1.1.1
---------+-> | LXC |--|-->+                     Blade a
         |   +-----+  |   |                    +-------+
         |      |<----|---+                    | Appl. |
         |   +-----+  |   |       +-------- >  | Serv. |
         |   | LB. |__|___|_______|            +-------+
         |   | IPVS|  |   |       |
         |   +-----+  |   |       |
         +------------+   |       |
                          |       |
            Blade 2       |       |
         +------------+   |       |             VIP 11.1.1.1
         |   +-----+  |   |       |             Blade b
Inet B   |   | FW. |  |   |       |            +-------+
---------+-> | LXC |--|-->|       |            | Appl. |
         |   +-----+  |   |       +----------> | Serv. |
         |      | <---|---+       |            +-------+
         |   +-----+  |   |       |
         |   | LB. |__|___|_______|
         |   | IPVS|  |   |       |             VIP 11.1.1.1
         |   +-----+  |   |       |             Blade c
         +------------+   |       |            +-------+
                          |       |            | Appl. |
            Blade n       |       +--------->  | Serv. |
         +------------+   |       |            +-------+
         |   +-----+  |   |       |
Inet N   |   | FW. |  |   |       |             VIP 11.1.1.1
---------+-> | LXC |--|-->|       |             Blade x
         |   +-----+  |   |       |            +-------+
         |      |<----|---+       |            | Appl. |
         |   +-----+  |           +--------->  | Serv. |
         |   | LB. |__|___________|            +-------+
         |   | IPVS|  |
         |   +-----+  |
         +------------+

You might even co-locate the Appl on the FW/GW Blades.
The ideal solution would be where you can sort out fragments based on interface
and have defrag on others. (In this case even the first fragment)

Regards
Hans Schillstrom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 4, 2012, 8:56 p.m. UTC | #6
On Wednesday, January 04, 2012 19:05:10 Jozsef Kadlecsik wrote:
> On Wed, 4 Jan 2012, Pablo Neira Ayuso wrote:
> 
> > On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > > I like that idea, an "early" table at prio -500 with PREROUTING.
> > > There is also a need for a new flag "--allfrags"
> > > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > > 
> > > ex.
> > > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> > 
> > New tables add too much overhead. We have discussed this before with
> > Patrick.
> > 
> > Since this still remains specific to your needs, I think you can
> > remove nf_conntrack module in your setup.
> > 
> > I don't come with one sane setup that may want selectively defragment
> > some traffic yes and other not.
> > 
> > Am I missing anything else?
> 
> I agree. If you don't want defragmentation at all, then make sure you 
> don't load the nf_conntrack module directly/indirectly. Conntrack doesn't 
> work without defragmentation anyway.

We are using LXC and it's only in the container that holds the external 
interface that can't have defragmentation.
The problem is if it's loaded you have it in all namespaces :-(

> 
> The only thing what such a really-early table could buy at the moment is 
> to specify which flows not to defragment at layer 3 level.
> 
> If we had dynamic hooks registration and hook priorities at table level, 
> that'd come handy now.

I do agree.

> 

Regards
Hans
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 4, 2012, 9:15 p.m. UTC | #7
Hello Again

On Wednesday, January 04, 2012 18:40:35 Pablo Neira Ayuso wrote:
> On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > I like that idea, an "early" table at prio -500 with PREROUTING.
> > There is also a need for a new flag "--allfrags"
> > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > 
> > ex.
> > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> 
> New tables add too much overhead. We have discussed this before with
> Patrick.
> 
Only if loaded .. 
It would have been the perfect solution.
Is the discussion about the overhead on the list (I can't find it)?

I made a quick test with an "early" table
and --allfrags fix (for IPv4) and it works really good.

iptables -t early -A PREROUTING -i eth0 -a -j NOTRACK
iptables -t mangle -A PREROUTING -i eth0 -a -j HMARK --mod 3 --offs 100

So your opinion is no more tables,
even if it's rare that it is loaded?

Regards
Hans
Jozsef Kadlecsik Jan. 4, 2012, 9:40 p.m. UTC | #8
On Wed, 4 Jan 2012, Hans Schillstrom wrote:

> On Wednesday, January 04, 2012 19:05:10 Jozsef Kadlecsik wrote:
> > On Wed, 4 Jan 2012, Pablo Neira Ayuso wrote:
> > 
> > > On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > > > I like that idea, an "early" table at prio -500 with PREROUTING.
> > > > There is also a need for a new flag "--allfrags"
> > > > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > > > 
> > > > ex.
> > > > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> > > 
> > > New tables add too much overhead. We have discussed this before with
> > > Patrick.
> > > 
> > > Since this still remains specific to your needs, I think you can
> > > remove nf_conntrack module in your setup.
> > > 
> > > I don't come with one sane setup that may want selectively defragment
> > > some traffic yes and other not.
> > > 
> > > Am I missing anything else?
> > 
> > I agree. If you don't want defragmentation at all, then make sure you 
> > don't load the nf_conntrack module directly/indirectly. Conntrack doesn't 
> > work without defragmentation anyway.
> 
> We are using LXC and it's only in the container that holds the external 
> interface that can't have defragmentation.
> The problem is if it's loaded you have it in all namespaces :-(

Conntrack is per net namespaces. You may have one container with conntrack 
enabled and another one without conntrack.

Moreover, if you may receive fragments of the same packet at different 
interfaces in different blades, then you may receive different whole 
packets of the same flow at different interfaces/blades. But stateful 
firewalling relies on the assumption that all packets goes through of the 
firewall. Because it's not assured, conntrack may not run in the 
containers you denoted as FW LXC.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 5, 2012, 7:19 a.m. UTC | #9
On Wednesday 04 January 2012 22:40:09 Jozsef Kadlecsik wrote:
> On Wed, 4 Jan 2012, Hans Schillstrom wrote:
> 
> > On Wednesday, January 04, 2012 19:05:10 Jozsef Kadlecsik wrote:
> > > On Wed, 4 Jan 2012, Pablo Neira Ayuso wrote:
> > > 
> > > > On Wed, Jan 04, 2012 at 12:48:35PM +0100, Hans Schillstrom wrote:
> > > > > I like that idea, an "early" table at prio -500 with PREROUTING.
> > > > > There is also a need for a new flag "--allfrags"
> > > > > i.e. all fragments needs to be sorted out and sent to same dest for defrag.
> > > > > 
> > > > > ex.
> > > > > iptables -t early -A PREROUTING -i eth0 --allfrags -j NOTRACK
> > > > 
> > > > New tables add too much overhead. We have discussed this before with
> > > > Patrick.
> > > > 
> > > > Since this still remains specific to your needs, I think you can
> > > > remove nf_conntrack module in your setup.
> > > > 
> > > > I don't come with one sane setup that may want selectively defragment
> > > > some traffic yes and other not.
> > > > 
> > > > Am I missing anything else?
> > > 
> > > I agree. If you don't want defragmentation at all, then make sure you 
> > > don't load the nf_conntrack module directly/indirectly. Conntrack doesn't 
> > > work without defragmentation anyway.
> > 
> > We are using LXC and it's only in the container that holds the external 
> > interface that can't have defragmentation.
> > The problem is if it's loaded you have it in all namespaces :-(
> 
> Conntrack is per net namespaces. You may have one container with conntrack 
> enabled and another one without conntrack.

How do you disable conntrack per netns ?
I can't see how to do it except for NOTRACK
Then the nf_defrag issue is still there...

> 
> Moreover, if you may receive fragments of the same packet at different 
> interfaces in different blades, then you may receive different whole 
> packets of the same flow at different interfaces/blades. But stateful 
> firewalling relies on the assumption that all packets goes through of the 
> firewall. 
True you can't have stateful fw in that stage because of fragments.
> Because it's not assured, conntrack may not run in the 
> containers you denoted as FW LXC.
Thats why I want to disable defrag and conntrack in them

A single flow, with Containers in any Blade.

    +---------------------------+    /              +--------------------+
--> | FW (no CT)frag  HMARK sel |--->---            | Conntrack and IPVS |---->
    +---------------------------+    \              +--------------------+
            \ (fragments)                                  ..
             v 
         +---------------------------+   /                 ..
         |     de-frag  HMARK sel    |----->  
         +---------------------------+   \          +--------------------+
                                                    | Conntrack and IPVS |---->
                                                    +--------------------+

Note that HMARK makes a preselection of which IPVS to use, and directs the flow
to the same IPVS independent of which Blade/interface it arrives on.
i.e. the defrag:ed packed will reach the same IPVS as the others.
diff mbox

Patch

--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -74,6 +74,14 @@  static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
...
+	const struct net_device *dev = (hooknum == NF_INET_LOCAL_OUT ?
+					out : in);
+
+	/* No defrag and not Previously seen (loopback)? */
+	if (dev_net(dev)->ct.sysctl_notrac_defrag && skb->nfct) {
+		/* Attach fake conntrack entry. as in NOTRACK */
+		skb->nfct = &nf_ct_untracked_get()->ct_general;
+		skb->nfctinfo = IP_CT_NEW;
+		nf_conntrack_get(skb->nfct);
+		return NF_ACCEPT;
+	}
...

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>