Message ID | 20100413145944.GA7716@redhat.com |
---|---|
State | New |
Headers | show |
On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote: > The following situation was observed in the field: > tap1 sends packets, tap2 does not consume them, as a result > tap1 can not be closed. This happens because > tun/tap devices can hang on to skbs undefinitely. > > As noted by Herbert, possible solutions include a timeout followed by a > copy/change of ownership of the skb, or always copying/changing > ownership if we're going into a hostile device. > > This patch implements the second approach. > > Note: one issue still remaining is that since skbs > keep reference to tun socket and tun socket has a > reference to tun device, we won't flush backlog, > instead simply waiting for all skbs to get transmitted. > At least this is not user-triggerable, and > this was not reported in practice, my assumption is > other devices besides tap complete an skb > within finite time after it has been queued. > > A possible solution for the second issue > would not to have socket reference the device, > instead, implement dev->destructor for tun, and > wait for all skbs to complete there, but this > needs some thought, probably too risky for 2.6.34. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Thanks,
Michael S. Tsirkin wrote: > The following situation was observed in the field: > tap1 sends packets, tap2 does not consume them, as a result > tap1 can not be closed. And before that, tap1 may not be able to send further packets to anyone else on the bridge as its TX resources were blocked by tap2 - that's what we saw in the field. > This happens because > tun/tap devices can hang on to skbs undefinitely. > > As noted by Herbert, possible solutions include a timeout followed by a > copy/change of ownership of the skb, or always copying/changing > ownership if we're going into a hostile device. > > This patch implements the second approach. > > Note: one issue still remaining is that since skbs > keep reference to tun socket and tun socket has a > reference to tun device, we won't flush backlog, > instead simply waiting for all skbs to get transmitted. > At least this is not user-triggerable, and > this was not reported in practice, my assumption is > other devices besides tap complete an skb > within finite time after it has been queued. > > A possible solution for the second issue > would not to have socket reference the device, > instead, implement dev->destructor for tun, and > wait for all skbs to complete there, but this > needs some thought, probably too risky for 2.6.34. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> > > --- > > Please review the below, and consider for 2.6.34, > and stable trees. > > drivers/net/tun.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 96c39bd..4326520 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > } > } > > + /* Orphan the skb - required as we might hang on to it > + * for indefinite time. */ > + skb_orphan(skb); > + > /* Enqueue packet */ > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > dev->trans_start = jiffies; Nice solution, thanks! Jan
Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit : > Michael S. Tsirkin wrote: > > The following situation was observed in the field: > > tap1 sends packets, tap2 does not consume them, as a result > > tap1 can not be closed. > > And before that, tap1 may not be able to send further packets to anyone > else on the bridge as its TX resources were blocked by tap2 - that's > what we saw in the field. > After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to send one single frame. Is it OK ? Back to the problem : tap1 cannot be closed. Why ? because of refcounts ? When a socket with inflight tx packets is closed, we dont block the close, we only delay the socket freeing once all packets were delivered and freed.
Eric Dumazet wrote: > Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit : >> Michael S. Tsirkin wrote: >>> The following situation was observed in the field: >>> tap1 sends packets, tap2 does not consume them, as a result >>> tap1 can not be closed. >> And before that, tap1 may not be able to send further packets to anyone >> else on the bridge as its TX resources were blocked by tap2 - that's >> what we saw in the field. >> > > After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to > send one single frame. Is it OK ? I think if that's a real issue, you have to apply traffic shaping to the untrusted nodes. The existing flow-control scheme was fragile anyway as you had to translate packet lengths on TX side to packet counts on RX. Jan
On Tue, Apr 13, 2010 at 06:40:38PM +0200, Eric Dumazet wrote: > Le mardi 13 avril 2010 à 17:36 +0200, Jan Kiszka a écrit : > > Michael S. Tsirkin wrote: > > > The following situation was observed in the field: > > > tap1 sends packets, tap2 does not consume them, as a result > > > tap1 can not be closed. > > > > And before that, tap1 may not be able to send further packets to anyone > > else on the bridge as its TX resources were blocked by tap2 - that's > > what we saw in the field. > > > > After the patch, tap1 is able to flood tap2, and tap3/tap4 not able to > send one single frame. Is it OK ? Yes :) This was always possible. Number of senders needed to flood a receiver might vary depending on send/recv queue size that you set. External sources can also fill your RX queue if you let them. In the end, we need to rely on the scheduler for fairness, or apply packet shaping. > Back to the problem : tap1 cannot be closed. > > Why ? because of refcounts ? Yes. > When a socket with inflight tx packets is closed, we dont block the > close, we only delay the socket freeing once all packets were delivered > and freed. > Which is wrong, since this is under userspace control, so you get unkillable processes.
Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit : > > When a socket with inflight tx packets is closed, we dont block the > > close, we only delay the socket freeing once all packets were delivered > > and freed. > > > > Which is wrong, since this is under userspace control, so you get > unkillable processes. > We do not get unkillable processes, at least with sockets I was thinking about (TCP/UDP ones). Maybe tun sockets can behave the same ? Herbert Acked your patch, so I guess its OK, but I think it can be dangerous. Anyway my feeling is that we try to add various mechanisms to keep a hostile user flooding another one. For example, UDP got memory accounting quite recently, and we added socket backlog limits very recently. It was considered not needed few years ago.
On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit : > > > > When a socket with inflight tx packets is closed, we dont block the > > > close, we only delay the socket freeing once all packets were delivered > > > and freed. > > > > > > > Which is wrong, since this is under userspace control, so you get > > unkillable processes. > > > > We do not get unkillable processes, at least with sockets I was thinking > about (TCP/UDP ones). > > Maybe tun sockets can behave the same ? Looks like that's what my patch does: ip_rcv seems to call skb_orphan too. > Herbert Acked your patch, so I guess its OK, but I think it can be > dangerous. > Anyway my feeling is that we try to add various mechanisms to keep a > hostile user flooding another one. > > For example, UDP got memory accounting quite recently, and we added > socket backlog limits very recently. It was considered not needed few > years ago. >
Le mardi 13 avril 2010 à 23:25 +0300, Michael S. Tsirkin a écrit : > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > > Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit : > > > > > > When a socket with inflight tx packets is closed, we dont block the > > > > close, we only delay the socket freeing once all packets were delivered > > > > and freed. > > > > > > > > > > Which is wrong, since this is under userspace control, so you get > > > unkillable processes. > > > > > > > We do not get unkillable processes, at least with sockets I was thinking > > about (TCP/UDP ones). > > > > Maybe tun sockets can behave the same ? > > Looks like that's what my patch does: ip_rcv seems to call > skb_orphan too. Well, I was speaking of tx side, you speak of receiving side. An external flood (coming from another domain) is another problem. A sender might flood the 'network' inside our domain. How can we reasonably limit the sender ? Maybe the answer is 'We can not', but it should be stated somewhere, so that someone can address this point later.
On Tue, Apr 13, 2010 at 10:38:06PM +0200, Eric Dumazet wrote: > Le mardi 13 avril 2010 à 23:25 +0300, Michael S. Tsirkin a écrit : > > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > > > Le mardi 13 avril 2010 à 20:39 +0300, Michael S. Tsirkin a écrit : > > > > > > > > When a socket with inflight tx packets is closed, we dont block the > > > > > close, we only delay the socket freeing once all packets were delivered > > > > > and freed. > > > > > > > > > > > > > Which is wrong, since this is under userspace control, so you get > > > > unkillable processes. > > > > > > > > > > We do not get unkillable processes, at least with sockets I was thinking > > > about (TCP/UDP ones). > > > > > > Maybe tun sockets can behave the same ? > > > > Looks like that's what my patch does: ip_rcv seems to call > > skb_orphan too. > > Well, I was speaking of tx side, you speak of receiving side. Point is, both ip_rcv and my patch call skb_orphan. > An external flood (coming from another domain) is another problem. > > A sender might flood the 'network' inside our domain. How can we > reasonably limit the sender ? > > Maybe the answer is 'We can not', but it should be stated somewhere, so > that someone can address this point later. > And whatever's done should ideally work for tap to IP and IP to IP sockets as well, not just tap to tap.
On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > > Herbert Acked your patch, so I guess its OK, but I think it can be > dangerous. The tun socket accounting was never designed to stop it from flooding another tun interface. It's there to stop it from transmitting above a destination interface TX bandwidth and cause unnecessary packet drops. It also limits the total amount of kernel memory that can be pinned down by a single tun interface. In this case, all we're doing is shifting the accounting from the "hardware" queue to the qdisc queue. So your ability to flood a tun interface is essentially unchanged. BTW we do the same thing in a number of hardware drivers, as well as virtio-net. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Apr 2010 08:58:22 +0800 > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: >> >> Herbert Acked your patch, so I guess its OK, but I think it can be >> dangerous. > > The tun socket accounting was never designed to stop it from > flooding another tun interface. It's there to stop it from > transmitting above a destination interface TX bandwidth and > cause unnecessary packet drops. It also limits the total amount > of kernel memory that can be pinned down by a single tun interface. > > In this case, all we're doing is shifting the accounting from the > "hardware" queue to the qdisc queue. > > So your ability to flood a tun interface is essentially unchanged. > > BTW we do the same thing in a number of hardware drivers, as well > as virtio-net. Right. Although this reminds me about the whole SKB orphaning on xmit issue that keeps coming back to haunt us. If there weren't odd references to the SKB's socket in the packet scheduler et al. we could just orphan these things right upon entry to the qdisc and not have to add hacks like this to every driver. In fact... maybe we can just do it in dev_hard_queue_xmit() since we are out of the qdisc at that point.... but I guess there might be weird drivers that want the SKB socket in their ->xmit routine... Ho hum. In any event that's net-next-2.6 exploratory material, and I've applied this patch to net-2.6, thanks!
On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote: > The following situation was observed in the field: > tap1 sends packets, tap2 does not consume them, as a result > tap1 can not be closed. This happens because > tun/tap devices can hang on to skbs undefinitely. > > As noted by Herbert, possible solutions include a timeout followed by a > copy/change of ownership of the skb, or always copying/changing > ownership if we're going into a hostile device. > > This patch implements the second approach. > > Note: one issue still remaining is that since skbs > keep reference to tun socket and tun socket has a > reference to tun device, we won't flush backlog, > instead simply waiting for all skbs to get transmitted. > At least this is not user-triggerable, and > this was not reported in practice, my assumption is > other devices besides tap complete an skb > within finite time after it has been queued. > > A possible solution for the second issue > would not to have socket reference the device, > instead, implement dev->destructor for tun, and > wait for all skbs to complete there, but this > needs some thought, probably too risky for 2.6.34. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> > > --- > > Please review the below, and consider for 2.6.34, > and stable trees. > > drivers/net/tun.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 96c39bd..4326520 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > } > } > > + /* Orphan the skb - required as we might hang on to it > + * for indefinite time. */ > + skb_orphan(skb); > + > /* Enqueue packet */ > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > dev->trans_start = jiffies; > -- > 1.7.0.2.280.gc6f05 This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6 Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33). Thanks!
On Wed, Apr 21, 2010 at 01:46:33PM +0200, Jan Kiszka wrote: > Michael S. Tsirkin wrote: > > On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote: > >> The following situation was observed in the field: > >> tap1 sends packets, tap2 does not consume them, as a result > >> tap1 can not be closed. This happens because > >> tun/tap devices can hang on to skbs undefinitely. > >> > >> As noted by Herbert, possible solutions include a timeout followed by a > >> copy/change of ownership of the skb, or always copying/changing > >> ownership if we're going into a hostile device. > >> > >> This patch implements the second approach. > >> > >> Note: one issue still remaining is that since skbs > >> keep reference to tun socket and tun socket has a > >> reference to tun device, we won't flush backlog, > >> instead simply waiting for all skbs to get transmitted. > >> At least this is not user-triggerable, and > >> this was not reported in practice, my assumption is > >> other devices besides tap complete an skb > >> within finite time after it has been queued. > >> > >> A possible solution for the second issue > >> would not to have socket reference the device, > >> instead, implement dev->destructor for tun, and > >> wait for all skbs to complete there, but this > >> needs some thought, probably too risky for 2.6.34. > >> > >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > >> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> > >> > >> --- > >> > >> Please review the below, and consider for 2.6.34, > >> and stable trees. > >> > >> drivers/net/tun.c | 4 ++++ > >> 1 files changed, 4 insertions(+), 0 deletions(-) > >> > >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c > >> index 96c39bd..4326520 100644 > >> --- a/drivers/net/tun.c > >> +++ b/drivers/net/tun.c > >> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > >> } > >> } > >> > >> + /* Orphan the skb - required as we might hang on to it > >> + * for indefinite time. */ > >> + skb_orphan(skb); > >> + > >> /* Enqueue packet */ > >> skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > >> dev->trans_start = jiffies; > >> -- > >> 1.7.0.2.280.gc6f05 > > > > This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6 > > Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33). > > > > Thanks! > > > > Before I forget again: > > Tested-by: Jan Kiszka <jan.kiszka@siemens.com> > (namely the field test of our customer) > > Jan Cool, thanks! > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux
Michael S. Tsirkin wrote: > On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote: >> The following situation was observed in the field: >> tap1 sends packets, tap2 does not consume them, as a result >> tap1 can not be closed. This happens because >> tun/tap devices can hang on to skbs undefinitely. >> >> As noted by Herbert, possible solutions include a timeout followed by a >> copy/change of ownership of the skb, or always copying/changing >> ownership if we're going into a hostile device. >> >> This patch implements the second approach. >> >> Note: one issue still remaining is that since skbs >> keep reference to tun socket and tun socket has a >> reference to tun device, we won't flush backlog, >> instead simply waiting for all skbs to get transmitted. >> At least this is not user-triggerable, and >> this was not reported in practice, my assumption is >> other devices besides tap complete an skb >> within finite time after it has been queued. >> >> A possible solution for the second issue >> would not to have socket reference the device, >> instead, implement dev->destructor for tun, and >> wait for all skbs to complete there, but this >> needs some thought, probably too risky for 2.6.34. >> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> >> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> >> >> --- >> >> Please review the below, and consider for 2.6.34, >> and stable trees. >> >> drivers/net/tun.c | 4 ++++ >> 1 files changed, 4 insertions(+), 0 deletions(-) >> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >> index 96c39bd..4326520 100644 >> --- a/drivers/net/tun.c >> +++ b/drivers/net/tun.c >> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) >> } >> } >> >> + /* Orphan the skb - required as we might hang on to it >> + * for indefinite time. */ >> + skb_orphan(skb); >> + >> /* Enqueue packet */ >> skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); >> dev->trans_start = jiffies; >> -- >> 1.7.0.2.280.gc6f05 > > This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6 > Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33). > > Thanks! > Before I forget again: Tested-by: Jan Kiszka <jan.kiszka@siemens.com> (namely the field test of our customer) Jan
On Wed, Apr 21, 2010 at 02:35:57PM +0300, Michael S. Tsirkin wrote: > On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote: > > The following situation was observed in the field: > > tap1 sends packets, tap2 does not consume them, as a result > > tap1 can not be closed. This happens because > > tun/tap devices can hang on to skbs undefinitely. > > > > As noted by Herbert, possible solutions include a timeout followed by a > > copy/change of ownership of the skb, or always copying/changing > > ownership if we're going into a hostile device. > > > > This patch implements the second approach. > > > > Note: one issue still remaining is that since skbs > > keep reference to tun socket and tun socket has a > > reference to tun device, we won't flush backlog, > > instead simply waiting for all skbs to get transmitted. > > At least this is not user-triggerable, and > > this was not reported in practice, my assumption is > > other devices besides tap complete an skb > > within finite time after it has been queued. > > > > A possible solution for the second issue > > would not to have socket reference the device, > > instead, implement dev->destructor for tun, and > > wait for all skbs to complete there, but this > > needs some thought, probably too risky for 2.6.34. > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com> > > > > --- > > > > Please review the below, and consider for 2.6.34, > > and stable trees. > > > > drivers/net/tun.c | 4 ++++ > > 1 files changed, 4 insertions(+), 0 deletions(-) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > > index 96c39bd..4326520 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > > } > > } > > > > + /* Orphan the skb - required as we might hang on to it > > + * for indefinite time. */ > > + skb_orphan(skb); > > + > > /* Enqueue packet */ > > skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); > > dev->trans_start = jiffies; > > -- > > 1.7.0.2.280.gc6f05 > > This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6 > Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33). David Miller queues up the patches for the network subsystem for the stable trees, and then forwards them to me when he feels they are ready. So I'll defer to him on this one and wait for it to come from him. thanks, greg k-h
On Wed, 2010-04-14 at 08:58 +0800, Herbert Xu wrote: > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > > > > Herbert Acked your patch, so I guess its OK, but I think it can be > > dangerous. > > The tun socket accounting was never designed to stop it from > flooding another tun interface. It's there to stop it from > transmitting above a destination interface TX bandwidth and > cause unnecessary packet drops. It also limits the total amount > of kernel memory that can be pinned down by a single tun interface. > > In this case, all we're doing is shifting the accounting from the > "hardware" queue to the qdisc queue. > > So your ability to flood a tun interface is essentially unchanged. I've just been looking at VPN performance, using netperf to flood an openconnect/ocserv connection over GigE and profiling my VPN client. If I run netperf over the *unencrypted* link, it only sends 1Gb/s of packets — because the packets are correctly accounted to netperf's UDP socket until the moment they're actually transmitted on the wire, and the backpressure works correctly. When I run over the VPN, netperf thinks it sent 2½ times the amount of TX traffic. Packets are being dropped by the tun device before even feeding them up to the VPN client to be sent — presumably because of this skb_orphan() call. (The client itself should do the right thing, and only suck packets out of the tun at the rate it can shove them out *its* UDP socket.) Did we ever look at the alternative solution of taking ownership only after a timeout, or on demand when we need to shut down the device?
On Sun, Feb 01, 2015 at 11:20:33AM +0000, David Woodhouse wrote: > On Wed, 2010-04-14 at 08:58 +0800, Herbert Xu wrote: > > On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote: > > > > > > Herbert Acked your patch, so I guess its OK, but I think it can be > > > dangerous. > > > > The tun socket accounting was never designed to stop it from > > flooding another tun interface. It's there to stop it from > > transmitting above a destination interface TX bandwidth and > > cause unnecessary packet drops. It also limits the total amount > > of kernel memory that can be pinned down by a single tun interface. > > > > In this case, all we're doing is shifting the accounting from the > > "hardware" queue to the qdisc queue. > > > > So your ability to flood a tun interface is essentially unchanged. > > I've just been looking at VPN performance, using netperf to flood an > openconnect/ocserv connection over GigE and profiling my VPN client. > > If I run netperf over the *unencrypted* link, it only sends 1Gb/s of > packets — because the packets are correctly accounted to netperf's UDP > socket until the moment they're actually transmitted on the wire, and > the backpressure works correctly. > > When I run over the VPN, netperf thinks it sent 2½ times the amount of > TX traffic. At some level, it's expected: netperf's manual actually says: A UDP_STREAM test has no end-to-end flow control - UDP provides none and neither does netperf. However, if you wish, you can configure netperf with --enable-intervals=yes to enable the global command-line -b and -w options to pace bursts of traffic onto the network. > Packets are being dropped by the tun device before even > feeding them up to the VPN client to be sent — presumably because of > this skb_orphan() call. (The client itself should do the right thing, > and only suck packets out of the tun at the rate it can shove them out > *its* UDP socket.) A simple work-around is to limit the rate using a non work conservig qdisc. > Did we ever look at the alternative solution of taking ownership only > after a timeout, or on demand when we need to shut down the device? I've been thinking about this on and off, but didn't find a good safe solution yet. For timeout, the difficulty is to find a good timer value, low enough to avoid DOS attacks but high enough to avoid spurious packet drops (and expensive timer interrupts).
On Sun, 2015-02-01 at 14:26 +0200, Michael S. Tsirkin wrote: > > When I run over the VPN, netperf thinks it sent 2½ times the amount of > > TX traffic. > > At some level, it's expected: netperf's manual actually says: > A UDP_STREAM test has no end-to-end flow control - UDP provides none and > neither does netperf. However, if you wish, you can configure netperf > with --enable-intervals=yes to enable the global command-line -b and -w > options to pace bursts of traffic onto the network. True, but UDP is just a canary for other protocols here. We *should* be able to keep track of the packets before they even leave the box, and know that we haven't even managed to send them. Even if we know it's a datagram protocol and it's potentially going to be dropped in transit later. Of course, now I'm looking closely at the path these packets take to leave the box, it starts to offend me that they're being passed up to userspace just to encrypt them (as DTLS or ESP) and then send them back down to the kernel on a UDP socket. The kernel already knows how to {en,de}crypt ESP, and do the sequence number checking on incoming packets. I'm wondering if we bypass userspace in that case somehow — let userspace negotiate the encryption and connect the UDP socket, then just pass the socket fd and the parameters to the kernel so that incoming packets are decrypted and 'received' on the tun device, and outgoing packets on the tun device are encrypted and sent out on the UDP socket. The performance isn't too much of an issue for a VPN *client* in practice, but we have a server implementation too which would probably benefit quite well from such an offload facility. If I were to look at such a thing, would it provoke screams of horror?
From: David Woodhouse <dwmw2@infradead.org> Date: Sun, 01 Feb 2015 13:33:50 +0000 > Of course, now I'm looking closely at the path these packets take to > leave the box, it starts to offend me that they're being passed up to > userspace just to encrypt them (as DTLS or ESP) and then send them back > down to the kernel on a UDP socket. The kernel already knows how to > {en,de}crypt ESP, and do the sequence number checking on incoming > packets. It's funny, I thought we had an IPSEC stack....
On Sun, 2015-02-01 at 12:19 -0800, David Miller wrote: > From: David Woodhouse <dwmw2@infradead.org> > Date: Sun, 01 Feb 2015 13:33:50 +0000 > > > Of course, now I'm looking closely at the path these packets take to > > leave the box, it starts to offend me that they're being passed up to > > userspace just to encrypt them (as DTLS or ESP) and then send them back > > down to the kernel on a UDP socket. The kernel already knows how to > > {en,de}crypt ESP, and do the sequence number checking on incoming > > packets. > > It's funny, I thought we had an IPSEC stack.... Right. But I'm trying to work out how we can sanely *use* that from a VPN client. The client normally sets up a tun device, configuring it with appropriate IP addresses and routes by invoking vpnc-script or passing the information back to NetworkManager. The client itself might not even have root privs, in the NetworkManager case. The initial authentication and connection are done over HTTPS, and packets *can* be passed that way if they need to be. But obviously the client *also* tries to set up a UDP data transport too — which is DTLS in the case of Cisco AnyConnect, and ESP in UDP for Juniper. If it *can* get communication over UDP, it'll use it. Otherwise it just passes packets over the TCP connection. So it needs to dynamically set up and tear down the ESP/DTLS tunnels as and when they are working. Ideally we want it such that that packets routed to the tun device get transparently encrypted and sent out on the UDP socket, and packets received from UDP and successfully decrypted will appear to have arrived on the tun device. The user may be manually tweaking the routing, or setting up firewall/NAT/etc. on the tun device. I can see how to set up an ESP in UDP tunnel such that it looks like the packets are actually departing on the *physical* interface (which in practice I suppose they are). But that's going to be fairly complex to set up, and extremely non-intuitive and hard to manage for the user. To the extent that I don't think it's actually deployable. I really was looking for some way to push down something like an XFRM state into the tun device and just say "shove them out here until I tell you otherwise".
From: David Woodhouse <dwmw2@infradead.org> Date: Sun, 01 Feb 2015 21:29:43 +0000 > I really was looking for some way to push down something like an XFRM > state into the tun device and just say "shove them out here until I tell > you otherwise". People decided to use TUN and push VPN stuff back into userspace, and there are repercussions for that decision. I'm not saying this to be mean or whatever, but I was very disappointed when userland IPSEC solutions using TUN started showing up. We might as well have not have implemented the IPSEC stack at all, because as a result of the userland VPN stuff our IPSEC stack is largely unused except by a very narrow group of users.
On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote: > From: David Woodhouse <dwmw2@infradead.org> > Date: Sun, 01 Feb 2015 21:29:43 +0000 > > > I really was looking for some way to push down something like an XFRM > > state into the tun device and just say "shove them out here until I tell > > you otherwise". > > People decided to use TUN and push VPN stuff back into userspace, > and there are repercussions for that decision. > > I'm not saying this to be mean or whatever, but I was very > disappointed when userland IPSEC solutions using TUN started showing > up. Yeah. That's a valid criticism of vpnc, certainly. I never did understand why it reimplemented the IPSec stack. For my OpenConnect client it's somewhat more justified though — the initial data transport there is over TLS, which the kernel doesn't support. And if we *can* establish UDP communication, that's over DTLS which the kernel doesn't support either. It's not even the *standard* version of DTLS because Cisco are still using a pre-RFC4347 version of the protocol. And we also need to probe the UDP connectivity and do keepalives and manage the fallback to using the TCP data transport. It's not like vpnc where it really is just a case of setting up the ESP context and letting it run. It's only now I've added Juniper support, which uses ESP-in-UDP for the data transport, that I'm doing something that the kernel supports at all. And now I'm looking at how to make use of that. > We might as well have not have implemented the IPSEC stack at all, > because as a result of the userland VPN stuff our IPSEC stack is > largely unused except by a very narrow group of users. Well, I'd love to make better use of it if I can. I do suspect it makes most sense for userspace to continue to manage the probing of UDP connectivity, and the fallback to TCP mode — and I suspect it also makes sense to continue to use tun for passing packets up to the VPN client when it's using the TCP transport. So the question would be how we handle redirecting the packet flow to the optional UDP transport, when the VPN client determines that it's available. For the sake of the user setting up firewall and routing rules, I do think it's important that it continues to appear to userspace as the *same* device for the entire lifetime of the session, regardless of which transport the packets happen to be using at a given moment in time. It doesn't *have* to be tun, though. You don't seem to like my suggestion of somehow pushing down an XFRM state to the tun device to direct the packets out there instead of up to userspace. Do you have an alternative suggestion... or a specific concern that would help me come up with something you like better? I'm guessing you don't want to push the *whole* management of the TLS control connection *and* the UDP transport, and probing the latter with keepalives, into the kernel? I certainly don't :)
On Mon, Feb 02, 2015 at 07:27:10AM +0000, David Woodhouse wrote: > On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote: > > > We might as well have not have implemented the IPSEC stack at all, > > because as a result of the userland VPN stuff our IPSEC stack is > > largely unused except by a very narrow group of users. > > Well, I'd love to make better use of it if I can. I do suspect it makes > most sense for userspace to continue to manage the probing of UDP > connectivity, and the fallback to TCP mode — and I suspect it also makes > sense to continue to use tun for passing packets up to the VPN client > when it's using the TCP transport. > > So the question would be how we handle redirecting the packet flow to > the optional UDP transport, when the VPN client determines that it's > available. For the sake of the user setting up firewall and routing > rules, I do think it's important that it continues to appear to > userspace as the *same* device for the entire lifetime of the session, > regardless of which transport the packets happen to be using at a given > moment in time. It doesn't *have* to be tun, though. > > You don't seem to like my suggestion of somehow pushing down an XFRM > state to the tun device to direct the packets out there instead of up to > userspace. Do you have an alternative suggestion... or a specific > concern that would help me come up with something you like better? Maybe you want to use a virtual tunnel interface (vti) what we have already. Everything that is routed through such an interface is guaranteed to be either encrypted if a matching xfrm state is present or dropped. Same on the rceive side, everything that is received by this interface is guaranteed to be IPsec processed. So you can do a routing based decision about the IPsec processing. While I'm sure it could handle the ESP in UDP encapsulation, I'm not that sure about your TCP fallback because this requires a valid xfrm state to allow packets to pass. Using the same interface for both is probably not possible.
Hi, On Mon, Feb 02, 2015 at 07:27:10AM +0000, David Woodhouse wrote: > On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote: > > From: David Woodhouse <dwmw2@infradead.org> > > Date: Sun, 01 Feb 2015 21:29:43 +0000 > > > > > I really was looking for some way to push down something like an XFRM > > > state into the tun device and just say "shove them out here until I tell > > > you otherwise". > > > > People decided to use TUN and push VPN stuff back into userspace, > > and there are repercussions for that decision. > > > > I'm not saying this to be mean or whatever, but I was very > > disappointed when userland IPSEC solutions using TUN started showing > > up. > > Yeah. That's a valid criticism of vpnc, certainly. I never did > understand why it reimplemented the IPSec stack. > > For my OpenConnect client it's somewhat more justified though — the > initial data transport there is over TLS, which the kernel doesn't > support. And if we *can* establish UDP communication, that's over DTLS > which the kernel doesn't support either. It's not even the *standard* > version of DTLS because Cisco are still using a pre-RFC4347 version of > the protocol. And we also need to probe the UDP connectivity and do > keepalives and manage the fallback to using the TCP data transport. > > It's not like vpnc where it really is just a case of setting up the ESP > context and letting it run. > > It's only now I've added Juniper support, which uses ESP-in-UDP for the > data transport, that I'm doing something that the kernel supports at > all. And now I'm looking at how to make use of that. > > > We might as well have not have implemented the IPSEC stack at all, > > because as a result of the userland VPN stuff our IPSEC stack is > > largely unused except by a very narrow group of users. > > Well, I'd love to make better use of it if I can. I do suspect it makes > most sense for userspace to continue to manage the probing of UDP > connectivity, and the fallback to TCP mode — and I suspect it also makes > sense to continue to use tun for passing packets up to the VPN client > when it's using the TCP transport. > > So the question would be how we handle redirecting the packet flow to > the optional UDP transport, when the VPN client determines that it's > available. For the sake of the user setting up firewall and routing > rules, I do think it's important that it continues to appear to > userspace as the *same* device for the entire lifetime of the session, > regardless of which transport the packets happen to be using at a given > moment in time. It doesn't *have* to be tun, though. Since you want to provide connectivity over HTTPS which is not possible in kernel space, you are stuck with keeping the tun device. So the packet flow in that case is identical to how e.g. OpenVPN does it: - tunX holds default route - OpenConnect then: - receives packets on /dev/tun - holds TCP socket to VPN concentrator - does encapsulation into TLS Speaking of optimisation, the interesting part is the alternative flow via IPsec in UDP. AFAICT, it should be possible to setup an ESP in UDP tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try that myself. The funny thing with XFRM is, it applies before the routing decision does: If my IPsec policy matches, the packet goes that way no matter what the routing table says about the original destination. This can be used to override the default route provided via tun0 in the above case. Of course, OpenConnect has to manage all the XFRM/policy stuff on it's own, since switching from ESP in UDP back to TLS would mean to tear down the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and send test traffic to decide whether to set it up fully (if limited) or tear it down (if unlimited) again so traffic arrives at tunX again. In my opinion, this might work. The whole setup is probably about as intuitive as the fact that kernel IPsec tunnel mode does not naturally provide an own interface. Firewall setup on top of that might become a matter of try-and-error. Maybe having a VTI interface and merely moving the default route instead of fiddling with policies all the time might make things a little easier to comprehend, but surely adds some performance overhead. Cheers, Phil
On Mon, 2015-02-02 at 09:24 +0100, Steffen Klassert wrote: > > Maybe you want to use a virtual tunnel interface (vti) what we have > already. Everything that is routed through such an interface is > guaranteed to be either encrypted if a matching xfrm state is present > or dropped. Same on the rceive side, everything that is received by > this interface is guaranteed to be IPsec processed. So you can do > a routing based decision about the IPsec processing. > > While I'm sure it could handle the ESP in UDP encapsulation, I'm not that > sure about your TCP fallback because this requires a valid xfrm state > to allow packets to pass. Using the same interface for both is probably > not possible. I'm trying to imagine how we could make it work in practice if we end up exposing two *different* interfaces and having to change the kernel's routing according to whether we have UDP connectivity at any given moment in time. Given how painful it already is to maintain vpnc-script and make it do the right thing for split-include and split-exclude routing, I'm not really sure I want to go there. Even if we could get such a scheme to work, it would probably also require retaining root privileges to make the changes — and one of the security benefits over the proprietary VPN clients is that we don't *need* to run as root. We can either drop privs after running vpnc-script to do the initial routing setup, or in the NetworkManager case we *never* run with elevated privileges; we just pass the IP/routing information back over DBus to NetworkManager. It occurs to me that for the approach I was thinking about, I wouldn't even need to touch the internals of the tun driver. It could be a separate driver which just uses tun_get_socket(). Userspace could hand it the file descriptors of the tun device and the connected UDP socket, along with the encryption parameters — and then just stop reading packets from the tun device for itself.
On Mon, 2015-02-02 at 16:23 +0100, Phil Sutter wrote: > Since you want to provide connectivity over HTTPS which is not possible > in kernel space, you are stuck with keeping the tun device. So the > packet flow in that case is identical to how e.g. OpenVPN does it: > > - tunX holds default route > - OpenConnect then: > - receives packets on /dev/tun > - holds TCP socket to VPN concentrator > - does encapsulation into TLS > > Speaking of optimisation, the interesting part is the alternative flow > via IPsec in UDP. Right. The packet flow you describe is what we already have. Except of course we already *do* establish the UDP connection (which is DTLS when we're talking to a Cisco AnyConnect server, and ESP in UDP when we're talking to Juniper). If we get responses to keepalive packets, we'll send outbound packets over the UDP connection. If the UDP connectivity goes AWOL, we'll fall back to sending on TCP. Rekeying of the UDP connection is handled over the TCP control connection too. Even in the DTLS case, the master secret and session ID are exchanged over TCP and the DTLS is actually done as a 'session resume', without the normal DTLS handshake ever happening. As you say, I'm stuck with keeping the tun device (or something very much like it). This *isn't* like vpnc where I can set up an IPSec config and just let it run. > AFAICT, it should be possible to setup an ESP in UDP > tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try > that myself. The funny thing with XFRM is, it applies before the routing > decision does: If my IPsec policy matches, the packet goes that way no > matter what the routing table says about the original destination. This > can be used to override the default route provided via tun0 in the above > case. Except it isn't even the default route. We get given a bunch of split includes or split excludes from the VPN server. We pass them to vpnc-script or NetworkManager to actually set the routes up, and those tools may make their own tweaks to what the server requested — denying the default route and setting up explicit routes, or adding firewall rules or NAT to incoming/outgoing packets on the tun device. If it is no longer *just* the single tun device, everything gets really complicated. Even *before* we talk about changing it on the fly during normal operation. > Of course, OpenConnect has to manage all the XFRM/policy stuff on it's > own, since switching from ESP in UDP back to TLS would mean to tear down > the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and > send test traffic to decide whether to set it up fully (if limited) or > tear it down (if unlimited) again so traffic arrives at tunX again. Right. And ideally without CAP_NET_ADMIN. > In my opinion, this might work. The whole setup is probably about as > intuitive as the fact that kernel IPsec tunnel mode does not naturally > provide an own interface. Firewall setup on top of that might become a > matter of try-and-error. Maybe having a VTI interface and merely moving > the default route instead of fiddling with policies all the time might > make things a little easier to comprehend, but surely adds some > performance overhead. I think even the latter is sufficiently complex to manage that it's not worth pursuing. I may throw together my suggested hack using tun_get_socket() and see how much it makes *me* barf before deciding whether to show it here for more feedback :)
From: David Woodhouse <dwmw2@infradead.org> Date: Mon, 02 Feb 2015 07:27:10 +0000 > I'm guessing you don't want to push the *whole* management of the TLS > control connection *and* the UDP transport, and probing the latter with > keepalives, into the kernel? I certainly don't :) Whilst Herbert Xu and I have discussed in the past supporting automatic SSL handling of socket data during socket writes in the kernel, doing TLS stuff would be a bit of a stretch :-)
On Tue, 2015-02-03 at 16:19 -0800, David Miller wrote: > From: David Woodhouse <dwmw2@infradead.org> > Date: Mon, 02 Feb 2015 07:27:10 +0000 > > > I'm guessing you don't want to push the *whole* management of the TLS > > control connection *and* the UDP transport, and probing the latter with > > keepalives, into the kernel? I certainly don't :) > > Whilst Herbert Xu and I have discussed in the past supporting > automatic SSL handling of socket data during socket writes in the > kernel, doing TLS stuff would be a bit of a stretch :-) Right. For the DTLS I was thinking we'd do the handshake in userspace and then hand the UDP socket down. At that point it's basically the same as ESP with the bytes in a slightly different place. So I really am looking at an option for "here's a UDP socket to send those tun packets out on, with <this> encryption setup" as the sanest plan I can come up with.
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 96c39bd..4326520 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) } } + /* Orphan the skb - required as we might hang on to it + * for indefinite time. */ + skb_orphan(skb); + /* Enqueue packet */ skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb); dev->trans_start = jiffies;