diff mbox series

udp: Allow to defer reception until connect() happened

Message ID 20181129015626.85441-1-cpaasch@apple.com
State Rejected, archived
Delegated to: David Miller
Headers show
Series udp: Allow to defer reception until connect() happened | expand

Commit Message

Christoph Paasch Nov. 29, 2018, 1:56 a.m. UTC
There are use-cases where a host wants to use a UDP socket with a
specific 4-tuple. The way to do this is to bind() and then connect() the
socket. However, after the bind(), the socket starts receiving data even
if it does not match the intended 4-tuple. That is because after the
bind() UDP-socket will match in the lookup for all incoming UDP-traffic
that has the specific IP/port.

This patch prevents any incoming traffic until the connect() system-call
is called whenever the app sets the UDP socket-option
UDP_WAIT_FOR_CONNECT.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---

Notes:
    Changes compared to the original RFC-submission:
    
    * Make it a UDP-specific socket-option
    * Rename it to 'wait-for-connect'
    
    Wrt to the discussion on the original RFC submission
    (cfr., https://marc.info/?l=linux-netdev&m=154102843910587&w=2):
    
    We believe that this patch is still useful to enable applications to use
    different models of implementing a UDP-server. For some frameworks it is much
    easier to have a socket per-connection and thus let the de-multiplexing happen
    in the kernel instead of the app.

 include/linux/udp.h      |  5 ++++-
 include/net/udp.h        |  1 +
 include/uapi/linux/udp.h |  1 +
 net/ipv4/udp.c           | 26 +++++++++++++++++++++++++-
 net/ipv4/udplite.c       |  2 +-
 net/ipv6/udp.c           | 15 ++++++++++++++-
 net/ipv6/udp_impl.h      |  1 +
 net/ipv6/udplite.c       |  2 +-
 8 files changed, 48 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Nov. 29, 2018, 2:19 a.m. UTC | #1
On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch <cpaasch@apple.com> wrote:
>
> There are use-cases where a host wants to use a UDP socket with a
> specific 4-tuple. The way to do this is to bind() and then connect() the
> socket. However, after the bind(), the socket starts receiving data even
> if it does not match the intended 4-tuple. That is because after the
> bind() UDP-socket will match in the lookup for all incoming UDP-traffic
> that has the specific IP/port.
>
> This patch prevents any incoming traffic until the connect() system-call
> is called whenever the app sets the UDP socket-option
> UDP_WAIT_FOR_CONNECT.

Please do not add something that could mislead applications writers to
think UDP stack can scale.

UDP stack does not have a full hash on 4-tuples, it means that
incoming traffic on a 'shared port' has
to scan a list of XXX sockets to find the best match ...

Also you add another cache line miss in UDP lookup to access
udp_sk()->wait_for_connect.

recvfrom() can be used to filter whatever frame that came before the connect()
Eric Dumazet Nov. 29, 2018, 3:15 a.m. UTC | #2
On Wed, Nov 28, 2018 at 7:09 PM Jana Iyengar <jri.ietf@gmail.com> wrote:
>
>
> On Wed, Nov 28, 2018 at 6:19 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch <cpaasch@apple.com> wrote:
>> >
>> > There are use-cases where a host wants to use a UDP socket with a
>> > specific 4-tuple. The way to do this is to bind() and then connect() the
>> > socket. However, after the bind(), the socket starts receiving data even
>> > if it does not match the intended 4-tuple. That is because after the
>> > bind() UDP-socket will match in the lookup for all incoming UDP-traffic
>> > that has the specific IP/port.
>> >
>> > This patch prevents any incoming traffic until the connect() system-call
>> > is called whenever the app sets the UDP socket-option
>> > UDP_WAIT_FOR_CONNECT.
>>
>> Please do not add something that could mislead applications writers to
>> think UDP stack can scale.
>
>
>> UDP stack does not have a full hash on 4-tuples, it means that
>> incoming traffic on a 'shared port' has
>> to scan a list of XXX sockets to find the best match ...
>
>
>> Also you add another cache line miss in UDP lookup to access
>> udp_sk()->wait_for_connect.
>>
>> recvfrom() can be used to filter whatever frame that came before the connect()
>
>
> I don't think I understand that argument -- connect() is supported for UDP sockets, and UDP sockets are being used for serving QUIC traffic. Are you suggesting that connect() never be used?

If the source port is not shared, Christoph patch is not needed.

If it is shared, then a whole can of worm is opened.

Trying to hack UDP stack while it is not fully 4-tuple ready is not
going to fly.
Christoph Paasch Nov. 29, 2018, 10:47 p.m. UTC | #3
Hello,

On 28/11/18 - 19:15:12, Eric Dumazet wrote:
> On Wed, Nov 28, 2018 at 7:09 PM Jana Iyengar <jri.ietf@gmail.com> wrote:
> >
> >
> > On Wed, Nov 28, 2018 at 6:19 PM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch <cpaasch@apple.com> wrote:
> >> >
> >> > There are use-cases where a host wants to use a UDP socket with a
> >> > specific 4-tuple. The way to do this is to bind() and then connect() the
> >> > socket. However, after the bind(), the socket starts receiving data even
> >> > if it does not match the intended 4-tuple. That is because after the
> >> > bind() UDP-socket will match in the lookup for all incoming UDP-traffic
> >> > that has the specific IP/port.
> >> >
> >> > This patch prevents any incoming traffic until the connect() system-call
> >> > is called whenever the app sets the UDP socket-option
> >> > UDP_WAIT_FOR_CONNECT.
> >>
> >> Please do not add something that could mislead applications writers to
> >> think UDP stack can scale.
> >
> >
> >> UDP stack does not have a full hash on 4-tuples, it means that
> >> incoming traffic on a 'shared port' has
> >> to scan a list of XXX sockets to find the best match ...
> >
> >
> >> Also you add another cache line miss in UDP lookup to access
> >> udp_sk()->wait_for_connect.

Fair enough. We could add the socket later to the hash (see below).

> >>
> >> recvfrom() can be used to filter whatever frame that came before the connect()
> >
> >
> > I don't think I understand that argument -- connect() is supported for UDP sockets, and UDP sockets are being used for serving QUIC traffic. Are you suggesting that connect() never be used?
> 
> If the source port is not shared, Christoph patch is not needed.
> 
> If it is shared, then a whole can of worm is opened.
> 
> Trying to hack UDP stack while it is not fully 4-tuple ready is not
> going to fly.

Indeed, the UDP-stack is not fully 4-tuple ready.


What are your thoughts on getting it there?

Should be doable by simply using a similar approach as TCP, no? Any caveats
you see with that?

Then, when one sets the "wait-for-connect"-flag we would add the socket to
the hash-table only at connect()-time also addressing the cache-line miss
you mentioned above.


Thanks,
Christoph
Eric Dumazet Nov. 29, 2018, 10:59 p.m. UTC | #4
On Thu, Nov 29, 2018 at 2:47 PM Christoph Paasch <cpaasch@apple.com> wrote:

> Indeed, the UDP-stack is not fully 4-tuple ready.
>
>
> What are your thoughts on getting it there?

This would request an additional lookup, and heavy duty servers using
non connected sockets
would pay the price for an extra lookup for each incoming packets.

DNS servers and QUIC servers would not like that, since they have better use
of a single (unconnected) UDP socket per cpu/thread.


>
> Should be doable by simply using a similar approach as TCP, no? Any caveats
> you see with that?
>
> Then, when one sets the "wait-for-connect"-flag we would add the socket to
> the hash-table only at connect()-time also addressing the cache-line miss
> you mentioned above.

Sure.
Christoph Paasch Dec. 3, 2018, 4:45 p.m. UTC | #5
On 29/11/18 - 14:59:36, Eric Dumazet wrote:
> On Thu, Nov 29, 2018 at 2:47 PM Christoph Paasch <cpaasch@apple.com> wrote:
> 
> > Indeed, the UDP-stack is not fully 4-tuple ready.
> >
> >
> > What are your thoughts on getting it there?
> 
> This would request an additional lookup, and heavy duty servers using
> non connected sockets
> would pay the price for an extra lookup for each incoming packets.

Indeed.


But I'm sure we can do it such that those servers don't pay a cost.

E.g., with a jump-label that would only be set when one does a "wait-for-connect"
and only these sockets end up in that 4-tuple hash-table.

> DNS servers and QUIC servers would not like that, since they have better use
> of a single (unconnected) UDP socket per cpu/thread.

I'm actually wondering, how are you doing Quic connection migration when
having a single unconnected UDP socket per CPU? That would involve quite
some cross-CPU communication, no?

Except if something else is steering the different UDP-flows that belong to the
same Quic connection to the right CPU...


Christoph



> 
> 
> >
> > Should be doable by simply using a similar approach as TCP, no? Any caveats
> > you see with that?
> >
> > Then, when one sets the "wait-for-connect"-flag we would add the socket to
> > the hash-table only at connect()-time also addressing the cache-line miss
> > you mentioned above.
> 
> Sure.
Eric Dumazet Dec. 3, 2018, 4:50 p.m. UTC | #6
On Mon, Dec 3, 2018 at 8:45 AM Christoph Paasch <cpaasch@apple.com> wrote:
>
> On 29/11/18 - 14:59:36, Eric Dumazet wrote:
> > On Thu, Nov 29, 2018 at 2:47 PM Christoph Paasch <cpaasch@apple.com> wrote:
> >
> > > Indeed, the UDP-stack is not fully 4-tuple ready.
> > >
> > >
> > > What are your thoughts on getting it there?
> >
> > This would request an additional lookup, and heavy duty servers using
> > non connected sockets
> > would pay the price for an extra lookup for each incoming packets.
>
> Indeed.
>
>
> But I'm sure we can do it such that those servers don't pay a cost.
>
> E.g., with a jump-label that would only be set when one does a "wait-for-connect"
> and only these sockets end up in that 4-tuple hash-table.

That will work only for benchmarks.
Typical servers have a variety of concurrent tasks.

jump labels are kind of doomed with network namespaces,
and we cant really enable/disable them from a non privileged application,
since this slows down the whole machine.

>
> > DNS servers and QUIC servers would not like that, since they have better use
> > of a single (unconnected) UDP socket per cpu/thread.
>
> I'm actually wondering, how are you doing Quic connection migration when
> having a single unconnected UDP socket per CPU? That would involve quite
> some cross-CPU communication, no?

Maybe a small percentage of packets do need this, so we do not care.

>
> Except if something else is steering the different UDP-flows that belong to the
> same Quic connection to the right CPU...
>
>
> Christoph
>
>
>
> >
> >
> > >
> > > Should be doable by simply using a similar approach as TCP, no? Any caveats
> > > you see with that?
> > >
> > > Then, when one sets the "wait-for-connect"-flag we would add the socket to
> > > the hash-table only at connect()-time also addressing the cache-line miss
> > > you mentioned above.
> >
> > Sure.
diff mbox series

Patch

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 2725c83395bf..9a715d25ce36 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -55,7 +55,10 @@  struct udp_sock {
 					   * different encapsulation layer set
 					   * this
 					   */
-			 gro_enabled:1;	/* Can accept GRO packets */
+			 gro_enabled:1,	/* Can accept GRO packets */
+			 wait_for_connect:1;/* Wait until app calls connect()
+					     * before accepting incoming data
+					     */
 	/*
 	 * Following member retains the information to create a UDP header
 	 * when the socket is uncorked.
diff --git a/include/net/udp.h b/include/net/udp.h
index fd6d948755c8..b7467a4129b3 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -285,6 +285,7 @@  int udp_get_port(struct sock *sk, unsigned short snum,
 				  const struct sock *));
 int udp_err(struct sk_buff *, u32);
 int udp_abort(struct sock *sk, int err);
+int udp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
 void udp_flush_pending_frames(struct sock *sk);
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 30baccb6c9c4..b61f8e8dd80b 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -34,6 +34,7 @@  struct udphdr {
 #define UDP_NO_CHECK6_RX 102	/* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT	103	/* Set GSO segmentation size */
 #define UDP_GRO		104	/* This socket can receive UDP GRO packets */
+#define UDP_WAIT_FOR_CONNECT	105	/* Don't accept incoming data until the app calls connect() */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE	1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..c5adafcfd52f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -407,6 +407,9 @@  static int compute_score(struct sock *sk, struct net *net,
 		return -1;
 	score += 4;
 
+	if (udp_sk(sk)->wait_for_connect)
+		return -1;
+
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
 	return score;
@@ -2601,6 +2604,10 @@  int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		release_sock(sk);
 		break;
 
+	case UDP_WAIT_FOR_CONNECT:
+		up->wait_for_connect = valbool;
+		break;
+
 	/*
 	 * 	UDP-Lite's partial checksum coverage (RFC 3828).
 	 */
@@ -2695,6 +2702,10 @@  int udp_lib_getsockopt(struct sock *sk, int level, int optname,
 		val = up->gso_size;
 		break;
 
+	case UDP_WAIT_FOR_CONNECT:
+		val = up->wait_for_connect;
+		break;
+
 	/* The following two cannot be changed on UDP sockets, the return is
 	 * always 0 (which corresponds to the full checksum coverage of UDP). */
 	case UDPLITE_SEND_CSCOV:
@@ -2779,12 +2790,25 @@  int udp_abort(struct sock *sk, int err)
 }
 EXPORT_SYMBOL_GPL(udp_abort);
 
+int udp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	int ret;
+
+	ret = ip4_datagram_connect(sk, uaddr, addr_len);
+
+	if (!ret)
+		udp_sk(sk)->wait_for_connect = 0;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(udp_v4_connect);
+
 struct proto udp_prot = {
 	.name			= "UDP",
 	.owner			= THIS_MODULE,
 	.close			= udp_lib_close,
 	.pre_connect		= udp_pre_connect,
-	.connect		= ip4_datagram_connect,
+	.connect		= udp_v4_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
 	.init			= udp_init_sock,
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 39c7f17d916f..22c74822ff30 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -41,7 +41,7 @@  struct proto 	udplite_prot = {
 	.name		   = "UDP-Lite",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
-	.connect	   = ip4_datagram_connect,
+	.connect	   = udp_v4_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
 	.init		   = udplite_sk_init,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..c3a9101380de 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1681,6 +1681,19 @@  void udp6_proc_exit(struct net *net)
 }
 #endif /* CONFIG_PROC_FS */
 
+int udp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	int ret;
+
+	ret = ip6_datagram_connect(sk, uaddr, addr_len);
+
+	if (!ret)
+		udp_sk(sk)->wait_for_connect = 0;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(udp_v6_connect);
+
 /* ------------------------------------------------------------------------ */
 
 struct proto udpv6_prot = {
@@ -1688,7 +1701,7 @@  struct proto udpv6_prot = {
 	.owner			= THIS_MODULE,
 	.close			= udp_lib_close,
 	.pre_connect		= udpv6_pre_connect,
-	.connect		= ip6_datagram_connect,
+	.connect		= udp_v6_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
 	.init			= udp_init_sock,
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index 5730e6503cb4..14e847b63db9 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -14,6 +14,7 @@  int __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
 
 int udp_v6_get_port(struct sock *sk, unsigned short snum);
 
+int udp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int udpv6_getsockopt(struct sock *sk, int level, int optname,
 		     char __user *optval, int __user *optlen);
 int udpv6_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index a125aebc29e5..035e1b2f2529 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -38,7 +38,7 @@  struct proto udplitev6_prot = {
 	.name		   = "UDPLITEv6",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
-	.connect	   = ip6_datagram_connect,
+	.connect	   = udp_v6_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
 	.init		   = udplite_sk_init,