diff mbox

ip: find correct route for socket which is not bound to a device

Message ID 1442385255-27014-1-git-send-email-wen.gang.wang@oracle.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Wengang Wang Sept. 16, 2015, 6:34 a.m. UTC
For multi-cast, we should find valid route(thus get the meaniful pmtu) for
the package on the socket which is not bound to a device(sk_bound_dev_if
being 0) too.

From man page of socket(7)

       SO_BINDTODEVICE
		Bind this socket to a particular device like “eth0”, as
		specified in the passed interface name.  If the name is an
		empty string or the option length is zero, the socket
		device binding is removed. The  passed  option is  a
		variable-length null-terminated interface name string with
		the maximum size of IFNAMSIZ.  If a socket is bound to an
		interface, only packets received from that particular
		interface are processed by the socket. Note that this works
		only for some socket types, particularly AF_INET sockets.
		It is not supported for packet sockets (use normal bind(2)
		there).

The man page doesn't say when socket not bound packages won't be routed.

A problem is hit that all multi-cast packages dropped by kernel(from sender
host). The lower layer is IPoIB with MTU being 7000. And I was sending 4096
length multi-cast  package. In side IPoIB the first send is dropped because
is exeeding the internal package size limitation mcast_mtu which is 2044.
So IPoIB calls ip_rt_update_pmtu (indirectly) trying to set path mtu. A
correct route is configured for the multi-cast, so the setting of pmtu
cucceeded and the next multi-cast package(to the same target) is expected
to succeed(it would be well fragmented accroding to the pmtu I just set).
But actually the second and later multi-cast packages got dropped too. And
the reason is that the neighor looking up(fib_lookup) is skipped because of
the socket is not bound to device(sk_bound_dev_if being 0). After applied
the patch I proposed here, it works fine.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

David Miller Sept. 18, 2015, 4:20 a.m. UTC | #1
From: Wengang Wang <wen.gang.wang@oracle.com>
Date: Wed, 16 Sep 2015 14:34:15 +0800

> For multi-cast, we should find valid route(thus get the meaniful pmtu) for
> the package on the socket which is not bound to a device(sk_bound_dev_if
> being 0) too.

Your patch breaks exactly the situation explained in full detail
in the huge comment about the first change you are making:

			/* Special hack: user can direct multicasts
			   and limited broadcast via necessary interface
			   without fiddling with IP_MULTICAST_IF or IP_PKTINFO.
			   This hack is not just for fun, it allows
			   vic,vat and friends to work.
			   They bind socket to loopback, set ttl to zero
			   and expect that it will work.
			   From the viewpoint of routing cache they are broken,
			   because we are not allowed to build multicast path
			   with loopback source addr (look, routing cache
			   cannot know, that ttl is zero, so that packet
			   will not leave this host and route is valid).
			   Luckily, this hack is good workaround.
			 */

That situation will now fail after your patch.

So I cannot apply this patch, sorry.

I know what you want, you want to end up with a cached route that
will track the PMTU.  But this hack will break other things at
the same time so is not acceptable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wengang Wang Sept. 21, 2015, 2:04 a.m. UTC | #2
Hi David,

Thanks for your review on this.

在 2015年09月18日 12:20, David Miller 写道:
> From: Wengang Wang <wen.gang.wang@oracle.com>
> Date: Wed, 16 Sep 2015 14:34:15 +0800
>
>> For multi-cast, we should find valid route(thus get the meaniful pmtu) for
>> the package on the socket which is not bound to a device(sk_bound_dev_if
>> being 0) too.
> Your patch breaks exactly the situation explained in full detail
> in the huge comment about the first change you are making:
>
> 			/* Special hack: user can direct multicasts
> 			   and limited broadcast via necessary interface
> 			   without fiddling with IP_MULTICAST_IF or IP_PKTINFO.
> 			   This hack is not just for fun, it allows
> 			   vic,vat and friends to work.
> 			   They bind socket to loopback, set ttl to zero
> 			   and expect that it will work.
> 			   From the viewpoint of routing cache they are broken,
> 			   because we are not allowed to build multicast path
> 			   with loopback source addr (look, routing cache
> 			   cannot know, that ttl is zero, so that packet
> 			   will not leave this host and route is valid).
> 			   Luckily, this hack is good workaround.
> 			 */
>
> That situation will now fail after your patch.

Yes, I noticed the above comment before I made the patch. I have 
question regarding that
comment: Seems it's talking about the loopback source address(loopback 
device interface
is with the index of 1). Do you think we can make something specific to 
loopback device and
let the package with other outbound devices go through the routing cache?


> So I cannot apply this patch, sorry.
>
> I know what you want, you want to end up with a cached route that
> will track the PMTU.
Yes, that's exactly what I want.
Actually the user space can work fine by adding a syscall 
setsockopt(..., SO_BINDTODEVICE, ..) to
bind socket to the device. Well, the problem here is that there are a 
lot of user space applications which
don't bind sockets and they are working fine with old kernel(2.6.39). 
When moving to new kernel(4.x),
the problem appeared. And seems it's hard for force all of them change 
to bind sockets and recompile.
  -- that's the pain.  Do you have any idea which won't break the 
existing hack but can help with my case?

thanks,
wengang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 21, 2015, 4:22 a.m. UTC | #3
From: Wengang Wang <wen.gang.wang@oracle.com>
Date: Mon, 21 Sep 2015 10:04:59 +0800

>  Do you have any idea which won't break the existing
>  -- hack but can help with my case?

Although I am obligated to find deficiencies and shortcomings in your
submissions, I am not obligated to design or implement your changes
correctly for you, sorry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a556..032481a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2097,7 +2097,7 @@  struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 			 */
 
 			fl4->flowi4_oif = dev_out->ifindex;
-			goto make_route;
+			goto lookup;
 		}
 
 		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
@@ -2153,6 +2153,7 @@  struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		goto make_route;
 	}
 
+lookup:
 	if (fib_lookup(net, fl4, &res, 0)) {
 		res.fi = NULL;
 		res.table = NULL;