diff mbox

netlink: trim skb to exact size to avoid MSG_TRUNC

Message ID 1444698923-13992-1-git-send-email-ronen.arad@intel.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Arad, Ronen Oct. 13, 2015, 1:15 a.m. UTC
The available room in the skb allocated in netlink_dump for iproute2
show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
set in netlink_recvmg.
This was handled properly for small skb allocated when no interface has
many VLANs configured. This patch applies the same logic to larger skbs
which are allocated using the calculated min_dump_alloc size.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
---
 net/netlink/af_netlink.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Thomas Graf Oct. 13, 2015, 8:56 a.m. UTC | #1
On 10/12/15 at 06:15pm, Ronen Arad wrote:
> The available room in the skb allocated in netlink_dump for iproute2
> show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
> be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
> set in netlink_recvmg.
> This was handled properly for small skb allocated when no interface has
> many VLANs configured. This patch applies the same logic to larger skbs
> which are allocated using the calculated min_dump_alloc size.
> 
> Signed-off-by: Ronen Arad <ronen.arad@intel.com>

Wouldn't this imply a bug in which rtnl_calcit() does not account for
some data that is later dumped? How else could the skb end up being
larger than what alloc_size accounts for at this point unless we end
up stuffing 2x smallish messages into the padded projection of the
calculated maximum message size of that type. Is that what you are
seeing?

Regardless of that, alloc_size is likely larger than nlk->max_recvmsg_len
anyway at this point so unless the reader suddenly provides a different
message size or does peeking it will likely still be truncated.

I'm just trying to understand which exact case you are solving here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arad, Ronen Oct. 13, 2015, 5:52 p.m. UTC | #2
>-----Original Message-----
>From: Thomas Graf [mailto:tgraf@suug.ch]
>Sent: Tuesday, October 13, 2015 1:56 AM
>To: Arad, Ronen
>Cc: netdev@vger.kernel.org
>Subject: Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
>
>On 10/12/15 at 06:15pm, Ronen Arad wrote:
>> The available room in the skb allocated in netlink_dump for iproute2
>> show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
>> be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
>> set in netlink_recvmg.
>> This was handled properly for small skb allocated when no interface has
>> many VLANs configured. This patch applies the same logic to larger skbs
>> which are allocated using the calculated min_dump_alloc size.
>>
>> Signed-off-by: Ronen Arad <ronen.arad@intel.com>
>
>Wouldn't this imply a bug in which rtnl_calcit() does not account for
>some data that is later dumped?
[@Ronen] rtnl_calcit() is not bug-free. It is not, however, the direct
cause of the problem this patch intends to solve.
rtnl_calcit() overestimates the min_alloc_size. The space for VLAN_INFO for
the maximum number ov VLANs on any interface is always added to
min_alloc_size even when the dump request does not specify VLAN or
compressed VLANs. The overestimation is because if_nlmsg_size() does not
pass ext_filter_mask to  rtnl_link_get_af_size() or rtnl_link_get_size().
(ext_filter_mask is passed to rtnl_vfinfo_size() and rtnl_port_size())
 
>How else could the skb end up being
>larger than what alloc_size accounts for at this point unless we end
>up stuffing 2x smallish messages into the padded projection of the
>calculated maximum message size of that type.
[@Ronen] The skb size (i.e. the tailroom of a newly allocated skb) is
greater than the argument passed to netlink_skb_alloc(). I didn't fully
looked at the allocation mechanism (there could be multiple ways and
netlink skbs could be memory mapped)). My understanding is that this is
expected as indicated by the comment in the code.

Netlink dump by design attempts to pack as many smallish messages into the
same skb to minimize the number of parts of a multi-part dump response.
The min_alloc_size is intended to guarantee drivers/modules sufficient
space for the largest dump message of a single netdev. 

>Is that what you are
>seeing?

[@Ronen] My initial observation was that with many VLANs configured, the 
min_alloc_size is greater than the 16KiB buffer iproute2 uses for both
"ip link show" and "bridge [-c[ompressvlans]] vlan show" commands. This buf
is defined locally within iproute2's lib/libnetlink.c.
Each VLAN_INFO attribute takes 8 bytes. 4094 VLANs requires 32,752 bytes.
Compressed VLANs would need a lot less (at the extreme only 8 bytes for a
single range covering the entire 1-4094 or any other range).

I bumped iproute2 buffer size from 16KiB to 3*16KiB when I first noticed
"Message truncated" error from "ip" and "bridge" commands. I was surprised
when such a large buffer remained insufficient for a system with more
interfaces.

The root cause is that the skb is allocated with more space than requested
and the dump is allowed to use this extra space available. In my case I
observed skb allocated with total space of 65,216 bytes when 34,420
were requested.
 
>
>Regardless of that, alloc_size is likely larger than nlk->max_recvmsg_len
>anyway at this point so unless the reader suddenly provides a different
>message size or does peeking it will likely still be truncated.
>
[@Ronen] My reader as I described above is providing a larger message
which I'm trying to properly size. I'm aware that libnl shields
applications from the need to know and provide properly sized buffer by
peeking or/and re-allocating a buffer.
My issue is with iproute2 "ip link show" and "bridge vlan show" commands.
 
>I'm just trying to understand which exact case you are solving here.
[@Ronen] My patch applies the same logic that is used when allocation is
done by nlk->max_recvmsg_len to allocation that is done by min_alloc_size.
The code could be clearer like that:

	/* NLMSG_GOODSIZE is small to avoid high order allocations being
	 * required, but it makes sense to _attempt_ a 16K bytes allocation
	 * to reduce number of system calls on dump operations, if user
	 * ever provided a big enough buffer.
	 */
	cb = &nlk->cb;
	alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);

	if (alloc_min_size < nlk->max_recvmsg_len) {
		alloc_size = nlk->max_recvmsg_len;
		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
					GFP_KERNEL |
					__GFP_NOWARN |
					__GFP_NORETRY);
	}
	if (!skb) {
		alloc_size = alloc_min_size;
		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
					GFP_KERNEL);
	}
	if (!skb)
		goto errout_skb;
	/* available room should be exact amount to avoid MSG_TRUNC */
      skb_reserve(skb, skb_tailroom(skb) - alloc_size);

Allocation is always performed by alloc_size which could be
nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
upon failure falling back to alloc_min_size.
The trimming of the skb space is common regardless of the allocation call.
I tried to submit the minimal patch to address the issue. If you think the
Re-organized code is better I can re-submit a V2.


 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Oct. 14, 2015, 7:44 a.m. UTC | #3
On 10/13/15 at 05:52pm, Arad, Ronen wrote:
> [@Ronen] My reader as I described above is providing a larger message
> which I'm trying to properly size. I'm aware that libnl shields
> applications from the need to know and provide properly sized buffer by
> peeking or/and re-allocating a buffer.
> My issue is with iproute2 "ip link show" and "bridge vlan show" commands.

>  
> >I'm just trying to understand which exact case you are solving here.
> Allocation is always performed by alloc_size which could be
> nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
> upon failure falling back to alloc_min_size.
> The trimming of the skb space is common regardless of the allocation call.
> I tried to submit the minimal patch to address the issue. If you think the
> Re-organized code is better I can re-submit a V2.

I was about to suggest the same code change after initial discussion ;-)

So you are fixing the case where >2x messages fit the padded skb size.
This was not clear from the commit message. I would appreciate a note
in the commit message and updated code comment to reflect this.

The fix is definitely not incorrect and the penalty for readers which
peek first is less than I thought since nlk->max_recvmsg_len is at
least 16K in size. Since most peekers will double buffer sizes they
will most likely end up growing nlk->max_recvmsg_len after the first
read.

However, if alloc_size is > 16K, we would have typically ended up with a
giant skb which peeking users were able to take advantage of while
with this fix this is no longer the case.

However #2, I'll see if it makes sense to look at MSG_PEEK in recvmsg
and change nlk->max_recvmsg_len accordingly so we take advantage of
the full skb size on sockets which perform peeking. Given that both
reader behaviours can be preserved, I'm good with your proposed v2.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arad, Ronen Oct. 15, 2015, 6:41 a.m. UTC | #4
>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Thomas Graf
>Sent: Wednesday, October 14, 2015 12:45 AM
>To: Arad, Ronen
>Cc: netdev@vger.kernel.org
>Subject: Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
>
>On 10/13/15 at 05:52pm, Arad, Ronen wrote:
>> [@Ronen] My reader as I described above is providing a larger message
>> which I'm trying to properly size. I'm aware that libnl shields
>> applications from the need to know and provide properly sized buffer by
>> peeking or/and re-allocating a buffer.
>> My issue is with iproute2 "ip link show" and "bridge vlan show" commands.
>
>>
>> >I'm just trying to understand which exact case you are solving here.
>> Allocation is always performed by alloc_size which could be
>> nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
>> upon failure falling back to alloc_min_size.
>> The trimming of the skb space is common regardless of the allocation call.
>> I tried to submit the minimal patch to address the issue. If you think the
>> Re-organized code is better I can re-submit a V2.
>
>I was about to suggest the same code change after initial discussion ;-)
>
>So you are fixing the case where >2x messages fit the padded skb size.
[@Ronen] I'm not sure I understand this statement. I'm fixing the padding
of the skb such that reader could have reasonable buffer size based on the
largest netdev. It is just happened that the skb allocation was about
double the size. Probably because the allocation was some kind of power
of 2 and the requested size was slightly above the next lower power of 2.

On a separate patch titled 
[PATCH net-next v3] netlink: Rightsize IFLA_AF_SPEC size calculation
I'm reducing the over-estimation of the buffer size for "ip link"
requests. It turned out that VLAN information space was added to
unrelated dump requests since ext_filter_mask was not passed to
rtnl_link_get_af_size().
The "rightsizing" patch also reduces the buffer size of compressed VLANs
dump. Non-compressed VLANs dump will continue to require more than the
16KiB buffer size from somewhere around 1800 VLANs and above (based on
8 bytes per VLAN plus other attributes consuming about 1700 bytes).
Using the same logic the full range of 4094 VLANs uncompressed would take
Roughly 34500 bytes. It looks like a "safe" iproute2 statically sized
Buffer would have to be about 36000 bytes or so.

>This was not clear from the commit message. I would appreciate a note
>in the commit message and updated code comment to reflect this.
>
>The fix is definitely not incorrect and the penalty for readers which
>peek first is less than I thought since nlk->max_recvmsg_len is at
>least 16K in size. Since most peekers will double buffer sizes they
>will most likely end up growing nlk->max_recvmsg_len after the first
>read.
[@Ronen] nlk->max_recvmsg_len is actually capped at 16KiB.
/* Record the max length of recvmsg() calls for future allocations */
nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len, 16384);
>
>However, if alloc_size is > 16K, we would have typically ended up with a
>giant skb which peeking users were able to take advantage of while
>with this fix this is no longer the case.
[@Ronen] As I noted above, peeking reader could only enjoy saving up to 
16KiB.
>
>However #2, I'll see if it makes sense to look at MSG_PEEK in recvmsg
>and change nlk->max_recvmsg_len accordingly so we take advantage of
>the full skb size on sockets which perform peeking. Given that both
>reader behaviours can be preserved, I'm good with your proposed v2.
[@Ronen] I'll submit by suggested v2.
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8f060d7..d628253 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2817,9 +2817,13 @@  static int netlink_dump(struct sock *sk)
 			skb_reserve(skb, skb_tailroom(skb) -
 					 nlk->max_recvmsg_len);
 	}
-	if (!skb)
+	if (!skb) {
 		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
 					GFP_KERNEL);
+		/* available room should be exact amount to avoid MSG_TRUNC */
+		if (skb)
+			skb_reserve(skb, skb_tailroom(skb) - alloc_size);
+	}
 	if (!skb)
 		goto errout_skb;
 	netlink_skb_set_owner_r(skb, sk);