diff mbox

netlink: enable skb header refcounting before sending first broadcast

Message ID 20150710115141.12980.88829.stgit@buzz
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Konstantin Khlebnikov July 10, 2015, 11:51 a.m. UTC
This fixes race between non-atomic updates of adjacent bit-fields:
skb->cloned could be lost because netlink broadcast clones skb after
sending it to the first listener who sets skb->peeked at the same skb.
As a result atomic refcounting of skb header stays disabled and
skb_release_data() frees it twice. Race leads to double-free in kmalloc-xxx.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()")
---
 net/netlink/af_netlink.c |    6 ++++++
 1 file changed, 6 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet July 10, 2015, 1:49 p.m. UTC | #1
On Fri, 2015-07-10 at 14:51 +0300, Konstantin Khlebnikov wrote:
> This fixes race between non-atomic updates of adjacent bit-fields:
> skb->cloned could be lost because netlink broadcast clones skb after
> sending it to the first listener who sets skb->peeked at the same skb.
> As a result atomic refcounting of skb header stays disabled and
> skb_release_data() frees it twice. Race leads to double-free in kmalloc-xxx.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Fixes: b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()")
> ---
>  net/netlink/af_netlink.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index dea925388a5b..921e0d8dfe3a 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2028,6 +2028,12 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>  	info.tx_filter = filter;
>  	info.tx_data = filter_data;
>  
> +	/* Enable atomic refcounting in skb_release_data() before first send:
> +	 * non-atomic set of that bit-field in __skb_clone() could race with
> +	 * __skb_recv_datagram() which touches the same set of bit-fields.
> +	 */
> +	skb->cloned = 1;
> +
>  	/* While we sleep in clone, do not allow to change socket list */
>  
>  	netlink_lock_table();

Wow, this is tricky.

I wonder how you found this bug ????

Acked-by: Eric Dumazet <edumazet@google.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov July 10, 2015, 2:08 p.m. UTC | #2
On 10.07.2015 16:49, Eric Dumazet wrote:
> On Fri, 2015-07-10 at 14:51 +0300, Konstantin Khlebnikov wrote:
>> This fixes race between non-atomic updates of adjacent bit-fields:
>> skb->cloned could be lost because netlink broadcast clones skb after
>> sending it to the first listener who sets skb->peeked at the same skb.
>> As a result atomic refcounting of skb header stays disabled and
>> skb_release_data() frees it twice. Race leads to double-free in kmalloc-xxx.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Fixes: b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()")
>> ---
>>   net/netlink/af_netlink.c |    6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index dea925388a5b..921e0d8dfe3a 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -2028,6 +2028,12 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>>   	info.tx_filter = filter;
>>   	info.tx_data = filter_data;
>>
>> +	/* Enable atomic refcounting in skb_release_data() before first send:
>> +	 * non-atomic set of that bit-field in __skb_clone() could race with
>> +	 * __skb_recv_datagram() which touches the same set of bit-fields.
>> +	 */
>> +	skb->cloned = 1;
>> +
>>   	/* While we sleep in clone, do not allow to change socket list */
>>
>>   	netlink_lock_table();
>
> Wow, this is tricky.
>
> I wonder how you found this bug ????

In some setups race happens quite often: once or twice per hour.
I guess the main trigger was the openvswitch which generates a
lot of netlink traffic. Though debugging was a real pain.

>
> Acked-by: Eric Dumazet <edumazet@google.com>
>
>
>
Herbert Xu July 13, 2015, 7:23 a.m. UTC | #3
On Fri, Jul 10, 2015 at 02:51:41PM +0300, Konstantin Khlebnikov wrote:
> This fixes race between non-atomic updates of adjacent bit-fields:
> skb->cloned could be lost because netlink broadcast clones skb after
> sending it to the first listener who sets skb->peeked at the same skb.
> As a result atomic refcounting of skb header stays disabled and
> skb_release_data() frees it twice. Race leads to double-free in kmalloc-xxx.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Fixes: b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()")
> ---
>  net/netlink/af_netlink.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index dea925388a5b..921e0d8dfe3a 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2028,6 +2028,12 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>  	info.tx_filter = filter;
>  	info.tx_data = filter_data;
>  
> +	/* Enable atomic refcounting in skb_release_data() before first send:
> +	 * non-atomic set of that bit-field in __skb_clone() could race with
> +	 * __skb_recv_datagram() which touches the same set of bit-fields.
> +	 */
> +	skb->cloned = 1;
> +
>  	/* While we sleep in clone, do not allow to change socket list */
>  
>  	netlink_lock_table();

Your effort in finding this bug is wonderful.  However I think
the fix is a bit dirty.

The real issue here is that the recv path no longer handles shared
skbs.  So either we need to fix the recv path to not touch skbs
without cloning them, or we need to get rid of the use of shared
skbs in netlink.

In fact it looks I introduced the bug way back in

commit a59322be07c964e916d15be3df473fb7ba20c41e
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Wed Dec 5 01:53:40 2007 -0800

    [UDP]: Only increment counter on first peek/recv

I will try to mend this error :)

Cheers,
Eric Dumazet July 13, 2015, 8:05 a.m. UTC | #4
On Mon, 2015-07-13 at 15:23 +0800, Herbert Xu wrote:

> The real issue here is that the recv path no longer handles shared
> skbs.  So either we need to fix the recv path to not touch skbs
> without cloning them, or we need to get rid of the use of shared
> skbs in netlink.
> 
> In fact it looks I introduced the bug way back in
> 
> commit a59322be07c964e916d15be3df473fb7ba20c41e
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Wed Dec 5 01:53:40 2007 -0800
> 
>     [UDP]: Only increment counter on first peek/recv
> 
> I will try to mend this error :)
> 
> Cheers,

Herbert, UDP peek support is very buggy anyway, because of deferred
checksums

__skb_checksum_complete() will happily manipulate csum, ip_summed,
csum_complete_sw & csum_valid

Ideally, peek should never touch skb (but skb->users)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 13, 2015, 8:10 a.m. UTC | #5
On Mon, Jul 13, 2015 at 10:05:42AM +0200, Eric Dumazet wrote:
>
> Herbert, UDP peek support is very buggy anyway, because of deferred
> checksums
> 
> __skb_checksum_complete() will happily manipulate csum, ip_summed,
> csum_complete_sw & csum_valid
> 
> Ideally, peek should never touch skb (but skb->users)

I think UDP should be OK because the main creator of shared skbs
is af_packet and in that cast the IP stack will clone the skb upon
entry.  AFAIK there aren't any entities doing the shared skb trick
within the IP stack.

IOW the UDP stack does not have to worry about share skbs, unlike
netlink.

Cheers,
Eric Dumazet July 13, 2015, 8:22 a.m. UTC | #6
On Mon, 2015-07-13 at 16:10 +0800, Herbert Xu wrote:
> On Mon, Jul 13, 2015 at 10:05:42AM +0200, Eric Dumazet wrote:
> >
> > Herbert, UDP peek support is very buggy anyway, because of deferred
> > checksums
> > 
> > __skb_checksum_complete() will happily manipulate csum, ip_summed,
> > csum_complete_sw & csum_valid
> > 
> > Ideally, peek should never touch skb (but skb->users)
> 
> I think UDP should be OK because the main creator of shared skbs
> is af_packet and in that cast the IP stack will clone the skb upon
> entry.  AFAIK there aren't any entities doing the shared skb trick
> within the IP stack.
> 
> IOW the UDP stack does not have to worry about share skbs, unlike
> netlink.

It should worry, in case multiple threads are using MSG_PEEK on same udp
socket ;)

Problem here is not the producer (might be unicast packets btw),
but multiple 'consumers'

It turns out your patch would also solve this problem.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 13, 2015, 8:25 a.m. UTC | #7
On Mon, Jul 13, 2015 at 10:22:34AM +0200, Eric Dumazet wrote:
>
> It should worry, in case multiple threads are using MSG_PEEK on same udp
> socket ;)

That should be fine because we already hold a spinlock on the
queue.

Cheers,
Eric Dumazet July 13, 2015, 8:28 a.m. UTC | #8
On Mon, Jul 13, 2015 at 10:25 AM, Herbert Xu
<herbert@gondor.apana.org.au> wrote:
> On Mon, Jul 13, 2015 at 10:22:34AM +0200, Eric Dumazet wrote:
>>
>> It should worry, in case multiple threads are using MSG_PEEK on same udp
>> socket ;)
>
> That should be fine because we already hold a spinlock on the
> queue.
>

Except that udp checksum are checked outside of spinlock protection.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu July 13, 2015, 8:31 a.m. UTC | #9
On Mon, Jul 13, 2015 at 10:28:19AM +0200, Eric Dumazet wrote:
>
> Except that udp checksum are checked outside of spinlock protection.

Good point.  I wonder when this got broken.  I'll do some digging.

Cheers,
diff mbox

Patch

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dea925388a5b..921e0d8dfe3a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2028,6 +2028,12 @@  int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
 	info.tx_filter = filter;
 	info.tx_data = filter_data;
 
+	/* Enable atomic refcounting in skb_release_data() before first send:
+	 * non-atomic set of that bit-field in __skb_clone() could race with
+	 * __skb_recv_datagram() which touches the same set of bit-fields.
+	 */
+	skb->cloned = 1;
+
 	/* While we sleep in clone, do not allow to change socket list */
 
 	netlink_lock_table();