Message ID | 20190907134532.31975-1-ap420073@gmail.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | net: fix nested device bugs | expand |
From: Taehee Yoo <ap420073@gmail.com> Date: Sat, 7 Sep 2019 22:45:32 +0900 > Current code doesn't limit the number of nested devices. > Nested devices would be handled recursively and this needs huge stack > memory. So, unlimited nested devices could make stack overflow. ... > Splat looks like: > [ 140.483124] BUG: looking up invalid subclass: 8 > [ 140.483505] turning off the locking correctness validator. The limit here is not stack memory, but a limit in the lockdep validator, which can probably be fixed by other means. This was the feedback I saw given for the previous version of this series as well.
On Thu, 12 Sep 2019 at 07:32, David Miller <davem@davemloft.net> wrote: > Hi David Thank you for the review! > From: Taehee Yoo <ap420073@gmail.com> > Date: Sat, 7 Sep 2019 22:45:32 +0900 > > > Current code doesn't limit the number of nested devices. > > Nested devices would be handled recursively and this needs huge stack > > memory. So, unlimited nested devices could make stack overflow. > ... > > Splat looks like: > > [ 140.483124] BUG: looking up invalid subclass: 8 > > [ 140.483505] turning off the locking correctness validator. > > The limit here is not stack memory, but a limit in the lockdep > validator, which can probably be fixed by other means. > > This was the feedback I saw given for the previous version of > this series as well. I just realized this commit message is not enough for describing the problems. It looks like that "invalid subclass" makes panic. But this is not. The panic is not related to "invalid subclass" lockdep problem. There are two splats. 1. [ 140.483124] BUG: looking up invalid subclass: 8 2. [ 168.446539] BUG: KASAN: slab-out-of-bounds in __unwind_start+0x71/0x850 [ 168.794493] Rebooting in 5 seconds.. The first message is just a warning message of lockdep because of too deep lockdep subclasses and it doesn't make any problem and panic. This message can be ignored right now because other patches of this patchset avoids this problem using dynamic lockdep key. The second message is a panic message. This is stack overflow and it occurs because of too deep nested devices. The goal of this patch is to fix this stack overflow problem. I tested with this reproducer commands without lockdep. ip link add dummy0 type dummy ip link add link dummy0 name vlan1 type vlan id 1 ip link set vlan1 up for i in {2..200} do let A=$i-1 ip link add name vlan$i link vlan$A type vlan id $i done ip link del vlan1 <-- this command is added. Splat looks like: [ 181.594092] Thread overran stack, or stack corrupted [ 181.594726] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 181.595417] CPU: 1 PID: 1402 Comm: ip Not tainted 5.3.0-rc7+ #173 [ 181.596193] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 181.605986] RIP: 0010:stack_depot_fetch+0x10/0x30 [ 181.606588] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 59 bf 89 ff 0f 0b e8 02 3f 9d ff eb e9 89 f8 c1 ef 110 [ 181.609820] RSP: 0018:ffff8880cbebedd8 EFLAGS: 00010006 [ 181.610485] RAX: 00000000001fffff RBX: ffff8880cbebfc00 RCX: 0000000000000000 [ 181.611394] RDX: 000000000000001d RSI: ffff8880cbebede0 RDI: 0000000000003ff0 [ 181.612297] RBP: ffffea00032fae00 R08: ffffed101b5a3eab R09: ffffed101b5a3eab [ 181.613222] R10: 0000000000000001 R11: ffffed101b5a3eaa R12: ffff8880d6115e80 [ 181.614148] R13: ffff8880cbebeac0 R14: ffff8880cbebfc00 R15: ffff8880cbebef80 [ 181.615053] FS: 00007f46140510c0(0000) GS:ffff8880dad00000(0000) knlGS:0000000000000000 [ 181.616085] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 181.616841] CR2: ffffffffb9a7a0d8 CR3: 00000000bc356003 CR4: 00000000000606e0 [ 181.635748] Call Trace: [ 181.635996] Modules linked in: 8021q garp stp mrp llc dummy veth openvswitch nsh nf_conncount nf_nat nf_conntrs [ 181.637360] CR2: ffffffffb9a7a0d8 [ 181.637670] ---[ end trace f890ce3e5c51ceb4 ]--- [ 181.638096] RIP: 0010:stack_depot_fetch+0x10/0x30 [ 181.638524] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 59 bf 89 ff 0f 0b e8 02 3f 9d ff eb e9 89 f8 c1 ef 110 [ 181.805441] RSP: 0018:ffff8880cbebedd8 EFLAGS: 00010006 [ 181.900192] RAX: 00000000001fffff RBX: ffff8880cbebfc00 RCX: 0000000000000000 [ 181.901119] RDX: 000000000000001d RSI: ffff8880cbebede0 RDI: 0000000000003ff0 [ 181.902038] RBP: ffffea00032fae00 R08: ffffed101b5a3eab R09: ffffed101b5a3eab [ 181.902960] R10: 0000000000000001 R11: ffffed101b5a3eaa R12: ffff8880d6115e80 [ 181.903885] R13: ffff8880cbebeac0 R14: ffff8880cbebfc00 R15: ffff8880cbebef80 [ 181.904825] FS: 00007f46140510c0(0000) GS:ffff8880dad00000(0000) knlGS:0000000000000000 [ 181.905862] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 181.906604] CR2: ffffffffb9a7a0d8 CR3: 00000000bc356003 CR4: 00000000000606e0 [ 181.907525] Kernel panic - not syncing: Fatal exception [ 181.908179] Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff) [ 181.909176] Rebooting in 5 seconds..
From: Taehee Yoo <ap420073@gmail.com> Date: Thu, 12 Sep 2019 12:56:19 +0900 > I tested with this reproducer commands without lockdep. > > ip link add dummy0 type dummy > ip link add link dummy0 name vlan1 type vlan id 1 > ip link set vlan1 up > > for i in {2..200} > do > let A=$i-1 > > ip link add name vlan$i link vlan$A type vlan id $i > done > ip link del vlan1 <-- this command is added. Is there any other device type which allows arbitrary nesting depth in this manner other than VLAN? Perhaps it is the VLAN nesting depth that we should limit instead of all of this extra code.
On Thu, 12 Sep 2019 at 18:38, David Miller <davem@davemloft.net> wrote: > > From: Taehee Yoo <ap420073@gmail.com> > Date: Thu, 12 Sep 2019 12:56:19 +0900 > > > I tested with this reproducer commands without lockdep. > > > > ip link add dummy0 type dummy > > ip link add link dummy0 name vlan1 type vlan id 1 > > ip link set vlan1 up > > > > for i in {2..200} > > do > > let A=$i-1 > > > > ip link add name vlan$i link vlan$A type vlan id $i > > done > > ip link del vlan1 <-- this command is added. > > Is there any other device type which allows arbitrary nesting depth > in this manner other than VLAN? Perhaps it is the VLAN nesting > depth that we should limit instead of all of this extra code. Below device types have the same problem. VLAN, BONDING, TEAM, VXLAN, MACVLAN, and MACSEC. All the below test commands reproduce a panic. BONDING test commands: ip link add bond0 type bond for i in {1..200} do let A=$i-1 ip link add bond$i type bond ip link set bond$i master bond$A done ip link set bond5 master bond0 TEAM test commands: ip link add team0 type team for i in {1..200} do let A=$i-1 ip link add team$i type team ip link set team$i master team$A done MACSEC test commands: ip link add link lo macsec0 type macsec for i in {1..100} do let A=$i-1 ip link add link macsec$A macsec$i type macsec done ip link del macsec0 MACVLAN test commands: ip link add dummy0 type dummy ip link add macvlan1 link dummy0 type macvlan ip link add vlan2 link macvlan1 type vlan id 2 let i=3 for j in {1..100} do let A=$i-1 ip link add macvlan$i link vlan$A type macvlan let i=$i+1 let A=$i-1 ip link add vlan$i link macvlan$A type vlan id $i let i=$i+1 done ip link del dummy0 VXLAN test commands: ip link add vxlan1 type vxlan dev lo id 1 dstport 1 for i in {2..100} do let A=$i-1 ip link add vxlan$i type vxlan dev vxlan$A id $i dstport $i done ip link del vxlan1
From: Taehee Yoo <ap420073@gmail.com> Date: Thu, 12 Sep 2019 19:14:37 +0900 > On Thu, 12 Sep 2019 at 18:38, David Miller <davem@davemloft.net> wrote: >> >> From: Taehee Yoo <ap420073@gmail.com> >> Date: Thu, 12 Sep 2019 12:56:19 +0900 >> >> > I tested with this reproducer commands without lockdep. >> > >> > ip link add dummy0 type dummy >> > ip link add link dummy0 name vlan1 type vlan id 1 >> > ip link set vlan1 up >> > >> > for i in {2..200} >> > do >> > let A=$i-1 >> > >> > ip link add name vlan$i link vlan$A type vlan id $i >> > done >> > ip link del vlan1 <-- this command is added. >> >> Is there any other device type which allows arbitrary nesting depth >> in this manner other than VLAN? Perhaps it is the VLAN nesting >> depth that we should limit instead of all of this extra code. > > Below device types have the same problem. > VLAN, BONDING, TEAM, VXLAN, MACVLAN, and MACSEC. > All the below test commands reproduce a panic. I think then we need to move the traversals over to a iterative rather than recursive algorithm.
On Thu, 12 Sep 2019 at 20:37, David Miller <davem@davemloft.net> wrote: > > From: Taehee Yoo <ap420073@gmail.com> > Date: Thu, 12 Sep 2019 19:14:37 +0900 > > > On Thu, 12 Sep 2019 at 18:38, David Miller <davem@davemloft.net> wrote: > >> > >> From: Taehee Yoo <ap420073@gmail.com> > >> Date: Thu, 12 Sep 2019 12:56:19 +0900 > >> > >> > I tested with this reproducer commands without lockdep. > >> > > >> > ip link add dummy0 type dummy > >> > ip link add link dummy0 name vlan1 type vlan id 1 > >> > ip link set vlan1 up > >> > > >> > for i in {2..200} > >> > do > >> > let A=$i-1 > >> > > >> > ip link add name vlan$i link vlan$A type vlan id $i > >> > done > >> > ip link del vlan1 <-- this command is added. > >> > >> Is there any other device type which allows arbitrary nesting depth > >> in this manner other than VLAN? Perhaps it is the VLAN nesting > >> depth that we should limit instead of all of this extra code. > > > > Below device types have the same problem. > > VLAN, BONDING, TEAM, VXLAN, MACVLAN, and MACSEC. > > All the below test commands reproduce a panic. > > I think then we need to move the traversals over to a iterative > rather than recursive algorithm. I agree with that. So I will check it out then send a v3 patchset. Thank you!
On Thu, 12 Sep 2019 at 20:37, David Miller <davem@davemloft.net> wrote: > > From: Taehee Yoo <ap420073@gmail.com> > Date: Thu, 12 Sep 2019 19:14:37 +0900 > > > On Thu, 12 Sep 2019 at 18:38, David Miller <davem@davemloft.net> wrote: > >> > >> From: Taehee Yoo <ap420073@gmail.com> > >> Date: Thu, 12 Sep 2019 12:56:19 +0900 > >> > >> > I tested with this reproducer commands without lockdep. > >> > > >> > ip link add dummy0 type dummy > >> > ip link add link dummy0 name vlan1 type vlan id 1 > >> > ip link set vlan1 up > >> > > >> > for i in {2..200} > >> > do > >> > let A=$i-1 > >> > > >> > ip link add name vlan$i link vlan$A type vlan id $i > >> > done > >> > ip link del vlan1 <-- this command is added. > >> > >> Is there any other device type which allows arbitrary nesting depth > >> in this manner other than VLAN? Perhaps it is the VLAN nesting > >> depth that we should limit instead of all of this extra code. > > > > Below device types have the same problem. > > VLAN, BONDING, TEAM, VXLAN, MACVLAN, and MACSEC. > > All the below test commands reproduce a panic. > > I think then we need to move the traversals over to a iterative > rather than recursive algorithm. Just to clarify, I have a question. There are two recursive routines in the code. a) netdev_walk_all_{lower/upper}_dev() that are used to traversal all of their lower or upper devices. b) VLAN, BONDING, TEAM, VXLAN, MACVLAN, and MACSEC modules internally handle their lower/upper devices recursively when an event is received such as unregistering. what is the routine that you mentioned?
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 88292953aa6f..5bb5756129af 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1624,6 +1624,8 @@ enum netdev_priv_flags { * @type: Interface hardware type * @hard_header_len: Maximum hardware header length. * @min_header_len: Minimum hardware header length + * @upper_level: Maximum depth level of upper devices. + * @lower_level: Maximum depth level of lower devices. * * @needed_headroom: Extra headroom the hardware may need, but not in all * cases can this be guaranteed @@ -1854,6 +1856,8 @@ struct net_device { unsigned short type; unsigned short hard_header_len; unsigned char min_header_len; + unsigned char upper_level; + unsigned char lower_level; unsigned short needed_headroom; unsigned short needed_tailroom; diff --git a/net/core/dev.c b/net/core/dev.c index 0891f499c1bb..6a4b4ce62204 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -146,6 +146,7 @@ #include "net-sysfs.h" #define MAX_GRO_SKBS 8 +#define MAX_NEST_DEV 8 /* This should be increased if a protocol with a bigger head is added. */ #define GRO_MAX_HEAD (MAX_HEADER + 128) @@ -6602,6 +6603,21 @@ struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev, } EXPORT_SYMBOL(netdev_upper_get_next_dev_rcu); +static struct net_device *netdev_next_upper_dev(struct net_device *dev, + struct list_head **iter) +{ + struct netdev_adjacent *upper; + + upper = list_entry((*iter)->next, struct netdev_adjacent, list); + + if (&upper->list == &dev->adj_list.upper) + return NULL; + + *iter = &upper->list; + + return upper->dev; +} + static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev, struct list_head **iter) { @@ -6619,6 +6635,33 @@ static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev, return upper->dev; } +int netdev_walk_all_upper_dev(struct net_device *dev, + int (*fn)(struct net_device *dev, + void *data), + void *data) +{ + struct net_device *udev; + struct list_head *iter; + int ret; + + for (iter = &dev->adj_list.upper, + udev = netdev_next_upper_dev(dev, &iter); + udev; + udev = netdev_next_upper_dev(dev, &iter)) { + /* first is the upper device itself */ + ret = fn(udev, data); + if (ret) + return ret; + + /* then look at all of its upper devices */ + ret = netdev_walk_all_upper_dev(udev, fn, data); + if (ret) + return ret; + } + + return 0; +} + int netdev_walk_all_upper_dev_rcu(struct net_device *dev, int (*fn)(struct net_device *dev, void *data), @@ -6785,6 +6828,52 @@ static struct net_device *netdev_next_lower_dev_rcu(struct net_device *dev, return lower->dev; } +static u8 __netdev_upper_depth(struct net_device *dev) +{ + struct net_device *udev; + struct list_head *iter; + u8 max_depth = 0; + + for (iter = &dev->adj_list.upper, + udev = netdev_next_upper_dev(dev, &iter); + udev; + udev = netdev_next_upper_dev(dev, &iter)) { + if (max_depth < udev->upper_level) + max_depth = udev->upper_level; + } + + return max_depth; +} + +static u8 __netdev_lower_depth(struct net_device *dev) +{ + struct net_device *ldev; + struct list_head *iter; + u8 max_depth = 0; + + for (iter = &dev->adj_list.lower, + ldev = netdev_next_lower_dev(dev, &iter); + ldev; + ldev = netdev_next_lower_dev(dev, &iter)) { + if (max_depth < ldev->lower_level) + max_depth = ldev->lower_level; + } + + return max_depth; +} + +static int __netdev_update_upper_level(struct net_device *dev, void *data) +{ + dev->upper_level = __netdev_upper_depth(dev) + 1; + return 0; +} + +static int __netdev_update_lower_level(struct net_device *dev, void *data) +{ + dev->lower_level = __netdev_lower_depth(dev) + 1; + return 0; +} + int netdev_walk_all_lower_dev_rcu(struct net_device *dev, int (*fn)(struct net_device *dev, void *data), @@ -7063,6 +7152,9 @@ static int __netdev_upper_dev_link(struct net_device *dev, if (netdev_has_upper_dev(upper_dev, dev)) return -EBUSY; + if ((dev->lower_level + upper_dev->upper_level) > MAX_NEST_DEV) + return -EMLINK; + if (!master) { if (netdev_has_upper_dev(dev, upper_dev)) return -EEXIST; @@ -7089,6 +7181,12 @@ static int __netdev_upper_dev_link(struct net_device *dev, if (ret) goto rollback; + __netdev_update_upper_level(dev, NULL); + netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL); + + __netdev_update_lower_level(upper_dev, NULL); + netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL); + return 0; rollback: @@ -7171,6 +7269,12 @@ void netdev_upper_dev_unlink(struct net_device *dev, call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, &changeupper_info.info); + + __netdev_update_upper_level(dev, NULL); + netdev_walk_all_lower_dev(dev, __netdev_update_upper_level, NULL); + + __netdev_update_lower_level(upper_dev, NULL); + netdev_walk_all_upper_dev(upper_dev, __netdev_update_lower_level, NULL); } EXPORT_SYMBOL(netdev_upper_dev_unlink); @@ -9157,6 +9261,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, dev->gso_max_size = GSO_MAX_SIZE; dev->gso_max_segs = GSO_MAX_SEGS; + dev->upper_level = 1; + dev->lower_level = 1; INIT_LIST_HEAD(&dev->napi_list); INIT_LIST_HEAD(&dev->unreg_list);
Current code doesn't limit the number of nested devices. Nested devices would be handled recursively and this needs huge stack memory. So, unlimited nested devices could make stack overflow. This patch adds upper_level and lower_leve, they are common variables and represent maximum lower/upper depth. When upper/lower device is attached or dettached, {lower/upper}_level are updated. and if maximum depth is bigger than 8, attach routine fails and returns -EMLINK. Test commands: ip link add dummy0 type dummy ip link add link dummy0 name vlan1 type vlan id 1 ip link set vlan1 up for i in {2..100} do let A=$i-1 ip link add name vlan$i link vlan$A type vlan id $i done Splat looks like: [ 140.483124] BUG: looking up invalid subclass: 8 [ 140.483505] turning off the locking correctness validator. [ 140.483505] CPU: 0 PID: 1324 Comm: ip Not tainted 5.3.0-rc7+ #322 [ 140.483505] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015 [ 140.483505] Call Trace: [ 140.483505] dump_stack+0x7c/0xbb [ 140.483505] register_lock_class+0x64d/0x14d0 [ 140.483505] ? is_dynamic_key+0x230/0x230 [ 140.483505] ? module_assert_mutex_or_preempt+0x41/0x70 [ 140.483505] ? __module_address+0x3f/0x3c0 [ 140.483505] lockdep_init_map+0x24e/0x630 [ 140.483505] vlan_dev_init+0x828/0xce0 [8021q] [ 140.483505] register_netdevice+0x24f/0xd70 [ 140.483505] ? netdev_change_features+0xa0/0xa0 [ 140.483505] ? dev_get_nest_level+0xe1/0x170 [ 140.483505] register_vlan_dev+0x29b/0x710 [8021q] [ 140.483505] __rtnl_newlink+0xb75/0x1180 [ ... ] [ 168.446539] WARNING: can't dereference registers at 00000000bef3d701 for ip apic_timer_interrupt+0xf/0x20 [ 168.466843] ================================================================== [ 168.469452] BUG: KASAN: slab-out-of-bounds in __unwind_start+0x71/0x850 [ 168.480707] Write of size 88 at addr ffff8880b8856d38 by task ip/1758 [ 168.480707] [ 168.480707] CPU: 1 PID: 1758 Comm: ip Not tainted 5.3.0-rc7+ #322 [ ... ] [ 168.794493] Rebooting in 5 seconds.. Signed-off-by: Taehee Yoo <ap420073@gmail.com> --- v1 -> v2 : this patch isn't changed include/linux/netdevice.h | 4 ++ net/core/dev.c | 106 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 110 insertions(+)