Message ID | f8fde7da-9402-4e17-1bcc-cba08f1ec18b@virtuozzo.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | skb_can_coalesce() merges tcp frags with XFS-related slab objects | expand |
On 02/20/2019 05:34 AM, Vasily Averin wrote: > Dear David, > > currently do_tcp_sendpages() calls skb_can_coalesce() to merge proper tcp fragments. > If these fragments are slab objects and the data is not transferred out of the local host > then tcp_recvmsg() can crash host on BUG_ON (see [2] below). > > There is known usecase when slab objects are provided to tcp_sendpage: > XFS over locally landed network blockdevice. > > I found few such cases: > - _drbd_send_page() had PageSlab() check log time ago. > - recently Ilya Dryomov fixed it in ceph > by commit 7e241f647dc7 "libceph: fall back to sendmsg for slab pages" > > Recently OpenVZ team noticed this problem during experiments with > XFS over locally-landed iscsi target. > > I would note: triggered BUG is not a real problem but false alert, > that though crashes host. > > I can fix last problem by adding PageSlab() into iscsi_tcp_segment_map(), > however it does not fix the problem completely, > there are chances that the problem will be reproduced again with some other filesystems > or with some other kind of network blockdevice. > > David, what do you think, is it probably better to add PageSlab() check > directly into skb_can_coalesce()? (see [1] below) > No, this would be wrong. There is no way a page fragment can be backed by slab object, since a page fragment can be shared (the page refcount needs to be manipulated, without slab/slub being aware of this) Please fix the callers. > Thank you, > Vasily Averin > > [1] The patch preventing the merge of the Slab-based tcp fragments > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 95d25b010a25..e1d200ba1fef 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -3089,7 +3089,7 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i, > if (i) { > const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; > > - return page == skb_frag_page(frag) && > + return page == skb_frag_page(frag) && !PageSlab(page) && > off == frag->page_offset + skb_frag_size(frag); > } > return false; > > [2] oops example, RHEL7-based OpenVZ node, XFS over locally-landed iscsi target > > [ 4902.545219] usercopy: kernel memory exposure attempt detected from ffff8c1497c92200 (kmalloc-512) (1024 bytes) > [ 4902.550585] ------------[ cut here ]------------ > [ 4902.552472] kernel BUG at mm/usercopy.c:72! > [ 4902.554148] invalid opcode: 0000 [#1] SMP > [ 4902.555906] Modules linked in: nf_conntrack_netlink xt_mark raw_diag udp_diag netlink_diag af_packet_diag unix_diag dm_service_time iscsi_tcp libiscsi_tcp libiscsi xfs xt_CHECKSUM ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables binfmt_misc tun devlink tcp_diag inet_diag ip_set nfnetlink fuse kvm_intel ppdev kvm i2c_piix4 irqbypass sg virtio_balloon parport_pc parport joydev pcspkr libcrc32c br_netfilter veth overlay ip6_vzprivnet ip6_vznetstat > [ 4903.413397] ip_vznetstat ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom crct10dif_common ata_generic pata_acpi 8021q garp mrp stp llc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix scsi_transport_iscsi drm libata crc32c_intel serio_raw virtio_pci virtio_ring dm_multipath virtio drm_panel_orientation_quirks floppy sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables] > [ 4903.433902] CPU: 0 PID: 13793 Comm: tgtd ve: 0 Kdump: loaded Not tainted 3.10.0-957.1.3.vz7.83.4 #1 83.4 > [ 4903.436975] Hardware name: Virtuozzo KVM, BIOS 1.10.2-3.1.vz7.3 04/01/2014 > [ 4903.439474] task: ffff8c14f898c740 ti: ffff8c14f8a94000 task.ti: ffff8c14f8a94000 > [ 4903.442278] RIP: 0010:[<ffffffff9ce59427>] [<ffffffff9ce59427>] __check_object_size+0x87/0x250 > [ 4903.445370] RSP: 0018:ffff8c14f8a97b58 EFLAGS: 00010246 > [ 4903.447547] RAX: 0000000000000062 RBX: ffff8c1497c92200 RCX: 0000000000000000 > [ 4903.450108] RDX: 0000000000000000 RSI: ffff8c167fc138d8 RDI: ffff8c167fc138d8 > [ 4903.452753] RBP: ffff8c14f8a97b78 R08: 0000000000000004 R09: ffff8c166d14af00 > [ 4903.455451] R10: 0000000000000080 R11: ffff97b6016ffff8 R12: 0000000000000400 > [ 4903.458142] R13: 0000000000000001 R14: ffff8c1497c92600 R15: 0000000000000400 > [ 4903.460759] FS: 00007fd59b2b7740(0000) GS:ffff8c167fc00000(0000) knlGS:0000000000000000 > [ 4903.463776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4903.466111] CR2: 0000000002701b58 CR3: 00000000b8a98000 CR4: 00000000000006f0 > [ 4903.468860] Call Trace: > [ 4903.470483] [<ffffffff9cfb310d>] memcpy_toiovec+0x4d/0xb0 > [ 4903.472780] [<ffffffff9d2589e8>] skb_copy_datagram_iovec+0x128/0x280 > [ 4903.475388] [<ffffffff9d2c0646>] tcp_recvmsg+0x246/0xbb0 > [ 4903.477884] [<ffffffff9cdd42cd>] ? __alloc_pages_nodemask+0x26d/0x610 > [ 4903.480432] [<ffffffff9d2ef1d0>] inet_recvmsg+0x80/0xb0 > [ 4903.482718] [<ffffffff9d2467bc>] sock_aio_read.part.12+0x14c/0x170 > [ 4903.485255] [<ffffffff9d246801>] sock_aio_read+0x21/0x30 > [ 4903.487495] [<ffffffff9ce5edd6>] do_sync_read+0x96/0xe0 > [ 4903.489710] [<ffffffff9ce5f8b5>] vfs_read+0x145/0x170 > [ 4903.491975] [<ffffffff9ce606cf>] SyS_read+0x7f/0xf0 > [ 4903.494179] [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146 > [ 4903.496684] [<ffffffff9d3a4e9b>] system_call_fastpath+0x22/0x27 > [ 4903.499143] [<ffffffff9d3a4de1>] ? system_call_after_swapgs+0xae/0x146 > [ 4903.501748] Code: 45 d1 48 c7 c6 f8 c3 68 9d 48 c7 c1 93 5e 69 9d 48 0f 45 f1 49 89 c0 4d 89 e1 48 89 d9 48 c7 c7 68 2b 69 9d 31 c0 e8 4f 1e 53 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c0 00 00 c0 9c 4c 39 f0 73 0d > [ 4903.510508] RIP [<ffffffff9ce59427>] __check_object_size+0x87/0x250 > [ 4903.513149] RSP <ffff8c14f8a97b58> >
On 2/20/19 6:53 PM, Eric Dumazet wrote: > On 02/20/2019 05:34 AM, Vasily Averin wrote: >> Dear David, >> >> currently do_tcp_sendpages() calls skb_can_coalesce() to merge proper tcp fragments. >> If these fragments are slab objects and the data is not transferred out of the local host >> then tcp_recvmsg() can crash host on BUG_ON (see [2] below). >> >> There is known usecase when slab objects are provided to tcp_sendpage: >> XFS over locally landed network blockdevice. >> >> I found few such cases: >> - _drbd_send_page() had PageSlab() check log time ago. >> - recently Ilya Dryomov fixed it in ceph >> by commit 7e241f647dc7 "libceph: fall back to sendmsg for slab pages" >> >> Recently OpenVZ team noticed this problem during experiments with >> XFS over locally-landed iscsi target. >> >> I would note: triggered BUG is not a real problem but false alert, >> that though crashes host. >> >> I can fix last problem by adding PageSlab() into iscsi_tcp_segment_map(), >> however it does not fix the problem completely, >> there are chances that the problem will be reproduced again with some other filesystems >> or with some other kind of network blockdevice. >> >> David, what do you think, is it probably better to add PageSlab() check >> directly into skb_can_coalesce()? (see [1] below) >> > > No, this would be wrong. > > There is no way a page fragment can be backed by slab object, > since a page fragment can be shared (the page refcount needs to be manipulated, without slab/slub > being aware of this) Thank you for explanation, though this happen in real life and triggers BUG_ON only if receiving side is located on the same host. Is it probably makes sense to add WARN_ON into skb_can_coalesce to detect such cases? > Please fix the callers. Ok, will do it.
On 02/20/2019 08:19 AM, Vasily Averin wrote: > Thank you for explanation, > though this happen in real life and triggers BUG_ON only if receiving side is located on the same host. > Is it probably makes sense to add WARN_ON into skb_can_coalesce to detect such cases? Yes, but please do it only in the sendpage() path, or only in CONFIG_DEBUG_PAGEALLOC / CONFIG_DEBUG_VM cases. tcp_sendmsg() uses a per task page (look at sk_page_frag()), and it seems strange to recheck what we already know (it is a page not backed/used by SLAB)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 95d25b010a25..e1d200ba1fef 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3089,7 +3089,7 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i, if (i) { const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1]; - return page == skb_frag_page(frag) && + return page == skb_frag_page(frag) && !PageSlab(page) && off == frag->page_offset + skb_frag_size(frag); } return false;