From patchwork Thu Jul 28 11:28:23 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Denys Fedoryshchenko X-Patchwork-Id: 653681 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3s0V3j3RzGz9t1S for ; Thu, 28 Jul 2016 21:28:33 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757236AbcG1L23 (ORCPT ); Thu, 28 Jul 2016 07:28:29 -0400 Received: from nuclearcat.com ([144.76.183.226]:35834 "EHLO nuclearcat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757007AbcG1L20 (ORCPT ); Thu, 28 Jul 2016 07:28:26 -0400 Received: from localhost (localhost [127.0.0.1]) by nuclearcat.com (Postfix) with ESMTP id 4EC1C67C0086; Thu, 28 Jul 2016 11:28:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at nuclearcat.com Received: from nuclearcat.com ([127.0.0.1]) by localhost (nuclearcat.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bYxgWkJmk57X; Thu, 28 Jul 2016 11:28:23 +0000 (UTC) Received: from germany.nuclearcat.com (localhost [127.0.0.1]) (Authenticated sender: nuclearcat@nuclearcat.com) by nuclearcat.com (Postfix) with ESMTPA id 7EE8067C0085; Thu, 28 Jul 2016 11:28:23 +0000 (UTC) MIME-Version: 1.0 Date: Thu, 28 Jul 2016 14:28:23 +0300 From: Denys Fedoryshchenko To: Guillaume Nault Cc: Cong Wang , Linux Kernel Network Developers Subject: Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit In-Reply-To: <20160728110948.GA3046@alphalink.fr> References: <20160728110948.GA3046@alphalink.fr> Message-ID: X-Sender: nuclearcat@nuclearcat.com User-Agent: Roundcube Webmail/1.2.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 2016-07-28 14:09, Guillaume Nault wrote: > On Tue, Jul 12, 2016 at 10:31:18AM -0700, Cong Wang wrote: >> On Mon, Jul 11, 2016 at 12:45 PM, wrote: >> > Hi >> > >> > On latest kernel i noticed kernel panic happening 1-2 times per day. It is >> > also happening on older kernel (at least 4.5.3). >> > >> ... >> > [42916.426463] Call Trace: >> > [42916.426658] >> > >> > [42916.426719] [] skb_push+0x36/0x37 >> > [42916.427111] [] ppp_start_xmit+0x10f/0x150 >> > [ppp_generic] >> > [42916.427314] [] dev_hard_start_xmit+0x25a/0x2d3 >> > [42916.427516] [] ? >> > validate_xmit_skb.isra.107.part.108+0x11d/0x238 >> > [42916.427858] [] sch_direct_xmit+0x89/0x1b5 >> > [42916.428060] [] __qdisc_run+0x133/0x170 >> > [42916.428261] [] net_tx_action+0xe3/0x148 >> > [42916.428462] [] __do_softirq+0xb9/0x1a9 >> > [42916.428663] [] irq_exit+0x37/0x7c >> > [42916.428862] [] smp_apic_timer_interrupt+0x3d/0x48 >> > [42916.429063] [] apic_timer_interrupt+0x7c/0x90 >> >> Interesting, we call a skb_cow_head() before skb_push() in >> ppp_start_xmit(), >> I have no idea why this could happen. >> > The skb is corrupted: head is at ffff8800b0bf2800 while data is at > ffa00500b0bf284c. > > Figuring out how this corruption happened is going to be hard without a > way to reproduce the problem. > > Denys, can you confirm you're using a vanilla kernel? > Also I guess the ppp devices and tc settings are handled by accel-ppp. > If so, can you share more info about your setup (accel-ppp.conf, radius > attributes, iptables...) so that I can try to reproduce it on my > machines? I have slight modification from vanilla: But i guess it should not be reason of crash (it is related to another system, without it i was unable to shape over 7Gbps, maybe with latest kernel i will not need this patch). I'm trying to make reproducible conditions of crash, because right now it happens only on some servers in large networks (completely different ISPs, so i excluded possible hardware fault of specific server). It is complex config, i have accel-ppp, plus my own "shaping daemon" that apply several shapers on ppp interfaces. Wost thing it happens only on live customers, i am unable to reproduce same on stress tests. Also until recent kernel i was getting different panic messages (but all related to ppp). I think also at least one reason of crash also was fixed by "ppp: defer netns reference release for ppp channel" in 4.7.0 (maybe thats why i am getting less crashes recently). I tried also various kernel debug options that doesn't cause major performance degradation (locks checking, freed memory poisoning and etc), without any luck yet. Is it useful if i will post panics that at least occurs twice? (I will post below example, got recently) Sure if i will be able to reproducible conditions i will send them immediately. [ 5449.900988] general protection fault: 0000 [#1] SMP [ 5449.901263] Modules linked in: cls_fw act_police cls_u32 sch_ingress sch_sfq sch_htb pppoe pppox ppp_generic slhc netconsole configfs xt_nat ts_bm xt_string xt_connmark xt_TCPMSS xt_tcpudp xt_mark iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [ 5449.904989] CPU: 1 PID: 6359 Comm: ip Not tainted 4.7.0-build-0109 #2 [ 5449.905255] Hardware name: Supermicro X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015 [ 5449.905712] task: ffff8803eef40000 ti: ffff8803fd754000 task.ti: ffff8803fd754000 [ 5449.906168] RIP: 0010:[] [] inet_fill_ifaddr+0x5a/0x264 [ 5449.906710] RSP: 0018:ffff8803fd757b98 EFLAGS: 00010286 [ 5449.906976] RAX: ffff8803ef65cb90 RBX: ffff8803f7d2cd00 RCX: 0000000000000000 [ 5449.907248] RDX: 0000000800000002 RSI: ffff8803ef65cb90 RDI: ffff8803ef65cba8 [ 5449.907519] RBP: ffff8803fd757be0 R08: 0000000000000008 R09: 0000000000000002 [ 5449.907792] R10: ffa005040269f480 R11: ffffffff820a1c00 R12: ffa005040269f480 [ 5449.908067] R13: ffff8803ef65cb90 R14: 0000000000000000 R15: ffff8803f7d2cd00 [ 5449.908339] FS: 00007f660674d700(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000 [ 5449.908796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5449.909067] CR2: 00000000008b9018 CR3: 00000003f2a11000 CR4: 00000000001406e0 [ 5449.909339] Stack: [ 5449.909598] 0163a8c0869711ac 0000008000000000 ffffffffffffffff 0003e1d50003e1d5 [ 5449.910329] ffff8800d54c0ac8 ffff8803f0d90000 0000000000000005 0000000000000000 [ 5449.911066] ffff8803f7d2cd00 ffff8803fd757c40 ffffffff818a9f73 ffffffff820a1c00 [ 5449.911803] Call Trace: [ 5449.912061] [] inet_dump_ifaddr+0xfb/0x185 [ 5449.912332] [] rtnl_dump_all+0xa9/0xc2 [ 5449.912601] [] netlink_dump+0xf0/0x25c [ 5449.912873] [] netlink_recvmsg+0x1a9/0x2d3 [ 5449.913142] [] sock_recvmsg+0x14/0x16 [ 5449.913407] [] ___sys_recvmsg+0xea/0x1a1 [ 5449.913675] [] ? alloc_pages_vma+0x167/0x1a0 [ 5449.913945] [] ? page_add_new_anon_rmap+0xb4/0xbd [ 5449.914212] [] ? lru_cache_add_active_or_unevictable+0x31/0x9d [ 5449.914664] [] ? handle_mm_fault+0x632/0x112d [ 5449.914940] [] ? vma_merge+0x27e/0x2b1 [ 5449.915208] [] __sys_recvmsg+0x3d/0x5e [ 5449.915478] [] ? __sys_recvmsg+0x3d/0x5e [ 5449.915747] [] SyS_recvmsg+0xd/0x17 [ 5449.916017] [] entry_SYSCALL_64_fastpath+0x17/0x93 [ 5449.916287] Code: e5 41 57 41 56 41 55 41 54 49 89 f4 53 89 c6 48 89 fb 48 83 ec 20 e8 be b0 fc ff 48 85 c0 49 89 c5 0f 84 f4 01 00 00 c6 40 10 02 8a 44 24 41 41 83 ce ff 45 89 f7 41 88 45 11 41 8b 44 24 44 [ 5449.921684] RIP [] inet_fill_ifaddr+0x5a/0x264 [ 5449.922028] RSP [ 5449.922547] ---[ end trace 18580d58f51e3038 ]--- [ 5449.923705] Kernel panic - not syncing: Fatal exception [ 5449.923979] Kernel Offset: disabled [ 5449.925873] Rebooting in 5 seconds.. [43221.432450] general protection fault: 0000 [#1] SMP [43221.432656] Modules linked in: intel_ips intel_smartconnect intel_rst cls_fw act_police cls_u32 sch_ingress sch_sfq sch_htb pppoe pppox ppp_generic slhc netconsole configfs xt_nat ts_bm xt_string xt_connmark xt_TCPMSS xt_tcpudp xt_mark iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [43221.433815] CPU: 3 PID: 29196 Comm: accel-cmd Not tainted 4.7.0-build-0110 #2 [43221.434024] Hardware name: Supermicro X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015 [43221.434414] task: ffff8803dcc39780 ti: ffff8800cdb18000 task.ti: ffff8800cdb18000 [43221.434805] RIP: 0010:[] [] inet_fill_ifaddr+0x5a/0x264 [43221.435202] RSP: 0018:ffff8800cdb1bb98 EFLAGS: 00010282 [43221.435406] RAX: ffff8803fe89efb0 RBX: ffff8803de661500 RCX: 0000000000000000 [43221.435616] RDX: 0000000800000002 RSI: ffff8803fe89efb0 RDI: ffff8803fe89efc8 [43221.435823] RBP: ffff8800cdb1bbe0 R08: 0000000000000008 R09: 0000000000000002 [43221.436030] R10: ffa0050402880f80 R11: ffffffff820a1680 R12: ffa0050402880f80 [43221.436234] R13: ffff8803fe89efb0 R14: 0000000000000000 R15: ffff8803de661500 [43221.436436] FS: 00007f25a2539700(0000) GS:ffff88041fcc0000(0000) knlGS:0000000000000000 [43221.436821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43221.437023] CR2: 000000000060f000 CR3: 00000000cd2e8000 CR4: 00000000001406e0 [43221.437227] Stack: [43221.437419] 0163a8c0818411ac 0000008000000000 ffffffffffffffff 003a44db003a44db [43221.437827] ffff8803fe5992c8 ffff8803f5b04000 0000000000000003 0000000000000000 [43221.438230] ffff8803de661500 ffff8800cdb1bc40 ffffffff818a85f6 ffffffff820a1680 [43221.438636] Call Trace: [43221.438834] [] inet_dump_ifaddr+0xfb/0x185 [43221.439035] [] rtnl_dump_all+0xa9/0xc2 [43221.439241] [] netlink_dump+0xf0/0x25c [43221.439441] [] netlink_recvmsg+0x1a9/0x2d3 [43221.439641] [] sock_recvmsg+0x14/0x16 [43221.439841] [] ___sys_recvmsg+0xea/0x1a1 [43221.440043] [] ? alloc_pages_vma+0x167/0x1a0 [43221.440247] [] ? page_add_new_anon_rmap+0xb4/0xbd [43221.440449] [] ? lru_cache_add_active_or_unevictable+0x31/0x9d [43221.440837] [] ? handle_mm_fault+0x632/0x112d [43221.441038] [] ? SyS_sendto+0xef/0x120 [43221.441241] [] __sys_recvmsg+0x3d/0x5e [43221.441443] [] ? __sys_recvmsg+0x3d/0x5e [43221.441644] [] SyS_recvmsg+0xd/0x17 [43221.441849] [] entry_SYSCALL_64_fastpath+0x17/0x93 [43221.442055] Code: e5 41 57 41 56 41 55 41 54 49 89 f4 53 89 c6 48 89 fb 48 83 ec 20 e8 be b0 fc ff 48 85 c0 49 89 c5 0f 84 f4 01 00 00 c6 40 10 02 8a 44 24 41 41 83 ce ff 45 89 f7 41 88 45 11 41 8b 44 24 44 [43221.442945] RIP [] inet_fill_ifaddr+0x5a/0x264 [43221.443151] RSP [43221.445125] ---[ end trace 99273d413e56a193 ]--- [43221.446262] Kernel panic - not syncing: Fatal exception [43221.446536] Kernel Offset: disabled [43221.448446] Rebooting in 5 seconds.. Jul 27 23:41:44 10.0.253.19 Jul 27 23:41:44 10.0.253.19 [43226.451328] ACPI MEMORY or I/O RESET_REG. --- linux/net/sched/sch_htb.c 2016-06-08 01:23:53.000000000 +0000 +++ linux-new/net/sched/sch_htb.c 2016-06-21 14:03:08.398486593 +0000 @@ -1495,10 +1495,10 @@ cl->common.classid); cl->quantum = 1000; } - if (!hopt->quantum && cl->quantum > 200000) { + if (!hopt->quantum && cl->quantum > 2000000) { pr_warn("HTB: quantum of class %X is big. Consider r2q change.\n", cl->common.classid); - cl->quantum = 200000; + cl->quantum = 2000000; } if (hopt->quantum) cl->quantum = hopt->quantum;