From patchwork Mon Sep 4 08:33:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?UmFmYcWCIE1pxYJlY2tp?= X-Patchwork-Id: 1939226 X-Patchwork-Delegate: zajec5@gmail.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; secure) header.d=lists.infradead.org header.i=@lists.infradead.org header.a=rsa-sha256 header.s=bombadil.20210309 header.b=fAIduZaz; dkim=permerror header.d=gmail.com header.i=@gmail.com header.a=rsa-sha1 header.s=20221208 header.b=nuEXBYXy; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.openwrt.org (client-ip=2607:7c80:54:3::133; helo=bombadil.infradead.org; envelope-from=openwrt-devel-bounces+incoming=patchwork.ozlabs.org@lists.openwrt.org; receiver=patchwork.ozlabs.org) Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:3::133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VmdYK5Gl6z1ynR for ; Sat, 25 May 2024 20:33:41 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:Content-Type: Content-Transfer-Encoding:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:Subject:From:Cc:To:MIME-Version:Date:Message-ID: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=MB/bOE/u6QTL9GH2iHiqWkfVF0AQmbnBagmZGIO07uQ=; b=fAIduZazhRioCJ0Puooha0iHyE zwz8SHvQ/5GemoN+Sjd0M2ghCMyxNUruBTiUm/qFnaeJMBVqAbpOj8gPSREUixZmXLUh9tFHUE5jm IRQDF+MzO1DBaPUA0G5o1DckpRrWdyIR/SIUZ6EZC0H2EVUPsOVjAlwMUwoK2+f533aKwtH9AUdO+ 35WS3EUuHb6Ei+fvrvRsOYPWqkImtAifs9nStA/xFI4UBPCN5NH1vAeVHW/vPegauldSDRjXZ+y+m oHIUQuu7fUEGmAh8wcQhzv7vCbVSPhy5y8nwQZlAIfgTa2Nq08rTQxWuoI67/PR0GO4kilLTPFcOM cHTKpJ0w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sAohC-0000000ApQY-1n9K; Sat, 25 May 2024 10:32:14 +0000 Received: from mail-lj1-x232.google.com ([2a00:1450:4864:20::232]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qd524-003Zp6-0V for openwrt-devel@lists.openwrt.org; Mon, 04 Sep 2023 08:34:08 +0000 Received: by mail-lj1-x232.google.com with SMTP id 38308e7fff4ca-2b9c907bc68so16884821fa.2 for ; Mon, 04 Sep 2023 01:34:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693816441; x=1694421241; darn=lists.openwrt.org; h=content-transfer-encoding:subject:from:cc:to:content-language :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=0pHfPHsRS32lQzAyn/aCgRZs7D/3MVytHhFMqFPkbks=; b=nuEXBYXyXLHKXY78yJYVQAhM1eaFQ0Dk3te3DcIFAVURxesNkSVdeoFGOWV63UCPyx dnUL6/0C52uSs0gEh2WsfKrNda5AknZMtGC7CZtGUOd0bNN/iAjf7PNazEOceESg4m5e J0uEr2aYFGcGpyEJC9nDUXXRoiabVltaOA4bsH5zB6eIoZBDuqH4hg/tftY38bLlQqVc OFOh5thMDS0bHaOX6pUi6zruMh8n0KB4rjEDqQ2LwyrIH16PIhPMPUHabtDt1iXMsp3Q 0xi6GOACYPZVPfA8LWf93M8cQy5dh3f9UavvaDlKiW14rSFW/eY36eLwUSQr/pC686ag vd6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693816441; x=1694421241; h=content-transfer-encoding:subject:from:cc:to:content-language :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=0pHfPHsRS32lQzAyn/aCgRZs7D/3MVytHhFMqFPkbks=; b=Ky+gaPv3yLfT+8zOdEg/SQ+TTukHlS94/VmEVjuO0wvFheM6w8OO7aSqaEYwLC4xJm IyCNuFX9i/4way6VxrPZFW/39YOSHTG+gGeYlPvctKsA6ObylVRyy3DL5RrOp+Do0HXt mMRvh6QBl1FpanRWHu/LCerv5YXQMMBoLltioUNb64rYJ8sBbKwO9U6DSSlPMELx7HWi J85XacaMlnyVZxI7iyHuBzBg28L6zGCIcvVl7C6CVKqEhDwoqSnYTC14FjZQzpWkanUi h9TENOUYObtr5qsn17Y74MdgJ10MRkVrvRnQ0Qk/gQK6WlFoj4jUlPDL2iiWx6/pL97i 9nnQ== X-Gm-Message-State: AOJu0Yxa1ZuyVUEElPzl7jCZT9vNpDYvvZxhn1MCG06vXXvKlwDTRKM/ fDhSJdLPMDZl++4pXQwl/mk= X-Google-Smtp-Source: AGHT+IHtr6DV3rG/CtK+M6vhF1zbxFhnQjZ9kLSqp2je76cD/XZSMMFxTb4+Sq3uaqR44G5Jp5TLUg== X-Received: by 2002:a05:6512:536:b0:500:a69a:1c4 with SMTP id o22-20020a056512053600b00500a69a01c4mr5910615lfc.58.1693816440971; Mon, 04 Sep 2023 01:34:00 -0700 (PDT) Received: from [192.168.26.149] (031011218106.poznan.vectranet.pl. [31.11.218.106]) by smtp.googlemail.com with ESMTPSA id t11-20020a05640203cb00b0052a1d98618bsm5571217edw.54.2023.09.04.01.33.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Sep 2023 01:34:00 -0700 (PDT) Message-ID: Date: Mon, 4 Sep 2023 10:33:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , Boqun Feng , Russell King , Daniel Lezcano , Thomas Gleixner , Florian Fainelli , linux-clk@vger.kernel.org, linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Cc: openwrt-devel@lists.openwrt.org, bcm-kernel-feedback-list@broadcom.com From: =?utf-8?b?UmFmYcWCIE1pxYJlY2tp?= Subject: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230904_013404_191297_40977FE8 X-CRM114-Status: GOOD ( 16.57 ) X-Spam-Score: 0.1 (/) X-Spam-Report: Spam detection software, running on the system "bombadil.infradead.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: I made a second attempt on debugging some longstanding stability issues affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards with a pretty slow arch timer running at 36,8 kHz. After 0 to 20 minutes of close to zero activity I experience hangs and I need to wait a minute for watchdog to kick in and reboot device. Content analysis details: (0.1 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [2a00:1450:4864:20:0:0:0:232 listed in] [list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider [zajec5[at]gmail.com] 0.2 FREEMAIL_ENVFROM_END_DIGIT Envelope-from freemail username ends in digit [zajec5[at]gmail.com] -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID_EF Message has a valid DKIM or DK signature from envelope-from domain X-Mailman-Approved-At: Sat, 25 May 2024 03:29:34 -0700 X-BeenThere: openwrt-devel@lists.openwrt.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: OpenWrt Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "openwrt-devel" Errors-To: openwrt-devel-bounces+incoming=patchwork.ozlabs.org@lists.openwrt.org I made a second attempt on debugging some longstanding stability issues affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards with a pretty slow arch timer running at 36,8 kHz. After 0 to 20 minutes of close to zero activity I experience hangs and I need to wait a minute for watchdog to kick in and reboot device. First debugging attempt: https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b8305d2@gmail.com/T/ ("ARM board lockups/hangs triggered by locks and mutexes") After a lot of bisecting, testing & hacking I believe there are 3 types of kernel aspects that affect BCM53573 stability. I'd like to describe them below to document my debugging work. I'm clueless at this point. Maybe someone can come up with an idea of actual issue & ideally a solution. ##### 1. Locking During my first bisecting attempts I found multiple locking-related commit that regressed stability. Bisected commits: 131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts"). and a following group: d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t") dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions") 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line") 809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code") 9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header") 04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants") 513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed") 68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values") I don't believe there is actually anything wrong about above changes. Maybe it's some tiny timing thing that my board just doesn't like? ##### 2. Clock (arm,armv7-timer) While comparing main clock in Broadcom's SDK with upstream one I noticed a tiny difference: mask value. I don't know it it makes any sense but switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in arm_arch_timer.c (to match SDK) increases average uptime (time before a hang/lockup happens) from 4 minutes to 36 minutes. ##### 3. Random code changes During my bisecting attempts I found one commit that regressed kernel stability but actual changes were meaningless in context of locking. It was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic partitions"): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26 I thought that maybe it was all about making add_mtd_device() bigger and changing addresses of a lot of symbols (looking at System.map). So I reverted that mtd commit and developed a dummy change relocating as few symbols (System.map) as possible while still breaking stability: ### As those hangs/lockups are related to so many different changes it's really hard to debug them. This bug seems to be specific to the slow arch clock that affects stability only when kernel locking code and symbols layout trigger some very specific timing. Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects so much code it's hard to tell why it actually matters. Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it disabled. I tried it and it improves stability (I had 3 devices with 6 days of uptime and counting) indeed. Again it affects a lot of kernel parts so it's hard to tell why it helps. Unless someone comes up with some magic solution I'll probably try building BCM53573 images without CONFIG_SMP for my personal needs. --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void) arch_cpu_idle(); start_critical_timings(); } + + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); + if (cpu_idle_force_poll == 5678) + arch_cpu_idle(); + if (cpu_idle_force_poll == 1234) + arch_cpu_idle(); } static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev, Above dummy change didn't relocate thousands of symbols but only about 20 of them. They happened to be lock symbols however. Does it make any sense for above diff to regress kernel stability for me and cause hangs/lockups? --- System.map.good +++ System.map.bad @@ -22214,36 +22214,36 @@ c062e7e0 T __cpuidle_text_start c062e7e0 t cpu_idle_poll c062e860 T default_idle_call -c062e884 T __cpuidle_text_end -c062e888 T __lock_text_start -c062e8a0 T _raw_spin_unlock_irqrestore -c062e8c0 T _raw_spin_trylock -c062e900 T _raw_write_unlock_irqrestore -c062e920 T _raw_read_trylock -c062e960 T _raw_write_trylock -c062e9a0 T _raw_spin_lock_bh -c062ea00 T _raw_read_lock_bh -c062ea40 T _raw_write_lock_bh -c062ea80 T _raw_spin_trylock_bh -c062eb00 T _raw_spin_unlock_bh -c062eb40 T _raw_write_unlock_bh -c062eb80 T _raw_read_unlock_bh -c062ebc0 T _raw_read_unlock_irqrestore -c062ec00 T _raw_write_lock -c062ec40 T _raw_write_lock_irq -c062ec80 T _raw_write_lock_irqsave -c062ecc0 T _raw_read_lock -c062ed00 T _raw_spin_lock -c062ed40 T _raw_read_lock_irq -c062ed80 T _raw_spin_lock_irq -c062ede0 T _raw_spin_lock_irqsave -c062ee40 T _raw_read_lock_irqsave -c062ee70 T __hyp_text_end -c062ee70 T __hyp_text_start -c062ee70 T __kprobes_text_end -c062ee70 T __kprobes_text_start -c062ee70 T __lock_text_end -c062ee70 T _etext +c062e954 T __cpuidle_text_end +c062e958 T __lock_text_start +c062e960 T _raw_spin_unlock_irqrestore +c062e980 T _raw_spin_trylock +c062e9c0 T _raw_write_unlock_irqrestore +c062e9e0 T _raw_read_trylock +c062ea20 T _raw_write_trylock +c062ea60 T _raw_spin_lock_bh +c062eac0 T _raw_read_lock_bh +c062eb00 T _raw_write_lock_bh +c062eb40 T _raw_spin_trylock_bh +c062ebc0 T _raw_spin_unlock_bh +c062ec00 T _raw_write_unlock_bh +c062ec40 T _raw_read_unlock_bh +c062ec80 T _raw_read_unlock_irqrestore +c062ecc0 T _raw_write_lock +c062ed00 T _raw_write_lock_irq +c062ed40 T _raw_write_lock_irqsave +c062ed80 T _raw_read_lock +c062edc0 T _raw_spin_lock +c062ee00 T _raw_read_lock_irq +c062ee40 T _raw_spin_lock_irq +c062eea0 T _raw_spin_lock_irqsave +c062ef00 T _raw_read_lock_irqsave +c062ef30 T __hyp_text_end +c062ef30 T __hyp_text_start +c062ef30 T __kprobes_text_end +c062ef30 T __kprobes_text_start +c062ef30 T __lock_text_end +c062ef30 T _etext c062f000 D __start_rodata c062f000 D sigreturn_codes c062f044 d cpu_arch_name