From patchwork Fri May 31 06:13:01 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stewart Smith X-Patchwork-Id: 1108102 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 45FZHh0nrHz9s00 for ; Fri, 31 May 2019 16:27:28 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 45FZHg6bmDzDqK6 for ; Fri, 31 May 2019 16:27:27 +1000 (AEST) X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=linux.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=stewart@linux.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 45FZ0N1dmhzDqYC for ; Fri, 31 May 2019 16:14:12 +1000 (AEST) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x4V64Jxu078514 for ; Fri, 31 May 2019 02:14:09 -0400 Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by mx0a-001b2d01.pphosted.com with ESMTP id 2stv42d088-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 31 May 2019 02:14:09 -0400 Received: from localhost by e31.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 31 May 2019 07:14:08 +0100 Received: from b03cxnp07029.gho.boulder.ibm.com (9.17.130.16) by e31.co.us.ibm.com (192.168.1.131) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Fri, 31 May 2019 07:14:05 +0100 Received: from b03ledav006.gho.boulder.ibm.com (b03ledav006.gho.boulder.ibm.com [9.17.130.237]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x4V6E4pD41877736 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Fri, 31 May 2019 06:14:04 GMT Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8E1A3C605B for ; Fri, 31 May 2019 06:14:04 +0000 (GMT) Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2A6E3C605A for ; Fri, 31 May 2019 06:14:04 +0000 (GMT) Received: from birb.localdomain (unknown [9.185.142.91]) by b03ledav006.gho.boulder.ibm.com (Postfix) with ESMTP for ; Fri, 31 May 2019 06:14:04 +0000 (GMT) Received: by birb.localdomain (Postfix, from userid 1000) id 7C1C4507BD0; Fri, 31 May 2019 16:13:57 +1000 (AEST) From: Stewart Smith To: skiboot@lists.ozlabs.org Date: Fri, 31 May 2019 16:13:01 +1000 X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190531061351.22973-1-stewart@linux.ibm.com> References: <20190531061351.22973-1-stewart@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 19053106-8235-0000-0000-00000EA0F3AB X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00011189; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000286; SDB=6.01211088; UDB=6.00636341; IPR=6.00992124; MB=3.00027126; MTD=3.00000008; XFM=3.00000015; UTC=2019-05-31 06:14:06 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19053106-8236-0000-0000-000045C9C0EF Message-Id: <20190531061351.22973-61-stewart@linux.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-05-31_03:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=13 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1905310040 Subject: [Skiboot] [PATCH 060/110] skiboot 6.0.9 release notes X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" Signed-off-by: Stewart Smith (cherry picked from commit 19484fa4338f84617ae8282f17b43d658afc3589) Signed-off-by: Stewart Smith --- doc/release-notes/skiboot-6.0.9.rst | 139 ++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 doc/release-notes/skiboot-6.0.9.rst diff --git a/doc/release-notes/skiboot-6.0.9.rst b/doc/release-notes/skiboot-6.0.9.rst new file mode 100644 index 000000000000..87eb797843ff --- /dev/null +++ b/doc/release-notes/skiboot-6.0.9.rst @@ -0,0 +1,139 @@ +.. _skiboot-6.0.9: + +============= +skiboot-6.0.9 +============= + +skiboot 6.0.9 was released on Friday October 12th, 2018. It replaces +:ref:`skiboot-6.0.8` as the current stable release in the 6.0.x series. + +It is recommended that 6.0.9 be used instead of any previous 6.0.x version +due to the bug fixes it contains. + +The bug fixes are: + +- opal/hmi: Ignore debug trigger inject core FIR. + + Core FIR[60] is a side effect of the work around for the CI Vector Load + issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where + Linux already ignores it. But it looks like in some cases we may happen + to see CORE_FIR[60] while we are already in Malfunction Alert HMI + (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that + happens then just ignore it instead of crashing kernel as not recoverable. + +- opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL. + + When primary thread receives a CORE level HMI for timer facility errors + while secondaries are still in OPAL, thread 0 ends up in rendez-vous + waiting for secondaries to get into hmi handling. This is because OPAL + runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until + they are given to Linux OS. Fix this by adding a check for secondary + state and force them in hmi handling by queuing job on secondary threads. + + I have tested this by injecting HDEC parity error very early during Linux + kernel boot. Recovery works fine for non-TB errors. But if TB is bad at + this very eary stage we already doomed. + + Without this patch we see: :: + + [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c + [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c + [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 + [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error + [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) + [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) + [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) + [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) + [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) + [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) + [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) + [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) + [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) + + After this patch: :: + + [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c + [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c + [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c + [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c + [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 + [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error + [ 259.419650678,7] HMI: Sending hmi job to thread 1 + [ 259.419652744,7] HMI: Sending hmi job to thread 2 + [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 + [ 259.419654725,7] HMI: Sending hmi job to thread 3 + [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 + [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 + [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error + [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error + [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error + [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c + [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c + [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c + +- hw/bt.c: quieten all the noisy BT/IPMI messages +- npu2: Use correct kill type for TCE invalidation + + kill_type is enum of OPAL_PCI_TCE_KILL_PAGES, OPAL_PCI_TCE_KILL_PE, + OPAL_PCI_TCE_KILL_ALL and phb4_tce_kill() gets it right but + npu2_tce_kill() uses OPAL_PCI_TCE_KILL which is an OPAL API token. + +- hw/npu2-opencapi: Fix setting of supported OpenCAPI templates + + In opal_npu_tl_set(), we made a typo that means the OPAL_NPU_TL_SET call + may not clear the enable bits for templates that were previously enabled + but are now disabled. + + Fix the typo so we clear NPU2_OTL_CONFIG1_TX_TEMP2_EN as well as + TEMP{1,3}_EN. + +- phb4: Workaround PHB errata with CFG write UR/CA errors + + If the PHB encounters a UR or CA status on a CFG write, it will + incorrectly freeze the wrong PE. Instead of using the PE# specified + in the CONFIG_ADDRESS register, it will use the PE# of whatever + MMIO occurred last. + + Work around this disabling freeze on such errors + +- phb4: Handle allocation errors in phb4_eeh_dump_regs() + + If the zalloc fails (and it can be a rather large allocation), + we will overwite memory at 0 instead of failing. + +- phb4: Don't try to access non-existent PEST entries + + In a POWER9 chip, some PHB4s have 256 PEs, some have 512. + + Currently, the diagnostics code retrieves 512 unconditionally, + which is wrong and causes us to incorrectly report bogus values + for the "high" PEs on the small PHBs. + + Use the actual number of implemented PEs instead + +- phb4: Don't probe a PHB if its garded + + Presently phb4_probe_stack() causes an exception while trying to probe + a PHB if its garded. This causes skiboot to go into a reboot loop with + following exception log: :: + + *********************************************** + Fatal MCE at 000000003006ecd4 .probe_phb4+0x570 + CFAR : 00000000300b98a0 + + Aborting! + CPU 0018 Backtrace: + S: 0000000031cc37e0 R: 000000003001a51c ._abort+0x4c + S: 0000000031cc3860 R: 0000000030028170 .exception_entry+0x180 + S: 0000000031cc3a40 R: 0000000000001f10 * + S: 0000000031cc3c20 R: 000000003006ecb0 .probe_phb4+0x54c + S: 0000000031cc3e30 R: 0000000030014ca4 .main_cpu_entry+0x5b0 + S: 0000000031cc3f00 R: 0000000030002700 boot_entry+0x1b8 + + This is caused as phb4_probe_stack() will ignore all xscom read/write + errors to enable PHB Bars and then tries to perform an mmio to read + PHB Version registers that cause the fatal MCE. + + We fix this by ignoring the PHB probe if the first xscom_write() to + populate the PHB Bar register fails, which indicates that there is + something wrong with the PHB.