witherspoon: Squash spurious I2C errors

Message ID	20191127044758.32596-1-oohall@gmail.com
State	Accepted
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> From: Oliver O'Halloran <oohall@gmail.com> To: skiboot@lists.ozlabs.org Date: Wed, 27 Nov 2019 15:47:58 +1100 Message-Id: <20191127044758.32596-1-oohall@gmail.com> MIME-Version: 1.0 Subject: [Skiboot] [PATCH] witherspoon: Squash spurious I2C errors Precedence: list Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	witherspoon: Squash spurious I2C errors \| expand witherspoon: Squash spurious I2C errors

Message ID

20191127044758.32596-1-oohall@gmail.com

State

Accepted

Headers

From: Oliver O'Halloran <oohall@gmail.com>
To: skiboot@lists.ozlabs.org
Date: Wed, 27 Nov 2019 15:47:58 +1100
Message-Id: <20191127044758.32596-1-oohall@gmail.com>
MIME-Version: 1.0
Subject: [Skiboot] [PATCH] witherspoon: Squash spurious I2C errors
Precedence: list
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Skiboot"
	<skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Series

witherspoon: Squash spurious I2C errors | expand

Checks

Context	Check	Description
snowpatch_ozlabs/apply_patch	success	Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot	success	Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco	success	Signed-off-by present

Context

Check

Description

snowpatch_ozlabs/apply_patch

success

Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)

snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot

success

Test snowpatch/job/snowpatch-skiboot on branch master

snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco

success

Signed-off-by present

Commit Message

Oliver O'Halloran Nov. 27, 2019, 4:47 a.m. UTC

On witherspoon there's an I2C bus connecting the each chip to the GPUs
connected to that chip via NVLink. That bus has a 750kOhm pullup on the
system planar with prevents the bus from operating correctly.

Each GPU has a smaller pullup which makes the bus usable when a GPU is
plugged in, but on systems without GPUs we get a lot of spurious I2C
master errors. Specificly, because of the oversized pullup the SDA and
SCL for that bus cannot return to '1' fast enough, so the master assumes
that another master is driving the bus and that it should stop.

Work around this by disabling the affected port when there's no GPUs
detected in the system.

Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
Tested on a system with a GPU on each socket and on a system with none.
Couldn't find a mixed system.
---
 platforms/astbmc/witherspoon.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Andrew Donnellan Nov. 27, 2019, 5:04 a.m. UTC | #1

On 27/11/19 3:47 pm, Oliver O'Halloran wrote:
> On witherspoon there's an I2C bus connecting the each chip to the GPUs
> connected to that chip via NVLink. That bus has a 750kOhm pullup on the
> system planar with prevents the bus from operating correctly.
> 
> Each GPU has a smaller pullup which makes the bus usable when a GPU is
> plugged in, but on systems without GPUs we get a lot of spurious I2C
> master errors. Specificly, because of the oversized pullup the SDA and
> SCL for that bus cannot return to '1' fast enough, so the master assumes
> that another master is driving the bus and that it should stop.
> 
> Work around this by disabling the affected port when there's no GPUs
> detected in the system.
> 
> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Looks reasonable.

Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>

> ---
> Tested on a system with a GPU on each socket and on a system with none.
> Couldn't find a mixed system.
> ---
>   platforms/astbmc/witherspoon.c | 25 +++++++++++++++++++++++++
>   1 file changed, 25 insertions(+)
> 
> diff --git a/platforms/astbmc/witherspoon.c b/platforms/astbmc/witherspoon.c
> index edf84fb89601..081996684603 100644
> --- a/platforms/astbmc/witherspoon.c
> +++ b/platforms/astbmc/witherspoon.c
> @@ -465,6 +465,7 @@ static void npu2_phb_nvlink_dt(struct phb *npuphb)
>   static void witherspoon_finalise_dt(bool is_reboot)
>   {
>   	struct dt_node *np;
> +	struct proc_chip *c;
>   
>   	if (is_reboot)
>   		return;
> @@ -479,6 +480,30 @@ static void witherspoon_finalise_dt(bool is_reboot)
>   			continue;
>   		npu2_phb_nvlink_dt(npphb);
>   	}
> +
> +	/*
> +	 * The I2C bus on used to talk to the GPUs has a 750K pullup
> +	 * which is way too big. If there's no GPUs connected to the
> +	 * chip all I2C transactions fail with an Arb loss error since
> +	 * SCL/SDA don't return to the idle state fast enough. Disable
> +	 * the port to squash the errors.
> +	 */
> +	for (c = next_chip(0); c; c = next_chip(c)) {
> +		bool detected = false;
> +		int i;
> +
> +		np = dt_find_by_path(c->devnode, "i2cm@a1000/i2c-bus@4");
> +		if (!np)
> +			continue;
> +
> +		for (i = 0; i < 3; i++)
> +			detected |= occ_get_gpu_presence(c, i);
> +
> +		if (!detected) {
> +			dt_check_del_prop(np, "status");
> +			dt_add_property_string(np, "status", "disabled");
> +		}
> +	}
>   }
>   
>   /* The only difference between these is the PCI slot handling */
>

Oliver O'Halloran Dec. 16, 2019, 3:59 a.m. UTC | #2

On Wed, Nov 27, 2019 at 3:48 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On witherspoon there's an I2C bus connecting the each chip to the GPUs
> connected to that chip via NVLink. That bus has a 750kOhm pullup on the
> system planar with prevents the bus from operating correctly.
>
> Each GPU has a smaller pullup which makes the bus usable when a GPU is
> plugged in, but on systems without GPUs we get a lot of spurious I2C
> master errors. Specificly, because of the oversized pullup the SDA and
> SCL for that bus cannot return to '1' fast enough, so the master assumes
> that another master is driving the bus and that it should stop.
>
> Work around this by disabling the affected port when there's no GPUs
> detected in the system.
>
> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Merged as dc85bd46d20db4d878216283f818573cb0c8c05b

diff --git a/platforms/astbmc/witherspoon.c b/platforms/astbmc/witherspoon.c
index edf84fb89601..081996684603 100644
--- a/platforms/astbmc/witherspoon.c
+++ b/platforms/astbmc/witherspoon.c
@@ -465,6 +465,7 @@  static void npu2_phb_nvlink_dt(struct phb *npuphb)
 static void witherspoon_finalise_dt(bool is_reboot)
 {
 	struct dt_node *np;
+	struct proc_chip *c;
 
 	if (is_reboot)
 		return;
@@ -479,6 +480,30 @@  static void witherspoon_finalise_dt(bool is_reboot)
 			continue;
 		npu2_phb_nvlink_dt(npphb);
 	}
+
+	/*
+	 * The I2C bus on used to talk to the GPUs has a 750K pullup
+	 * which is way too big. If there's no GPUs connected to the
+	 * chip all I2C transactions fail with an Arb loss error since
+	 * SCL/SDA don't return to the idle state fast enough. Disable
+	 * the port to squash the errors.
+	 */
+	for (c = next_chip(0); c; c = next_chip(c)) {
+		bool detected = false;
+		int i;
+
+		np = dt_find_by_path(c->devnode, "i2cm@a1000/i2c-bus@4");
+		if (!np)
+			continue;
+
+		for (i = 0; i < 3; i++)
+			detected |= occ_get_gpu_presence(c, i);
+
+		if (!detected) {
+			dt_check_del_prop(np, "status");
+			dt_add_property_string(np, "status", "disabled");
+		}
+	}
 }
 
 /* The only difference between these is the PCI slot handling */

witherspoon: Squash spurious I2C errors

Checks

Commit Message

Comments

Patch