diff mbox series

npu2/hw-procedures: Remove assertion from check_credits()

Message ID 20191114161304.18130-1-arbab@linux.ibm.com
State Accepted
Headers show
Series npu2/hw-procedures: Remove assertion from check_credits() | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Reza Arbab Nov. 14, 2019, 4:13 p.m. UTC
The RX clock mux in the NVLink PHY can glitch, which will manifest in
hard to diagnose behavior--at best, a checkstop during the first link
traffic. The only reliable way we found to detect this was by checking
for a discrepancy in the credits we expect to receive during link
training.

Since the time the check was added, we've found that

* Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()")
does work around the original glitch.

* Asserting is too harsh. Before root cause was established, it was
thought this could have been a manufacturing defect and we wanted to
loudly fail hardware acceptance boot cycle tests.

* It seems there is a valid situation in which credits are off from
the expected value. During GPU hot reset, a CPU prefetch across the link
can affect the credit count before we check.

Given all of the above, remove the assert().

Cc: stable # 6.0.x
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 hw/npu2-hw-procedures.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

Comments

Oliver O'Halloran Nov. 20, 2019, 12:25 a.m. UTC | #1
On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab@linux.ibm.com> wrote:
>
> The RX clock mux in the NVLink PHY can glitch, which will manifest in
> hard to diagnose behavior--at best, a checkstop during the first link
> traffic. The only reliable way we found to detect this was by checking
> for a discrepancy in the credits we expect to receive during link
> training.
>
> Since the time the check was added, we've found that
>
> * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()")
> does work around the original glitch.
>
> * Asserting is too harsh. Before root cause was established, it was
> thought this could have been a manufacturing defect and we wanted to
> loudly fail hardware acceptance boot cycle tests.
>
> * It seems there is a valid situation in which credits are off from
> the expected value. During GPU hot reset, a CPU prefetch across the link
> can affect the credit count before we check.
>
> Given all of the above, remove the assert().
>
> Cc: stable # 6.0.x
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>

Merged as 24664b48642845d620e225111bf6184f3c102f60

Vasant, can you pick this up? I don't think you're monitoring
stable@arbab-laptop.localdomain, but maybe you are ;)
Reza Arbab Dec. 2, 2019, 7:49 p.m. UTC | #2
On Wed, Nov 20, 2019 at 11:25:49AM +1100, Oliver O'Halloran wrote:
>On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab@linux.ibm.com> wrote:
>> Cc: stable # 6.0.x
>> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
>
>Merged as 24664b48642845d620e225111bf6184f3c102f60
>
>Vasant, can you pick this up? I don't think you're monitoring
>stable@arbab-laptop.localdomain, but maybe you are ;)

There's nothing stable about my laptop. :)

Vasant, ping?
diff mbox series

Patch

diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
index 86864629a66b..2a03beda606a 100644
--- a/hw/npu2-hw-procedures.c
+++ b/hw/npu2-hw-procedures.c
@@ -756,17 +756,14 @@  static uint32_t check_credit(struct npu2_dev *ndev, uint64_t reg,
 
 static uint32_t check_credits(struct npu2_dev *ndev)
 {
-	int fail = 0;
 	uint64_t val;
 
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL);
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL);
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL);
-	fail += CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL);
-
-	assert(!fail);
+	CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
+	CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
+	CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL);
+	CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL);
+	CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL);
+	CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL);
 
 	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
 	val &= 0xFF3FFFFFFFFFFFFF;