Message ID | 20191114161304.18130-1-arbab@linux.ibm.com |
---|---|
State | Accepted |
Headers | show |
Series | npu2/hw-procedures: Remove assertion from check_credits() | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab@linux.ibm.com> wrote: > > The RX clock mux in the NVLink PHY can glitch, which will manifest in > hard to diagnose behavior--at best, a checkstop during the first link > traffic. The only reliable way we found to detect this was by checking > for a discrepancy in the credits we expect to receive during link > training. > > Since the time the check was added, we've found that > > * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()") > does work around the original glitch. > > * Asserting is too harsh. Before root cause was established, it was > thought this could have been a manufacturing defect and we wanted to > loudly fail hardware acceptance boot cycle tests. > > * It seems there is a valid situation in which credits are off from > the expected value. During GPU hot reset, a CPU prefetch across the link > can affect the credit count before we check. > > Given all of the above, remove the assert(). > > Cc: stable # 6.0.x > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Merged as 24664b48642845d620e225111bf6184f3c102f60 Vasant, can you pick this up? I don't think you're monitoring stable@arbab-laptop.localdomain, but maybe you are ;)
On Wed, Nov 20, 2019 at 11:25:49AM +1100, Oliver O'Halloran wrote: >On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab@linux.ibm.com> wrote: >> Cc: stable # 6.0.x >> Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > >Merged as 24664b48642845d620e225111bf6184f3c102f60 > >Vasant, can you pick this up? I don't think you're monitoring >stable@arbab-laptop.localdomain, but maybe you are ;) There's nothing stable about my laptop. :) Vasant, ping?
diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c index 86864629a66b..2a03beda606a 100644 --- a/hw/npu2-hw-procedures.c +++ b/hw/npu2-hw-procedures.c @@ -756,17 +756,14 @@ static uint32_t check_credit(struct npu2_dev *ndev, uint64_t reg, static uint32_t check_credits(struct npu2_dev *ndev) { - int fail = 0; uint64_t val; - fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL); - fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL); - fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL); - fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL); - fail += CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL); - fail += CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL); - - assert(!fail); + CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL); + CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL); + CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL); + CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL); + CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL); + CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL); val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev)); val &= 0xFF3FFFFFFFFFFFFF;
The RX clock mux in the NVLink PHY can glitch, which will manifest in hard to diagnose behavior--at best, a checkstop during the first link traffic. The only reliable way we found to detect this was by checking for a discrepancy in the credits we expect to receive during link training. Since the time the check was added, we've found that * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()") does work around the original glitch. * Asserting is too harsh. Before root cause was established, it was thought this could have been a manufacturing defect and we wanted to loudly fail hardware acceptance boot cycle tests. * It seems there is a valid situation in which credits are off from the expected value. During GPU hot reset, a CPU prefetch across the link can affect the credit count before we check. Given all of the above, remove the assert(). Cc: stable # 6.0.x Signed-off-by: Reza Arbab <arbab@linux.ibm.com> --- hw/npu2-hw-procedures.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)