Message ID | 1531276356-18469-1-git-send-email-arbab@linux.ibm.com |
---|---|
State | Accepted |
Headers | show |
Series | npu2/hw-procedures: Fence bricks via NTL instead of MISC | expand |
On 11/07/18 12:32, Reza Arbab wrote: > There are a couple of places we can set/unset fence for a brick: > > 1. MISC register: NPU2_MISC_FENCE_STATE > 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev) > > Recent testing of ATS in combination with GPU reset has exposed a side > effect of using (1); if fence is set for all six bricks, it triggers a > sticky nmmu latch which prevents the NPU from getting ATR responses. > This manifests as a hang in the tests. > > We have npu2_dev_fence_brick() which uses (1), and only two calls to it. > Replace the call which sets fence with a write to (2). Remove the > corresponding unset call entirely. It's unneeded because the procedures > already do a progression from full fence to half to idle using (2). > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Hmm, I wonder whether there's any side effects from how we use NPU2_MISC_FENCE_STATE in the opencapi code... anything we should be aware of? This looks good to me. Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > --- > hw/npu2-hw-procedures.c | 31 +++++++------------------------ > 1 file changed, 7 insertions(+), 24 deletions(-) > > diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c > index 7ab08f0..8de6b4d 100644 > --- a/hw/npu2-hw-procedures.c > +++ b/hw/npu2-hw-procedures.c > @@ -236,26 +236,6 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val) > return false; > } > > -static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set) > -{ > - /* > - * Add support for queisce/fence the brick at > - * procedure reset time. > - */ > - uint32_t brick; > - uint64_t val; > - > - brick = ndev->index; > - if (set) > - brick += 6; > - > - val = PPC_BIT(brick); > - NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear", > - ndev->index, val); > - npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val); > - return 0; > -} > - > /* Procedure 1.2.1 - Reset NPU/NDL */ > uint32_t reset_ntl(struct npu2_dev *ndev) > { > @@ -326,9 +306,6 @@ static uint32_t reset_ntl_release(struct npu2_dev *ndev) > > } > > - /* Release the fence */ > - npu2_dev_fence_brick(ndev, false); > - > val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev)); > val &= 0xFFBFFFFFFFFFFFFF; > npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val); > @@ -952,7 +929,13 @@ int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf, > > void npu2_dev_procedure_reset(struct npu2_dev *dev) > { > - npu2_dev_fence_brick(dev, true); > + uint64_t val; > + > + /* Fence the brick */ > + val = npu2_read(dev->npu, NPU2_NTL_MISC_CFG1(dev)); > + val |= PPC_BIT(8) | PPC_BIT(9); > + npu2_write(dev->npu, NPU2_NTL_MISC_CFG1(dev), val); > + > npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET); > } > >
Reza Arbab <arbab@linux.ibm.com> writes: > There are a couple of places we can set/unset fence for a brick: > > 1. MISC register: NPU2_MISC_FENCE_STATE > 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev) > > Recent testing of ATS in combination with GPU reset has exposed a side > effect of using (1); if fence is set for all six bricks, it triggers a > sticky nmmu latch which prevents the NPU from getting ATR responses. > This manifests as a hang in the tests. > > We have npu2_dev_fence_brick() which uses (1), and only two calls to it. > Replace the call which sets fence with a write to (2). Remove the > corresponding unset call entirely. It's unneeded because the procedures > already do a progression from full fence to half to idle using (2). > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > --- > hw/npu2-hw-procedures.c | 31 +++++++------------------------ > 1 file changed, 7 insertions(+), 24 deletions(-) Thanks, seeing as we want this for stable as well, merged to: master as of 5ff8763c9b0421d8de0f4346ca211c853d2406d4 6.0.x as of 966a2839484f164979c5118e52928f8509b4982b
On Wed, Jul 11, 2018 at 01:56:10PM +1000, Andrew Donnellan wrote: >Hmm, I wonder whether there's any side effects from how we use >NPU2_MISC_FENCE_STATE in the opencapi code... anything we should be >aware of? I can't tell if the opencapi code is similar enough to run into the root problem which motivated the change. Would it be possible for every brick to be fenced at once, say if you did FLR on every device simultaneously? Even if so, I don't know if that would then defeat ATS as it does in the NVLink case.
diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c index 7ab08f0..8de6b4d 100644 --- a/hw/npu2-hw-procedures.c +++ b/hw/npu2-hw-procedures.c @@ -236,26 +236,6 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val) return false; } -static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set) -{ - /* - * Add support for queisce/fence the brick at - * procedure reset time. - */ - uint32_t brick; - uint64_t val; - - brick = ndev->index; - if (set) - brick += 6; - - val = PPC_BIT(brick); - NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear", - ndev->index, val); - npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val); - return 0; -} - /* Procedure 1.2.1 - Reset NPU/NDL */ uint32_t reset_ntl(struct npu2_dev *ndev) { @@ -326,9 +306,6 @@ static uint32_t reset_ntl_release(struct npu2_dev *ndev) } - /* Release the fence */ - npu2_dev_fence_brick(ndev, false); - val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev)); val &= 0xFFBFFFFFFFFFFFFF; npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val); @@ -952,7 +929,13 @@ int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf, void npu2_dev_procedure_reset(struct npu2_dev *dev) { - npu2_dev_fence_brick(dev, true); + uint64_t val; + + /* Fence the brick */ + val = npu2_read(dev->npu, NPU2_NTL_MISC_CFG1(dev)); + val |= PPC_BIT(8) | PPC_BIT(9); + npu2_write(dev->npu, NPU2_NTL_MISC_CFG1(dev), val); + npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET); }
There are a couple of places we can set/unset fence for a brick: 1. MISC register: NPU2_MISC_FENCE_STATE 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev) Recent testing of ATS in combination with GPU reset has exposed a side effect of using (1); if fence is set for all six bricks, it triggers a sticky nmmu latch which prevents the NPU from getting ATR responses. This manifests as a hang in the tests. We have npu2_dev_fence_brick() which uses (1), and only two calls to it. Replace the call which sets fence with a write to (2). Remove the corresponding unset call entirely. It's unneeded because the procedures already do a progression from full fence to half to idle using (2). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> --- hw/npu2-hw-procedures.c | 31 +++++++------------------------ 1 file changed, 7 insertions(+), 24 deletions(-)