diff mbox series

npu2/hw-procedures: Fence bricks via NTL instead of MISC

Message ID 1531276356-18469-1-git-send-email-arbab@linux.ibm.com
State Accepted
Headers show
Series npu2/hw-procedures: Fence bricks via NTL instead of MISC | expand

Commit Message

Reza Arbab July 11, 2018, 2:32 a.m. UTC
There are a couple of places we can set/unset fence for a brick:

1. MISC register: NPU2_MISC_FENCE_STATE
2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)

Recent testing of ATS in combination with GPU reset has exposed a side
effect of using (1); if fence is set for all six bricks, it triggers a
sticky nmmu latch which prevents the NPU from getting ATR responses.
This manifests as a hang in the tests.

We have npu2_dev_fence_brick() which uses (1), and only two calls to it.
Replace the call which sets fence with a write to (2). Remove the
corresponding unset call entirely. It's unneeded because the procedures
already do a progression from full fence to half to idle using (2).

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 hw/npu2-hw-procedures.c | 31 +++++++------------------------
 1 file changed, 7 insertions(+), 24 deletions(-)

Comments

Andrew Donnellan July 11, 2018, 3:56 a.m. UTC | #1
On 11/07/18 12:32, Reza Arbab wrote:
> There are a couple of places we can set/unset fence for a brick:
> 
> 1. MISC register: NPU2_MISC_FENCE_STATE
> 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)
> 
> Recent testing of ATS in combination with GPU reset has exposed a side
> effect of using (1); if fence is set for all six bricks, it triggers a
> sticky nmmu latch which prevents the NPU from getting ATR responses.
> This manifests as a hang in the tests.
> 
> We have npu2_dev_fence_brick() which uses (1), and only two calls to it.
> Replace the call which sets fence with a write to (2). Remove the
> corresponding unset call entirely. It's unneeded because the procedures
> already do a progression from full fence to half to idle using (2).
> 
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>

Hmm, I wonder whether there's any side effects from how we use 
NPU2_MISC_FENCE_STATE in the opencapi code... anything we should be 
aware of?

This looks good to me.

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
>   hw/npu2-hw-procedures.c | 31 +++++++------------------------
>   1 file changed, 7 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
> index 7ab08f0..8de6b4d 100644
> --- a/hw/npu2-hw-procedures.c
> +++ b/hw/npu2-hw-procedures.c
> @@ -236,26 +236,6 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
>   	return false;
>   }
>   
> -static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set)
> -{
> -	/*
> -	 * Add support for queisce/fence the brick at
> -	 * procedure reset time.
> -	 */
> -	uint32_t brick;
> -	uint64_t val;
> -
> -	brick = ndev->index;
> -	if (set)
> -		brick += 6;
> -
> -	val = PPC_BIT(brick);
> -	NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear",
> -			ndev->index, val);
> -	npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
> -	return 0;
> -}
> -
>   /* Procedure 1.2.1 - Reset NPU/NDL */
>   uint32_t reset_ntl(struct npu2_dev *ndev)
>   {
> @@ -326,9 +306,6 @@ static uint32_t reset_ntl_release(struct npu2_dev *ndev)
>   
>   	}
>   
> -	/* Release the fence */
> -	npu2_dev_fence_brick(ndev, false);
> -
>   	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
>   	val &= 0xFFBFFFFFFFFFFFFF;
>   	npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> @@ -952,7 +929,13 @@ int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf,
>   
>   void npu2_dev_procedure_reset(struct npu2_dev *dev)
>   {
> -	npu2_dev_fence_brick(dev, true);
> +	uint64_t val;
> +
> +	/* Fence the brick */
> +	val = npu2_read(dev->npu, NPU2_NTL_MISC_CFG1(dev));
> +	val |= PPC_BIT(8) | PPC_BIT(9);
> +	npu2_write(dev->npu, NPU2_NTL_MISC_CFG1(dev), val);
> +
>   	npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET);
>   }
>   
>
Stewart Smith July 11, 2018, 5:08 a.m. UTC | #2
Reza Arbab <arbab@linux.ibm.com> writes:
> There are a couple of places we can set/unset fence for a brick:
>
> 1. MISC register: NPU2_MISC_FENCE_STATE
> 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)
>
> Recent testing of ATS in combination with GPU reset has exposed a side
> effect of using (1); if fence is set for all six bricks, it triggers a
> sticky nmmu latch which prevents the NPU from getting ATR responses.
> This manifests as a hang in the tests.
>
> We have npu2_dev_fence_brick() which uses (1), and only two calls to it.
> Replace the call which sets fence with a write to (2). Remove the
> corresponding unset call entirely. It's unneeded because the procedures
> already do a progression from full fence to half to idle using (2).
>
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  hw/npu2-hw-procedures.c | 31 +++++++------------------------
>  1 file changed, 7 insertions(+), 24 deletions(-)

Thanks, seeing as we want this for stable as well, merged to:
master as of 5ff8763c9b0421d8de0f4346ca211c853d2406d4
6.0.x as of 966a2839484f164979c5118e52928f8509b4982b
Reza Arbab July 11, 2018, 4:11 p.m. UTC | #3
On Wed, Jul 11, 2018 at 01:56:10PM +1000, Andrew Donnellan wrote:
>Hmm, I wonder whether there's any side effects from how we use 
>NPU2_MISC_FENCE_STATE in the opencapi code... anything we should be 
>aware of?

I can't tell if the opencapi code is similar enough to run into the root 
problem which motivated the change.

Would it be possible for every brick to be fenced at once, say if you 
did FLR on every device simultaneously? Even if so, I don't know if that 
would then defeat ATS as it does in the NVLink case.
diff mbox series

Patch

diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
index 7ab08f0..8de6b4d 100644
--- a/hw/npu2-hw-procedures.c
+++ b/hw/npu2-hw-procedures.c
@@ -236,26 +236,6 @@  static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
 	return false;
 }
 
-static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set)
-{
-	/*
-	 * Add support for queisce/fence the brick at
-	 * procedure reset time.
-	 */
-	uint32_t brick;
-	uint64_t val;
-
-	brick = ndev->index;
-	if (set)
-		brick += 6;
-
-	val = PPC_BIT(brick);
-	NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear",
-			ndev->index, val);
-	npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
-	return 0;
-}
-
 /* Procedure 1.2.1 - Reset NPU/NDL */
 uint32_t reset_ntl(struct npu2_dev *ndev)
 {
@@ -326,9 +306,6 @@  static uint32_t reset_ntl_release(struct npu2_dev *ndev)
 
 	}
 
-	/* Release the fence */
-	npu2_dev_fence_brick(ndev, false);
-
 	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
 	val &= 0xFFBFFFFFFFFFFFFF;
 	npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
@@ -952,7 +929,13 @@  int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf,
 
 void npu2_dev_procedure_reset(struct npu2_dev *dev)
 {
-	npu2_dev_fence_brick(dev, true);
+	uint64_t val;
+
+	/* Fence the brick */
+	val = npu2_read(dev->npu, NPU2_NTL_MISC_CFG1(dev));
+	val |= PPC_BIT(8) | PPC_BIT(9);
+	npu2_write(dev->npu, NPU2_NTL_MISC_CFG1(dev), val);
+
 	npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET);
 }