[V2,1/2] iommu: Optimize IOMMU UnMap

Message ID	20240801033432.106837-1-amhetre@nvidia.com
State	Changes Requested
Headers	show Return-Path: <linux-tegra+bounces-3149-incoming=patchwork.ozlabs.org@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: Ashish Mhetre <amhetre@nvidia.com> To: <will@kernel.org>, <robin.murphy@arm.com>, <joro@8bytes.org> CC: <linux-arm-kernel@lists.infradead.org>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>, <linux-tegra@vger.kernel.org>, Ashish Mhetre <amhetre@nvidia.com> Subject: [PATCH V2 1/2] iommu: Optimize IOMMU UnMap Date: Thu, 1 Aug 2024 03:34:31 +0000 Message-ID: <20240801033432.106837-1-amhetre@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	[V2,1/2] iommu: Optimize IOMMU UnMap \| expand [V2,1/2] iommu: Optimize IOMMU UnMap [V2,2/2] include: linux: Update gather only if it's not NULL

Message ID

20240801033432.106837-1-amhetre@nvidia.com

State

Changes Requested

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.117.161 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C
From: Ashish Mhetre <amhetre@nvidia.com>
To: <will@kernel.org>, <robin.murphy@arm.com>, <joro@8bytes.org>
CC: <linux-arm-kernel@lists.infradead.org>, <iommu@lists.linux.dev>,
	<linux-kernel@vger.kernel.org>, <linux-tegra@vger.kernel.org>, Ashish Mhetre
	<amhetre@nvidia.com>
Subject: [PATCH V2 1/2] iommu: Optimize IOMMU UnMap
Date: Thu, 1 Aug 2024 03:34:31 +0000
Message-ID: <20240801033432.106837-1-amhetre@nvidia.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Aug 2024 03:34:53.5732
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 2c4a8e52-7064-4897-8f24-08dcb1dae834
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.161];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	CO1PEPF000044F3.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB9219

Series

[V2,1/2] iommu: Optimize IOMMU UnMap | expand

Commit Message

Ashish Mhetre Aug. 1, 2024, 3:34 a.m. UTC

The current __arm_lpae_unmap() function calls dma_sync() on individual
PTEs after clearing them. Overall unmap performance can be improved by
around 25% for large buffer sizes by combining the syncs for adjacent
leaf entries.
Optimize the unmap time by clearing all the leaf entries and issuing a
single dma_sync() for them.
Below is detailed analysis of average unmap latency(in us) with and
without this optimization obtained by running dma_map_benchmark for
different buffer sizes.

		UnMap Latency(us)
Size	Without		With		% gain with
	optimiztion	optimization	optimization

4KB	3		3		0
8KB	4		3.8		5
16KB	6.1		5.4		11.48
32KB	10.2		8.5		16.67
64KB	18.5		14.9		19.46
128KB	35		27.5		21.43
256KB	67.5		52.2		22.67
512KB	127.9		97.2		24.00
1MB	248.6		187.4		24.62
2MB	65.5		65.5		0
4MB	119.2		119		0.17

Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
---
Changes in V2:
- Updated the commit message to be imperative.
- Fixed ptep at incorrect index getting cleared for non-leaf entries.
---
 drivers/iommu/io-pgtable-arm.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

Comments

Robin Murphy Aug. 1, 2024, 4:23 p.m. UTC | #1

On 01/08/2024 4:34 am, Ashish Mhetre wrote:
> The current __arm_lpae_unmap() function calls dma_sync() on individual
> PTEs after clearing them. Overall unmap performance can be improved by
> around 25% for large buffer sizes by combining the syncs for adjacent
> leaf entries.
> Optimize the unmap time by clearing all the leaf entries and issuing a
> single dma_sync() for them.
> Below is detailed analysis of average unmap latency(in us) with and
> without this optimization obtained by running dma_map_benchmark for
> different buffer sizes.
> 
> 		UnMap Latency(us)
> Size	Without		With		% gain with
> 	optimiztion	optimization	optimization
> 
> 4KB	3		3		0
> 8KB	4		3.8		5
> 16KB	6.1		5.4		11.48
> 32KB	10.2		8.5		16.67
> 64KB	18.5		14.9		19.46
> 128KB	35		27.5		21.43
> 256KB	67.5		52.2		22.67
> 512KB	127.9		97.2		24.00
> 1MB	248.6		187.4		24.62
> 2MB	65.5		65.5		0
> 4MB	119.2		119		0.17
> 
> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
> ---
> Changes in V2:
> - Updated the commit message to be imperative.
> - Fixed ptep at incorrect index getting cleared for non-leaf entries.
> ---
>   drivers/iommu/io-pgtable-arm.c | 34 +++++++++++++++++++++-------------
>   1 file changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f5d9fd1f45bf..32401948b980 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -274,13 +274,15 @@ static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
>   				   sizeof(*ptep) * num_entries, DMA_TO_DEVICE);
>   }
>   
> -static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg)
> +static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
>   {
> +	int i;

You can make this a nice tidy loop-local declaration now.

> -	*ptep = 0;
> +	for (i = 0; i < num_entries; i++)
> +		ptep[i] = 0;
>   
>   	if (!cfg->coherent_walk)
> -		__arm_lpae_sync_pte(ptep, 1, cfg);
> +		__arm_lpae_sync_pte(ptep, num_entries, cfg);
>   }
>   
>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> @@ -635,9 +637,10 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			       unsigned long iova, size_t size, size_t pgcount,
>   			       int lvl, arm_lpae_iopte *ptep)
>   {
> +	bool gather_queued;
>   	arm_lpae_iopte pte;
>   	struct io_pgtable *iop = &data->iop;
> -	int i = 0, num_entries, max_entries, unmap_idx_start;
> +	int i = 0, j = 0, num_entries, max_entries, unmap_idx_start;

Similarly there's no need to initialise j here, but I'd make it 
loop-local anyway.

>   	/* Something went horribly wrong and we ran out of page table */
>   	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> @@ -652,28 +655,33 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   	/* If the size matches this level, we're in the right place */
>   	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>   		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - unmap_idx_start;
> +		gather_queued = iommu_iotlb_gather_queued(gather);

This is used exactly once, do we really need to introduce the variable?

>   		num_entries = min_t(int, pgcount, max_entries);
>   
> -		while (i < num_entries) {
> -			pte = READ_ONCE(*ptep);
> +		/* Find and handle non-leaf entries */
> +		for (i = 0; i < num_entries; i++) {
> +			pte = READ_ONCE(ptep[i]);
>   			if (WARN_ON(!pte))
>   				break;
>   
> -			__arm_lpae_clear_pte(ptep, &iop->cfg);
> -
>   			if (!iopte_leaf(pte, lvl, iop->fmt)) {
> +				__arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
> +
>   				/* Also flush any partial walks */
>   				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
>   							  ARM_LPAE_GRANULE(data));
>   				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
> -			} else if (!iommu_iotlb_gather_queued(gather)) {
> -				io_pgtable_tlb_add_page(iop, gather, iova + i * size, size);
>   			}
> -
> -			ptep++;
> -			i++;
>   		}
>   
> +		/* Clear the remaining entries */
> +		if (i)

It seems a little non-obvious to optimise for one specific corner of 
unexpected failures here. I'd hope a zero-sized dma_sync wouldn't blow 
up in that case, but if you want to be safe then I'd cover it by 
tweaking the condition in __arm_lpae_clear_pte() to "if 
(!cfg->coherent_walk && num_entries)".

Thanks,
Robin.

> +			__arm_lpae_clear_pte(ptep, &iop->cfg, i);
> +
> +		if (!gather_queued)
> +			for (j = 0; j < i; j++)
> +				io_pgtable_tlb_add_page(iop, gather, iova + j * size, size);
> +
>   		return i * size;
>   	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
>   		/*

Ashish Mhetre Aug. 5, 2024, 4:17 a.m. UTC | #2

On 8/1/2024 9:53 PM, Robin Murphy wrote:
> External email: Use caution opening links or attachments
>
>
> On 01/08/2024 4:34 am, Ashish Mhetre wrote:
>> The current __arm_lpae_unmap() function calls dma_sync() on individual
>> PTEs after clearing them. Overall unmap performance can be improved by
>> around 25% for large buffer sizes by combining the syncs for adjacent
>> leaf entries.
>> Optimize the unmap time by clearing all the leaf entries and issuing a
>> single dma_sync() for them.
>> Below is detailed analysis of average unmap latency(in us) with and
>> without this optimization obtained by running dma_map_benchmark for
>> different buffer sizes.
>>
>>               UnMap Latency(us)
>> Size  Without         With            % gain with
>>       optimiztion     optimization    optimization
>>
>> 4KB   3               3               0
>> 8KB   4               3.8             5
>> 16KB  6.1             5.4             11.48
>> 32KB  10.2            8.5             16.67
>> 64KB  18.5            14.9            19.46
>> 128KB 35              27.5            21.43
>> 256KB 67.5            52.2            22.67
>> 512KB 127.9           97.2            24.00
>> 1MB   248.6           187.4           24.62
>> 2MB   65.5            65.5            0
>> 4MB   119.2           119             0.17
>>
>> Signed-off-by: Ashish Mhetre <amhetre@nvidia.com>
>> ---
>> Changes in V2:
>> - Updated the commit message to be imperative.
>> - Fixed ptep at incorrect index getting cleared for non-leaf entries.
>> ---
>>   drivers/iommu/io-pgtable-arm.c | 34 +++++++++++++++++++++-------------
>>   1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/iommu/io-pgtable-arm.c 
>> b/drivers/iommu/io-pgtable-arm.c
>> index f5d9fd1f45bf..32401948b980 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -274,13 +274,15 @@ static void __arm_lpae_sync_pte(arm_lpae_iopte 
>> *ptep, int num_entries,
>>                                  sizeof(*ptep) * num_entries, 
>> DMA_TO_DEVICE);
>>   }
>>
>> -static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct 
>> io_pgtable_cfg *cfg)
>> +static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct 
>> io_pgtable_cfg *cfg, int num_entries)
>>   {
>> +     int i;
>
> You can make this a nice tidy loop-local declaration now.
>
Sure, will update this in new version.

>> -     *ptep = 0;
>> +     for (i = 0; i < num_entries; i++)
>> +             ptep[i] = 0;
>>
>>       if (!cfg->coherent_walk)
>> -             __arm_lpae_sync_pte(ptep, 1, cfg);
>> +             __arm_lpae_sync_pte(ptep, num_entries, cfg);
>>   }
>>
>>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>> @@ -635,9 +637,10 @@ static size_t __arm_lpae_unmap(struct 
>> arm_lpae_io_pgtable *data,
>>                              unsigned long iova, size_t size, size_t 
>> pgcount,
>>                              int lvl, arm_lpae_iopte *ptep)
>>   {
>> +     bool gather_queued;
>>       arm_lpae_iopte pte;
>>       struct io_pgtable *iop = &data->iop;
>> -     int i = 0, num_entries, max_entries, unmap_idx_start;
>> +     int i = 0, j = 0, num_entries, max_entries, unmap_idx_start;
>
> Similarly there's no need to initialise j here, but I'd make it
> loop-local anyway.
>
Ack, I will make j loop-local.

>>       /* Something went horribly wrong and we ran out of page table */
>>       if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> @@ -652,28 +655,33 @@ static size_t __arm_lpae_unmap(struct 
>> arm_lpae_io_pgtable *data,
>>       /* If the size matches this level, we're in the right place */
>>       if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>>               max_entries = ARM_LPAE_PTES_PER_TABLE(data) - 
>> unmap_idx_start;
>> +             gather_queued = iommu_iotlb_gather_queued(gather);
>
> This is used exactly once, do we really need to introduce the variable?
>
Ack, I will remove the unnecessary variable.

>>               num_entries = min_t(int, pgcount, max_entries);
>>
>> -             while (i < num_entries) {
>> -                     pte = READ_ONCE(*ptep);
>> +             /* Find and handle non-leaf entries */
>> +             for (i = 0; i < num_entries; i++) {
>> +                     pte = READ_ONCE(ptep[i]);
>>                       if (WARN_ON(!pte))
>>                               break;
>>
>> -                     __arm_lpae_clear_pte(ptep, &iop->cfg);
>> -
>>                       if (!iopte_leaf(pte, lvl, iop->fmt)) {
>> +                             __arm_lpae_clear_pte(&ptep[i], 
>> &iop->cfg, 1);
>> +
>>                               /* Also flush any partial walks */
>>                               io_pgtable_tlb_flush_walk(iop, iova + i 
>> * size, size,
>> ARM_LPAE_GRANULE(data));
>>                               __arm_lpae_free_pgtable(data, lvl + 1, 
>> iopte_deref(pte, data));
>> -                     } else if (!iommu_iotlb_gather_queued(gather)) {
>> -                             io_pgtable_tlb_add_page(iop, gather, 
>> iova + i * size, size);
>>                       }
>> -
>> -                     ptep++;
>> -                     i++;
>>               }
>>
>> +             /* Clear the remaining entries */
>> +             if (i)
>
> It seems a little non-obvious to optimise for one specific corner of
> unexpected failures here. I'd hope a zero-sized dma_sync wouldn't blow
> up in that case, but if you want to be safe then I'd cover it by
> tweaking the condition in __arm_lpae_clear_pte() to "if
> (!cfg->coherent_walk && num_entries)".
>
> Thanks,
> Robin.
>
Yes, makes sense to cover this in __arm_lpae_clear_pte().
I will update this in V3.

>> + __arm_lpae_clear_pte(ptep, &iop->cfg, i);
>> +
>> +             if (!gather_queued)
>> +                     for (j = 0; j < i; j++)
>> +                             io_pgtable_tlb_add_page(iop, gather, 
>> iova + j * size, size);
>> +
>>               return i * size;
>>       } else if (iopte_leaf(pte, lvl, iop->fmt)) {
>>               /*

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f5d9fd1f45bf..32401948b980 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -274,13 +274,15 @@  static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
 				   sizeof(*ptep) * num_entries, DMA_TO_DEVICE);
 }
 
-static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg)
+static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
 {
+	int i;
 
-	*ptep = 0;
+	for (i = 0; i < num_entries; i++)
+		ptep[i] = 0;
 
 	if (!cfg->coherent_walk)
-		__arm_lpae_sync_pte(ptep, 1, cfg);
+		__arm_lpae_sync_pte(ptep, num_entries, cfg);
 }
 
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -635,9 +637,10 @@  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       unsigned long iova, size_t size, size_t pgcount,
 			       int lvl, arm_lpae_iopte *ptep)
 {
+	bool gather_queued;
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	int i = 0, num_entries, max_entries, unmap_idx_start;
+	int i = 0, j = 0, num_entries, max_entries, unmap_idx_start;
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -652,28 +655,33 @@  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 	/* If the size matches this level, we're in the right place */
 	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
 		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - unmap_idx_start;
+		gather_queued = iommu_iotlb_gather_queued(gather);
 		num_entries = min_t(int, pgcount, max_entries);
 
-		while (i < num_entries) {
-			pte = READ_ONCE(*ptep);
+		/* Find and handle non-leaf entries */
+		for (i = 0; i < num_entries; i++) {
+			pte = READ_ONCE(ptep[i]);
 			if (WARN_ON(!pte))
 				break;
 
-			__arm_lpae_clear_pte(ptep, &iop->cfg);
-
 			if (!iopte_leaf(pte, lvl, iop->fmt)) {
+				__arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
+
 				/* Also flush any partial walks */
 				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
 							  ARM_LPAE_GRANULE(data));
 				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
-			} else if (!iommu_iotlb_gather_queued(gather)) {
-				io_pgtable_tlb_add_page(iop, gather, iova + i * size, size);
 			}
-
-			ptep++;
-			i++;
 		}
 
+		/* Clear the remaining entries */
+		if (i)
+			__arm_lpae_clear_pte(ptep, &iop->cfg, i);
+
+		if (!gather_queued)
+			for (j = 0; j < i; j++)
+				io_pgtable_tlb_add_page(iop, gather, iova + j * size, size);
+
 		return i * size;
 	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
 		/*

[V2,1/2] iommu: Optimize IOMMU UnMap

Commit Message

Comments

Patch