[RFCv2,00/36] Process management for IOMMU + SVM for SMMUv3

Message ID	20171006133203.22803-1-jean-philippe.brucker@arm.com
Headers	show Return-Path: <devicetree-owner@vger.kernel.org> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> To: linux-arm-kernel@lists.infradead.org, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, devicetree@vger.kernel.org, iommu@lists.linux-foundation.org Cc: joro@8bytes.org, robh+dt@kernel.org, mark.rutland@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, lorenzo.pieralisi@arm.com, hanjun.guo@linaro.org, sudeep.holla@arm.com, rjw@rjwysocki.net, lenb@kernel.org, robin.murphy@arm.com, bhelgaas@google.com, alex.williamson@redhat.com, tn@semihalf.com, liubo95@huawei.com, thunder.leizhen@huawei.com, xieyisheng1@huawei.com, gabriele.paoloni@huawei.com, nwatters@codeaurora.org, okaya@codeaurora.org, rfranz@cavium.com, dwmw2@infradead.org, jacob.jun.pan@linux.intel.com, yi.l.liu@intel.com, ashok.raj@intel.com, robdclark@gmail.com Subject: [RFCv2 PATCH 00/36] Process management for IOMMU + SVM for SMMUv3 Date: Fri, 6 Oct 2017 14:31:27 +0100 Message-Id: <20171006133203.22803-1-jean-philippe.brucker@arm.com> Sender: devicetree-owner@vger.kernel.org Precedence: bulk
Series	Process management for IOMMU + SVM for SMMUv3 \| expand [RFCv2,00/36] Process management for IOMMU + SVM for SMMUv3 [RFCv2,12/36] dt-bindings: document stall and PASID properties for IOMMU masters

Jean-Philippe Brucker Oct. 6, 2017, 1:31 p.m. UTC

Following discussions at plumbers and elsewhere, it seems like we need to
unify some of the Shared Virtual Memory (SVM) code, in order to define
clear semantics for the SVM API.

My previous RFC [1] was centered on the SMMUv3, but some of this code will
need to be reused by the SMMUv2 and virtio-iommu drivers. This second
proposal focuses on abstracting a little more into the core IOMMU API, and
also trying to find common ground for all SVM-capable IOMMUs.

SVM is, in the context of the IOMMU, sharing page tables between a process
and a device. Traditionally it requires IO Page Fault and Process Address
Space ID capabilities in device and IOMMU.

* A device driver can bind a process to a device, with iommu_process_bind.
  Internally we hold on to the mm and get notified of its activity with an
  mmu_notifier. The bond is removed by exit_mm, by a call to
  iommu_process_unbind or iommu_detach_device.

* iommu_process_bind returns a 20-bit PASID (PCI terminology) to the
  device driver, which programs it into the device to access the process
  address space.

* The device and the IOMMU support recoverable page faults. This can be
  either ATS+PRI for PCI, or platform-specific mechanisms such as Stall
  for SMMU.

Ideally systems wanting to use SVM have to support these three features,
but in practice we'll see implementations supporting just a subset of
them, especially in validation environments. So even if this particular
patchset assumes all three capabilities, it should also be possible to
support PASID without IOPF (by pinning everything, see non-system SVM in
OpenCL), or IOPF without PASID (sharing the single device address space
with a process, could be useful for DPDK+VFIO).

Implementing both these cases would enable PT sharing alone. Some people
would also like IOPF alone without SVM (covered by this series) or process
management without shared PT (not covered). Using these features
individually is also important for testing, as SVM is in its infancy and
providing easy ways to test is essential to reduce the number of quirks
down the line.

  Process management
  ==================

The first part of this series introduces boilerplate code for managing
PASIDs and processes bound to devices. It's something any IOMMU driver
that wants to support bind/unbind will have to do, and it is difficult to
get right.

Patches
1: iommu_process and PASID allocation, attach and release
2: process_exit callback for device drivers
3: iommu_process search by PASID
4: track process changes with an MMU notifiers
5: bind and unbind operations

My proposal uses the following model:

* The PASID space is system-wide. This means that a Linux process will
  have a single PASID. I introduce the iommu_process structure and a
  global IDR to manage this.

* An iommu_process can be bound to multiple domains, and a domain can have
  multiple iommu_process.

* IOMMU groups share same PASID table. IOMMU groups are a convenient way
  to cover various hardware weaknesses that prevent a group of device to
  be isolated by the IOMMU (untrusted bridge, for instance). It's foolish
  to assume that all PASID implementations will perfectly isolate devices
  within a bus and functions within a device, so let's assume all devices
  within an IOMMU group have to share PASID traffic as well. In general
  there will be a single device per group.

* It's up to the driver implementation to decide where to implement the
  PASID tables. For SMMU it's more convenient to have a single PASID table
  per domain. And I think the model fits better with the existing IOMMU
  API: IOVA traffic is shared by all devices in a domain, so should PASID
  traffic.

  This isn't a hard requirement though, an implementation can still have a
  PASID table for each device.

  Fault handling
  ==============

The second part adds a few helpers for distributing recoverable and
unrecoverable faults to other parts of the kernel:

* to the mm subsystem, when process page tables are shared with a device,
* to VFIO allowing it to forward translation faults to guests, and let
  them to recover from it,
* to device drivers that need to do something a bit more complex than just
  displaying a fault on dmesg.

You'll notice that this overlaps the work carried out by Jacob Pan for
vSVM fault reporting (published a few hours ago! [2]), which goes in the
same direction. For iommu_fault definition and handler registration it's
probably best to go with his more complete patchset, but I needed some
code to present the full solution and a way to describe both PRI and stall
data.

Patches
6: a new fault handler registration for device drivers (see also [2])
7: report faults to device drivers or add them to a workqueue (ditto)
8: call handle_mm_fault for recoverable faults
9: allow device driver to register blocking handlers

For the moment the interactions between process and fault queue are the
following. Hopefully it should be sufficient.

* When unbinding a process, the fault queue has to be flushed to ensure
  that no old fault will hit a future process that obtains the same PASID.

* When handling a fault, find a process by PASID and handle the fault on
  its mm. The process structure is refcounted, so releasing it in the
  fault handler might free the process.

Patch 10 adds a VFIO interface for binding a device owned by a userspace
driver to processes. I didn't add capability detection now, leaving that
discussion for later (also needed by vSVM).

  ARM SMMUv3 support
  ==================

The third part adds an example user, the SMMUv3 driver. A lot of
preparatory work is still needed to support these features, I only
extracted a small part of the previous series to make it common.

If you don't care about SMMU I advise to look at patches 21, which uses
the new process management interface. Patches 27, 29 and 35 use the new
fault queue for PRI and Stall.

Patches:
11:     track domain-master links (for ATS and CD invalidation)
12-13   add stall and PASID properties to the device tree
     -> New.
14-15:  add SSID support to the SMMU
     -> Now initializes the CD tables from the value found in DT.
16-20:  share ASID and page tables part 1
21:     implement iommu-process operations
     -> New.
22-26:  share ASID and page tables part 2
27:     use the new fault queue
     -> New.
28:     find masters by SID
     -> New.
29:     add stall support
     -> New.
30-36:  add PCI ATS, PRI and PASID
     -> Now uses mostly core code

This series is available on my svm/rfc2 branch [3]. It is based on v4.14
with Yisheng's stall fix [4]. Patch 8 also requires mmput_async which
should be added back soon enough [5]. Updates and fixes will go on
branch svm/current until next version.

Hoping this helps,
Jean

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-February/020599.html
[2] https://patchwork.kernel.org/patch/9988089/
[3] git://linux-arm.org/linux-jpb svm/rfc2
[4] https://patchwork.kernel.org/patch/9963863/
[5] https://patchwork.kernel.org/patch/9952257/

Jean-Philippe Brucker (36):
  iommu: Keep track of processes and PASIDs
  iommu: Add a process_exit callback for device drivers
  iommu/process: Add public function to search for a process
  iommu/process: Track process changes with an mmu_notifier
  iommu/process: Bind and unbind process to and from devices
  iommu: Extend fault reporting
  iommu: Add a fault handler
  iommu/fault: Handle mm faults
  iommu/fault: Allow blocking fault handlers
  vfio: Add support for Shared Virtual Memory
  iommu/arm-smmu-v3: Link domains and devices
  dt-bindings: document stall and PASID properties for IOMMU masters
  iommu/of: Add stall and pasid properties to iommu_fwspec
  iommu/arm-smmu-v3: Add support for Substream IDs
  iommu/arm-smmu-v3: Add second level of context descriptor table
  iommu/arm-smmu-v3: Add support for VHE
  iommu/arm-smmu-v3: Support broadcast TLB maintenance
  iommu/arm-smmu-v3: Add SVM feature checking
  arm64: mm: Pin down ASIDs for sharing contexts with devices
  iommu/arm-smmu-v3: Track ASID state
  iommu/arm-smmu-v3: Implement process operations
  iommu/io-pgtable-arm: Factor out ARM LPAE register defines
  iommu/arm-smmu-v3: Share process page tables
  iommu/arm-smmu-v3: Steal private ASID from a domain
  iommu/arm-smmu-v3: Use shared ASID set
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
  iommu/arm-smmu-v3: Register fault workqueue
  iommu/arm-smmu-v3: Maintain a SID->device structure
  iommu/arm-smmu-v3: Add stall support for platform devices
  ACPI/IORT: Check ATS capability in root complex nodes
  iommu/arm-smmu-v3: Add support for PCI ATS
  iommu/arm-smmu-v3: Hook ATC invalidation to process ops
  iommu/arm-smmu-v3: Disable tagged pointers
  PCI: Make "PRG Response PASID Required" handling common
  iommu/arm-smmu-v3: Add support for PRI
  iommu/arm-smmu-v3: Add support for PCI PASID

 Documentation/devicetree/bindings/iommu/iommu.txt |   24 +
 MAINTAINERS                                       |    1 +
 arch/arm64/include/asm/mmu.h                      |    1 +
 arch/arm64/include/asm/mmu_context.h              |   11 +-
 arch/arm64/mm/context.c                           |   80 +-
 drivers/acpi/arm64/iort.c                         |   11 +
 drivers/iommu/Kconfig                             |   19 +
 drivers/iommu/Makefile                            |    2 +
 drivers/iommu/amd_iommu.c                         |   19 +-
 drivers/iommu/arm-smmu-v3.c                       | 1990 ++++++++++++++++++---
 drivers/iommu/io-pgfault.c                        |  421 +++++
 drivers/iommu/io-pgtable-arm.c                    |   48 +-
 drivers/iommu/io-pgtable-arm.h                    |   67 +
 drivers/iommu/iommu-process.c                     |  604 +++++++
 drivers/iommu/iommu.c                             |  113 ++
 drivers/iommu/of_iommu.c                          |   10 +
 drivers/pci/ats.c                                 |   17 +
 drivers/vfio/vfio_iommu_type1.c                   |  243 ++-
 include/linux/iommu.h                             |  254 ++-
 include/linux/pci-ats.h                           |    8 +
 include/uapi/linux/pci_regs.h                     |    1 +
 include/uapi/linux/vfio.h                         |   69 +
 22 files changed, 3690 insertions(+), 323 deletions(-)
 create mode 100644 drivers/iommu/io-pgfault.c
 create mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 drivers/iommu/iommu-process.c

Bjorn Helgaas Oct. 6, 2017, 6:11 p.m. UTC | #1

On Fri, Oct 06, 2017 at 02:32:01PM +0100, Jean-Philippe Brucker wrote:
> The PASID ECN to the PCIe spec added a bit in the PRI status register that
> allows a Function to declare whether a PRG Response should contain the
> PASID prefix or not.
> 
> Move the helper that accesses it from amd_iommu into the PCI subsystem,
> renaming it to be consistent with the current spec (PRPR - PRG Response
> PASID Required).
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I assume this will be merged with the rest of the series, probably via an
IOMMU tree.

> ---
>  drivers/iommu/amd_iommu.c     | 19 +------------------
>  drivers/pci/ats.c             | 17 +++++++++++++++++
>  include/linux/pci-ats.h       |  8 ++++++++
>  include/uapi/linux/pci_regs.h |  1 +
>  4 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..45036a253d63 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2039,23 +2039,6 @@ static int pdev_iommuv2_enable(struct pci_dev *pdev)
>  	return ret;
>  }
>  
> -/* FIXME: Move this to PCI code */
> -#define PCI_PRI_TLP_OFF		(1 << 15)
> -
> -static bool pci_pri_tlp_required(struct pci_dev *pdev)
> -{
> -	u16 status;
> -	int pos;
> -
> -	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
> -	if (!pos)
> -		return false;
> -
> -	pci_read_config_word(pdev, pos + PCI_PRI_STATUS, &status);
> -
> -	return (status & PCI_PRI_TLP_OFF) ? true : false;
> -}
> -
>  /*
>   * If a device is not yet associated with a domain, this function
>   * assigns it visible for the hardware
> @@ -2084,7 +2067,7 @@ static int attach_device(struct device *dev,
>  
>  			dev_data->ats.enabled = true;
>  			dev_data->ats.qdep    = pci_ats_queue_depth(pdev);
> -			dev_data->pri_tlp     = pci_pri_tlp_required(pdev);
> +			dev_data->pri_tlp     = pci_prg_resp_requires_prefix(pdev);
>  		}
>  	} else if (amd_iommu_iotlb_sup &&
>  		   pci_enable_ats(pdev, PAGE_SHIFT) == 0) {
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ad8ddbbbf245..f95e42df728b 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -389,3 +389,20 @@ int pci_max_pasids(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL_GPL(pci_max_pasids);
>  #endif /* CONFIG_PCI_PASID */
> +
> +#if defined(CONFIG_PCI_PASID) && defined(CONFIG_PCI_PRI)
> +bool pci_prg_resp_requires_prefix(struct pci_dev *pdev)
> +{
> +	u16 status;
> +	int pos;
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
> +	if (!pos)
> +		return false;
> +
> +	pci_read_config_word(pdev, pos + PCI_PRI_STATUS, &status);
> +
> +	return !!(status & PCI_PRI_STATUS_PRPR);
> +}
> +EXPORT_SYMBOL_GPL(pci_prg_resp_requires_prefix);
> +#endif /* CONFIG_PCI_PASID && CONFIG_PCI_PRI */
> diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> index 782fb8e0755f..367ea9448441 100644
> --- a/include/linux/pci-ats.h
> +++ b/include/linux/pci-ats.h
> @@ -67,5 +67,13 @@ static inline int pci_max_pasids(struct pci_dev *pdev)
>  
>  #endif /* CONFIG_PCI_PASID */
>  
> +#if defined(CONFIG_PCI_PASID) && defined(CONFIG_PCI_PRI)
> +bool pci_prg_resp_requires_prefix(struct pci_dev *pdev);
> +#else
> +static inline bool pci_prg_resp_requires_prefix(struct pci_dev *pdev)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_PCI_PASID && CONFIG_PCI_PRI */
>  
>  #endif /* LINUX_PCI_ATS_H*/
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index f8d58045926f..a0eeb16a2bfe 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -862,6 +862,7 @@
>  #define  PCI_PRI_STATUS_RF	0x001	/* Response Failure */
>  #define  PCI_PRI_STATUS_UPRGI	0x002	/* Unexpected PRG index */
>  #define  PCI_PRI_STATUS_STOPPED	0x100	/* PRI Stopped */
> +#define  PCI_PRI_STATUS_PRPR	0x8000	/* PRG Response requires PASID prefix */
>  #define PCI_PRI_MAX_REQ		0x08	/* PRI max reqs supported */
>  #define PCI_PRI_ALLOC_REQ	0x0c	/* PRI max reqs allowed */
>  #define PCI_EXT_CAP_PRI_SIZEOF	16
> -- 
> 2.13.3
> 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Oct. 9, 2017, 9:49 a.m. UTC | #2

Hi Jean,

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> Following discussions at plumbers and elsewhere, it seems like we need to
> unify some of the Shared Virtual Memory (SVM) code, in order to define
> clear semantics for the SVM API.
> 
> My previous RFC [1] was centered on the SMMUv3, but some of this code will
> need to be reused by the SMMUv2 and virtio-iommu drivers. This second
> proposal focuses on abstracting a little more into the core IOMMU API, and
> also trying to find common ground for all SVM-capable IOMMUs.
> 
> SVM is, in the context of the IOMMU, sharing page tables between a process
> and a device. Traditionally it requires IO Page Fault and Process Address
> Space ID capabilities in device and IOMMU.
> 
> * A device driver can bind a process to a device, with iommu_process_bind.
>   Internally we hold on to the mm and get notified of its activity with an
>   mmu_notifier. The bond is removed by exit_mm, by a call to
>   iommu_process_unbind or iommu_detach_device.
> 
> * iommu_process_bind returns a 20-bit PASID (PCI terminology) to the
>   device driver, which programs it into the device to access the process
>   address space.
> 
> * The device and the IOMMU support recoverable page faults. This can be
>   either ATS+PRI for PCI, or platform-specific mechanisms such as Stall
>   for SMMU.
> 
> Ideally systems wanting to use SVM have to support these three features,
> but in practice we'll see implementations supporting just a subset of
> them, especially in validation environments. So even if this particular
> patchset assumes all three capabilities, it should also be possible to
> support PASID without IOPF (by pinning everything, see non-system SVM in
> OpenCL)
How to pin everything? If user malloc anything we should pin it. It should
from user or driver?

> , or IOPF without PASID (sharing the single device address space
> with a process, could be useful for DPDK+VFIO).
> 
> Implementing both these cases would enable PT sharing alone. Some people
> would also like IOPF alone without SVM (covered by this series) or process
> management without shared PT (not covered). Using these features
> individually is also important for testing, as SVM is in its infancy and
> providing easy ways to test is essential to reduce the number of quirks
> down the line.
> 
>   Process management
>   ==================
> 
> The first part of this series introduces boilerplate code for managing
> PASIDs and processes bound to devices. It's something any IOMMU driver
> that wants to support bind/unbind will have to do, and it is difficult to
> get right.
> 
> Patches
> 1: iommu_process and PASID allocation, attach and release
> 2: process_exit callback for device drivers
> 3: iommu_process search by PASID
> 4: track process changes with an MMU notifiers
> 5: bind and unbind operations
> 
> My proposal uses the following model:
> 
> * The PASID space is system-wide. This means that a Linux process will
>   have a single PASID. I introduce the iommu_process structure and a
>   global IDR to manage this.
> 
> * An iommu_process can be bound to multiple domains, and a domain can have
>   multiple iommu_process.
when bind a task to device, can we create a single domain for it? I am thinking
about process management without shared PT(for some device only support PASID
without pri ability), it seems hard to expand if a domain have multiple iommu_process?
Do you have any idea about this?

> 
> * IOMMU groups share same PASID table. IOMMU groups are a convenient way
>   to cover various hardware weaknesses that prevent a group of device to
>   be isolated by the IOMMU (untrusted bridge, for instance). It's foolish
>   to assume that all PASID implementations will perfectly isolate devices
>   within a bus and functions within a device, so let's assume all devices
>   within an IOMMU group have to share PASID traffic as well. In general
>   there will be a single device per group.
> 
> * It's up to the driver implementation to decide where to implement the
>   PASID tables. For SMMU it's more convenient to have a single PASID table
>   per domain. And I think the model fits better with the existing IOMMU
>   API: IOVA traffic is shared by all devices in a domain, so should PASID
>   traffic.
What's the meaning of "share PASID traffic"? PASID space is system-wide,
and a domain can have multiple iommu_process , so a domain can have multiple
PASIDs , one PASID for a iommu_process, right?

Yisheng Xie
Thanks


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Oct. 9, 2017, 11:36 a.m. UTC | #3

Hi,

On 09/10/17 10:49, Yisheng Xie wrote:
> Hi Jean,
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> Following discussions at plumbers and elsewhere, it seems like we need to
>> unify some of the Shared Virtual Memory (SVM) code, in order to define
>> clear semantics for the SVM API.
>>
>> My previous RFC [1] was centered on the SMMUv3, but some of this code will
>> need to be reused by the SMMUv2 and virtio-iommu drivers. This second
>> proposal focuses on abstracting a little more into the core IOMMU API, and
>> also trying to find common ground for all SVM-capable IOMMUs.
>>
>> SVM is, in the context of the IOMMU, sharing page tables between a process
>> and a device. Traditionally it requires IO Page Fault and Process Address
>> Space ID capabilities in device and IOMMU.
>>
>> * A device driver can bind a process to a device, with iommu_process_bind.
>>   Internally we hold on to the mm and get notified of its activity with an
>>   mmu_notifier. The bond is removed by exit_mm, by a call to
>>   iommu_process_unbind or iommu_detach_device.
>>
>> * iommu_process_bind returns a 20-bit PASID (PCI terminology) to the
>>   device driver, which programs it into the device to access the process
>>   address space.
>>
>> * The device and the IOMMU support recoverable page faults. This can be
>>   either ATS+PRI for PCI, or platform-specific mechanisms such as Stall
>>   for SMMU.
>>
>> Ideally systems wanting to use SVM have to support these three features,
>> but in practice we'll see implementations supporting just a subset of
>> them, especially in validation environments. So even if this particular
>> patchset assumes all three capabilities, it should also be possible to
>> support PASID without IOPF (by pinning everything, see non-system SVM in
>> OpenCL)
> How to pin everything? If user malloc anything we should pin it. It should
> from user or driver?

For userspace drivers, I guess it would be via a VFIO ioctl, that does the
same preparatory work as VFIO_IOMMU_MAP_DMA, but doesn't call iommu_map.
For things like OpenCL SVM buffers, it's the kernel driver that does the
pinning, just like VFIO does it, before launching the work on a SVM buffer.

>> , or IOPF without PASID (sharing the single device address space
>> with a process, could be useful for DPDK+VFIO).
>>
>> Implementing both these cases would enable PT sharing alone. Some people
>> would also like IOPF alone without SVM (covered by this series) or process
>> management without shared PT (not covered). Using these features
>> individually is also important for testing, as SVM is in its infancy and
>> providing easy ways to test is essential to reduce the number of quirks
>> down the line.
>>
>>   Process management
>>   ==================
>>
>> The first part of this series introduces boilerplate code for managing
>> PASIDs and processes bound to devices. It's something any IOMMU driver
>> that wants to support bind/unbind will have to do, and it is difficult to
>> get right.
>>
>> Patches
>> 1: iommu_process and PASID allocation, attach and release
>> 2: process_exit callback for device drivers
>> 3: iommu_process search by PASID
>> 4: track process changes with an MMU notifiers
>> 5: bind and unbind operations
>>
>> My proposal uses the following model:
>>
>> * The PASID space is system-wide. This means that a Linux process will
>>   have a single PASID. I introduce the iommu_process structure and a
>>   global IDR to manage this.
>>
>> * An iommu_process can be bound to multiple domains, and a domain can have
>>   multiple iommu_process.
> when bind a task to device, can we create a single domain for it? I am thinking
> about process management without shared PT(for some device only support PASID
> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
> Do you have any idea about this?

A device always has to be in a domain, as far as I know. Not supporting
PRI forces you to pin down all user mappings (or just the ones you use for
DMA) but you should sill be able to share PT. Now if you don't support
shared PT either, but only PASID, then you'll have to use io-pgtable and a
new map/unmap API on an iommu_process. I don't understand your concern
though, how would the link between process and domains prevent this use-case?

>> * IOMMU groups share same PASID table. IOMMU groups are a convenient way
>>   to cover various hardware weaknesses that prevent a group of device to
>>   be isolated by the IOMMU (untrusted bridge, for instance). It's foolish
>>   to assume that all PASID implementations will perfectly isolate devices
>>   within a bus and functions within a device, so let's assume all devices
>>   within an IOMMU group have to share PASID traffic as well. In general
>>   there will be a single device per group.
>>
>> * It's up to the driver implementation to decide where to implement the
>>   PASID tables. For SMMU it's more convenient to have a single PASID table
>>   per domain. And I think the model fits better with the existing IOMMU
>>   API: IOVA traffic is shared by all devices in a domain, so should PASID
>>   traffic.
> What's the meaning of "share PASID traffic"? PASID space is system-wide,
> and a domain can have multiple iommu_process , so a domain can have multiple
> PASIDs , one PASID for a iommu_process, right?

Yes, I meant that if a device can access mappings for a specific PASID,
then other devices in that same domain are also able to access them.

A few reasons for this choice in the SMMU:
* As all devices in an IOMMU group will be in the same domain and share
the same PASID traffic, it encompasses that case. Groups are the smallest
isolation granularity, then users are free to choose to put different
IOMMU groups in different domains.
* For architectures that can have both non-PASID and PASID traffic
simultaneously, like the SMMU, it is simpler to reason about PASID tables
being a domain, rather than sharing PASID0 within the domain and handling
all others per device.
* It's the same principle as non-PASID mappings (iommu_map/unmap is on a
domain).
* It implement the classic example of IOMMU architectures where multiple
device descriptors point to the same PASID tables.
* It may be desirable for drivers to share PASIDs within a domain, if they
are actually using domains for conveniently sharing address spaces between
devices. I'm not sure how much this is used as a feature. It does model a
shared bus where each device can snoop DMA, so it may be useful.

bind/unbind operations are done on devices and not domains, though,
because it allows users to know which device supports PASID, PRI, etc.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joerg Roedel Oct. 11, 2017, 11:33 a.m. UTC | #4

Hi Jean-Philipe,

Thanks for your patches, this is definitly a step in the right direction
to get generic support for IO virtual memory into the IOMMU core code.

But I see an issue with the design around task_struct, please see
below.

On Fri, Oct 06, 2017 at 02:31:32PM +0100, Jean-Philippe Brucker wrote:
> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
> +			      int *pasid, int flags)

I just took this patch as an example, it is in the other patches as
well. The code is designed around 'struct task_struct' while it should
really be designed around 'struct mm_struct'. You are not attaching a
specific process to a device, but the address-space of one or more
processes. And that should be reflected in the design.

There are several bad implications of building it around task_struct,
one is the life-time of the binding. If the address space is detached
from the device when the process exits, only the thread that did the
setup can can safely make use of the device, because if the device is
accessed from another thread it will crash when the setup-thread exits.

There are other benefits of using mm_struct, for example that you can
use mmu_notifiers to exit-handling.

Here is how I think the base API should look like:

	* iommu_iovm_device_init(struct device *dev);
	  iommu_iovm_device_shutdown(struct device *dev);

	  These two functions do the device specific setup/shutdown. For
	  PCI this would include enabling the PRI, PASID, and ATS
	  capabilities and setting up a PASID table for the device.

	* iommu_iovm_bind_mm(struct device *dev, struct mm_struct *mm,
			     iovm_shutdown_cb *cb);
	  iommu_iovm_unbind_mm(struct device *dev, struct mm_struct *mm);

	  These functions add and delete the entries in the PASID table
	  for the device and setup mmu_notifiers for the mm_struct to
	  keep IOMMU TLBs in sync with the CPU TLBs.

	  The shutdown_cb is invoked by the IOMMU core code when the
	  mm_struct goes away, for example because the process
	  segfaults.

	The PASID handling is best done these functions as well, unless
	there is a strong reason to allow device drivers to do the
	handling themselves.

The context data can be stored directly in mm_struct, including the
PASID for that mm.

There is of course more functionality needed, the above only outlines
the very basic needs.

Regards,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Oct. 12, 2017, 11:13 a.m. UTC | #5

Hi Joerg,

On 11/10/17 12:33, Joerg Roedel wrote:
> Hi Jean-Philipe,
> 
> Thanks for your patches, this is definitly a step in the right direction
> to get generic support for IO virtual memory into the IOMMU core code.
> 
> But I see an issue with the design around task_struct, please see
> below.
> 
> On Fri, Oct 06, 2017 at 02:31:32PM +0100, Jean-Philippe Brucker wrote:
>> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
>> +			      int *pasid, int flags)
> 
> I just took this patch as an example, it is in the other patches as
> well. The code is designed around 'struct task_struct' while it should
> really be designed around 'struct mm_struct'. You are not attaching a
> specific process to a device, but the address-space of one or more
> processes. And that should be reflected in the design.

Agreed. The task struct is only really needed for obtaining the mm in my
code. It also keeps hold of a pid, but that's wrong and easy to remove.

> There are several bad implications of building it around task_struct,
> one is the life-time of the binding. If the address space is detached
> from the device when the process exits, only the thread that did the
> setup can can safely make use of the device, because if the device is
> accessed from another thread it will crash when the setup-thread exits.
> 
> There are other benefits of using mm_struct, for example that you can
> use mmu_notifiers to exit-handling.
> 
> Here is how I think the base API should look like:
> 
> 	* iommu_iovm_device_init(struct device *dev);
> 	  iommu_iovm_device_shutdown(struct device *dev);
> 
> 	  These two functions do the device specific setup/shutdown. For
> 	  PCI this would include enabling the PRI, PASID, and ATS
> 	  capabilities and setting up a PASID table for the device.

Ok. On SMMU the PASID table also hosts the non-PASID page table pointer,
so the table and capability cannot be setup later than attach_device (and
we'll probably have to enable PASID in add_device). But I suppose it's an
implementation detail.

Some device drivers will want to use ATS alone, for accelerating IOVA
traffic. Should we enable it automatically, or provide device drivers with
a way to enable it manually? According to the PCI spec, PASID has to be
enabled before ATS, so device_init would have to first disable ATS in that
case.

> 	* iommu_iovm_bind_mm(struct device *dev, struct mm_struct *mm,
> 			     iovm_shutdown_cb *cb);
> 	  iommu_iovm_unbind_mm(struct device *dev, struct mm_struct *mm);
> 
> 	  These functions add and delete the entries in the PASID table
> 	  for the device and setup mmu_notifiers for the mm_struct to
> 	  keep IOMMU TLBs in sync with the CPU TLBs.
> 
> 	  The shutdown_cb is invoked by the IOMMU core code when the
> 	  mm_struct goes away, for example because the process
> 	  segfaults.
> 
> 	The PASID handling is best done these functions as well, unless
> 	there is a strong reason to allow device drivers to do the
> 	handling themselves.
> 
> The context data can be stored directly in mm_struct, including the
> PASID for that mm.

Changing mm_struct probably isn't required at the moment, since the mm
subsystem won't use the context data or the PASID. Outside of
drivers/iommu/, only the caller of bind_mm needs the PASID in order to
program it into the device. The only advantage I see would be slightly
faster bind(), when finding out if a mm is already bound to devices. But
we don't really need fast bind(), so I don't think we'd have enough
material to argue for a change in mm_struct.

We do need to allocate a separate "iommu_mm_struct" wrapper to store the
mmu_notifier. Maybe bind() could return this structure (that contains the
PASID), and unbind() would take this iommu_mm_struct as argument?

Thanks
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Oct. 12, 2017, 12:05 p.m. UTC | #6

Hi Jean,

Thanks for replying.
On 2017/10/9 19:36, Jean-Philippe Brucker wrote:
> Hi,
> 
> On 09/10/17 10:49, Yisheng Xie wrote:
>> Hi Jean,
>>
>> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>>> Following discussions at plumbers and elsewhere, it seems like we need to
>>> unify some of the Shared Virtual Memory (SVM) code, in order to define
>>> clear semantics for the SVM API.
>>>
>>> My previous RFC [1] was centered on the SMMUv3, but some of this code will
>>> need to be reused by the SMMUv2 and virtio-iommu drivers. This second
>>> proposal focuses on abstracting a little more into the core IOMMU API, and
>>> also trying to find common ground for all SVM-capable IOMMUs.
>>>
>>> SVM is, in the context of the IOMMU, sharing page tables between a process
>>> and a device. Traditionally it requires IO Page Fault and Process Address
>>> Space ID capabilities in device and IOMMU.
>>>
>>> * A device driver can bind a process to a device, with iommu_process_bind.
>>>   Internally we hold on to the mm and get notified of its activity with an
>>>   mmu_notifier. The bond is removed by exit_mm, by a call to
>>>   iommu_process_unbind or iommu_detach_device.
>>>
>>> * iommu_process_bind returns a 20-bit PASID (PCI terminology) to the
>>>   device driver, which programs it into the device to access the process
>>>   address space.
>>>
>>> * The device and the IOMMU support recoverable page faults. This can be
>>>   either ATS+PRI for PCI, or platform-specific mechanisms such as Stall
>>>   for SMMU.
>>>
>>> Ideally systems wanting to use SVM have to support these three features,
>>> but in practice we'll see implementations supporting just a subset of
>>> them, especially in validation environments. So even if this particular
>>> patchset assumes all three capabilities, it should also be possible to
>>> support PASID without IOPF (by pinning everything, see non-system SVM in
>>> OpenCL)
>> How to pin everything? If user malloc anything we should pin it. It should
>> from user or driver?
> 
> For userspace drivers, I guess it would be via a VFIO ioctl, that does the
> same preparatory work as VFIO_IOMMU_MAP_DMA, but doesn't call iommu_map.
> For things like OpenCL SVM buffers, it's the kernel driver that does the
> pinning, just like VFIO does it, before launching the work on a SVM buffer.
> 
>>> , or IOPF without PASID (sharing the single device address space
>>> with a process, could be useful for DPDK+VFIO).
>>>
>>> Implementing both these cases would enable PT sharing alone. Some people
>>> would also like IOPF alone without SVM (covered by this series) or process
>>> management without shared PT (not covered). Using these features
>>> individually is also important for testing, as SVM is in its infancy and
>>> providing easy ways to test is essential to reduce the number of quirks
>>> down the line.
>>>
>>>   Process management
>>>   ==================
>>>
>>> The first part of this series introduces boilerplate code for managing
>>> PASIDs and processes bound to devices. It's something any IOMMU driver
>>> that wants to support bind/unbind will have to do, and it is difficult to
>>> get right.
>>>
>>> Patches
>>> 1: iommu_process and PASID allocation, attach and release
>>> 2: process_exit callback for device drivers
>>> 3: iommu_process search by PASID
>>> 4: track process changes with an MMU notifiers
>>> 5: bind and unbind operations
>>>
>>> My proposal uses the following model:
>>>
>>> * The PASID space is system-wide. This means that a Linux process will
>>>   have a single PASID. I introduce the iommu_process structure and a
>>>   global IDR to manage this.
>>>
>>> * An iommu_process can be bound to multiple domains, and a domain can have
>>>   multiple iommu_process.
>> when bind a task to device, can we create a single domain for it? I am thinking
>> about process management without shared PT(for some device only support PASID
>> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
>> Do you have any idea about this?
> 
> A device always has to be in a domain, as far as I know. Not supporting
> PRI forces you to pin down all user mappings (or just the ones you use for
> DMA) but you should sill be able to share PT. Now if you don't support
> shared PT either, but only PASID, then you'll have to use io-pgtable and a
> new map/unmap API on an iommu_process. I don't understand your concern
> though, how would the link between process and domains prevent this use-case?
> 
So you mean that if an iommu_process bind to multiple devices it should create
multiple io-pgtables? or just share the same io-pgtable?

>>> * IOMMU groups share same PASID table. IOMMU groups are a convenient way
>>>   to cover various hardware weaknesses that prevent a group of device to
>>>   be isolated by the IOMMU (untrusted bridge, for instance). It's foolish
>>>   to assume that all PASID implementations will perfectly isolate devices
>>>   within a bus and functions within a device, so let's assume all devices
>>>   within an IOMMU group have to share PASID traffic as well. In general
>>>   there will be a single device per group.
>>>
>>> * It's up to the driver implementation to decide where to implement the
>>>   PASID tables. For SMMU it's more convenient to have a single PASID table
>>>   per domain. And I think the model fits better with the existing IOMMU
>>>   API: IOVA traffic is shared by all devices in a domain, so should PASID
>>>   traffic.
>> What's the meaning of "share PASID traffic"? PASID space is system-wide,
>> and a domain can have multiple iommu_process , so a domain can have multiple
>> PASIDs , one PASID for a iommu_process, right?
I get what your mean now, thanks for your explain.

> 
> Yes, I meant that if a device can access mappings for a specific PASID,
> then other devices in that same domain are also able to access them.
> 
> A few reasons for this choice in the SMMU:
> * As all devices in an IOMMU group will be in the same domain and share
> the same PASID traffic, it encompasses that case. Groups are the smallest
> isolation granularity, then users are free to choose to put different
> IOMMU groups in different domains.
> * For architectures that can have both non-PASID and PASID traffic
> simultaneously, like the SMMU, it is simpler to reason about PASID tables
> being a domain, rather than sharing PASID0 within the domain and handling
> all others per device.
> * It's the same principle as non-PASID mappings (iommu_map/unmap is on a
> domain).
> * It implement the classic example of IOMMU architectures where multiple
> device descriptors point to the same PASID tables.
> * It may be desirable for drivers to share PASIDs within a domain, if they
> are actually using domains for conveniently sharing address spaces between
> devices. I'm not sure how much this is used as a feature. It does model a
> shared bus where each device can snoop DMA, so it may be useful.
> 

I get another question about this design, thinking about the following case:

If a platform device with PASID ability, e.g. accelerator, which have multiple
accelerator process units(APUs), it may create multiple virtual devices, one
virtual device represent for an APU, with the same sid.

They can be in different groups, however must be in the same domain as this
design, for domain held the PASID table, right? So how could they be used by
different guest OS?

Thanks
Yisheng Xie

> bind/unbind operations are done on devices and not domains, though,
> because it allows users to know which device supports PASID, PRI, etc.
> 
> Thanks,
> Jean
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joerg Roedel Oct. 12, 2017, 12:47 p.m. UTC | #7

Hi Jean-Philippe,

On Thu, Oct 12, 2017 at 12:13:20PM +0100, Jean-Philippe Brucker wrote:
> On 11/10/17 12:33, Joerg Roedel wrote:
> > Here is how I think the base API should look like:
> > 
> > 	* iommu_iovm_device_init(struct device *dev);
> > 	  iommu_iovm_device_shutdown(struct device *dev);
> > 
> > 	  These two functions do the device specific setup/shutdown. For
> > 	  PCI this would include enabling the PRI, PASID, and ATS
> > 	  capabilities and setting up a PASID table for the device.
> 
> Ok. On SMMU the PASID table also hosts the non-PASID page table pointer,
> so the table and capability cannot be setup later than attach_device (and
> we'll probably have to enable PASID in add_device). But I suppose it's an
> implementation detail.

Right, when the capabilities are enabled is an implementation detail of
the iommu-drivers. 

> Some device drivers will want to use ATS alone, for accelerating IOVA
> traffic. Should we enable it automatically, or provide device drivers with
> a way to enable it manually? According to the PCI spec, PASID has to be
> enabled before ATS, so device_init would have to first disable ATS in that
> case.

Yes, driver can enable ATS for normal use of a device, and
disable/re-enable it when the driver requests PASID/PRI functionality.
That is also an implementation detail. You should probably also document
that the init/shutdown functions may interrupt device operation, so that
driver writers are aware of that.

> 
> > 	* iommu_iovm_bind_mm(struct device *dev, struct mm_struct *mm,
> > 			     iovm_shutdown_cb *cb);
> > 	  iommu_iovm_unbind_mm(struct device *dev, struct mm_struct *mm);
> > 
> > 	  These functions add and delete the entries in the PASID table
> > 	  for the device and setup mmu_notifiers for the mm_struct to
> > 	  keep IOMMU TLBs in sync with the CPU TLBs.
> > 
> > 	  The shutdown_cb is invoked by the IOMMU core code when the
> > 	  mm_struct goes away, for example because the process
> > 	  segfaults.
> > 
> > 	The PASID handling is best done these functions as well, unless
> > 	there is a strong reason to allow device drivers to do the
> > 	handling themselves.
> > 
> > The context data can be stored directly in mm_struct, including the
> > PASID for that mm.
> 
> Changing mm_struct probably isn't required at the moment, since the mm
> subsystem won't use the context data or the PASID. Outside of
> drivers/iommu/, only the caller of bind_mm needs the PASID in order to
> program it into the device. The only advantage I see would be slightly
> faster bind(), when finding out if a mm is already bound to devices. But
> we don't really need fast bind(), so I don't think we'd have enough
> material to argue for a change in mm_struct.

The idea behind storing the PASID in mm_struct is that we have a
system-wide PASID-allocator and only one PASID per address space, even
when accessed from multiple devices.

There will be hardware implemenations where this is required, afaik. It
doesn't mean that it _needs_ to be part of mm_struct, but it is
certainly easier than tracking this 1-1 relation separatly.

> We do need to allocate a separate "iommu_mm_struct" wrapper to store the
> mmu_notifier. Maybe bind() could return this structure (that contains the
> PASID), and unbind() would take this iommu_mm_struct as argument?

I'd like to track this iommu_mm_struct only in iommu-code, otherwise
drivers need to track the pointer somewhere. And we need to track the
mm_struct->iommu_mm_struct relation anyway in core-code to handle events
like segfaults, when the whole mm_struct goes away under us. So when we
track this in core code, there is no need to track this again in the
device drivers.


Regards,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Oct. 12, 2017, 12:55 p.m. UTC | #8

On 12/10/17 13:05, Yisheng Xie wrote:
[...]
>>>> * An iommu_process can be bound to multiple domains, and a domain can have
>>>>   multiple iommu_process.
>>> when bind a task to device, can we create a single domain for it? I am thinking
>>> about process management without shared PT(for some device only support PASID
>>> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
>>> Do you have any idea about this?
>>
>> A device always has to be in a domain, as far as I know. Not supporting
>> PRI forces you to pin down all user mappings (or just the ones you use for
>> DMA) but you should sill be able to share PT. Now if you don't support
>> shared PT either, but only PASID, then you'll have to use io-pgtable and a
>> new map/unmap API on an iommu_process. I don't understand your concern
>> though, how would the link between process and domains prevent this use-case?
>>
> So you mean that if an iommu_process bind to multiple devices it should create
> multiple io-pgtables? or just share the same io-pgtable?

I don't know to be honest, I haven't thought much about the io-pgtable
case, I'm all about sharing the mm :)

It really depends on what the user (GPU driver I assume) wants. I think
that if you're not sharing an mm with the device, then you're trying to
hide parts of the process to the device, so you'd also want the
flexibility of having different io-pgtables between devices. Different
devices accessing isolated parts of the process requires separate io-pgtables.

>>>> * IOMMU groups share same PASID table. IOMMU groups are a convenient way
>>>>   to cover various hardware weaknesses that prevent a group of device to
>>>>   be isolated by the IOMMU (untrusted bridge, for instance). It's foolish
>>>>   to assume that all PASID implementations will perfectly isolate devices
>>>>   within a bus and functions within a device, so let's assume all devices
>>>>   within an IOMMU group have to share PASID traffic as well. In general
>>>>   there will be a single device per group.
>>>>
>>>> * It's up to the driver implementation to decide where to implement the
>>>>   PASID tables. For SMMU it's more convenient to have a single PASID table
>>>>   per domain. And I think the model fits better with the existing IOMMU
>>>>   API: IOVA traffic is shared by all devices in a domain, so should PASID
>>>>   traffic.
>>> What's the meaning of "share PASID traffic"? PASID space is system-wide,
>>> and a domain can have multiple iommu_process , so a domain can have multiple
>>> PASIDs , one PASID for a iommu_process, right?
> I get what your mean now, thanks for your explain.
> 
>>
>> Yes, I meant that if a device can access mappings for a specific PASID,
>> then other devices in that same domain are also able to access them.
>>
>> A few reasons for this choice in the SMMU:
>> * As all devices in an IOMMU group will be in the same domain and share
>> the same PASID traffic, it encompasses that case. Groups are the smallest
>> isolation granularity, then users are free to choose to put different
>> IOMMU groups in different domains.
>> * For architectures that can have both non-PASID and PASID traffic
>> simultaneously, like the SMMU, it is simpler to reason about PASID tables
>> being a domain, rather than sharing PASID0 within the domain and handling
>> all others per device.
>> * It's the same principle as non-PASID mappings (iommu_map/unmap is on a
>> domain).
>> * It implement the classic example of IOMMU architectures where multiple
>> device descriptors point to the same PASID tables.
>> * It may be desirable for drivers to share PASIDs within a domain, if they
>> are actually using domains for conveniently sharing address spaces between
>> devices. I'm not sure how much this is used as a feature. It does model a
>> shared bus where each device can snoop DMA, so it may be useful.
>>
> 
> I get another question about this design, thinking about the following case:
> 
> If a platform device with PASID ability, e.g. accelerator, which have multiple
> accelerator process units(APUs), it may create multiple virtual devices, one
> virtual device represent for an APU, with the same sid.
> 
> They can be in different groups, however must be in the same domain as this
> design, for domain held the PASID table, right? So how could they be used by
> different guest OS?

If they have the same SID, they will be in the same group as there will be
a single entry in the SMMU stream table. Otherwise, if the virtual devices
can be properly isolated from each other (in the same way as PCI SR-IOV),
they will each have their own SID, can each be in a different IOMMU group
and can be assigned to different guests.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jordan Crouse Oct. 12, 2017, 3:28 p.m. UTC | #9

On Thu, Oct 12, 2017 at 01:55:32PM +0100, Jean-Philippe Brucker wrote:
> On 12/10/17 13:05, Yisheng Xie wrote:
> [...]
> >>>> * An iommu_process can be bound to multiple domains, and a domain can have
> >>>>   multiple iommu_process.
> >>> when bind a task to device, can we create a single domain for it? I am thinking
> >>> about process management without shared PT(for some device only support PASID
> >>> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
> >>> Do you have any idea about this?
> >>
> >> A device always has to be in a domain, as far as I know. Not supporting
> >> PRI forces you to pin down all user mappings (or just the ones you use for
> >> DMA) but you should sill be able to share PT. Now if you don't support
> >> shared PT either, but only PASID, then you'll have to use io-pgtable and a
> >> new map/unmap API on an iommu_process. I don't understand your concern
> >> though, how would the link between process and domains prevent this use-case?
> >>
> > So you mean that if an iommu_process bind to multiple devices it should create
> > multiple io-pgtables? or just share the same io-pgtable?
> 
> I don't know to be honest, I haven't thought much about the io-pgtable
> case, I'm all about sharing the mm :)
> 
> It really depends on what the user (GPU driver I assume) wants. I think
> that if you're not sharing an mm with the device, then you're trying to
> hide parts of the process to the device, so you'd also want the
> flexibility of having different io-pgtables between devices. Different
> devices accessing isolated parts of the process requires separate io-pgtables.

In our specific Snapdragon use case the GPU is the only entity that cares about
process specific io-pgtables.  Everything else (display, video, camera) is happy
using a global io-ptgable.  The reasoning is that the GPU is programmable from
user space and can be easily used to copy data whereas the other use cases have
mostly fixed functions.

Even if different devices did want to have a process specific io-pgtable I doubt
we would share them.  Every device uses the IOMMU differently and the magic
needed to share a io-pgtable between (for example) a GPU and a DSP would be
prohibitively complicated.

Jordan

Sinan Kaya Oct. 20, 2017, 11:32 p.m. UTC | #10

few nits below.

> +/*
> + * Allocate a iommu_process structure for the given task.
> + *
> + * Ideally we shouldn't need the domain parameter, since iommu_process is
> + * system-wide, but we use it to retrieve the driver's allocation ops and a
> + * PASID range.
> + */
> +static struct iommu_process *
> +iommu_process_alloc(struct iommu_domain *domain, struct task_struct *task)
> +{
> +	int err;
> +	int pasid;
> +	struct iommu_process *process;
> +
> +	if (WARN_ON(!domain->ops->process_alloc || !domain->ops->process_free))
> +		return ERR_PTR(-ENODEV);
> +
> +	process = domain->ops->process_alloc(task);
> +	if (IS_ERR(process))
> +		return process;
> +	if (!process)
> +		return ERR_PTR(-ENOMEM);
> +
> +	process->pid		= get_task_pid(task, PIDTYPE_PID);
> +	process->release	= domain->ops->process_free;
> +	INIT_LIST_HEAD(&process->domains);
> +	kref_init(&process->kref);
> +
nit, I think you should place this check right after the pid assignment.

> +	if (!process->pid) {
> +		err = -EINVAL;
> +		goto err_free_process;
> +	}
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_process_lock);
> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
> +				 domain->max_pasid + 1, GFP_ATOMIC);
> +	process->pasid = pasid;
> +	spin_unlock(&iommu_process_lock);
> +	idr_preload_end();
> +

nit, You can maybe return here if pasid is not negative.

> +	if (pasid < 0) {
> +		err = pasid;
> +		goto err_put_pid;
> +	}
> +
> +	return process;
> +
> +err_put_pid:
> +	put_pid(process->pid);
> +
> +err_free_process:
> +	domain->ops->process_free(process);
> +
> +	return ERR_PTR(err);
> +}
> +
> +static void iommu_process_release(struct kref *kref)
> +{
> +	struct iommu_process *process;
> +	void (*release)(struct iommu_process *);
> +
> +	assert_spin_locked(&iommu_process_lock);
> +
> +	process = container_of(kref, struct iommu_process, kref);

if we are concerned about things going wrong (assert above), we should
also add some pointer check here (WARN) for process and release pointers as well.

> +	release = process->release;
> +
> +	WARN_ON(!list_empty(&process->domains));
> +
> +	idr_remove(&iommu_process_idr, process->pasid);
> +	put_pid(process->pid);
> +	release(process);
> +}
> +
> +/*
> + * Returns non-zero if a reference to the process was successfully taken.
> + * Returns zero if the process is being freed and should not be used.
> + */
> +static int iommu_process_get_locked(struct iommu_process *process)
> +{
> +	assert_spin_locked(&iommu_process_lock);
> +
> +	if (process)
> +		return kref_get_unless_zero(&process->kref);
> +
> +	return 0;
> +}
> +
> +static void iommu_process_put_locked(struct iommu_process *process)
> +{
> +	assert_spin_locked(&iommu_process_lock);
> +
> +	kref_put(&process->kref, iommu_process_release);
> +}
> +
> +static int iommu_process_attach(struct iommu_domain *domain, struct device *dev,
> +				struct iommu_process *process)
> +{
> +	int err;
> +	int pasid = process->pasid;
> +	struct iommu_context *context;
> +
> +	if (WARN_ON(!domain->ops->process_attach || !domain->ops->process_detach))
> +		return -ENODEV;
> +
> +	if (pasid > domain->max_pasid || pasid < domain->min_pasid)
> +		return -ENOSPC;
> +
> +	context = kzalloc(sizeof(*context), GFP_KERNEL);
> +	if (!context)
> +		return -ENOMEM;
> +

devm_kzalloc maybe?

> +	context->process	= process;
> +	context->domain		= domain;
> +	refcount_set(&context->ref, 1);
> +
> +	spin_lock(&iommu_process_lock);
> +	err = domain->ops->process_attach(domain, dev, process, true);
> +	if (err) {
> +		kfree(context);
> +		spin_unlock(&iommu_process_lock);
> +		return err;
> +	}
> +
> +	list_add(&context->process_head, &process->domains);
> +	list_add(&context->domain_head, &domain->processes);
> +	spin_unlock(&iommu_process_lock);
> +
> +	return 0;
> +}

Sinan Kaya Oct. 21, 2017, 3:47 p.m. UTC | #11

Just some improvement suggestions.

On 10/6/2017 9:31 AM, Jean-Philippe Brucker wrote:
> +	spin_lock(&iommu_process_lock);
> +	idr_for_each_entry(&iommu_process_idr, process, i) {
> +		if (process->pid != pid)
> +			continue;
if you see this construct a lot, this could be a for_each_iommu_process.

> +
> +		if (!iommu_process_get_locked(process)) {
> +			/* Process is defunct, create a new one */
> +			process = NULL;
> +			break;
> +		}
> +
> +		/* Great, is it also bound to this domain? */
> +		list_for_each_entry(cur_context, &process->domains,
> +				    process_head) {
> +			if (cur_context->domain != domain)
> +				continue;
if you see this construct a lot, this could be a for_each_iommu_process_domain.

> +
> +			context = cur_context;
> +			*pasid = process->pasid;
> +
> +			/* Splendid, tell the driver and increase the ref */
> +			err = iommu_process_attach_locked(context, dev);
> +			if (err)
> +				iommu_process_put_locked(process);
> +
> +			break;
> +		}
> +		break;
> +	}
> +	spin_unlock(&iommu_process_lock);
> +	put_pid(pid);
> +
> +	if (context)
> +		return err;

I think you should make the section above a independent function and return here when the
context is found.

Yi Liu Oct. 23, 2017, 11:04 a.m. UTC | #12

Hi Jean,

> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Friday, October 6, 2017 9:31 PM
> To: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-
> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-
> foundation.org
> Cc: joro@8bytes.org; robh+dt@kernel.org; mark.rutland@arm.com;
> catalin.marinas@arm.com; will.deacon@arm.com; lorenzo.pieralisi@arm.com;
> hanjun.guo@linaro.org; sudeep.holla@arm.com; rjw@rjwysocki.net;
> lenb@kernel.org; robin.murphy@arm.com; bhelgaas@google.com;
> alex.williamson@redhat.com; tn@semihalf.com; liubo95@huawei.com;
> thunder.leizhen@huawei.com; xieyisheng1@huawei.com;
> gabriele.paoloni@huawei.com; nwatters@codeaurora.org; okaya@codeaurora.org;
> rfranz@cavium.com; dwmw2@infradead.org; jacob.jun.pan@linux.intel.com; Liu, Yi
> L <yi.l.liu@intel.com>; Raj, Ashok <ashok.raj@intel.com>; robdclark@gmail.com
> Subject: [RFCv2 PATCH 01/36] iommu: Keep track of processes and PASIDs
> 
> IOMMU drivers need a way to bind Linux processes to devices. This is used for
> Shared Virtual Memory (SVM), where devices support paging. In that mode, DMA can
> directly target virtual addresses of a process.
> 
> Introduce boilerplate code for allocating process structures and binding them to
> devices. Four operations are added to IOMMU drivers:
> 
> * process_alloc, process_free: to create an iommu_process structure and
>   perform architecture-specific operations required to grab the process
>   (for instance on ARM SMMU, pin down the CPU ASID). There is a single
>   iommu_process structure per Linux process.
> 
> * process_attach: attach a process to a device. The IOMMU driver checks
>   that the device is capable of sharing an address space with this
>   process, and writes the PASID table entry to install the process page
>   directory.
> 
>   Some IOMMU drivers (e.g. ARM SMMU and virtio-iommu) will have a single
>   PASID table per domain, for convenience. Other can implement it
>   differently but to help these drivers, process_attach and process_detach
>   take a 'first' or 'last' parameter telling whether they need to
>   install/remove the PASID entry or only send the required TLB
>   invalidations.
> 
> * process_detach: detach a process from a device. The IOMMU driver removes
>   the PASID table entry and invalidates the IOTLBs.
> 
> process_attach and process_detach operations are serialized with a spinlock. At the
> moment it is global, but if we try to optimize it, the core should at least prevent
> concurrent attach/detach on the same domain.
> (so multi-level PASID table code can allocate tables lazily without having to go
> through the io-pgtable concurrency nightmare). process_alloc can sleep, but
> process_free must not (because we'll have to call it from
> call_srcu.)
> 
> At the moment we use an IDR for allocating PASIDs and retrieving contexts.
> We also use a single spinlock. These can be refined and optimized later (a custom
> allocator will be needed for top-down PASID allocation).
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/Kconfig         |  10 ++
>  drivers/iommu/Makefile        |   1 +
>  drivers/iommu/iommu-process.c | 225
> ++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/iommu.c         |   1 +
>  include/linux/iommu.h         |  24 +++++
>  5 files changed, 261 insertions(+)
>  create mode 100644 drivers/iommu/iommu-process.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index
> f3a21343e636..1ea5c90e37be 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -74,6 +74,16 @@ config IOMMU_DMA
>  	select IOMMU_IOVA
>  	select NEED_SG_DMA_LENGTH
> 
> +config IOMMU_PROCESS
> +	bool "Process management API for the IOMMU"
> +	select IOMMU_API
> +	help
> +	  Enable process management for the IOMMU API. In systems that support
> +	  it, device drivers can bind processes to devices and share their page
> +	  tables using this API.
> +
> +	  If unsure, say N here.
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index
> b910aea813a1..a2832edbfaa2 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,6 +1,7 @@
>  obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> +obj-$(CONFIG_IOMMU_PROCESS) += iommu-process.o
>  obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o diff --git
> a/drivers/iommu/iommu-process.c b/drivers/iommu/iommu-process.c new file
> mode 100644 index 000000000000..a7e5a1c94305
> --- /dev/null
> +++ b/drivers/iommu/iommu-process.c
> @@ -0,0 +1,225 @@
> +/*
> + * Track processes bound to devices
> + *
> + * This program is free software; you can redistribute it and/or modify
> +it
> + * under the terms of the GNU General Public License version 2 as
> +published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
> +USA
> + *
> + * Copyright (C) 2017 ARM Ltd.
> + *
> + * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>  */
> +
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/* Link between a domain and a process */ struct iommu_context {
> +	struct iommu_process	*process;
> +	struct iommu_domain	*domain;
> +
> +	struct list_head	process_head;
> +	struct list_head	domain_head;
> +
> +	/* Number of devices that use this context */
> +	refcount_t		ref;
> +};
> +
> +/*
> + * Because we're using an IDR, PASIDs are limited to 31 bits (the sign
> +bit is
> + * used for returning errors). In practice implementations will use at
> +most 20
> + * bits, which is the PCI limit.
> + */
> +static DEFINE_IDR(iommu_process_idr);
> +
> +/*
> + * For the moment this is an all-purpose lock. It serializes
> + * access/modifications to contexts (process-domain links),
> +access/modifications
> + * to the PASID IDR, and changes to process refcount as well.
> + */
> +static DEFINE_SPINLOCK(iommu_process_lock);
> +
> +/*
> + * Allocate a iommu_process structure for the given task.
> + *
> + * Ideally we shouldn't need the domain parameter, since iommu_process
> +is
> + * system-wide, but we use it to retrieve the driver's allocation ops
> +and a
> + * PASID range.
> + */
> +static struct iommu_process *
> +iommu_process_alloc(struct iommu_domain *domain, struct task_struct
> +*task) {
> +	int err;
> +	int pasid;
> +	struct iommu_process *process;
> +
> +	if (WARN_ON(!domain->ops->process_alloc || !domain->ops-
> >process_free))
> +		return ERR_PTR(-ENODEV);
> +
> +	process = domain->ops->process_alloc(task);
> +	if (IS_ERR(process))
> +		return process;
> +	if (!process)
> +		return ERR_PTR(-ENOMEM);
> +
> +	process->pid		= get_task_pid(task, PIDTYPE_PID);
> +	process->release	= domain->ops->process_free;
> +	INIT_LIST_HEAD(&process->domains);
> +	kref_init(&process->kref);
> +
> +	if (!process->pid) {
> +		err = -EINVAL;
> +		goto err_free_process;
> +	}
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_process_lock);
> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
> +				 domain->max_pasid + 1, GFP_ATOMIC);
> +	process->pasid = pasid;

[Liu, Yi L] If I'm understanding well, here is managing the pasid allocation in iommu
layer instead of vendor iommu driver? Is there strong reason here? I think pasid
management may be better within vendor iommu driver as pasid management
could differ from vendor to vendor.

Regards,
Yi L

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Oct. 23, 2017, 12:17 p.m. UTC | #13

On 23/10/17 12:04, Liu, Yi L wrote:
>> +	idr_preload(GFP_KERNEL);
>> +	spin_lock(&iommu_process_lock);
>> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
>> +				 domain->max_pasid + 1, GFP_ATOMIC);
>> +	process->pasid = pasid;
> 
> [Liu, Yi L] If I'm understanding well, here is managing the pasid allocation in iommu
> layer instead of vendor iommu driver? Is there strong reason here? I think pasid
> management may be better within vendor iommu driver as pasid management
> could differ from vendor to vendor.

But that's the thing, we're trying to abstract PASID and process
management to have it in the core, because there shouldn't be many
differences from vendor to vendor. This way we have the allocation code in
one place and vendor drivers don't have to copy paste it from other drivers.

It's just a global number within a range, so I don't think vendors will
have many different ways of designing it.

Thanks,
Jean

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ashok Raj Oct. 25, 2017, 6:05 p.m. UTC | #14

Hi Jean

On Mon, Oct 23, 2017 at 01:17:07PM +0100, Jean-Philippe Brucker wrote:
> On 23/10/17 12:04, Liu, Yi L wrote:
> >> +	idr_preload(GFP_KERNEL);
> >> +	spin_lock(&iommu_process_lock);
> >> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
> >> +				 domain->max_pasid + 1, GFP_ATOMIC);
> >> +	process->pasid = pasid;
> > 
> > [Liu, Yi L] If I'm understanding well, here is managing the pasid allocation in iommu
> > layer instead of vendor iommu driver? Is there strong reason here? I think pasid
> > management may be better within vendor iommu driver as pasid management
> > could differ from vendor to vendor.
> 
> But that's the thing, we're trying to abstract PASID and process
> management to have it in the core, because there shouldn't be many
> differences from vendor to vendor. This way we have the allocation code in
> one place and vendor drivers don't have to copy paste it from other drivers.

I think this makes sense for the native case and also in the vIOMMU
if the PASID tables and allocation are completely managed by the guest.

If the vIOMMU requires any co-ordination in how the PASID's are allocated
for guest devices there might need to be some control on how these are 
allocated that ultimately need to be managed by VMM/Physical IOMMU. For 
instance if the PASID space is sparse for e.g

if we make the PASID allocation as one of the ops, the IOMMU implementation
will choose the default function, or if it choose a differnt mechanism it would
have that flexibility.

Does this make sense?

Cheers,
Ashok

> 
> It's just a global number within a range, so I don't think vendors will
> have many different ways of designing it.
> 
> Thanks,
> Jean
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Oct. 30, 2017, 10:28 a.m. UTC | #15

On 25/10/17 19:05, Raj, Ashok wrote:
> Hi Jean
> 
> On Mon, Oct 23, 2017 at 01:17:07PM +0100, Jean-Philippe Brucker wrote:
>> On 23/10/17 12:04, Liu, Yi L wrote:
>>>> +	idr_preload(GFP_KERNEL);
>>>> +	spin_lock(&iommu_process_lock);
>>>> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
>>>> +				 domain->max_pasid + 1, GFP_ATOMIC);
>>>> +	process->pasid = pasid;
>>>
>>> [Liu, Yi L] If I'm understanding well, here is managing the pasid allocation in iommu
>>> layer instead of vendor iommu driver? Is there strong reason here? I think pasid
>>> management may be better within vendor iommu driver as pasid management
>>> could differ from vendor to vendor.
>>
>> But that's the thing, we're trying to abstract PASID and process
>> management to have it in the core, because there shouldn't be many
>> differences from vendor to vendor. This way we have the allocation code in
>> one place and vendor drivers don't have to copy paste it from other drivers.
> 
> I think this makes sense for the native case and also in the vIOMMU
> if the PASID tables and allocation are completely managed by the guest.
> 
> If the vIOMMU requires any co-ordination in how the PASID's are allocated
> for guest devices there might need to be some control on how these are 
> allocated that ultimately need to be managed by VMM/Physical IOMMU. For 
> instance if the PASID space is sparse for e.g

(I don't have your example)

> if we make the PASID allocation as one of the ops, the IOMMU implementation
> will choose the default function, or if it choose a differnt mechanism it would
> have that flexibility.
> 
> Does this make sense?

If the PASID space is sparse, maybe we can add a firmware or probe
mechanism to declare reserved PASIDs, like we have for reserved IOVAs,
that feeds into the core IOMMU driver. But I agree that we can always let
vendor drivers implement their own allocator if they need one in the
future. For the moment it can stay generic.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 2, 2017, 4:20 p.m. UTC | #16

Hi Sinan,

Sorry for the delay, thanks for reviewing this!

On 21/10/17 00:32, Sinan Kaya wrote:
> few nits below.
> 
>> +/*
>> + * Allocate a iommu_process structure for the given task.
>> + *
>> + * Ideally we shouldn't need the domain parameter, since iommu_process is
>> + * system-wide, but we use it to retrieve the driver's allocation ops and a
>> + * PASID range.
>> + */
>> +static struct iommu_process *
>> +iommu_process_alloc(struct iommu_domain *domain, struct task_struct *task)
>> +{
>> +	int err;
>> +	int pasid;
>> +	struct iommu_process *process;
>> +
>> +	if (WARN_ON(!domain->ops->process_alloc || !domain->ops->process_free))
>> +		return ERR_PTR(-ENODEV);
>> +
>> +	process = domain->ops->process_alloc(task);
>> +	if (IS_ERR(process))
>> +		return process;
>> +	if (!process)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	process->pid		= get_task_pid(task, PIDTYPE_PID);
>> +	process->release	= domain->ops->process_free;
>> +	INIT_LIST_HEAD(&process->domains);
>> +	kref_init(&process->kref);
>> +
> nit, I think you should place this check right after the pid assignment.

Sure

>> +	if (!process->pid) {
>> +		err = -EINVAL;
>> +		goto err_free_process;
>> +	}
>> +
>> +	idr_preload(GFP_KERNEL);
>> +	spin_lock(&iommu_process_lock);
>> +	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
>> +				 domain->max_pasid + 1, GFP_ATOMIC);
>> +	process->pasid = pasid;
>> +	spin_unlock(&iommu_process_lock);
>> +	idr_preload_end();
>> +
> 
> nit, You can maybe return here if pasid is not negative.

Ok

>> +	if (pasid < 0) {
>> +		err = pasid;
>> +		goto err_put_pid;
>> +	}
>> +
>> +	return process;
>> +
>> +err_put_pid:
>> +	put_pid(process->pid);
>> +
>> +err_free_process:
>> +	domain->ops->process_free(process);
>> +
>> +	return ERR_PTR(err);
>> +}
>> +
>> +static void iommu_process_release(struct kref *kref)
>> +{
>> +	struct iommu_process *process;
>> +	void (*release)(struct iommu_process *);
>> +
>> +	assert_spin_locked(&iommu_process_lock);
>> +
>> +	process = container_of(kref, struct iommu_process, kref);
> 
> if we are concerned about things going wrong (assert above), we should
> also add some pointer check here (WARN) for process and release pointers as well.

We can probably get rid of this assert, as any external users releasing
the process will have to go through iommu_process_put() which takes the
lock. process_alloc() ensures that release isn't NULL, and process should
always be valid here since we're being called from kref_put, but I should
check the value in process_put.

>> +	release = process->release;
>> +
>> +	WARN_ON(!list_empty(&process->domains));
>> +
>> +	idr_remove(&iommu_process_idr, process->pasid);
>> +	put_pid(process->pid);
>> +	release(process);
>> +}
>> +
>> +/*
>> + * Returns non-zero if a reference to the process was successfully taken.
>> + * Returns zero if the process is being freed and should not be used.
>> + */
>> +static int iommu_process_get_locked(struct iommu_process *process)
>> +{
>> +	assert_spin_locked(&iommu_process_lock);
>> +
>> +	if (process)
>> +		return kref_get_unless_zero(&process->kref);
>> +
>> +	return 0;
>> +}
>> +
>> +static void iommu_process_put_locked(struct iommu_process *process)
>> +{
>> +	assert_spin_locked(&iommu_process_lock);
>> +
>> +	kref_put(&process->kref, iommu_process_release);
>> +}
>> +
>> +static int iommu_process_attach(struct iommu_domain *domain, struct device *dev,
>> +				struct iommu_process *process)
>> +{
>> +	int err;
>> +	int pasid = process->pasid;
>> +	struct iommu_context *context;
>> +
>> +	if (WARN_ON(!domain->ops->process_attach || !domain->ops->process_detach))
>> +		return -ENODEV;
>> +
>> +	if (pasid > domain->max_pasid || pasid < domain->min_pasid)
>> +		return -ENOSPC;
>> +
>> +	context = kzalloc(sizeof(*context), GFP_KERNEL);
>> +	if (!context)
>> +		return -ENOMEM;
>> +
> 
> devm_kzalloc maybe?

I don't think we can ever leak contexts. Before the device is released, it
has to detach() from the domain, which will unbind from any process (call
unbind_dev_all()).

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 2, 2017, 4:21 p.m. UTC | #17

On 21/10/17 16:47, Sinan Kaya wrote:
> Just some improvement suggestions.
> 
> On 10/6/2017 9:31 AM, Jean-Philippe Brucker wrote:
>> +	spin_lock(&iommu_process_lock);
>> +	idr_for_each_entry(&iommu_process_idr, process, i) {
>> +		if (process->pid != pid)
>> +			continue;
> if you see this construct a lot, this could be a for_each_iommu_process.
> 
>> +
>> +		if (!iommu_process_get_locked(process)) {
>> +			/* Process is defunct, create a new one */
>> +			process = NULL;
>> +			break;
>> +		}
>> +
>> +		/* Great, is it also bound to this domain? */
>> +		list_for_each_entry(cur_context, &process->domains,
>> +				    process_head) {
>> +			if (cur_context->domain != domain)
>> +				continue;
> if you see this construct a lot, this could be a for_each_iommu_process_domain.
> 
>> +
>> +			context = cur_context;
>> +			*pasid = process->pasid;
>> +
>> +			/* Splendid, tell the driver and increase the ref */
>> +			err = iommu_process_attach_locked(context, dev);
>> +			if (err)
>> +				iommu_process_put_locked(process);
>> +
>> +			break;
>> +		}
>> +		break;
>> +	}
>> +	spin_unlock(&iommu_process_lock);
>> +	put_pid(pid);
>> +
>> +	if (context)
>> +		return err;
> 
> I think you should make the section above a independent function and return here when the
> context is found.

Hopefully this code will only be needed for bind(), but moving it to a
separate function should look better.

Thanks,
Jean

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 3, 2017, 5:45 a.m. UTC | #18

Hi Jean,

On 2017/11/3 1:02, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Jean-Philippe Brucker [mailto:Jean-Philippe.Brucker@arm.com]
>> Sent: Thursday, November 02, 2017 3:52 PM
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>> Cc: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-
>> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-
>> foundation.org; Mark Rutland <Mark.Rutland@arm.com>; xieyisheng (A)
>> <xieyisheng1@huawei.com>; Gabriele Paoloni
>> <gabriele.paoloni@huawei.com>; Catalin Marinas
>> <Catalin.Marinas@arm.com>; Will Deacon <Will.Deacon@arm.com>;
>> okaya@codeaurora.org; yi.l.liu@intel.com; Lorenzo Pieralisi
>> <Lorenzo.Pieralisi@arm.com>; ashok.raj@intel.com; tn@semihalf.com;
>> joro@8bytes.org; rfranz@cavium.com; lenb@kernel.org;
>> jacob.jun.pan@linux.intel.com; alex.williamson@redhat.com;
>> robh+dt@kernel.org; Leizhen (ThunderTown) <thunder.leizhen@huawei.com>;
>> bhelgaas@google.com; dwmw2@infradead.org; liubo (CU)
>> <liubo95@huawei.com>; rjw@rjwysocki.net; robdclark@gmail.com;
>> hanjun.guo@linaro.org; Sudeep Holla <Sudeep.Holla@arm.com>; Robin
>> Murphy <Robin.Murphy@arm.com>; nwatters@codeaurora.org; Linuxarm
>> <linuxarm@huawei.com>
>> Subject: Re: [RFCv2 PATCH 14/36] iommu/arm-smmu-v3: Add support for
>> Substream IDs
>>
>> Hi Shameer,
>>
>> On Thu, Nov 02, 2017 at 12:49:32PM +0000, Shameerali Kolothum Thodi wrote:
>>> We had a go with this series on HiSIlicon D05 platform which doesn't have
>>> support for ssids/ATS/PRI, to make sure it generally works.
>>>
>>> But observed the below crash on boot,
>>>
>>> [   16.009084] WARNING: CPU: 59 PID: 391 at mm/page_alloc.c:3883
>> __alloc_pages_nodemask+0x19c/0xc48
>>> [   16.026797] Modules linked in:
>>> [   16.032944] CPU: 59 PID: 391 Comm: kworker/59:1 Not tainted 4.14.0-rc1-
>> 159539-ge42aca3 #236
>>> [...]
>>> [   16.068206] Workqueue: events deferred_probe_work_func
>>> [   16.078557] task: ffff8017d38a0000 task.stack: ffff00000b198000
>>> [   16.090486] PC is at __alloc_pages_nodemask+0x19c/0xc48
>>> [   16.101013] LR is at __alloc_pages_nodemask+0xe0/0xc48
>>> [   16.469220] [<ffff000008186b94>] __alloc_pages_nodemask+0x19c/0xc48
>>> [   16.481854] [<ffff0000081d65b0>] alloc_pages_current+0x80/0xcc
>>> [   16.493607] [<ffff000008182be8>] __get_free_pages+0xc/0x38
>>> [   16.504661] [<ffff0000083c4d58>] swiotlb_alloc_coherent+0x64/0x190
>>> [   16.517117] [<ffff00000809824c>] __dma_alloc+0x110/0x204
>>> [   16.527820] [<ffff00000858e850>] dmam_alloc_coherent+0x88/0xf0
>>> [   16.539575] [<ffff000008568884>]
>> arm_smmu_domain_finalise_s1+0x60/0x248
>>> [   16.552909] [<ffff00000856c104>] arm_smmu_attach_dev+0x264/0x300
>>> [   16.565013] [<ffff00000855d40c>] __iommu_attach_device+0x48/0x5c
>>> [   16.577117] [<ffff00000855e698>] iommu_group_add_device+0x144/0x3a4
>>> [   16.589746] [<ffff00000855ed18>] iommu_group_get_for_dev+0x70/0xf8
>>> [   16.602201] [<ffff00000856a314>] arm_smmu_add_device+0x1a4/0x418
>>> [   16.614308] [<ffff00000849dfcc>] iort_iommu_configure+0xf0/0x16c
>>> [   16.626416] [<ffff000008468c50>] acpi_dma_configure+0x30/0x70
>>> [   16.637994] [<ffff00000858f00c>] dma_configure+0xa8/0xd4
>>> [   16.648695] [<ffff00000857706c>] driver_probe_device+0x1a4/0x2dc
>>> [   16.673081] [<ffff0000085752c8>] bus_for_each_drv+0x54/0x94
>>> [   16.684307] [<ffff000008576db0>] __device_attach+0xc4/0x12c
>>> [   16.695533] [<ffff000008577350>] device_initial_probe+0x10/0x18
>>> [   16.707462] [<ffff0000085762b4>] bus_probe_device+0x90/0x98
>>>
>>> After a bit of debug it looks like on platforms where ssid is not supported,
>>> s1_cfg.num_contexts is set to zero and it eventually results in this crash
>>> in,
>>> arm_smmu_domain_finalise_s1() -->arm_smmu_alloc_cd_tables()-->
>>> arm_smmu_alloc_cd_leaf_table() as num_leaf_entries is zero.
>>>
>>> With the below fix, it works on D05 now,
>>>
>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>> index 8ad90e2..51f5821 100644
>>> --- a/drivers/iommu/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>> @@ -2433,7 +2433,10 @@ static int arm_smmu_domain_finalise(struct
>> iommu_domain *domain,
>>>                         domain->min_pasid = 1;
>>>                         domain->max_pasid = master->num_ssids - 1;
>>>                         smmu_domain->s1_cfg.num_contexts = master->num_ssids;
>>> +               } else {
>>> +                       smmu_domain->s1_cfg.num_contexts = 1;
>>>                 }
>>> +
>>>                 smmu_domain->s1_cfg.can_stall = master->ste.can_stall;
>>>                 break;
>>>         case ARM_SMMU_DOMAIN_NESTED:
>>>
>>>
>>> I am not sure this is right place do this. Please take a look.
>>
>> Thanks for testing the series and reporting the bug. I added the
>> following patch to branch svm/current, does it work for you?
> 
> Yes, it does.
> 
> Thanks,
> Shameer
>  
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 42c8378624ed..edda466adc81 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -3169,9 +3169,7 @@ static int arm_smmu_add_device(struct device *dev)
>>                 }
>>         }
>>
>> -       if (smmu->ssid_bits)
>> -               master->num_ssids = 1 << min(smmu->ssid_bits,
>> -                                            fwspec->num_pasid_bits);
>> +       master->num_ssids = 1 << min(smmu->ssid_bits, fwspec-
>>> num_pasid_bits);

If fwspec->num_pasid_bits = 0, then master have _one_ num_ssids ?

It seems Shameerali's fix is better ?

>>
>>         if (fwspec->can_stall && smmu->features & ARM_SMMU_FEAT_STALLS) {
>>                 master->can_fault = true;
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 3, 2017, 9:37 a.m. UTC | #19

On 03/11/17 05:45, Yisheng Xie wrote:
> Hi Jean,
> 
> On 2017/11/3 1:02, Shameerali Kolothum Thodi wrote:
>>
>>
>>> -----Original Message-----
>>> From: Jean-Philippe Brucker [mailto:Jean-Philippe.Brucker@arm.com]
>>> Sent: Thursday, November 02, 2017 3:52 PM
>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>>> Cc: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-
>>> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-
>>> foundation.org; Mark Rutland <Mark.Rutland@arm.com>; xieyisheng (A)
>>> <xieyisheng1@huawei.com>; Gabriele Paoloni
>>> <gabriele.paoloni@huawei.com>; Catalin Marinas
>>> <Catalin.Marinas@arm.com>; Will Deacon <Will.Deacon@arm.com>;
>>> okaya@codeaurora.org; yi.l.liu@intel.com; Lorenzo Pieralisi
>>> <Lorenzo.Pieralisi@arm.com>; ashok.raj@intel.com; tn@semihalf.com;
>>> joro@8bytes.org; rfranz@cavium.com; lenb@kernel.org;
>>> jacob.jun.pan@linux.intel.com; alex.williamson@redhat.com;
>>> robh+dt@kernel.org; Leizhen (ThunderTown) <thunder.leizhen@huawei.com>;
>>> bhelgaas@google.com; dwmw2@infradead.org; liubo (CU)
>>> <liubo95@huawei.com>; rjw@rjwysocki.net; robdclark@gmail.com;
>>> hanjun.guo@linaro.org; Sudeep Holla <Sudeep.Holla@arm.com>; Robin
>>> Murphy <Robin.Murphy@arm.com>; nwatters@codeaurora.org; Linuxarm
>>> <linuxarm@huawei.com>
>>> Subject: Re: [RFCv2 PATCH 14/36] iommu/arm-smmu-v3: Add support for
>>> Substream IDs
>>>
>>> Hi Shameer,
>>>
>>> On Thu, Nov 02, 2017 at 12:49:32PM +0000, Shameerali Kolothum Thodi wrote:
>>>> We had a go with this series on HiSIlicon D05 platform which doesn't have
>>>> support for ssids/ATS/PRI, to make sure it generally works.
>>>>
>>>> But observed the below crash on boot,
>>>>
>>>> [   16.009084] WARNING: CPU: 59 PID: 391 at mm/page_alloc.c:3883
>>> __alloc_pages_nodemask+0x19c/0xc48
>>>> [   16.026797] Modules linked in:
>>>> [   16.032944] CPU: 59 PID: 391 Comm: kworker/59:1 Not tainted 4.14.0-rc1-
>>> 159539-ge42aca3 #236
>>>> [...]
>>>> [   16.068206] Workqueue: events deferred_probe_work_func
>>>> [   16.078557] task: ffff8017d38a0000 task.stack: ffff00000b198000
>>>> [   16.090486] PC is at __alloc_pages_nodemask+0x19c/0xc48
>>>> [   16.101013] LR is at __alloc_pages_nodemask+0xe0/0xc48
>>>> [   16.469220] [<ffff000008186b94>] __alloc_pages_nodemask+0x19c/0xc48
>>>> [   16.481854] [<ffff0000081d65b0>] alloc_pages_current+0x80/0xcc
>>>> [   16.493607] [<ffff000008182be8>] __get_free_pages+0xc/0x38
>>>> [   16.504661] [<ffff0000083c4d58>] swiotlb_alloc_coherent+0x64/0x190
>>>> [   16.517117] [<ffff00000809824c>] __dma_alloc+0x110/0x204
>>>> [   16.527820] [<ffff00000858e850>] dmam_alloc_coherent+0x88/0xf0
>>>> [   16.539575] [<ffff000008568884>]
>>> arm_smmu_domain_finalise_s1+0x60/0x248
>>>> [   16.552909] [<ffff00000856c104>] arm_smmu_attach_dev+0x264/0x300
>>>> [   16.565013] [<ffff00000855d40c>] __iommu_attach_device+0x48/0x5c
>>>> [   16.577117] [<ffff00000855e698>] iommu_group_add_device+0x144/0x3a4
>>>> [   16.589746] [<ffff00000855ed18>] iommu_group_get_for_dev+0x70/0xf8
>>>> [   16.602201] [<ffff00000856a314>] arm_smmu_add_device+0x1a4/0x418
>>>> [   16.614308] [<ffff00000849dfcc>] iort_iommu_configure+0xf0/0x16c
>>>> [   16.626416] [<ffff000008468c50>] acpi_dma_configure+0x30/0x70
>>>> [   16.637994] [<ffff00000858f00c>] dma_configure+0xa8/0xd4
>>>> [   16.648695] [<ffff00000857706c>] driver_probe_device+0x1a4/0x2dc
>>>> [   16.673081] [<ffff0000085752c8>] bus_for_each_drv+0x54/0x94
>>>> [   16.684307] [<ffff000008576db0>] __device_attach+0xc4/0x12c
>>>> [   16.695533] [<ffff000008577350>] device_initial_probe+0x10/0x18
>>>> [   16.707462] [<ffff0000085762b4>] bus_probe_device+0x90/0x98
>>>>
>>>> After a bit of debug it looks like on platforms where ssid is not supported,
>>>> s1_cfg.num_contexts is set to zero and it eventually results in this crash
>>>> in,
>>>> arm_smmu_domain_finalise_s1() -->arm_smmu_alloc_cd_tables()-->
>>>> arm_smmu_alloc_cd_leaf_table() as num_leaf_entries is zero.
>>>>
>>>> With the below fix, it works on D05 now,
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index 8ad90e2..51f5821 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -2433,7 +2433,10 @@ static int arm_smmu_domain_finalise(struct
>>> iommu_domain *domain,
>>>>                         domain->min_pasid = 1;
>>>>                         domain->max_pasid = master->num_ssids - 1;
>>>>                         smmu_domain->s1_cfg.num_contexts = master->num_ssids;
>>>> +               } else {
>>>> +                       smmu_domain->s1_cfg.num_contexts = 1;
>>>>                 }
>>>> +
>>>>                 smmu_domain->s1_cfg.can_stall = master->ste.can_stall;
>>>>                 break;
>>>>         case ARM_SMMU_DOMAIN_NESTED:
>>>>
>>>>
>>>> I am not sure this is right place do this. Please take a look.
>>>
>>> Thanks for testing the series and reporting the bug. I added the
>>> following patch to branch svm/current, does it work for you?
>>
>> Yes, it does.
>>
>> Thanks,
>> Shameer
>>  
>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>> index 42c8378624ed..edda466adc81 100644
>>> --- a/drivers/iommu/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>> @@ -3169,9 +3169,7 @@ static int arm_smmu_add_device(struct device *dev)
>>>                 }
>>>         }
>>>
>>> -       if (smmu->ssid_bits)
>>> -               master->num_ssids = 1 << min(smmu->ssid_bits,
>>> -                                            fwspec->num_pasid_bits);
>>> +       master->num_ssids = 1 << min(smmu->ssid_bits, fwspec-
>>>> num_pasid_bits);
> 
> If fwspec->num_pasid_bits = 0, then master have _one_ num_ssids ?

Yes, the context table allocator always needs to allocate at least one
entry, even if the master or SMMU doesn't support SSID. I think an earlier
version called this field "num_contexts", maybe we should got back to that
name for clarity?

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shameerali Kolothum Thodi Nov. 3, 2017, 9:39 a.m. UTC | #20

> -----Original Message-----

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]

> Sent: Friday, November 03, 2017 9:37 AM

> To: xieyisheng (A) <xieyisheng1@huawei.com>; Shameerali Kolothum Thodi

> <shameerali.kolothum.thodi@huawei.com>

> Cc: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-

> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-

> foundation.org; Mark Rutland <Mark.Rutland@arm.com>; Gabriele Paoloni

> <gabriele.paoloni@huawei.com>; Catalin Marinas

> <Catalin.Marinas@arm.com>; Will Deacon <Will.Deacon@arm.com>;

> okaya@codeaurora.org; yi.l.liu@intel.com; Lorenzo Pieralisi

> <Lorenzo.Pieralisi@arm.com>; ashok.raj@intel.com; tn@semihalf.com;

> joro@8bytes.org; rfranz@cavium.com; lenb@kernel.org;

> jacob.jun.pan@linux.intel.com; alex.williamson@redhat.com;

> robh+dt@kernel.org; Leizhen (ThunderTown) <thunder.leizhen@huawei.com>;

> bhelgaas@google.com; dwmw2@infradead.org; liubo (CU)

> <liubo95@huawei.com>; rjw@rjwysocki.net; robdclark@gmail.com;

> hanjun.guo@linaro.org; Sudeep Holla <Sudeep.Holla@arm.com>; Robin

> Murphy <Robin.Murphy@arm.com>; nwatters@codeaurora.org; Linuxarm

> <linuxarm@huawei.com>

> Subject: Re: [RFCv2 PATCH 14/36] iommu/arm-smmu-v3: Add support for

> Substream IDs

> 

> On 03/11/17 05:45, Yisheng Xie wrote:

> > Hi Jean,

> >

> > On 2017/11/3 1:02, Shameerali Kolothum Thodi wrote:

> >>

> >>

> >>> -----Original Message-----

> >>> From: Jean-Philippe Brucker [mailto:Jean-Philippe.Brucker@arm.com]

> >>> Sent: Thursday, November 02, 2017 3:52 PM

> >>> To: Shameerali Kolothum Thodi

> <shameerali.kolothum.thodi@huawei.com>

> >>> Cc: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-

> >>> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-

> >>> foundation.org; Mark Rutland <Mark.Rutland@arm.com>; xieyisheng (A)

> >>> <xieyisheng1@huawei.com>; Gabriele Paoloni

> >>> <gabriele.paoloni@huawei.com>; Catalin Marinas

> >>> <Catalin.Marinas@arm.com>; Will Deacon <Will.Deacon@arm.com>;

> >>> okaya@codeaurora.org; yi.l.liu@intel.com; Lorenzo Pieralisi

> >>> <Lorenzo.Pieralisi@arm.com>; ashok.raj@intel.com; tn@semihalf.com;

> >>> joro@8bytes.org; rfranz@cavium.com; lenb@kernel.org;

> >>> jacob.jun.pan@linux.intel.com; alex.williamson@redhat.com;

> >>> robh+dt@kernel.org; Leizhen (ThunderTown)

> <thunder.leizhen@huawei.com>;

> >>> bhelgaas@google.com; dwmw2@infradead.org; liubo (CU)

> >>> <liubo95@huawei.com>; rjw@rjwysocki.net; robdclark@gmail.com;

> >>> hanjun.guo@linaro.org; Sudeep Holla <Sudeep.Holla@arm.com>; Robin

> >>> Murphy <Robin.Murphy@arm.com>; nwatters@codeaurora.org; Linuxarm

> >>> <linuxarm@huawei.com>

> >>> Subject: Re: [RFCv2 PATCH 14/36] iommu/arm-smmu-v3: Add support for

> >>> Substream IDs

> >>>

> >>> Hi Shameer,

> >>>

> >>> On Thu, Nov 02, 2017 at 12:49:32PM +0000, Shameerali Kolothum Thodi

> wrote:

> >>>> We had a go with this series on HiSIlicon D05 platform which doesn't have

> >>>> support for ssids/ATS/PRI, to make sure it generally works.

> >>>>

> >>>> But observed the below crash on boot,

> >>>>

> >>>> [   16.009084] WARNING: CPU: 59 PID: 391 at mm/page_alloc.c:3883

> >>> __alloc_pages_nodemask+0x19c/0xc48

> >>>> [   16.026797] Modules linked in:

> >>>> [   16.032944] CPU: 59 PID: 391 Comm: kworker/59:1 Not tainted 4.14.0-

> rc1-

> >>> 159539-ge42aca3 #236

> >>>> [...]

> >>>> [   16.068206] Workqueue: events deferred_probe_work_func

> >>>> [   16.078557] task: ffff8017d38a0000 task.stack: ffff00000b198000

> >>>> [   16.090486] PC is at __alloc_pages_nodemask+0x19c/0xc48

> >>>> [   16.101013] LR is at __alloc_pages_nodemask+0xe0/0xc48

> >>>> [   16.469220] [<ffff000008186b94>]

> __alloc_pages_nodemask+0x19c/0xc48

> >>>> [   16.481854] [<ffff0000081d65b0>] alloc_pages_current+0x80/0xcc

> >>>> [   16.493607] [<ffff000008182be8>] __get_free_pages+0xc/0x38

> >>>> [   16.504661] [<ffff0000083c4d58>] swiotlb_alloc_coherent+0x64/0x190

> >>>> [   16.517117] [<ffff00000809824c>] __dma_alloc+0x110/0x204

> >>>> [   16.527820] [<ffff00000858e850>] dmam_alloc_coherent+0x88/0xf0

> >>>> [   16.539575] [<ffff000008568884>]

> >>> arm_smmu_domain_finalise_s1+0x60/0x248

> >>>> [   16.552909] [<ffff00000856c104>]

> arm_smmu_attach_dev+0x264/0x300

> >>>> [   16.565013] [<ffff00000855d40c>] __iommu_attach_device+0x48/0x5c

> >>>> [   16.577117] [<ffff00000855e698>]

> iommu_group_add_device+0x144/0x3a4

> >>>> [   16.589746] [<ffff00000855ed18>]

> iommu_group_get_for_dev+0x70/0xf8

> >>>> [   16.602201] [<ffff00000856a314>]

> arm_smmu_add_device+0x1a4/0x418

> >>>> [   16.614308] [<ffff00000849dfcc>] iort_iommu_configure+0xf0/0x16c

> >>>> [   16.626416] [<ffff000008468c50>] acpi_dma_configure+0x30/0x70

> >>>> [   16.637994] [<ffff00000858f00c>] dma_configure+0xa8/0xd4

> >>>> [   16.648695] [<ffff00000857706c>] driver_probe_device+0x1a4/0x2dc

> >>>> [   16.673081] [<ffff0000085752c8>] bus_for_each_drv+0x54/0x94

> >>>> [   16.684307] [<ffff000008576db0>] __device_attach+0xc4/0x12c

> >>>> [   16.695533] [<ffff000008577350>] device_initial_probe+0x10/0x18

> >>>> [   16.707462] [<ffff0000085762b4>] bus_probe_device+0x90/0x98

> >>>>

> >>>> After a bit of debug it looks like on platforms where ssid is not supported,

> >>>> s1_cfg.num_contexts is set to zero and it eventually results in this crash

> >>>> in,

> >>>> arm_smmu_domain_finalise_s1() -->arm_smmu_alloc_cd_tables()-->

> >>>> arm_smmu_alloc_cd_leaf_table() as num_leaf_entries is zero.

> >>>>

> >>>> With the below fix, it works on D05 now,

> >>>>

> >>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-

> v3.c

> >>>> index 8ad90e2..51f5821 100644

> >>>> --- a/drivers/iommu/arm-smmu-v3.c

> >>>> +++ b/drivers/iommu/arm-smmu-v3.c

> >>>> @@ -2433,7 +2433,10 @@ static int arm_smmu_domain_finalise(struct

> >>> iommu_domain *domain,

> >>>>                         domain->min_pasid = 1;

> >>>>                         domain->max_pasid = master->num_ssids - 1;

> >>>>                         smmu_domain->s1_cfg.num_contexts = master-

> >num_ssids;

> >>>> +               } else {

> >>>> +                       smmu_domain->s1_cfg.num_contexts = 1;

> >>>>                 }

> >>>> +

> >>>>                 smmu_domain->s1_cfg.can_stall = master->ste.can_stall;

> >>>>                 break;

> >>>>         case ARM_SMMU_DOMAIN_NESTED:

> >>>>

> >>>>

> >>>> I am not sure this is right place do this. Please take a look.

> >>>

> >>> Thanks for testing the series and reporting the bug. I added the

> >>> following patch to branch svm/current, does it work for you?

> >>

> >> Yes, it does.

> >>

> >> Thanks,

> >> Shameer

> >>

> >>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-

> v3.c

> >>> index 42c8378624ed..edda466adc81 100644

> >>> --- a/drivers/iommu/arm-smmu-v3.c

> >>> +++ b/drivers/iommu/arm-smmu-v3.c

> >>> @@ -3169,9 +3169,7 @@ static int arm_smmu_add_device(struct device

> *dev)

> >>>                 }

> >>>         }

> >>>

> >>> -       if (smmu->ssid_bits)

> >>> -               master->num_ssids = 1 << min(smmu->ssid_bits,

> >>> -                                            fwspec->num_pasid_bits);

> >>> +       master->num_ssids = 1 << min(smmu->ssid_bits, fwspec-

> >>>> num_pasid_bits);

> >

> > If fwspec->num_pasid_bits = 0, then master have _one_ num_ssids ?

> 

> Yes, the context table allocator always needs to allocate at least one

> entry, even if the master or SMMU doesn't support SSID. I think an earlier

> version called this field "num_contexts", maybe we should got back to that

> name for clarity?


+1 for that. As ssid can be zero as per the spec, num_ssids=1 will be slightly misleading.

Thanks,
Shameer

Yisheng Xie Nov. 6, 2017, 12:50 a.m. UTC | #21

On 2017/11/3 17:37, Jean-Philippe Brucker wrote:
> On 03/11/17 05:45, Yisheng Xie wrote:
>> Hi Jean,
>>
>> On 2017/11/3 1:02, Shameerali Kolothum Thodi wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Jean-Philippe Brucker [mailto:Jean-Philippe.Brucker@arm.com]
>>>> Sent: Thursday, November 02, 2017 3:52 PM
>>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>>>> Cc: linux-arm-kernel@lists.infradead.org; linux-pci@vger.kernel.org; linux-
>>>> acpi@vger.kernel.org; devicetree@vger.kernel.org; iommu@lists.linux-
>>>> foundation.org; Mark Rutland <Mark.Rutland@arm.com>; xieyisheng (A)
>>>> <xieyisheng1@huawei.com>; Gabriele Paoloni
>>>> <gabriele.paoloni@huawei.com>; Catalin Marinas
>>>> <Catalin.Marinas@arm.com>; Will Deacon <Will.Deacon@arm.com>;
>>>> okaya@codeaurora.org; yi.l.liu@intel.com; Lorenzo Pieralisi
>>>> <Lorenzo.Pieralisi@arm.com>; ashok.raj@intel.com; tn@semihalf.com;
>>>> joro@8bytes.org; rfranz@cavium.com; lenb@kernel.org;
>>>> jacob.jun.pan@linux.intel.com; alex.williamson@redhat.com;
>>>> robh+dt@kernel.org; Leizhen (ThunderTown) <thunder.leizhen@huawei.com>;
>>>> bhelgaas@google.com; dwmw2@infradead.org; liubo (CU)
>>>> <liubo95@huawei.com>; rjw@rjwysocki.net; robdclark@gmail.com;
>>>> hanjun.guo@linaro.org; Sudeep Holla <Sudeep.Holla@arm.com>; Robin
>>>> Murphy <Robin.Murphy@arm.com>; nwatters@codeaurora.org; Linuxarm
>>>> <linuxarm@huawei.com>
>>>> Subject: Re: [RFCv2 PATCH 14/36] iommu/arm-smmu-v3: Add support for
>>>> Substream IDs
>>>>
>>>> Hi Shameer,
>>>>
>>>> On Thu, Nov 02, 2017 at 12:49:32PM +0000, Shameerali Kolothum Thodi wrote:
>>>>> We had a go with this series on HiSIlicon D05 platform which doesn't have
>>>>> support for ssids/ATS/PRI, to make sure it generally works.
>>>>>
>>>>> But observed the below crash on boot,
>>>>>
>>>>> [   16.009084] WARNING: CPU: 59 PID: 391 at mm/page_alloc.c:3883
>>>> __alloc_pages_nodemask+0x19c/0xc48
>>>>> [   16.026797] Modules linked in:
>>>>> [   16.032944] CPU: 59 PID: 391 Comm: kworker/59:1 Not tainted 4.14.0-rc1-
>>>> 159539-ge42aca3 #236
>>>>> [...]
>>>>> [   16.068206] Workqueue: events deferred_probe_work_func
>>>>> [   16.078557] task: ffff8017d38a0000 task.stack: ffff00000b198000
>>>>> [   16.090486] PC is at __alloc_pages_nodemask+0x19c/0xc48
>>>>> [   16.101013] LR is at __alloc_pages_nodemask+0xe0/0xc48
>>>>> [   16.469220] [<ffff000008186b94>] __alloc_pages_nodemask+0x19c/0xc48
>>>>> [   16.481854] [<ffff0000081d65b0>] alloc_pages_current+0x80/0xcc
>>>>> [   16.493607] [<ffff000008182be8>] __get_free_pages+0xc/0x38
>>>>> [   16.504661] [<ffff0000083c4d58>] swiotlb_alloc_coherent+0x64/0x190
>>>>> [   16.517117] [<ffff00000809824c>] __dma_alloc+0x110/0x204
>>>>> [   16.527820] [<ffff00000858e850>] dmam_alloc_coherent+0x88/0xf0
>>>>> [   16.539575] [<ffff000008568884>]
>>>> arm_smmu_domain_finalise_s1+0x60/0x248
>>>>> [   16.552909] [<ffff00000856c104>] arm_smmu_attach_dev+0x264/0x300
>>>>> [   16.565013] [<ffff00000855d40c>] __iommu_attach_device+0x48/0x5c
>>>>> [   16.577117] [<ffff00000855e698>] iommu_group_add_device+0x144/0x3a4
>>>>> [   16.589746] [<ffff00000855ed18>] iommu_group_get_for_dev+0x70/0xf8
>>>>> [   16.602201] [<ffff00000856a314>] arm_smmu_add_device+0x1a4/0x418
>>>>> [   16.614308] [<ffff00000849dfcc>] iort_iommu_configure+0xf0/0x16c
>>>>> [   16.626416] [<ffff000008468c50>] acpi_dma_configure+0x30/0x70
>>>>> [   16.637994] [<ffff00000858f00c>] dma_configure+0xa8/0xd4
>>>>> [   16.648695] [<ffff00000857706c>] driver_probe_device+0x1a4/0x2dc
>>>>> [   16.673081] [<ffff0000085752c8>] bus_for_each_drv+0x54/0x94
>>>>> [   16.684307] [<ffff000008576db0>] __device_attach+0xc4/0x12c
>>>>> [   16.695533] [<ffff000008577350>] device_initial_probe+0x10/0x18
>>>>> [   16.707462] [<ffff0000085762b4>] bus_probe_device+0x90/0x98
>>>>>
>>>>> After a bit of debug it looks like on platforms where ssid is not supported,
>>>>> s1_cfg.num_contexts is set to zero and it eventually results in this crash
>>>>> in,
>>>>> arm_smmu_domain_finalise_s1() -->arm_smmu_alloc_cd_tables()-->
>>>>> arm_smmu_alloc_cd_leaf_table() as num_leaf_entries is zero.
>>>>>
>>>>> With the below fix, it works on D05 now,
>>>>>
>>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>>> index 8ad90e2..51f5821 100644
>>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>>> @@ -2433,7 +2433,10 @@ static int arm_smmu_domain_finalise(struct
>>>> iommu_domain *domain,
>>>>>                         domain->min_pasid = 1;
>>>>>                         domain->max_pasid = master->num_ssids - 1;
>>>>>                         smmu_domain->s1_cfg.num_contexts = master->num_ssids;
>>>>> +               } else {
>>>>> +                       smmu_domain->s1_cfg.num_contexts = 1;
>>>>>                 }
>>>>> +
>>>>>                 smmu_domain->s1_cfg.can_stall = master->ste.can_stall;
>>>>>                 break;
>>>>>         case ARM_SMMU_DOMAIN_NESTED:
>>>>>
>>>>>
>>>>> I am not sure this is right place do this. Please take a look.
>>>>
>>>> Thanks for testing the series and reporting the bug. I added the
>>>> following patch to branch svm/current, does it work for you?
>>>
>>> Yes, it does.
>>>
>>> Thanks,
>>> Shameer
>>>  
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index 42c8378624ed..edda466adc81 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -3169,9 +3169,7 @@ static int arm_smmu_add_device(struct device *dev)
>>>>                 }
>>>>         }
>>>>
>>>> -       if (smmu->ssid_bits)
>>>> -               master->num_ssids = 1 << min(smmu->ssid_bits,
>>>> -                                            fwspec->num_pasid_bits);
>>>> +       master->num_ssids = 1 << min(smmu->ssid_bits, fwspec-
>>>>> num_pasid_bits);
>>
>> If fwspec->num_pasid_bits = 0, then master have _one_ num_ssids ?
> 
> Yes, the context table allocator always needs to allocate at least one
> entry, even if the master or SMMU doesn't support SSID. I think an earlier
> version called this field "num_contexts", maybe we should got back to that
> name for clarity?
> 
Yes, it will be more clear.

Thanks
Yisheng

> Thanks,
> Jean
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Liu Nov. 8, 2017, 1:21 a.m. UTC | #22

Hi Jean,

On 2017/10/12 20:55, Jean-Philippe Brucker wrote:
> On 12/10/17 13:05, Yisheng Xie wrote:
> [...]
>>>>> * An iommu_process can be bound to multiple domains, and a domain can have
>>>>>   multiple iommu_process.
>>>> when bind a task to device, can we create a single domain for it? I am thinking
>>>> about process management without shared PT(for some device only support PASID
>>>> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
>>>> Do you have any idea about this?
>>>
>>> A device always has to be in a domain, as far as I know. Not supporting
>>> PRI forces you to pin down all user mappings (or just the ones you use for
>>> DMA) but you should sill be able to share PT. Now if you don't support
>>> shared PT either, but only PASID, then you'll have to use io-pgtable and a
>>> new map/unmap API on an iommu_process. I don't understand your concern
>>> though, how would the link between process and domains prevent this use-case?
>>>
>> So you mean that if an iommu_process bind to multiple devices it should create
>> multiple io-pgtables? or just share the same io-pgtable?
> 
> I don't know to be honest, I haven't thought much about the io-pgtable
> case, I'm all about sharing the mm :)
> 

Sorry to get back to this thread, but traditional DMA_MAP use case may also want to
enable Substreamid/PASID.
As a general framework, you may also consider SubStreamid/Pasid support for dma map/io-pgtable.

We're considering make io-pgtables per SubStreamid/Pasid, but haven't decide put all 
io-pgtables into a single domain or iommu_process.

Thanks,
Liubo

> It really depends on what the user (GPU driver I assume) wants. I think
> that if you're not sharing an mm with the device, then you're trying to
> hide parts of the process to the device, so you'd also want the
> flexibility of having different io-pgtables between devices. Different
> devices accessing isolated parts of the process requires separate io-pgtables.
> 


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 8, 2017, 10:50 a.m. UTC | #23

Hi Liubo,

On 08/11/17 01:21, Bob Liu wrote:
> Hi Jean,
> 
> On 2017/10/12 20:55, Jean-Philippe Brucker wrote:
>> On 12/10/17 13:05, Yisheng Xie wrote:
>> [...]
>>>>>> * An iommu_process can be bound to multiple domains, and a domain can have
>>>>>>   multiple iommu_process.
>>>>> when bind a task to device, can we create a single domain for it? I am thinking
>>>>> about process management without shared PT(for some device only support PASID
>>>>> without pri ability), it seems hard to expand if a domain have multiple iommu_process?
>>>>> Do you have any idea about this?
>>>>
>>>> A device always has to be in a domain, as far as I know. Not supporting
>>>> PRI forces you to pin down all user mappings (or just the ones you use for
>>>> DMA) but you should sill be able to share PT. Now if you don't support
>>>> shared PT either, but only PASID, then you'll have to use io-pgtable and a
>>>> new map/unmap API on an iommu_process. I don't understand your concern
>>>> though, how would the link between process and domains prevent this use-case?
>>>>
>>> So you mean that if an iommu_process bind to multiple devices it should create
>>> multiple io-pgtables? or just share the same io-pgtable?
>>
>> I don't know to be honest, I haven't thought much about the io-pgtable
>> case, I'm all about sharing the mm :)
>>
> 
> Sorry to get back to this thread, but traditional DMA_MAP use case may also want to
> enable Substreamid/PASID.
> As a general framework, you may also consider SubStreamid/Pasid support for dma map/io-pgtable.
> 
> We're considering make io-pgtables per SubStreamid/Pasid, but haven't decide put all 
> io-pgtables into a single domain or iommu_process.

Yes they should be in a single domain, see also my other reply here:
http://www.spinics.net/lists/arm-kernel/msg613586.html

I've only been thinking about the IOMMU API for the moment, but I guess
the VFIO API would use this extension? I suppose it would be a new PASID
field to DMA_MAP along with a flag. The PASID would probably be allocated
with BIND + some special flag.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bharat Kumar Gogada Nov. 8, 2017, 5:50 p.m. UTC | #24

Hi Jean,

+static struct iommu_process *
+iommu_process_alloc(struct iommu_domain *domain, struct task_struct 
+*task) {
+	int err;
+	int pasid;
+	struct iommu_process *process;
+
+	if (WARN_ON(!domain->ops->process_alloc || !domain->ops->process_free))
+		return ERR_PTR(-ENODEV);
+
+	process = domain->ops->process_alloc(task);
+	if (IS_ERR(process))
+		return process;
+	if (!process)
+		return ERR_PTR(-ENOMEM);
+
+	process->pid		= get_task_pid(task, PIDTYPE_PID);
+	process->release	= domain->ops->process_free;
+	INIT_LIST_HEAD(&process->domains);
+	kref_init(&process->kref);
+
+	if (!process->pid) {
+		err = -EINVAL;
+		goto err_free_process;
+	}
+
+	idr_preload(GFP_KERNEL);
+	spin_lock(&iommu_process_lock);
+	pasid = idr_alloc_cyclic(&iommu_process_idr, process, domain->min_pasid,
+				 domain->max_pasid + 1, GFP_ATOMIC);
If EP supports only one pasid; domain->min_pasid=1 and domain->max_pasid=0.
When idr_alloc_cyclic is called it invokes idr_get_free_cmn function where we have following condition. (Based on kernel 4.14-rc6)
	if (!radix_tree_tagged(root, IDR_FREE)) 
                start = max(start, maxindex + 1);
              if (start > max) 
		return ERR_PTR(-ENOSPC);
Here max is being assigned zero by the time this function is invoked, this value is based on domain->max_pasid.
This condition fails and ENOSPC is returned.
	
In this case even though hardware supports PASID, BIND flow fails.
Any reason why pasid allocation moved to idr allocations rather than bitmap allocations as in v1 patches ?

+	process->pasid = pasid;

Regards,
Bharat
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 9, 2017, 3:32 a.m. UTC | #25

Hi Jean,

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> Hook process operations to support PASID and page table sharing with the
> SMMUv3:
> 
> +
> +static void arm_smmu_process_exit(struct iommu_domain *domain,
> +				  struct iommu_process *process)
> +{
> +	struct arm_smmu_master_data *master;
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +
> +	if (!domain->process_exit)
> +		return;

If domain do not set process_exit handler, just return? smmu do not
need invalid ATC, clear cd entry, etc.? Maybe you should check when
call domain->process_exit?

> +
> +	spin_lock(&smmu_domain->devices_lock);
> +	list_for_each_entry(master, &smmu_domain->devices, list) {
> +		if (!master->processes)
> +			continue;
> +
> +		master->processes--;
Add
		if (domain->process_exit)
here?
> +		domain->process_exit(domain, master->dev, process->pasid,
> +				     domain->process_exit_token);
> +
> +		/* TODO: inval ATC */
> +	}
> +	spin_unlock(&smmu_domain->devices_lock);
> +
> +	arm_smmu_write_ctx_desc(smmu_domain, process->pasid, NULL);
> +
> +	/* TODO: Invalidate all mappings if not DVM */
> +}
> +
Thanks
Yisheng Xie

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 9, 2017, 12:08 p.m. UTC | #26

Hi,

On 09/11/17 03:32, Yisheng Xie wrote:
> Hi Jean,
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> Hook process operations to support PASID and page table sharing with the
>> SMMUv3:
>>
>> +
>> +static void arm_smmu_process_exit(struct iommu_domain *domain,
>> +				  struct iommu_process *process)
>> +{
>> +	struct arm_smmu_master_data *master;
>> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>> +
>> +	if (!domain->process_exit)
>> +		return;
> 
> If domain do not set process_exit handler, just return? smmu do not
> need invalid ATC, clear cd entry, etc.? Maybe you should check when
> call domain->process_exit?

Indeed, that doesn't make sense. I'll move the check below.

Thanks,
Jean

>> +
>> +	spin_lock(&smmu_domain->devices_lock);
>> +	list_for_each_entry(master, &smmu_domain->devices, list) {
>> +		if (!master->processes)
>> +			continue;
>> +
>> +		master->processes--;
> Add
> 		if (domain->process_exit)
> here?
>> +		domain->process_exit(domain, master->dev, process->pasid,
>> +				     domain->process_exit_token);
>> +
>> +		/* TODO: inval ATC */
>> +	}
>> +	spin_unlock(&smmu_domain->devices_lock);
>> +
>> +	arm_smmu_write_ctx_desc(smmu_domain, process->pasid, NULL);
>> +
>> +	/* TODO: Invalidate all mappings if not DVM */
>> +}
>> +
> Thanks
> Yisheng Xie
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-philippe Brucker Nov. 9, 2017, 12:13 p.m. UTC | #27

Hi Bharat,

On 11/08/2017 05:50 PM, Bharat Kumar Gogada wrote:
> Hi Jean,
> 
> +static struct iommu_process *
> +iommu_process_alloc(struct iommu_domain *domain, struct task_struct
> +*task) {
> +       int err;
> +       int pasid;
> +       struct iommu_process *process;
> +
> +       if (WARN_ON(!domain->ops->process_alloc ||
> !domain->ops->process_free))
> +               return ERR_PTR(-ENODEV);
> +
> +       process = domain->ops->process_alloc(task);
> +       if (IS_ERR(process))
> +               return process;
> +       if (!process)
> +               return ERR_PTR(-ENOMEM);
> +
> +       process->pid            = get_task_pid(task, PIDTYPE_PID);
> +       process->release        = domain->ops->process_free;
> +       INIT_LIST_HEAD(&process->domains);
> +       kref_init(&process->kref);
> +
> +       if (!process->pid) {
> +               err = -EINVAL;
> +               goto err_free_process;
> +       }
> +
> +       idr_preload(GFP_KERNEL);
> +       spin_lock(&iommu_process_lock);
> +       pasid = idr_alloc_cyclic(&iommu_process_idr, process,
> domain->min_pasid,
> +                                domain->max_pasid + 1, GFP_ATOMIC);
> If EP supports only one pasid; domain->min_pasid=1 and domain->max_pasid=0.
> When idr_alloc_cyclic is called it invokes idr_get_free_cmn function
> where we have following condition. (Based on kernel 4.14-rc6)
>         if (!radix_tree_tagged(root, IDR_FREE))
>                 start = max(start, maxindex + 1);
>               if (start > max)
>                 return ERR_PTR(-ENOSPC);
> Here max is being assigned zero by the time this function is invoked,
> this value is based on domain->max_pasid.
> This condition fails and ENOSPC is returned.
>        
> In this case even though hardware supports PASID, BIND flow fails.

It should fail, since we're reserving PASID 0 for non-PASID transactions
with S1DSS=0b10. In addition, the SMMUv3 specification does not allow
using PASID with a single entry. See the description of S1CDMax in 5.2
Stream Table Entry:

"when this field is 0, the substreams of the STE are disabled and one CD
is available. (The minimum useful number of substreams is 2.) Any
transaction with a SubstreamID will be terminated with an abort and a
C_BAD_SUBSTREAMID event recorded."

> Any reason why pasid allocation moved to idr allocations rather than
> bitmap allocations as in v1 patches ?

Yes, idr provides a convenient way to quickly retrieve the context
associated with a PASID, when handling a fault. v1 had the allocation in a
bitmap and storing in a rb-tree. By using an idr we combine both and rely
on a well-tested infrastructure.

Note that in the future we might need to go back to handcrafting the PASID
allocation, but it will probably still be based on idr.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 9, 2017, 12:16 p.m. UTC | #28

Hi Bharat,

On 11/08/2017 05:50 PM, Bharat Kumar Gogada wrote:
> Hi Jean,
> 
> +static struct iommu_process *
> +iommu_process_alloc(struct iommu_domain *domain, struct task_struct
> +*task) {
> +       int err;
> +       int pasid;
> +       struct iommu_process *process;
> +
> +       if (WARN_ON(!domain->ops->process_alloc ||
> !domain->ops->process_free))
> +               return ERR_PTR(-ENODEV);
> +
> +       process = domain->ops->process_alloc(task);
> +       if (IS_ERR(process))
> +               return process;
> +       if (!process)
> +               return ERR_PTR(-ENOMEM);
> +
> +       process->pid            = get_task_pid(task, PIDTYPE_PID);
> +       process->release        = domain->ops->process_free;
> +       INIT_LIST_HEAD(&process->domains);
> +       kref_init(&process->kref);
> +
> +       if (!process->pid) {
> +               err = -EINVAL;
> +               goto err_free_process;
> +       }
> +
> +       idr_preload(GFP_KERNEL);
> +       spin_lock(&iommu_process_lock);
> +       pasid = idr_alloc_cyclic(&iommu_process_idr, process,
> domain->min_pasid,
> +                                domain->max_pasid + 1, GFP_ATOMIC);
> If EP supports only one pasid; domain->min_pasid=1 and domain->max_pasid=0.
> When idr_alloc_cyclic is called it invokes idr_get_free_cmn function
> where we have following condition. (Based on kernel 4.14-rc6)
>         if (!radix_tree_tagged(root, IDR_FREE))
>                 start = max(start, maxindex + 1);
>               if (start > max)
>                 return ERR_PTR(-ENOSPC);
> Here max is being assigned zero by the time this function is invoked,
> this value is based on domain->max_pasid.
> This condition fails and ENOSPC is returned.
>        
> In this case even though hardware supports PASID, BIND flow fails.

It should fail, since we're reserving PASID 0 for non-PASID transactions
with S1DSS=0b10. In addition, the SMMUv3 specification does not allow
using PASID with a single entry. See the description of S1CDMax in 5.2
Stream Table Entry:

"when this field is 0, the substreams of the STE are disabled and one CD
is available. (The minimum useful number of substreams is 2.) Any
transaction with a SubstreamID will be terminated with an abort and a
C_BAD_SUBSTREAMID event recorded."

> Any reason why pasid allocation moved to idr allocations rather than
> bitmap allocations as in v1 patches ?

Yes, idr provides a convenient way to quickly retrieve the context
associated with a PASID, when handling a fault. v1 had the allocation in a
bitmap and storing in a rb-tree. By using an idr we combine both and rely
on a well-tested infrastructure.

Note that in the future we might need to go back to handcrafting the PASID
allocation, but it will probably still be based on idr.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bharat Kumar Gogada Nov. 13, 2017, 11:06 a.m. UTC | #29

>        
> In this case even though hardware supports PASID, BIND flow fails.

It should fail, since we're reserving PASID 0 for non-PASID transactions with S1DSS=0b10. In addition, the SMMUv3 specification does not allow using PASID with a single entry. See the description of S1CDMax in 5.2 Stream Table Entry:

"when this field is 0, the substreams of the STE are disabled and one CD is available. (The minimum useful number of substreams is 2.) Any transaction with a SubstreamID will be terminated with an abort and a C_BAD_SUBSTREAMID event recorded."

> Any reason why pasid allocation moved to idr allocations rather than 
> bitmap allocations as in v1 patches ?

Yes, idr provides a convenient way to quickly retrieve the context associated with a PASID, when handling a fault. v1 had the allocation in a bitmap and storing in a rb-tree. By using an idr we combine both and rely on a well-tested infrastructure.

Note that in the future we might need to go back to handcrafting the PASID allocation, but it will probably still be based on idr.

Thanks for the clarification, Jean. 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bharat Kumar Gogada Nov. 16, 2017, 2:19 p.m. UTC | #30

Hi Jean,

+static size_t
+arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
+			unsigned long iova, size_t size)
+{
+	unsigned long flags;
+	struct arm_smmu_cmdq_ent cmd;
+	struct arm_smmu_master_data *master;
+
+	arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_for_each_entry(master, &smmu_domain->devices, list)
+		arm_smmu_atc_inv_master(master, &cmd);
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+	return size;
+}
+
 /* IOMMU API */
 static bool arm_smmu_capable(enum iommu_cap cap)  { @@ -2361,6 +2506,8 @@ static void arm_smmu_detach_dev(struct device *dev)
 		__iommu_process_unbind_dev_all(&smmu_domain->domain, dev);
 
 	if (smmu_domain) {
+		arm_smmu_atc_inv_master_all(master, 0);
+
In BIND flow, when VFIO_IOMMU_UNBIND is invoked invalidation is sent on allocated PASID for this application.
When vfio group fd is closed after UNBIND,  arm_smmu_detach_dev is invoked now invalidation is being sent on ssid zero.
Why  invalidation needs to be sent on ssid zero ?

Regards,
Bharat

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 16, 2017, 3:03 p.m. UTC | #31

Hi Bharat,

On 16/11/17 14:19, Bharat Kumar Gogada wrote:
> Hi Jean,
> 
> +static size_t
> +arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
> +			unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	struct arm_smmu_cmdq_ent cmd;
> +	struct arm_smmu_master_data *master;
> +
> +	arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);
> +
> +	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +	list_for_each_entry(master, &smmu_domain->devices, list)
> +		arm_smmu_atc_inv_master(master, &cmd);
> +	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> +
> +	return size;
> +}
> +
>  /* IOMMU API */
>  static bool arm_smmu_capable(enum iommu_cap cap)  { @@ -2361,6 +2506,8 @@ static void arm_smmu_detach_dev(struct device *dev)
>  		__iommu_process_unbind_dev_all(&smmu_domain->domain, dev);
>  
>  	if (smmu_domain) {
> +		arm_smmu_atc_inv_master_all(master, 0);
> +
> In BIND flow, when VFIO_IOMMU_UNBIND is invoked invalidation is sent on allocated PASID for this application.
> When vfio group fd is closed after UNBIND,  arm_smmu_detach_dev is invoked now invalidation is being sent on ssid zero.
> Why  invalidation needs to be sent on ssid zero ?

It's possible to use bind/unbind and map/unmap APIs at the same time on a
domain. map/unmap modifies non-ssid mappings as usual, for context
descriptor 0. Note that SSID 0 is converted to "no SSID" by
arm_smmu_atc_inv_to_cmd.

That said, I think VFIO cleans all DMA mappings when the VFIO group fd is
closed, which will send individual ATC invalidation for each mapping. But
VFIO may not be the only user of this API, and future users may be less
careful. It's safer to send this global invalidation whenever we detach
the domain, to ensure we're not leaving stale ATC entries for the next user.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bharat Kumar Gogada Nov. 17, 2017, 6:11 a.m. UTC | #32

> +
> +	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +	list_for_each_entry(master, &smmu_domain->devices, list)
> +		arm_smmu_atc_inv_master(master, &cmd);
> +	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> +
> +	return size;
> +}
> +
>  /* IOMMU API */
>  static bool arm_smmu_capable(enum iommu_cap cap)  { @@ -2361,6 +2506,8 @@ static void arm_smmu_detach_dev(struct device *dev)
>  		__iommu_process_unbind_dev_all(&smmu_domain->domain, dev);
>  
>  	if (smmu_domain) {
> +		arm_smmu_atc_inv_master_all(master, 0);
> +
> In BIND flow, when VFIO_IOMMU_UNBIND is invoked invalidation is sent on allocated PASID for this application.
> When vfio group fd is closed after UNBIND,  arm_smmu_detach_dev is invoked now invalidation is being sent on ssid zero.
> Why  invalidation needs to be sent on ssid zero ?

It's possible to use bind/unbind and map/unmap APIs at the same time on a domain. map/unmap modifies non-ssid mappings as usual, for context descriptor 0. Note that SSID 0 is converted to "no SSID" by arm_smmu_atc_inv_to_cmd.

That said, I think VFIO cleans all DMA mappings when the VFIO group fd is closed, which will send individual ATC invalidation for each mapping. But VFIO may not be the only user of this API, and future users may be less careful. It's safer to send this global invalidation whenever we detach the domain, to ensure we're not leaving stale ATC entries for the next user.

Thanks Jean, I see that currently vfio_group_fops_open does not allow multiple instances. 
If a device supports multiple PASID there might be different applications running parallel. 
So why is multiple instances restricted ?

Regards,
Bharat



--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 17, 2017, 11:39 a.m. UTC | #33

On 17/11/17 06:11, Bharat Kumar Gogada wrote:
[...]
> Thanks Jean, I see that currently vfio_group_fops_open does not allow multiple instances. 
> If a device supports multiple PASID there might be different applications running parallel. 
> So why is multiple instances restricted ?

You can't have multiple processes owning the same PCI device, it's
unmanageable.

For using multiple PASIDs, my idea was that the userspace driver ("the
server"), that owns the device, would have a way to partition it into
smaller frames. It forks to create "clients" and assigns a PASID to each
of them (by issuing VFIO_BIND(client_pid) -> pasid, then writing the PASID
into a privileged MMIO frame that defines the partition properties). Each
client accesses an unprivileged MMIO frame to use a device partition (or
sends commands to the server via IPC), and can perform DMA on its own
virtual memory.

This is complete speculation of course, we have very little information on
how PASID-capable devices will be designed, so I'm trying to imagine
likely scenarios.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Liu Nov. 22, 2017, 3:15 a.m. UTC | #34

Hey Jean,

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> IOMMU drivers need a way to bind Linux processes to devices. This is used
> for Shared Virtual Memory (SVM), where devices support paging. In that
> mode, DMA can directly target virtual addresses of a process.
> 
> Introduce boilerplate code for allocating process structures and binding
> them to devices. Four operations are added to IOMMU drivers:
> 
> * process_alloc, process_free: to create an iommu_process structure and
>   perform architecture-specific operations required to grab the process
>   (for instance on ARM SMMU, pin down the CPU ASID). There is a single
>   iommu_process structure per Linux process.
> 

I'm a bit confused here.
The original meaning of iommu_domain is a virtual addrspace defined by a set of io page table.
(fix me if I misunderstood).
Then what's the meaning of iommu_domain and iommu_process after introducing iommu_process?
Could you consider document these concepts? 

Thanks,
Liubo

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 22, 2017, 1:04 p.m. UTC | #35

On 22/11/17 03:15, Bob Liu wrote:
> Hey Jean,
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> IOMMU drivers need a way to bind Linux processes to devices. This is used
>> for Shared Virtual Memory (SVM), where devices support paging. In that
>> mode, DMA can directly target virtual addresses of a process.
>>
>> Introduce boilerplate code for allocating process structures and binding
>> them to devices. Four operations are added to IOMMU drivers:
>>
>> * process_alloc, process_free: to create an iommu_process structure and
>>   perform architecture-specific operations required to grab the process
>>   (for instance on ARM SMMU, pin down the CPU ASID). There is a single
>>   iommu_process structure per Linux process.
>>
> 
> I'm a bit confused here.
> The original meaning of iommu_domain is a virtual addrspace defined by a set of io page table.
> (fix me if I misunderstood).

iommu_domain can also be seen as a logical partition of devices that share
the same address spaces (the concept comes from AMD and Intel IOMMU
domains, I believe). Without PASIDs it was a single address space, with
PASIDs it can have multiple address spaces.

> Then what's the meaning of iommu_domain and iommu_process after introducing iommu_process?
> Could you consider document these concepts? 

iommu_process is used to keep track of Linux process address spaces. I'll
rename it to io_mm in next version, to make it clear that it doesn't
represent a Linux task but an mm_struct instead. However the
implementation stays pretty much identical. A domain can be associated to
multiple io_mm, and an io_mm can be associated to multiple domains.

In the IOMMU architectures I know, PASID is implemented like this. You
have the device tables (stream tables on SMMU), pointing to PASID tables
(context descriptor tables on SMMU). In the following diagram,

                    .->+--------+
                   / 0 |        |------ io_pgtable
                  /    +--------+
                 /   1 |        |------ io_mm->mm X
    +--------+  /      +--------+
  0 |      A |-'     2 |        |-.
    +--------+         +--------+  \
  1 |        |       3 |        |   \
    +--------+         +--------+    -- io_mm->mm Y
  2 |      B |--.     PASID tables  /
    +--------+   \                 |
  3 |      B |----+--->+--------+  |
    +--------+   /   0 |        |- | -- io_pgtable
  4 |      B |--'      +--------+  |
    +--------+       1 |        |  |
  Device tables        +--------+  |
                     2 |        |--'
                       +--------+
                     3 |        |------ io_mm->priv io_pgtable
                       +--------+
                      PASID tables

* Device 0 (e.g. PCI 0000:00:00.0) is in domain A.
* Devices 2, 3 and 4 are in domain B.
* Domain A has the top set of PASID tables.
* Domain B has the bottom set of PASID tables.

* Domain A is bound to process address space X.
  -> Device 0 can access X with PASID 1.
* Both domains A and B are bound to process address space Y.
  -> Devices 0, 2, 3 and 4 can access Y with PASID 2

* PASID 0 is special on Arm SMMU (with S1DSS=0b10). It will always be
  reserved for classic DMA map/unmap. Even for hypothetical devices that
  don't support non-pasid transactions, I'd like to keep this convention.
  It should be quite useful for device drivers to have PASID 0 available
  with DMA map/unmap.

* When introducing "private" PASID address spaces (that many are asking
  for), which are backed by a set of io-pgtable and map/unmap ops, I
  suppose they would reuse the io_mm structure. In this example PASID 3 is
  associated to a private address space and not backed by an mm. Since the
  PASID space is global, PASID 3 won't be available for any other domain.

Does this clarify the current design, or is it just more confusing?

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Liu Nov. 23, 2017, 10:33 a.m. UTC | #36

On 2017/11/22 21:04, Jean-Philippe Brucker wrote:
> On 22/11/17 03:15, Bob Liu wrote:
>> Hey Jean,
>>
>> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>>> IOMMU drivers need a way to bind Linux processes to devices. This is used
>>> for Shared Virtual Memory (SVM), where devices support paging. In that
>>> mode, DMA can directly target virtual addresses of a process.
>>>
>>> Introduce boilerplate code for allocating process structures and binding
>>> them to devices. Four operations are added to IOMMU drivers:
>>>
>>> * process_alloc, process_free: to create an iommu_process structure and
>>>   perform architecture-specific operations required to grab the process
>>>   (for instance on ARM SMMU, pin down the CPU ASID). There is a single
>>>   iommu_process structure per Linux process.
>>>
>>
>> I'm a bit confused here.
>> The original meaning of iommu_domain is a virtual addrspace defined by a set of io page table.
>> (fix me if I misunderstood).
> 
> iommu_domain can also be seen as a logical partition of devices that share
> the same address spaces (the concept comes from AMD and Intel IOMMU
> domains, I believe). Without PASIDs it was a single address space, with
> PASIDs it can have multiple address spaces.
> 
>> Then what's the meaning of iommu_domain and iommu_process after introducing iommu_process?
>> Could you consider document these concepts? 
> 
> iommu_process is used to keep track of Linux process address spaces. I'll
> rename it to io_mm in next version, to make it clear that it doesn't
> represent a Linux task but an mm_struct instead. However the
> implementation stays pretty much identical. A domain can be associated to
> multiple io_mm, and an io_mm can be associated to multiple domains.
> 
> In the IOMMU architectures I know, PASID is implemented like this. You
> have the device tables (stream tables on SMMU), pointing to PASID tables
> (context descriptor tables on SMMU). In the following diagram,
> 
>                     .->+--------+
>                    / 0 |        |------ io_pgtable
>                   /    +--------+
>                  /   1 |        |------ io_mm->mm X
>     +--------+  /      +--------+
>   0 |      A |-'     2 |        |-.
>     +--------+         +--------+  \
>   1 |        |       3 |        |   \
>     +--------+         +--------+    -- io_mm->mm Y
>   2 |      B |--.     PASID tables  /
>     +--------+   \                 |
>   3 |      B |----+--->+--------+  |
>     +--------+   /   0 |        |- | -- io_pgtable
>   4 |      B |--'      +--------+  |
>     +--------+       1 |        |  |
>   Device tables        +--------+  |
>                      2 |        |--'
>                        +--------+
>                      3 |        |------ io_mm->priv io_pgtable
>                        +--------+
>                       PASID tables
> 
> * Device 0 (e.g. PCI 0000:00:00.0) is in domain A.
> * Devices 2, 3 and 4 are in domain B.
> * Domain A has the top set of PASID tables.
> * Domain B has the bottom set of PASID tables.
> 
> * Domain A is bound to process address space X.
>   -> Device 0 can access X with PASID 1.
> * Both domains A and B are bound to process address space Y.
>   -> Devices 0, 2, 3 and 4 can access Y with PASID 2
> 
> * PASID 0 is special on Arm SMMU (with S1DSS=0b10). It will always be
>   reserved for classic DMA map/unmap. Even for hypothetical devices that
>   don't support non-pasid transactions, I'd like to keep this convention.
>   It should be quite useful for device drivers to have PASID 0 available
>   with DMA map/unmap.
> 
> * When introducing "private" PASID address spaces (that many are asking
>   for), which are backed by a set of io-pgtable and map/unmap ops, I
>   suppose they would reuse the io_mm structure. In this example PASID 3 is
>   associated to a private address space and not backed by an mm. Since the
>   PASID space is global, PASID 3 won't be available for any other domain.
> 
> Does this clarify the current design, or is it just more confusing?
> 

It's very helpful, thank you very much!

Regards,
Liubo


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Liu Nov. 24, 2017, 8:23 a.m. UTC | #37

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> Add two new ioctl for VFIO containers. VFIO_DEVICE_BIND_PROCESS creates a
> bond between a container and a process address space, identified by a
> device-specific ID named PASID. This allows the device to target DMA
> transactions at the process virtual addresses without a need for mapping
> and unmapping buffers explicitly in the IOMMU. The process page tables are
> shared with the IOMMU, and mechanisms such as PCI ATS/PRI may be used to
> handle faults. VFIO_DEVICE_UNBIND_PROCESS removed a bond identified by a
> PASID.
> 

How about hide bind/unbind into ioctl(VFIO_SET_IOMMU)? 
e.g always bind to current process in SET_IOMMU.

Not sure about the real use case.

> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 243 +++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h       |  69 ++++++++++++
>  2 files changed, 311 insertions(+), 1 deletion(-)
> 


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 24, 2017, 10:58 a.m. UTC | #38

On 24/11/17 08:23, Bob Liu wrote:
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> Add two new ioctl for VFIO containers. VFIO_DEVICE_BIND_PROCESS creates a
>> bond between a container and a process address space, identified by a
>> device-specific ID named PASID. This allows the device to target DMA
>> transactions at the process virtual addresses without a need for mapping
>> and unmapping buffers explicitly in the IOMMU. The process page tables are
>> shared with the IOMMU, and mechanisms such as PCI ATS/PRI may be used to
>> handle faults. VFIO_DEVICE_UNBIND_PROCESS removed a bond identified by a
>> PASID.
>> 
> 
> How about hide bind/unbind into ioctl(VFIO_SET_IOMMU)?
> e.g always bind to current process in SET_IOMMU.
> 
> Not sure about the real use case.

I guess you could introduce a new VFIO IOMMU type for this. I think this
would be useful for SVA without PASID: if the device supports I/O page
faults, use SET_IOMMU with a VFIO_SVA_IOMMU type (for example) and the
process is bound automatically to the default translation context of the
device. This requires a new IOMMU type because the MAP/UNMAP ioctl won't
work anymore.

I'm not keen on introducing loads of new features in the APIs at the
moment, because I only have the IOMMU point of view, not many endpoint users.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 29, 2017, 6:08 a.m. UTC | #39

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
> +			      int *pasid, int flags)
> +{
[..]
> +			err = iommu_process_attach_locked(context, dev);
> +			if (err)
> +				iommu_process_put_locked(process);
one ref for a context is enough right? so also need call iommu_process_put_locked()
if attach ok, or will be leak if user call bind twice for the same device and task.

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 29, 2017, 6:15 a.m. UTC | #40

Hi Jean,

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> -	if (domain->ext_handler) {
> +	if (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) {
> +		fault->flags |= IOMMU_FAULT_ATOMIC;

Why remove the condition of domain->ext_handler? should it be much better like:
  if ((domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) && domain->ext_handler)

If domain->ext_handler is NULL, and (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC)
is true. It will oops, right?

>  		ret = domain->ext_handler(domain, dev, fault,
>  					  domain->handler_token);

Thanks
Yisheng Xie

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 29, 2017, 3:01 p.m. UTC | #41

Hello,

On 29/11/17 06:15, Yisheng Xie wrote:
> Hi Jean,
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> -	if (domain->ext_handler) {
>> +	if (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) {
>> +		fault->flags |= IOMMU_FAULT_ATOMIC;
> 
> Why remove the condition of domain->ext_handler? should it be much better like:
>   if ((domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) && domain->ext_handler)
> 
> If domain->ext_handler is NULL, and (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC)
> is true. It will oops, right?

I removed the check because ext_handler shouldn't be NULL if handler_flags
has a bit set (as per iommu_set_ext_fault_handler). But you're right that
this is fragile, and I overlooked the case where users could call
set_ext_fault_handler to clear the fault handler.

(Note that this ext_handler will most likely be replaced by the fault
infrastructure that Jacob is working on:
https://patchwork.kernel.org/patch/10063385/ to which we should add the
atomic/blocking flags)

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 29, 2017, 3:01 p.m. UTC | #42

On 29/11/17 06:08, Yisheng Xie wrote:
> 
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
>> +			      int *pasid, int flags)
>> +{
> [..]
>> +			err = iommu_process_attach_locked(context, dev);
>> +			if (err)
>> +				iommu_process_put_locked(process);
> one ref for a context is enough right? so also need call iommu_process_put_locked()
> if attach ok, or will be leak if user call bind twice for the same device and task.

I wasn't sure, I think I prefer taking one ref for each bind. If user
calls bind twice, it should call unbind twice as well (in case of leak we
free the context on process exit).

Also with this implementation, user can call bind for two devices in the
same domain, which will share the same context structure. So we have to
take as many refs as bind() calls.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 30, 2017, 1:11 a.m. UTC | #43

Hi, Jean,

On 2017/11/29 23:01, Jean-Philippe Brucker wrote:
> On 29/11/17 06:08, Yisheng Xie wrote:
>>
>>
>> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>>> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
>>> +			      int *pasid, int flags)
>>> +{
>> [..]
>>> +			err = iommu_process_attach_locked(context, dev);
>>> +			if (err)
>>> +				iommu_process_put_locked(process);
>> one ref for a context is enough right? so also need call iommu_process_put_locked()
>> if attach ok, or will be leak if user call bind twice for the same device and task.
> 
> I wasn't sure, I think I prefer taking one ref for each bind. If user
> calls bind twice, it should call unbind twice as well (in case of leak we
> free the context on process exit).
> 
> Also with this implementation, user can call bind for two devices in the
> same domain, which will share the same context structure. So we have to
> take as many refs as bind() calls.

hmm, it has two ref, one for _context_ and the other for *process* (or maybe mm for
your next version), right? For each bind it will take a ref of context as present
design. but why also process ref need be taken for each bind? I mean it seems does
not break _user can call bind for two devices in the same domain_.

And if you really want to take a ref of *process* for echo bind, you should put it when
unbind, right? I just not find where you put the ref of process when unbind. But just put
the process ref when free context.

Maybe I just miss something.

Thanks
Yisheng Xie
> 
> Thanks,
> Jean
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Nov. 30, 2017, 2:45 a.m. UTC | #44

hi jean,

On 2017/11/29 23:01, Jean-Philippe Brucker wrote:
> Hello,
> 
> On 29/11/17 06:15, Yisheng Xie wrote:
>> Hi Jean,
>>
>> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>>> -	if (domain->ext_handler) {
>>> +	if (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) {
>>> +		fault->flags |= IOMMU_FAULT_ATOMIC;
>>
>> Why remove the condition of domain->ext_handler? should it be much better like:
>>   if ((domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC) && domain->ext_handler)
>>
>> If domain->ext_handler is NULL, and (domain->handler_flags & IOMMU_FAULT_HANDLER_ATOMIC)
>> is true. It will oops, right?
> 
> I removed the check because ext_handler shouldn't be NULL if handler_flags
> has a bit set (as per iommu_set_ext_fault_handler). But you're right that
> this is fragile, and I overlooked the case where users could call
> set_ext_fault_handler to clear the fault handler.
> 
> (Note that this ext_handler will most likely be replaced by the fault
> infrastructure that Jacob is working on:
> https://patchwork.kernel.org/patch/10063385/ to which we should add the
> atomic/blocking flags)
> 

Get it, thanks for your explanation.

Thanks
Yisheng Xie

> Thanks,
> Jean
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Nov. 30, 2017, 1:39 p.m. UTC | #45

On 30/11/17 01:11, Yisheng Xie wrote:
> Hi, Jean,
> 
> On 2017/11/29 23:01, Jean-Philippe Brucker wrote:
>> On 29/11/17 06:08, Yisheng Xie wrote:
>>>
>>>
>>> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>>>> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
>>>> +			      int *pasid, int flags)
>>>> +{
>>> [..]
>>>> +			err = iommu_process_attach_locked(context, dev);
>>>> +			if (err)
>>>> +				iommu_process_put_locked(process);
>>> one ref for a context is enough right? so also need call iommu_process_put_locked()
>>> if attach ok, or will be leak if user call bind twice for the same device and task.
>>
>> I wasn't sure, I think I prefer taking one ref for each bind. If user
>> calls bind twice, it should call unbind twice as well (in case of leak we
>> free the context on process exit).
>>
>> Also with this implementation, user can call bind for two devices in the
>> same domain, which will share the same context structure. So we have to
>> take as many refs as bind() calls.
> 
> 
> hmm, it has two ref, one for _context_ and the other for *process* (or maybe mm for
> your next version), right? For each bind it will take a ref of context as present
> design. but why also process ref need be taken for each bind? I mean it seems does
> not break _user can call bind for two devices in the same domain_.
> 
> And if you really want to take a ref of *process* for echo bind, you should put it when
> unbind, right? I just not find where you put the ref of process when unbind. But just put
> the process ref when free context.
> 
> Maybe I just miss something.

No you're right I misunderstood, sorry about that. Each context has a
single ref to a process, so we do need to drop the process ref here as you
pointed out.

I thought I exercised this path though, I'll update my test suite. Also
attach_locked shouldn't take a context ref if attach fails...

Thanks a lot,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yisheng Xie Dec. 6, 2017, 6:51 a.m. UTC | #46

Hi Jean,

On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
> If the SMMU supports it and the kernel was built with HTTU support, enable
> +	if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && (reg & (IDR0_HA | IDR0_HD))) {
> +		smmu->features |= ARM_SMMU_FEAT_HA;
> +		if (reg & IDR0_HD)
> +			smmu->features |= ARM_SMMU_FEAT_HD;
> +	}

What is relationship of armv8.1 HW_AFDBM and SMMUv3 HTTU? I mean why we need
IS_ENABLED(CONFIG_ARM64_HW_AFDBM) ?

If CONFIG_ARM64_HW_AFDBM=y but the process do not support ARMv8.1, should it also
enable related feature for SMMUv3?

Thanks
Yisheng Xie
> +
>  	/*
>  	 * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU
>  	 * will create TLB entries for NH-EL1 world and will miss the
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jean-Philippe Brucker Dec. 6, 2017, 11:06 a.m. UTC | #47

On 06/12/17 06:51, Yisheng Xie wrote:
> Hi Jean,
> 
> On 2017/10/6 21:31, Jean-Philippe Brucker wrote:
>> If the SMMU supports it and the kernel was built with HTTU support, enable
>> +	if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && (reg & (IDR0_HA | IDR0_HD))) {
>> +		smmu->features |= ARM_SMMU_FEAT_HA;
>> +		if (reg & IDR0_HD)
>> +			smmu->features |= ARM_SMMU_FEAT_HD;
>> +	}
> 
> What is relationship of armv8.1 HW_AFDBM and SMMUv3 HTTU? I mean why we need
> IS_ENABLED(CONFIG_ARM64_HW_AFDBM) ?

I think the reason we needed this was that, without CONFIG_ARM64_HW_AFDBM,
the CPU wouldn't update the pte atomically and pte_dirty() wouldn't check
the DBM bit 51 (only the SW dirty bit 55).

Since af29678fe785 ("arm64: Remove the !CONFIG_ARM64_HW_AFDBM alternative
code paths") removed lots of #ifdefs, I'll see if we can remove the above
IS_ENABLED as well.

> If CONFIG_ARM64_HW_AFDBM=y but the process do not support ARMv8.1, should it also
> enable related feature for SMMUv3?

Yes I think we can enable HTTU in the SMMU even if the CPU doesn't support
it (though HTTU is only useful when sharing process address spaces). The
mm code checks for both HW and SW bits even when the CPU doesn't support
ARMv8.1

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sinan Kaya Jan. 19, 2018, 4:52 a.m. UTC | #48

Hi Jean-Philippe,

On 10/6/2017 9:31 AM, Jean-Philippe Brucker wrote:
>  /**
> + * iommu_process_bind_device - Bind a process address space to a device
> + * @dev: the device
> + * @task: the process to bind
> + * @pasid: valid address where the PASID will be stored
> + * @flags: bond properties (IOMMU_PROCESS_BIND_*)
> + *
> + * Create a bond between device and task, allowing the device to access the
> + * process address space using the returned PASID.
> + *
> + * On success, 0 is returned and @pasid contains a valid ID. Otherwise, an error
> + * is returned.
> + */
> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
> +			      int *pasid, int flags)

This API doesn't play nice with endpoint device drivers that have PASID limitations.

The AMD driver seems to have PASID limitations per product that are not being
advertised in the PCI capability.

device_iommu_pasid_init()
{
        pasid_limit = min_t(unsigned int,
                        (unsigned int)(1 << kfd->device_info->max_pasid_bits),
                        iommu_info.max_pasids);
        /*
         * last pasid is used for kernel queues doorbells
         * in the future the last pasid might be used for a kernel thread.
         */
        pasid_limit = min_t(unsigned int,
                                pasid_limit,
                                kfd->doorbell_process_limit - 1);
}

kfd->device_info->max_pasid_bits seems to contain per device limitations.

Would you be willing to extend the API so that the requester can impose some limit
on the PASID value that is getting allocated.

Sinan

Jean-Philippe Brucker Jan. 19, 2018, 10:27 a.m. UTC | #49

Hi Sinan,

On 19/01/18 04:52, Sinan Kaya wrote:
> Hi Jean-Philippe,
> 
> On 10/6/2017 9:31 AM, Jean-Philippe Brucker wrote:
>>  /**
>> + * iommu_process_bind_device - Bind a process address space to a device
>> + * @dev: the device
>> + * @task: the process to bind
>> + * @pasid: valid address where the PASID will be stored
>> + * @flags: bond properties (IOMMU_PROCESS_BIND_*)
>> + *
>> + * Create a bond between device and task, allowing the device to access the
>> + * process address space using the returned PASID.
>> + *
>> + * On success, 0 is returned and @pasid contains a valid ID. Otherwise, an error
>> + * is returned.
>> + */
>> +int iommu_process_bind_device(struct device *dev, struct task_struct *task,
>> +			      int *pasid, int flags)
> 
> This API doesn't play nice with endpoint device drivers that have PASID limitations.
> 
> The AMD driver seems to have PASID limitations per product that are not being
> advertised in the PCI capability.
> 
> device_iommu_pasid_init()
> {
>         pasid_limit = min_t(unsigned int,
>                         (unsigned int)(1 << kfd->device_info->max_pasid_bits),
>                         iommu_info.max_pasids);
>         /*
>          * last pasid is used for kernel queues doorbells
>          * in the future the last pasid might be used for a kernel thread.
>          */
>         pasid_limit = min_t(unsigned int,
>                                 pasid_limit,
>                                 kfd->doorbell_process_limit - 1);
> }
> 
> kfd->device_info->max_pasid_bits seems to contain per device limitations.
> 
> Would you be willing to extend the API so that the requester can impose some limit
> on the PASID value that is getting allocated.

Good point. Following the feedback for this series, next version adds
another public function:

int iommu_sva_device_init(struct device *dev, int features);

that has to be called by the device driver before any bind(). The intent
is to let some IOMMU drivers initialize PASID tables and other features
lazily, only if the device driver actually intends to use them. Maybe I
could change this function to:

int iommu_sva_device_init(struct device *dev, int features, unsigned int
max_pasid);

@features is a bitmask telling what the device driver needs (PASID and/or
page faults). If features has IOMMU_SVA_FEAT_PASID set, then device driver
can set a max_pasid limit, that we'd store in our private device-iommu
data. If max_pasid is 0, then we'd use the PCI limit.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sinan Kaya Jan. 19, 2018, 1:07 p.m. UTC | #50

On 2018-01-19 05:27, Jean-Philippe Brucker wrote:
> Hi Sinan,
> 
> On 19/01/18 04:52, Sinan Kaya wrote:
>> Hi Jean-Philippe,
>> 
>> On 10/6/2017 9:31 AM, Jean-Philippe Brucker wrote:
>>>  /**
>>> + * iommu_process_bind_device - Bind a process address space to a 
>>> device
>>> + * @dev: the device
>>> + * @task: the process to bind
>>> + * @pasid: valid address where the PASID will be stored
>>> + * @flags: bond properties (IOMMU_PROCESS_BIND_*)
>>> + *
>>> + * Create a bond between device and task, allowing the device to 
>>> access the
>>> + * process address space using the returned PASID.
>>> + *
>>> + * On success, 0 is returned and @pasid contains a valid ID. 
>>> Otherwise, an error
>>> + * is returned.
>>> + */
>>> +int iommu_process_bind_device(struct device *dev, struct task_struct 
>>> *task,
>>> +			      int *pasid, int flags)
>> 
>> This API doesn't play nice with endpoint device drivers that have 
>> PASID limitations.
>> 
>> The AMD driver seems to have PASID limitations per product that are 
>> not being
>> advertised in the PCI capability.
>> 
>> device_iommu_pasid_init()
>> {
>>         pasid_limit = min_t(unsigned int,
>>                         (unsigned int)(1 << 
>> kfd->device_info->max_pasid_bits),
>>                         iommu_info.max_pasids);
>>         /*
>>          * last pasid is used for kernel queues doorbells
>>          * in the future the last pasid might be used for a kernel 
>> thread.
>>          */
>>         pasid_limit = min_t(unsigned int,
>>                                 pasid_limit,
>>                                 kfd->doorbell_process_limit - 1);
>> }
>> 
>> kfd->device_info->max_pasid_bits seems to contain per device 
>> limitations.
>> 
>> Would you be willing to extend the API so that the requester can 
>> impose some limit
>> on the PASID value that is getting allocated.
> 
> Good point. Following the feedback for this series, next version adds
> another public function:
> 
> int iommu_sva_device_init(struct device *dev, int features);
> 
> that has to be called by the device driver before any bind(). The 
> intent
> is to let some IOMMU drivers initialize PASID tables and other features
> lazily, only if the device driver actually intends to use them. Maybe I
> could change this function to:
> 
> int iommu_sva_device_init(struct device *dev, int features, unsigned 
> int
> max_pasid);
> 
> @features is a bitmask telling what the device driver needs (PASID 
> and/or
> page faults). If features has IOMMU_SVA_FEAT_PASID set, then device 
> driver
> can set a max_pasid limit, that we'd store in our private device-iommu
> data. If max_pasid is 0, then we'd use the PCI limit.

Yes, this should work.

> 
> Thanks,
> Jean
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFCv2,00/36] Process management for IOMMU + SVM for SMMUv3

Message

Comments