Message ID | 1457322077-26640-9-git-send-email-aik@ozlabs.ru |
---|---|
State | Changes Requested |
Headers | show |
On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > without passing them to user space which saves time on switching > to user space and back. > > Both real and virtual modes are supported. The kernel tries to > handle a TCE request in the real mode, if fails it passes the request > to the virtual mode to complete the operation. If it a virtual mode > handler fails, the request is passed to user space; this is not expected > to happen ever though. Well... not expect to happen with a qemu which uses this. Presumably it will fall back to userspace routinely if you have an old qemu that doesn't add the liobn mappings. > The first user of this is VFIO on POWER. Trampolines to the VFIO external > user API functions are required for this patch. I'm not sure what you mean by "trampoline" here. > This uses a VFIO KVM device to associate a logical bus number (LIOBN) > with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap > requests. Group fd? Or container fd? The group fd wouldn't make a lot of sense. > To make use of the feature, the user space has to create a guest view > of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and > then associate a LIOBN with this table via VFIO KVM device, > a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in > the next patch). > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Is that with or without DDW (i.e. with or without a 64-bit DMA window)? > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ > arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 370 insertions(+) > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index 7965fc7..9417d12 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -33,6 +33,7 @@ > #include <asm/kvm_ppc.h> > #include <asm/kvm_book3s.h> > #include <asm/mmu-hash64.h> > +#include <asm/mmu_context.h> > #include <asm/hvcall.h> > #include <asm/synch.h> > #include <asm/ppc-opcode.h> > @@ -317,11 +318,161 @@ fail: > return ret; > } > > +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, > + unsigned long entry) > +{ > + struct mm_iommu_table_group_mem_t *mem = NULL; > + const unsigned long pgsize = 1ULL << tbl->it_page_shift; > + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > + > + if (!pua) > + return H_HARDWARE; > + > + mem = mm_iommu_lookup(*pua, pgsize); > + if (!mem) > + return H_HARDWARE; > + > + mm_iommu_mapped_dec(mem); > + > + *pua = 0; > + > + return H_SUCCESS; > +} > + > +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, > + unsigned long entry) > +{ > + enum dma_data_direction dir = DMA_NONE; > + unsigned long hpa = 0; > + > + if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) > + return H_HARDWARE; > + > + if (dir == DMA_NONE) > + return H_SUCCESS; > + > + return kvmppc_tce_iommu_mapped_dec(tbl, entry); > +} > + > +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, > + unsigned long entry, unsigned long gpa, > + enum dma_data_direction dir) > +{ > + long ret; > + unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > + struct mm_iommu_table_group_mem_t *mem; > + > + if (!pua) > + return H_HARDWARE; H_HARDWARE? Or H_PARAMETER? This essentially means the guest has supplied a bad physical address, doesn't it? > + if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) > + return H_HARDWARE; > + > + mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); > + if (!mem) > + return H_HARDWARE; > + > + if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) > + return H_HARDWARE; > + > + if (mm_iommu_mapped_inc(mem)) > + return H_HARDWARE; > + > + ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); > + if (ret) { > + mm_iommu_mapped_dec(mem); > + return H_TOO_HARD; > + } > + > + if (dir != DMA_NONE) > + kvmppc_tce_iommu_mapped_dec(tbl, entry); > + > + *pua = ua; IIUC this means you have a copy of the UA for every group attached to the TCE table, but they'll all be the same. Any way to avoid that duplication? > + return 0; > +} > + > +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce) > +{ > + long idx, ret = H_HARDWARE; > + const unsigned long entry = ioba >> tbl->it_page_shift; > + const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + const enum dma_data_direction dir = iommu_tce_direction(tce); > + > + /* Clear TCE */ > + if (dir == DMA_NONE) { > + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) > + return H_PARAMETER; > + > + return kvmppc_tce_iommu_unmap(tbl, entry); > + } > + > + /* Put TCE */ > + if (iommu_tce_put_param_check(tbl, ioba, tce)) > + return H_PARAMETER; > + > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + > + return ret; > +} > + > +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, unsigned long ioba, > + u64 __user *tces, unsigned long npages) > +{ > + unsigned long i, ret, tce, gpa; > + const unsigned long entry = ioba >> tbl->it_page_shift; > + > + for (i = 0; i < npages; ++i) { > + gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + > + if (iommu_tce_put_param_check(tbl, ioba + > + (i << tbl->it_page_shift), gpa)) > + return H_PARAMETER; > + } > + > + for (i = 0; i < npages; ++i) { > + tce = be64_to_cpu(tces[i]); tces is a user address, which means it should only be dereferenced via get_user() or copy_from_user() helpers. > + gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + > + ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa, > + iommu_tce_direction(tce)); > + if (ret != H_SUCCESS) > + return ret; > + } > + > + return H_SUCCESS; > +} > + > +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + unsigned long i; > + const unsigned long entry = ioba >> tbl->it_page_shift; > + > + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) > + kvmppc_tce_iommu_unmap(tbl, entry + i); > + > + return H_SUCCESS; > +} > + > long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); > long ret; > + struct kvmppc_spapr_tce_group *kg; > + struct iommu_table *tbltmp = NULL; > > /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > /* liobn, ioba, tce); */ > @@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (ret != H_SUCCESS) > return ret; > > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce); > + if (ret != H_SUCCESS) > + return ret; > + } > + > kvmppc_tce_put(stt, ioba >> stt->page_shift, tce); > > return H_SUCCESS; > @@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > long i, ret = H_SUCCESS, idx; > unsigned long entry, ua = 0; > u64 __user *tces, tce; > + struct kvmppc_spapr_tce_group *kg; > + struct iommu_table *tbltmp = NULL; > > stt = kvmppc_find_table(vcpu, liobn); > if (!stt) > @@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > } > tces = (u64 __user *) ua; > > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_h_put_tce_indirect_iommu(vcpu, > + kg->tbl, ioba, tces, npages); > + if (ret != H_SUCCESS) > + goto unlock_exit; > + } > + > for (i = 0; i < npages; ++i) { > if (get_user(tce, tces + i)) { > ret = H_TOO_HARD; > @@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > { > struct kvmppc_spapr_tce_table *stt; > long i, ret; > + struct kvmppc_spapr_tce_group *kg; > + struct iommu_table *tbltmp = NULL; > > stt = kvmppc_find_table(vcpu, liobn); > if (!stt) > @@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) > return H_PARAMETER; > > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba, > + tce_value, npages); > + if (ret != H_SUCCESS) > + return ret; > + } > + > for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift)) > kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value); > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > index 11163ae..6567d6c 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -26,6 +26,7 @@ > #include <linux/slab.h> > #include <linux/hugetlb.h> > #include <linux/list.h> > +#include <linux/iommu.h> > > #include <asm/tlbflush.h> > #include <asm/kvm_ppc.h> > @@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( > return mm_iommu_lookup_rm(mm, ua, size); > } > > +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, unsigned long entry) > +{ > + struct mm_iommu_table_group_mem_t *mem = NULL; > + const unsigned long pgsize = 1ULL << tbl->it_page_shift; > + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > + > + if (!pua) > + return H_SUCCESS; > + > + pua = (void *) vmalloc_to_phys(pua); > + if (!pua) > + return H_SUCCESS; > + > + mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize); > + if (!mem) > + return H_HARDWARE; > + > + mm_iommu_mapped_dec(mem); > + > + *pua = 0; > + > + return H_SUCCESS; > +} > + > +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, unsigned long entry) > +{ > + enum dma_data_direction dir = DMA_NONE; > + unsigned long hpa = 0; > + > + if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir)) > + return H_HARDWARE; > + > + if (dir == DMA_NONE) > + return H_SUCCESS; > + > + return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry); > +} > + > +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl, > + unsigned long entry, unsigned long gpa, > + enum dma_data_direction dir) > +{ > + long ret; > + unsigned long hpa = 0, ua; > + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > + struct mm_iommu_table_group_mem_t *mem; > + > + if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL)) > + return H_HARDWARE; > + mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift); > + if (!mem) > + return H_HARDWARE; > + > + if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa)) > + return H_HARDWARE; > + > + pua = (void *) vmalloc_to_phys(pua); > + if (!pua) > + return H_HARDWARE; > + > + if (mm_iommu_mapped_inc(mem)) > + return H_HARDWARE; > + > + ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir); > + if (ret) { > + mm_iommu_mapped_dec(mem); > + return H_TOO_HARD; > + } > + > + if (dir != DMA_NONE) > + kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry); > + > + *pua = ua; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map); > + > +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, unsigned long liobn, > + unsigned long ioba, unsigned long tce) > +{ > + const unsigned long entry = ioba >> tbl->it_page_shift; > + const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + const enum dma_data_direction dir = iommu_tce_direction(tce); > + > + /* Clear TCE */ > + if (dir == DMA_NONE) { > + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) > + return H_PARAMETER; > + > + return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry); > + } > + > + /* Put TCE */ > + if (iommu_tce_put_param_check(tbl, ioba, gpa)) > + return H_PARAMETER; > + > + return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir); > +} > + > +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, unsigned long ioba, > + u64 *tces, unsigned long npages) > +{ > + unsigned long i, ret, tce, gpa; > + const unsigned long entry = ioba >> tbl->it_page_shift; > + > + for (i = 0; i < npages; ++i) { > + gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + > + if (iommu_tce_put_param_check(tbl, ioba + > + (i << tbl->it_page_shift), gpa)) > + return H_PARAMETER; > + } > + > + for (i = 0; i < npages; ++i) { > + tce = be64_to_cpu(tces[i]); > + gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); > + > + ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa, > + iommu_tce_direction(tce)); > + if (ret != H_SUCCESS) > + return ret; > + } > + > + return H_SUCCESS; > +} > + > +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu, > + struct iommu_table *tbl, > + unsigned long liobn, unsigned long ioba, > + unsigned long tce_value, unsigned long npages) > +{ > + unsigned long i; > + const unsigned long entry = ioba >> tbl->it_page_shift; > + > + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) > + return H_PARAMETER; > + > + for (i = 0; i < npages; ++i) > + kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i); > + > + return H_SUCCESS; > +} > + > long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); > long ret; > + struct kvmppc_spapr_tce_group *kg; > + struct iommu_table *tbltmp = NULL; > > /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ > /* liobn, ioba, tce); */ > @@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > if (ret != H_SUCCESS) > return ret; > > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl, > + liobn, ioba, tce); > + if (ret != H_SUCCESS) > + return ret; > + } > + > kvmppc_tce_put(stt, ioba >> stt->page_shift, tce); > > return H_SUCCESS; > @@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > long i, ret = H_SUCCESS; > unsigned long tces, entry, ua = 0; > unsigned long *rmap = NULL; > + struct iommu_table *tbltmp = NULL; > > stt = kvmppc_find_table(vcpu, liobn); > if (!stt) > @@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > * depend on hpt. > */ > struct mm_iommu_table_group_mem_t *mem; > + struct kvmppc_spapr_tce_group *kg; > > if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL)) > return H_TOO_HARD; > @@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K); > if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces)) > return H_TOO_HARD; > + > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu, > + kg->tbl, ioba, (u64 *)tces, npages); > + if (ret != H_SUCCESS) > + return ret; > + } > } else { > /* > * This is emulated devices case. > @@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, > { > struct kvmppc_spapr_tce_table *stt; > long i, ret; > + struct kvmppc_spapr_tce_group *kg; > + struct iommu_table *tbltmp = NULL; > > stt = kvmppc_find_table(vcpu, liobn); > if (!stt) > @@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, > if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) > return H_PARAMETER; > > + list_for_each_entry_lockless(kg, &stt->groups, next) { > + if (kg->tbl == tbltmp) > + continue; > + tbltmp = kg->tbl; > + ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl, > + liobn, ioba, tce_value, npages); > + if (ret != H_SUCCESS) > + return ret; > + } > + > for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift)) > kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value); >
On 03/08/2016 10:08 PM, David Gibson wrote: > On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO >> without passing them to user space which saves time on switching >> to user space and back. >> >> Both real and virtual modes are supported. The kernel tries to >> handle a TCE request in the real mode, if fails it passes the request >> to the virtual mode to complete the operation. If it a virtual mode >> handler fails, the request is passed to user space; this is not expected >> to happen ever though. > > Well... not expect to happen with a qemu which uses this. Presumably > it will fall back to userspace routinely if you have an old qemu that > doesn't add the liobn mappings. Ah. Ok, thanks, I'll add this to the commit log. >> The first user of this is VFIO on POWER. Trampolines to the VFIO external >> user API functions are required for this patch. > > I'm not sure what you mean by "trampoline" here. For example, look at kvm_vfio_group_get_external_user. It calls symbol_get(vfio_group_get_external_user) and then calls a function via the returned pointer. Is there a better word for this? >> This uses a VFIO KVM device to associate a logical bus number (LIOBN) >> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap >> requests. > > Group fd? Or container fd? The group fd wouldn't make a lot of > sense. Group. KVM has no idea about containers. >> To make use of the feature, the user space has to create a guest view >> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and >> then associate a LIOBN with this table via VFIO KVM device, >> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in >> the next patch). >> >> Tests show that this patch increases transmission speed from 220MB/s >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Is that with or without DDW (i.e. with or without a 64-bit DMA window)? Without DDW, I should have mentioned this. The patch is from the times when there was no DDW :( >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> --- >> arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ >> arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ >> 2 files changed, 370 insertions(+) >> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c >> index 7965fc7..9417d12 100644 >> --- a/arch/powerpc/kvm/book3s_64_vio.c >> +++ b/arch/powerpc/kvm/book3s_64_vio.c >> @@ -33,6 +33,7 @@ >> #include <asm/kvm_ppc.h> >> #include <asm/kvm_book3s.h> >> #include <asm/mmu-hash64.h> >> +#include <asm/mmu_context.h> >> #include <asm/hvcall.h> >> #include <asm/synch.h> >> #include <asm/ppc-opcode.h> >> @@ -317,11 +318,161 @@ fail: >> return ret; >> } >> >> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, >> + unsigned long entry) >> +{ >> + struct mm_iommu_table_group_mem_t *mem = NULL; >> + const unsigned long pgsize = 1ULL << tbl->it_page_shift; >> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); >> + >> + if (!pua) >> + return H_HARDWARE; >> + >> + mem = mm_iommu_lookup(*pua, pgsize); >> + if (!mem) >> + return H_HARDWARE; >> + >> + mm_iommu_mapped_dec(mem); >> + >> + *pua = 0; >> + >> + return H_SUCCESS; >> +} >> + >> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, >> + unsigned long entry) >> +{ >> + enum dma_data_direction dir = DMA_NONE; >> + unsigned long hpa = 0; >> + >> + if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) >> + return H_HARDWARE; >> + >> + if (dir == DMA_NONE) >> + return H_SUCCESS; >> + >> + return kvmppc_tce_iommu_mapped_dec(tbl, entry); >> +} >> + >> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, >> + unsigned long entry, unsigned long gpa, >> + enum dma_data_direction dir) >> +{ >> + long ret; >> + unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); >> + struct mm_iommu_table_group_mem_t *mem; >> + >> + if (!pua) >> + return H_HARDWARE; > > H_HARDWARE? Or H_PARAMETER? This essentially means the guest has > supplied a bad physical address, doesn't it? Well, may be. I'll change. If it not H_TOO_HARD, it does not make any difference after all :) >> + if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) >> + return H_HARDWARE; >> + >> + mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); >> + if (!mem) >> + return H_HARDWARE; >> + >> + if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) >> + return H_HARDWARE; >> + >> + if (mm_iommu_mapped_inc(mem)) >> + return H_HARDWARE; >> + >> + ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); >> + if (ret) { >> + mm_iommu_mapped_dec(mem); >> + return H_TOO_HARD; >> + } >> + >> + if (dir != DMA_NONE) >> + kvmppc_tce_iommu_mapped_dec(tbl, entry); >> + >> + *pua = ua; > > IIUC this means you have a copy of the UA for every group attached to > the TCE table, but they'll all be the same. Any way to avoid that > duplication? It is for every container, not a group. On P8, I allow multiple groups to go to the same container, that means that a container has one or two iommu_table, and each iommu_table has this "ua" list but since tables are different (window size, page size, content), these "ua" arrays are also different.
On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote: > On 03/08/2016 10:08 PM, David Gibson wrote: > >On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: > >>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > >>without passing them to user space which saves time on switching > >>to user space and back. > >> > >>Both real and virtual modes are supported. The kernel tries to > >>handle a TCE request in the real mode, if fails it passes the request > >>to the virtual mode to complete the operation. If it a virtual mode > >>handler fails, the request is passed to user space; this is not expected > >>to happen ever though. > > > >Well... not expect to happen with a qemu which uses this. Presumably > >it will fall back to userspace routinely if you have an old qemu that > >doesn't add the liobn mappings. > > > Ah. Ok, thanks, I'll add this to the commit log. Ok. > >>The first user of this is VFIO on POWER. Trampolines to the VFIO external > >>user API functions are required for this patch. > > > >I'm not sure what you mean by "trampoline" here. > > For example, look at kvm_vfio_group_get_external_user. It calls > symbol_get(vfio_group_get_external_user) and then calls a function via the > returned pointer. > > Is there a better word for this? Uh.. probably although I don't immediately know what. "Trampoline" usually refers to code on the stack used for bouncing places, which isn't what this resembles. > >>This uses a VFIO KVM device to associate a logical bus number (LIOBN) > >>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap > >>requests. > > > >Group fd? Or container fd? The group fd wouldn't make a lot of > >sense. > > > Group. KVM has no idea about containers. That's not going to fly. Having a liobn registered against just one group in a container makes no sense at all. Conceptually, if not physically, the container shares a single set of TCE tables. If handling that means teaching KVM the concept of containers, then so be it. Btw, I'm not sure yet if extending the existing vfio kvm device to make the vfio<->kvm linkages makes sense. I think the reason some x86 machines need that is quite different from how we're using it for Power. I haven't got a clear enough picture yet to be sure either way. The other option that would seem likely to me would be a "bind VFIO container" ioctl() on the fd associated with a kernel accelerated TCE table. > >>To make use of the feature, the user space has to create a guest view > >>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and > >>then associate a LIOBN with this table via VFIO KVM device, > >>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in > >>the next patch). > >> > >>Tests show that this patch increases transmission speed from 220MB/s > >>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > > >Is that with or without DDW (i.e. with or without a 64-bit DMA window)? > > > Without DDW, I should have mentioned this. The patch is from the times when > there was no DDW :( Ok. > >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>--- > >> arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ > >> arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ > >> 2 files changed, 370 insertions(+) > >> > >>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > >>index 7965fc7..9417d12 100644 > >>--- a/arch/powerpc/kvm/book3s_64_vio.c > >>+++ b/arch/powerpc/kvm/book3s_64_vio.c > >>@@ -33,6 +33,7 @@ > >> #include <asm/kvm_ppc.h> > >> #include <asm/kvm_book3s.h> > >> #include <asm/mmu-hash64.h> > >>+#include <asm/mmu_context.h> > >> #include <asm/hvcall.h> > >> #include <asm/synch.h> > >> #include <asm/ppc-opcode.h> > >>@@ -317,11 +318,161 @@ fail: > >> return ret; > >> } > >> > >>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, > >>+ unsigned long entry) > >>+{ > >>+ struct mm_iommu_table_group_mem_t *mem = NULL; > >>+ const unsigned long pgsize = 1ULL << tbl->it_page_shift; > >>+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > >>+ > >>+ if (!pua) > >>+ return H_HARDWARE; > >>+ > >>+ mem = mm_iommu_lookup(*pua, pgsize); > >>+ if (!mem) > >>+ return H_HARDWARE; > >>+ > >>+ mm_iommu_mapped_dec(mem); > >>+ > >>+ *pua = 0; > >>+ > >>+ return H_SUCCESS; > >>+} > >>+ > >>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, > >>+ unsigned long entry) > >>+{ > >>+ enum dma_data_direction dir = DMA_NONE; > >>+ unsigned long hpa = 0; > >>+ > >>+ if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) > >>+ return H_HARDWARE; > >>+ > >>+ if (dir == DMA_NONE) > >>+ return H_SUCCESS; > >>+ > >>+ return kvmppc_tce_iommu_mapped_dec(tbl, entry); > >>+} > >>+ > >>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, > >>+ unsigned long entry, unsigned long gpa, > >>+ enum dma_data_direction dir) > >>+{ > >>+ long ret; > >>+ unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > >>+ struct mm_iommu_table_group_mem_t *mem; > >>+ > >>+ if (!pua) > >>+ return H_HARDWARE; > > > >H_HARDWARE? Or H_PARAMETER? This essentially means the guest has > >supplied a bad physical address, doesn't it? > > Well, may be. I'll change. If it not H_TOO_HARD, it does not make any > difference after all :) > > > > >>+ if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) > >>+ return H_HARDWARE; > >>+ > >>+ mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); > >>+ if (!mem) > >>+ return H_HARDWARE; > >>+ > >>+ if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) > >>+ return H_HARDWARE; > >>+ > >>+ if (mm_iommu_mapped_inc(mem)) > >>+ return H_HARDWARE; > >>+ > >>+ ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); > >>+ if (ret) { > >>+ mm_iommu_mapped_dec(mem); > >>+ return H_TOO_HARD; > >>+ } > >>+ > >>+ if (dir != DMA_NONE) > >>+ kvmppc_tce_iommu_mapped_dec(tbl, entry); > >>+ > >>+ *pua = ua; > > > >IIUC this means you have a copy of the UA for every group attached to > >the TCE table, but they'll all be the same. Any way to avoid that > >duplication? > > It is for every container, not a group. On P8, I allow multiple groups to go > to the same container, that means that a container has one or two > iommu_table, and each iommu_table has this "ua" list but since tables are > different (window size, page size, content), these "ua" arrays are also > different. Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups attached to the stt, and each one seems to update pua. Or is that what the if (kg->tbl == tbltmp) continue; is supposed to avoid? In which case what ensures that the stt->groups list is ordered by tbl pointer?
On 03/10/2016 04:18 PM, David Gibson wrote: > On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote: >> On 03/08/2016 10:08 PM, David Gibson wrote: >>> On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO >>>> without passing them to user space which saves time on switching >>>> to user space and back. >>>> >>>> Both real and virtual modes are supported. The kernel tries to >>>> handle a TCE request in the real mode, if fails it passes the request >>>> to the virtual mode to complete the operation. If it a virtual mode >>>> handler fails, the request is passed to user space; this is not expected >>>> to happen ever though. >>> >>> Well... not expect to happen with a qemu which uses this. Presumably >>> it will fall back to userspace routinely if you have an old qemu that >>> doesn't add the liobn mappings. >> >> >> Ah. Ok, thanks, I'll add this to the commit log. > > Ok. > >>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external >>>> user API functions are required for this patch. >>> >>> I'm not sure what you mean by "trampoline" here. >> >> For example, look at kvm_vfio_group_get_external_user. It calls >> symbol_get(vfio_group_get_external_user) and then calls a function via the >> returned pointer. >> >> Is there a better word for this? > > Uh.. probably although I don't immediately know what. "Trampoline" > usually refers to code on the stack used for bouncing places, which > isn't what this resembles. "Dynamic wrapper"? >>>> This uses a VFIO KVM device to associate a logical bus number (LIOBN) >>>> with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap >>>> requests. >>> >>> Group fd? Or container fd? The group fd wouldn't make a lot of >>> sense. >> >> >> Group. KVM has no idea about containers. > > That's not going to fly. Having a liobn registered against just one > group in a container makes no sense at all. Conceptually, if not > physically, the container shares a single set of TCE tables. If > handling that means teaching KVM the concept of containers, then so be > it. > > Btw, I'm not sure yet if extending the existing vfio kvm device to > make the vfio<->kvm linkages makes sense. I think the reason some x86 > machines need that is quite different from how we're using it for > Power. I haven't got a clear enough picture yet to be sure either > way. > > The other option that would seem likely to me would be a "bind VFIO > container" ioctl() on the fd associated with a kernel accelerated TCE table. Oh, I just noticed this response. I need to digest it. Looks like this is going to take other 2 years to upstream... >>>> To make use of the feature, the user space has to create a guest view >>>> of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and >>>> then associate a LIOBN with this table via VFIO KVM device, >>>> a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in >>>> the next patch). >>>> >>>> Tests show that this patch increases transmission speed from 220MB/s >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). >>> >>> Is that with or without DDW (i.e. with or without a 64-bit DMA window)? >> >> >> Without DDW, I should have mentioned this. The patch is from the times when >> there was no DDW :( > > Ok. > >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> --- >>>> arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 370 insertions(+) >>>> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c >>>> index 7965fc7..9417d12 100644 >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c >>>> @@ -33,6 +33,7 @@ >>>> #include <asm/kvm_ppc.h> >>>> #include <asm/kvm_book3s.h> >>>> #include <asm/mmu-hash64.h> >>>> +#include <asm/mmu_context.h> >>>> #include <asm/hvcall.h> >>>> #include <asm/synch.h> >>>> #include <asm/ppc-opcode.h> >>>> @@ -317,11 +318,161 @@ fail: >>>> return ret; >>>> } >>>> >>>> +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, >>>> + unsigned long entry) >>>> +{ >>>> + struct mm_iommu_table_group_mem_t *mem = NULL; >>>> + const unsigned long pgsize = 1ULL << tbl->it_page_shift; >>>> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); >>>> + >>>> + if (!pua) >>>> + return H_HARDWARE; >>>> + >>>> + mem = mm_iommu_lookup(*pua, pgsize); >>>> + if (!mem) >>>> + return H_HARDWARE; >>>> + >>>> + mm_iommu_mapped_dec(mem); >>>> + >>>> + *pua = 0; >>>> + >>>> + return H_SUCCESS; >>>> +} >>>> + >>>> +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, >>>> + unsigned long entry) >>>> +{ >>>> + enum dma_data_direction dir = DMA_NONE; >>>> + unsigned long hpa = 0; >>>> + >>>> + if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) >>>> + return H_HARDWARE; >>>> + >>>> + if (dir == DMA_NONE) >>>> + return H_SUCCESS; >>>> + >>>> + return kvmppc_tce_iommu_mapped_dec(tbl, entry); >>>> +} >>>> + >>>> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, >>>> + unsigned long entry, unsigned long gpa, >>>> + enum dma_data_direction dir) >>>> +{ >>>> + long ret; >>>> + unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); >>>> + struct mm_iommu_table_group_mem_t *mem; >>>> + >>>> + if (!pua) >>>> + return H_HARDWARE; >>> >>> H_HARDWARE? Or H_PARAMETER? This essentially means the guest has >>> supplied a bad physical address, doesn't it? >> >> Well, may be. I'll change. If it not H_TOO_HARD, it does not make any >> difference after all :) >> >> >> >>>> + if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) >>>> + return H_HARDWARE; >>>> + >>>> + mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); >>>> + if (!mem) >>>> + return H_HARDWARE; >>>> + >>>> + if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) >>>> + return H_HARDWARE; >>>> + >>>> + if (mm_iommu_mapped_inc(mem)) >>>> + return H_HARDWARE; >>>> + >>>> + ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); >>>> + if (ret) { >>>> + mm_iommu_mapped_dec(mem); >>>> + return H_TOO_HARD; >>>> + } >>>> + >>>> + if (dir != DMA_NONE) >>>> + kvmppc_tce_iommu_mapped_dec(tbl, entry); >>>> + >>>> + *pua = ua; >>> >>> IIUC this means you have a copy of the UA for every group attached to >>> the TCE table, but they'll all be the same. Any way to avoid that >>> duplication? >> >> It is for every container, not a group. On P8, I allow multiple groups to go >> to the same container, that means that a container has one or two >> iommu_table, and each iommu_table has this "ua" list but since tables are >> different (window size, page size, content), these "ua" arrays are also >> different. > > Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups > attached to the stt, and each one seems to update pua. > > Or is that what the if (kg->tbl == tbltmp) continue; is supposed to > avoid? In which case what ensures that the stt->groups list is > ordered by tbl pointer? Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn have the same iommu_table, so the first group's one gets updated, other do not but it is ok as they use the same table. In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua can be updated more than once. Well, not a huge loss.
On Fri, Mar 11, 2016 at 01:15:20PM +1100, Alexey Kardashevskiy wrote: > On 03/10/2016 04:18 PM, David Gibson wrote: > >On Wed, Mar 09, 2016 at 07:46:47PM +1100, Alexey Kardashevskiy wrote: > >>On 03/08/2016 10:08 PM, David Gibson wrote: > >>>On Mon, Mar 07, 2016 at 02:41:16PM +1100, Alexey Kardashevskiy wrote: > >>>>This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >>>>and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > >>>>without passing them to user space which saves time on switching > >>>>to user space and back. > >>>> > >>>>Both real and virtual modes are supported. The kernel tries to > >>>>handle a TCE request in the real mode, if fails it passes the request > >>>>to the virtual mode to complete the operation. If it a virtual mode > >>>>handler fails, the request is passed to user space; this is not expected > >>>>to happen ever though. > >>> > >>>Well... not expect to happen with a qemu which uses this. Presumably > >>>it will fall back to userspace routinely if you have an old qemu that > >>>doesn't add the liobn mappings. > >> > >> > >>Ah. Ok, thanks, I'll add this to the commit log. > > > >Ok. > > > >>>>The first user of this is VFIO on POWER. Trampolines to the VFIO external > >>>>user API functions are required for this patch. > >>> > >>>I'm not sure what you mean by "trampoline" here. > >> > >>For example, look at kvm_vfio_group_get_external_user. It calls > >>symbol_get(vfio_group_get_external_user) and then calls a function via the > >>returned pointer. > >> > >>Is there a better word for this? > > > >Uh.. probably although I don't immediately know what. "Trampoline" > >usually refers to code on the stack used for bouncing places, which > >isn't what this resembles. > > "Dynamic wrapper"? Sure, that'll do. > >>>>This uses a VFIO KVM device to associate a logical bus number (LIOBN) > >>>>with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap > >>>>requests. > >>> > >>>Group fd? Or container fd? The group fd wouldn't make a lot of > >>>sense. > >> > >> > >>Group. KVM has no idea about containers. > > > >That's not going to fly. Having a liobn registered against just one > >group in a container makes no sense at all. Conceptually, if not > >physically, the container shares a single set of TCE tables. If > >handling that means teaching KVM the concept of containers, then so be > >it. > > > >Btw, I'm not sure yet if extending the existing vfio kvm device to > >make the vfio<->kvm linkages makes sense. I think the reason some x86 > >machines need that is quite different from how we're using it for > >Power. I haven't got a clear enough picture yet to be sure either > >way. > > > >The other option that would seem likely to me would be a "bind VFIO > >container" ioctl() on the fd associated with a kernel accelerated TCE table. > > > Oh, I just noticed this response. I need to digest it. Looks like this is > going to take other 2 years to upstream... > > > >>>>To make use of the feature, the user space has to create a guest view > >>>>of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and > >>>>then associate a LIOBN with this table via VFIO KVM device, > >>>>a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in > >>>>the next patch). > >>>> > >>>>Tests show that this patch increases transmission speed from 220MB/s > >>>>to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > >>> > >>>Is that with or without DDW (i.e. with or without a 64-bit DMA window)? > >> > >> > >>Without DDW, I should have mentioned this. The patch is from the times when > >>there was no DDW :( > > > >Ok. > > > >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>>>--- > >>>> arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ > >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ > >>>> 2 files changed, 370 insertions(+) > >>>> > >>>>diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > >>>>index 7965fc7..9417d12 100644 > >>>>--- a/arch/powerpc/kvm/book3s_64_vio.c > >>>>+++ b/arch/powerpc/kvm/book3s_64_vio.c > >>>>@@ -33,6 +33,7 @@ > >>>> #include <asm/kvm_ppc.h> > >>>> #include <asm/kvm_book3s.h> > >>>> #include <asm/mmu-hash64.h> > >>>>+#include <asm/mmu_context.h> > >>>> #include <asm/hvcall.h> > >>>> #include <asm/synch.h> > >>>> #include <asm/ppc-opcode.h> > >>>>@@ -317,11 +318,161 @@ fail: > >>>> return ret; > >>>> } > >>>> > >>>>+static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, > >>>>+ unsigned long entry) > >>>>+{ > >>>>+ struct mm_iommu_table_group_mem_t *mem = NULL; > >>>>+ const unsigned long pgsize = 1ULL << tbl->it_page_shift; > >>>>+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > >>>>+ > >>>>+ if (!pua) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ mem = mm_iommu_lookup(*pua, pgsize); > >>>>+ if (!mem) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ mm_iommu_mapped_dec(mem); > >>>>+ > >>>>+ *pua = 0; > >>>>+ > >>>>+ return H_SUCCESS; > >>>>+} > >>>>+ > >>>>+static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, > >>>>+ unsigned long entry) > >>>>+{ > >>>>+ enum dma_data_direction dir = DMA_NONE; > >>>>+ unsigned long hpa = 0; > >>>>+ > >>>>+ if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ if (dir == DMA_NONE) > >>>>+ return H_SUCCESS; > >>>>+ > >>>>+ return kvmppc_tce_iommu_mapped_dec(tbl, entry); > >>>>+} > >>>>+ > >>>>+long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, > >>>>+ unsigned long entry, unsigned long gpa, > >>>>+ enum dma_data_direction dir) > >>>>+{ > >>>>+ long ret; > >>>>+ unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); > >>>>+ struct mm_iommu_table_group_mem_t *mem; > >>>>+ > >>>>+ if (!pua) > >>>>+ return H_HARDWARE; > >>> > >>>H_HARDWARE? Or H_PARAMETER? This essentially means the guest has > >>>supplied a bad physical address, doesn't it? > >> > >>Well, may be. I'll change. If it not H_TOO_HARD, it does not make any > >>difference after all :) > >> > >> > >> > >>>>+ if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); > >>>>+ if (!mem) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ if (mm_iommu_mapped_inc(mem)) > >>>>+ return H_HARDWARE; > >>>>+ > >>>>+ ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); > >>>>+ if (ret) { > >>>>+ mm_iommu_mapped_dec(mem); > >>>>+ return H_TOO_HARD; > >>>>+ } > >>>>+ > >>>>+ if (dir != DMA_NONE) > >>>>+ kvmppc_tce_iommu_mapped_dec(tbl, entry); > >>>>+ > >>>>+ *pua = ua; > >>> > >>>IIUC this means you have a copy of the UA for every group attached to > >>>the TCE table, but they'll all be the same. Any way to avoid that > >>>duplication? > >> > >>It is for every container, not a group. On P8, I allow multiple groups to go > >>to the same container, that means that a container has one or two > >>iommu_table, and each iommu_table has this "ua" list but since tables are > >>different (window size, page size, content), these "ua" arrays are also > >>different. > > > >Erm.. but h_put_tce iterates h_put_tce_iommu through all the groups > >attached to the stt, and each one seems to update pua. > > > >Or is that what the if (kg->tbl == tbltmp) continue; is supposed to > >avoid? In which case what ensures that the stt->groups list is > >ordered by tbl pointer? > > > Nothing. In the normal case (POWER8 IODA2) all groups on the same liobn have > the same iommu_table, so the first group's one gets updated, other do not > but it is ok as they use the same table. Right, which is another indication that group is the wrong concept to use here. > In a bad case (POWER7 IODA1, multiple containers per LIOBN) the same @ua can > be updated more than once. Well, not a huge loss. Ugh.. this really seems to be based on knowing the specific cases we have in practice, rather than writing code that's correct based on ly on the properties that the objects are defined to have. The latter approach will make for much more robust and extensible code.
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 7965fc7..9417d12 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -33,6 +33,7 @@ #include <asm/kvm_ppc.h> #include <asm/kvm_book3s.h> #include <asm/mmu-hash64.h> +#include <asm/mmu_context.h> #include <asm/hvcall.h> #include <asm/synch.h> #include <asm/ppc-opcode.h> @@ -317,11 +318,161 @@ fail: return ret; } +static long kvmppc_tce_iommu_mapped_dec(struct iommu_table *tbl, + unsigned long entry) +{ + struct mm_iommu_table_group_mem_t *mem = NULL; + const unsigned long pgsize = 1ULL << tbl->it_page_shift; + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + + if (!pua) + return H_HARDWARE; + + mem = mm_iommu_lookup(*pua, pgsize); + if (!mem) + return H_HARDWARE; + + mm_iommu_mapped_dec(mem); + + *pua = 0; + + return H_SUCCESS; +} + +static long kvmppc_tce_iommu_unmap(struct iommu_table *tbl, + unsigned long entry) +{ + enum dma_data_direction dir = DMA_NONE; + unsigned long hpa = 0; + + if (iommu_tce_xchg(tbl, entry, &hpa, &dir)) + return H_HARDWARE; + + if (dir == DMA_NONE) + return H_SUCCESS; + + return kvmppc_tce_iommu_mapped_dec(tbl, entry); +} + +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl, + unsigned long entry, unsigned long gpa, + enum dma_data_direction dir) +{ + long ret; + unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + struct mm_iommu_table_group_mem_t *mem; + + if (!pua) + return H_HARDWARE; + + if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL)) + return H_HARDWARE; + + mem = mm_iommu_lookup(ua, 1ULL << tbl->it_page_shift); + if (!mem) + return H_HARDWARE; + + if (mm_iommu_ua_to_hpa(mem, ua, &hpa)) + return H_HARDWARE; + + if (mm_iommu_mapped_inc(mem)) + return H_HARDWARE; + + ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); + if (ret) { + mm_iommu_mapped_dec(mem); + return H_TOO_HARD; + } + + if (dir != DMA_NONE) + kvmppc_tce_iommu_mapped_dec(tbl, entry); + + *pua = ua; + + return 0; +} + +long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, + unsigned long liobn, unsigned long ioba, + unsigned long tce) +{ + long idx, ret = H_HARDWARE; + const unsigned long entry = ioba >> tbl->it_page_shift; + const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); + const enum dma_data_direction dir = iommu_tce_direction(tce); + + /* Clear TCE */ + if (dir == DMA_NONE) { + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) + return H_PARAMETER; + + return kvmppc_tce_iommu_unmap(tbl, entry); + } + + /* Put TCE */ + if (iommu_tce_put_param_check(tbl, ioba, tce)) + return H_PARAMETER; + + idx = srcu_read_lock(&vcpu->kvm->srcu); + ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry, gpa, dir); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + + return ret; +} + +static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long ioba, + u64 __user *tces, unsigned long npages) +{ + unsigned long i, ret, tce, gpa; + const unsigned long entry = ioba >> tbl->it_page_shift; + + for (i = 0; i < npages; ++i) { + gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE); + + if (iommu_tce_put_param_check(tbl, ioba + + (i << tbl->it_page_shift), gpa)) + return H_PARAMETER; + } + + for (i = 0; i < npages; ++i) { + tce = be64_to_cpu(tces[i]); + gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); + + ret = kvmppc_tce_iommu_map(vcpu->kvm, tbl, entry + i, gpa, + iommu_tce_direction(tce)); + if (ret != H_SUCCESS) + return ret; + } + + return H_SUCCESS; +} + +long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + unsigned long i; + const unsigned long entry = ioba >> tbl->it_page_shift; + + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) + kvmppc_tce_iommu_unmap(tbl, entry + i); + + return H_SUCCESS; +} + long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); long ret; + struct kvmppc_spapr_tce_group *kg; + struct iommu_table *tbltmp = NULL; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ @@ -337,6 +488,15 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (ret != H_SUCCESS) return ret; + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_h_put_tce_iommu(vcpu, kg->tbl, liobn, ioba, tce); + if (ret != H_SUCCESS) + return ret; + } + kvmppc_tce_put(stt, ioba >> stt->page_shift, tce); return H_SUCCESS; @@ -351,6 +511,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, long i, ret = H_SUCCESS, idx; unsigned long entry, ua = 0; u64 __user *tces, tce; + struct kvmppc_spapr_tce_group *kg; + struct iommu_table *tbltmp = NULL; stt = kvmppc_find_table(vcpu, liobn); if (!stt) @@ -378,6 +540,16 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, } tces = (u64 __user *) ua; + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_h_put_tce_indirect_iommu(vcpu, + kg->tbl, ioba, tces, npages); + if (ret != H_SUCCESS) + goto unlock_exit; + } + for (i = 0; i < npages; ++i) { if (get_user(tce, tces + i)) { ret = H_TOO_HARD; @@ -405,6 +577,8 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, { struct kvmppc_spapr_tce_table *stt; long i, ret; + struct kvmppc_spapr_tce_group *kg; + struct iommu_table *tbltmp = NULL; stt = kvmppc_find_table(vcpu, liobn); if (!stt) @@ -418,6 +592,16 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) return H_PARAMETER; + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_h_stuff_tce_iommu(vcpu, kg->tbl, liobn, ioba, + tce_value, npages); + if (ret != H_SUCCESS) + return ret; + } + for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift)) kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value); diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index 11163ae..6567d6c 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -26,6 +26,7 @@ #include <linux/slab.h> #include <linux/hugetlb.h> #include <linux/list.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/kvm_ppc.h> @@ -212,11 +213,162 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( return mm_iommu_lookup_rm(mm, ua, size); } +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long entry) +{ + struct mm_iommu_table_group_mem_t *mem = NULL; + const unsigned long pgsize = 1ULL << tbl->it_page_shift; + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + + if (!pua) + return H_SUCCESS; + + pua = (void *) vmalloc_to_phys(pua); + if (!pua) + return H_SUCCESS; + + mem = kvmppc_rm_iommu_lookup(vcpu, *pua, pgsize); + if (!mem) + return H_HARDWARE; + + mm_iommu_mapped_dec(mem); + + *pua = 0; + + return H_SUCCESS; +} + +static long kvmppc_rm_tce_iommu_unmap(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long entry) +{ + enum dma_data_direction dir = DMA_NONE; + unsigned long hpa = 0; + + if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir)) + return H_HARDWARE; + + if (dir == DMA_NONE) + return H_SUCCESS; + + return kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry); +} + +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl, + unsigned long entry, unsigned long gpa, + enum dma_data_direction dir) +{ + long ret; + unsigned long hpa = 0, ua; + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry); + struct mm_iommu_table_group_mem_t *mem; + + if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL)) + return H_HARDWARE; + + mem = kvmppc_rm_iommu_lookup(vcpu, ua, 1ULL << tbl->it_page_shift); + if (!mem) + return H_HARDWARE; + + if (mm_iommu_rm_ua_to_hpa(mem, ua, &hpa)) + return H_HARDWARE; + + pua = (void *) vmalloc_to_phys(pua); + if (!pua) + return H_HARDWARE; + + if (mm_iommu_mapped_inc(mem)) + return H_HARDWARE; + + ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir); + if (ret) { + mm_iommu_mapped_dec(mem); + return H_TOO_HARD; + } + + if (dir != DMA_NONE) + kvmppc_rm_tce_iommu_mapped_dec(vcpu, tbl, entry); + + *pua = ua; + + return 0; +} +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map); + +static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long liobn, + unsigned long ioba, unsigned long tce) +{ + const unsigned long entry = ioba >> tbl->it_page_shift; + const unsigned long gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); + const enum dma_data_direction dir = iommu_tce_direction(tce); + + /* Clear TCE */ + if (dir == DMA_NONE) { + if (iommu_tce_clear_param_check(tbl, ioba, 0, 1)) + return H_PARAMETER; + + return kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry); + } + + /* Put TCE */ + if (iommu_tce_put_param_check(tbl, ioba, gpa)) + return H_PARAMETER; + + return kvmppc_rm_tce_iommu_map(vcpu, tbl, entry, gpa, dir); +} + +static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, unsigned long ioba, + u64 *tces, unsigned long npages) +{ + unsigned long i, ret, tce, gpa; + const unsigned long entry = ioba >> tbl->it_page_shift; + + for (i = 0; i < npages; ++i) { + gpa = be64_to_cpu(tces[i]) & ~(TCE_PCI_READ | TCE_PCI_WRITE); + + if (iommu_tce_put_param_check(tbl, ioba + + (i << tbl->it_page_shift), gpa)) + return H_PARAMETER; + } + + for (i = 0; i < npages; ++i) { + tce = be64_to_cpu(tces[i]); + gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE); + + ret = kvmppc_rm_tce_iommu_map(vcpu, tbl, entry + i, gpa, + iommu_tce_direction(tce)); + if (ret != H_SUCCESS) + return ret; + } + + return H_SUCCESS; +} + +static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu, + struct iommu_table *tbl, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + unsigned long i; + const unsigned long entry = ioba >> tbl->it_page_shift; + + if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages)) + return H_PARAMETER; + + for (i = 0; i < npages; ++i) + kvmppc_rm_tce_iommu_unmap(vcpu, tbl, entry + i); + + return H_SUCCESS; +} + long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); long ret; + struct kvmppc_spapr_tce_group *kg; + struct iommu_table *tbltmp = NULL; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ @@ -232,6 +384,16 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, if (ret != H_SUCCESS) return ret; + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_rm_h_put_tce_iommu(vcpu, kg->tbl, + liobn, ioba, tce); + if (ret != H_SUCCESS) + return ret; + } + kvmppc_tce_put(stt, ioba >> stt->page_shift, tce); return H_SUCCESS; @@ -272,6 +434,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, long i, ret = H_SUCCESS; unsigned long tces, entry, ua = 0; unsigned long *rmap = NULL; + struct iommu_table *tbltmp = NULL; stt = kvmppc_find_table(vcpu, liobn); if (!stt) @@ -299,6 +462,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, * depend on hpt. */ struct mm_iommu_table_group_mem_t *mem; + struct kvmppc_spapr_tce_group *kg; if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL)) return H_TOO_HARD; @@ -306,6 +470,16 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K); if (!mem || mm_iommu_rm_ua_to_hpa(mem, ua, &tces)) return H_TOO_HARD; + + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu, + kg->tbl, ioba, (u64 *)tces, npages); + if (ret != H_SUCCESS) + return ret; + } } else { /* * This is emulated devices case. @@ -355,6 +529,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, { struct kvmppc_spapr_tce_table *stt; long i, ret; + struct kvmppc_spapr_tce_group *kg; + struct iommu_table *tbltmp = NULL; stt = kvmppc_find_table(vcpu, liobn); if (!stt) @@ -368,6 +544,16 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) return H_PARAMETER; + list_for_each_entry_lockless(kg, &stt->groups, next) { + if (kg->tbl == tbltmp) + continue; + tbltmp = kg->tbl; + ret = kvmppc_rm_h_stuff_tce_iommu(vcpu, kg->tbl, + liobn, ioba, tce_value, npages); + if (ret != H_SUCCESS) + return ret; + } + for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift)) kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space; this is not expected to happen ever though. The first user of this is VFIO on POWER. Trampolines to the VFIO external user API functions are required for this patch. This uses a VFIO KVM device to associate a logical bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. To make use of the feature, the user space has to create a guest view of the TCE table via KVM_CAP_SPAPR_TCE/KVM_CAP_SPAPR_TCE_64 and then associate a LIOBN with this table via VFIO KVM device, a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN property (which is added in the next patch). Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- arch/powerpc/kvm/book3s_64_vio.c | 184 +++++++++++++++++++++++++++++++++++ arch/powerpc/kvm/book3s_64_vio_hv.c | 186 ++++++++++++++++++++++++++++++++++++ 2 files changed, 370 insertions(+)