Message ID | 1575269124-17885-2-git-send-email-linuxram@us.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Enable IOMMU support for pseries Secure VMs | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (2e6c4d7e1c5990fa2ccca6db0868a05640ac1df1) |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 47 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
On 02/12/2019 17:45, Ram Pai wrote: > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > its parameters. One page is dedicated per cpu, for the lifetime of the > kernel for this purpose. On secure VMs, contents of this page, when > accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > needs to know the unencrypted entries, to update the TCE table > accordingly. There is nothing secret or sensitive about these entries. > Hence share the page with the hypervisor. This unsecures a page in the guest in a random place which creates an additional attack surface which is hard to exploit indeed but nevertheless it is there. A safer option would be not to use the hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE in the guest). Also what is this for anyway? If I understand things right, you cannot map any random guest memory, you should only be mapping that 64MB-ish bounce buffer array but 1) I do not see that happening (I may have missed it) 2) it should be done once and it takes a little time for whatever memory size we allow for bounce buffers anyway. Thanks, > > Signed-off-by: Ram Pai <linuxram@us.ibm.com> > --- > arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++--- > 1 file changed, 20 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > index 6ba081d..0720831 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -37,6 +37,7 @@ > #include <asm/mmzone.h> > #include <asm/plpar_wrappers.h> > #include <asm/svm.h> > +#include <asm/ultravisor.h> > > #include "pseries.h" > > @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, > > static DEFINE_PER_CPU(__be64 *, tce_page); > > +/* > + * Allocate a tce page. If secure VM, share the page with the hypervisor. > + * > + * NOTE: the TCE page is shared with the hypervisor explicitly and remains > + * shared for the lifetime of the kernel. It is implicitly unshared at kernel > + * shutdown through a UV_UNSHARE_ALL_PAGES ucall. > + */ > +static __be64 *alloc_tce_page(void) > +{ > + __be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > + > + if (tcep && is_secure_guest()) > + uv_share_page(PHYS_PFN(__pa(tcep)), 1); > + > + return tcep; > +} > + > static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > long npages, unsigned long uaddr, > enum dma_data_direction direction, > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > * from iommu_alloc{,_sg}() > */ > if (!tcep) { > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > - /* If allocation fails, fall back to the loop implementation */ > + tcep = alloc_tce_page(); > if (!tcep) { > local_irq_restore(flags); > return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, > @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, > tcep = __this_cpu_read(tce_page); > > if (!tcep) { > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > + tcep = alloc_tce_page(); > if (!tcep) { > local_irq_enable(); > return -ENOMEM; >
On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > > > On 02/12/2019 17:45, Ram Pai wrote: > > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > > its parameters. One page is dedicated per cpu, for the lifetime of the > > kernel for this purpose. On secure VMs, contents of this page, when > > accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > > needs to know the unencrypted entries, to update the TCE table > > accordingly. There is nothing secret or sensitive about these entries. > > Hence share the page with the hypervisor. > > This unsecures a page in the guest in a random place which creates an > additional attack surface which is hard to exploit indeed but > nevertheless it is there. > A safer option would be not to use the > hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > in the guest). Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked automatically when IOMMU option is enabled. This happens even on a normal VM when IOMMU is enabled. > > Also what is this for anyway? This is for sending indirect-TCE entries to the hypervisor. The hypervisor must be able to read those TCE entries, so that it can use those entires to populate the TCE table with the correct mappings. > if I understand things right, you cannot > map any random guest memory, you should only be mapping that 64MB-ish > bounce buffer array but 1) I do not see that happening (I may have > missed it) 2) it should be done once and it takes a little time for > whatever memory size we allow for bounce buffers anyway. Thanks, Any random guest memory can be shared by the guest. Maybe you are confusing this with the SWIOTLB bounce buffers used by PCI devices, to transfer data to the hypervisor? > > > > > > Signed-off-by: Ram Pai <linuxram@us.ibm.com> > > --- > > arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++--- > > 1 file changed, 20 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > > index 6ba081d..0720831 100644 > > --- a/arch/powerpc/platforms/pseries/iommu.c > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > @@ -37,6 +37,7 @@ > > #include <asm/mmzone.h> > > #include <asm/plpar_wrappers.h> > > #include <asm/svm.h> > > +#include <asm/ultravisor.h> > > > > #include "pseries.h" > > > > @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, > > > > static DEFINE_PER_CPU(__be64 *, tce_page); > > > > +/* > > + * Allocate a tce page. If secure VM, share the page with the hypervisor. > > + * > > + * NOTE: the TCE page is shared with the hypervisor explicitly and remains > > + * shared for the lifetime of the kernel. It is implicitly unshared at kernel > > + * shutdown through a UV_UNSHARE_ALL_PAGES ucall. > > + */ > > +static __be64 *alloc_tce_page(void) > > +{ > > + __be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > > + > > + if (tcep && is_secure_guest()) > > + uv_share_page(PHYS_PFN(__pa(tcep)), 1); > > + > > + return tcep; > > +} > > + > > static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > > long npages, unsigned long uaddr, > > enum dma_data_direction direction, > > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > > * from iommu_alloc{,_sg}() > > */ > > if (!tcep) { > > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > > - /* If allocation fails, fall back to the loop implementation */ > > + tcep = alloc_tce_page(); > > if (!tcep) { > > local_irq_restore(flags); > > return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, > > @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, > > tcep = __this_cpu_read(tce_page); > > > > if (!tcep) { > > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > > + tcep = alloc_tce_page(); > > if (!tcep) { > > local_irq_enable(); > > return -ENOMEM; > > > > -- > Alexey
On 03/12/2019 13:08, Ram Pai wrote: > On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: >> >> >> On 02/12/2019 17:45, Ram Pai wrote: >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of >>> its parameters. One page is dedicated per cpu, for the lifetime of the >>> kernel for this purpose. On secure VMs, contents of this page, when >>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor >>> needs to know the unencrypted entries, to update the TCE table >>> accordingly. There is nothing secret or sensitive about these entries. >>> Hence share the page with the hypervisor. >> >> This unsecures a page in the guest in a random place which creates an >> additional attack surface which is hard to exploit indeed but >> nevertheless it is there. >> A safer option would be not to use the >> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE >> in the guest). > > > Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > automatically when IOMMU option is enabled. It is advertised by QEMU but the guest does not have to use it. > This happens even > on a normal VM when IOMMU is enabled. > > >> >> Also what is this for anyway? > > This is for sending indirect-TCE entries to the hypervisor. > The hypervisor must be able to read those TCE entries, so that it can > use those entires to populate the TCE table with the correct mappings. > >> if I understand things right, you cannot >> map any random guest memory, you should only be mapping that 64MB-ish >> bounce buffer array but 1) I do not see that happening (I may have >> missed it) 2) it should be done once and it takes a little time for >> whatever memory size we allow for bounce buffers anyway. Thanks, > > Any random guest memory can be shared by the guest. Yes but we do not want this to be this random. I thought the whole idea of swiotlb was to restrict the amount of shared memory to bare minimum, what do I miss? > Maybe you are confusing this with the SWIOTLB bounce buffers used by PCI > devices, to transfer data to the hypervisor? Is not this for pci+swiotlb? The cover letter suggests it is for virtio-scsi-_pci_ with iommu_platform=on which makes it a normal pci device just like emulated XHCI. Thanks, > >> >> >>> >>> Signed-off-by: Ram Pai <linuxram@us.ibm.com> >>> --- >>> arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++--- >>> 1 file changed, 20 insertions(+), 3 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c >>> index 6ba081d..0720831 100644 >>> --- a/arch/powerpc/platforms/pseries/iommu.c >>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>> @@ -37,6 +37,7 @@ >>> #include <asm/mmzone.h> >>> #include <asm/plpar_wrappers.h> >>> #include <asm/svm.h> >>> +#include <asm/ultravisor.h> >>> >>> #include "pseries.h" >>> >>> @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, >>> >>> static DEFINE_PER_CPU(__be64 *, tce_page); >>> >>> +/* >>> + * Allocate a tce page. If secure VM, share the page with the hypervisor. >>> + * >>> + * NOTE: the TCE page is shared with the hypervisor explicitly and remains >>> + * shared for the lifetime of the kernel. It is implicitly unshared at kernel >>> + * shutdown through a UV_UNSHARE_ALL_PAGES ucall. >>> + */ >>> +static __be64 *alloc_tce_page(void) >>> +{ >>> + __be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC); >>> + >>> + if (tcep && is_secure_guest()) >>> + uv_share_page(PHYS_PFN(__pa(tcep)), 1); >>> + >>> + return tcep; >>> +} >>> + >>> static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, >>> long npages, unsigned long uaddr, >>> enum dma_data_direction direction, >>> @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, >>> * from iommu_alloc{,_sg}() >>> */ >>> if (!tcep) { >>> - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); >>> - /* If allocation fails, fall back to the loop implementation */ >>> + tcep = alloc_tce_page(); >>> if (!tcep) { >>> local_irq_restore(flags); >>> return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, >>> @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, >>> tcep = __this_cpu_read(tce_page); >>> >>> if (!tcep) { >>> - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); >>> + tcep = alloc_tce_page(); >>> if (!tcep) { >>> local_irq_enable(); >>> return -ENOMEM; >>> >> >> -- >> Alexey >
On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > > > On 03/12/2019 13:08, Ram Pai wrote: > > On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > >> > >> > >> On 02/12/2019 17:45, Ram Pai wrote: > >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > >>> its parameters. One page is dedicated per cpu, for the lifetime of the > >>> kernel for this purpose. On secure VMs, contents of this page, when > >>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > >>> needs to know the unencrypted entries, to update the TCE table > >>> accordingly. There is nothing secret or sensitive about these entries. > >>> Hence share the page with the hypervisor. > >> > >> This unsecures a page in the guest in a random place which creates an > >> additional attack surface which is hard to exploit indeed but > >> nevertheless it is there. > >> A safer option would be not to use the > >> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > >> in the guest). > > > > > > Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > > automatically when IOMMU option is enabled. > > It is advertised by QEMU but the guest does not have to use it. Are you suggesting that even normal-guest, not use hcall-multi-tce? or just secure-guest? > > > This happens even > > on a normal VM when IOMMU is enabled. > > > > > >> > >> Also what is this for anyway? > > > > This is for sending indirect-TCE entries to the hypervisor. > > The hypervisor must be able to read those TCE entries, so that it can > > use those entires to populate the TCE table with the correct mappings. > > > >> if I understand things right, you cannot > >> map any random guest memory, you should only be mapping that 64MB-ish > >> bounce buffer array but 1) I do not see that happening (I may have > >> missed it) 2) it should be done once and it takes a little time for > >> whatever memory size we allow for bounce buffers anyway. Thanks, > > > > Any random guest memory can be shared by the guest. > > Yes but we do not want this to be this random. It is not sharing some random page. It is sharing a page that is ear-marked for communicating TCE entries. Yes the address of the page can be random, depending on where the allocator decides to allocate it. The purpose of the page is not random. That page is used for one specific purpose; to communicate the TCE entries to the hypervisor. > I thought the whole idea > of swiotlb was to restrict the amount of shared memory to bare minimum, > what do I miss? I think, you are making a incorrect connection between this patch and SWIOTLB. This patch has nothing to do with SWIOTLB. > > > Maybe you are confusing this with the SWIOTLB bounce buffers used by > > PCI devices, to transfer data to the hypervisor? > > Is not this for pci+swiotlb? No. This patch is NOT for PCI+SWIOTLB. The SWIOTLB pages are a different set of pages allocated and earmarked for bounce buffering. This patch is purely to help the hypervisor setup the TCE table, in the presence of a IOMMU. >The cover letter suggests it is for > virtio-scsi-_pci_ with iommu_platform=on which makes it a > normal pci device just like emulated XHCI. Thanks, Well, I guess, the cover letter is probably confusing. There are two patches, which togather enable virtio on secure guests, in the presence of IOMMU. The second patch enables virtio in the presence of a IOMMU, to use DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio devices. However that by itself wont work if the TCE entires are not correctly setup in the TCE tables. The first patch; i.e this patch, helps accomplish that. Hope this clears up the confusion. RP
On 03/12/2019 15:05, Ram Pai wrote: > On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: >> >> >> On 03/12/2019 13:08, Ram Pai wrote: >>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 02/12/2019 17:45, Ram Pai wrote: >>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of >>>>> its parameters. One page is dedicated per cpu, for the lifetime of the >>>>> kernel for this purpose. On secure VMs, contents of this page, when >>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor >>>>> needs to know the unencrypted entries, to update the TCE table >>>>> accordingly. There is nothing secret or sensitive about these entries. >>>>> Hence share the page with the hypervisor. >>>> >>>> This unsecures a page in the guest in a random place which creates an >>>> additional attack surface which is hard to exploit indeed but >>>> nevertheless it is there. >>>> A safer option would be not to use the >>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE >>>> in the guest). >>> >>> >>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked >>> automatically when IOMMU option is enabled. >> >> It is advertised by QEMU but the guest does not have to use it. > > Are you suggesting that even normal-guest, not use hcall-multi-tce? > or just secure-guest? Just secure. > >> >>> This happens even >>> on a normal VM when IOMMU is enabled. >>> >>> >>>> >>>> Also what is this for anyway? >>> >>> This is for sending indirect-TCE entries to the hypervisor. >>> The hypervisor must be able to read those TCE entries, so that it can >>> use those entires to populate the TCE table with the correct mappings. >>> >>>> if I understand things right, you cannot >>>> map any random guest memory, you should only be mapping that 64MB-ish >>>> bounce buffer array but 1) I do not see that happening (I may have >>>> missed it) 2) it should be done once and it takes a little time for >>>> whatever memory size we allow for bounce buffers anyway. Thanks, >>> >>> Any random guest memory can be shared by the guest. >> >> Yes but we do not want this to be this random. > > It is not sharing some random page. It is sharing a page that is > ear-marked for communicating TCE entries. Yes the address of the page > can be random, depending on where the allocator decides to allocate it. > The purpose of the page is not random. I was talking about the location. > That page is used for one specific purpose; to communicate the TCE > entries to the hypervisor. > >> I thought the whole idea >> of swiotlb was to restrict the amount of shared memory to bare minimum, >> what do I miss? > > I think, you are making a incorrect connection between this patch and > SWIOTLB. This patch has nothing to do with SWIOTLB. I can see this and this is the confusing part. >> >>> Maybe you are confusing this with the SWIOTLB bounce buffers used by >>> PCI devices, to transfer data to the hypervisor? >> >> Is not this for pci+swiotlb? > > > No. This patch is NOT for PCI+SWIOTLB. The SWIOTLB pages are a > different set of pages allocated and earmarked for bounce buffering. > > This patch is purely to help the hypervisor setup the TCE table, in the > presence of a IOMMU. Then the hypervisor should be able to access the guest pages mapped for DMA and these pages should be made unsecure for this to work. Where/when does this happen? >> The cover letter suggests it is for >> virtio-scsi-_pci_ with iommu_platform=on which makes it a >> normal pci device just like emulated XHCI. Thanks, > > Well, I guess, the cover letter is probably confusing. There are two > patches, which togather enable virtio on secure guests, in the presence > of IOMMU. > > The second patch enables virtio in the presence of a IOMMU, to use > DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio > devices. The second patch does nothing in relation to the problem being solved. > However that by itself wont work if the TCE entires are not correctly > setup in the TCE tables. The first patch; i.e this patch, helps > accomplish that. >> Hope this clears up the confusion.
On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: > > > On 03/12/2019 15:05, Ram Pai wrote: > > On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > >> > >> > >> On 03/12/2019 13:08, Ram Pai wrote: > >>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > >>>> > >>>> > >>>> On 02/12/2019 17:45, Ram Pai wrote: > >>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > >>>>> its parameters. One page is dedicated per cpu, for the lifetime of the > >>>>> kernel for this purpose. On secure VMs, contents of this page, when > >>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > >>>>> needs to know the unencrypted entries, to update the TCE table > >>>>> accordingly. There is nothing secret or sensitive about these entries. > >>>>> Hence share the page with the hypervisor. > >>>> > >>>> This unsecures a page in the guest in a random place which creates an > >>>> additional attack surface which is hard to exploit indeed but > >>>> nevertheless it is there. > >>>> A safer option would be not to use the > >>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > >>>> in the guest). > >>> > >>> > >>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > >>> automatically when IOMMU option is enabled. > >> > >> It is advertised by QEMU but the guest does not have to use it. > > > > Are you suggesting that even normal-guest, not use hcall-multi-tce? > > or just secure-guest? > > > Just secure. hmm.. how are the TCE entries communicated to the hypervisor, if hcall-multi-tce is disabled? > > > > > >> > >>> This happens even > >>> on a normal VM when IOMMU is enabled. > >>> > >>> > >>>> > >>>> Also what is this for anyway? > >>> > >>> This is for sending indirect-TCE entries to the hypervisor. > >>> The hypervisor must be able to read those TCE entries, so that it can > >>> use those entires to populate the TCE table with the correct mappings. > >>> > >>>> if I understand things right, you cannot > >>>> map any random guest memory, you should only be mapping that 64MB-ish > >>>> bounce buffer array but 1) I do not see that happening (I may have > >>>> missed it) 2) it should be done once and it takes a little time for > >>>> whatever memory size we allow for bounce buffers anyway. Thanks, > >>> > >>> Any random guest memory can be shared by the guest. > >> > >> Yes but we do not want this to be this random. > > > > It is not sharing some random page. It is sharing a page that is > > ear-marked for communicating TCE entries. Yes the address of the page > > can be random, depending on where the allocator decides to allocate it. > > The purpose of the page is not random. > > I was talking about the location. > > > > That page is used for one specific purpose; to communicate the TCE > > entries to the hypervisor. > > > >> I thought the whole idea > >> of swiotlb was to restrict the amount of shared memory to bare minimum, > >> what do I miss? > > > > I think, you are making a incorrect connection between this patch and > > SWIOTLB. This patch has nothing to do with SWIOTLB. > > I can see this and this is the confusing part. > > > >> > >>> Maybe you are confusing this with the SWIOTLB bounce buffers used by > >>> PCI devices, to transfer data to the hypervisor? > >> > >> Is not this for pci+swiotlb? > > > > > > No. This patch is NOT for PCI+SWIOTLB. The SWIOTLB pages are a > > different set of pages allocated and earmarked for bounce buffering. > > > > This patch is purely to help the hypervisor setup the TCE table, in the > > presence of a IOMMU. > > Then the hypervisor should be able to access the guest pages mapped for > DMA and these pages should be made unsecure for this to work. Where/when > does this happen? This happens in the SWIOTLB code. The code to do that is already upstream. The sharing of the pages containing the SWIOTLB bounce buffers is done in init_svm() which calls swiotlb_update_mem_attributes() which calls set_memory_decrypted(). In the case of pseries, set_memory_decrypted() calls uv_share_page(). The code that bounces the contents of a I/O buffer through the SWIOTLB buffers, is in swiotlb_bounce(). > > > >> The cover letter suggests it is for > >> virtio-scsi-_pci_ with iommu_platform=on which makes it a > >> normal pci device just like emulated XHCI. Thanks, > > > > Well, I guess, the cover letter is probably confusing. There are two > > patches, which togather enable virtio on secure guests, in the presence > > of IOMMU. > > > > The second patch enables virtio in the presence of a IOMMU, to use > > DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio > > devices. > > The second patch does nothing in relation to the problem being solved. The second patch registers dma_iommu_ops with the PCI-system. Doing so enables I/O to take the dma_iommu_ops path, which internally leads it through the SWIOTLB path. Without that, the I/O fails to reach its destination. > > > > However that by itself wont work if the TCE entires are not correctly > > setup in the TCE tables. The first patch; i.e this patch, helps > > accomplish that. > >> Hope this clears up the confusion. > > > > > > -- > Alexey
On 04/12/2019 03:52, Ram Pai wrote: > On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: >> >> >> On 03/12/2019 15:05, Ram Pai wrote: >>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 03/12/2019 13:08, Ram Pai wrote: >>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: >>>>>> >>>>>> >>>>>> On 02/12/2019 17:45, Ram Pai wrote: >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of >>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the >>>>>>> kernel for this purpose. On secure VMs, contents of this page, when >>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor >>>>>>> needs to know the unencrypted entries, to update the TCE table >>>>>>> accordingly. There is nothing secret or sensitive about these entries. >>>>>>> Hence share the page with the hypervisor. >>>>>> >>>>>> This unsecures a page in the guest in a random place which creates an >>>>>> additional attack surface which is hard to exploit indeed but >>>>>> nevertheless it is there. >>>>>> A safer option would be not to use the >>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE >>>>>> in the guest). >>>>> >>>>> >>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked >>>>> automatically when IOMMU option is enabled. >>>> >>>> It is advertised by QEMU but the guest does not have to use it. >>> >>> Are you suggesting that even normal-guest, not use hcall-multi-tce? >>> or just secure-guest? >> >> >> Just secure. > > hmm.. how are the TCE entries communicated to the hypervisor, if > hcall-multi-tce is disabled? Via H_PUT_TCE which updates 1 entry at once (sets or clears). hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an optimization. > >> >> >>> >>>> >>>>> This happens even >>>>> on a normal VM when IOMMU is enabled. >>>>> >>>>> >>>>>> >>>>>> Also what is this for anyway? >>>>> >>>>> This is for sending indirect-TCE entries to the hypervisor. >>>>> The hypervisor must be able to read those TCE entries, so that it can >>>>> use those entires to populate the TCE table with the correct mappings. >>>>> >>>>>> if I understand things right, you cannot >>>>>> map any random guest memory, you should only be mapping that 64MB-ish >>>>>> bounce buffer array but 1) I do not see that happening (I may have >>>>>> missed it) 2) it should be done once and it takes a little time for >>>>>> whatever memory size we allow for bounce buffers anyway. Thanks, >>>>> >>>>> Any random guest memory can be shared by the guest. >>>> >>>> Yes but we do not want this to be this random. >>> >>> It is not sharing some random page. It is sharing a page that is >>> ear-marked for communicating TCE entries. Yes the address of the page >>> can be random, depending on where the allocator decides to allocate it. >>> The purpose of the page is not random. >> >> I was talking about the location. >> >> >>> That page is used for one specific purpose; to communicate the TCE >>> entries to the hypervisor. >>> >>>> I thought the whole idea >>>> of swiotlb was to restrict the amount of shared memory to bare minimum, >>>> what do I miss? >>> >>> I think, you are making a incorrect connection between this patch and >>> SWIOTLB. This patch has nothing to do with SWIOTLB. >> >> I can see this and this is the confusing part. >> >> >>>> >>>>> Maybe you are confusing this with the SWIOTLB bounce buffers used by >>>>> PCI devices, to transfer data to the hypervisor? >>>> >>>> Is not this for pci+swiotlb? >>> >>> >>> No. This patch is NOT for PCI+SWIOTLB. The SWIOTLB pages are a >>> different set of pages allocated and earmarked for bounce buffering. >>> >>> This patch is purely to help the hypervisor setup the TCE table, in the >>> presence of a IOMMU. >> >> Then the hypervisor should be able to access the guest pages mapped for >> DMA and these pages should be made unsecure for this to work. Where/when >> does this happen? > > This happens in the SWIOTLB code. The code to do that is already > upstream. > > The sharing of the pages containing the SWIOTLB bounce buffers is done > in init_svm() which calls swiotlb_update_mem_attributes() which calls > set_memory_decrypted(). In the case of pseries, set_memory_decrypted() calls > uv_share_page(). This does not seem enough as when you enforce iommu_platform=on, QEMU starts accessing virtio buffers via IOMMU so bounce buffers have to be mapped explicitly, via H_PUT_TCE&co, where does this happen? > > The code that bounces the contents of a I/O buffer through the > SWIOTLB buffers, is in swiotlb_bounce(). > >> >> >>>> The cover letter suggests it is for >>>> virtio-scsi-_pci_ with iommu_platform=on which makes it a >>>> normal pci device just like emulated XHCI. Thanks, >>> >>> Well, I guess, the cover letter is probably confusing. There are two >>> patches, which togather enable virtio on secure guests, in the presence >>> of IOMMU. >>> >>> The second patch enables virtio in the presence of a IOMMU, to use >>> DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio >>> devices. >> >> The second patch does nothing in relation to the problem being solved. > > The second patch registers dma_iommu_ops with the PCI-system. Doing so > enables I/O to take the dma_iommu_ops path, which internally > leads it through the SWIOTLB path. Without that, the I/O fails to reach > its destination. This is not what the commit log says. What DMA ops was used before 2/2? I thought it was NULL which should have turned into direct which then would switch to swiotlb but since recent DMA reworks it is even harder to tell what happens with DMA setup. Thanks, >>> However that by itself wont work if the TCE entires are not correctly >>> setup in the TCE tables. The first patch; i.e this patch, helps >>> accomplish that. >>>> Hope this clears up the confusion.
On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: > > > On 04/12/2019 03:52, Ram Pai wrote: > > On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: > >> > >> > >> On 03/12/2019 15:05, Ram Pai wrote: > >>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > >>>> > >>>> > >>>> On 03/12/2019 13:08, Ram Pai wrote: > >>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > >>>>>> > >>>>>> > >>>>>> On 02/12/2019 17:45, Ram Pai wrote: > >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > >>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the > >>>>>>> kernel for this purpose. On secure VMs, contents of this page, when > >>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > >>>>>>> needs to know the unencrypted entries, to update the TCE table > >>>>>>> accordingly. There is nothing secret or sensitive about these entries. > >>>>>>> Hence share the page with the hypervisor. > >>>>>> > >>>>>> This unsecures a page in the guest in a random place which creates an > >>>>>> additional attack surface which is hard to exploit indeed but > >>>>>> nevertheless it is there. > >>>>>> A safer option would be not to use the > >>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > >>>>>> in the guest). > >>>>> > >>>>> > >>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > >>>>> automatically when IOMMU option is enabled. > >>>> > >>>> It is advertised by QEMU but the guest does not have to use it. > >>> > >>> Are you suggesting that even normal-guest, not use hcall-multi-tce? > >>> or just secure-guest? > >> > >> > >> Just secure. > > > > hmm.. how are the TCE entries communicated to the hypervisor, if > > hcall-multi-tce is disabled? > > Via H_PUT_TCE which updates 1 entry at once (sets or clears). > hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and > H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an > optimization. Do you still think, secure-VM should use H_PUT_TCE and not H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? Is there any advantage of special casing it for secure-VMs. In fact, we could make use of as much optimization as possible. > > >>>> Is not this for pci+swiotlb? ..snip.. > >>> This patch is purely to help the hypervisor setup the TCE table, in the > >>> presence of a IOMMU. > >> > >> Then the hypervisor should be able to access the guest pages mapped for > >> DMA and these pages should be made unsecure for this to work. Where/when > >> does this happen? > > > > This happens in the SWIOTLB code. The code to do that is already > > upstream. > > > > The sharing of the pages containing the SWIOTLB bounce buffers is done > > in init_svm() which calls swiotlb_update_mem_attributes() which calls > > set_memory_decrypted(). In the case of pseries, set_memory_decrypted() calls > > uv_share_page(). > > > This does not seem enough as when you enforce iommu_platform=on, QEMU > starts accessing virtio buffers via IOMMU so bounce buffers have to be > mapped explicitly, via H_PUT_TCE&co, where does this happen? > I think, it happens at boot time. Every page of the guest memory is TCE mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped as part of that operation. RP
On 04/12/2019 11:49, Ram Pai wrote: > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: >> >> >> On 04/12/2019 03:52, Ram Pai wrote: >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 03/12/2019 15:05, Ram Pai wrote: >>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: >>>>>> >>>>>> >>>>>> On 03/12/2019 13:08, Ram Pai wrote: >>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 02/12/2019 17:45, Ram Pai wrote: >>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of >>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the >>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when >>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor >>>>>>>>> needs to know the unencrypted entries, to update the TCE table >>>>>>>>> accordingly. There is nothing secret or sensitive about these entries. >>>>>>>>> Hence share the page with the hypervisor. >>>>>>>> >>>>>>>> This unsecures a page in the guest in a random place which creates an >>>>>>>> additional attack surface which is hard to exploit indeed but >>>>>>>> nevertheless it is there. >>>>>>>> A safer option would be not to use the >>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE >>>>>>>> in the guest). >>>>>>> >>>>>>> >>>>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked >>>>>>> automatically when IOMMU option is enabled. >>>>>> >>>>>> It is advertised by QEMU but the guest does not have to use it. >>>>> >>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce? >>>>> or just secure-guest? >>>> >>>> >>>> Just secure. >>> >>> hmm.. how are the TCE entries communicated to the hypervisor, if >>> hcall-multi-tce is disabled? >> >> Via H_PUT_TCE which updates 1 entry at once (sets or clears). >> hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an >> optimization. > > Do you still think, secure-VM should use H_PUT_TCE and not > H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? > Is there any advantage of special casing it for secure-VMs. Reducing the amount of insecure memory at random location. > In fact, we could make use of as much optimization as possible. > > >> >>>>>> Is not this for pci+swiotlb? > ..snip.. >>>>> This patch is purely to help the hypervisor setup the TCE table, in the >>>>> presence of a IOMMU. >>>> >>>> Then the hypervisor should be able to access the guest pages mapped for >>>> DMA and these pages should be made unsecure for this to work. Where/when >>>> does this happen? >>> >>> This happens in the SWIOTLB code. The code to do that is already >>> upstream. >>> >>> The sharing of the pages containing the SWIOTLB bounce buffers is done >>> in init_svm() which calls swiotlb_update_mem_attributes() which calls >>> set_memory_decrypted(). In the case of pseries, set_memory_decrypted() calls >>> uv_share_page(). >> >> >> This does not seem enough as when you enforce iommu_platform=on, QEMU >> starts accessing virtio buffers via IOMMU so bounce buffers have to be >> mapped explicitly, via H_PUT_TCE&co, where does this happen? >> > > I think, it happens at boot time. Every page of the guest memory is TCE > mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped > as part of that operation. Ah I see. This works via the huge dma window. Ok, makes sense now. It just seems like a waste that we could map swiotlb 1:1 via the always existing small DMA window but instead we rely on a huge window to map these small buffers. This way we are wasting the entire 32bit window and most of the huge window. We may fix it in the future (not right now) but for now I would still avoid unsecuring additional memory. Thanks,
On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote: > > > On 04/12/2019 11:49, Ram Pai wrote: > > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: > >> > >> > >> On 04/12/2019 03:52, Ram Pai wrote: > >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: > >>>> > >>>> > >>>> On 03/12/2019 15:05, Ram Pai wrote: > >>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > >>>>>> > >>>>>> > >>>>>> On 03/12/2019 13:08, Ram Pai wrote: > >>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On 02/12/2019 17:45, Ram Pai wrote: > >>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > >>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the > >>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when > >>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > >>>>>>>>> needs to know the unencrypted entries, to update the TCE table > >>>>>>>>> accordingly. There is nothing secret or sensitive about these entries. > >>>>>>>>> Hence share the page with the hypervisor. > >>>>>>>> > >>>>>>>> This unsecures a page in the guest in a random place which creates an > >>>>>>>> additional attack surface which is hard to exploit indeed but > >>>>>>>> nevertheless it is there. > >>>>>>>> A safer option would be not to use the > >>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > >>>>>>>> in the guest). > >>>>>>> > >>>>>>> > >>>>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > >>>>>>> automatically when IOMMU option is enabled. > >>>>>> > >>>>>> It is advertised by QEMU but the guest does not have to use it. > >>>>> > >>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce? > >>>>> or just secure-guest? > >>>> > >>>> > >>>> Just secure. > >>> > >>> hmm.. how are the TCE entries communicated to the hypervisor, if > >>> hcall-multi-tce is disabled? > >> > >> Via H_PUT_TCE which updates 1 entry at once (sets or clears). > >> hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and > >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an > >> optimization. > > > > Do you still think, secure-VM should use H_PUT_TCE and not > > H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? > > Is there any advantage of special casing it for secure-VMs. > > > Reducing the amount of insecure memory at random location. The other approach we could use for that - which would still allow H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the same pool that we use for the bounce buffers. I assume there must already be some sort of allocator for that? > > In fact, we could make use of as much optimization as possible. > > > > > >> > >>>>>> Is not this for pci+swiotlb? > > ..snip.. > >>>>> This patch is purely to help the hypervisor setup the TCE table, in the > >>>>> presence of a IOMMU. > >>>> > >>>> Then the hypervisor should be able to access the guest pages mapped for > >>>> DMA and these pages should be made unsecure for this to work. Where/when > >>>> does this happen? > >>> > >>> This happens in the SWIOTLB code. The code to do that is already > >>> upstream. > >>> > >>> The sharing of the pages containing the SWIOTLB bounce buffers is done > >>> in init_svm() which calls swiotlb_update_mem_attributes() which calls > >>> set_memory_decrypted(). In the case of pseries, set_memory_decrypted() calls > >>> uv_share_page(). > >> > >> > >> This does not seem enough as when you enforce iommu_platform=on, QEMU > >> starts accessing virtio buffers via IOMMU so bounce buffers have to be > >> mapped explicitly, via H_PUT_TCE&co, where does this happen? > >> > > > > I think, it happens at boot time. Every page of the guest memory is TCE > > mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped > > as part of that operation. > > > Ah I see. This works via the huge dma window. Ok, makes sense now. > > It just seems like a waste that we could map swiotlb 1:1 via the always > existing small DMA window but instead we rely on a huge window to map > these small buffers. This way we are wasting the entire 32bit window and > most of the huge window. We may fix it in the future (not right now) but > for now I would still avoid unsecuring additional memory. Thanks, > >
On Sun, 2019-12-01 at 22:45 -0800, Ram Pai wrote: > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > * from iommu_alloc{,_sg}() > */ > if (!tcep) { > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > - /* If allocation fails, fall back to the loop implementation */ > + tcep = alloc_tce_page(); > if (!tcep) { > local_irq_restore(flags); > return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, The comment about failing allocation was removed, but I see no chage of behaviour here. Can you please explain what/where it changes? Best regards, Leonardo
On Wed, Dec 04, 2019 at 03:26:50PM -0300, Leonardo Bras wrote: > On Sun, 2019-12-01 at 22:45 -0800, Ram Pai wrote: > > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, > > * from iommu_alloc{,_sg}() > > */ > > if (!tcep) { > > - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); > > - /* If allocation fails, fall back to the loop implementation */ > > + tcep = alloc_tce_page(); > > if (!tcep) { > > local_irq_restore(flags); > > return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, > > The comment about failing allocation was removed, but I see no chage of > behaviour here. > > Can you please explain what/where it changes? You observed it right. The comment should stay put. Will have it fixed in my next version. Thanks, RP
On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote: > On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote: > > > > > > On 04/12/2019 11:49, Ram Pai wrote: > > > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: > > >> > > >> > > >> On 04/12/2019 03:52, Ram Pai wrote: > > >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: > > >>>> > > >>>> > > >>>> On 03/12/2019 15:05, Ram Pai wrote: > > >>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > > >>>>>> > > >>>>>> > > >>>>>> On 03/12/2019 13:08, Ram Pai wrote: > > >>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On 02/12/2019 17:45, Ram Pai wrote: > > >>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > > >>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the > > >>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when > > >>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > > >>>>>>>>> needs to know the unencrypted entries, to update the TCE table > > >>>>>>>>> accordingly. There is nothing secret or sensitive about these entries. > > >>>>>>>>> Hence share the page with the hypervisor. > > >>>>>>>> > > >>>>>>>> This unsecures a page in the guest in a random place which creates an > > >>>>>>>> additional attack surface which is hard to exploit indeed but > > >>>>>>>> nevertheless it is there. > > >>>>>>>> A safer option would be not to use the > > >>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > > >>>>>>>> in the guest). > > >>>>>>> > > >>>>>>> > > >>>>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > > >>>>>>> automatically when IOMMU option is enabled. > > >>>>>> > > >>>>>> It is advertised by QEMU but the guest does not have to use it. > > >>>>> > > >>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce? > > >>>>> or just secure-guest? > > >>>> > > >>>> > > >>>> Just secure. > > >>> > > >>> hmm.. how are the TCE entries communicated to the hypervisor, if > > >>> hcall-multi-tce is disabled? > > >> > > >> Via H_PUT_TCE which updates 1 entry at once (sets or clears). > > >> hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and > > >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an > > >> optimization. > > > > > > Do you still think, secure-VM should use H_PUT_TCE and not > > > H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? > > > Is there any advantage of special casing it for secure-VMs. > > > > > > Reducing the amount of insecure memory at random location. > > The other approach we could use for that - which would still allow > H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the > same pool that we use for the bounce buffers. I assume there must > already be some sort of allocator for that? The allocator for swiotlb is buried deep in the swiotlb code. It is not exposed to the outside-swiotlb world. Will have to do major surgery to expose it. I was thinking, maybe we share the page, finish the INDIRECT_TCE call, and unshare the page. This will address Alexey's concern of having shared pages at random location, and will also give me my performance optimization. Alexey: ok? RP
On 05/12/2019 07:42, Ram Pai wrote: > On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote: >> On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote: >>> >>> >>> On 04/12/2019 11:49, Ram Pai wrote: >>>> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: >>>>> >>>>> >>>>> On 04/12/2019 03:52, Ram Pai wrote: >>>>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: >>>>>>> >>>>>>> >>>>>>> On 03/12/2019 15:05, Ram Pai wrote: >>>>>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 03/12/2019 13:08, Ram Pai wrote: >>>>>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 02/12/2019 17:45, Ram Pai wrote: >>>>>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of >>>>>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the >>>>>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when >>>>>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor >>>>>>>>>>>> needs to know the unencrypted entries, to update the TCE table >>>>>>>>>>>> accordingly. There is nothing secret or sensitive about these entries. >>>>>>>>>>>> Hence share the page with the hypervisor. >>>>>>>>>>> >>>>>>>>>>> This unsecures a page in the guest in a random place which creates an >>>>>>>>>>> additional attack surface which is hard to exploit indeed but >>>>>>>>>>> nevertheless it is there. >>>>>>>>>>> A safer option would be not to use the >>>>>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE >>>>>>>>>>> in the guest). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked >>>>>>>>>> automatically when IOMMU option is enabled. >>>>>>>>> >>>>>>>>> It is advertised by QEMU but the guest does not have to use it. >>>>>>>> >>>>>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce? >>>>>>>> or just secure-guest? >>>>>>> >>>>>>> >>>>>>> Just secure. >>>>>> >>>>>> hmm.. how are the TCE entries communicated to the hypervisor, if >>>>>> hcall-multi-tce is disabled? >>>>> >>>>> Via H_PUT_TCE which updates 1 entry at once (sets or clears). >>>>> hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and >>>>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an >>>>> optimization. >>>> >>>> Do you still think, secure-VM should use H_PUT_TCE and not >>>> H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? >>>> Is there any advantage of special casing it for secure-VMs. >>> >>> >>> Reducing the amount of insecure memory at random location. >> >> The other approach we could use for that - which would still allow >> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the >> same pool that we use for the bounce buffers. I assume there must >> already be some sort of allocator for that? > > The allocator for swiotlb is buried deep in the swiotlb code. It is > not exposed to the outside-swiotlb world. Will have to do major surgery > to expose it. > > I was thinking, maybe we share the page, finish the INDIRECT_TCE call, > and unshare the page. This will address Alexey's concern of having > shared pages at random location, and will also give me my performance > optimization. Alexey: ok? I really do not see the point. I really think we should to 1:1 mapping of swtiotlb buffers using the default 32bit window using H_PUT_TCE and this should be more than enough, I do not think the amount of code will be dramatically different compared to unsecuring and securing a page or using one of swtiotlb pages for this purpose. Thanks,
On Thu, Dec 05, 2019 at 09:26:14AM +1100, Alexey Kardashevskiy wrote: > > > On 05/12/2019 07:42, Ram Pai wrote: > > On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote: > >> On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote: > >>> > >>> > >>> On 04/12/2019 11:49, Ram Pai wrote: > >>>> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote: > >>>>> > >>>>> > >>>>> On 04/12/2019 03:52, Ram Pai wrote: > >>>>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote: > >>>>>>> > >>>>>>> > >>>>>>> On 03/12/2019 15:05, Ram Pai wrote: > >>>>>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 03/12/2019 13:08, Ram Pai wrote: > >>>>>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 02/12/2019 17:45, Ram Pai wrote: > >>>>>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > >>>>>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the > >>>>>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when > >>>>>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor > >>>>>>>>>>>> needs to know the unencrypted entries, to update the TCE table > >>>>>>>>>>>> accordingly. There is nothing secret or sensitive about these entries. > >>>>>>>>>>>> Hence share the page with the hypervisor. > >>>>>>>>>>> > >>>>>>>>>>> This unsecures a page in the guest in a random place which creates an > >>>>>>>>>>> additional attack surface which is hard to exploit indeed but > >>>>>>>>>>> nevertheless it is there. > >>>>>>>>>>> A safer option would be not to use the > >>>>>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE > >>>>>>>>>>> in the guest). > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hmm... How do we not use it? AFAICT hcall-multi-tce option gets invoked > >>>>>>>>>> automatically when IOMMU option is enabled. > >>>>>>>>> > >>>>>>>>> It is advertised by QEMU but the guest does not have to use it. > >>>>>>>> > >>>>>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce? > >>>>>>>> or just secure-guest? > >>>>>>> > >>>>>>> > >>>>>>> Just secure. > >>>>>> > >>>>>> hmm.. how are the TCE entries communicated to the hypervisor, if > >>>>>> hcall-multi-tce is disabled? > >>>>> > >>>>> Via H_PUT_TCE which updates 1 entry at once (sets or clears). > >>>>> hcall-multi-tce enables H_PUT_TCE_INDIRECT (512 entries at once) and > >>>>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an > >>>>> optimization. > >>>> > >>>> Do you still think, secure-VM should use H_PUT_TCE and not > >>>> H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? > >>>> Is there any advantage of special casing it for secure-VMs. > >>> > >>> > >>> Reducing the amount of insecure memory at random location. > >> > >> The other approach we could use for that - which would still allow > >> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the > >> same pool that we use for the bounce buffers. I assume there must > >> already be some sort of allocator for that? > > > > The allocator for swiotlb is buried deep in the swiotlb code. It is > > not exposed to the outside-swiotlb world. Will have to do major surgery > > to expose it. > > > > I was thinking, maybe we share the page, finish the INDIRECT_TCE call, > > and unshare the page. This will address Alexey's concern of having > > shared pages at random location, and will also give me my performance > > optimization. Alexey: ok? > > > I really do not see the point. I really think we should to 1:1 mapping > of swtiotlb buffers using the default 32bit window using H_PUT_TCE and > this should be more than enough, I do not think the amount of code will > be dramatically different compared to unsecuring and securing a page or > using one of swtiotlb pages for this purpose. Thanks, Ok. I will address your major concern -- "do not create new shared pages at random location" in my next version of the patch. Using the 32bit DMA window just to map the SWIOTLB buffers, will be some effort. Hope we can stage it that way. RP
On Wed, Dec 04, 2019 at 12:42:32PM -0800, Ram Pai wrote: > > The other approach we could use for that - which would still allow > > H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the > > same pool that we use for the bounce buffers. I assume there must > > already be some sort of allocator for that? > > The allocator for swiotlb is buried deep in the swiotlb code. It is > not exposed to the outside-swiotlb world. Will have to do major surgery > to expose it. I don't think it would require all that much changes, but I'd really hate the layering of calling into it directly. Do we have a struct device associated with the iommu that doesn't get iommu translations themselves? If we do a dma_alloc_coherent on that you'll get the memory pool for free.
On Thu, Dec 05, 2019 at 09:26:14AM +1100, Alexey Kardashevskiy wrote: > > > On 05/12/2019 07:42, Ram Pai wrote: .snip... > >>>> Do you still think, secure-VM should use H_PUT_TCE and not > >>>> H_PUT_TCE_INDIRECT? And normal VM should use H_PUT_TCE_INDIRECT? > >>>> Is there any advantage of special casing it for secure-VMs. > >>> > >>> > >>> Reducing the amount of insecure memory at random location. > >> > >> The other approach we could use for that - which would still allow > >> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the > >> same pool that we use for the bounce buffers. I assume there must > >> already be some sort of allocator for that? > > > > The allocator for swiotlb is buried deep in the swiotlb code. It is > > not exposed to the outside-swiotlb world. Will have to do major surgery > > to expose it. > > > > I was thinking, maybe we share the page, finish the INDIRECT_TCE call, > > and unshare the page. This will address Alexey's concern of having > > shared pages at random location, and will also give me my performance > > optimization. Alexey: ok? > > > I really do not see the point. I really think we should to 1:1 mapping > of swtiotlb buffers using the default 32bit window using H_PUT_TCE and > this should be more than enough, I do not think the amount of code will > be dramatically different compared to unsecuring and securing a page or > using one of swtiotlb pages for this purpose. Thanks, Ok. there are three different issues to be addressed here. (a) How to map the TCE entries in the TCE table? Should we use H_PUT_TCE, or H_PUT_INDIRECT_TCE? (b) How much of the guest memory must be mapped in the TCE table? Should it be the entire guest memory? or the memory used by the SWIOTLB? (c) What mapping window must to be used? Is it the 64bit ddw? or the default 32 bit ddw? Regardless of how we resolve issue (b) and (c), we still need to resolve (a). The main concern you have about solution (a) is that, random pages are permanently shared, something that can be exploited and can cause security issues. I tend to agree, this is possible, though I am not sure how. But yes we need to address this concern, since security is paramount to Secure Virtual Machines. The way to resolve (a) is -- (i) grab a page from the SWIOTLB pool and use H_PUT_INDIRECT_TCE OR (ii) simply use H_PUT_TCE. OR (iii) share the page prior to H_PUT_INDIRECT_TCE, and unshare the page once done. Solution (i) has layering violation; as Christoph alluded to in his previous reply. The swiotlb buffers are meant for I/O and DMA related actitivy. We will be abusing these swiotlb pages to communicate TCE entries to the hypervisor. Secondly IOMMU code has no idea where its pages are sourced from and should not know either. I am uncomfortable going this route. There is some upstream discussion about having a seperate pool of shared pages on secure VM, https://lkml.org/lkml/2019/11/14/381. That solution; when ready, may be exploitable here. I have coded solution (ii) and it works. But boot path slows down significantly. Huge amount H_PUT_TCE hcalls. Very hurtful. I strongly think, solution (iii) is the right way to go. I have coded it, it works and bootpath is much faster. However I am not sure if you have a concern with this solution. In any case, I am sending my next version of the patch based on solution (iii). Once this is addressed, I will address (b) and (c). RP
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 6ba081d..0720831 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -37,6 +37,7 @@ #include <asm/mmzone.h> #include <asm/plpar_wrappers.h> #include <asm/svm.h> +#include <asm/ultravisor.h> #include "pseries.h" @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, static DEFINE_PER_CPU(__be64 *, tce_page); +/* + * Allocate a tce page. If secure VM, share the page with the hypervisor. + * + * NOTE: the TCE page is shared with the hypervisor explicitly and remains + * shared for the lifetime of the kernel. It is implicitly unshared at kernel + * shutdown through a UV_UNSHARE_ALL_PAGES ucall. + */ +static __be64 *alloc_tce_page(void) +{ + __be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC); + + if (tcep && is_secure_guest()) + uv_share_page(PHYS_PFN(__pa(tcep)), 1); + + return tcep; +} + static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, enum dma_data_direction direction, @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, * from iommu_alloc{,_sg}() */ if (!tcep) { - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); - /* If allocation fails, fall back to the loop implementation */ + tcep = alloc_tce_page(); if (!tcep) { local_irq_restore(flags); return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, tcep = __this_cpu_read(tce_page); if (!tcep) { - tcep = (__be64 *)__get_free_page(GFP_ATOMIC); + tcep = alloc_tce_page(); if (!tcep) { local_irq_enable(); return -ENOMEM;
H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of its parameters. One page is dedicated per cpu, for the lifetime of the kernel for this purpose. On secure VMs, contents of this page, when accessed by the hypervisor, retrieves encrypted TCE entries. Hypervisor needs to know the unencrypted entries, to update the TCE table accordingly. There is nothing secret or sensitive about these entries. Hence share the page with the hypervisor. Signed-off-by: Ram Pai <linuxram@us.ibm.com> --- arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-)