diff mbox series

[v4,1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

Message ID 1575269124-17885-2-git-send-email-linuxram@us.ibm.com (mailing list archive)
State Superseded
Headers show
Series Enable IOMMU support for pseries Secure VMs | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (2e6c4d7e1c5990fa2ccca6db0868a05640ac1df1)
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 47 lines checked
snowpatch_ozlabs/needsstable success Patch has no Fixes tags

Commit Message

Ram Pai Dec. 2, 2019, 6:45 a.m. UTC
H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
its parameters. One page is dedicated per cpu, for the lifetime of the
kernel for this purpose. On secure VMs, contents of this page, when
accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
needs to know the unencrypted entries, to update the TCE table
accordingly.  There is nothing secret or sensitive about these entries.
Hence share the page with the hypervisor.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

Comments

Alexey Kardashevskiy Dec. 3, 2019, 12:56 a.m. UTC | #1
On 02/12/2019 17:45, Ram Pai wrote:
> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> its parameters. One page is dedicated per cpu, for the lifetime of the
> kernel for this purpose. On secure VMs, contents of this page, when
> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> needs to know the unencrypted entries, to update the TCE table
> accordingly.  There is nothing secret or sensitive about these entries.
> Hence share the page with the hypervisor.

This unsecures a page in the guest in a random place which creates an
additional attack surface which is hard to exploit indeed but
nevertheless it is there. A safer option would be not to use the
hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
in the guest).

Also what is this for anyway? If I understand things right, you cannot
map any random guest memory, you should only be mapping that 64MB-ish
bounce buffer array but 1) I do not see that happening (I may have
missed it) 2) it should be done once and it takes a little time for
whatever memory size we allow for bounce buffers anyway. Thanks,


> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 6ba081d..0720831 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -37,6 +37,7 @@
>  #include <asm/mmzone.h>
>  #include <asm/plpar_wrappers.h>
>  #include <asm/svm.h>
> +#include <asm/ultravisor.h>
>  
>  #include "pseries.h"
>  
> @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  
>  static DEFINE_PER_CPU(__be64 *, tce_page);
>  
> +/*
> + * Allocate a tce page.  If secure VM, share the page with the hypervisor.
> + *
> + * NOTE: the TCE page is shared with the hypervisor explicitly and remains
> + * shared for the lifetime of the kernel. It is implicitly unshared at kernel
> + * shutdown through a UV_UNSHARE_ALL_PAGES ucall.
> + */
> +static __be64 *alloc_tce_page(void)
> +{
> +	__be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> +
> +	if (tcep && is_secure_guest())
> +		uv_share_page(PHYS_PFN(__pa(tcep)), 1);
> +
> +	return tcep;
> +}
> +
>  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				     long npages, unsigned long uaddr,
>  				     enum dma_data_direction direction,
> @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  	 * from iommu_alloc{,_sg}()
>  	 */
>  	if (!tcep) {
> -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> -		/* If allocation fails, fall back to the loop implementation */
> +		tcep = alloc_tce_page();
>  		if (!tcep) {
>  			local_irq_restore(flags);
>  			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
>  	tcep = __this_cpu_read(tce_page);
>  
>  	if (!tcep) {
> -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> +		tcep = alloc_tce_page();
>  		if (!tcep) {
>  			local_irq_enable();
>  			return -ENOMEM;
>
Ram Pai Dec. 3, 2019, 2:08 a.m. UTC | #2
On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 02/12/2019 17:45, Ram Pai wrote:
> > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> > its parameters. One page is dedicated per cpu, for the lifetime of the
> > kernel for this purpose. On secure VMs, contents of this page, when
> > accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> > needs to know the unencrypted entries, to update the TCE table
> > accordingly.  There is nothing secret or sensitive about these entries.
> > Hence share the page with the hypervisor.
> 
> This unsecures a page in the guest in a random place which creates an
> additional attack surface which is hard to exploit indeed but
> nevertheless it is there.
> A safer option would be not to use the
> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> in the guest).


Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
automatically when IOMMU option is enabled.  This happens even
on a normal VM when IOMMU is enabled. 


> 
> Also what is this for anyway? 

This is for sending indirect-TCE entries to the hypervisor.
The hypervisor must be able to read those TCE entries, so that it can 
use those entires to populate the TCE table with the correct mappings.

> if I understand things right, you cannot
> map any random guest memory, you should only be mapping that 64MB-ish
> bounce buffer array but 1) I do not see that happening (I may have
> missed it) 2) it should be done once and it takes a little time for
> whatever memory size we allow for bounce buffers anyway. Thanks,

Any random guest memory can be shared by the guest. 

Maybe you are confusing this with the SWIOTLB bounce buffers used by PCI
devices, to transfer data to the hypervisor?

> 
> 
> > 
> > Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 6ba081d..0720831 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -37,6 +37,7 @@
> >  #include <asm/mmzone.h>
> >  #include <asm/plpar_wrappers.h>
> >  #include <asm/svm.h>
> > +#include <asm/ultravisor.h>
> >  
> >  #include "pseries.h"
> >  
> > @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >  
> >  static DEFINE_PER_CPU(__be64 *, tce_page);
> >  
> > +/*
> > + * Allocate a tce page.  If secure VM, share the page with the hypervisor.
> > + *
> > + * NOTE: the TCE page is shared with the hypervisor explicitly and remains
> > + * shared for the lifetime of the kernel. It is implicitly unshared at kernel
> > + * shutdown through a UV_UNSHARE_ALL_PAGES ucall.
> > + */
> > +static __be64 *alloc_tce_page(void)
> > +{
> > +	__be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> > +
> > +	if (tcep && is_secure_guest())
> > +		uv_share_page(PHYS_PFN(__pa(tcep)), 1);
> > +
> > +	return tcep;
> > +}
> > +
> >  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >  				     long npages, unsigned long uaddr,
> >  				     enum dma_data_direction direction,
> > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >  	 * from iommu_alloc{,_sg}()
> >  	 */
> >  	if (!tcep) {
> > -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> > -		/* If allocation fails, fall back to the loop implementation */
> > +		tcep = alloc_tce_page();
> >  		if (!tcep) {
> >  			local_irq_restore(flags);
> >  			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> > @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
> >  	tcep = __this_cpu_read(tce_page);
> >  
> >  	if (!tcep) {
> > -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> > +		tcep = alloc_tce_page();
> >  		if (!tcep) {
> >  			local_irq_enable();
> >  			return -ENOMEM;
> > 
> 
> -- 
> Alexey
Alexey Kardashevskiy Dec. 3, 2019, 2:15 a.m. UTC | #3
On 03/12/2019 13:08, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 02/12/2019 17:45, Ram Pai wrote:
>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>> kernel for this purpose. On secure VMs, contents of this page, when
>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>> needs to know the unencrypted entries, to update the TCE table
>>> accordingly.  There is nothing secret or sensitive about these entries.
>>> Hence share the page with the hypervisor.
>>
>> This unsecures a page in the guest in a random place which creates an
>> additional attack surface which is hard to exploit indeed but
>> nevertheless it is there.
>> A safer option would be not to use the
>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>> in the guest).
> 
> 
> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> automatically when IOMMU option is enabled.

It is advertised by QEMU but the guest does not have to use it.

> This happens even
> on a normal VM when IOMMU is enabled.
> 
> 
>>
>> Also what is this for anyway? 
> 
> This is for sending indirect-TCE entries to the hypervisor.
> The hypervisor must be able to read those TCE entries, so that it can 
> use those entires to populate the TCE table with the correct mappings.
> 
>> if I understand things right, you cannot
>> map any random guest memory, you should only be mapping that 64MB-ish
>> bounce buffer array but 1) I do not see that happening (I may have
>> missed it) 2) it should be done once and it takes a little time for
>> whatever memory size we allow for bounce buffers anyway. Thanks,
> 
> Any random guest memory can be shared by the guest. 

Yes but we do not want this to be this random. I thought the whole idea
of swiotlb was to restrict the amount of shared memory to bare minimum,
what do I miss?

> Maybe you are confusing this with the SWIOTLB bounce buffers used by PCI
> devices, to transfer data to the hypervisor?

Is not this for pci+swiotlb? The cover letter suggests it is for
virtio-scsi-_pci_ with 	iommu_platform=on which makes it a normal pci
device just like emulated XHCI. Thanks,




> 
>>
>>
>>>
>>> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 23 ++++++++++++++++++++---
>>>  1 file changed, 20 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>> index 6ba081d..0720831 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -37,6 +37,7 @@
>>>  #include <asm/mmzone.h>
>>>  #include <asm/plpar_wrappers.h>
>>>  #include <asm/svm.h>
>>> +#include <asm/ultravisor.h>
>>>  
>>>  #include "pseries.h"
>>>  
>>> @@ -179,6 +180,23 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
>>>  
>>>  static DEFINE_PER_CPU(__be64 *, tce_page);
>>>  
>>> +/*
>>> + * Allocate a tce page.  If secure VM, share the page with the hypervisor.
>>> + *
>>> + * NOTE: the TCE page is shared with the hypervisor explicitly and remains
>>> + * shared for the lifetime of the kernel. It is implicitly unshared at kernel
>>> + * shutdown through a UV_UNSHARE_ALL_PAGES ucall.
>>> + */
>>> +static __be64 *alloc_tce_page(void)
>>> +{
>>> +	__be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
>>> +
>>> +	if (tcep && is_secure_guest())
>>> +		uv_share_page(PHYS_PFN(__pa(tcep)), 1);
>>> +
>>> +	return tcep;
>>> +}
>>> +
>>>  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>>>  				     long npages, unsigned long uaddr,
>>>  				     enum dma_data_direction direction,
>>> @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>>>  	 * from iommu_alloc{,_sg}()
>>>  	 */
>>>  	if (!tcep) {
>>> -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
>>> -		/* If allocation fails, fall back to the loop implementation */
>>> +		tcep = alloc_tce_page();
>>>  		if (!tcep) {
>>>  			local_irq_restore(flags);
>>>  			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
>>> @@ -405,7 +422,7 @@ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
>>>  	tcep = __this_cpu_read(tce_page);
>>>  
>>>  	if (!tcep) {
>>> -		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
>>> +		tcep = alloc_tce_page();
>>>  		if (!tcep) {
>>>  			local_irq_enable();
>>>  			return -ENOMEM;
>>>
>>
>> -- 
>> Alexey
>
Ram Pai Dec. 3, 2019, 4:05 a.m. UTC | #4
On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 03/12/2019 13:08, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 02/12/2019 17:45, Ram Pai wrote:
> >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>> kernel for this purpose. On secure VMs, contents of this page, when
> >>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> >>> needs to know the unencrypted entries, to update the TCE table
> >>> accordingly.  There is nothing secret or sensitive about these entries.
> >>> Hence share the page with the hypervisor.
> >>
> >> This unsecures a page in the guest in a random place which creates an
> >> additional attack surface which is hard to exploit indeed but
> >> nevertheless it is there.
> >> A safer option would be not to use the
> >> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >> in the guest).
> > 
> > 
> > Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> > automatically when IOMMU option is enabled.
> 
> It is advertised by QEMU but the guest does not have to use it.

Are you suggesting that even normal-guest, not use hcall-multi-tce?
or just secure-guest?  

> 
> > This happens even
> > on a normal VM when IOMMU is enabled.
> > 
> > 
> >>
> >> Also what is this for anyway? 
> > 
> > This is for sending indirect-TCE entries to the hypervisor.
> > The hypervisor must be able to read those TCE entries, so that it can 
> > use those entires to populate the TCE table with the correct mappings.
> > 
> >> if I understand things right, you cannot
> >> map any random guest memory, you should only be mapping that 64MB-ish
> >> bounce buffer array but 1) I do not see that happening (I may have
> >> missed it) 2) it should be done once and it takes a little time for
> >> whatever memory size we allow for bounce buffers anyway. Thanks,
> > 
> > Any random guest memory can be shared by the guest. 
> 
> Yes but we do not want this to be this random. 

It is not sharing some random page. It is sharing a page that is
ear-marked for communicating TCE entries. Yes the address of the page
can be random, depending on where the allocator decides to allocate it.
The purpose of the page is not random.

That page is used for one specific purpose; to communicate the TCE
entries to the hypervisor.  

> I thought the whole idea
> of swiotlb was to restrict the amount of shared memory to bare minimum,
> what do I miss?

I think, you are making a incorrect connection between this patch and
SWIOTLB.  This patch has nothing to do with SWIOTLB.

> 
> > Maybe you are confusing this with the SWIOTLB bounce buffers used by
> > PCI devices, to transfer data to the hypervisor?
> 
> Is not this for pci+swiotlb? 


No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
different set of pages allocated and earmarked for bounce buffering.

This patch is purely to help the hypervisor setup the TCE table, in the
presence of a IOMMU.

>The cover letter suggests it is for
> virtio-scsi-_pci_ with 	iommu_platform=on which makes it a
> normal pci device just like emulated XHCI. Thanks,

Well, I guess, the cover letter is probably confusing. There are two
patches, which togather enable virtio on secure guests, in the presence
of IOMMU.

The second patch enables virtio in the presence of a IOMMU, to use
DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
devices.

However that by itself wont work if the TCE entires are not correctly
setup in the TCE tables.  The first patch; i.e this patch, helps
accomplish that.

Hope this clears up the confusion.
RP
Alexey Kardashevskiy Dec. 3, 2019, 4:24 a.m. UTC | #5
On 03/12/2019 15:05, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 03/12/2019 13:08, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 02/12/2019 17:45, Ram Pai wrote:
>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>>>> kernel for this purpose. On secure VMs, contents of this page, when
>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>>>> needs to know the unencrypted entries, to update the TCE table
>>>>> accordingly.  There is nothing secret or sensitive about these entries.
>>>>> Hence share the page with the hypervisor.
>>>>
>>>> This unsecures a page in the guest in a random place which creates an
>>>> additional attack surface which is hard to exploit indeed but
>>>> nevertheless it is there.
>>>> A safer option would be not to use the
>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>>>> in the guest).
>>>
>>>
>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>> automatically when IOMMU option is enabled.
>>
>> It is advertised by QEMU but the guest does not have to use it.
> 
> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> or just secure-guest?  


Just secure.


> 
>>
>>> This happens even
>>> on a normal VM when IOMMU is enabled.
>>>
>>>
>>>>
>>>> Also what is this for anyway? 
>>>
>>> This is for sending indirect-TCE entries to the hypervisor.
>>> The hypervisor must be able to read those TCE entries, so that it can 
>>> use those entires to populate the TCE table with the correct mappings.
>>>
>>>> if I understand things right, you cannot
>>>> map any random guest memory, you should only be mapping that 64MB-ish
>>>> bounce buffer array but 1) I do not see that happening (I may have
>>>> missed it) 2) it should be done once and it takes a little time for
>>>> whatever memory size we allow for bounce buffers anyway. Thanks,
>>>
>>> Any random guest memory can be shared by the guest. 
>>
>> Yes but we do not want this to be this random. 
> 
> It is not sharing some random page. It is sharing a page that is
> ear-marked for communicating TCE entries. Yes the address of the page
> can be random, depending on where the allocator decides to allocate it.
> The purpose of the page is not random.

I was talking about the location.


> That page is used for one specific purpose; to communicate the TCE
> entries to the hypervisor.  
> 
>> I thought the whole idea
>> of swiotlb was to restrict the amount of shared memory to bare minimum,
>> what do I miss?
> 
> I think, you are making a incorrect connection between this patch and
> SWIOTLB.  This patch has nothing to do with SWIOTLB.

I can see this and this is the confusing part.


>>
>>> Maybe you are confusing this with the SWIOTLB bounce buffers used by
>>> PCI devices, to transfer data to the hypervisor?
>>
>> Is not this for pci+swiotlb? 
> 
> 
> No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
> different set of pages allocated and earmarked for bounce buffering.
> 
> This patch is purely to help the hypervisor setup the TCE table, in the
> presence of a IOMMU.

Then the hypervisor should be able to access the guest pages mapped for
DMA and these pages should be made unsecure for this to work. Where/when
does this happen?


>> The cover letter suggests it is for
>> virtio-scsi-_pci_ with 	iommu_platform=on which makes it a
>> normal pci device just like emulated XHCI. Thanks,
> 
> Well, I guess, the cover letter is probably confusing. There are two
> patches, which togather enable virtio on secure guests, in the presence
> of IOMMU.
> 
> The second patch enables virtio in the presence of a IOMMU, to use
> DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
> devices.

The second patch does nothing in relation to the problem being solved.


> However that by itself wont work if the TCE entires are not correctly
> setup in the TCE tables.  The first patch; i.e this patch, helps
> accomplish that.
>> Hope this clears up the confusion.
Ram Pai Dec. 3, 2019, 4:52 p.m. UTC | #6
On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 03/12/2019 15:05, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 03/12/2019 13:08, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 02/12/2019 17:45, Ram Pai wrote:
> >>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>>>> kernel for this purpose. On secure VMs, contents of this page, when
> >>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> >>>>> needs to know the unencrypted entries, to update the TCE table
> >>>>> accordingly.  There is nothing secret or sensitive about these entries.
> >>>>> Hence share the page with the hypervisor.
> >>>>
> >>>> This unsecures a page in the guest in a random place which creates an
> >>>> additional attack surface which is hard to exploit indeed but
> >>>> nevertheless it is there.
> >>>> A safer option would be not to use the
> >>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >>>> in the guest).
> >>>
> >>>
> >>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> >>> automatically when IOMMU option is enabled.
> >>
> >> It is advertised by QEMU but the guest does not have to use it.
> > 
> > Are you suggesting that even normal-guest, not use hcall-multi-tce?
> > or just secure-guest?  
> 
> 
> Just secure.

hmm..  how are the TCE entries communicated to the hypervisor, if
hcall-multi-tce is disabled?

> 
> 
> > 
> >>
> >>> This happens even
> >>> on a normal VM when IOMMU is enabled.
> >>>
> >>>
> >>>>
> >>>> Also what is this for anyway? 
> >>>
> >>> This is for sending indirect-TCE entries to the hypervisor.
> >>> The hypervisor must be able to read those TCE entries, so that it can 
> >>> use those entires to populate the TCE table with the correct mappings.
> >>>
> >>>> if I understand things right, you cannot
> >>>> map any random guest memory, you should only be mapping that 64MB-ish
> >>>> bounce buffer array but 1) I do not see that happening (I may have
> >>>> missed it) 2) it should be done once and it takes a little time for
> >>>> whatever memory size we allow for bounce buffers anyway. Thanks,
> >>>
> >>> Any random guest memory can be shared by the guest. 
> >>
> >> Yes but we do not want this to be this random. 
> > 
> > It is not sharing some random page. It is sharing a page that is
> > ear-marked for communicating TCE entries. Yes the address of the page
> > can be random, depending on where the allocator decides to allocate it.
> > The purpose of the page is not random.
> 
> I was talking about the location.
> 
> 
> > That page is used for one specific purpose; to communicate the TCE
> > entries to the hypervisor.  
> > 
> >> I thought the whole idea
> >> of swiotlb was to restrict the amount of shared memory to bare minimum,
> >> what do I miss?
> > 
> > I think, you are making a incorrect connection between this patch and
> > SWIOTLB.  This patch has nothing to do with SWIOTLB.
> 
> I can see this and this is the confusing part.
> 
> 
> >>
> >>> Maybe you are confusing this with the SWIOTLB bounce buffers used by
> >>> PCI devices, to transfer data to the hypervisor?
> >>
> >> Is not this for pci+swiotlb? 
> > 
> > 
> > No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
> > different set of pages allocated and earmarked for bounce buffering.
> > 
> > This patch is purely to help the hypervisor setup the TCE table, in the
> > presence of a IOMMU.
> 
> Then the hypervisor should be able to access the guest pages mapped for
> DMA and these pages should be made unsecure for this to work. Where/when
> does this happen?

This happens in the SWIOTLB code.  The code to do that is already
upstream.  

The sharing of the pages containing the SWIOTLB bounce buffers is done
in init_svm() which calls swiotlb_update_mem_attributes() which calls
set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
uv_share_page().

The code that bounces the contents of a I/O buffer through the 
SWIOTLB buffers, is in swiotlb_bounce().

> 
> 
> >> The cover letter suggests it is for
> >> virtio-scsi-_pci_ with 	iommu_platform=on which makes it a
> >> normal pci device just like emulated XHCI. Thanks,
> > 
> > Well, I guess, the cover letter is probably confusing. There are two
> > patches, which togather enable virtio on secure guests, in the presence
> > of IOMMU.
> > 
> > The second patch enables virtio in the presence of a IOMMU, to use
> > DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
> > devices.
> 
> The second patch does nothing in relation to the problem being solved.

The second patch registers dma_iommu_ops with the PCI-system.  Doing so
enables I/O to take the dma_iommu_ops path, which internally 
leads it through the SWIOTLB path. Without that, the I/O fails to reach
its destination.

> 
> 
> > However that by itself wont work if the TCE entires are not correctly
> > setup in the TCE tables.  The first patch; i.e this patch, helps
> > accomplish that.
> >> Hope this clears up the confusion.
> 
> 
> 
> 
> 
> -- 
> Alexey
Alexey Kardashevskiy Dec. 4, 2019, 12:04 a.m. UTC | #7
On 04/12/2019 03:52, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 03/12/2019 15:05, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 03/12/2019 13:08, Ram Pai wrote:
>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>>>>>> needs to know the unencrypted entries, to update the TCE table
>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
>>>>>>> Hence share the page with the hypervisor.
>>>>>>
>>>>>> This unsecures a page in the guest in a random place which creates an
>>>>>> additional attack surface which is hard to exploit indeed but
>>>>>> nevertheless it is there.
>>>>>> A safer option would be not to use the
>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>>>>>> in the guest).
>>>>>
>>>>>
>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>>>> automatically when IOMMU option is enabled.
>>>>
>>>> It is advertised by QEMU but the guest does not have to use it.
>>>
>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
>>> or just secure-guest?  
>>
>>
>> Just secure.
> 
> hmm..  how are the TCE entries communicated to the hypervisor, if
> hcall-multi-tce is disabled?

Via H_PUT_TCE which updates 1 entry at once (sets or clears).
hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
optimization.


> 
>>
>>
>>>
>>>>
>>>>> This happens even
>>>>> on a normal VM when IOMMU is enabled.
>>>>>
>>>>>
>>>>>>
>>>>>> Also what is this for anyway? 
>>>>>
>>>>> This is for sending indirect-TCE entries to the hypervisor.
>>>>> The hypervisor must be able to read those TCE entries, so that it can 
>>>>> use those entires to populate the TCE table with the correct mappings.
>>>>>
>>>>>> if I understand things right, you cannot
>>>>>> map any random guest memory, you should only be mapping that 64MB-ish
>>>>>> bounce buffer array but 1) I do not see that happening (I may have
>>>>>> missed it) 2) it should be done once and it takes a little time for
>>>>>> whatever memory size we allow for bounce buffers anyway. Thanks,
>>>>>
>>>>> Any random guest memory can be shared by the guest. 
>>>>
>>>> Yes but we do not want this to be this random. 
>>>
>>> It is not sharing some random page. It is sharing a page that is
>>> ear-marked for communicating TCE entries. Yes the address of the page
>>> can be random, depending on where the allocator decides to allocate it.
>>> The purpose of the page is not random.
>>
>> I was talking about the location.
>>
>>
>>> That page is used for one specific purpose; to communicate the TCE
>>> entries to the hypervisor.  
>>>
>>>> I thought the whole idea
>>>> of swiotlb was to restrict the amount of shared memory to bare minimum,
>>>> what do I miss?
>>>
>>> I think, you are making a incorrect connection between this patch and
>>> SWIOTLB.  This patch has nothing to do with SWIOTLB.
>>
>> I can see this and this is the confusing part.
>>
>>
>>>>
>>>>> Maybe you are confusing this with the SWIOTLB bounce buffers used by
>>>>> PCI devices, to transfer data to the hypervisor?
>>>>
>>>> Is not this for pci+swiotlb? 
>>>
>>>
>>> No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
>>> different set of pages allocated and earmarked for bounce buffering.
>>>
>>> This patch is purely to help the hypervisor setup the TCE table, in the
>>> presence of a IOMMU.
>>
>> Then the hypervisor should be able to access the guest pages mapped for
>> DMA and these pages should be made unsecure for this to work. Where/when
>> does this happen?
> 
> This happens in the SWIOTLB code.  The code to do that is already
> upstream.  
>
> The sharing of the pages containing the SWIOTLB bounce buffers is done
> in init_svm() which calls swiotlb_update_mem_attributes() which calls
> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
> uv_share_page().


This does not seem enough as when you enforce iommu_platform=on, QEMU
starts accessing virtio buffers via IOMMU so bounce buffers have to be
mapped explicitly, via H_PUT_TCE&co, where does this happen?


> 
> The code that bounces the contents of a I/O buffer through the 
> SWIOTLB buffers, is in swiotlb_bounce().
> 
>>
>>
>>>> The cover letter suggests it is for
>>>> virtio-scsi-_pci_ with 	iommu_platform=on which makes it a
>>>> normal pci device just like emulated XHCI. Thanks,
>>>
>>> Well, I guess, the cover letter is probably confusing. There are two
>>> patches, which togather enable virtio on secure guests, in the presence
>>> of IOMMU.
>>>
>>> The second patch enables virtio in the presence of a IOMMU, to use
>>> DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
>>> devices.
>>
>> The second patch does nothing in relation to the problem being solved.
> 
> The second patch registers dma_iommu_ops with the PCI-system.  Doing so
> enables I/O to take the dma_iommu_ops path, which internally 
> leads it through the SWIOTLB path. Without that, the I/O fails to reach
> its destination.


This is not what the commit log says. What DMA ops was used before 2/2?
I thought it was NULL which should have turned into direct which then
would switch to swiotlb but since recent DMA reworks it is even harder
to tell what happens with DMA setup. Thanks,


>>> However that by itself wont work if the TCE entires are not correctly
>>> setup in the TCE tables.  The first patch; i.e this patch, helps
>>> accomplish that.
>>>> Hope this clears up the confusion.
Ram Pai Dec. 4, 2019, 12:49 a.m. UTC | #8
On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 04/12/2019 03:52, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 03/12/2019 15:05, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 03/12/2019 13:08, Ram Pai wrote:
> >>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 02/12/2019 17:45, Ram Pai wrote:
> >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
> >>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> >>>>>>> needs to know the unencrypted entries, to update the TCE table
> >>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
> >>>>>>> Hence share the page with the hypervisor.
> >>>>>>
> >>>>>> This unsecures a page in the guest in a random place which creates an
> >>>>>> additional attack surface which is hard to exploit indeed but
> >>>>>> nevertheless it is there.
> >>>>>> A safer option would be not to use the
> >>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >>>>>> in the guest).
> >>>>>
> >>>>>
> >>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> >>>>> automatically when IOMMU option is enabled.
> >>>>
> >>>> It is advertised by QEMU but the guest does not have to use it.
> >>>
> >>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> >>> or just secure-guest?  
> >>
> >>
> >> Just secure.
> > 
> > hmm..  how are the TCE entries communicated to the hypervisor, if
> > hcall-multi-tce is disabled?
> 
> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> optimization.

Do you still think, secure-VM should use H_PUT_TCE and not
H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
Is there any advantage of special casing it for secure-VMs.
In fact, we could make use of as much optimization as possible.


> 
> >>>> Is not this for pci+swiotlb? 
..snip..
> >>> This patch is purely to help the hypervisor setup the TCE table, in the
> >>> presence of a IOMMU.
> >>
> >> Then the hypervisor should be able to access the guest pages mapped for
> >> DMA and these pages should be made unsecure for this to work. Where/when
> >> does this happen?
> > 
> > This happens in the SWIOTLB code.  The code to do that is already
> > upstream.  
> >
> > The sharing of the pages containing the SWIOTLB bounce buffers is done
> > in init_svm() which calls swiotlb_update_mem_attributes() which calls
> > set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
> > uv_share_page().
> 
> 
> This does not seem enough as when you enforce iommu_platform=on, QEMU
> starts accessing virtio buffers via IOMMU so bounce buffers have to be
> mapped explicitly, via H_PUT_TCE&co, where does this happen?
> 

I think, it happens at boot time. Every page of the guest memory is TCE
mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
as part of that operation.

RP
Alexey Kardashevskiy Dec. 4, 2019, 1:08 a.m. UTC | #9
On 04/12/2019 11:49, Ram Pai wrote:
> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 04/12/2019 03:52, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 03/12/2019 15:05, Ram Pai wrote:
>>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 03/12/2019 13:08, Ram Pai wrote:
>>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
>>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
>>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>>>>>>>> needs to know the unencrypted entries, to update the TCE table
>>>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
>>>>>>>>> Hence share the page with the hypervisor.
>>>>>>>>
>>>>>>>> This unsecures a page in the guest in a random place which creates an
>>>>>>>> additional attack surface which is hard to exploit indeed but
>>>>>>>> nevertheless it is there.
>>>>>>>> A safer option would be not to use the
>>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>>>>>>>> in the guest).
>>>>>>>
>>>>>>>
>>>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>>>>>> automatically when IOMMU option is enabled.
>>>>>>
>>>>>> It is advertised by QEMU but the guest does not have to use it.
>>>>>
>>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
>>>>> or just secure-guest?  
>>>>
>>>>
>>>> Just secure.
>>>
>>> hmm..  how are the TCE entries communicated to the hypervisor, if
>>> hcall-multi-tce is disabled?
>>
>> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
>> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
>> optimization.
> 
> Do you still think, secure-VM should use H_PUT_TCE and not
> H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> Is there any advantage of special casing it for secure-VMs.


Reducing the amount of insecure memory at random location.


> In fact, we could make use of as much optimization as possible.
> 
> 
>>
>>>>>> Is not this for pci+swiotlb? 
> ..snip..
>>>>> This patch is purely to help the hypervisor setup the TCE table, in the
>>>>> presence of a IOMMU.
>>>>
>>>> Then the hypervisor should be able to access the guest pages mapped for
>>>> DMA and these pages should be made unsecure for this to work. Where/when
>>>> does this happen?
>>>
>>> This happens in the SWIOTLB code.  The code to do that is already
>>> upstream.  
>>>
>>> The sharing of the pages containing the SWIOTLB bounce buffers is done
>>> in init_svm() which calls swiotlb_update_mem_attributes() which calls
>>> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
>>> uv_share_page().
>>
>>
>> This does not seem enough as when you enforce iommu_platform=on, QEMU
>> starts accessing virtio buffers via IOMMU so bounce buffers have to be
>> mapped explicitly, via H_PUT_TCE&co, where does this happen?
>>
> 
> I think, it happens at boot time. Every page of the guest memory is TCE
> mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
> as part of that operation.


Ah I see. This works via the huge dma window. Ok, makes sense now.

It just seems like a waste that we could map swiotlb 1:1 via the always
existing small DMA window but instead we rely on a huge window to map
these small buffers. This way we are wasting the entire 32bit window and
most of the huge window. We may fix it in the future (not right now) but
for now I would still avoid unsecuring additional memory. Thanks,
David Gibson Dec. 4, 2019, 3:36 a.m. UTC | #10
On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 04/12/2019 11:49, Ram Pai wrote:
> > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 04/12/2019 03:52, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 03/12/2019 15:05, Ram Pai wrote:
> >>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 03/12/2019 13:08, Ram Pai wrote:
> >>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
> >>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
> >>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> >>>>>>>>> needs to know the unencrypted entries, to update the TCE table
> >>>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
> >>>>>>>>> Hence share the page with the hypervisor.
> >>>>>>>>
> >>>>>>>> This unsecures a page in the guest in a random place which creates an
> >>>>>>>> additional attack surface which is hard to exploit indeed but
> >>>>>>>> nevertheless it is there.
> >>>>>>>> A safer option would be not to use the
> >>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >>>>>>>> in the guest).
> >>>>>>>
> >>>>>>>
> >>>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> >>>>>>> automatically when IOMMU option is enabled.
> >>>>>>
> >>>>>> It is advertised by QEMU but the guest does not have to use it.
> >>>>>
> >>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> >>>>> or just secure-guest?  
> >>>>
> >>>>
> >>>> Just secure.
> >>>
> >>> hmm..  how are the TCE entries communicated to the hypervisor, if
> >>> hcall-multi-tce is disabled?
> >>
> >> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> >> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> >> optimization.
> > 
> > Do you still think, secure-VM should use H_PUT_TCE and not
> > H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> > Is there any advantage of special casing it for secure-VMs.
> 
> 
> Reducing the amount of insecure memory at random location.

The other approach we could use for that - which would still allow
H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
same pool that we use for the bounce buffers.  I assume there must
already be some sort of allocator for that?

> > In fact, we could make use of as much optimization as possible.
> > 
> > 
> >>
> >>>>>> Is not this for pci+swiotlb? 
> > ..snip..
> >>>>> This patch is purely to help the hypervisor setup the TCE table, in the
> >>>>> presence of a IOMMU.
> >>>>
> >>>> Then the hypervisor should be able to access the guest pages mapped for
> >>>> DMA and these pages should be made unsecure for this to work. Where/when
> >>>> does this happen?
> >>>
> >>> This happens in the SWIOTLB code.  The code to do that is already
> >>> upstream.  
> >>>
> >>> The sharing of the pages containing the SWIOTLB bounce buffers is done
> >>> in init_svm() which calls swiotlb_update_mem_attributes() which calls
> >>> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
> >>> uv_share_page().
> >>
> >>
> >> This does not seem enough as when you enforce iommu_platform=on, QEMU
> >> starts accessing virtio buffers via IOMMU so bounce buffers have to be
> >> mapped explicitly, via H_PUT_TCE&co, where does this happen?
> >>
> > 
> > I think, it happens at boot time. Every page of the guest memory is TCE
> > mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
> > as part of that operation.
> 
> 
> Ah I see. This works via the huge dma window. Ok, makes sense now.
> 
> It just seems like a waste that we could map swiotlb 1:1 via the always
> existing small DMA window but instead we rely on a huge window to map
> these small buffers. This way we are wasting the entire 32bit window and
> most of the huge window. We may fix it in the future (not right now) but
> for now I would still avoid unsecuring additional memory. Thanks,
> 
>
Leonardo Bras Dec. 4, 2019, 6:26 p.m. UTC | #11
On Sun, 2019-12-01 at 22:45 -0800, Ram Pai wrote:
> @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>          * from iommu_alloc{,_sg}()
>          */
>         if (!tcep) {
> -               tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> -               /* If allocation fails, fall back to the loop implementation */
> +               tcep = alloc_tce_page();
>                 if (!tcep) {
>                         local_irq_restore(flags);
>                         return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,

The comment about failing allocation was removed, but I see no chage of
behaviour here. 

Can you please explain what/where it changes? 

Best regards,

Leonardo
Ram Pai Dec. 4, 2019, 8:27 p.m. UTC | #12
On Wed, Dec 04, 2019 at 03:26:50PM -0300, Leonardo Bras wrote:
> On Sun, 2019-12-01 at 22:45 -0800, Ram Pai wrote:
> > @@ -206,8 +224,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >          * from iommu_alloc{,_sg}()
> >          */
> >         if (!tcep) {
> > -               tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
> > -               /* If allocation fails, fall back to the loop implementation */
> > +               tcep = alloc_tce_page();
> >                 if (!tcep) {
> >                         local_irq_restore(flags);
> >                         return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> 
> The comment about failing allocation was removed, but I see no chage of
> behaviour here. 
> 
> Can you please explain what/where it changes? 

You observed it right. The comment should stay put.  Will have it fixed
in my next version.

Thanks,
RP
Ram Pai Dec. 4, 2019, 8:42 p.m. UTC | #13
On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote:
> On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 04/12/2019 11:49, Ram Pai wrote:
> > > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> > >>
> > >>
> > >> On 04/12/2019 03:52, Ram Pai wrote:
> > >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> > >>>>
> > >>>>
> > >>>> On 03/12/2019 15:05, Ram Pai wrote:
> > >>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 03/12/2019 13:08, Ram Pai wrote:
> > >>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
> > >>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> > >>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
> > >>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
> > >>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> > >>>>>>>>> needs to know the unencrypted entries, to update the TCE table
> > >>>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
> > >>>>>>>>> Hence share the page with the hypervisor.
> > >>>>>>>>
> > >>>>>>>> This unsecures a page in the guest in a random place which creates an
> > >>>>>>>> additional attack surface which is hard to exploit indeed but
> > >>>>>>>> nevertheless it is there.
> > >>>>>>>> A safer option would be not to use the
> > >>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> > >>>>>>>> in the guest).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> > >>>>>>> automatically when IOMMU option is enabled.
> > >>>>>>
> > >>>>>> It is advertised by QEMU but the guest does not have to use it.
> > >>>>>
> > >>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> > >>>>> or just secure-guest?  
> > >>>>
> > >>>>
> > >>>> Just secure.
> > >>>
> > >>> hmm..  how are the TCE entries communicated to the hypervisor, if
> > >>> hcall-multi-tce is disabled?
> > >>
> > >> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> > >> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> > >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> > >> optimization.
> > > 
> > > Do you still think, secure-VM should use H_PUT_TCE and not
> > > H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> > > Is there any advantage of special casing it for secure-VMs.
> > 
> > 
> > Reducing the amount of insecure memory at random location.
> 
> The other approach we could use for that - which would still allow
> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
> same pool that we use for the bounce buffers.  I assume there must
> already be some sort of allocator for that?

The allocator for swiotlb is buried deep in the swiotlb code. It is 
not exposed to the outside-swiotlb world. Will have to do major surgery
to expose it.

I was thinking, maybe we share the page, finish the INDIRECT_TCE call,
and unshare the page.  This will address Alexey's concern of having
shared pages at random location, and will also give me my performance
optimization.  Alexey: ok?

RP
Alexey Kardashevskiy Dec. 4, 2019, 10:26 p.m. UTC | #14
On 05/12/2019 07:42, Ram Pai wrote:
> On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote:
>> On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 04/12/2019 11:49, Ram Pai wrote:
>>>> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 04/12/2019 03:52, Ram Pai wrote:
>>>>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 03/12/2019 15:05, Ram Pai wrote:
>>>>>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 03/12/2019 13:08, Ram Pai wrote:
>>>>>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
>>>>>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>>>>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
>>>>>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>>>>>>>>>>> needs to know the unencrypted entries, to update the TCE table
>>>>>>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
>>>>>>>>>>>> Hence share the page with the hypervisor.
>>>>>>>>>>>
>>>>>>>>>>> This unsecures a page in the guest in a random place which creates an
>>>>>>>>>>> additional attack surface which is hard to exploit indeed but
>>>>>>>>>>> nevertheless it is there.
>>>>>>>>>>> A safer option would be not to use the
>>>>>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>>>>>>>>>>> in the guest).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>>>>>>>>> automatically when IOMMU option is enabled.
>>>>>>>>>
>>>>>>>>> It is advertised by QEMU but the guest does not have to use it.
>>>>>>>>
>>>>>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
>>>>>>>> or just secure-guest?  
>>>>>>>
>>>>>>>
>>>>>>> Just secure.
>>>>>>
>>>>>> hmm..  how are the TCE entries communicated to the hypervisor, if
>>>>>> hcall-multi-tce is disabled?
>>>>>
>>>>> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
>>>>> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
>>>>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
>>>>> optimization.
>>>>
>>>> Do you still think, secure-VM should use H_PUT_TCE and not
>>>> H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
>>>> Is there any advantage of special casing it for secure-VMs.
>>>
>>>
>>> Reducing the amount of insecure memory at random location.
>>
>> The other approach we could use for that - which would still allow
>> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
>> same pool that we use for the bounce buffers.  I assume there must
>> already be some sort of allocator for that?
> 
> The allocator for swiotlb is buried deep in the swiotlb code. It is 
> not exposed to the outside-swiotlb world. Will have to do major surgery
> to expose it.
> 
> I was thinking, maybe we share the page, finish the INDIRECT_TCE call,
> and unshare the page.  This will address Alexey's concern of having
> shared pages at random location, and will also give me my performance
> optimization.  Alexey: ok?


I really do not see the point. I really think we should to 1:1 mapping
of swtiotlb buffers using the default 32bit window using H_PUT_TCE and
this should be more than enough, I do not think the amount of code will
be dramatically different compared to unsecuring and securing a page or
using one of swtiotlb pages for this purpose. Thanks,
Ram Pai Dec. 5, 2019, 2:15 a.m. UTC | #15
On Thu, Dec 05, 2019 at 09:26:14AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 05/12/2019 07:42, Ram Pai wrote:
> > On Wed, Dec 04, 2019 at 02:36:18PM +1100, David Gibson wrote:
> >> On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote:
> >>>
> >>>
> >>> On 04/12/2019 11:49, Ram Pai wrote:
> >>>> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> >>>>>
> >>>>>
> >>>>> On 04/12/2019 03:52, Ram Pai wrote:
> >>>>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 03/12/2019 15:05, Ram Pai wrote:
> >>>>>>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 03/12/2019 13:08, Ram Pai wrote:
> >>>>>>>>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 02/12/2019 17:45, Ram Pai wrote:
> >>>>>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>>>>>>>>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>>>>>>>>>>> kernel for this purpose. On secure VMs, contents of this page, when
> >>>>>>>>>>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> >>>>>>>>>>>> needs to know the unencrypted entries, to update the TCE table
> >>>>>>>>>>>> accordingly.  There is nothing secret or sensitive about these entries.
> >>>>>>>>>>>> Hence share the page with the hypervisor.
> >>>>>>>>>>>
> >>>>>>>>>>> This unsecures a page in the guest in a random place which creates an
> >>>>>>>>>>> additional attack surface which is hard to exploit indeed but
> >>>>>>>>>>> nevertheless it is there.
> >>>>>>>>>>> A safer option would be not to use the
> >>>>>>>>>>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >>>>>>>>>>> in the guest).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> >>>>>>>>>> automatically when IOMMU option is enabled.
> >>>>>>>>>
> >>>>>>>>> It is advertised by QEMU but the guest does not have to use it.
> >>>>>>>>
> >>>>>>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> >>>>>>>> or just secure-guest?  
> >>>>>>>
> >>>>>>>
> >>>>>>> Just secure.
> >>>>>>
> >>>>>> hmm..  how are the TCE entries communicated to the hypervisor, if
> >>>>>> hcall-multi-tce is disabled?
> >>>>>
> >>>>> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> >>>>> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> >>>>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> >>>>> optimization.
> >>>>
> >>>> Do you still think, secure-VM should use H_PUT_TCE and not
> >>>> H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> >>>> Is there any advantage of special casing it for secure-VMs.
> >>>
> >>>
> >>> Reducing the amount of insecure memory at random location.
> >>
> >> The other approach we could use for that - which would still allow
> >> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
> >> same pool that we use for the bounce buffers.  I assume there must
> >> already be some sort of allocator for that?
> > 
> > The allocator for swiotlb is buried deep in the swiotlb code. It is 
> > not exposed to the outside-swiotlb world. Will have to do major surgery
> > to expose it.
> > 
> > I was thinking, maybe we share the page, finish the INDIRECT_TCE call,
> > and unshare the page.  This will address Alexey's concern of having
> > shared pages at random location, and will also give me my performance
> > optimization.  Alexey: ok?
> 
> 
> I really do not see the point. I really think we should to 1:1 mapping
> of swtiotlb buffers using the default 32bit window using H_PUT_TCE and
> this should be more than enough, I do not think the amount of code will
> be dramatically different compared to unsecuring and securing a page or
> using one of swtiotlb pages for this purpose. Thanks,

Ok. I will address your major concern -- "do not create new shared pages
at random location"  in my next version of the patch.  Using the 32bit
DMA window just to map the SWIOTLB buffers, will be some effort. Hope 
we can stage it that way.

RP
Christoph Hellwig Dec. 5, 2019, 8:28 a.m. UTC | #16
On Wed, Dec 04, 2019 at 12:42:32PM -0800, Ram Pai wrote:
> > The other approach we could use for that - which would still allow
> > H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
> > same pool that we use for the bounce buffers.  I assume there must
> > already be some sort of allocator for that?
> 
> The allocator for swiotlb is buried deep in the swiotlb code. It is 
> not exposed to the outside-swiotlb world. Will have to do major surgery
> to expose it.

I don't think it would require all that much changes, but I'd really
hate the layering of calling into it directly.  Do we have a struct
device associated with the iommu that doesn't get iommu translations
themselves?  If we do a dma_alloc_coherent on that you'll get the
memory pool for free.
Ram Pai Dec. 6, 2019, 11:10 p.m. UTC | #17
On Thu, Dec 05, 2019 at 09:26:14AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 05/12/2019 07:42, Ram Pai wrote:
.snip...
> >>>> Do you still think, secure-VM should use H_PUT_TCE and not
> >>>> H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> >>>> Is there any advantage of special casing it for secure-VMs.
> >>>
> >>>
> >>> Reducing the amount of insecure memory at random location.
> >>
> >> The other approach we could use for that - which would still allow
> >> H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
> >> same pool that we use for the bounce buffers.  I assume there must
> >> already be some sort of allocator for that?
> > 
> > The allocator for swiotlb is buried deep in the swiotlb code. It is 
> > not exposed to the outside-swiotlb world. Will have to do major surgery
> > to expose it.
> > 
> > I was thinking, maybe we share the page, finish the INDIRECT_TCE call,
> > and unshare the page.  This will address Alexey's concern of having
> > shared pages at random location, and will also give me my performance
> > optimization.  Alexey: ok?
> 
> 
> I really do not see the point. I really think we should to 1:1 mapping
> of swtiotlb buffers using the default 32bit window using H_PUT_TCE and
> this should be more than enough, I do not think the amount of code will
> be dramatically different compared to unsecuring and securing a page or
> using one of swtiotlb pages for this purpose. Thanks,

Ok. there are three different issues to be addressed here.

(a) How to map the TCE entries in the TCE table?  Should we use H_PUT_TCE, 
	or H_PUT_INDIRECT_TCE?

(b) How much of the guest memory must be mapped in the TCE table? Should
   it be the entire guest memory? or the memory used by the SWIOTLB?

(c) What mapping window must to be used? Is it the 64bit ddw? or the 
	default 32 bit ddw?

Regardless of how we resolve issue (b) and (c),  we still need to
resolve (a).  The main concern you have about solution (a) is
that, random pages are permanently shared, something that can be
exploited and can cause security issues.  I tend to agree, this
is possible, though I am not sure how. But yes we need to address
this concern, since security is paramount to Secure Virtual Machines.

The way to resolve (a) is  --
(i) grab a page from the SWIOTLB pool and use H_PUT_INDIRECT_TCE
	OR
(ii) simply use H_PUT_TCE.
	OR
(iii) share the page prior to H_PUT_INDIRECT_TCE, and
	unshare the page once done.

Solution (i) has layering violation; as Christoph alluded to in his
previous reply. The swiotlb buffers are meant for I/O and DMA related
actitivy.  We will be abusing these swiotlb pages to communicate TCE
entries to the hypervisor.  Secondly IOMMU code has no idea where its
pages are sourced from and should not know either. I am uncomfortable
going this route. There is some upstream discussion about having a seperate
pool of shared pages on secure VM, https://lkml.org/lkml/2019/11/14/381.
That solution; when ready, may be exploitable here.

I have coded solution (ii)  and it works. But boot path slows down
significantly. Huge amount H_PUT_TCE hcalls. Very hurtful.

I strongly think, solution (iii) is the right way to go. I have coded
it, it works and bootpath is much faster.  However I am not sure if you
have a concern with this solution.

In any case, I am sending my next version of the patch based on solution
(iii).

Once this is addressed, I will address (b) and (c).

RP
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6ba081d..0720831 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -37,6 +37,7 @@ 
 #include <asm/mmzone.h>
 #include <asm/plpar_wrappers.h>
 #include <asm/svm.h>
+#include <asm/ultravisor.h>
 
 #include "pseries.h"
 
@@ -179,6 +180,23 @@  static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
+/*
+ * Allocate a tce page.  If secure VM, share the page with the hypervisor.
+ *
+ * NOTE: the TCE page is shared with the hypervisor explicitly and remains
+ * shared for the lifetime of the kernel. It is implicitly unshared at kernel
+ * shutdown through a UV_UNSHARE_ALL_PAGES ucall.
+ */
+static __be64 *alloc_tce_page(void)
+{
+	__be64 *tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
+
+	if (tcep && is_secure_guest())
+		uv_share_page(PHYS_PFN(__pa(tcep)), 1);
+
+	return tcep;
+}
+
 static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
 				     enum dma_data_direction direction,
@@ -206,8 +224,7 @@  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	 * from iommu_alloc{,_sg}()
 	 */
 	if (!tcep) {
-		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
-		/* If allocation fails, fall back to the loop implementation */
+		tcep = alloc_tce_page();
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
@@ -405,7 +422,7 @@  static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn,
 	tcep = __this_cpu_read(tce_page);
 
 	if (!tcep) {
-		tcep = (__be64 *)__get_free_page(GFP_ATOMIC);
+		tcep = alloc_tce_page();
 		if (!tcep) {
 			local_irq_enable();
 			return -ENOMEM;