diff mbox series

[RFC,45/48] RISC-V: ioremap: Implement for arch specific ioremap hooks

Message ID 20230419221716.3603068-46-atishp@rivosinc.com
State Changes Requested
Headers show
Series RISC-V CoVE support | expand

Commit Message

Atish Kumar Patra April 19, 2023, 10:17 p.m. UTC
From: Rajnesh Kanwal <rkanwal@rivosinc.com>

The guests running in CoVE must notify the host about its mmio regions
so that host can enable mmio emulation.

Signed-off-by: Rajnesh Kanwal <rkanwal@rivosinc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
---
 arch/riscv/mm/Makefile  |  1 +
 arch/riscv/mm/ioremap.c | 45 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)
 create mode 100644 arch/riscv/mm/ioremap.c

Comments

Dave Hansen April 20, 2023, 10:15 p.m. UTC | #1
On 4/19/23 15:17, Atish Patra wrote:
> The guests running in CoVE must notify the host about its mmio regions
> so that host can enable mmio emulation.

This one doesn't make a lot of sense to me.

The guest and host must agree about the guest's physical layout up
front.  In general, the host gets to dictate that layout.  It tells the
guest, up front, what is present in the guest physical address space.

This callback appears to say to the host:

	Hey, I (the guest) am treating this guest physical area as MMIO.

But the host and guest have to agree _somewhere_ what the MMIO is used
for, not just that it is being used as MMIO.
Atish Kumar Patra April 21, 2023, 7:24 p.m. UTC | #2
On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/19/23 15:17, Atish Patra wrote:
> > The guests running in CoVE must notify the host about its mmio regions
> > so that host can enable mmio emulation.
>
> This one doesn't make a lot of sense to me.
>
> The guest and host must agree about the guest's physical layout up
> front.  In general, the host gets to dictate that layout.  It tells the
> guest, up front, what is present in the guest physical address space.
>

That is passed through DT/ACPI (which will be measured) to the guest.

> This callback appears to say to the host:
>
>         Hey, I (the guest) am treating this guest physical area as MMIO.
>
> But the host and guest have to agree _somewhere_ what the MMIO is used
> for, not just that it is being used as MMIO.
>

Yes. The TSM (TEE Security Manager) which is equivalent to TDX also
needs to be aware
of the MMIO regions so that it can forward the faults accordingly.
Most of the MMIO is emulated in the host (userspace or kernel
emulation if present).
The host is outside the trust boundary of the guest. Thus, guest needs
to make sure the host
only emulates the designated MMIO region. Otherwise, it opens an
attack surface from a malicious host.

All other confidential computing solutions also depend on guest
initiated MMIO as well. AFAIK, the TDX & SEV
relies on #VE like exceptions to invoke that while this patch is
similar to what pkvm does.
This approach lets the enlightened guest control which MMIO regions it
wants the host to emulate.
It can be a subset of the region's host provided the layout. The guest
device filtering solution is based on
this idea as well [1].

[1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/


>
Dave Hansen April 24, 2023, 1:48 p.m. UTC | #3
On 4/21/23 12:24, Atish Kumar Patra wrote:
> On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <dave.hansen@intel.com> wrote:>> This callback appears to say to the host:
>>
>>         Hey, I (the guest) am treating this guest physical area as MMIO.
>>
>> But the host and guest have to agree _somewhere_ what the MMIO is used
>> for, not just that it is being used as MMIO.
> 
> Yes. The TSM (TEE Security Manager) which is equivalent to TDX also 
> needs to be aware of the MMIO regions so that it can forward the
> faults accordingly. Most of the MMIO is emulated in the host
> (userspace or kernel emulation if present). The host is outside the
> trust boundary of the guest. Thus, guest needs to make sure the host 
> only emulates the designated MMIO region. Otherwise, it opens an 
> attack surface from a malicious host.
How does this mechanism stop the host from emulating something outside
the designated region?

On TDX, for instance, the guest page table have a shared/private bit.
Private pages get TDX protections to (among other things) keep the page
contents confidential from the host.  Shared pages can be used for MMIO
and don't have those protections.

If the host goes and tries to flip a page from private->shared, TDX
protections will kick in and prevent it.

None of this requires the guest to tell the host where it expects MMIO
to be located.

> All other confidential computing solutions also depend on guest 
> initiated MMIO as well. AFAIK, the TDX & SEV relies on #VE like
> exceptions to invoke that while this patch is similar to what pkvm
> does. This approach lets the enlightened guest control which MMIO
> regions it wants the host to emulate.

I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
don't require an ioremap hook like this.  So, even if they *are* "guest
initiated", the question still remains how they work without this patch,
or what they are missing without it.

> It can be a subset of the region's host provided the layout. The
> guest device filtering solution is based on this idea as well [1].
> 
> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/

I don't really see the connection.  Even if that series was going
forward (I'm not sure it is) there is no ioremap hook there.  There's
also no guest->host communication in that series.  The guest doesn't
_tell_ the host where the MMIO is, it just declines to run code for
devices that it didn't expect to see.

I'm still rather confused here.
Atish Kumar Patra April 25, 2023, 8 a.m. UTC | #4
On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > On Fri, Apr 21, 2023 at 3:46 AM Dave Hansen <dave.hansen@intel.com> wrote:>> This callback appears to say to the host:
> >>
> >>         Hey, I (the guest) am treating this guest physical area as MMIO.
> >>
> >> But the host and guest have to agree _somewhere_ what the MMIO is used
> >> for, not just that it is being used as MMIO.
> >
> > Yes. The TSM (TEE Security Manager) which is equivalent to TDX also
> > needs to be aware of the MMIO regions so that it can forward the
> > faults accordingly. Most of the MMIO is emulated in the host
> > (userspace or kernel emulation if present). The host is outside the
> > trust boundary of the guest. Thus, guest needs to make sure the host
> > only emulates the designated MMIO region. Otherwise, it opens an
> > attack surface from a malicious host.
> How does this mechanism stop the host from emulating something outside
> the designated region?
>
> On TDX, for instance, the guest page table have a shared/private bit.
> Private pages get TDX protections to (among other things) keep the page
> contents confidential from the host.  Shared pages can be used for MMIO
> and don't have those protections.
>
> If the host goes and tries to flip a page from private->shared, TDX
> protections will kick in and prevent it.
>
> None of this requires the guest to tell the host where it expects MMIO
> to be located.
>
> > All other confidential computing solutions also depend on guest
> > initiated MMIO as well. AFAIK, the TDX & SEV relies on #VE like
> > exceptions to invoke that while this patch is similar to what pkvm
> > does. This approach lets the enlightened guest control which MMIO
> > regions it wants the host to emulate.
>
> I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
> don't require an ioremap hook like this.  So, even if they *are* "guest
> initiated", the question still remains how they work without this patch,
> or what they are missing without it.
>

Maybe I misunderstood your question earlier. Are you concerned about guests
invoking any MMIO region specific calls in the ioremap path or passing
that information to the host ?
Earlier, I assumed the former but it seems you are also concerned
about the latter as well. Sorry for the confusion in that case.
The guest initiation is necessary while the host notification can be
made optional.
The "guest initiated" means the guest tells the TSM (equivalent of TDX
module in RISC-V) the MMIO region details.
The TSM keeps a track of this and any page faults that happen in that
region are forwarded
to the host by the TSM after the instruction decoding. Thus TSM can
make sure that only ioremapped regions are
considered MMIO regions. Otherwise, all memory outside the guest
physical region will be considered as the MMIO region.

In the current CoVE implementation, that MMIO region information is also
passed to the host to provide additional flexibility. The host may
choose to do additional
sanity check and bail if the fault address does not belong to
requested MMIO regions without
going to the userspace. This is purely an optimization and may not be mandatory.


> > It can be a subset of the region's host provided the layout. The
> > guest device filtering solution is based on this idea as well [1].
> >
> > [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>
> I don't really see the connection.  Even if that series was going
> forward (I'm not sure it is) there is no ioremap hook there.  There's
> also no guest->host communication in that series.  The guest doesn't
> _tell_ the host where the MMIO is, it just declines to run code for
> devices that it didn't expect to see.
>

This is a recent version of the above series from tdx github. This is
a WIP as well and has not been posted to
the mailing list. Thus, it may be going under revisions as well.
As per my understanding the above ioremap changes for TDX mark the
ioremapped pages as shared.
The guest->host communication happen in the #VE exception handler
where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
with an EPT violation set. The host would emulate an MMIO address if
it gets an VMCALL with EPT violation.
Please correct me if I am wrong.

As I said above, the objective here is to notify the TSM where the
MMIO is. Notifying the host
is just an optimization that we choose to add. In fact, in this series
the KVM code doesn't do anything with that information.
The commit text probably can be improved to clarify that.


> I'm still rather confused here.
Dave Hansen April 25, 2023, 1:10 p.m. UTC | #5
On 4/25/23 01:00, Atish Kumar Patra wrote:
> On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 4/21/23 12:24, Atish Kumar Patra wrote:
>> I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
>> don't require an ioremap hook like this.  So, even if they *are* "guest
>> initiated", the question still remains how they work without this patch,
>> or what they are missing without it.
> 
> Maybe I misunderstood your question earlier. Are you concerned about guests
> invoking any MMIO region specific calls in the ioremap path or passing
> that information to the host ?

My concern is that I don't know why this patch is here.  There should be
a very simple answer to the question: Why does RISC-V need this patch
but x86 does not?

> Earlier, I assumed the former but it seems you are also concerned
> about the latter as well. Sorry for the confusion in that case.
> The guest initiation is necessary while the host notification can be
> made optional.
> The "guest initiated" means the guest tells the TSM (equivalent of TDX
> module in RISC-V) the MMIO region details.
> The TSM keeps a track of this and any page faults that happen in that
> region are forwarded
> to the host by the TSM after the instruction decoding. Thus TSM can
> make sure that only ioremapped regions are
> considered MMIO regions. Otherwise, all memory outside the guest
> physical region will be considered as the MMIO region.

Ahh, OK, that's a familiar problem.  I see the connection to device
filtering now.

Is this functionality in the current set?  I went looking for it and all
I found was the host notification side.

Is this the only mechanism by which the guest tells the TSM which parts
of the guest physical address space can be exposed to the host?

For TDX and SEV, that information is inferred from a bit in the page
tables.  Essentially, there are dedicated guest physical addresses that
tell the hardware how to treat the mappings: should the secure page
tables or the host's EPT/NPT be consulted?

If that mechanism is different for RISC-V, it would go a long way to
explaining why RISC-V needs this patch.

> In the current CoVE implementation, that MMIO region information is also
> passed to the host to provide additional flexibility. The host may
> choose to do additional
> sanity check and bail if the fault address does not belong to
> requested MMIO regions without
> going to the userspace. This is purely an optimization and may not be mandatory.

Makes sense, thanks for the explanation.

>>> It can be a subset of the region's host provided the layout. The
>>> guest device filtering solution is based on this idea as well [1].
>>>
>>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>>
>> I don't really see the connection.  Even if that series was going
>> forward (I'm not sure it is) there is no ioremap hook there.  There's
>> also no guest->host communication in that series.  The guest doesn't
>> _tell_ the host where the MMIO is, it just declines to run code for
>> devices that it didn't expect to see.
> 
> This is a recent version of the above series from tdx github. This is
> a WIP as well and has not been posted to
> the mailing list. Thus, it may be going under revisions as well.
> As per my understanding the above ioremap changes for TDX mark the
> ioremapped pages as shared.
> The guest->host communication happen in the #VE exception handler
> where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> with an EPT violation set. The host would emulate an MMIO address if
> it gets an VMCALL with EPT violation.
> Please correct me if I am wrong.

Yeah, TDX does:

1. Guest MMIO access
2. Guest #VE handler (if the access faults)
3. Guest hypercall->host
4. Host fixes the fault
5. Hypercall returns, guest returns from #VE via IRET
6. Guest retries MMIO instruction

From what you said, RISC-V appears to do:

1. Guest MMIO access
2. Host MMIO handler
3. Host handles the fault, returns
4. Guest retries MMIO instruction

In other words, this mechanism does the same thing but short-circuits
the trip through #VE and the hypercall.

What happens if this ioremap() hook is not in place?  Does the hardware
(or TSM) generate an exception like TDX gets?  If so, it's probably
possible to move this "notify the TSM" code to that exception handler
instead of needing an ioremap() hook.

I'm not saying that it's _better_ to do that, but it would allow you to
get rid of this patch for now and get me to shut up. :)

> As I said above, the objective here is to notify the TSM where the 
> MMIO is. Notifying the host is just an optimization that we choose to
> add. In fact, in this series the KVM code doesn't do anything with
> that information. The commit text probably can be improved to clarify
> that.

Just to close the loop here, please go take a look at
pgprot_decrypted().  That's where the x86 guest page table bit gets to
tell the hardware that the mapping might cause a #VE and is under the
control of the host.  That's the extent of what x86 does at ioremap() time.

So, to summarize, we have:

x86:
1. Guest page table bit to mark shared (host) vs. private (guest)
   control
2. #VE if there is a fault on a shared mapping to call into the host

RISC-V:
1. Guest->TSM call to mark MMIO vs. private
2. Faults in the MMIO area are then transparent to the guest

That design difference would, indeed, help explain why this patch is
here.  I'm still not 100% convinced that the patch is *required*, but I
at least understand how we arrived here.
Atish Kumar Patra April 26, 2023, 8:02 a.m. UTC | #6
On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/25/23 01:00, Atish Kumar Patra wrote:
> > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> >> I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
> >> don't require an ioremap hook like this.  So, even if they *are* "guest
> >> initiated", the question still remains how they work without this patch,
> >> or what they are missing without it.
> >
> > Maybe I misunderstood your question earlier. Are you concerned about guests
> > invoking any MMIO region specific calls in the ioremap path or passing
> > that information to the host ?
>
> My concern is that I don't know why this patch is here.  There should be
> a very simple answer to the question: Why does RISC-V need this patch
> but x86 does not?
>
> > Earlier, I assumed the former but it seems you are also concerned
> > about the latter as well. Sorry for the confusion in that case.
> > The guest initiation is necessary while the host notification can be
> > made optional.
> > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > module in RISC-V) the MMIO region details.
> > The TSM keeps a track of this and any page faults that happen in that
> > region are forwarded
> > to the host by the TSM after the instruction decoding. Thus TSM can
> > make sure that only ioremapped regions are
> > considered MMIO regions. Otherwise, all memory outside the guest
> > physical region will be considered as the MMIO region.
>
> Ahh, OK, that's a familiar problem.  I see the connection to device
> filtering now.
>
> Is this functionality in the current set?  I went looking for it and all
> I found was the host notification side.
>

The current series doesn't have the guest filtering feature enabled.
However, we implemented guest filtering and is maintained in a separate tree

https://github.com/rivosinc/linux/tree/cove-integration-device-filtering

We did not include those in this series because the tdx patches are
still undergoing
development. We are planning to rebase our changes once the revised
patches are available.

> Is this the only mechanism by which the guest tells the TSM which parts
> of the guest physical address space can be exposed to the host?
>

This is the current approach defined in CoVE spec. Guest informs about both
shared memory & mmio regions via dedicated SBI calls (
e.g sbi_covg_[add/remove]_mmio_region and
sbi_covg_[share/unshare]_memory_region)

> For TDX and SEV, that information is inferred from a bit in the page
> tables.  Essentially, there are dedicated guest physical addresses that
> tell the hardware how to treat the mappings: should the secure page
> tables or the host's EPT/NPT be consulted?
>

Yes. We don't have such a mechanism defined in CoVE yet.
Having said that, there is nothing in ISA to prevent that and it is doable.
Some specific bits in the PTE entry can also be used to encode for
shared & mmio physical memory addresses.
The TSM implementation will probably need to implement a software page
walker in that case.

Are there any performance advantages between the two approaches ?
As per my understanding, we are saving some boot time privilege
transitions & less ABIs but
adds the cost of software walk at runtime faults.

> If that mechanism is different for RISC-V, it would go a long way to
> explaining why RISC-V needs this patch.
>
> > In the current CoVE implementation, that MMIO region information is also
> > passed to the host to provide additional flexibility. The host may
> > choose to do additional
> > sanity check and bail if the fault address does not belong to
> > requested MMIO regions without
> > going to the userspace. This is purely an optimization and may not be mandatory.
>
> Makes sense, thanks for the explanation.
>
> >>> It can be a subset of the region's host provided the layout. The
> >>> guest device filtering solution is based on this idea as well [1].
> >>>
> >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> >>
> >> I don't really see the connection.  Even if that series was going
> >> forward (I'm not sure it is) there is no ioremap hook there.  There's
> >> also no guest->host communication in that series.  The guest doesn't
> >> _tell_ the host where the MMIO is, it just declines to run code for
> >> devices that it didn't expect to see.
> >
> > This is a recent version of the above series from tdx github. This is
> > a WIP as well and has not been posted to
> > the mailing list. Thus, it may be going under revisions as well.
> > As per my understanding the above ioremap changes for TDX mark the
> > ioremapped pages as shared.
> > The guest->host communication happen in the #VE exception handler
> > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > with an EPT violation set. The host would emulate an MMIO address if
> > it gets an VMCALL with EPT violation.
> > Please correct me if I am wrong.
>
> Yeah, TDX does:
>
> 1. Guest MMIO access
> 2. Guest #VE handler (if the access faults)
> 3. Guest hypercall->host
> 4. Host fixes the fault
> 5. Hypercall returns, guest returns from #VE via IRET
> 6. Guest retries MMIO instruction
>
> From what you said, RISC-V appears to do:
>
> 1. Guest MMIO access
> 2. Host MMIO handler
> 3. Host handles the fault, returns
> 4. Guest retries MMIO instruction
>
> In other words, this mechanism does the same thing but short-circuits
> the trip through #VE and the hypercall.
>

Yes. Thanks for summarizing the tdx approach.

> What happens if this ioremap() hook is not in place?  Does the hardware
> (or TSM) generate an exception like TDX gets?  If so, it's probably
> possible to move this "notify the TSM" code to that exception handler
> instead of needing an ioremap() hook.
>

We don't have a #VE like exception mechanism in RISC-V.

> I'm not saying that it's _better_ to do that, but it would allow you to
> get rid of this patch for now and get me to shut up. :)
>
> > As I said above, the objective here is to notify the TSM where the
> > MMIO is. Notifying the host is just an optimization that we choose to
> > add. In fact, in this series the KVM code doesn't do anything with
> > that information. The commit text probably can be improved to clarify
> > that.
>
> Just to close the loop here, please go take a look at
> pgprot_decrypted().  That's where the x86 guest page table bit gets to
> tell the hardware that the mapping might cause a #VE and is under the
> control of the host.  That's the extent of what x86 does at ioremap() time.
>
> So, to summarize, we have:
>
> x86:
> 1. Guest page table bit to mark shared (host) vs. private (guest)
>    control
> 2. #VE if there is a fault on a shared mapping to call into the host
>
> RISC-V:
> 1. Guest->TSM call to mark MMIO vs. private
> 2. Faults in the MMIO area are then transparent to the guest
>

Yup. This discussion raised a very valid design aspect of the CoVE spec.
To summarize, we need to investigate whether using PTE bits instead of
additional ABI
for managing shared/confidential/ioremapped pages makes more sense.

Thanks for putting up with my answers and the feedback :).

> That design difference would, indeed, help explain why this patch is
> here.  I'm still not 100% convinced that the patch is *required*, but I
> at least understand how we arrived here.
Anup Patel April 26, 2023, 10:30 a.m. UTC | #7
On Wed, Apr 26, 2023 at 1:32 PM Atish Kumar Patra <atishp@rivosinc.com> wrote:
>
> On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 4/25/23 01:00, Atish Kumar Patra wrote:
> > > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > >> I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
> > >> don't require an ioremap hook like this.  So, even if they *are* "guest
> > >> initiated", the question still remains how they work without this patch,
> > >> or what they are missing without it.
> > >
> > > Maybe I misunderstood your question earlier. Are you concerned about guests
> > > invoking any MMIO region specific calls in the ioremap path or passing
> > > that information to the host ?
> >
> > My concern is that I don't know why this patch is here.  There should be
> > a very simple answer to the question: Why does RISC-V need this patch
> > but x86 does not?
> >
> > > Earlier, I assumed the former but it seems you are also concerned
> > > about the latter as well. Sorry for the confusion in that case.
> > > The guest initiation is necessary while the host notification can be
> > > made optional.
> > > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > > module in RISC-V) the MMIO region details.
> > > The TSM keeps a track of this and any page faults that happen in that
> > > region are forwarded
> > > to the host by the TSM after the instruction decoding. Thus TSM can
> > > make sure that only ioremapped regions are
> > > considered MMIO regions. Otherwise, all memory outside the guest
> > > physical region will be considered as the MMIO region.
> >
> > Ahh, OK, that's a familiar problem.  I see the connection to device
> > filtering now.
> >
> > Is this functionality in the current set?  I went looking for it and all
> > I found was the host notification side.
> >
>
> The current series doesn't have the guest filtering feature enabled.
> However, we implemented guest filtering and is maintained in a separate tree
>
> https://github.com/rivosinc/linux/tree/cove-integration-device-filtering
>
> We did not include those in this series because the tdx patches are
> still undergoing
> development. We are planning to rebase our changes once the revised
> patches are available.
>
> > Is this the only mechanism by which the guest tells the TSM which parts
> > of the guest physical address space can be exposed to the host?
> >
>
> This is the current approach defined in CoVE spec. Guest informs about both
> shared memory & mmio regions via dedicated SBI calls (
> e.g sbi_covg_[add/remove]_mmio_region and
> sbi_covg_[share/unshare]_memory_region)
>
> > For TDX and SEV, that information is inferred from a bit in the page
> > tables.  Essentially, there are dedicated guest physical addresses that
> > tell the hardware how to treat the mappings: should the secure page
> > tables or the host's EPT/NPT be consulted?
> >
>
> Yes. We don't have such a mechanism defined in CoVE yet.
> Having said that, there is nothing in ISA to prevent that and it is doable.
> Some specific bits in the PTE entry can also be used to encode for
> shared & mmio physical memory addresses.
> The TSM implementation will probably need to implement a software page
> walker in that case.

We can certainly use PTE bits defined by Svpmbt extension to
differentiate between IO and memory. Also, we can use the PTE
SW bits to differentiate between shared and non-shared memory.

>
> Are there any performance advantages between the two approaches ?
> As per my understanding, we are saving some boot time privilege
> transitions & less ABIs but
> adds the cost of software walk at runtime faults.

Performance wise both approaches will be the same because in
case of PTE based approach, the TSM can on-demand map the
shared memory and do software walk upon first access.

>
> > If that mechanism is different for RISC-V, it would go a long way to
> > explaining why RISC-V needs this patch.
> >
> > > In the current CoVE implementation, that MMIO region information is also
> > > passed to the host to provide additional flexibility. The host may
> > > choose to do additional
> > > sanity check and bail if the fault address does not belong to
> > > requested MMIO regions without
> > > going to the userspace. This is purely an optimization and may not be mandatory.
> >
> > Makes sense, thanks for the explanation.
> >
> > >>> It can be a subset of the region's host provided the layout. The
> > >>> guest device filtering solution is based on this idea as well [1].
> > >>>
> > >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> > >>
> > >> I don't really see the connection.  Even if that series was going
> > >> forward (I'm not sure it is) there is no ioremap hook there.  There's
> > >> also no guest->host communication in that series.  The guest doesn't
> > >> _tell_ the host where the MMIO is, it just declines to run code for
> > >> devices that it didn't expect to see.
> > >
> > > This is a recent version of the above series from tdx github. This is
> > > a WIP as well and has not been posted to
> > > the mailing list. Thus, it may be going under revisions as well.
> > > As per my understanding the above ioremap changes for TDX mark the
> > > ioremapped pages as shared.
> > > The guest->host communication happen in the #VE exception handler
> > > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > > with an EPT violation set. The host would emulate an MMIO address if
> > > it gets an VMCALL with EPT violation.
> > > Please correct me if I am wrong.
> >
> > Yeah, TDX does:
> >
> > 1. Guest MMIO access
> > 2. Guest #VE handler (if the access faults)
> > 3. Guest hypercall->host
> > 4. Host fixes the fault
> > 5. Hypercall returns, guest returns from #VE via IRET
> > 6. Guest retries MMIO instruction
> >
> > From what you said, RISC-V appears to do:
> >
> > 1. Guest MMIO access
> > 2. Host MMIO handler
> > 3. Host handles the fault, returns
> > 4. Guest retries MMIO instruction
> >
> > In other words, this mechanism does the same thing but short-circuits
> > the trip through #VE and the hypercall.
> >
>
> Yes. Thanks for summarizing the tdx approach.
>
> > What happens if this ioremap() hook is not in place?  Does the hardware
> > (or TSM) generate an exception like TDX gets?  If so, it's probably
> > possible to move this "notify the TSM" code to that exception handler
> > instead of needing an ioremap() hook.
> >
>
> We don't have a #VE like exception mechanism in RISC-V.
>
> > I'm not saying that it's _better_ to do that, but it would allow you to
> > get rid of this patch for now and get me to shut up. :)
> >
> > > As I said above, the objective here is to notify the TSM where the
> > > MMIO is. Notifying the host is just an optimization that we choose to
> > > add. In fact, in this series the KVM code doesn't do anything with
> > > that information. The commit text probably can be improved to clarify
> > > that.
> >
> > Just to close the loop here, please go take a look at
> > pgprot_decrypted().  That's where the x86 guest page table bit gets to
> > tell the hardware that the mapping might cause a #VE and is under the
> > control of the host.  That's the extent of what x86 does at ioremap() time.
> >
> > So, to summarize, we have:
> >
> > x86:
> > 1. Guest page table bit to mark shared (host) vs. private (guest)
> >    control
> > 2. #VE if there is a fault on a shared mapping to call into the host
> >
> > RISC-V:
> > 1. Guest->TSM call to mark MMIO vs. private
> > 2. Faults in the MMIO area are then transparent to the guest
> >
>
> Yup. This discussion raised a very valid design aspect of the CoVE spec.
> To summarize, we need to investigate whether using PTE bits instead of
> additional ABI
> for managing shared/confidential/ioremapped pages makes more sense.
>
> Thanks for putting up with my answers and the feedback :).

I think we should re-evaluate the PTE (or software walk) based approach
for CoVE spec. It is better to keep the CoVE spec as minimal as possible
and define SBI calls only if absolutely required.

>
> > That design difference would, indeed, help explain why this patch is
> > here.  I'm still not 100% convinced that the patch is *required*, but I
> > at least understand how we arrived here.

Regards,
Anup
Andrew Bresticker April 26, 2023, 1:55 p.m. UTC | #8
On Wed, Apr 26, 2023 at 6:30 AM Anup Patel <apatel@ventanamicro.com> wrote:
>
> On Wed, Apr 26, 2023 at 1:32 PM Atish Kumar Patra <atishp@rivosinc.com> wrote:
> >
> > On Tue, Apr 25, 2023 at 6:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 4/25/23 01:00, Atish Kumar Patra wrote:
> > > > On Mon, Apr 24, 2023 at 7:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > >> On 4/21/23 12:24, Atish Kumar Patra wrote:
> > > >> I'm not _quite_ sure what "guest initiated" means.  But SEV and TDX
> > > >> don't require an ioremap hook like this.  So, even if they *are* "guest
> > > >> initiated", the question still remains how they work without this patch,
> > > >> or what they are missing without it.
> > > >
> > > > Maybe I misunderstood your question earlier. Are you concerned about guests
> > > > invoking any MMIO region specific calls in the ioremap path or passing
> > > > that information to the host ?
> > >
> > > My concern is that I don't know why this patch is here.  There should be
> > > a very simple answer to the question: Why does RISC-V need this patch
> > > but x86 does not?
> > >
> > > > Earlier, I assumed the former but it seems you are also concerned
> > > > about the latter as well. Sorry for the confusion in that case.
> > > > The guest initiation is necessary while the host notification can be
> > > > made optional.
> > > > The "guest initiated" means the guest tells the TSM (equivalent of TDX
> > > > module in RISC-V) the MMIO region details.
> > > > The TSM keeps a track of this and any page faults that happen in that
> > > > region are forwarded
> > > > to the host by the TSM after the instruction decoding. Thus TSM can
> > > > make sure that only ioremapped regions are
> > > > considered MMIO regions. Otherwise, all memory outside the guest
> > > > physical region will be considered as the MMIO region.
> > >
> > > Ahh, OK, that's a familiar problem.  I see the connection to device
> > > filtering now.
> > >
> > > Is this functionality in the current set?  I went looking for it and all
> > > I found was the host notification side.
> > >
> >
> > The current series doesn't have the guest filtering feature enabled.
> > However, we implemented guest filtering and is maintained in a separate tree
> >
> > https://github.com/rivosinc/linux/tree/cove-integration-device-filtering
> >
> > We did not include those in this series because the tdx patches are
> > still undergoing
> > development. We are planning to rebase our changes once the revised
> > patches are available.
> >
> > > Is this the only mechanism by which the guest tells the TSM which parts
> > > of the guest physical address space can be exposed to the host?
> > >
> >
> > This is the current approach defined in CoVE spec. Guest informs about both
> > shared memory & mmio regions via dedicated SBI calls (
> > e.g sbi_covg_[add/remove]_mmio_region and
> > sbi_covg_[share/unshare]_memory_region)
> >
> > > For TDX and SEV, that information is inferred from a bit in the page
> > > tables.  Essentially, there are dedicated guest physical addresses that
> > > tell the hardware how to treat the mappings: should the secure page
> > > tables or the host's EPT/NPT be consulted?
> > >
> >
> > Yes. We don't have such a mechanism defined in CoVE yet.
> > Having said that, there is nothing in ISA to prevent that and it is doable.
> > Some specific bits in the PTE entry can also be used to encode for
> > shared & mmio physical memory addresses.
> > The TSM implementation will probably need to implement a software page
> > walker in that case.
>
> We can certainly use PTE bits defined by Svpmbt extension to
> differentiate between IO and memory. Also, we can use the PTE
> SW bits to differentiate between shared and non-shared memory.
>
> >
> > Are there any performance advantages between the two approaches ?
> > As per my understanding, we are saving some boot time privilege
> > transitions & less ABIs but
> > adds the cost of software walk at runtime faults.
>
> Performance wise both approaches will be the same because in
> case of PTE based approach, the TSM can on-demand map the
> shared memory and do software walk upon first access.

For MMIO sure, we can use Svpbmt or an RSW bit in the VS-stage PTE.
Performance-wise the difference is a few fetches from guest memory by
the TSM vs a lookup by the TSM in an internal data-structure.

It's a little more complicated for shared <-> private conversion,
though. If we were to emulate what TDX does with separate Shared vs
Secure EPTs, we could use the MSB of the GPA to divide GPA space in
half between private vs shared. But then we need to enable the host to
reclaim the private pages on a private -> shared conversion: either
the TSM must track which parts of GPA space have been converted (which
gets complicated in the presence of hugepages), or we let the host
remove whatever private pages it wants. For the latter we'd then need
an "accept" flow -- we don't have a #VE equivalent on RISC-V, but I
suppose we could use access fault exceptions for this purpose.

-Andrew
>
> >
> > > If that mechanism is different for RISC-V, it would go a long way to
> > > explaining why RISC-V needs this patch.
> > >
> > > > In the current CoVE implementation, that MMIO region information is also
> > > > passed to the host to provide additional flexibility. The host may
> > > > choose to do additional
> > > > sanity check and bail if the fault address does not belong to
> > > > requested MMIO regions without
> > > > going to the userspace. This is purely an optimization and may not be mandatory.
> > >
> > > Makes sense, thanks for the explanation.
> > >
> > > >>> It can be a subset of the region's host provided the layout. The
> > > >>> guest device filtering solution is based on this idea as well [1].
> > > >>>
> > > >>> [1] https://lore.kernel.org/all/20210930010511.3387967-1-sathyanarayanan.kuppuswamy@linux.intel.com/
> > > >>
> > > >> I don't really see the connection.  Even if that series was going
> > > >> forward (I'm not sure it is) there is no ioremap hook there.  There's
> > > >> also no guest->host communication in that series.  The guest doesn't
> > > >> _tell_ the host where the MMIO is, it just declines to run code for
> > > >> devices that it didn't expect to see.
> > > >
> > > > This is a recent version of the above series from tdx github. This is
> > > > a WIP as well and has not been posted to
> > > > the mailing list. Thus, it may be going under revisions as well.
> > > > As per my understanding the above ioremap changes for TDX mark the
> > > > ioremapped pages as shared.
> > > > The guest->host communication happen in the #VE exception handler
> > > > where the guest converts this to a hypercall by invoking TDG.VP.VMCALL
> > > > with an EPT violation set. The host would emulate an MMIO address if
> > > > it gets an VMCALL with EPT violation.
> > > > Please correct me if I am wrong.
> > >
> > > Yeah, TDX does:
> > >
> > > 1. Guest MMIO access
> > > 2. Guest #VE handler (if the access faults)
> > > 3. Guest hypercall->host
> > > 4. Host fixes the fault
> > > 5. Hypercall returns, guest returns from #VE via IRET
> > > 6. Guest retries MMIO instruction
> > >
> > > From what you said, RISC-V appears to do:
> > >
> > > 1. Guest MMIO access
> > > 2. Host MMIO handler
> > > 3. Host handles the fault, returns
> > > 4. Guest retries MMIO instruction
> > >
> > > In other words, this mechanism does the same thing but short-circuits
> > > the trip through #VE and the hypercall.
> > >
> >
> > Yes. Thanks for summarizing the tdx approach.
> >
> > > What happens if this ioremap() hook is not in place?  Does the hardware
> > > (or TSM) generate an exception like TDX gets?  If so, it's probably
> > > possible to move this "notify the TSM" code to that exception handler
> > > instead of needing an ioremap() hook.
> > >
> >
> > We don't have a #VE like exception mechanism in RISC-V.
> >
> > > I'm not saying that it's _better_ to do that, but it would allow you to
> > > get rid of this patch for now and get me to shut up. :)
> > >
> > > > As I said above, the objective here is to notify the TSM where the
> > > > MMIO is. Notifying the host is just an optimization that we choose to
> > > > add. In fact, in this series the KVM code doesn't do anything with
> > > > that information. The commit text probably can be improved to clarify
> > > > that.
> > >
> > > Just to close the loop here, please go take a look at
> > > pgprot_decrypted().  That's where the x86 guest page table bit gets to
> > > tell the hardware that the mapping might cause a #VE and is under the
> > > control of the host.  That's the extent of what x86 does at ioremap() time.
> > >
> > > So, to summarize, we have:
> > >
> > > x86:
> > > 1. Guest page table bit to mark shared (host) vs. private (guest)
> > >    control
> > > 2. #VE if there is a fault on a shared mapping to call into the host
> > >
> > > RISC-V:
> > > 1. Guest->TSM call to mark MMIO vs. private
> > > 2. Faults in the MMIO area are then transparent to the guest
> > >
> >
> > Yup. This discussion raised a very valid design aspect of the CoVE spec.
> > To summarize, we need to investigate whether using PTE bits instead of
> > additional ABI
> > for managing shared/confidential/ioremapped pages makes more sense.
> >
> > Thanks for putting up with my answers and the feedback :).
>
> I think we should re-evaluate the PTE (or software walk) based approach
> for CoVE spec. It is better to keep the CoVE spec as minimal as possible
> and define SBI calls only if absolutely required.
>
> >
> > > That design difference would, indeed, help explain why this patch is
> > > here.  I'm still not 100% convinced that the patch is *required*, but I
> > > at least understand how we arrived here.
>
> Regards,
> Anup
diff mbox series

Patch

diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 1fd9b60..721b557 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -15,6 +15,7 @@  obj-y += cacheflush.o
 obj-y += context.o
 obj-y += pgtable.o
 obj-y += pmem.o
+obj-y += ioremap.o
 
 ifeq ($(CONFIG_MMU),y)
 obj-$(CONFIG_SMP) += tlbflush.o
diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c
new file mode 100644
index 0000000..0d4e026
--- /dev/null
+++ b/arch/riscv/mm/ioremap.c
@@ -0,0 +1,45 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2023 Rivos Inc.
+ *
+ * Authors:
+ *     Rajnesh Kanwal <rkanwal@rivosinc.com>
+ */
+
+#include <linux/export.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/io.h>
+#include <asm/covg_sbi.h>
+#include <asm/cove.h>
+#include <asm-generic/io.h>
+
+void ioremap_phys_range_hook(phys_addr_t addr, size_t size, pgprot_t prot)
+{
+	unsigned long offset;
+
+	if (!is_cove_guest())
+		return;
+
+	/* Page-align address and size. */
+	offset = addr & (~PAGE_MASK);
+	addr -= offset;
+	size = PAGE_ALIGN(size + offset);
+
+	sbi_covg_add_mmio_region(addr, size);
+}
+
+void iounmap_phys_range_hook(phys_addr_t addr, size_t size)
+{
+	unsigned long offset;
+
+	if (!is_cove_guest())
+		return;
+
+	/* Page-align address and size. */
+	offset = addr & (~PAGE_MASK);
+	addr -= offset;
+	size = PAGE_ALIGN(size + offset);
+
+	sbi_covg_remove_mmio_region(addr, size);
+}