diff mbox series

[RFC] KVM: PPC: Book3s HV: Allow setting GTSE for the nested guest

Message ID 20220304182657.2489303-1-farosas@linux.ibm.com
State New
Headers show
Series [RFC] KVM: PPC: Book3s HV: Allow setting GTSE for the nested guest | expand

Commit Message

Fabiano Rosas March 4, 2022, 6:26 p.m. UTC
We're currently getting a Program Interrupt inside the nested guest
kernel when running with GTSE disabled in the nested hypervisor. We
allow any guest a cmdline override of GTSE for migration purposes. The
nested guest does not know it needs to use the option and tries to run
'tlbie' with LPCR_GTSE=0.

The details are a bit more intricate:

QEMU always sets GTSE=1 in OV5 even before calling KVM. At prom_init,
guests use the OV5 value to set MMU_FTR_GTSE. This setting can be
overridden by 'radix_hcall_invalidate=on' in the kernel cmdline. The
option itself depends on the availability of
FW_FEATURE_RPT_INVALIDATE, which it tied to QEMU's cap-rpt-invalidate
capability.

The MMU_FTR_GTSE flag leads guests to set PROC_TABLE_GTSE in their
process tables and after H_REGISTER_PROC_TBL, both QEMU and KVM will
set LPCR_GTSE=1 for that guest. Unless the guest uses the cmdline
override, in which case:

  MMU_FTR_GTSE=0 -> PROC_TABLE_GTSE=0 -> LPCR_GTSE=0

We don't allow the nested hypervisor to set some LPCR bits for its
nested guests, so if the nested HV has LPCR_GTSE=0, its nested guests
will also have LPCR_GTSE=0. But since the only thing that can really
flip GTSE is the cmdline override, if a nested guest runs without it,
then the sequence goes:

  MMU_FTR_GTSE=1 -> PROC_TABLE_GTSE=1 -> LPCR_GTSE=0.

With LPCR_GTSE=0 the HW will treat 'tlbie' as HV privileged.

This patch allows a nested HV to set LPCR_GTSE for its nested guests
so the LPCR setting will match what the nested guest sees in OV5.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Aneesh Kumar K V March 7, 2022, 5:58 a.m. UTC | #1
Fabiano Rosas <farosas@linux.ibm.com> writes:

> We're currently getting a Program Interrupt inside the nested guest
> kernel when running with GTSE disabled in the nested hypervisor. We
> allow any guest a cmdline override of GTSE for migration purposes. The
> nested guest does not know it needs to use the option and tries to run
> 'tlbie' with LPCR_GTSE=0.
>
> The details are a bit more intricate:
>
> QEMU always sets GTSE=1 in OV5 even before calling KVM. At prom_init,
> guests use the OV5 value to set MMU_FTR_GTSE. This setting can be
> overridden by 'radix_hcall_invalidate=on' in the kernel cmdline. The
> option itself depends on the availability of
> FW_FEATURE_RPT_INVALIDATE, which it tied to QEMU's cap-rpt-invalidate
> capability.
>
> The MMU_FTR_GTSE flag leads guests to set PROC_TABLE_GTSE in their
> process tables and after H_REGISTER_PROC_TBL, both QEMU and KVM will
> set LPCR_GTSE=1 for that guest. Unless the guest uses the cmdline
> override, in which case:
>
>   MMU_FTR_GTSE=0 -> PROC_TABLE_GTSE=0 -> LPCR_GTSE=0
>
> We don't allow the nested hypervisor to set some LPCR bits for its
> nested guests, so if the nested HV has LPCR_GTSE=0, its nested guests
> will also have LPCR_GTSE=0. But since the only thing that can really
> flip GTSE is the cmdline override, if a nested guest runs without it,
> then the sequence goes:
>
>   MMU_FTR_GTSE=1 -> PROC_TABLE_GTSE=1 -> LPCR_GTSE=0.
>
> With LPCR_GTSE=0 the HW will treat 'tlbie' as HV privileged.
>
> This patch allows a nested HV to set LPCR_GTSE for its nested guests
> so the LPCR setting will match what the nested guest sees in OV5.

This needs a Fixes: tag?

I am not sure what is broken. If L1 doesn't support GTSE, then it should
publish the same to L2 and L2 should not use tlbie. That was working
before? Or is it that the kernel command to disable gtse when used with
L2 kernel is broken? 

>
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 9d373f8963ee..5b9008d89f90 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -262,7 +262,7 @@ static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
>  	 * Don't let L1 change LPCR bits for the L2 except these:
>  	 */
>  	mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
> -		LPCR_LPES | LPCR_MER;
> +		LPCR_LPES | LPCR_MER | LPCR_GTSE;
>  
>  	/*
>  	 * Additional filtering is required depending on hardware
> -- 
> 2.34.1
Fabiano Rosas March 7, 2022, 3:17 p.m. UTC | #2
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Fabiano Rosas <farosas@linux.ibm.com> writes:
>
>> We're currently getting a Program Interrupt inside the nested guest
>> kernel when running with GTSE disabled in the nested hypervisor. We
>> allow any guest a cmdline override of GTSE for migration purposes. The
>> nested guest does not know it needs to use the option and tries to run
>> 'tlbie' with LPCR_GTSE=0.
>>
>> The details are a bit more intricate:
>>
>> QEMU always sets GTSE=1 in OV5 even before calling KVM. At prom_init,
>> guests use the OV5 value to set MMU_FTR_GTSE. This setting can be
>> overridden by 'radix_hcall_invalidate=on' in the kernel cmdline. The
>> option itself depends on the availability of
>> FW_FEATURE_RPT_INVALIDATE, which it tied to QEMU's cap-rpt-invalidate
>> capability.
>>
>> The MMU_FTR_GTSE flag leads guests to set PROC_TABLE_GTSE in their
>> process tables and after H_REGISTER_PROC_TBL, both QEMU and KVM will
>> set LPCR_GTSE=1 for that guest. Unless the guest uses the cmdline
>> override, in which case:
>>
>>   MMU_FTR_GTSE=0 -> PROC_TABLE_GTSE=0 -> LPCR_GTSE=0
>>
>> We don't allow the nested hypervisor to set some LPCR bits for its
>> nested guests, so if the nested HV has LPCR_GTSE=0, its nested guests
>> will also have LPCR_GTSE=0. But since the only thing that can really
>> flip GTSE is the cmdline override, if a nested guest runs without it,
>> then the sequence goes:
>>
>>   MMU_FTR_GTSE=1 -> PROC_TABLE_GTSE=1 -> LPCR_GTSE=0.
>>
>> With LPCR_GTSE=0 the HW will treat 'tlbie' as HV privileged.
>>
>> This patch allows a nested HV to set LPCR_GTSE for its nested guests
>> so the LPCR setting will match what the nested guest sees in OV5.
>
> This needs a Fixes: tag?

This feature was done in many pieces, I think it will end up being the
commit that enabled the H_RPT_INVALIDATE capability:

Fixes: b87cc116c7e1 ("KVM: PPC: Book3S HV: Add KVM_CAP_PPC_RPT_INVALIDATE capability")

> I am not sure what is broken. If L1 doesn't support GTSE, then it should
> publish the same to L2 and L2 should not use tlbie.

L1 cannot set L2's LPCR to the correct value because L0 will not allow
it. That is what this patch is changing.

I looked into having QEMU set the proper values to use with CAS, but
that is done in QEMU too early, before the first dispatch of L2 (which
is when L0 decides that L1 is not allowed to modify some bits). So QEMU
always advertises GTSE=1.

> That was working before? Or is it that the kernel command to disable
> gtse when used with L2 kernel is broken?

The command line works, but it needs to be explicitly given when
starting the L2. There is no link between L1 and L2 when it comes to
GTSE aside from the LPCR value L1 chose to use. So L2 can start with no
command line at all, while L1 had GTSE disabled.

AFAICT, this has always been broken. The stack leading to this is:

NIP [c00000000008615c] radix__flush_tlb_kernel_range+0x13c/0x420
[c000000000075840] change_page_attr+0xb0/0x240
[c00000000044624c] __apply_to_page_range+0x5ac/0xb90
[c000000000075bbc] change_memory_attr+0x7c/0x150
[c000000000350390] bpf_prog_select_runtime+0x200/0x290
[c000000000d9400c] bpf_migrate_filter+0x18c/0x1e0
[c000000000d95f38] bpf_prog_create+0x178/0x1d0
[c00000000130e4f4] ptp_classifier_init+0x4c/0x80
[c00000000130d874] sock_init+0xe0/0x100
[c0000000000121e0] do_one_initcall+0x60/0x2c0
[c0000000012b48cc] kernel_init_freeable+0x33c/0x3dc
[c0000000000127c8] kernel_init+0x44/0x18c
[c00000000000ce64] ret_from_kernel_thread+0x5c/0x64

>>
>> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
>> ---
>>  arch/powerpc/kvm/book3s_hv_nested.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
>> index 9d373f8963ee..5b9008d89f90 100644
>> --- a/arch/powerpc/kvm/book3s_hv_nested.c
>> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
>> @@ -262,7 +262,7 @@ static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
>>  	 * Don't let L1 change LPCR bits for the L2 except these:
>>  	 */
>>  	mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
>> -		LPCR_LPES | LPCR_MER;
>> +		LPCR_LPES | LPCR_MER | LPCR_GTSE;
>>  
>>  	/*
>>  	 * Additional filtering is required depending on hardware
>> -- 
>> 2.34.1
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 9d373f8963ee..5b9008d89f90 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -262,7 +262,7 @@  static void load_l2_hv_regs(struct kvm_vcpu *vcpu,
 	 * Don't let L1 change LPCR bits for the L2 except these:
 	 */
 	mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
-		LPCR_LPES | LPCR_MER;
+		LPCR_LPES | LPCR_MER | LPCR_GTSE;
 
 	/*
 	 * Additional filtering is required depending on hardware