mbox series

[SRU,J,0/2] 5.15.0-85 live migration regression

Message ID 20230920022235.111762-1-chengen.du@canonical.com
Headers show
Series 5.15.0-85 live migration regression | expand

Message

Chengen Du Sept. 20, 2023, 2:22 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2036675

SRU Justification:

[Impact]
The fixes introduced for LP#2032164, aimed at resolving a live migration issue, have unintentionally led to a regression.
Consequently, a previously functional live migration pattern now fails when tested with the 5.15.0-85 kernel from -proposed.

Specifically, live migration from a PKRU-enabled host running a kernel version older than 5.15.0-85 to a host utilizing the 5.15.0-85 kernel will result in a failure.
It's important to note that this issue occurs regardless of whether the destination host has PKRU enabled or not.
In both scenarios, the live migration fails, albeit manifesting in different ways — one leads to a hang, while the other fails due to a PCID flag issue.

[Fix]
To address the issue introduced in LP#2032164, we will begin by reverting the following commits.
Subsequently, we will actively pursue a more comprehensive solution.

commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0
Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
Date: Mon Aug 21 14:47:28 2023 +0800

    KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES

commit 27a189b881278c8ad9c16b0ee05668d724352733
Author: Leonardo Bras <leobras@redhat.com>
Date: Mon Aug 21 14:47:27 2023 +0800

    x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

[Test Plan]
The issue resolved in LP#2032164 will reoccur.
To reproduce this problem, follow these steps:
1. Set up two machines: one with PKRU support and the other without.
2. Initiate a guest that lacks PKRU support on the machine with PKRU support.
3. Utilize libvirt to migrate the aforementioned guest to a different machine that lacks PKRU support.
4. The error emerges on the destination machine:
KVM: entry failed, hardware error 0x80000021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

EAX=86cf7970 EBX=00000000 ECX=00000001 EDX=005b0036
ESI=00000087 EDI=00000087 EBP=87c03e38 ESP=87c03e18
EIP=86cf7d5e EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 00000000 0000ffff
IDT= 00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2023-07-09T03:03:14.911750Z qemu-system-x86_64: terminating on signal 15 from pid 4134 (/usr/sbin/libvirtd)
2023-07-09 03:03:15.312+0000: shutting down, reason=destroyed

[Where problems could occur]
We've reverted the commits to revert the behavior to the original one,
but the issue from LP#2032164 still persists.

Chengen Du (2):
  Revert "KVM: x86: Always enable legacy FP/SSE in allowed user
    XFEATURES"
  Revert "x86/kvm/fpu: Limit guest user_xfeatures to supported bits of
    XCR0"

 arch/x86/kvm/cpuid.c | 8 --------
 1 file changed, 8 deletions(-)

Comments

Stefan Bader Sept. 20, 2023, 7:01 a.m. UTC | #1
On 20.09.23 04:22, Chengen Du wrote:
> BugLink: https://bugs.launchpad.net/bugs/2036675
> 
> SRU Justification:
> 
> [Impact]
> The fixes introduced for LP#2032164, aimed at resolving a live migration issue, have unintentionally led to a regression.
> Consequently, a previously functional live migration pattern now fails when tested with the 5.15.0-85 kernel from -proposed.
> 
> Specifically, live migration from a PKRU-enabled host running a kernel version older than 5.15.0-85 to a host utilizing the 5.15.0-85 kernel will result in a failure.
> It's important to note that this issue occurs regardless of whether the destination host has PKRU enabled or not.
> In both scenarios, the live migration fails, albeit manifesting in different ways — one leads to a hang, while the other fails due to a PCID flag issue.
> 
> [Fix]
> To address the issue introduced in LP#2032164, we will begin by reverting the following commits.
> Subsequently, we will actively pursue a more comprehensive solution.
> 
> commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0
> Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Date: Mon Aug 21 14:47:28 2023 +0800
> 
>      KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
> 
> commit 27a189b881278c8ad9c16b0ee05668d724352733
> Author: Leonardo Bras <leobras@redhat.com>
> Date: Mon Aug 21 14:47:27 2023 +0800
> 
>      x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
> 
> [Test Plan]
> The issue resolved in LP#2032164 will reoccur.
> To reproduce this problem, follow these steps:
> 1. Set up two machines: one with PKRU support and the other without.
> 2. Initiate a guest that lacks PKRU support on the machine with PKRU support.
> 3. Utilize libvirt to migrate the aforementioned guest to a different machine that lacks PKRU support.
> 4. The error emerges on the destination machine:
> KVM: entry failed, hardware error 0x80000021
> 
> If you're running a guest on an Intel machine without unrestricted mode
> support, the failure can be most likely due to the guest entering an invalid
> state for Intel VT. For example, the guest maybe running in big real mode
> which is not supported on less recent Intel processors.
> 
> EAX=86cf7970 EBX=00000000 ECX=00000001 EDX=005b0036
> ESI=00000087 EDI=00000087 EBP=87c03e38 ESP=87c03e18
> EIP=86cf7d5e EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 00009300
> CS =f000 ffff0000 0000ffff 00009b00
> SS =0000 00000000 0000ffff 00009300
> DS =0000 00000000 0000ffff 00009300
> FS =0000 00000000 0000ffff 00009300
> GS =0000 00000000 0000ffff 00009300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 00000000 0000ffff 00008b00
> GDT= 00000000 0000ffff
> IDT= 00000000 0000ffff
> CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 2023-07-09T03:03:14.911750Z qemu-system-x86_64: terminating on signal 15 from pid 4134 (/usr/sbin/libvirtd)
> 2023-07-09 03:03:15.312+0000: shutting down, reason=destroyed
> 
> [Where problems could occur]
> We've reverted the commits to revert the behavior to the original one,
> but the issue from LP#2032164 still persists.
> 
> Chengen Du (2):
>    Revert "KVM: x86: Always enable legacy FP/SSE in allowed user
>      XFEATURES"
>    Revert "x86/kvm/fpu: Limit guest user_xfeatures to supported bits of
>      XCR0"
> 
>   arch/x86/kvm/cpuid.c | 8 --------
>   1 file changed, 8 deletions(-)
> 

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Roxana Nicolescu Sept. 20, 2023, 7:06 a.m. UTC | #2
On 20/09/2023 04:22, Chengen Du wrote:
> BugLink: https://bugs.launchpad.net/bugs/2036675
>
> SRU Justification:
>
> [Impact]
> The fixes introduced for LP#2032164, aimed at resolving a live migration issue, have unintentionally led to a regression.
> Consequently, a previously functional live migration pattern now fails when tested with the 5.15.0-85 kernel from -proposed.
>
> Specifically, live migration from a PKRU-enabled host running a kernel version older than 5.15.0-85 to a host utilizing the 5.15.0-85 kernel will result in a failure.
> It's important to note that this issue occurs regardless of whether the destination host has PKRU enabled or not.
> In both scenarios, the live migration fails, albeit manifesting in different ways — one leads to a hang, while the other fails due to a PCID flag issue.
>
> [Fix]
> To address the issue introduced in LP#2032164, we will begin by reverting the following commits.
> Subsequently, we will actively pursue a more comprehensive solution.
>
> commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0
> Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Date: Mon Aug 21 14:47:28 2023 +0800
>
>      KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
>
> commit 27a189b881278c8ad9c16b0ee05668d724352733
> Author: Leonardo Bras <leobras@redhat.com>
> Date: Mon Aug 21 14:47:27 2023 +0800
>
>      x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
>
> [Test Plan]
> The issue resolved in LP#2032164 will reoccur.
> To reproduce this problem, follow these steps:
> 1. Set up two machines: one with PKRU support and the other without.
> 2. Initiate a guest that lacks PKRU support on the machine with PKRU support.
> 3. Utilize libvirt to migrate the aforementioned guest to a different machine that lacks PKRU support.
> 4. The error emerges on the destination machine:
> KVM: entry failed, hardware error 0x80000021
>
> If you're running a guest on an Intel machine without unrestricted mode
> support, the failure can be most likely due to the guest entering an invalid
> state for Intel VT. For example, the guest maybe running in big real mode
> which is not supported on less recent Intel processors.
>
> EAX=86cf7970 EBX=00000000 ECX=00000001 EDX=005b0036
> ESI=00000087 EDI=00000087 EBP=87c03e38 ESP=87c03e18
> EIP=86cf7d5e EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 00009300
> CS =f000 ffff0000 0000ffff 00009b00
> SS =0000 00000000 0000ffff 00009300
> DS =0000 00000000 0000ffff 00009300
> FS =0000 00000000 0000ffff 00009300
> GS =0000 00000000 0000ffff 00009300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 00000000 0000ffff 00008b00
> GDT= 00000000 0000ffff
> IDT= 00000000 0000ffff
> CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 2023-07-09T03:03:14.911750Z qemu-system-x86_64: terminating on signal 15 from pid 4134 (/usr/sbin/libvirtd)
> 2023-07-09 03:03:15.312+0000: shutting down, reason=destroyed
>
> [Where problems could occur]
> We've reverted the commits to revert the behavior to the original one,
> but the issue from LP#2032164 still persists.
>
> Chengen Du (2):
>    Revert "KVM: x86: Always enable legacy FP/SSE in allowed user
>      XFEATURES"
>    Revert "x86/kvm/fpu: Limit guest user_xfeatures to supported bits of
>      XCR0"
>
>   arch/x86/kvm/cpuid.c | 8 --------
>   1 file changed, 8 deletions(-)
>
Acked-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
Stefan Bader Sept. 20, 2023, 7:20 a.m. UTC | #3
On 20.09.23 04:22, Chengen Du wrote:
> BugLink: https://bugs.launchpad.net/bugs/2036675
> 
> SRU Justification:
> 
> [Impact]
> The fixes introduced for LP#2032164, aimed at resolving a live migration issue, have unintentionally led to a regression.
> Consequently, a previously functional live migration pattern now fails when tested with the 5.15.0-85 kernel from -proposed.
> 
> Specifically, live migration from a PKRU-enabled host running a kernel version older than 5.15.0-85 to a host utilizing the 5.15.0-85 kernel will result in a failure.
> It's important to note that this issue occurs regardless of whether the destination host has PKRU enabled or not.
> In both scenarios, the live migration fails, albeit manifesting in different ways — one leads to a hang, while the other fails due to a PCID flag issue.
> 
> [Fix]
> To address the issue introduced in LP#2032164, we will begin by reverting the following commits.
> Subsequently, we will actively pursue a more comprehensive solution.
> 
> commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0
> Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Date: Mon Aug 21 14:47:28 2023 +0800
> 
>      KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
> 
> commit 27a189b881278c8ad9c16b0ee05668d724352733
> Author: Leonardo Bras <leobras@redhat.com>
> Date: Mon Aug 21 14:47:27 2023 +0800
> 
>      x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
> 
> [Test Plan]
> The issue resolved in LP#2032164 will reoccur.
> To reproduce this problem, follow these steps:
> 1. Set up two machines: one with PKRU support and the other without.
> 2. Initiate a guest that lacks PKRU support on the machine with PKRU support.
> 3. Utilize libvirt to migrate the aforementioned guest to a different machine that lacks PKRU support.
> 4. The error emerges on the destination machine:
> KVM: entry failed, hardware error 0x80000021
> 
> If you're running a guest on an Intel machine without unrestricted mode
> support, the failure can be most likely due to the guest entering an invalid
> state for Intel VT. For example, the guest maybe running in big real mode
> which is not supported on less recent Intel processors.
> 
> EAX=86cf7970 EBX=00000000 ECX=00000001 EDX=005b0036
> ESI=00000087 EDI=00000087 EBP=87c03e38 ESP=87c03e18
> EIP=86cf7d5e EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 00009300
> CS =f000 ffff0000 0000ffff 00009b00
> SS =0000 00000000 0000ffff 00009300
> DS =0000 00000000 0000ffff 00009300
> FS =0000 00000000 0000ffff 00009300
> GS =0000 00000000 0000ffff 00009300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 00000000 0000ffff 00008b00
> GDT= 00000000 0000ffff
> IDT= 00000000 0000ffff
> CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000000
> Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 2023-07-09T03:03:14.911750Z qemu-system-x86_64: terminating on signal 15 from pid 4134 (/usr/sbin/libvirtd)
> 2023-07-09 03:03:15.312+0000: shutting down, reason=destroyed
> 
> [Where problems could occur]
> We've reverted the commits to revert the behavior to the original one,
> but the issue from LP#2032164 still persists.
> 
> Chengen Du (2):
>    Revert "KVM: x86: Always enable legacy FP/SSE in allowed user
>      XFEATURES"
>    Revert "x86/kvm/fpu: Limit guest user_xfeatures to supported bits of
>      XCR0"
> 
>   arch/x86/kvm/cpuid.c | 8 --------
>   1 file changed, 8 deletions(-)
> 

Applied to jammy:linux/master-prep (in preparation for re-spin). Thanks.

-Stefan