mbox series

[SRU,J,I,F,v2,0/2] rcu stalls with many storage key guests (LP: 1975582)

Message ID 20220610125502.462958-1-frank.heimes@canonical.com
Headers show
Series rcu stalls with many storage key guests (LP: 1975582) | expand

Message

Frank Heimes June 10, 2022, 12:55 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1975582

SRU Justification:

[Impact] 

 * Ubuntu on s390x KVM environments with lots of large guests with storage
   keys can be affected by rcu stalls.

 * These rcu stalls can cause the system to crash/dump.

[Fix]

 * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use non-quiescing sske for KVM switch to keyed guest"

 * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap: voluntarily schedule during key setting"

[Test Plan]

 * There is no trigger or direct test or re-creation of the 
   problem situation possible, but...

 * and IBM z13 or LinuxONE (or never) LPAR is needed that
   runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
   and acts as KVM host with again several large guests running
   on top with storage groups.

 * Let such a system running for days under significant load
   and watch the logs for rcu issues.

 * Prior to the submission of this SRU patched test kernels
   for focal 5.4 and bionic hwe-5.4 were created and tested.
   They ran for days at a staging environemnt at IBM
   without further issues.

 * The modifications are all limited to s390x.

 * A test kernel was build (see below) that ran in a test environment
   at IBM under appropriate load for several days.

[Where problems could occur]

 * Due to the change for the KVM switch to keyed guest
   from classic sske to non-quiescing sske
   the KVM behaviour might have changed and the storage keys harmed.

 * The now more generous scheduling while setting keys
   has an impact on the guest memory management and mapping
   which will lead to a different performance.

 * This, with the introduction of __s390_enable_skey_pmd and
   cond_resched, might increase the overhead in certain situations,
   but eventually improves the responsiveness over time,
   hence avoid rcu stalls.

[Other Info]
 
 * Since the patches are upstream in 5.19-rc1,
   they will be included in the kernel that is planned for kinetic (5.19).

 * Hence this is an SRU to jammy, impish and focal.

v2: since this SRU is not only for J, but also for I and F

Christian Borntraeger (2):
  s390/gmap: voluntarily schedule during key setting
  s390/mm: use non-quiescing sske for KVM switch to keyed guest

 arch/s390/mm/gmap.c    | 14 ++++++++++++++
 arch/s390/mm/pgtable.c |  2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

Comments

Tim Gardner June 10, 2022, 2:18 p.m. UTC | #1
Acked-by: Tim Gardner <tim.gardner@canonical.com>

On 6/10/22 06:55, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/1975582
> 
> SRU Justification:
> 
> [Impact]
> 
>   * Ubuntu on s390x KVM environments with lots of large guests with storage
>     keys can be affected by rcu stalls.
> 
>   * These rcu stalls can cause the system to crash/dump.
> 
> [Fix]
> 
>   * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use non-quiescing sske for KVM switch to keyed guest"
> 
>   * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap: voluntarily schedule during key setting"
> 
> [Test Plan]
> 
>   * There is no trigger or direct test or re-creation of the
>     problem situation possible, but...
> 
>   * and IBM z13 or LinuxONE (or never) LPAR is needed that
>     runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
>     and acts as KVM host with again several large guests running
>     on top with storage groups.
> 
>   * Let such a system running for days under significant load
>     and watch the logs for rcu issues.
> 
>   * Prior to the submission of this SRU patched test kernels
>     for focal 5.4 and bionic hwe-5.4 were created and tested.
>     They ran for days at a staging environemnt at IBM
>     without further issues.
> 
>   * The modifications are all limited to s390x.
> 
>   * A test kernel was build (see below) that ran in a test environment
>     at IBM under appropriate load for several days.
> 
> [Where problems could occur]
> 
>   * Due to the change for the KVM switch to keyed guest
>     from classic sske to non-quiescing sske
>     the KVM behaviour might have changed and the storage keys harmed.
> 
>   * The now more generous scheduling while setting keys
>     has an impact on the guest memory management and mapping
>     which will lead to a different performance.
> 
>   * This, with the introduction of __s390_enable_skey_pmd and
>     cond_resched, might increase the overhead in certain situations,
>     but eventually improves the responsiveness over time,
>     hence avoid rcu stalls.
> 
> [Other Info]
>   
>   * Since the patches are upstream in 5.19-rc1,
>     they will be included in the kernel that is planned for kinetic (5.19).
> 
>   * Hence this is an SRU to jammy, impish and focal.
> 
> v2: since this SRU is not only for J, but also for I and F
> 
> Christian Borntraeger (2):
>    s390/gmap: voluntarily schedule during key setting
>    s390/mm: use non-quiescing sske for KVM switch to keyed guest
> 
>   arch/s390/mm/gmap.c    | 14 ++++++++++++++
>   arch/s390/mm/pgtable.c |  2 +-
>   2 files changed, 15 insertions(+), 1 deletion(-)
>
Stefan Bader June 15, 2022, 7:50 a.m. UTC | #2
On 10.06.22 14:55, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/1975582
> 
> SRU Justification:
> 
> [Impact]
> 
>   * Ubuntu on s390x KVM environments with lots of large guests with storage
>     keys can be affected by rcu stalls.
> 
>   * These rcu stalls can cause the system to crash/dump.
> 
> [Fix]
> 
>   * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use non-quiescing sske for KVM switch to keyed guest"
> 
>   * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap: voluntarily schedule during key setting"
> 
> [Test Plan]
> 
>   * There is no trigger or direct test or re-creation of the
>     problem situation possible, but...
> 
>   * and IBM z13 or LinuxONE (or never) LPAR is needed that
>     runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
>     and acts as KVM host with again several large guests running
>     on top with storage groups.
> 
>   * Let such a system running for days under significant load
>     and watch the logs for rcu issues.
> 
>   * Prior to the submission of this SRU patched test kernels
>     for focal 5.4 and bionic hwe-5.4 were created and tested.
>     They ran for days at a staging environemnt at IBM
>     without further issues.
> 
>   * The modifications are all limited to s390x.
> 
>   * A test kernel was build (see below) that ran in a test environment
>     at IBM under appropriate load for several days.
> 
> [Where problems could occur]
> 
>   * Due to the change for the KVM switch to keyed guest
>     from classic sske to non-quiescing sske
>     the KVM behaviour might have changed and the storage keys harmed.
> 
>   * The now more generous scheduling while setting keys
>     has an impact on the guest memory management and mapping
>     which will lead to a different performance.
> 
>   * This, with the introduction of __s390_enable_skey_pmd and
>     cond_resched, might increase the overhead in certain situations,
>     but eventually improves the responsiveness over time,
>     hence avoid rcu stalls.
> 
> [Other Info]
>   
>   * Since the patches are upstream in 5.19-rc1,
>     they will be included in the kernel that is planned for kinetic (5.19).
> 
>   * Hence this is an SRU to jammy, impish and focal.
> 
> v2: since this SRU is not only for J, but also for I and F
> 
> Christian Borntraeger (2):
>    s390/gmap: voluntarily schedule during key setting
>    s390/mm: use non-quiescing sske for KVM switch to keyed guest
> 
>   arch/s390/mm/gmap.c    | 14 ++++++++++++++
>   arch/s390/mm/pgtable.c |  2 +-
>   2 files changed, 15 insertions(+), 1 deletion(-)
> 

For Impish, there is a chance that this will not make it. There is only one 
cycle until EOL, so if this important it would be good if you explicitly 
mentioned this (as a reply here).

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Frank Heimes June 15, 2022, 7:57 a.m. UTC | #3
No, it is not super important for Impish (but for F and J) - knowing that I
reaches its end of life soon.
I just did the SRU this way (incl. Impish) since it's recommended by the
SRU process to patch without gaps (to avoid any potential regressions on
updates).
If Impish will reach it's EOL meanwhile, that fine ...


On Wed, Jun 15, 2022 at 9:50 AM Stefan Bader <stefan.bader@canonical.com>
wrote:

> On 10.06.22 14:55, frank.heimes@canonical.com wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1975582
> >
> > SRU Justification:
> >
> > [Impact]
> >
> >   * Ubuntu on s390x KVM environments with lots of large guests with
> storage
> >     keys can be affected by rcu stalls.
> >
> >   * These rcu stalls can cause the system to crash/dump.
> >
> > [Fix]
> >
> >   * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use
> non-quiescing sske for KVM switch to keyed guest"
> >
> >   * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap:
> voluntarily schedule during key setting"
> >
> > [Test Plan]
> >
> >   * There is no trigger or direct test or re-creation of the
> >     problem situation possible, but...
> >
> >   * and IBM z13 or LinuxONE (or never) LPAR is needed that
> >     runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
> >     and acts as KVM host with again several large guests running
> >     on top with storage groups.
> >
> >   * Let such a system running for days under significant load
> >     and watch the logs for rcu issues.
> >
> >   * Prior to the submission of this SRU patched test kernels
> >     for focal 5.4 and bionic hwe-5.4 were created and tested.
> >     They ran for days at a staging environemnt at IBM
> >     without further issues.
> >
> >   * The modifications are all limited to s390x.
> >
> >   * A test kernel was build (see below) that ran in a test environment
> >     at IBM under appropriate load for several days.
> >
> > [Where problems could occur]
> >
> >   * Due to the change for the KVM switch to keyed guest
> >     from classic sske to non-quiescing sske
> >     the KVM behaviour might have changed and the storage keys harmed.
> >
> >   * The now more generous scheduling while setting keys
> >     has an impact on the guest memory management and mapping
> >     which will lead to a different performance.
> >
> >   * This, with the introduction of __s390_enable_skey_pmd and
> >     cond_resched, might increase the overhead in certain situations,
> >     but eventually improves the responsiveness over time,
> >     hence avoid rcu stalls.
> >
> > [Other Info]
> >
> >   * Since the patches are upstream in 5.19-rc1,
> >     they will be included in the kernel that is planned for kinetic
> (5.19).
> >
> >   * Hence this is an SRU to jammy, impish and focal.
> >
> > v2: since this SRU is not only for J, but also for I and F
> >
> > Christian Borntraeger (2):
> >    s390/gmap: voluntarily schedule during key setting
> >    s390/mm: use non-quiescing sske for KVM switch to keyed guest
> >
> >   arch/s390/mm/gmap.c    | 14 ++++++++++++++
> >   arch/s390/mm/pgtable.c |  2 +-
> >   2 files changed, 15 insertions(+), 1 deletion(-)
> >
>
> For Impish, there is a chance that this will not make it. There is only
> one
> cycle until EOL, so if this important it would be good if you explicitly
> mentioned this (as a reply here).
>
> Acked-by: Stefan Bader <stefan.bader@canonical.com>
>
Stefan Bader June 21, 2022, 5:25 p.m. UTC | #4
On 10.06.22 14:55, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/1975582
> 
> SRU Justification:
> 
> [Impact]
> 
>   * Ubuntu on s390x KVM environments with lots of large guests with storage
>     keys can be affected by rcu stalls.
> 
>   * These rcu stalls can cause the system to crash/dump.
> 
> [Fix]
> 
>   * 3ae11dbcfac9 3ae11dbcfac906a8c3a480e98660a823130dc16a "s390/mm: use non-quiescing sske for KVM switch to keyed guest"
> 
>   * 6d5946274df1 6d5946274df1fff539a7eece458a43be733d1db8 "s390/gmap: voluntarily schedule during key setting"
> 
> [Test Plan]
> 
>   * There is no trigger or direct test or re-creation of the
>     problem situation possible, but...
> 
>   * and IBM z13 or LinuxONE (or never) LPAR is needed that
>     runs Ubuntu Server 20.04 LTS or 18.04 LTS with HWE kernel
>     and acts as KVM host with again several large guests running
>     on top with storage groups.
> 
>   * Let such a system running for days under significant load
>     and watch the logs for rcu issues.
> 
>   * Prior to the submission of this SRU patched test kernels
>     for focal 5.4 and bionic hwe-5.4 were created and tested.
>     They ran for days at a staging environemnt at IBM
>     without further issues.
> 
>   * The modifications are all limited to s390x.
> 
>   * A test kernel was build (see below) that ran in a test environment
>     at IBM under appropriate load for several days.
> 
> [Where problems could occur]
> 
>   * Due to the change for the KVM switch to keyed guest
>     from classic sske to non-quiescing sske
>     the KVM behaviour might have changed and the storage keys harmed.
> 
>   * The now more generous scheduling while setting keys
>     has an impact on the guest memory management and mapping
>     which will lead to a different performance.
> 
>   * This, with the introduction of __s390_enable_skey_pmd and
>     cond_resched, might increase the overhead in certain situations,
>     but eventually improves the responsiveness over time,
>     hence avoid rcu stalls.
> 
> [Other Info]
>   
>   * Since the patches are upstream in 5.19-rc1,
>     they will be included in the kernel that is planned for kinetic (5.19).
> 
>   * Hence this is an SRU to jammy, impish and focal.
> 
> v2: since this SRU is not only for J, but also for I and F
> 
> Christian Borntraeger (2):
>    s390/gmap: voluntarily schedule during key setting
>    s390/mm: use non-quiescing sske for KVM switch to keyed guest
> 
>   arch/s390/mm/gmap.c    | 14 ++++++++++++++
>   arch/s390/mm/pgtable.c |  2 +-
>   2 files changed, 15 insertions(+), 1 deletion(-)
> 

Applied to jammy,impish,focal:linux/master-next. For Impish this may or may not 
be released or moved into hwe-5.13. Thanks.

-Stefan