Message ID | 20220614135414.37746-5-ldufour@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Extending NMI watchdog during LPM | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/github-powerpc_kernel_qemu | fail | kernel (g5_defconfig, korg-5.5.0, /linux/arch/powerpc/configs/g5-qemu.config) failed at step build. |
snowpatch_ozlabs/github-powerpc_ppctests | success | Successfully ran 10 jobs. |
snowpatch_ozlabs/github-powerpc_sparse | fail | 2 of 4 jobs failed. |
snowpatch_ozlabs/github-powerpc_selftests | success | Successfully ran 10 jobs. |
snowpatch_ozlabs/github-powerpc_clang | success | Successfully ran 7 jobs. |
Laurent Dufour <ldufour@linux.ibm.com> writes: > diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c > index 179bbd4ae881..4284ceaf9060 100644 > --- a/arch/powerpc/platforms/pseries/mobility.c > +++ b/arch/powerpc/platforms/pseries/mobility.c > @@ -48,6 +48,39 @@ struct update_props_workarea { > #define MIGRATION_SCOPE (1) > #define PRRN_SCOPE -2 > > +#ifdef CONFIG_PPC_WATCHDOG > +static unsigned int lpm_nmi_wd_factor = 200; > + > +#ifdef CONFIG_SYSCTL > +static struct ctl_table lpm_nmi_wd_factor_ctl_table[] = { > + { > + .procname = "lpm_nmi_watchdog_factor", Assuming the basic idea is acceptable, I suggest making the user-visible name more generic (e.g. "nmi_watchdog_factor") in case it makes sense to apply this to other contexts in the future. > + .data = &lpm_nmi_wd_factor, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_douintvec_minmax, > + }, > + {} > +}; > +static struct ctl_table lpm_nmi_wd_factor_sysctl_root[] = { > + { > + .procname = "kernel", > + .mode = 0555, > + .child = lpm_nmi_wd_factor_ctl_table, > + }, > + {} > +}; > + > +static int __init register_lpm_nmi_wd_factor_sysctl(void) > +{ > + register_sysctl_table(lpm_nmi_wd_factor_sysctl_root); > + > + return 0; > +} > +device_initcall(register_lpm_nmi_wd_factor_sysctl); > +#endif /* CONFIG_SYSCTL */ > +#endif /* CONFIG_PPC_WATCHDOG */ > + > static int mobility_rtas_call(int token, char *buf, s32 scope) > { > int rc; > @@ -702,6 +735,7 @@ static int pseries_suspend(u64 handle) > static int pseries_migrate_partition(u64 handle) > { > int ret; > + unsigned int factor = lpm_nmi_wd_factor; > > ret = wait_for_vasi_session_suspending(handle); > if (ret) > @@ -709,6 +743,13 @@ static int pseries_migrate_partition(u64 handle) > > vas_migration_handler(VAS_SUSPEND); > > +#ifdef CONFIG_PPC_WATCHDOG > + if (factor) { > + pr_info("Set the NMI watchdog factor to %u%%\n", factor); > + watchdog_nmi_set_lpm_factor(factor); > + } > +#endif /* CONFIG_PPC_WATCHDOG */ > + > ret = pseries_suspend(handle); > if (ret == 0) { > post_mobility_fixup(); > @@ -716,6 +757,13 @@ static int pseries_migrate_partition(u64 handle) > } else > pseries_cancel_migration(handle, ret); > > +#ifdef CONFIG_PPC_WATCHDOG > + if (factor) { > + pr_info("Restoring NMI watchdog timer\n"); > + watchdog_nmi_set_lpm_factor(0); > + } > +#endif /* CONFIG_PPC_WATCHDOG */ > + A couple more suggestions: * Move the prints into a single statement in watchdog_nmi_set_lpm_factor(). * Add no-op versions of watchdog_nmi_set_lpm_factor for !CONFIG_PPC_WATCHDOG so we can minimize the #ifdef here. Otherwise this looks fine to me.
On 23/06/2022, 19:28:34, Nathan Lynch wrote: > Laurent Dufour <ldufour@linux.ibm.com> writes: >> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c >> index 179bbd4ae881..4284ceaf9060 100644 >> --- a/arch/powerpc/platforms/pseries/mobility.c >> +++ b/arch/powerpc/platforms/pseries/mobility.c >> @@ -48,6 +48,39 @@ struct update_props_workarea { >> #define MIGRATION_SCOPE (1) >> #define PRRN_SCOPE -2 >> >> +#ifdef CONFIG_PPC_WATCHDOG >> +static unsigned int lpm_nmi_wd_factor = 200; >> + >> +#ifdef CONFIG_SYSCTL >> +static struct ctl_table lpm_nmi_wd_factor_ctl_table[] = { >> + { >> + .procname = "lpm_nmi_watchdog_factor", > > Assuming the basic idea is acceptable, I suggest making the user-visible > name more generic (e.g. "nmi_watchdog_factor") in case it makes sense to > apply this to other contexts in the future. Fair enough, indeed, I was wondering if "lpm" is meaningful. > >> + .data = &lpm_nmi_wd_factor, >> + .maxlen = sizeof(int), >> + .mode = 0644, >> + .proc_handler = proc_douintvec_minmax, >> + }, >> + {} >> +}; >> +static struct ctl_table lpm_nmi_wd_factor_sysctl_root[] = { >> + { >> + .procname = "kernel", >> + .mode = 0555, >> + .child = lpm_nmi_wd_factor_ctl_table, >> + }, >> + {} >> +}; >> + >> +static int __init register_lpm_nmi_wd_factor_sysctl(void) >> +{ >> + register_sysctl_table(lpm_nmi_wd_factor_sysctl_root); >> + >> + return 0; >> +} >> +device_initcall(register_lpm_nmi_wd_factor_sysctl); >> +#endif /* CONFIG_SYSCTL */ >> +#endif /* CONFIG_PPC_WATCHDOG */ >> + >> static int mobility_rtas_call(int token, char *buf, s32 scope) >> { >> int rc; >> @@ -702,6 +735,7 @@ static int pseries_suspend(u64 handle) >> static int pseries_migrate_partition(u64 handle) >> { >> int ret; >> + unsigned int factor = lpm_nmi_wd_factor; >> >> ret = wait_for_vasi_session_suspending(handle); >> if (ret) >> @@ -709,6 +743,13 @@ static int pseries_migrate_partition(u64 handle) >> >> vas_migration_handler(VAS_SUSPEND); >> >> +#ifdef CONFIG_PPC_WATCHDOG >> + if (factor) { >> + pr_info("Set the NMI watchdog factor to %u%%\n", factor); >> + watchdog_nmi_set_lpm_factor(factor); >> + } >> +#endif /* CONFIG_PPC_WATCHDOG */ >> + >> ret = pseries_suspend(handle); >> if (ret == 0) { >> post_mobility_fixup(); >> @@ -716,6 +757,13 @@ static int pseries_migrate_partition(u64 handle) >> } else >> pseries_cancel_migration(handle, ret); >> >> +#ifdef CONFIG_PPC_WATCHDOG >> + if (factor) { >> + pr_info("Restoring NMI watchdog timer\n"); >> + watchdog_nmi_set_lpm_factor(0); >> + } >> +#endif /* CONFIG_PPC_WATCHDOG */ >> + > > A couple more suggestions: > > * Move the prints into a single statement in watchdog_nmi_set_lpm_factor(). You're right that sounds a better place. > > * Add no-op versions of watchdog_nmi_set_lpm_factor for > !CONFIG_PPC_WATCHDOG so we can minimize the #ifdef here. > Furthermore, this breaks compilation when !CONFIG_PPC_WATCHDOG because lpm_nmi_wd_factor is not defined. I'll rework that part. > Otherwise this looks fine to me. Thanks, Laurent.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ddccd1077462..53701ed671de 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -485,6 +485,18 @@ When ``kptr_restrict`` is set to 2, kernel pointers printed using %pK will be replaced with 0s regardless of privileges. +lpm_nmi_watchdog_factor (PPC only) +================================== + +Factor apply to to the NMI watchdog timeout (only when ``nmi_watchdog`` is +set to 1). This factor represents the percentage added to +``watchdog_thresh`` when calculating the NMI watchdog timeout during a +LPM. The soft lockup timeout is not impacted. + +A value of 0 means no change. The default value is 200 meaning the NMI +watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10). + + modprobe ======== diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 179bbd4ae881..4284ceaf9060 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -48,6 +48,39 @@ struct update_props_workarea { #define MIGRATION_SCOPE (1) #define PRRN_SCOPE -2 +#ifdef CONFIG_PPC_WATCHDOG +static unsigned int lpm_nmi_wd_factor = 200; + +#ifdef CONFIG_SYSCTL +static struct ctl_table lpm_nmi_wd_factor_ctl_table[] = { + { + .procname = "lpm_nmi_watchdog_factor", + .data = &lpm_nmi_wd_factor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_douintvec_minmax, + }, + {} +}; +static struct ctl_table lpm_nmi_wd_factor_sysctl_root[] = { + { + .procname = "kernel", + .mode = 0555, + .child = lpm_nmi_wd_factor_ctl_table, + }, + {} +}; + +static int __init register_lpm_nmi_wd_factor_sysctl(void) +{ + register_sysctl_table(lpm_nmi_wd_factor_sysctl_root); + + return 0; +} +device_initcall(register_lpm_nmi_wd_factor_sysctl); +#endif /* CONFIG_SYSCTL */ +#endif /* CONFIG_PPC_WATCHDOG */ + static int mobility_rtas_call(int token, char *buf, s32 scope) { int rc; @@ -702,6 +735,7 @@ static int pseries_suspend(u64 handle) static int pseries_migrate_partition(u64 handle) { int ret; + unsigned int factor = lpm_nmi_wd_factor; ret = wait_for_vasi_session_suspending(handle); if (ret) @@ -709,6 +743,13 @@ static int pseries_migrate_partition(u64 handle) vas_migration_handler(VAS_SUSPEND); +#ifdef CONFIG_PPC_WATCHDOG + if (factor) { + pr_info("Set the NMI watchdog factor to %u%%\n", factor); + watchdog_nmi_set_lpm_factor(factor); + } +#endif /* CONFIG_PPC_WATCHDOG */ + ret = pseries_suspend(handle); if (ret == 0) { post_mobility_fixup(); @@ -716,6 +757,13 @@ static int pseries_migrate_partition(u64 handle) } else pseries_cancel_migration(handle, ret); +#ifdef CONFIG_PPC_WATCHDOG + if (factor) { + pr_info("Restoring NMI watchdog timer\n"); + watchdog_nmi_set_lpm_factor(0); + } +#endif /* CONFIG_PPC_WATCHDOG */ + vas_migration_handler(VAS_RESUME); return ret;
During a LPM, while the memory transfer is in progress on the arrival side, some latencies is generated when accessing not yet transferred pages on the arrival side. Thus, the NMI watchdog may be triggered too frequently, which increases the risk to hit a NMI interrupt in a bad place in the kernel, leading to a kernel panic. Disabling the Hard Lockup Watchdog until the memory transfer could be a too strong work around, some users would want this timeout to be eventually triggered if the system is hanging even during LPM. Introduce a new sysctl variable lpm_nmi_watchdog_factor. It allows to apply a factor to the NMI watchdog timeout during a LPM. Just before the CPU are stopped for the switchover sequence, the NMI watchdog timer is set to watchdog_tresh + factor% A value of 0 has no effect. The default value is 200, meaning that the NMI watchdog is set to 30s during LPM (based on a 10s watchdog_tresh value). Once the memory transfer is achieved, the factor is reset to 0. Setting this value to a high number is like disabling the NMI watchdog during a LPM. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> --- Documentation/admin-guide/sysctl/kernel.rst | 12 ++++++ arch/powerpc/platforms/pseries/mobility.c | 48 +++++++++++++++++++++ 2 files changed, 60 insertions(+)