Message ID | 20221010023315.98396-1-zhouzhouyi@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [linux-next,RFC] powerpc: fix HOTPLUG error in rcutorture | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/github-powerpc_selftests | success | Successfully ran 10 jobs. |
snowpatch_ozlabs/github-powerpc_ppctests | success | Successfully ran 10 jobs. |
snowpatch_ozlabs/github-powerpc_kernel_qemu | success | Successfully ran 23 jobs. |
snowpatch_ozlabs/github-powerpc_sparse | success | Successfully ran 4 jobs. |
snowpatch_ozlabs/github-powerpc_clang | success | Successfully ran 6 jobs. |
Zhouyi Zhou <zhouzhouyi@gmail.com> writes: > I think we should avoid torture offline the cpu who do tick timer > when nohz full is running. Can you tell us what the bug you're fixing is? Did you see a crash/oops/hang etc? Or are you just proposing this as something that would be a good idea? > Tested on PPC VM of Open Source Lab of Oregon State University. > The test results show that after the fix, the success rate of > rcutorture is improved. > After: > Successes: 40 Failures: 9 > Before: > Successes: 38 Failures: 11 > > I examined the console.log and Make.out files one by one, no new > compile error or test error is introduced by above fix. > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > --- > Dear PPC developers > > I found this bug when trying to do rcutorture tests in ppc VM of > Open Source Lab of Oregon State University: > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > I tried to fix this bug. > > Thanks for your patience and guidance ;-) > > Thanks > Zhouyi > -- > arch/powerpc/kernel/sysfs.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c > index ef9a61718940..be9c0e45337e 100644 > --- a/arch/powerpc/kernel/sysfs.c > +++ b/arch/powerpc/kernel/sysfs.c > @@ -4,6 +4,7 @@ > #include <linux/smp.h> > #include <linux/percpu.h> > #include <linux/init.h> > +#include <linux/tick.h> > #include <linux/sched.h> > #include <linux/export.h> > #include <linux/nodemask.h> > @@ -21,6 +22,7 @@ > #include <asm/firmware.h> > #include <asm/idle.h> > #include <asm/svm.h> > +#include "../../../kernel/time/tick-internal.h" Needing to include this internal header is a sign that we are using the wrong API or otherwise using time keeping internals we shouldn't be. > #include "cacheinfo.h" > #include "setup.h" > @@ -1151,7 +1153,11 @@ static int __init topology_init(void) > * CPU. For instance, the boot cpu might never be valid > * for hotplugging. > */ > - if (smp_ops && smp_ops->cpu_offline_self) > + if (smp_ops && smp_ops->cpu_offline_self > +#ifdef CONFIG_NO_HZ_FULL > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) > +#endif > + ) I can't see any other arches doing anything like this. I don't think it's the arches responsibility. If the time keeping core needs a CPU to stay online to run the timer then it needs to organise that itself IMHO :) cheers > c->hotpluggable = 1; > #endif > > -- > 2.25.1
Thanks Michael for reviewing my patch On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@ellerman.id.au> wrote: > > Zhouyi Zhou <zhouzhouyi@gmail.com> writes: > > I think we should avoid torture offline the cpu who do tick timer > > when nohz full is running. > > Can you tell us what the bug you're fixing is? > > Did you see a crash/oops/hang etc? Or are you just proposing this as > something that would be a good idea? Sorry for the trouble and inconvenience that I bring to the community. I haven't made myself clear in my patch. The ins and outs are as follows: 1) cd linux-next 2) ./tools/testing/selftests/rcutorture/bin/torture.sh after 19 hours ;-) 3) tail ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log [ 121.449268][ T57] scftorture: scf_invoked_count VER: 2415215 resched: 697463 single: 619512/619760 single_ofl: 255751/256554 single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all: 77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0 stnmoe: 0 staf: 0 [ 121.454485][ T57] scftorture: --- End of test: LOCK_HOTPLUG: verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30 onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5 use_cpus_read_lock=0, weight_resched=-1, weight_single=-1, weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1, weight_many_wait=-1, weight_all=-1, weight_all_wait=-1 [ 121.469305][ T57] reboot: Power down I see "End of test: LOCK_HOTPLUG", which means the function torture_offline in kernel torture.c failed to bring down the cpu. 4) Then I chase the reason down to tick_nohz_cpu_down: if (tick_nohz_full_running && tick_do_timer_cpu == cpu) return -EBUSY; 5) I create above patch > > > Tested on PPC VM of Open Source Lab of Oregon State University. > > The test results show that after the fix, the success rate of > > rcutorture is improved. > > After: > > Successes: 40 Failures: 9 > > Before: > > Successes: 38 Failures: 11 > > > > I examined the console.log and Make.out files one by one, no new > > compile error or test error is introduced by above fix. > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > --- > > Dear PPC developers > > > > I found this bug when trying to do rcutorture tests in ppc VM of > > Open Source Lab of Oregon State University: > > > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG > > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > > I tried to fix this bug. > > > > Thanks for your patience and guidance ;-) > > > > Thanks > > Zhouyi > > -- > > arch/powerpc/kernel/sysfs.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c > > index ef9a61718940..be9c0e45337e 100644 > > --- a/arch/powerpc/kernel/sysfs.c > > +++ b/arch/powerpc/kernel/sysfs.c > > @@ -4,6 +4,7 @@ > > #include <linux/smp.h> > > #include <linux/percpu.h> > > #include <linux/init.h> > > +#include <linux/tick.h> > > #include <linux/sched.h> > > #include <linux/export.h> > > #include <linux/nodemask.h> > > @@ -21,6 +22,7 @@ > > #include <asm/firmware.h> > > #include <asm/idle.h> > > #include <asm/svm.h> > > +#include "../../../kernel/time/tick-internal.h" > > Needing to include this internal header is a sign that we are using the > wrong API or otherwise using time keeping internals we shouldn't be. Yes, when I do this, I guess there is something wrong in my patch. > > > #include "cacheinfo.h" > > #include "setup.h" > > @@ -1151,7 +1153,11 @@ static int __init topology_init(void) > > * CPU. For instance, the boot cpu might never be valid > > * for hotplugging. > > */ > > - if (smp_ops && smp_ops->cpu_offline_self) > > + if (smp_ops && smp_ops->cpu_offline_self > > +#ifdef CONFIG_NO_HZ_FULL > > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) > > +#endif > > + ) > > I can't see any other arches doing anything like this. I don't think > it's the arches responsibility. Agree! X86 seems to disable CPU0's hotplug by default, while tick_do_timer_cpu has a default value 0. 42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0 43 static int cpu0_hotpluggable = 1; 44 #else 45 static int cpu0_hotpluggable; 46 static int __init enable_cpu0_hotplug(char *str) 47 { 48 cpu0_hotpluggable = 1; 49 return 1; 50 } 51 52 __setup("cpu0_hotplug", enable_cpu0_hotplug); 53 #endif I need more time to make clear the relationship of X86's cpu0_hotpluggable and tick_do_timer_cpu, but I also intend to think it's time keeping the mechanism's responsibility. > > If the time keeping core needs a CPU to stay online to run the timer > then it needs to organise that itself IMHO :) Um, I am going to submit a patch to time keeping community sometime next month ;-) Thanks again Cheers Zhouyi > > cheers > > > c->hotpluggable = 1; > > #endif > > > > -- > > 2.25.1
Hi, I also reappear the same phenomenon in RISC-V: [ 120.156380] scftorture: --- End of test: LOCK_HOTPLUG So I guess it is not the arch's responsibility. I am very interested in it ;-) Thank you both for your guidance! Cheers Zhouyi On Tue, Oct 11, 2022 at 9:59 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > Thanks Michael for reviewing my patch > > On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@ellerman.id.au> wrote: > > > > Zhouyi Zhou <zhouzhouyi@gmail.com> writes: > > > I think we should avoid torture offline the cpu who do tick timer > > > when nohz full is running. > > > > Can you tell us what the bug you're fixing is? > > > > Did you see a crash/oops/hang etc? Or are you just proposing this as > > something that would be a good idea? > Sorry for the trouble and inconvenience that I bring to the community. > I haven't made myself clear in my patch. > The ins and outs are as follows: > 1) cd linux-next > 2) ./tools/testing/selftests/rcutorture/bin/torture.sh > after 19 hours ;-) > 3) tail ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log > > [ 121.449268][ T57] scftorture: scf_invoked_count VER: 2415215 > resched: 697463 single: 619512/619760 single_ofl: 255751/256554 > single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all: > 77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0 > stnmoe: 0 staf: 0 > [ 121.454485][ T57] scftorture: --- End of test: LOCK_HOTPLUG: > verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30 > onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5 > use_cpus_read_lock=0, weight_resched=-1, weight_single=-1, > weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1, > weight_many_wait=-1, weight_all=-1, weight_all_wait=-1 > [ 121.469305][ T57] reboot: Power down > > I see "End of test: LOCK_HOTPLUG", which means the function > torture_offline in kernel torture.c failed to bring down the cpu. > 4) Then I chase the reason down to tick_nohz_cpu_down: > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > return -EBUSY; > 5) I create above patch > > > > > Tested on PPC VM of Open Source Lab of Oregon State University. > > > The test results show that after the fix, the success rate of > > > rcutorture is improved. > > > After: > > > Successes: 40 Failures: 9 > > > Before: > > > Successes: 38 Failures: 11 > > > > > > I examined the console.log and Make.out files one by one, no new > > > compile error or test error is introduced by above fix. > > > > > > Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > --- > > > Dear PPC developers > > > > > > I found this bug when trying to do rcutorture tests in ppc VM of > > > Open Source Lab of Oregon State University: > > > > > > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG > > > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT > > > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 > > > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 > > > > > > I tried to fix this bug. > > > > > > Thanks for your patience and guidance ;-) > > > > > > Thanks > > > Zhouyi > > > -- > > > arch/powerpc/kernel/sysfs.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c > > > index ef9a61718940..be9c0e45337e 100644 > > > --- a/arch/powerpc/kernel/sysfs.c > > > +++ b/arch/powerpc/kernel/sysfs.c > > > @@ -4,6 +4,7 @@ > > > #include <linux/smp.h> > > > #include <linux/percpu.h> > > > #include <linux/init.h> > > > +#include <linux/tick.h> > > > #include <linux/sched.h> > > > #include <linux/export.h> > > > #include <linux/nodemask.h> > > > @@ -21,6 +22,7 @@ > > > #include <asm/firmware.h> > > > #include <asm/idle.h> > > > #include <asm/svm.h> > > > +#include "../../../kernel/time/tick-internal.h" > > > > Needing to include this internal header is a sign that we are using the > > wrong API or otherwise using time keeping internals we shouldn't be. > Yes, when I do this, I guess there is something wrong in my patch. > > > > > #include "cacheinfo.h" > > > #include "setup.h" > > > @@ -1151,7 +1153,11 @@ static int __init topology_init(void) > > > * CPU. For instance, the boot cpu might never be valid > > > * for hotplugging. > > > */ > > > - if (smp_ops && smp_ops->cpu_offline_self) > > > + if (smp_ops && smp_ops->cpu_offline_self > > > +#ifdef CONFIG_NO_HZ_FULL > > > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) > > > +#endif > > > + ) > > > > I can't see any other arches doing anything like this. I don't think > > it's the arches responsibility. > Agree! > > X86 seems to disable CPU0's hotplug by default, while > tick_do_timer_cpu has a default value 0. > > 42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0 > 43 static int cpu0_hotpluggable = 1; > 44 #else > 45 static int cpu0_hotpluggable; > 46 static int __init enable_cpu0_hotplug(char *str) > 47 { > 48 cpu0_hotpluggable = 1; > 49 return 1; > 50 } > 51 > 52 __setup("cpu0_hotplug", enable_cpu0_hotplug); > 53 #endif > > I need more time to make clear the relationship of X86's > cpu0_hotpluggable and tick_do_timer_cpu, but > I also intend to think it's time keeping the mechanism's responsibility. > > > > > > If the time keeping core needs a CPU to stay online to run the timer > > then it needs to organise that itself IMHO :) > > Um, I am going to submit a patch to time keeping community sometime > next month ;-) > > Thanks again > Cheers > Zhouyi > > > > cheers > > > > > c->hotpluggable = 1; > > > #endif > > > > > > -- > > > 2.25.1
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index ef9a61718940..be9c0e45337e 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -4,6 +4,7 @@ #include <linux/smp.h> #include <linux/percpu.h> #include <linux/init.h> +#include <linux/tick.h> #include <linux/sched.h> #include <linux/export.h> #include <linux/nodemask.h> @@ -21,6 +22,7 @@ #include <asm/firmware.h> #include <asm/idle.h> #include <asm/svm.h> +#include "../../../kernel/time/tick-internal.h" #include "cacheinfo.h" #include "setup.h" @@ -1151,7 +1153,11 @@ static int __init topology_init(void) * CPU. For instance, the boot cpu might never be valid * for hotplugging. */ - if (smp_ops && smp_ops->cpu_offline_self) + if (smp_ops && smp_ops->cpu_offline_self +#ifdef CONFIG_NO_HZ_FULL + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu) +#endif + ) c->hotpluggable = 1; #endif
I think we should avoid torture offline the cpu who do tick timer when nohz full is running. Tested on PPC VM of Open Source Lab of Oregon State University. The test results show that after the fix, the success rate of rcutorture is improved. After: Successes: 40 Failures: 9 Before: Successes: 38 Failures: 11 I examined the console.log and Make.out files one by one, no new compile error or test error is introduced by above fix. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> --- Dear PPC developers I found this bug when trying to do rcutorture tests in ppc VM of Open Source Lab of Oregon State University: ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03 ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04 I tried to fix this bug. Thanks for your patience and guidance ;-) Thanks Zhouyi -- arch/powerpc/kernel/sysfs.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)