Message ID | 1523092180-5365-1-git-send-email-liwang@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | [RFC] read_all: give more time to wait children finish read action | expand |
On 2018/04/07 17:09, Li Wang wrote: > 1. Some children are still working on the read I/O but parent trys to > stopping them after visit_dir() immediately. Although the stop_attemps > is 65535, it still sometimes fails, so we get the following worker > stalled messges in test. Hi Li, I got the same message on RHEL6.9Alpha as well, and this patch looks good to me. Thanks, Xiao Yang. > # uname -rm > 4.16.0-rc7 ppc64 > # ./read_all -d /sys -q -r 10 > tst_test.c:987: INFO: Timeout per run is 0h 05m 00s > read_all.c:280: BROK: Worker 26075 is stalled > read_all.c:280: WARN: Worker 26075 is stalled > read_all.c:280: WARN: Worker 26079 is stalled > read_all.c:280: WARN: Worker 26087 is stalled > > 2. The sched_work() push action in a infinite loop, here I propose to let > it in limited times. > > Signed-off-by: Li Wang<liwang@redhat.com> > --- > testcases/kernel/fs/read_all/read_all.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/testcases/kernel/fs/read_all/read_all.c b/testcases/kernel/fs/read_all/read_all.c > index b7ed540..ab206e7 100644 > --- a/testcases/kernel/fs/read_all/read_all.c > +++ b/testcases/kernel/fs/read_all/read_all.c > @@ -280,6 +280,7 @@ static void stop_workers(void) > workers[i].pid); > break; > } > + usleep(100); > } > } > } > @@ -306,9 +307,12 @@ static void sched_work(const char *path) > if (pushed) > break; > > - if (++push_attempts> worker_count) { > - usleep(100); > - push_attempts = 0; > + usleep(100); > + if (++push_attempts> 0xffff) { > + tst_brk(TBROK, > + "Attempts %d times but still failed to push %s", > + push_attempts, path); > + break; > } > } > }
Hello, Li Wang writes: > 1. Some children are still working on the read I/O but parent trys to > stopping them after visit_dir() immediately. Although the stop_attemps > is 65535, it still sometimes fails, so we get the following worker > stalled messges in test. > > # uname -rm > 4.16.0-rc7 ppc64 > # ./read_all -d /sys -q -r 10 > tst_test.c:987: INFO: Timeout per run is 0h 05m 00s > read_all.c:280: BROK: Worker 26075 is stalled > read_all.c:280: WARN: Worker 26075 is stalled > read_all.c:280: WARN: Worker 26079 is stalled > read_all.c:280: WARN: Worker 26087 is stalled wow, three workers have there queues perfectly filled... I guess I accidentally created a brute force box packing algorithm. > > 2. The sched_work() push action in a infinite loop, here I propose to let > it in limited times. I think this is moving the problem instead of solving it. Increasing the number of stop_attempts should have the same effect unless the workers are permanently blocked on I/O. However this might be better because it removes the sleep. Possibly we should actually try to determine if a worker is blocked reading a file and print the file name. > > Signed-off-by: Li Wang <liwang@redhat.com> > --- > testcases/kernel/fs/read_all/read_all.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/testcases/kernel/fs/read_all/read_all.c b/testcases/kernel/fs/read_all/read_all.c > index b7ed540..ab206e7 100644 > --- a/testcases/kernel/fs/read_all/read_all.c > +++ b/testcases/kernel/fs/read_all/read_all.c > @@ -280,6 +280,7 @@ static void stop_workers(void) > workers[i].pid); > break; > } > + usleep(100); > } > } > } > @@ -306,9 +307,12 @@ static void sched_work(const char *path) > if (pushed) > break; > > - if (++push_attempts > worker_count) { > - usleep(100); > - push_attempts = 0; > + usleep(100); > + if (++push_attempts > 0xffff) { Maybe add another f to this. > + tst_brk(TBROK, > + "Attempts %d times but still failed to push %s", > + push_attempts, path); > + break; > } > } > }
Hi Richard, Richard Palethorpe <rpalethorpe@suse.de> wrote: Hello, > > Li Wang writes: > > > 1. Some children are still working on the read I/O but parent trys to > > stopping them after visit_dir() immediately. Although the stop_attemps > > is 65535, it still sometimes fails, so we get the following worker > > stalled messges in test. > > > > # uname -rm > > 4.16.0-rc7 ppc64 > > # ./read_all -d /sys -q -r 10 > > tst_test.c:987: INFO: Timeout per run is 0h 05m 00s > > read_all.c:280: BROK: Worker 26075 is stalled > > read_all.c:280: WARN: Worker 26075 is stalled > > read_all.c:280: WARN: Worker 26079 is stalled > > read_all.c:280: WARN: Worker 26087 is stalled > > wow, three workers have there queues perfectly filled... I guess I > accidentally created a brute force box packing algorithm. > > > > > 2. The sched_work() push action in a infinite loop, here I propose to let > > it in limited times. > > I think this is moving the problem instead of solving it. Increasing the > number of stop_attempts should have the same effect unless the workers > are permanently blocked on I/O. However this might be better because it > removes the sleep. > Hmm, not sure if you're fully get my point, maybe I(apologize!) shouldn't fix two problems in one patch. For the block I/O issue, I just adding 'usleep(100)' in stop_workers() function. You suggest increasing stop_attempts is also accessible, but without sleep it still very fast to finish the loop and probably we need a very large number for stop_attempt to waste time. For the second change in sched_work() is just to guarantee we can exist the infinite loop if something wrong with queue_push action. > > Possibly we should actually try to determine if a worker is blocked > reading a file and print the file name. > You are right, I'm now still looking for a better way to avoid this block I/O issue. > > > > > Signed-off-by: Li Wang <liwang@redhat.com> > > --- > > testcases/kernel/fs/read_all/read_all.c | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/testcases/kernel/fs/read_all/read_all.c > b/testcases/kernel/fs/read_all/read_all.c > > index b7ed540..ab206e7 100644 > > --- a/testcases/kernel/fs/read_all/read_all.c > > +++ b/testcases/kernel/fs/read_all/read_all.c > > @@ -280,6 +280,7 @@ static void stop_workers(void) > > workers[i].pid); > > break; > > } > > + usleep(100); > > > > } > > } > > } > > @@ -306,9 +307,12 @@ static void sched_work(const char *path) > > if (pushed) > > break; > > > > - if (++push_attempts > worker_count) { > > - usleep(100); > > - push_attempts = 0; > > + usleep(100); > > + if (++push_attempts > 0xffff) { > > Maybe add another f to this. > No need too much attempt, my test says this push action can get pass less than try 20 times. > > > + tst_brk(TBROK, > > + "Attempts %d times but still failed to > push %s", > > + push_attempts, path); > > + break; > > } > > } > > } > > > -- > Thank you, > Richard. >
Li Wang writes: > Hi Richard, > > > Richard Palethorpe <rpalethorpe@suse.de> wrote: > > Hello, >> >> Li Wang writes: >> >> > 1. Some children are still working on the read I/O but parent trys to >> > stopping them after visit_dir() immediately. Although the stop_attemps >> > is 65535, it still sometimes fails, so we get the following worker >> > stalled messges in test. >> > >> > # uname -rm >> > 4.16.0-rc7 ppc64 >> > # ./read_all -d /sys -q -r 10 >> > tst_test.c:987: INFO: Timeout per run is 0h 05m 00s >> > read_all.c:280: BROK: Worker 26075 is stalled >> > read_all.c:280: WARN: Worker 26075 is stalled >> > read_all.c:280: WARN: Worker 26079 is stalled >> > read_all.c:280: WARN: Worker 26087 is stalled >> >> wow, three workers have there queues perfectly filled... I guess I >> accidentally created a brute force box packing algorithm. >> >> > >> > 2. The sched_work() push action in a infinite loop, here I propose to let >> > it in limited times. >> >> I think this is moving the problem instead of solving it. Increasing the >> number of stop_attempts should have the same effect unless the workers >> are permanently blocked on I/O. However this might be better because it >> removes the sleep. >> > > Hmm, not sure if you're fully get my point, maybe I(apologize!) shouldn't > fix two > problems in one patch. > > For the block I/O issue, I just adding 'usleep(100)' in stop_workers() > function. > You suggest increasing stop_attempts is also accessible, but without sleep > it still very fast to finish the loop and probably we need a very large > number > for stop_attempt to waste time. > > For the second change in sched_work() is just to guarantee we can exist the > infinite loop if something wrong with queue_push action. An infinite loop is not a big problem because the test library will timeout, but not knowing what caused the timeout is a problem. When writing the test I was not sure if timeouts would be a problem, but from your testing it appears that it is. > > > >> >> Possibly we should actually try to determine if a worker is blocked >> reading a file and print the file name. >> > > You are right, I'm now still looking for a better way to avoid this block > I/O issue. Maybe you could add a field to the worker struct containing the current file being read and a timestamp of when it started reading that file. Then we can use that information later to find out if a particular file is taking a long time to read. > > >> >> > >> > Signed-off-by: Li Wang <liwang@redhat.com> >> > --- >> > testcases/kernel/fs/read_all/read_all.c | 10 +++++++--- >> > 1 file changed, 7 insertions(+), 3 deletions(-) >> > >> > diff --git a/testcases/kernel/fs/read_all/read_all.c >> b/testcases/kernel/fs/read_all/read_all.c >> > index b7ed540..ab206e7 100644 >> > --- a/testcases/kernel/fs/read_all/read_all.c >> > +++ b/testcases/kernel/fs/read_all/read_all.c >> > @@ -280,6 +280,7 @@ static void stop_workers(void) >> > workers[i].pid); >> > break; >> > } >> > + usleep(100); >> >> >> > } >> > } >> > } >> > @@ -306,9 +307,12 @@ static void sched_work(const char *path) >> > if (pushed) >> > break; >> > >> > - if (++push_attempts > worker_count) { >> > - usleep(100); >> > - push_attempts = 0; >> > + usleep(100); >> > + if (++push_attempts > 0xffff) { >> >> Maybe add another f to this. >> > > No need too much attempt, my test says this push action can get pass less > than try 20 times. OK, I think this change is good except that using a constant time for the usleep is probably bad. Instead we could use an exponential function; so we can start with a sleep of 1 then double it up to a maximum of 20 times (approximately a second for the final wait). > > > >> >> > + tst_brk(TBROK, >> > + "Attempts %d times but still failed to >> push %s", >> > + push_attempts, path); >> > + break; >> > } >> > } >> > } >> >> >> -- >> Thank you, >> Richard. >> -- Thank you, Richard.
Hi! > OK, I think this change is good except that using a constant time for > the usleep is probably bad. Instead we could use an exponential > function; so we can start with a sleep of 1 then double it up to a > maximum of 20 times (approximately a second for the final wait). Actually I was thinking of adding such functionality to the test library exactly because I do not want to reimplement the exponentialy incremented sleep for retry over and overe in each testcase... Maybe we need TST_RETRY_ON_NONZERO() macro that would build the retry loop around a function passed as a paramter or something like that.
Hello, Cyril Hrubis writes: > Hi! >> OK, I think this change is good except that using a constant time for >> the usleep is probably bad. Instead we could use an exponential >> function; so we can start with a sleep of 1 then double it up to a >> maximum of 20 times (approximately a second for the final wait). > > Actually I was thinking of adding such functionality to the test library > exactly because I do not want to reimplement the exponentialy > incremented sleep for retry over and overe in each testcase... Sounds like a good idea. > > Maybe we need TST_RETRY_ON_NONZERO() macro that would build the retry > loop around a function passed as a paramter or something like that. Or TST_RETRY_LOOP, TST_RETRY_WHILE or TST_RETRY_WITH_EXPONENTIAL_BACKOFF. :-) Maybe it should not be a magic macro though, but just an inline function which you call in the body of a loop which performs the sleep based on a couple of parameters. I suppose it depends on how nice one can make the macro. -- Thank you, Richard.
Hi! > > Maybe we need TST_RETRY_ON_NONZERO() macro that would build the retry > > loop around a function passed as a paramter or something like that. > > Or TST_RETRY_LOOP, TST_RETRY_WHILE or > TST_RETRY_WITH_EXPONENTIAL_BACKOFF. :-) > > Maybe it should not be a magic macro though, but just an inline function > which you call in the body of a loop which performs the sleep based on a > couple of parameters. I suppose it depends on how nice one can make the > macro. Well, yes. But I would go for simple sleep for the sake of fixing this test for now, then replace it with something better once we settle on on the actual implementation.
Richard Palethorpe <rpalethorpe@suse.de> wrote: > > You are right, I'm now still looking for a better way to avoid this block > > I/O issue. > > Maybe you could add a field to the worker struct containing the current > file being read and a timestamp of when it started reading that > file. Then we can use that information later to find out if a particular > file is taking a long time to read. > > I tried to count the reading time of files in children, it shows 0~200 ms has been elapsed for each one. But for parent, the most slower cost time of stop_worker() is 8ms, and even visit_dir(root_dir) push all of the files, it takes only 1000ms. That probably can prove that why some children stalled there. ======== read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu11/snapshot_raw), elapsed_ms = 90: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu10/trace_pipe), elapsed_ms = 0: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu12/trace_pipe), elapsed_ms = 110: EAGAIN/EWOULDBLOCK ... read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu9/trace_pipe_raw), elapsed_ms = 160: EAGAIN/EWOULDBLOCK ... read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu9/trace_pipe_raw), elapsed_ms = 180: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu11/trace_pipe), elapsed_ms = 0: EAGAIN/EWOULDBLOCK ... read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu10/snapshot_raw), elapsed_ms = 130: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu6/trace_pipe), elapsed_ms = 200: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu9/trace_pipe), elapsed_ms = 0: EAGAIN/EWOULDBLOCK read_all.c:231: INFO: read(/sys/kernel/debug/tracing/per_cpu/cpu5/snapshot_raw), elapsed_ms = 0: EAGAIN/EWOULDBLOCK
On Mon, Apr 9, 2018 at 8:19 PM, Richard Palethorpe <rpalethorpe@suse.de> wrote: > Hello, > > Cyril Hrubis writes: > > > Hi! > >> OK, I think this change is good except that using a constant time for > >> the usleep is probably bad. Instead we could use an exponential > >> function; so we can start with a sleep of 1 then double it up to a > >> maximum of 20 times (approximately a second for the final wait). > > > > Actually I was thinking of adding such functionality to the test library > > exactly because I do not want to reimplement the exponentialy > > incremented sleep for retry over and overe in each testcase... > > Sounds like a good idea. > > > > > Maybe we need TST_RETRY_ON_NONZERO() macro that would build the retry > > loop around a function passed as a paramter or something like that. > > Or TST_RETRY_LOOP, TST_RETRY_WHILE or > TST_RETRY_WITH_EXPONENTIAL_BACKOFF. :-) > > Maybe it should not be a magic macro though, but just an inline function > which you call in the body of a loop which performs the sleep based on a > couple of parameters. I suppose it depends on how nice one can make the > macro. > > I have written two new MACROS from this suggestion, and will send out the PATCH to request you comments after finishing my roughly test. Thanks!
diff --git a/testcases/kernel/fs/read_all/read_all.c b/testcases/kernel/fs/read_all/read_all.c index b7ed540..ab206e7 100644 --- a/testcases/kernel/fs/read_all/read_all.c +++ b/testcases/kernel/fs/read_all/read_all.c @@ -280,6 +280,7 @@ static void stop_workers(void) workers[i].pid); break; } + usleep(100); } } } @@ -306,9 +307,12 @@ static void sched_work(const char *path) if (pushed) break; - if (++push_attempts > worker_count) { - usleep(100); - push_attempts = 0; + usleep(100); + if (++push_attempts > 0xffff) { + tst_brk(TBROK, + "Attempts %d times but still failed to push %s", + push_attempts, path); + break; } } }
1. Some children are still working on the read I/O but parent trys to stopping them after visit_dir() immediately. Although the stop_attemps is 65535, it still sometimes fails, so we get the following worker stalled messges in test. # uname -rm 4.16.0-rc7 ppc64 # ./read_all -d /sys -q -r 10 tst_test.c:987: INFO: Timeout per run is 0h 05m 00s read_all.c:280: BROK: Worker 26075 is stalled read_all.c:280: WARN: Worker 26075 is stalled read_all.c:280: WARN: Worker 26079 is stalled read_all.c:280: WARN: Worker 26087 is stalled 2. The sched_work() push action in a infinite loop, here I propose to let it in limited times. Signed-off-by: Li Wang <liwang@redhat.com> --- testcases/kernel/fs/read_all/read_all.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)