Message ID | 573AEC75.40508@gmail.com |
---|---|
State | RFC |
Headers | show |
Hi Mats, On 2016-05-17 12:03, Mats Karrman wrote: > Hi Felix, others, > > I have been experiencing problems with the init scripts dispatch > suddenly stopping (indefinitely). > This happens maybe once in 100 reboots. > After inserting a new start script that launches another daemon > (cgrulesengd) very early in the boot process, the failures started to > come a lot more frequently, maybe once in 10 reboots, making this a real > issue. > I'm normally using the versions of procd and libubox selected by OpenWRT > BB branch but I have tested the latest versions from the git repos with > the same result. > So far I have only got this to happen on a quite fast board (ARM dual > CorexA9 @ 1GHz). > Inserting trace prints in libubox changes behavior, also suggesting the > problem is timing dependent. > > When init hangs: > - it is still possible to log in on console > - there is always a zombie start script, e.g. S11sysctl. > - by killing a process (e.g. ubusd or cgrulesengd) the init process > continues. > - otherwise generating an event, e.g inserting something into a USB port > also makes the init continue. > > I have traced the problem down to the "epoll_wait" call in > libubox::uloop.c::uloop_fetch_events(). > The following patch makes sure epoll_wait is never called without a timeout. > My tests show that this solves the problem. > I have been able to observe the case when the boot gets stuck and then > continues after the 8s timeout. > However I'm not sure that this is the correct fix for the problem as > there may be other reasons that there is no event in the first place. > Your feedback would be welcome! I just took a look at the code and uloop's processing of signals looked a bit racy to me. I've pushed a commit that makes it use signalfd if available. I also found that waitpid wasn't being retried on signal interrupt, so I added an extra check there. The changes are in libubox git, but not in OpenWrt/LEDE yet. Please test if this fixes your issue. Thanks, - Felix
On 2016-05-17 13:29, Felix Fietkau wrote: > I just took a look at the code and uloop's processing of signals looked > a bit racy to me. I've pushed a commit that makes it use signalfd if > available. I also found that waitpid wasn't being retried on signal > interrupt, so I added an extra check there. The changes are in libubox > git, but not in OpenWrt/LEDE yet. > Please test if this fixes your issue. > > Thanks, > > - Felix Tried that but no immediate success, but it might have provided some additional clues. Now the boot hangs early on *every* boot but after logging in I found something different in the ps list. There is a Broadcom utility (smd) that is called from one of the start scripts (S10environment). It's purpose is to set scheduling priority and cpu affinity for some of the Broadcom proprietary processes, The smd program handles fork rather ugly. The parent only loops until it receives SIGCHLD and then exits without any wait. With the modified libubox I get a zombie smd child and sleeping smd parent and S11environment (no other zombie). Not sure exactly how this happened but I got to think about something written in the wait man page: """ If a parent process terminates, then its "zombie" children (if any) are adopted by init(8), which automatically performs a wait to remove the zombies. """ Is this wait really (unconditionally) implemented in procd or could that be what I accomplished with the "forced timeout" patch? I fixed the ugly fork and got the system to boot once. Then tried the original libubox with the fixed smd program but this was not enough to get things working (25 reboots to hang). Now I'm running reboot tests with your new libubox and fixed smd... // Mats
On 2016-05-17 17:31, Mats Karrman wrote: > > On 2016-05-17 13:29, Felix Fietkau wrote: >> I just took a look at the code and uloop's processing of signals looked >> a bit racy to me. I've pushed a commit that makes it use signalfd if >> available. I also found that waitpid wasn't being retried on signal >> interrupt, so I added an extra check there. The changes are in libubox >> git, but not in OpenWrt/LEDE yet. >> Please test if this fixes your issue. >> >> Thanks, >> >> - Felix > Tried that but no immediate success, but it might have provided > some additional clues. Now the boot hangs early on *every* boot > but after logging in I found something different in the ps list. > There is a Broadcom utility (smd) that is called from one of the > start scripts (S10environment). It's purpose is to set scheduling > priority and cpu affinity for some of the Broadcom proprietary > processes, The smd program handles fork rather ugly. The > parent only loops until it receives SIGCHLD and then exits without > any wait. With the modified libubox I get a zombie smd child and > sleeping smd parent and S11environment (no other zombie). > > Not sure exactly how this happened but I got to think about > something written in the wait man page: > > """ > If a parent process terminates, then its "zombie" children (if any) > are adopted by init(8), which automatically performs a wait to > remove the zombies. > """ > > Is this wait really (unconditionally) implemented in procd or could > that be what I accomplished with the "forced timeout" patch? > > I fixed the ugly fork and got the system to boot once. > Then tried the original libubox with the fixed smd program but > this was not enough to get things working (25 reboots to hang). > > Now I'm running reboot tests with your new libubox and fixed smd... More than 250 reboots without problem :) Clearly the smd program is broken, but still it doesn't feel good that it manages to hang the init process. Considering that timing is involved it's difficult to make any certain conclusions but it seems like having uloop epoll_wait to time out occasionally isn't such a bad idea? // Mats
On 2016-05-18 11:38, Mats Karrman wrote: > > > On 2016-05-17 17:31, Mats Karrman wrote: >> >> On 2016-05-17 13:29, Felix Fietkau wrote: >>> I just took a look at the code and uloop's processing of signals looked >>> a bit racy to me. I've pushed a commit that makes it use signalfd if >>> available. I also found that waitpid wasn't being retried on signal >>> interrupt, so I added an extra check there. The changes are in libubox >>> git, but not in OpenWrt/LEDE yet. >>> Please test if this fixes your issue. >>> >>> Thanks, >>> >>> - Felix >> Tried that but no immediate success, but it might have provided >> some additional clues. Now the boot hangs early on *every* boot >> but after logging in I found something different in the ps list. >> There is a Broadcom utility (smd) that is called from one of the >> start scripts (S10environment). It's purpose is to set scheduling >> priority and cpu affinity for some of the Broadcom proprietary >> processes, The smd program handles fork rather ugly. The >> parent only loops until it receives SIGCHLD and then exits without >> any wait. With the modified libubox I get a zombie smd child and >> sleeping smd parent and S11environment (no other zombie). >> >> Not sure exactly how this happened but I got to think about >> something written in the wait man page: >> >> """ >> If a parent process terminates, then its "zombie" children (if any) >> are adopted by init(8), which automatically performs a wait to >> remove the zombies. >> """ >> >> Is this wait really (unconditionally) implemented in procd or could >> that be what I accomplished with the "forced timeout" patch? >> >> I fixed the ugly fork and got the system to boot once. >> Then tried the original libubox with the fixed smd program but >> this was not enough to get things working (25 reboots to hang). >> >> Now I'm running reboot tests with your new libubox and fixed smd... > More than 250 reboots without problem :) > > Clearly the smd program is broken, but still it doesn't feel good that it > manages to hang the init process. Considering that timing is involved > it's difficult to make any certain conclusions but it seems like having > uloop epoll_wait to time out occasionally isn't such a bad idea? I agree, that definitely needs fixing. What kernel are you using? - Felix
On 2016-05-18 13:01, Felix Fietkau wrote: > On 2016-05-18 11:38, Mats Karrman wrote: >> >> On 2016-05-17 17:31, Mats Karrman wrote: >>> On 2016-05-17 13:29, Felix Fietkau wrote: >>>> I just took a look at the code and uloop's processing of signals looked >>>> a bit racy to me. I've pushed a commit that makes it use signalfd if >>>> available. I also found that waitpid wasn't being retried on signal >>>> interrupt, so I added an extra check there. The changes are in libubox >>>> git, but not in OpenWrt/LEDE yet. >>>> Please test if this fixes your issue. >>>> >>>> Thanks, >>>> >>>> - Felix >>> Tried that but no immediate success, but it might have provided >>> some additional clues. Now the boot hangs early on *every* boot >>> but after logging in I found something different in the ps list. >>> There is a Broadcom utility (smd) that is called from one of the >>> start scripts (S10environment). It's purpose is to set scheduling >>> priority and cpu affinity for some of the Broadcom proprietary >>> processes, The smd program handles fork rather ugly. The >>> parent only loops until it receives SIGCHLD and then exits without >>> any wait. With the modified libubox I get a zombie smd child and >>> sleeping smd parent and S11environment (no other zombie). >>> >>> Not sure exactly how this happened but I got to think about >>> something written in the wait man page: >>> >>> """ >>> If a parent process terminates, then its "zombie" children (if any) >>> are adopted by init(8), which automatically performs a wait to >>> remove the zombies. >>> """ >>> >>> Is this wait really (unconditionally) implemented in procd or could >>> that be what I accomplished with the "forced timeout" patch? >>> >>> I fixed the ugly fork and got the system to boot once. >>> Then tried the original libubox with the fixed smd program but >>> this was not enough to get things working (25 reboots to hang). >>> >>> Now I'm running reboot tests with your new libubox and fixed smd... >> More than 250 reboots without problem :) >> >> Clearly the smd program is broken, but still it doesn't feel good that it >> manages to hang the init process. Considering that timing is involved >> it's difficult to make any certain conclusions but it seems like having >> uloop epoll_wait to time out occasionally isn't such a bad idea? > I agree, that definitely needs fixing. What kernel are you using? It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old... Now I also noticed, with your libubox fixes (and my fixed smd) I still get some zombies, even though the system seems to boot OK all the way (the corresponding services being defunct though). With my epoll_wait timeout fix on the original libubox, this does not happen. BR // Mats
On 2016-05-18 14:00, Mats Karrman wrote: > > > On 2016-05-18 13:01, Felix Fietkau wrote: >> On 2016-05-18 11:38, Mats Karrman wrote: >>> >>> On 2016-05-17 17:31, Mats Karrman wrote: >>>> On 2016-05-17 13:29, Felix Fietkau wrote: >>>>> I just took a look at the code and uloop's processing of signals looked >>>>> a bit racy to me. I've pushed a commit that makes it use signalfd if >>>>> available. I also found that waitpid wasn't being retried on signal >>>>> interrupt, so I added an extra check there. The changes are in libubox >>>>> git, but not in OpenWrt/LEDE yet. >>>>> Please test if this fixes your issue. >>>>> >>>>> Thanks, >>>>> >>>>> - Felix >>>> Tried that but no immediate success, but it might have provided >>>> some additional clues. Now the boot hangs early on *every* boot >>>> but after logging in I found something different in the ps list. >>>> There is a Broadcom utility (smd) that is called from one of the >>>> start scripts (S10environment). It's purpose is to set scheduling >>>> priority and cpu affinity for some of the Broadcom proprietary >>>> processes, The smd program handles fork rather ugly. The >>>> parent only loops until it receives SIGCHLD and then exits without >>>> any wait. With the modified libubox I get a zombie smd child and >>>> sleeping smd parent and S11environment (no other zombie). >>>> >>>> Not sure exactly how this happened but I got to think about >>>> something written in the wait man page: >>>> >>>> """ >>>> If a parent process terminates, then its "zombie" children (if any) >>>> are adopted by init(8), which automatically performs a wait to >>>> remove the zombies. >>>> """ >>>> >>>> Is this wait really (unconditionally) implemented in procd or could >>>> that be what I accomplished with the "forced timeout" patch? >>>> >>>> I fixed the ugly fork and got the system to boot once. >>>> Then tried the original libubox with the fixed smd program but >>>> this was not enough to get things working (25 reboots to hang). >>>> >>>> Now I'm running reboot tests with your new libubox and fixed smd... >>> More than 250 reboots without problem :) >>> >>> Clearly the smd program is broken, but still it doesn't feel good that it >>> manages to hang the init process. Considering that timing is involved >>> it's difficult to make any certain conclusions but it seems like having >>> uloop epoll_wait to time out occasionally isn't such a bad idea? >> I agree, that definitely needs fixing. What kernel are you using? > It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old... > > Now I also noticed, with your libubox fixes (and my fixed smd) I still get > some zombies, even though the system seems to boot OK all the way > (the corresponding services being defunct though). > With my epoll_wait timeout fix on the original libubox, this does not > happen. Can you try backporting this to your kernel? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1 - Felix
On 2016-05-18 15:09, Mats Karrman wrote: > > > On 2016-05-18 14:03, Felix Fietkau wrote: >> On 2016-05-18 14:00, Mats Karrman wrote: >>> >>> On 2016-05-18 13:01, Felix Fietkau wrote: >>>> On 2016-05-18 11:38, Mats Karrman wrote: >>>>> On 2016-05-17 17:31, Mats Karrman wrote: >>>>>> On 2016-05-17 13:29, Felix Fietkau wrote: >>>>>>> I just took a look at the code and uloop's processing of signals >>>>>>> looked >>>>>>> a bit racy to me. I've pushed a commit that makes it use >>>>>>> signalfd if >>>>>>> available. I also found that waitpid wasn't being retried on signal >>>>>>> interrupt, so I added an extra check there. The changes are in >>>>>>> libubox >>>>>>> git, but not in OpenWrt/LEDE yet. >>>>>>> Please test if this fixes your issue. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> - Felix >>>>>> Tried that but no immediate success, but it might have provided >>>>>> some additional clues. Now the boot hangs early on *every* boot >>>>>> but after logging in I found something different in the ps list. >>>>>> There is a Broadcom utility (smd) that is called from one of the >>>>>> start scripts (S10environment). It's purpose is to set scheduling >>>>>> priority and cpu affinity for some of the Broadcom proprietary >>>>>> processes, The smd program handles fork rather ugly. The >>>>>> parent only loops until it receives SIGCHLD and then exits without >>>>>> any wait. With the modified libubox I get a zombie smd child and >>>>>> sleeping smd parent and S11environment (no other zombie). >>>>>> >>>>>> Not sure exactly how this happened but I got to think about >>>>>> something written in the wait man page: >>>>>> >>>>>> """ >>>>>> If a parent process terminates, then its "zombie" children (if any) >>>>>> are adopted by init(8), which automatically performs a wait to >>>>>> remove the zombies. >>>>>> """ >>>>>> >>>>>> Is this wait really (unconditionally) implemented in procd or could >>>>>> that be what I accomplished with the "forced timeout" patch? >>>>>> >>>>>> I fixed the ugly fork and got the system to boot once. >>>>>> Then tried the original libubox with the fixed smd program but >>>>>> this was not enough to get things working (25 reboots to hang). >>>>>> >>>>>> Now I'm running reboot tests with your new libubox and fixed smd... >>>>> More than 250 reboots without problem :) >>>>> >>>>> Clearly the smd program is broken, but still it doesn't feel good >>>>> that it >>>>> manages to hang the init process. Considering that timing is involved >>>>> it's difficult to make any certain conclusions but it seems like >>>>> having >>>>> uloop epoll_wait to time out occasionally isn't such a bad idea? >>>> I agree, that definitely needs fixing. What kernel are you using? >>> It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old... >>> >>> Now I also noticed, with your libubox fixes (and my fixed smd) I >>> still get >>> some zombies, even though the system seems to boot OK all the way >>> (the corresponding services being defunct though). >>> With my epoll_wait timeout fix on the original libubox, this does not >>> happen. >> Can you try backporting this to your kernel? >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1 >> >> >> - Felix > OK, did that. > > First, I tested original libubox without my epoll_wait timeout fix > (but fixed smd), > init hung after 7 reboots. Same state as before. > > Second, I tested your libubox (with fixed smd), seems to be no change, > i.e. > some processes/scripts ends up as zombies. > > Third, added my epoll_wait timeout fix to your libubox, made no > difference.. Also noted that other stuff is not working properly with your fixes in libubox, e.g. system upgrade hangs with a zombie sh and after scp copying files to the box I get a zombie dropbear. Child reaping in general seems to be broken. Backed down to latest libubox before your fixes and the zombies/problems disappear so it has no relation to the kernel patch or smd (original intit hang problem still there of course). // Mats
diff --git a/uloop.c b/uloop.c index ea160a0..8343bc5 100644 --- a/uloop.c +++ b/uloop.c @@ -256,7 +256,7 @@ static int uloop_fetch_events(int timeout) { int n, nfds; - nfds = epoll_wait(poll_fd, events, ARRAY_SIZE(events), timeout); + nfds = epoll_wait(poll_fd, events, ARRAY_SIZE(events), timeout < 0 ? 8000 : timeout); for (n = 0; n < nfds; ++n) { struct uloop_fd_event *cur = &cur_fds[n];