Message ID | 20160218174427.GG13177@mtj.duckdns.org |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On 02/18/2016, 06:44 PM, Tejun Heo wrote: > Hello, > > Can you please do the followings? > > 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether > the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth > doesn't make sense btw. Why is it there? > > 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached > patch works too. Hello, 1. didn't help, the problem persists. So I haven't applied the patch from 2. thanks,
On 02/19/2016, 11:20 AM, Jiri Slaby wrote: > On 02/18/2016, 06:44 PM, Tejun Heo wrote: >> Hello, >> >> Can you please do the followings? >> >> 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether >> the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth >> doesn't make sense btw. Why is it there? >> >> 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached >> patch works too. > > Hello, > > 1. didn't help, the problem persists. So I haven't applied the patch from 2. FWIW I dumped more info about the wq: wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > thanks,
Hello, Jiri. On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: > > 1. didn't help, the problem persists. So I haven't applied the patch from 2. > > FWIW I dumped more info about the wq: > wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 > pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> Can you please print out the same info for all pwq's during shutdown? It looks like we're leaking pwq refcnt but I can't spot a place where that could happen on an empty pwq. Thanks.
Hi, On 03/02/2016, 04:45 PM, Tejun Heo wrote: > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: >>> 1. didn't help, the problem persists. So I haven't applied the patch from 2. >> >> FWIW I dumped more info about the wq: >> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > > Can you please print out the same info for all pwq's during shutdown? > It looks like we're leaking pwq refcnt but I can't spot a place where > that could happen on an empty pwq. I have not done that yet, but today, I see: destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 in-flight: 18568:wq_barrier_func thanks,
Hello, Sorry about the delay. On Thu, Mar 03, 2016 at 10:12:01AM +0100, Jiri Slaby wrote: > On 03/02/2016, 04:45 PM, Tejun Heo wrote: > > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote: > >>> 1. didn't help, the problem persists. So I haven't applied the patch from 2. > >> > >> FWIW I dumped more info about the wq: > >> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200 > >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing> > > > > Can you please print out the same info for all pwq's during shutdown? > > It looks like we're leaking pwq refcnt but I can't spot a place where > > that could happen on an empty pwq. > > I have not done that yet, but today, I see: > destroy_workqueue: name='req_hci0' pwq=ffff88002f590300 > wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works: > pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1 > in-flight: 18568:wq_barrier_func So, this means that there's flush_work() racing against workqueue destruction, which can't be safe. :( Thanks.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 7ff5dc7..9824d4f 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4012,7 +4012,7 @@ void destroy_workqueue(struct workqueue_struct *wq) /* drain it before proceeding with destruction */ drain_workqueue(wq); - /* sanity checks */ + /* nothing should be in flight */ mutex_lock(&wq->mutex); for_each_pwq(pwq, wq) { int i; @@ -4024,8 +4024,7 @@ void destroy_workqueue(struct workqueue_struct *wq) } } - if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) || - WARN_ON(pwq->nr_active) || + if (WARN_ON(pwq->nr_active) || WARN_ON(!list_empty(&pwq->delayed_works))) { mutex_unlock(&wq->mutex); return; @@ -4046,6 +4045,13 @@ void destroy_workqueue(struct workqueue_struct *wq) if (wq->rescuer) kthread_stop(wq->rescuer->task); + /* rescuer is gone, everything should be quiescent now */ + WARN_ON(!list_empty(&wq->maydays)); + mutex_lock(&wq->mutex); + for_each_pwq(pwq, wq) + WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)); + mutex_unlock(&wq->mutex); + if (!(wq->flags & WQ_UNBOUND)) { /* * The base ref is never dropped on per-cpu pwqs. Directly