diff mbox

net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

Message ID 20160218174427.GG13177@mtj.duckdns.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Tejun Heo Feb. 18, 2016, 5:44 p.m. UTC
Hello,

Can you please do the followings?

1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether
   the problem is reproducible.  WQ_MEM_RECLAIM on anything bluetooth
   doesn't make sense btw.  Why is it there?

2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached
   patch works too.

Thanks.

Comments

Jiri Slaby Feb. 19, 2016, 10:20 a.m. UTC | #1
On 02/18/2016, 06:44 PM, Tejun Heo wrote:
> Hello,
> 
> Can you please do the followings?
> 
> 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether
>    the problem is reproducible.  WQ_MEM_RECLAIM on anything bluetooth
>    doesn't make sense btw.  Why is it there?
> 
> 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached
>    patch works too.

Hello,

1. didn't help, the problem persists. So I haven't applied the patch from 2.

thanks,
Jiri Slaby Feb. 19, 2016, 12:10 p.m. UTC | #2
On 02/19/2016, 11:20 AM, Jiri Slaby wrote:
> On 02/18/2016, 06:44 PM, Tejun Heo wrote:
>> Hello,
>>
>> Can you please do the followings?
>>
>> 1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether
>>    the problem is reproducible.  WQ_MEM_RECLAIM on anything bluetooth
>>    doesn't make sense btw.  Why is it there?
>>
>> 2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached
>>    patch works too.
> 
> Hello,
> 
> 1. didn't help, the problem persists. So I haven't applied the patch from 2.

FWIW I dumped more info about the wq:
wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>

> thanks,
Tejun Heo March 2, 2016, 3:45 p.m. UTC | #3
Hello, Jiri.

On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
> > 1. didn't help, the problem persists. So I haven't applied the patch from 2.
> 
> FWIW I dumped more info about the wq:
> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>

Can you please print out the same info for all pwq's during shutdown?
It looks like we're leaking pwq refcnt but I can't spot a place where
that could happen on an empty pwq.

Thanks.
Jiri Slaby March 3, 2016, 9:12 a.m. UTC | #4
Hi,

On 03/02/2016, 04:45 PM, Tejun Heo wrote:
> On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
>>> 1. didn't help, the problem persists. So I haven't applied the patch from 2.
>>
>> FWIW I dumped more info about the wq:
>> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
>> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>
> 
> Can you please print out the same info for all pwq's during shutdown?
> It looks like we're leaking pwq refcnt but I can't spot a place where
> that could happen on an empty pwq.

I have not done that yet, but today, I see:
destroy_workqueue: name='req_hci0' pwq=ffff88002f590300
wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
   pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
     in-flight: 18568:wq_barrier_func

thanks,
Tejun Heo March 11, 2016, 5:12 p.m. UTC | #5
Hello,

Sorry about the delay.

On Thu, Mar 03, 2016 at 10:12:01AM +0100, Jiri Slaby wrote:
> On 03/02/2016, 04:45 PM, Tejun Heo wrote:
> > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
> >>> 1. didn't help, the problem persists. So I haven't applied the patch from 2.
> >>
> >> FWIW I dumped more info about the wq:
> >> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
> >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>
> > 
> > Can you please print out the same info for all pwq's during shutdown?
> > It looks like we're leaking pwq refcnt but I can't spot a place where
> > that could happen on an empty pwq.
> 
> I have not done that yet, but today, I see:
> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300
> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
>    pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
>      in-flight: 18568:wq_barrier_func

So, this means that there's flush_work() racing against workqueue
destruction, which can't be safe. :(

Thanks.
diff mbox

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7ff5dc7..9824d4f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4012,7 +4012,7 @@  void destroy_workqueue(struct workqueue_struct *wq)
 	/* drain it before proceeding with destruction */
 	drain_workqueue(wq);
 
-	/* sanity checks */
+	/* nothing should be in flight */
 	mutex_lock(&wq->mutex);
 	for_each_pwq(pwq, wq) {
 		int i;
@@ -4024,8 +4024,7 @@  void destroy_workqueue(struct workqueue_struct *wq)
 			}
 		}
 
-		if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
-		    WARN_ON(pwq->nr_active) ||
+		if (WARN_ON(pwq->nr_active) ||
 		    WARN_ON(!list_empty(&pwq->delayed_works))) {
 			mutex_unlock(&wq->mutex);
 			return;
@@ -4046,6 +4045,13 @@  void destroy_workqueue(struct workqueue_struct *wq)
 	if (wq->rescuer)
 		kthread_stop(wq->rescuer->task);
 
+	/* rescuer is gone, everything should be quiescent now */
+	WARN_ON(!list_empty(&wq->maydays));
+	mutex_lock(&wq->mutex);
+	for_each_pwq(pwq, wq)
+		WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1));
+	mutex_unlock(&wq->mutex);
+
 	if (!(wq->flags & WQ_UNBOUND)) {
 		/*
 		 * The base ref is never dropped on per-cpu pwqs.  Directly