diff mbox

[v2,2/4] driver core: enable drivers to use deferred probefrom init

Message ID 201407292107.EGB64066.OOQMFSHLJVtFFO@I-love.SAKURA.ne.jp
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Tetsuo Handa July 29, 2014, 12:07 p.m. UTC
Luis R. Rodriguez wrote:
> On Mon, Jul 28, 2014 at 5:35 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Mon, Jul 28, 2014 at 05:26:34PM -0700, Luis R. Rodriguez wrote:
> >> To ignore SIGKILL ?
> >
> > Sorry, I thought this was a userspace change that caused this.
> >
> > As it's a kernel change, well, maybe that patch should be reverted...
> 
> That's certainly viable. Oleg?

I don't want to revert that patch.

I'm trying to reduce use of blocking operations that wait in unkillable state,
for the OOM killer is optimistic enough to wait forever even if the OOM-killed
process cannot be terminated due to having dependency on other threads that are
waiting the OOM-killed process to terminate and release memory. Linux kernel is
too optimistic about memory reclaim; memory allocation/reclaim deadlock is not
detectable.

> 
> If its reverted we won't know which drivers are hitting over the new
> 30 second limit requirement imposed by userspace, which the culprit
> commit forces failure on probe now. This series at least would allow
> us to annotate which drivers are brake the 30 second init limit, and
> enable a work around alternative if we wish to not revert the commit.
> This  series essentially should be considered an alternative solution
> to what was proposed initially by Joseph Salisbury, it may not be
> perfect but its a proposal. I welcome others.
(...snipped...)
> This also just addresses this *one* Ethernet driver, there was that
> SCSI driver that created the original report -- Canonical merged
> Joseph's fix as a work around, there is no general solution for this
> yet, and again with that work around you won't find which drivers run
> into this issue. There may be others found later so this is why I
> resorted to the deferred workqueue as a solution for now and to enable
> us to annotate which drivers need fixing as I expect getting the fix
> done / everyone happy can take time.

If you want to know which drivers are hitting over the new 30 second
limit requirement but don't want to break the boot, I think we can do
"what Ubuntu chose as a work around" + "a warning message" like below.


Hannes Reinecke wrote:
> Well ... from my POV there are two issues here:
> 
> The one is that systemd essentially nailed its internal worker 
> timeout to 30 seconds. And there is no way of modifying that short 
> of recompiling. Which should be fixed, so that at least one can
> modify this timeout.
> 
> The other one is that systemd killing a worker by sending SIGKILL, 
> which will kill modprobe terminally.
> Which definitely needs to be fixed.
> 
> But if we have both issues resolved (eg by allowing udevd to use a 
> longer timeout and revert the latter patch) we can identify the 
> offending drivers _and_ get the system to boot by simply adding a 
> kernel commandline parameter.
> Which is _far_ preferrable from a maintenance perspective.
> Users tend to become annoyed if their system doesn't boot ...

The proposal which allows longer timeout was expired.
https://bugs.launchpad.net/bugs/1297248
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Benjamin Poirier July 29, 2014, 10:25 p.m. UTC | #1
On 2014/07/29 21:07, Tetsuo Handa wrote:
> Luis R. Rodriguez wrote:
> > On Mon, Jul 28, 2014 at 5:35 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > > On Mon, Jul 28, 2014 at 05:26:34PM -0700, Luis R. Rodriguez wrote:
> > >> To ignore SIGKILL ?
> > >
> > > Sorry, I thought this was a userspace change that caused this.
> > >
> > > As it's a kernel change, well, maybe that patch should be reverted...
> > 
> > That's certainly viable. Oleg?
> 
> I don't want to revert that patch.

I agree that 786235ee should not be reverted to fix the problem of
modules that receive sigkill from udev while they are initializing. In
fact, while it may fix the case that was reported with mptsas, it would
not fix cxgb4 because there are other code paths that check for pending
signals and that abort (ex. pci_vpd_pci22_wait()).

Reverting 786235ee effectively works around the problem by making
modprobe unkillable. The proper solution would be to make sure that udev
does not send sigkill to modprobe in the first place, either by making
the timeout longer or by making the module probe faster.

If you have other reasons for reverting 786235ee, then it's a different
story.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis R. Rodriguez July 30, 2014, 12:14 a.m. UTC | #2
On Tue, Jul 29, 2014 at 03:25:29PM -0700, Benjamin Poirier wrote:
> On 2014/07/29 21:07, Tetsuo Handa wrote:
> > Luis R. Rodriguez wrote:
> > > On Mon, Jul 28, 2014 at 5:35 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > On Mon, Jul 28, 2014 at 05:26:34PM -0700, Luis R. Rodriguez wrote:
> > > >> To ignore SIGKILL ?
> > > >
> > > > Sorry, I thought this was a userspace change that caused this.
> > > >
> > > > As it's a kernel change, well, maybe that patch should be reverted...
> > > 
> > > That's certainly viable. Oleg?
> > 
> > I don't want to revert that patch.
> 
> I agree that 786235ee should not be reverted to fix the problem of
> modules that receive sigkill from udev while they are initializing. In
> fact, while it may fix the case that was reported with mptsas, it would
> not fix cxgb4 because there are other code paths that check for pending
> signals and that abort (ex. pci_vpd_pci22_wait()).
> 
> Reverting 786235ee effectively works around the problem by making
> modprobe unkillable. The proper solution would be to make sure that udev
> does not send sigkill to modprobe in the first place, either by making
> the timeout longer or by making the module probe faster.

Hannes sent a patch for systemd that enables a kernel command line override
for the timeout, this however still means some drivers can fail and distros
would have to use the longest known timeout for the supported kernel.

http://lists.freedesktop.org/archives/systemd-devel/2014-July/021601.html

Tetsuo is it possible / desirable to allow tasks to not kill unless the
reason is OOM ? Its unclear if this was discussed before, sorry if it was,
have just been a bit busy today to review the archive / discussions on this.

To *fatally* kill a module if it does not reach a time limit is rather harsh
without properly thinking about the entire picture of possible issues and 
reasons for the timeout and also consequences of the kill, essentially
what has happened is we are breaking ome boots on at least storage drivers
that take long, and now networking on one driver at least. I think we all
agree these drivers need fixing, there is no one arguing over that.
but to allow a timeout to fatally kill the damn system seems rather
stupid too if what we want to do is to get drivers fixed. It is both *hard
to debug* (see the bug reports) and simply just irritating to users.

The original commit on systemd that introduced the timeout is commit
e64fae55 but the purpose of that commit was to send to hell drivers
that are not using asynch firmware loading, but this is not the only
reason why some drivers would hit the timeout limit. As Benjamin notes
the cxgb4 driver issue is a bit more complex than that, as I've noted
I've sent some initial patches to help with asynch firmware but proper
integration is a bit more complex and even if we remove firmware out
of the picture (this was tried) the driver *still* takes more than
30 seconds to load and fails as Benjamin indicated. As Greg notes a bus
driver stub can be written -- but this will take a bit of time folks,
even if its a day or two, or a week or to just test things. In the
meantime we simply have broken systems / networking as collateral to
a CVE patch that in turn allowed systemd to also kill drivers on a
30 second timeout under the assumption it was all asynch firmware
loading. Collateral should not be the way to introduce new driver
requirements, specially if its breaking boots. All I'm saying is we
should just try to warn here, and not be fatal.

Hannes' patch will allow us to override the timeout through the comand
line but we're essentially still killing drivers that don't meet the new
implicit rules. This doesn't seem optimal and hence the discussion.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tetsuo Handa July 30, 2014, 2:22 a.m. UTC | #3
Luis R. Rodriguez wrote:
> Tetsuo is it possible / desirable to allow tasks to not kill unless the
> reason is OOM ? Its unclear if this was discussed before, sorry if it was,
> have just been a bit busy today to review the archive / discussions on this.

Are we aware that the 10 seconds timeout after SIGKILL is not the duration
between the beginning of module loading and the end of kthread_create() but
the duration to wait for kthreadd to create a new kernel thread?

If the kthreadd is unable to create a new kernel thread within 10 seconds,
something very bad is happening. For example, memory allocation deadlock
sequence shown below might be happening.

 (1) process1 holds a mutex using mutex_lock().
 (2) process1 calls kthread_create() and enters into killable wait state
     at wait_for_completion_killable().
 (3) kthreadd calls kernel_thread() and enters into oom-killable busy loop
     due to out of memory at alloc_pages_nodemask().
 (4) process2 is chosen by the OOM killer, but process2 is unable to
     terminate because process2 is waiting in unkillable state at
     mutex_lock() which was held by process1 at (1).
 (5) kthreadd continues busy loop because process2 does not release memory
     and the OOM killer does not kill more processes.
 (6) process1 continues waiting in oom-killable state because process1 is
     not chosen by the OOM killer.

See? The system will remain unresponding unless somebody releases memory
that is enough for kthreadd to complete. We cannot teach process1 that
process1 needs to give up waiting for kthreadd and call mutex_unlock()
in order to allow process2 to terminate. Also, we cannot teach the OOM
killer that process1 needs to be oom-killed after process2 is oom-killed.

Making the 10 seconds timeout after SIGKILL longer is safe.
Changing it to no-timeout-unless-oom-killed is unsafe.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis R. Rodriguez July 30, 2014, 2:26 p.m. UTC | #4
On Wed, Jul 30, 2014 at 11:22:14AM +0900, Tetsuo Handa wrote:
> Luis R. Rodriguez wrote:
> > Tetsuo is it possible / desirable to allow tasks to not kill unless the
> > reason is OOM ? Its unclear if this was discussed before, sorry if it was,
> > have just been a bit busy today to review the archive / discussions on this.
> 
> Are we aware that the 10 seconds timeout after SIGKILL is not the duration
> between the beginning of module loading and the end of kthread_create() but
> the duration to wait for kthreadd to create a new kernel thread?
> 
> If the kthreadd is unable to create a new kernel thread within 10 seconds,
> something very bad is happening. For example, memory allocation deadlock
> sequence shown below might be happening.
> 
>  (1) process1 holds a mutex using mutex_lock().
>  (2) process1 calls kthread_create() and enters into killable wait state
>      at wait_for_completion_killable().
>  (3) kthreadd calls kernel_thread() and enters into oom-killable busy loop
>      due to out of memory at alloc_pages_nodemask().
>  (4) process2 is chosen by the OOM killer, but process2 is unable to
>      terminate because process2 is waiting in unkillable state at
>      mutex_lock() which was held by process1 at (1).
>  (5) kthreadd continues busy loop because process2 does not release memory
>      and the OOM killer does not kill more processes.
>  (6) process1 continues waiting in oom-killable state because process1 is
>      not chosen by the OOM killer.
> 
> See? The system will remain unresponding unless somebody releases memory
> that is enough for kthreadd to complete.

I see but we're talking about large systems with gobs of memory so I'm pretty
sure memory should not be the issue here.

> We cannot teach process1 that
> process1 needs to give up waiting for kthreadd and call mutex_unlock()
> in order to allow process2 to terminate. Also, we cannot teach the OOM
> killer that process1 needs to be oom-killed after process2 is oom-killed.
> 
> Making the 10 seconds timeout after SIGKILL longer is safe.
> Changing it to no-timeout-unless-oom-killed is unsafe.

To be clear we have *not* merged the 10 second workaround:

https://launchpadlibrarian.net/169714201/kthread-Do-not-leave-kthread_create-immediately.patch

and come to think of it the work aroaund is aligned with what I was thinking
*without *waiting for 10 seconds, but my question was whether or not it was
reasonable to have the process request to go through this excemption.
So we would not do this all the time, but only for processes that would
request this, in this case modprobe.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/kthread.c b/kernel/kthread.c
index c2390f4..43da2b1 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -292,6 +292,26 @@  struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 	 * new kernel thread.
 	 */
 	if (unlikely(wait_for_completion_killable(&done))) {
+		int i = 0;
+
+		/*
+		 * I got SIGKILL, but wait for 10 more seconds for completion
+		 * unless chosen by the OOM killer. This delay is there as a
+		 * workaround for boot failure caused by SIGKILL upon device
+		 * driver initialization timeout.
+		 */
+		if (!test_tsk_thread_flag(current, TIF_MEMDIE)) {
+			pr_warn("I already got SIGKILL but ignoring it up to "
+				"10 seconds, in case the caller can't survive "
+				"fail-immediately-upon-SIGKILL event. "
+				"Please make sure that the caller can survive "
+				"this event, for this delay will be removed "
+				"in the future.\n");
+			dump_stack();
+		}
+		while (i++ < 10 && !test_tsk_thread_flag(current, TIF_MEMDIE))
+			if (wait_for_completion_timeout(&done, HZ))
+				goto ready;
 		/*
 		 * If I was SIGKILLed before kthreadd (or new kernel thread)
 		 * calls complete(), leave the cleanup of this structure to
@@ -305,6 +325,7 @@  struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 		 */
 		wait_for_completion(&done);
 	}
+ready:
 	task = create->result;
 	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };