[v1,5/5] driver-core: add driver asynchronous probe support

Message ID	1411768637-6809-6-git-send-email-mcgrof@do-not-panic.com
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: "Luis R. Rodriguez" <mcgrof@do-not-panic.com> To: gregkh@linuxfoundation.org, dmitry.torokhov@gmail.com, tiwai@suse.de, tj@kernel.org, arjan@linux.intel.com Cc: teg@jklm.no, rmilasan@suse.com, werner@suse.com, oleg@redhat.com, hare@suse.com, bpoirier@suse.de, santosh@chelsio.com, pmladek@suse.cz, dbueso@suse.com, mcgrof@suse.com, linux-kernel@vger.kernel.org, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Joseph Salisbury <joseph.salisbury@canonical.com>, Kay Sievers <kay@vrfy.org>, One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>, Tim Gardner <tim.gardner@canonical.com>, Pierre Fersing <pierre-fersing@pierref.org>, Andrew Morton <akpm@linux-foundation.org>, Nagalakshmi Nandigama <nagalakshmi.nandigama@avagotech.com>, Praveen Krishnamoorthy <praveen.krishnamoorthy@avagotech.com>, Sreekanth Reddy <sreekanth.reddy@avagotech.com>, Abhijit Mahajan <abhijit.mahajan@avagotech.com>, Casey Leedom <leedom@chelsio.com>, Hariprasad S <hariprasad@chelsio.com>, MPT-FusionLinux.pdl@avagotech.com, linux-scsi@vger.kernel.org, netdev@vger.kernel.org Subject: [PATCH v1 5/5] driver-core: add driver asynchronous probe support Date: Fri, 26 Sep 2014 14:57:17 -0700 Message-Id: <1411768637-6809-6-git-send-email-mcgrof@do-not-panic.com> In-Reply-To: <1411768637-6809-1-git-send-email-mcgrof@do-not-panic.com> References: <1411768637-6809-1-git-send-email-mcgrof@do-not-panic.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Luis R. Rodriguez Sept. 26, 2014, 9:57 p.m. UTC

From: "Luis R. Rodriguez" <mcgrof@suse.com>

Some init systems may wish to express the desire to have
device drivers run their device driver's bus probe() run
asynchronously. This implements support for this and
allows userspace to request async probe as a preference
through a generic shared device driver module parameter,
async_probe. Implemention for async probe is supported
through a module parameter given that since synchronous
probe has been prevalent for years some userspace might
exist which relies on the fact that the device driver will
probe synchronously and the assumption that devices it
provides will be immediately available after this.

Some device driver might not be able to run async probe
so we enable device drivers to annotate this to prevent
this module parameter from having any effect on them.

This implementation uses queue_work(system_unbound_wq)
to queue async probes, this should enable probe to run
slightly *faster* if the driver's probe path did not
have much interaction with other workqueues otherwise
it may run _slightly_ slower. Tests were done with cxgb4,
which is known to take long on probe, both without
having to run request_firmware() [0] and then by
requiring it to use request_firmware() [1]. The
difference in run time are only measurable in microseconds:

Tejun Heo Sept. 28, 2014, 3:03 p.m. UTC | #1

Hello,

On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
...
> Systemd should consider enabling async probe on device drivers
> it loads through systemd-udev but probably does not want to
> enable it for modules loaded through systemd-modules-load
> (modules-load.d). At least on my booting enablign async probe
> for all modules fails to boot as such in order to make this

Did you find out why boot failed with those modules?

> a bit more useful we whitelist a few buses where it should be
> at least in theory safe to try to enable async probe. This
> way even if systemd tried to ask to enable async probe for all
> its device drivers the kernel won't blindly do this. We also
> have the sync_probe flag which device drivers can themselves
> enable *iff* its known the device driver should never async
> probe.
> 
> In order to help *test* things folks can use the bus.safe_mod_async_probe=1
> kernel parameter which will work as if userspace would have
> requested all modules to load with async probe. Daring folks can
> also use bus.force_mod_async_probe=1 which will enable asynch probe
> even on buses not tested in any way yet, if you use that though
> you're on your own.

If those two knobs are meant for debugging, let's please make that
fact immediately evident.  e.g. Make them ugly boot params like
"__DEVEL__driver_force_mod_async_probe".  Devel/debug options ending
up becoming stable interface are really nasty.

> +struct driver_attach_work {
> +	struct work_struct work;
> +	struct device_driver *driver;
> +};
> +
>  struct driver_private {
>  	struct kobject kobj;
>  	struct klist klist_devices;
>  	struct klist_node knode_bus;
>  	struct module_kobject *mkobj;
> +	struct driver_attach_work *attach_work;
>  	struct device_driver *driver;
>  };

How many bytes are we saving by allocating it separately?  Can't we
just embed it in driver_private?

> +static void driver_attach_workfn(struct work_struct *work)
> +{
> +	int ret;
> +	struct driver_attach_work *attach_work =
> +		container_of(work, struct driver_attach_work, work);
> +	struct device_driver *drv = attach_work->driver;
> +	ktime_t calltime, delta, rettime;
> +	unsigned long long duration;

This could just be a personal preference but I think it's easier to
read if local vars w/ initializers come before the ones w/o.

> +
> +	calltime = ktime_get();
> +
> +	ret = driver_attach(drv);
> +	if (ret != 0) {
> +		remove_driver_private(drv);
> +		bus_put(drv->bus);
> +	}
> +
> +	rettime = ktime_get();
> +	delta = ktime_sub(rettime, calltime);
> +	duration = (unsigned long long) ktime_to_ns(delta) >> 10;
> +
> +	pr_debug("bus: '%s': add driver %s attach completed after %lld usecs\n",
> +		 drv->bus->name, drv->name, duration);

Why do we have the above printout for async path but not sync path?
It's kinda weird for the code path to diverge like this.  Shouldn't
the only difference be the context probes are running from?

...
> +static bool drv_enable_async_probe(struct device_driver *drv,
> +				   struct bus_type *bus)
> +{
> +	struct module *mod;
> +
> +	if (!drv->owner || drv->sync_probe)
> +		return false;
> +
> +	if (force_mod_async)
> +		return true;
> +
> +	mod = drv->owner;
> +	if (!safe_mod_async && !mod->async_probe_requested)
> +		return false;
> +
> +	/* For now lets avoid stupid bug reports */
> +	if (!strcmp(bus->name, "pci") ||
> +	    !strcmp(bus->name, "pci_express") ||
> +	    !strcmp(bus->name, "hid") ||
> +	    !strcmp(bus->name, "sdio") ||
> +	    !strcmp(bus->name, "gameport") ||
> +	    !strcmp(bus->name, "mmc") ||
> +	    !strcmp(bus->name, "i2c") ||
> +	    !strcmp(bus->name, "platform") ||
> +	    !strcmp(bus->name, "usb"))
> +		return true;

Ugh... things like this tend to become permanent.  Do we really need
this?  And how are we gonna find out what's broken why w/o bug
reports?

> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index e4ffbcf..7999aba 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -507,6 +507,13 @@ static void __device_release_driver(struct device *dev)
>  
>  	drv = dev->driver;
>  	if (drv) {
> +		if (drv->owner && !drv->sync_probe) {
> +			struct module *mod = drv->owner;
> +			struct driver_private *priv = drv->p;
> +
> +			if (mod->async_probe_requested)
> +				flush_work(&priv->attach_work->work);

This can be unconditional flus_work(&priv->attach_work) if attach_work
isn't separately allocated.

>  static int unknown_module_param_cb(char *param, char *val, const char *modname,
>  				   void *arg)
>  {
> +	int ret;
> +	struct module *mod = arg;

Ditto with the order of definitions.

> +	if (strcmp(param, "async_probe") == 0) {
> +		mod->async_probe_requested = true;
> +		return 0;
> +	}

Generally looks good to me.

Thanks a lot for doing this! :)

Tom Gundersen Sept. 28, 2014, 5:07 p.m. UTC | #2

Hi Luis,

Thanks for the patches and the detailed analysis.

Feel free to add

Acked-by: Tom Gundersen <teg@jklm.no>

Minor comments on the commit message below.

On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
<mcgrof@do-not-panic.com> wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>
> Some init systems may wish to express the desire to have
> device drivers run their device driver's bus probe() run
> asynchronously. This implements support for this and
> allows userspace to request async probe as a preference
> through a generic shared device driver module parameter,
> async_probe. Implemention for async probe is supported
> through a module parameter given that since synchronous
> probe has been prevalent for years some userspace might
> exist which relies on the fact that the device driver will
> probe synchronously and the assumption that devices it
> provides will be immediately available after this.
>
> Some device driver might not be able to run async probe
> so we enable device drivers to annotate this to prevent
> this module parameter from having any effect on them.
>
> This implementation uses queue_work(system_unbound_wq)
> to queue async probes, this should enable probe to run
> slightly *faster* if the driver's probe path did not
> have much interaction with other workqueues otherwise
> it may run _slightly_ slower. Tests were done with cxgb4,
> which is known to take long on probe, both without
> having to run request_firmware() [0] and then by
> requiring it to use request_firmware() [1]. The
> difference in run time are only measurable in microseconds:
>
> =====================================================================|
> strategy                                fw (usec)       no-fw (usec) |
> ---------------------------------------------------------------------|
> synchronous                             24472569        1307563      |
> kthread                                 25066415.5      1309868.5    |
> queue_work(system_unbound_wq)           24913661.5      1307631      |
> ---------------------------------------------------------------------|
>
> In practice, in seconds, the difference is barely noticeable:
>
> =====================================================================|
> strategy                                fw (s)          no-fw (s)    |
> ---------------------------------------------------------------------|
> synchronous                             24.47           1.31         |
> kthread                                 25.07           1.31         |
> queue_work(system_unbound_wq)           24.91           1.31         |
> ---------------------------------------------------------------------|
>
> [0] http://ftp.suse.com/pub/people/mcgrof/async-probe/probe-cgxb4-no-firmware.png
> [1] http://ftp.suse.com/pub/people/mcgrof/async-probe/probe-cgxb4-firmware.png
>
> The rest of the commit log documents why this feature was implemented
> primarily first for systemd and things it should consider next.
>
> Systemd has a general timeout for all workers currently set to 180
> seconds after which it will send a sigkill signal. Systemd now has a
> warning which is issued once it reaches 1/3 of the timeout. The original
> motivation for the systemd timeout was to help track device drivers
> which do not use asynch firmware loading on init() and the timeout was
> originally set to 30 seconds.

Please note that the motivation for the timeout in systemd had nothing
to do with async firmware loading (that was just the case where
problems cropped up). The motivation was to not allow udev-workers to
stay around indefinitely, and hence put an upper-bound on
their duration (initially 180 s). At some point the bound was reduced
to 30 seconds to make sure module-loading would bail out before the
kernel's firmware loading timeout would bail out (60s I believe). That
is no longer relevant, which is why it was safe to reset the timeout
to 180 s.

> Since systemd + kernel are heavily tied in for the purposes of this
> patch it is assumed you have merged on systemd the following
> commits:
>
> 671174136525ddf208cdbe75d6d6bd159afa961f        udev: timeout - warn after a third of the timeout before killing
> b5338a19864ac3f5632aee48069a669479621dca        udev: timeout - increase timeout
> 2e92633dbae52f5ac9b7b2e068935990d475d2cd        udev: bump event timeout to 60 seconds
> be2ea723b1d023b3d385d3b791ee4607cbfb20ca        udev: remove userspace firmware loading support
> 9f20a8a376f924c8eb5423cfc1f98644fc1e2d1a        udev: fixup commit
> dd5eddd28a74a49607a8fffcaf960040dba98479        udev: unify event timeout handling
> 9719859c07aa13539ed2cd4b31972cd30f678543        udevd: add --event-timeout commandline option
>
> Since we bundle together serially driver init() and probe()
> on module initialiation systemd's imposed timeout  put a limit on the
> amount of time a driver init() and probe routines can take. There's a
> few overlooked issues with this and the timeout in general:
>
> 0) Not all drivers are killed, the signal is just sent and
>    the kill will only be acted upoon if the driver you loaded
>    happens to have some code path that either uses kthreads (which
>    as of 786235ee are now killable), or uses some code which checks for
>    fatal_signal_pending() on the kernel somewhere -- i.e: pci_read_vpd().

Shouldn't this be seen as something to be fixed in the kernel? I mean,
do we not want userspace to have the possibility to kill udev/modprobe
even disregarding the worker timeouts (say at shutdown, or before
switching from the initrd)?

> 1) Since systemd is the only one logging the sigkill debugging that
>    drivers are not loaded or in the worst case *failed to boot* because
>    of a sigkill has proven hard to debug.

Care to clarify this a bit? Are the udev logs somehow unclear? If you
think we can improve the logging from udev, please ping me about that
and I'll sort it out.

> 2) When and if the signal is received by the driver somehow
>    the driver may fail at different points in its initialization
>    and unless all error paths on the driver are implemented
>    perfectly this could mean leaving a device in a half
>    initialized state.
>
> 3) The timeout is penalizing device drivers that take long on
>    probe(), this wasn't the original motivation. Systemd seems
>    to have been under assumption that probe was asynchronous,
>    this perhaps is true as an *objective* and goal for *some
>    subsystems* but by no means is it true that we've been on a wide
>    crusade to ensure this for all device drivers. It may be a good
>    idea for *many* device drivers but penalizing them with a kill
>    for taking long on probe is simply unacceptable specially
>    when the timeout is completely arbitrary.

The point is really not to "penalize" anything, we just need to make
sure we put some sort of restrictions on our workers so they don't
hang around forever.

> 4) The driver core calls probe for *all* devices that a driver can
>    claim and it does so serially, so if a device driver will need
>    to probe 3 devices and if probe on the device driver is synchronous
>    the amount of time that module loading will take will be:
>
>    driver load time = init() + probe for 3 devices serially
>
>    The timeout ultimatley ends up limiting the number of devices that
>    *any* device driver can support based on the following formula:
>
>    number_devices =          systemd_timeout
>                       -------------------------------------
>                          max known probe time for driver
>
>    Lastly since the error value passed down is the value of
>    the probe for the last device probed the module will fail
>    to load and all devices will fail to be available.
>
> In the Linux kernel we don't want to work around the timeout,
> instead systemd must be changed to take all the above into
> consideration when issuing any kills on device drivers, ideally
> the sigkill should be considered to be ignored at least for
> kmod. In addition to this we help systemd by giving it what it
> originally considered was there and enable it to ask device
> drivers to use asynchronous probe. This patch addresses that
> feature.
>
> Systemd should consider enabling async probe on device drivers
> it loads through systemd-udev but probably does not want to
> enable it for modules loaded through systemd-modules-load
> (modules-load.d). At least on my booting enablign async probe
> for all modules fails to boot as such in order to make this
> a bit more useful we whitelist a few buses where it should be
> at least in theory safe to try to enable async probe. This
> way even if systemd tried to ask to enable async probe for all
> its device drivers the kernel won't blindly do this. We also
> have the sync_probe flag which device drivers can themselves
> enable *iff* its known the device driver should never async
> probe.
>
> In order to help *test* things folks can use the bus.safe_mod_async_probe=1
> kernel parameter which will work as if userspace would have
> requested all modules to load with async probe. Daring folks can
> also use bus.force_mod_async_probe=1 which will enable asynch probe
> even on buses not tested in any way yet, if you use that though
> you're on your own.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
> Cc: Kay Sievers <kay@vrfy.org>
> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
> Cc: Tim Gardner <tim.gardner@canonical.com>
> Cc: Pierre Fersing <pierre-fersing@pierref.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Poirier <bpoirier@suse.de>
> Cc: Nagalakshmi Nandigama <nagalakshmi.nandigama@avagotech.com>
> Cc: Praveen Krishnamoorthy <praveen.krishnamoorthy@avagotech.com>
> Cc: Sreekanth Reddy <sreekanth.reddy@avagotech.com>
> Cc: Abhijit Mahajan <abhijit.mahajan@avagotech.com>
> Cc: Casey Leedom <leedom@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Santosh Rastapur <santosh@chelsio.com>
> Cc: MPT-FusionLinux.pdl@avagotech.com
> Cc: linux-scsi@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
> ---
>  drivers/base/base.h    |   6 +++
>  drivers/base/bus.c     | 137 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/base/dd.c      |   7 +++
>  include/linux/module.h |   2 +
>  kernel/module.c        |  12 ++++-
>  5 files changed, 159 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index 251c5d3..24836f1 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -43,11 +43,17 @@ struct subsys_private {
>  };
>  #define to_subsys_private(obj) container_of(obj, struct subsys_private, subsys.kobj)
>
> +struct driver_attach_work {
> +       struct work_struct work;
> +       struct device_driver *driver;
> +};
> +
>  struct driver_private {
>         struct kobject kobj;
>         struct klist klist_devices;
>         struct klist_node knode_bus;
>         struct module_kobject *mkobj;
> +       struct driver_attach_work *attach_work;
>         struct device_driver *driver;
>  };
>  #define to_driver(obj) container_of(obj, struct driver_private, kobj)
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index a5f41e4..41e321e6 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -85,6 +85,7 @@ static void driver_release(struct kobject *kobj)
>         struct driver_private *drv_priv = to_driver(kobj);
>
>         pr_debug("driver: '%s': %s\n", kobject_name(kobj), __func__);
> +       kfree(drv_priv->attach_work);
>         kfree(drv_priv);
>  }
>
> @@ -662,10 +663,125 @@ static void remove_driver_private(struct device_driver *drv)
>         struct driver_private *priv = drv->p;
>
>         kobject_put(&priv->kobj);
> +       kfree(priv->attach_work);
>         kfree(priv);
>         drv->p = NULL;
>  }
>
> +static void driver_attach_workfn(struct work_struct *work)
> +{
> +       int ret;
> +       struct driver_attach_work *attach_work =
> +               container_of(work, struct driver_attach_work, work);
> +       struct device_driver *drv = attach_work->driver;
> +       ktime_t calltime, delta, rettime;
> +       unsigned long long duration;
> +
> +       calltime = ktime_get();
> +
> +       ret = driver_attach(drv);
> +       if (ret != 0) {
> +               remove_driver_private(drv);
> +               bus_put(drv->bus);
> +       }
> +
> +       rettime = ktime_get();
> +       delta = ktime_sub(rettime, calltime);
> +       duration = (unsigned long long) ktime_to_ns(delta) >> 10;
> +
> +       pr_debug("bus: '%s': add driver %s attach completed after %lld usecs\n",
> +                drv->bus->name, drv->name, duration);
> +}
> +
> +int bus_driver_async_probe(struct device_driver *drv)
> +{
> +       struct driver_private *priv = drv->p;
> +
> +       priv->attach_work = kzalloc(sizeof(struct driver_attach_work),
> +                                   GFP_KERNEL);
> +       if (!priv->attach_work)
> +               return -ENOMEM;
> +
> +       priv->attach_work->driver = drv;
> +       INIT_WORK(&priv->attach_work->work, driver_attach_workfn);
> +
> +       /* Keep this as pr_info() until this is prevalent */
> +       pr_info("bus: '%s': probe for driver %s is run asynchronously\n",
> +                drv->bus->name, drv->name);
> +
> +       queue_work(system_unbound_wq, &priv->attach_work->work);
> +
> +       return 0;
> +}
> +
> +/*
> + */
> +static bool safe_mod_async = false;
> +module_param_named(safe_mod_async_probe, safe_mod_async, bool, 0400);
> +MODULE_PARM_DESC(safe_mod_async_probe,
> +                "Enable async probe on all modules safely");
> +
> +static bool force_mod_async = false;
> +module_param_named(force_mod_async_probe, force_mod_async, bool, 0400);
> +MODULE_PARM_DESC(force_mod_async_probe,
> +                "Force async probe on all modules");
> +
> +/**
> + * drv_enable_async_probe - evaluates if async probe should be used
> + * @drv: device driver to evaluate
> + * @bus: the bus for the device driver
> + *
> + * The driver core supports enabling asynchronous probe on device drivers
> + * by requiring userspace to pass the module parameter "async_probe".
> + * Currently only modules are enabled to use this feature. If a device
> + * driver is known to not work properly with asynchronous probe they
> + * can force disable asynchronous probe from being enabled through
> + * userspace by adding setting sync_probe to true on the @drv. We require
> + * async probe to be requested from userspace given that we have historically
> + * supported synchronous probe and some userspaces may exist which depend
> + * on this functionality. Userspace may wish to use asynchronous probe for
> + * most device drivers but since this can fail boot in practice we only
> + * enable it currently for a set of buses.
> + *
> + * If you'd like to test enabling async probe for all buses whitelisted
> + * you can enable the safe_mod_async_probe module parameter. Note that its
> + * not a good idea to always enable this, in particular you probably don't
> + * want drivers under modules-load.d to use this. This module parameter should
> + * only be used to help test. If you'd like to test even futher you can
> + * use force_mod_async_probe, that will force enable async probe on all
> + * drivers, regardless if its bus type, it should however be used with
> + * caution.
> + */
> +static bool drv_enable_async_probe(struct device_driver *drv,
> +                                  struct bus_type *bus)
> +{
> +       struct module *mod;
> +
> +       if (!drv->owner || drv->sync_probe)
> +               return false;
> +
> +       if (force_mod_async)
> +               return true;
> +
> +       mod = drv->owner;
> +       if (!safe_mod_async && !mod->async_probe_requested)
> +               return false;
> +
> +       /* For now lets avoid stupid bug reports */
> +       if (!strcmp(bus->name, "pci") ||
> +           !strcmp(bus->name, "pci_express") ||
> +           !strcmp(bus->name, "hid") ||
> +           !strcmp(bus->name, "sdio") ||
> +           !strcmp(bus->name, "gameport") ||
> +           !strcmp(bus->name, "mmc") ||
> +           !strcmp(bus->name, "i2c") ||
> +           !strcmp(bus->name, "platform") ||
> +           !strcmp(bus->name, "usb"))
> +               return true;
> +
> +       return false;
> +}
> +
>  /**
>   * bus_add_driver - Add a driver to the bus.
>   * @drv: driver.
> @@ -675,6 +791,7 @@ int bus_add_driver(struct device_driver *drv)
>         struct bus_type *bus;
>         struct driver_private *priv;
>         int error = 0;
> +       bool async_probe = false;
>
>         bus = bus_get(drv->bus);
>         if (!bus)
> @@ -696,11 +813,19 @@ int bus_add_driver(struct device_driver *drv)
>         if (error)
>                 goto out_unregister;
>
> +       async_probe = drv_enable_async_probe(drv, bus);
> +
>         klist_add_tail(&priv->knode_bus, &bus->p->klist_drivers);
>         if (drv->bus->p->drivers_autoprobe) {
> -               error = driver_attach(drv);
> -               if (error)
> -                       goto out_unregister;
> +               if (async_probe) {
> +                       error = bus_driver_async_probe(drv);
> +                       if (error)
> +                               goto out_unregister;
> +               } else {
> +                       error = driver_attach(drv);
> +                       if (error)
> +                               goto out_unregister;
> +               }
>         }
>         module_add_driver(drv->owner, drv);
>
> @@ -1267,6 +1392,12 @@ EXPORT_SYMBOL_GPL(subsys_virtual_register);
>
>  int __init buses_init(void)
>  {
> +       if (unlikely(safe_mod_async))
> +               pr_info("Enabled safe_mod_async -- you may run into issues\n");
> +
> +       if (unlikely(force_mod_async))
> +               pr_info("Enabling force_mod_async -- you're on your own!\n");
> +
>         bus_kset = kset_create_and_add("bus", &bus_uevent_ops, NULL);
>         if (!bus_kset)
>                 return -ENOMEM;
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index e4ffbcf..7999aba 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -507,6 +507,13 @@ static void __device_release_driver(struct device *dev)
>
>         drv = dev->driver;
>         if (drv) {
> +               if (drv->owner && !drv->sync_probe) {
> +                       struct module *mod = drv->owner;
> +                       struct driver_private *priv = drv->p;
> +
> +                       if (mod->async_probe_requested)
> +                               flush_work(&priv->attach_work->work);
> +               }
>                 pm_runtime_get_sync(dev);
>
>                 driver_sysfs_remove(dev);
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 71f282a..1e9e017 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -271,6 +271,8 @@ struct module {
>         bool sig_ok;
>  #endif
>
> +       bool async_probe_requested;
> +
>         /* symbols that will be GPL-only in the near future. */
>         const struct kernel_symbol *gpl_future_syms;
>         const unsigned long *gpl_future_crcs;
> diff --git a/kernel/module.c b/kernel/module.c
> index 88f3d6c..31d71ff 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3175,8 +3175,16 @@ out:
>  static int unknown_module_param_cb(char *param, char *val, const char *modname,
>                                    void *arg)
>  {
> +       int ret;
> +       struct module *mod = arg;
> +
> +       if (strcmp(param, "async_probe") == 0) {
> +               mod->async_probe_requested = true;
> +               return 0;
> +       }
> +
>         /* Check for magic 'dyndbg' arg */
> -       int ret = ddebug_dyndbg_module_param_cb(param, val, modname);
> +       ret = ddebug_dyndbg_module_param_cb(param, val, modname);
>         if (ret != 0)
>                 pr_warn("%s: unknown parameter '%s' ignored\n", modname, param);
>         return 0;
> @@ -3278,7 +3286,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
>
>         /* Module is ready to execute: parsing args may do that. */
>         after_dashes = parse_args(mod->name, mod->args, mod->kp, mod->num_kp,
> -                                 -32768, 32767, NULL,
> +                                 -32768, 32767, mod,
>                                   unknown_module_param_cb);
>         if (IS_ERR(after_dashes)) {
>                 err = PTR_ERR(after_dashes);
> --
> 2.1.0
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dmitry Torokhov Sept. 28, 2014, 7:22 p.m. UTC | #3

Hi Luis,

On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> +static bool drv_enable_async_probe(struct device_driver *drv,
> +				   struct bus_type *bus)
> +{
> +	struct module *mod;
> +
> +	if (!drv->owner || drv->sync_probe)
> +		return false;

This bit is one of the biggest issues I have with the patch set. Why async
probing is limited to modules only? I mentioned several times that we need
async probing for built-in drivers and the way you are structuring the flags
(async by default for modules, possibly opt-out of async for modules, forcibly
sync for built-in) it is hard to extend the infrastructure for built-in case.

Also, as far as I can see, you are only considering the case where driver is
being bound to already registered devices. If you have a module that creates a
device for a driver that is already loaded and takes long time to probe you
would still be probing synchronously even if driver/module requested async
behavior.

So for me it is NAK in the current form.

Thanks.

Luis R. Rodriguez Sept. 29, 2014, 9:22 p.m. UTC | #4

On Sun, Sep 28, 2014 at 11:03:29AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> ...
> > Systemd should consider enabling async probe on device drivers
> > it loads through systemd-udev but probably does not want to
> > enable it for modules loaded through systemd-modules-load
> > (modules-load.d). At least on my booting enablign async probe
> > for all modules fails to boot as such in order to make this
> 
> Did you find out why boot failed with those modules?

No, it seems this was early in boot and I haven't been able to capture the logs
yet of the faults. More on this below.

> > a bit more useful we whitelist a few buses where it should be
> > at least in theory safe to try to enable async probe. This
> > way even if systemd tried to ask to enable async probe for all
> > its device drivers the kernel won't blindly do this. We also
> > have the sync_probe flag which device drivers can themselves
> > enable *iff* its known the device driver should never async
> > probe.
> > 
> > In order to help *test* things folks can use the bus.safe_mod_async_probe=1
> > kernel parameter which will work as if userspace would have
> > requested all modules to load with async probe. Daring folks can
> > also use bus.force_mod_async_probe=1 which will enable asynch probe
> > even on buses not tested in any way yet, if you use that though
> > you're on your own.
> 
> If those two knobs are meant for debugging, let's please make that
> fact immediately evident.  e.g. Make them ugly boot params like
> "__DEVEL__driver_force_mod_async_probe".  Devel/debug options ending
> up becoming stable interface are really nasty.

Sure make sense, I wasn't quite sure how to make this quite clear,
a naming convention seems good to me but I also had added at least
a print about this on the log. Ideally I think a TAIN_DEBUG would
be best and it seems it could be useful for many other cases in
the kernel, we could also just re-use TAINT_CRAP as well. Thoughts?
Greg?

> > +struct driver_attach_work {
> > +	struct work_struct work;
> > +	struct device_driver *driver;
> > +};
> > +
> >  struct driver_private {
> >  	struct kobject kobj;
> >  	struct klist klist_devices;
> >  	struct klist_node knode_bus;
> >  	struct module_kobject *mkobj;
> > +	struct driver_attach_work *attach_work;
> >  	struct device_driver *driver;
> >  };
> 
> How many bytes are we saving by allocating it separately?

This saves us 24 bytes per device driver.

>  Can't we just embed it in driver_private?

We sure can and it is my preference to do that as well but just
in case I wanted to take the alternative space saving approach
as well and let folks decide. There's also the technical aspect
of hiding that data structure from drivers, and that may be worth
to do but I personally also prefer the simplicity of stuffing
it on the public data structure, as you noted below we could then also
unconditionally flush_work() on __device_release_driver().

Greg, any preference?

> > +static void driver_attach_workfn(struct work_struct *work)
> > +{
> > +	int ret;
> > +	struct driver_attach_work *attach_work =
> > +		container_of(work, struct driver_attach_work, work);
> > +	struct device_driver *drv = attach_work->driver;
> > +	ktime_t calltime, delta, rettime;
> > +	unsigned long long duration;
> 
> This could just be a personal preference but I think it's easier to
> read if local vars w/ initializers come before the ones w/o.

We gotta standardize on *something*, I tend to declare them
in the order in which they are used, in this case I failed to
list calltime first, but yeah I'll put initialized first, I
don't care much.

> > +
> > +	calltime = ktime_get();
> > +
> > +	ret = driver_attach(drv);
> > +	if (ret != 0) {
> > +		remove_driver_private(drv);
> > +		bus_put(drv->bus);
> > +	}
> > +
> > +	rettime = ktime_get();
> > +	delta = ktime_sub(rettime, calltime);
> > +	duration = (unsigned long long) ktime_to_ns(delta) >> 10;
> > +
> > +	pr_debug("bus: '%s': add driver %s attach completed after %lld usecs\n",
> > +		 drv->bus->name, drv->name, duration);
> 
> Why do we have the above printout for async path but not sync path?
> It's kinda weird for the code path to diverge like this.  Shouldn't
> the only difference be the context probes are running from?

Yeah sure, I'll remove this, it was useful for me for testing purposes
in evaluation against kthreads / sync runs, but that certainly was mostly
for debugging.

> ...
> > +static bool drv_enable_async_probe(struct device_driver *drv,
> > +				   struct bus_type *bus)
> > +{
> > +	struct module *mod;
> > +
> > +	if (!drv->owner || drv->sync_probe)
> > +		return false;
> > +
> > +	if (force_mod_async)
> > +		return true;
> > +
> > +	mod = drv->owner;
> > +	if (!safe_mod_async && !mod->async_probe_requested)
> > +		return false;
> > +
> > +	/* For now lets avoid stupid bug reports */
> > +	if (!strcmp(bus->name, "pci") ||
> > +	    !strcmp(bus->name, "pci_express") ||
> > +	    !strcmp(bus->name, "hid") ||
> > +	    !strcmp(bus->name, "sdio") ||
> > +	    !strcmp(bus->name, "gameport") ||
> > +	    !strcmp(bus->name, "mmc") ||
> > +	    !strcmp(bus->name, "i2c") ||
> > +	    !strcmp(bus->name, "platform") ||
> > +	    !strcmp(bus->name, "usb"))
> > +		return true;
> 
> Ugh... things like this tend to become permanent.  Do we really need
> this?  And how are we gonna find out what's broken why w/o bug
> reports?

Yeah... well we have two options, one is have something like this to
at least make it generally useful or remove this and let folks who
care start fixing async for all modules. The downside to removing
this is it makes async probe pretty much useless on most systems
right now, it would mean systemd would have to probably consider
the list above if they wanted to start using this without expecting
systems to not work.

Let me know what is preferred.

> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index e4ffbcf..7999aba 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -507,6 +507,13 @@ static void __device_release_driver(struct device *dev)
> >  
> >  	drv = dev->driver;
> >  	if (drv) {
> > +		if (drv->owner && !drv->sync_probe) {
> > +			struct module *mod = drv->owner;
> > +			struct driver_private *priv = drv->p;
> > +
> > +			if (mod->async_probe_requested)
> > +				flush_work(&priv->attach_work->work);
> 
> This can be unconditional flush_work(&priv->attach_work) if attach_work
> isn't separately allocated.

Indeed.

> >  static int unknown_module_param_cb(char *param, char *val, const char *modname,
> >  				   void *arg)
> >  {
> > +	int ret;
> > +	struct module *mod = arg;
> 
> Ditto with the order of definitions.

Amended.

> Generally looks good to me.
> 
> Thanks a lot for doing this! :)

Thanks for the review and pointers so far.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo Sept. 29, 2014, 9:26 p.m. UTC | #5

Hello, Luis.

On Mon, Sep 29, 2014 at 11:22:08PM +0200, Luis R. Rodriguez wrote:
> > > +	/* For now lets avoid stupid bug reports */
> > > +	if (!strcmp(bus->name, "pci") ||
> > > +	    !strcmp(bus->name, "pci_express") ||
> > > +	    !strcmp(bus->name, "hid") ||
> > > +	    !strcmp(bus->name, "sdio") ||
> > > +	    !strcmp(bus->name, "gameport") ||
> > > +	    !strcmp(bus->name, "mmc") ||
> > > +	    !strcmp(bus->name, "i2c") ||
> > > +	    !strcmp(bus->name, "platform") ||
> > > +	    !strcmp(bus->name, "usb"))
> > > +		return true;
> > 
> > Ugh... things like this tend to become permanent.  Do we really need
> > this?  And how are we gonna find out what's broken why w/o bug
> > reports?
> 
> Yeah... well we have two options, one is have something like this to
> at least make it generally useful or remove this and let folks who
> care start fixing async for all modules. The downside to removing
> this is it makes async probe pretty much useless on most systems
> right now, it would mean systemd would have to probably consider
> the list above if they wanted to start using this without expecting
> systems to not work.

So, I'd much prefer blacklist approach if something like this is a
necessity.  That way, we'd at least know what doesn't work.

Thanks.

Greg Kroah-Hartman Sept. 29, 2014, 9:59 p.m. UTC | #6

On Mon, Sep 29, 2014 at 11:22:08PM +0200, Luis R. Rodriguez wrote:
> On Sun, Sep 28, 2014 at 11:03:29AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> > ...
> > > Systemd should consider enabling async probe on device drivers
> > > it loads through systemd-udev but probably does not want to
> > > enable it for modules loaded through systemd-modules-load
> > > (modules-load.d). At least on my booting enablign async probe
> > > for all modules fails to boot as such in order to make this
> > 
> > Did you find out why boot failed with those modules?
> 
> No, it seems this was early in boot and I haven't been able to capture the logs
> yet of the faults. More on this below.
> 
> > > a bit more useful we whitelist a few buses where it should be
> > > at least in theory safe to try to enable async probe. This
> > > way even if systemd tried to ask to enable async probe for all
> > > its device drivers the kernel won't blindly do this. We also
> > > have the sync_probe flag which device drivers can themselves
> > > enable *iff* its known the device driver should never async
> > > probe.
> > > 
> > > In order to help *test* things folks can use the bus.safe_mod_async_probe=1
> > > kernel parameter which will work as if userspace would have
> > > requested all modules to load with async probe. Daring folks can
> > > also use bus.force_mod_async_probe=1 which will enable asynch probe
> > > even on buses not tested in any way yet, if you use that though
> > > you're on your own.
> > 
> > If those two knobs are meant for debugging, let's please make that
> > fact immediately evident.  e.g. Make them ugly boot params like
> > "__DEVEL__driver_force_mod_async_probe".  Devel/debug options ending
> > up becoming stable interface are really nasty.
> 
> Sure make sense, I wasn't quite sure how to make this quite clear,
> a naming convention seems good to me but I also had added at least
> a print about this on the log. Ideally I think a TAIN_DEBUG would
> be best and it seems it could be useful for many other cases in
> the kernel, we could also just re-use TAINT_CRAP as well. Thoughts?
> Greg?

TAINT_CRAP is for drivers/staging/ code, don't try to repurpose it for
some other horrid option.  There's no reason we can't add more taint
flags for this.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 29, 2014, 10:10 p.m. UTC | #7

On Mon, Sep 29, 2014 at 2:59 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
>> Sure make sense, I wasn't quite sure how to make this quite clear,
>> a naming convention seems good to me but I also had added at least
>> a print about this on the log. Ideally I think a TAIN_DEBUG would
>> be best and it seems it could be useful for many other cases in
>> the kernel, we could also just re-use TAINT_CRAP as well. Thoughts?
>> Greg?
>
> TAINT_CRAP is for drivers/staging/ code, don't try to repurpose it for
> some other horrid option.  There's no reason we can't add more taint
> flags for this.

OK thanks, I'll add TAINT_DEBUG. Any preference where to stuff struct
driver_attach_work *attach_work ? On the private data structure as
this patch currently implements, saving us 24 bytes and hiding it from
drivers, or stuffing it on the device driver and simplifying the core
code?

 Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Greg Kroah-Hartman Sept. 29, 2014, 10:24 p.m. UTC | #8

On Mon, Sep 29, 2014 at 03:10:22PM -0700, Luis R. Rodriguez wrote:
> On Mon, Sep 29, 2014 at 2:59 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> >> Sure make sense, I wasn't quite sure how to make this quite clear,
> >> a naming convention seems good to me but I also had added at least
> >> a print about this on the log. Ideally I think a TAIN_DEBUG would
> >> be best and it seems it could be useful for many other cases in
> >> the kernel, we could also just re-use TAINT_CRAP as well. Thoughts?
> >> Greg?
> >
> > TAINT_CRAP is for drivers/staging/ code, don't try to repurpose it for
> > some other horrid option.  There's no reason we can't add more taint
> > flags for this.
> 
> OK thanks, I'll add TAINT_DEBUG. Any preference where to stuff struct
> driver_attach_work *attach_work ? On the private data structure as
> this patch currently implements, saving us 24 bytes and hiding it from
> drivers, or stuffing it on the device driver and simplifying the core
> code?

I honestly haven't even looked at this series, sorry.  It's too late
near the close of the merge window for 3.18 and have been on the road
for the past week in France.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 30, 2014, 2:27 a.m. UTC | #9

On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote:
> On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
> <mcgrof@do-not-panic.com> wrote:
> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > Systemd has a general timeout for all workers currently set to 180
> > seconds after which it will send a sigkill signal. Systemd now has a
> > warning which is issued once it reaches 1/3 of the timeout. The original
> > motivation for the systemd timeout was to help track device drivers
> > which do not use asynch firmware loading on init() and the timeout was
> > originally set to 30 seconds.
> 
> Please note that the motivation for the timeout in systemd had nothing
> to do with async firmware loading (that was just the case where
> problems cropped up).

*Part *of the original kill logic, according to the commit log, was actually
due to the assumption that the issues observed *were* synchronous firmware
loading on module init():

commit e64fae5573e566ce4fd9b23c68ac8f3096603314
Author: Kay Sievers <kay.sievers@vrfy.org>
Date:   Wed Jan 18 05:06:18 2012 +0100

    udevd: kill hanging event processes after 30 seconds

    Some broken kernel drivers load firmware synchronously in the module init
    path and block modprobe until the firmware request is fulfilled.
    <...>

My point here is not to point fingers but to explain why we went on with
this and how we failed to realize only until later that the driver core
ran probe together with init. When a few folks pointed out the issues
with the kill the issue was punted back to kernel developers and the
assumption even among some kernel maintainers was that it was init paths
with sync behaviour that was causing some delays and they were broken
drivers. It is important to highlight these assumptions ended up setting
us off on the wrong path for a while in a hunt to try to fix this issue
either in driver or elsewhere.

> The motivation was to not allow udev-workers to
> stay around indefinitely, and hence put an upper-bound on
> their duration (initially 180 s). At some point the bound was reduced
> to 30 seconds to make sure module-loading would bail out before the
> kernel's firmware loading timeout would bail out (60s I believe).

Sure, part of it was that, but folks beat on driver developer about
the kill insisting it was drivers that were broken. It was only until
Chelsie folks called bloody murder becuase their delays were on probe
that we realized there was a bit more to this than what was being pushed
back on to driver developers.

> That
> is no longer relevant, which is why it was safe to reset the timeout
> to 180 s.

Indeed :D

> > Since systemd + kernel are heavily tied in for the purposes of this
> > patch it is assumed you have merged on systemd the following
> > commits:
> >
> > 671174136525ddf208cdbe75d6d6bd159afa961f        udev: timeout - warn after a third of the timeout before killing
> > b5338a19864ac3f5632aee48069a669479621dca        udev: timeout - increase timeout
> > 2e92633dbae52f5ac9b7b2e068935990d475d2cd        udev: bump event timeout to 60 seconds
> > be2ea723b1d023b3d385d3b791ee4607cbfb20ca        udev: remove userspace firmware loading support
> > 9f20a8a376f924c8eb5423cfc1f98644fc1e2d1a        udev: fixup commit
> > dd5eddd28a74a49607a8fffcaf960040dba98479        udev: unify event timeout handling
> > 9719859c07aa13539ed2cd4b31972cd30f678543        udevd: add --event-timeout commandline option
> >
> > Since we bundle together serially driver init() and probe()
> > on module initialiation systemd's imposed timeout  put a limit on the
> > amount of time a driver init() and probe routines can take. There's a
> > few overlooked issues with this and the timeout in general:
> >
> > 0) Not all drivers are killed, the signal is just sent and
> >    the kill will only be acted upoon if the driver you loaded
> >    happens to have some code path that either uses kthreads (which
> >    as of 786235ee are now killable), or uses some code which checks for
> >    fatal_signal_pending() on the kernel somewhere -- i.e: pci_read_vpd().
> 
> Shouldn't this be seen as something to be fixed in the kernel?

That's a great question. In practice now after CVE-2012-4398 and its series of
patches added which enabled OOM to kill things followed by 786235ee to also
handle OOM on kthreads it seems imperative we strive towards this, in practive
however if you're getting OOMs on boot you have far more serious issue to be
concerned over than handling CVE-2012-4398. Another issue is that even if we
wanted to address this a critical right now on module loading driver error
paths tend to be pretty buggy and we'd probably end up causing more issues than
fixing anything if the sigkill that triggered this was an arbitrary timeout,
specially if the timeout is not properly justified. Addressing sigkill due
to OOM is important, but as noted if you're running out of memory at load
time you have a bit other problems to be concerned over.

So extending the kill onto more drivers *because* of the timeout is probably
not a good reason as it would probably create more issue than fix anything
right now.

> I mean,
> do we not want userspace to have the possibility to kill udev/modprobe
> even disregarding the worker timeouts (say at shutdown, or before
> switching from the initrd)?

That's a good point and I think the merit to handle a kill due to the
other reasons (shutdown, switching from the initrd) should be addressed
separately. I mean that validating addressing the kill for the other
reasons does not validate the existing kill on timeout for synchronous
probing.

If its important to handle the kill on shutdown / switching initrd
that should be dealt with orthogonally.

> > 1) Since systemd is the only one logging the sigkill debugging that
> >    drivers are not loaded or in the worst case *failed to boot* because
> >    of a sigkill has proven hard to debug.
> 
> Care to clarify this a bit? Are the udev logs somehow unclear? 

Sure, so the problem is that folks debugging were not aware of what systemd was
doing.  Let me be clear that the original 30 second sigkill timeout thing was
passed down onto driver maintainers as a non-documented new kernel policy
slap-in-the-face-you-must-obviously-be-doing-something-wrong (TM) approach.
This was a policy decision passed down as a *reactive* measure, not many folks
were aware of it and of what systemd was doing. What made the situation even worse
was that as noted on 1) even though the sigkill was being sent since commit
e64fae55 (January 2012) on systemd the sigkill was not being picked up on many
drivers. To be clear the sigkill was being picked up if you had a driver that
by chance had some code on init / probe that by chance checked for
fatal_signal_pending(), and even when that triggered folks debugging were in no
way shape or form expecting a sigkill from userspace on modprobe as it was not well
known that this was part of the policy they should follow. Shit started to hit
the fan a bit more widely when kernel commit 786235ee (Nov 2013) was merged
upstream which allowed kthreads to be killed, and more drivers started failing.

An example of an ancient bug that no one handled until recently:

https://bugzilla.kernel.org/show_bug.cgi?id=59581

There is a proper fix to this now but the kill was what was causing this
in the first place. The kill was justified as theese drivers *should*
be using async probe but by no means does that mean the kill was
justified for all subsystems / drivers. The bug also really also sent
people on the wrong track and it was only until  Alexander poked me
about the issue we were seeing on cxbg4 likely being related that we
started to really zeroe in on the real issue.

The first driver reported / studied due to the kill from system was
mptsas:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705

A full bisect was done to even try to understand what the issue was..
Then there was the tug of war between either reverting the patch that
allowed the kthread to be killed or if this was systemd issue which
required increasing the timeout. This was still a storage driver,
and increasing the timeout arbitrarily really would not have helped
address the root cause of the issue.

The next non-storage driver bug that was reported and heavily
debugged was cxgb4 and it wasn't easy to debug:

https://bugzilla.suse.com/show_bug.cgi?id=877622

Conclusion then is that folks were simply not aware of this new de-facto
policy, it was obviously incorrect but well intentioned, and no one
really was paying attention to systemd-udevd logs. If we want chatty
behaviour that people will pick up we probably instead want a WARN()
on the kernel specially before we kill a driver and even then I'm sure
this can irritate some folks.

> If you think we can improve the logging from udev, please ping me about that
> and I'll sort it out.

I think the logging done on systemd is fine, there are a few issues with the
way things trickled down and what we now need to do. First and foremost there
was general communication issue about this new timing policy and obviously it
would have helped if this also had more design / review from others. Its no
one's fault, but we should learn from it. Design policies on systemd that can
affect the kernel / drivers could likely use some bit more review from a wider
audience and probably include folks who are probably going to be more critical
than those who likely would typically be favorable. Without wider review we
could fail to end up with something like a filter bubble [0] but applied to
engineering, a design filter bubble, if you will. So apart from addressing
logging its important to reflect on this issue and try to aim for having
something like a Red Team [1] on design involving systemd and kernel. This is
specially true if we are to really marry these two together more and more. 
The more critical people can be the better, but of course those need to
provide constructive criticism, not just rants.

In terms of logging:

Do we know if distributions / users are reviewing systemd-udevd logs for
certain types of issues with as much dilligence as they put to kernel logs when
systemd makes decision affecting the kernel? If not we should consider a way so
that that happens. In this case the fact that drivers were being killed while
being loaded was missed since it was unexpected that would happen so folks
didn't know to look for that option, but apart from that the *reason* for the
kill probably could have helped too. To help both of these we have to consider if
we are going to keep the sigkill on systemd on module loading due to a timeout.
As you clarified the goal of the timeout is to avoid having udev workers stay
around indefinitely, but I think we need to give kmod workers a bit more
consideration.  The point of this patch set was partly to give systemd what it
assumed was there, but clearly we can't assume all drivers can be loaded
asynchronously without issues right now. That means that even with this
functionality merged systemd will have to cope with the fact that some drivers
will be loaded with synchronous probe.  A general timeout and specially with a
sigkill is probably not a good idea then, unless of course:

0) those device drivers / subsystem maintainer want a timeout
1) the above decision can distinguish between sync probe / async probe
   being done

To address 0) perhaps one solution is that if subsystem maintainers
feel this is needed they can express this on data structure somewhere,
perhaps on the bus and/or have a driver value override, for example.

For 1) we could expose what we end up doing through sysfs.

Of course userspace could also simply want to put in place some
requirements but in terms of a timeout / kill it would have to also
accept that it cannot get what it might want. For instance we now know
it may be that an async probe is not possible on some drivers.

Perhaps its best to think about this differently and address now a
way to do that efficiently instead of reactively. Apart form having
the ability to let systemd ask for async probe, what else do we want
to accomplish?

[0] http://en.wikipedia.org/wiki/Filter_bubble
[1] http://en.wikipedia.org/wiki/Red_team

> > 2) When and if the signal is received by the driver somehow
> >    the driver may fail at different points in its initialization
> >    and unless all error paths on the driver are implemented
> >    perfectly this could mean leaving a device in a half
> >    initialized state.
> >
> > 3) The timeout is penalizing device drivers that take long on
> >    probe(), this wasn't the original motivation. Systemd seems
> >    to have been under assumption that probe was asynchronous,
> >    this perhaps is true as an *objective* and goal for *some
> >    subsystems* but by no means is it true that we've been on a wide
> >    crusade to ensure this for all device drivers. It may be a good
> >    idea for *many* device drivers but penalizing them with a kill
> >    for taking long on probe is simply unacceptable specially
> >    when the timeout is completely arbitrary.
> 
> The point is really not to "penalize" anything, we just need to make
> sure we put some sort of restrictions on our workers so they don't
> hang around forever.

Thanks for clarifying this, can you explain what issues could arise
from making an exception to allowing kmod workers to hang around
completing init + probe over a certain defined amount of time without
being killed?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 30, 2014, 7:15 a.m. UTC | #10

On Sun, Sep 28, 2014 at 12:22:47PM -0700, Dmitry Torokhov wrote:
> Hi Luis,
> 
> On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> > +static bool drv_enable_async_probe(struct device_driver *drv,
> > +				   struct bus_type *bus)
> > +{
> > +	struct module *mod;
> > +
> > +	if (!drv->owner || drv->sync_probe)
> > +		return false;
> 
> This bit is one of the biggest issues I have with the patch set. Why async
> probing is limited to modules only?

Because Tejun wanted to address this separately, so its not that we will restrict
this but we should have non-module solution added as an evolution on top of this,
as a secondary step.

> I mentioned several times that we need
> async probing for built-in drivers and the way you are structuring the flags
> (async by default for modules, possibly opt-out of async for modules, forcibly
> sync for built-in) it is hard to extend the infrastructure for built-in case.

I confess I haven't tried enabling built-in as a secondary step but its just
due to lack of time right now but I don't think impossible and think actually
think it should be fairly trivial. Are there real blockers to do this that
you see as an evolutionary step?

> Also, as far as I can see, you are only considering the case where driver is
> being bound to already registered devices. If you have a module that creates a
> device for a driver that is already loaded and takes long time to probe you
> would still be probing synchronously even if driver/module requested async
> behavior.

Can you provide an example code path hit here? I'll certainly like to address
that as well.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 30, 2014, 7:21 a.m. UTC | #11

On Mon, Sep 29, 2014 at 05:26:01PM -0400, Tejun Heo wrote:
> Hello, Luis.
> 
> On Mon, Sep 29, 2014 at 11:22:08PM +0200, Luis R. Rodriguez wrote:
> > > > +	/* For now lets avoid stupid bug reports */
> > > > +	if (!strcmp(bus->name, "pci") ||
> > > > +	    !strcmp(bus->name, "pci_express") ||
> > > > +	    !strcmp(bus->name, "hid") ||
> > > > +	    !strcmp(bus->name, "sdio") ||
> > > > +	    !strcmp(bus->name, "gameport") ||
> > > > +	    !strcmp(bus->name, "mmc") ||
> > > > +	    !strcmp(bus->name, "i2c") ||
> > > > +	    !strcmp(bus->name, "platform") ||
> > > > +	    !strcmp(bus->name, "usb"))
> > > > +		return true;
> > > 
> > > Ugh... things like this tend to become permanent.  Do we really need
> > > this?  And how are we gonna find out what's broken why w/o bug
> > > reports?
> > 
> > Yeah... well we have two options, one is have something like this to
> > at least make it generally useful or remove this and let folks who
> > care start fixing async for all modules. The downside to removing
> > this is it makes async probe pretty much useless on most systems
> > right now, it would mean systemd would have to probably consider
> > the list above if they wanted to start using this without expecting
> > systems to not work.
> 
> So, I'd much prefer blacklist approach if something like this is a
> necessity.  That way, we'd at least know what doesn't work.

For buses? Or do you mean you'd want to wait until we have a decent
list of drivers with the sync probe flag set? If the later it may take
a while to get that list for this to be somewhat useful.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 30, 2014, 7:47 a.m. UTC | #12

On Tue, Sep 30, 2014 at 04:27:51AM +0200, Luis R. Rodriguez wrote:
> On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote:
> > On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
> > <mcgrof@do-not-panic.com> wrote:
> > > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > > 0) Not all drivers are killed, the signal is just sent and
> > >    the kill will only be acted upoon if the driver you loaded
> > >    happens to have some code path that either uses kthreads (which
> > >    as of 786235ee are now killable), or uses some code which checks for
> > >    fatal_signal_pending() on the kernel somewhere -- i.e: pci_read_vpd().
> > 
> > Shouldn't this be seen as something to be fixed in the kernel?
> 
> That's a great question. In practice now after CVE-2012-4398 and its series of
> patches added which enabled OOM to kill things followed by 786235ee to also
> handle OOM on kthreads it seems imperative we strive towards this, in practive
> however if you're getting OOMs on boot you have far more serious issue to be
> concerned over than handling CVE-2012-4398. Another issue is that even if we
> wanted to address this a critical right now on module loading driver error
> paths tend to be pretty buggy and we'd probably end up causing more issues than
> fixing anything if the sigkill that triggered this was an arbitrary timeout,
> specially if the timeout is not properly justified.

<-- snip -->

> So extending the kill onto more drivers *because* of the timeout is probably
> not a good reason as it would probably create more issue than fix anything
> right now.

A bit more on this. Tejun had added devres while trying to convert libata to
use iomap but in that process also help address buggy failure paths on drivers [0].
Even with devres in place and devm functions being available they actually
haven't been popularized until recent kernels [1]. There is even further
research on precicely these sorts of errors, such as "Hector: Detecting
Resource-Release Omission Faults in error-handling code for systems software" [2]
but unfortunately there is no data over time. Another paper is "An approach to
improving the structure of error-handling code in the Linux kernel" [3] which
tries to address moving error handling code in the middle of the function to gotos
to shared code at the end of the function...

So we have buggy error paths on drivers and trusting them unfortunately isn't
a good idea at this point. They should be fixed but saying we should equally
kill all drivers right now would likley introduce more issues than anything.

[0] http://lwn.net/Articles/215861/
[1] http://www.slideshare.net/ennael/kernel-recipes-2013?qid=f0888b85-377b-4b29-95c3-f4e59822f5b3&v=default&b=&from_search=6
    See slide 6 on graph usage of devm functions over time
[2] http://coccinelle.lip6.fr/papers/dsn2013.pdf
[3] http://coccinelle.lip6.fr/papers/lctes11.pdf

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tom Gundersen Sept. 30, 2014, 9:22 a.m. UTC | #13

On Tue, Sep 30, 2014 at 4:27 AM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote:
>> On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
>> <mcgrof@do-not-panic.com> wrote:
>> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
>> > Systemd has a general timeout for all workers currently set to 180
>> > seconds after which it will send a sigkill signal. Systemd now has a
>> > warning which is issued once it reaches 1/3 of the timeout. The original
>> > motivation for the systemd timeout was to help track device drivers
>> > which do not use asynch firmware loading on init() and the timeout was
>> > originally set to 30 seconds.
>>
>> Please note that the motivation for the timeout in systemd had nothing
>> to do with async firmware loading (that was just the case where
>> problems cropped up).
>
> *Part *of the original kill logic, according to the commit log, was actually
> due to the assumption that the issues observed *were* synchronous firmware
> loading on module init():
>
> commit e64fae5573e566ce4fd9b23c68ac8f3096603314
> Author: Kay Sievers <kay.sievers@vrfy.org>
> Date:   Wed Jan 18 05:06:18 2012 +0100
>
>     udevd: kill hanging event processes after 30 seconds
>
>     Some broken kernel drivers load firmware synchronously in the module init
>     path and block modprobe until the firmware request is fulfilled.
>     <...>

This was a workaround to avoid a deadlock between udev and the kernel.
The 180 s timeout was already in place before this change, and was not
motivated by firmware loading. Also note that this patch was not about
"tracking device drivers", just about avoiding dead-lock.

> My point here is not to point fingers but to explain why we went on with
> this and how we failed to realize only until later that the driver core
> ran probe together with init. When a few folks pointed out the issues
> with the kill the issue was punted back to kernel developers and the
> assumption even among some kernel maintainers was that it was init paths
> with sync behaviour that was causing some delays and they were broken
> drivers. It is important to highlight these assumptions ended up setting
> us off on the wrong path for a while in a hunt to try to fix this issue
> either in driver or elsewhere.

Ok. I'm not sure the motivations for user-space changes is important
to include in the commit message, but if you do I'll try to clarify
things to avoid misunderstandings.

> Thanks for clarifying this, can you explain what issues could arise
> from making an exception to allowing kmod workers to hang around
> completing init + probe over a certain defined amount of time without
> being killed?

We could run out of udev workers and the whole boot would hang.

The way I see it, the current status from systemd's side is: our
short-term work-around is to increase the timeout, and at the moment
it appears no long-term solution is needed (i.e., it seems like the
right thing to do is to make sure insmod can be near instantaneous, it
appears people are working towards this goal, and so far no examples
have cropped up showing that it is fundamentally impossible (once/if
they do, we should of course revisit the problem)).

Cheers,

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Sept. 30, 2014, 3:24 p.m. UTC | #14

On Tue, Sep 30, 2014 at 11:22:14AM +0200, Tom Gundersen wrote:
> On Tue, Sep 30, 2014 at 4:27 AM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> > On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote:
> >> On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
> >> <mcgrof@do-not-panic.com> wrote:
> >> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >> > Systemd has a general timeout for all workers currently set to 180
> >> > seconds after which it will send a sigkill signal. Systemd now has a
> >> > warning which is issued once it reaches 1/3 of the timeout. The original
> >> > motivation for the systemd timeout was to help track device drivers
> >> > which do not use asynch firmware loading on init() and the timeout was
> >> > originally set to 30 seconds.
> >>
> >> Please note that the motivation for the timeout in systemd had nothing
> >> to do with async firmware loading (that was just the case where
> >> problems cropped up).
> >
> > *Part *of the original kill logic, according to the commit log, was actually
> > due to the assumption that the issues observed *were* synchronous firmware
> > loading on module init():
> >
> > commit e64fae5573e566ce4fd9b23c68ac8f3096603314
> > Author: Kay Sievers <kay.sievers@vrfy.org>
> > Date:   Wed Jan 18 05:06:18 2012 +0100
> >
> >     udevd: kill hanging event processes after 30 seconds
> >
> >     Some broken kernel drivers load firmware synchronously in the module init
> >     path and block modprobe until the firmware request is fulfilled.
> >     <...>
> 
> This was a workaround to avoid a deadlock between udev and the kernel.
> The 180 s timeout was already in place before this change, and was not
> motivated by firmware loading. Also note that this patch was not about
> "tracking device drivers", just about avoiding dead-lock.

Thanks, can you elaborate on how a deadlock can occur if the kmod
worker is not at some point sigkilled?

> > My point here is not to point fingers but to explain why we went on with
> > this and how we failed to realize only until later that the driver core
> > ran probe together with init. When a few folks pointed out the issues
> > with the kill the issue was punted back to kernel developers and the
> > assumption even among some kernel maintainers was that it was init paths
> > with sync behaviour that was causing some delays and they were broken
> > drivers. It is important to highlight these assumptions ended up setting
> > us off on the wrong path for a while in a hunt to try to fix this issue
> > either in driver or elsewhere.
> 
> Ok. I'm not sure the motivations for user-space changes is important
> to include in the commit message, but if you do I'll try to clarify
> things to avoid misunderstandings.

I can try to omit it on the next series.

> > Thanks for clarifying this, can you explain what issues could arise
> > from making an exception to allowing kmod workers to hang around
> > completing init + probe over a certain defined amount of time without
> > being killed?
> 
> We could run out of udev workers and the whole boot would hang.

Is the issue that if there is no extra worker available and all are
idling on sleep / synchronous long work boot will potentially hang
unless a new worker becomes available to do more work? If so I can
see the sigkill helping for hanging tasks but it doesn't necessarily
mean its a good idea to kill modules loading taking a while. Also
what if the sigkill is just avoided for *just* kmod workers?

> The way I see it, the current status from systemd's side is: our
> short-term work-around is to increase the timeout, and at the moment
> it appears no long-term solution is needed (i.e., it seems like the
> right thing to do is to make sure insmod can be near instantaneous, it
> appears people are working towards this goal, and so far no examples
> have cropped up showing that it is fundamentally impossible (once/if
> they do, we should of course revisit the problem)).

That again would be reactive behaviour, what would prevent avoiding the
sigkill only for kmod workers? Is it known the deadlock is immiment?
If the amount of workers for kmod that would hit the timeout is
considered low I don't see how that's possible and why not just lift
the sigkill.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tom Gundersen Oct. 2, 2014, 6:12 a.m. UTC | #15

On Tue, Sep 30, 2014 at 5:24 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
>> > commit e64fae5573e566ce4fd9b23c68ac8f3096603314
>> > Author: Kay Sievers <kay.sievers@vrfy.org>
>> > Date:   Wed Jan 18 05:06:18 2012 +0100
>> >
>> >     udevd: kill hanging event processes after 30 seconds
>> >
>> >     Some broken kernel drivers load firmware synchronously in the module init
>> >     path and block modprobe until the firmware request is fulfilled.
>> >     <...>
>>
>> This was a workaround to avoid a deadlock between udev and the kernel.
>> The 180 s timeout was already in place before this change, and was not
>> motivated by firmware loading. Also note that this patch was not about
>> "tracking device drivers", just about avoiding dead-lock.
>
> Thanks, can you elaborate on how a deadlock can occur if the kmod
> worker is not at some point sigkilled?

This was only relevant whet udev did the firmware loading. modprobe
would wait for the kernel, which would wait for the firmware loading,
which would wait for modprobe. This is no longer a problem as udev
does not do firmware loading any more.

> Is the issue that if there is no extra worker available and all are
> idling on sleep / synchronous long work boot will potentially hang
> unless a new worker becomes available to do more work?

Correct.

> If so I can
> see the sigkill helping for hanging tasks but it doesn't necessarily
> mean its a good idea to kill modules loading taking a while. Also
> what if the sigkill is just avoided for *just* kmod workers?

Depending on the number of devices you have, I suppose we could still
exhaust the workers.

>> The way I see it, the current status from systemd's side is: our
>> short-term work-around is to increase the timeout, and at the moment
>> it appears no long-term solution is needed (i.e., it seems like the
>> right thing to do is to make sure insmod can be near instantaneous, it
>> appears people are working towards this goal, and so far no examples
>> have cropped up showing that it is fundamentally impossible (once/if
>> they do, we should of course revisit the problem)).
>
> That again would be reactive behaviour, what would prevent avoiding the
> sigkill only for kmod workers? Is it known the deadlock is immiment?
> If the amount of workers for kmod that would hit the timeout is
> considered low I don't see how that's possible and why not just lift
> the sigkill.

Making kmod a special case is of course possible. However, as long as
there is no fundamental reason why kmod should get this special
treatment, this just looks like a work-around to me. We already have a
work-around, which is to increase the global timeout. If you still
think we should do something different in systemd, it is probably best
to take the discussion to systemd-devel to make sure all the relevant
people are involved.

Cheers,

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 2, 2014, 8:06 p.m. UTC | #16

As per Tom, adding systemd-devel for advice / review / of the request to avoid
the sigkill for kmod workers. Keeping others on Cc as its a discussion that
I think can help if both camps are involved. Specially since we've been
ping ponging back and forth on this particular topic for a long time now.

On Thu, Oct 02, 2014 at 08:12:37AM +0200, Tom Gundersen wrote:
> On Tue, Sep 30, 2014 at 5:24 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> >> > commit e64fae5573e566ce4fd9b23c68ac8f3096603314
> >> > Author: Kay Sievers <kay.sievers@vrfy.org>
> >> > Date:   Wed Jan 18 05:06:18 2012 +0100
> >> >
> >> >     udevd: kill hanging event processes after 30 seconds
> >> >
> >> >     Some broken kernel drivers load firmware synchronously in the module init
> >> >     path and block modprobe until the firmware request is fulfilled.
> >> >     <...>
> >>
> >> This was a workaround to avoid a deadlock between udev and the kernel.
> >> The 180 s timeout was already in place before this change, and was not
> >> motivated by firmware loading. Also note that this patch was not about
> >> "tracking device drivers", just about avoiding dead-lock.
> >
> > Thanks, can you elaborate on how a deadlock can occur if the kmod
> > worker is not at some point sigkilled?
> 
> This was only relevant whet udev did the firmware loading. modprobe
> would wait for the kernel, which would wait for the firmware loading,
> which would wait for modprobe. This is no longer a problem as udev
> does not do firmware loading any more.

Thanks for clarifying. So the deadlock concern is no longer there, therefore
it is not a reason to keep the sigkill for kmod.

> > Is the issue that if there is no extra worker available and all are
> > idling on sleep / synchronous long work boot will potentially hang
> > unless a new worker becomes available to do more work?
> 
> Correct.

Ok.

> > If so I can
> > see the sigkill helping for hanging tasks but it doesn't necessarily
> > mean its a good idea to kill modules loading taking a while. Also
> > what if the sigkill is just avoided for *just* kmod workers?
> 
> Depending on the number of devices you have, I suppose we could still
> exhaust the workers.

Ok can systemd dynamically create a worker or set of workers per device
that creeps up? Async probe for most drivers will help with this but
having it dynamic should help as well, specially since there are drivers
that will require probe synchronously -- and the fact that async probe
mechanism is not yet merged.

> >> The way I see it, the current status from systemd's side is: our
> >> short-term work-around is to increase the timeout, and at the moment
> >> it appears no long-term solution is needed (i.e., it seems like the
> >> right thing to do is to make sure insmod can be near instantaneous, it
> >> appears people are working towards this goal, and so far no examples
> >> have cropped up showing that it is fundamentally impossible (once/if
> >> they do, we should of course revisit the problem)).
> >
> > That again would be reactive behaviour, what would prevent avoiding the
> > sigkill only for kmod workers? Is it known the deadlock is immiment?
> > If the amount of workers for kmod that would hit the timeout is
> > considered low I don't see how that's possible and why not just lift
> > the sigkill.
> 
> Making kmod a special case is of course possible. However, as long as
> there is no fundamental reason why kmod should get this special
> treatment, this just looks like a work-around to me. 

I've mentioned a series of five reasons why its a bad idea right now to
sigkill modules [0], we're reviewed them each and still at least
items 2-4 remain particularly valid fundamental reasons to avoid it
specially if the deadlock is no longer possible. Running out of
workers because they are loading modules and that is taking a while
is not really a good standing reason to be killing them, specially
if the timeout already is set to a high value. All we're doing there is
limiting Linux / number of devices arbitrarily just to help free
workers, and it seems that should be dealt with differently. Killing
module loading arbitrarily in the middle is not advisable and can cause
more issue than help in any way.

Async probe mechanism will help free workers faster but this patch series is
still being evolved, we should still address the sigkill for kmod workers
separately and see what remaining reasons we have for it in light of the
possible issues highlighted that it can introduce if kept. If we want to
capture drivers taking long on probe each subsystem should handle that and WARN
/ pick up on it, we cannot however assume that this a generally bad things as
discussed before. We will also not be able to async probe *every* driver,
which is why the series allows a flag to specify for this.

[0] https://lkml.org/lkml/2014/9/26/879

> We already have a
> work-around, which is to increase the global timeout. If you still
> think we should do something different in systemd, it is probably best
> to take the discussion to systemd-devel to make sure all the relevant
> people are involved.

Sure, I've included systemd-devel. Hope is we can have a constructive
discussion on the sigkill for kmod.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 2, 2014, 11:29 p.m. UTC | #17

On Tue, Sep 30, 2014 at 09:21:59AM +0200, Luis R. Rodriguez wrote:
> On Mon, Sep 29, 2014 at 05:26:01PM -0400, Tejun Heo wrote:
> > Hello, Luis.
> > 
> > On Mon, Sep 29, 2014 at 11:22:08PM +0200, Luis R. Rodriguez wrote:
> > > > > +	/* For now lets avoid stupid bug reports */
> > > > > +	if (!strcmp(bus->name, "pci") ||
> > > > > +	    !strcmp(bus->name, "pci_express") ||
> > > > > +	    !strcmp(bus->name, "hid") ||
> > > > > +	    !strcmp(bus->name, "sdio") ||
> > > > > +	    !strcmp(bus->name, "gameport") ||
> > > > > +	    !strcmp(bus->name, "mmc") ||
> > > > > +	    !strcmp(bus->name, "i2c") ||
> > > > > +	    !strcmp(bus->name, "platform") ||
> > > > > +	    !strcmp(bus->name, "usb"))
> > > > > +		return true;
> > > > 
> > > > Ugh... things like this tend to become permanent.  Do we really need
> > > > this?  And how are we gonna find out what's broken why w/o bug
> > > > reports?
> > > 
> > > Yeah... well we have two options, one is have something like this to
> > > at least make it generally useful or remove this and let folks who
> > > care start fixing async for all modules. The downside to removing
> > > this is it makes async probe pretty much useless on most systems
> > > right now, it would mean systemd would have to probably consider
> > > the list above if they wanted to start using this without expecting
> > > systems to not work.
> > 
> > So, I'd much prefer blacklist approach if something like this is a
> > necessity.  That way, we'd at least know what doesn't work.
> 
> For buses? Or do you mean you'd want to wait until we have a decent
> list of drivers with the sync probe flag set? If the later it may take
> a while to get that list for this to be somewhat useful.

OK I'm removing this part and it works well for me now on my laptop
and an AMD server without a white list, so all the junk above will
be removed in the next series.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 2, 2014, 11:31 p.m. UTC | #18

On Tue, Sep 30, 2014 at 09:15:55AM +0200, Luis R. Rodriguez wrote:
> Can you provide an example code path hit here? I'll certainly like to address
> that as well.

I managed to enable built-in driver support on top of this series,
I'll send them as part of the next series but I suspect we'll want
to discuss blacklist/whitelist a bit more there.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tom Gundersen Oct. 3, 2014, 8:23 a.m. UTC | #19

On Thu, Oct 2, 2014 at 10:06 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> On Thu, Oct 02, 2014 at 08:12:37AM +0200, Tom Gundersen wrote:
>> Making kmod a special case is of course possible. However, as long as
>> there is no fundamental reason why kmod should get this special
>> treatment, this just looks like a work-around to me.
>
> I've mentioned a series of five reasons why its a bad idea right now to
> sigkill modules [0], we're reviewed them each and still at least
> items 2-4 remain particularly valid fundamental reasons to avoid it

So items 2-4 basically say "there currently are drivers that cannot
deal with sigkill after a three minute timeout".

In the short-term we already have the solution: increase the timeout.
In the long-term, we have two choices, either permanently add some
heuristic to udev to deal with drivers taking a very long time to be
inserted, or fix the drivers not to take such a long time. A priori,
it makes no sense to me that drivers spend unbounded amounts of time
to get inserted, so fixing the drivers seems like the most reasonable
approach to me. That said, I'm of course open to be proven wrong if
there are some drivers that fundamentally _must_ take a long time to
insert (but we should then discuss why that is and how we can best
deal with the situation, rather than adding some hack up-front when we
don't even know if it is needed).

Your patch series should go a long way towards fixing the drivers (and
I imagine there being a lot of low-hanging fruit that can easily be
fixed once your series has landed), and the fact that we have now
increased the udev timeout from 30 to 180 seconds should also greatly
reduce the problem.

Cheers,

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 3, 2014, 4:54 p.m. UTC | #20

On Fri, Oct 3, 2014 at 1:23 AM, Tom Gundersen <teg@jklm.no> wrote:
> On Thu, Oct 2, 2014 at 10:06 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
>> On Thu, Oct 02, 2014 at 08:12:37AM +0200, Tom Gundersen wrote:
>>> Making kmod a special case is of course possible. However, as long as
>>> there is no fundamental reason why kmod should get this special
>>> treatment, this just looks like a work-around to me.
>>
>> I've mentioned a series of five reasons why its a bad idea right now to
>> sigkill modules [0], we're reviewed them each and still at least
>> items 2-4 remain particularly valid fundamental reasons to avoid it
>
> So items 2-4 basically say "there currently are drivers that cannot
> deal with sigkill after a three minute timeout".

No, dealing with the sigkill gracefully is all related to 2) as it
says its probably a terrible idea to be triggering exit paths at
random points on device drivers on init / probe. And while one could
argue that perhaps that can be cleaned up I provided tons of
references and even *research effort* on this particular area so the
issues over this point should by no means easily be brushed off. And
it may be true that we can fix some things on Linux but a) that
requires a kernel upgrade on users and b) Some users may end up buying
hardware that only is supported through a proprietary driver and
getting those fixes is not trivial and almost impossible on some
cases.

3) says it is fundamentally incorrect to limit with any arbitrary
timeout the bus probe routine

4) talks about how the timeout is creating a limit on the number of
devices a device driver can support on Linux as follows give the
driver core batches *all* probes for one device driver serially:

   number_devices =          systemd_timeout
                      -------------------------------------
                         max known probe time for driver

We have device drivers which we *know* just on *probe* will take over
1 minute, this means that by default for these device drivers folks
can only install 3 devices of that type on a system. One can surely
address things on the kernel but again assuming folks use defaults and
don't upgrade their kernel the sigkill is simply limiting Linux right
now, even if it is for the short term.

> In the short-term we already have the solution: increase the timeout.

Short term implicates what will be supported for a while for tons of
deployments of systemd. The kernel command line work around for
increasing the timeout is a reactive measure, its not addressing the
problem architecturally. If the sigkill is going to be maintained for
kmod its implications should be well documented as well in terms of
the impact and limitations on both device drivers and number of
devices a driver can support.

> In the long-term, we have two choices, either permanently add some
> heuristic to udev to deal with drivers taking a very long time to be
> inserted, or fix the drivers not to take such a long time.

Drivers taking long on init should probably be addressed, drivers
taking long on probe are not broken specially since the driver core
probe's all supported devices on one device driver serially, so the
probe time is actually cumulative.

> A priori,
> it makes no sense to me that drivers spend unbounded amounts of time
> to get inserted, so fixing the drivers seems like the most reasonable
> approach to me. That said, I'm of course open to be proven wrong if
> there are some drivers that fundamentally _must_ take a long time to
> insert (but we should then discuss why that is and how we can best
> deal with the situation, rather than adding some hack up-front when we
> don't even know if it is needed).

Ok hold on. Async probe on the driver core will be a new feature and
there are even caveats that Tejun pointed out which are important for
distributions to consider before embracing it. Of course folks can
ignore these but by no means should it be considered that tons of
device device drivers were broken, what we are providing is a new
mechanism. And then there are device drivers which will need work in
order to use async probe, some will require fixes on init / probe
assumptions as I provided for the amd64_edac driver but for others
only time will tell what is required.

> Your patch series should go a long way towards fixing the drivers (and
> I imagine there being a lot of low-hanging fruit that can easily be
> fixed once your series has landed), and the fact that we have now
> increased the udev timeout from 30 to 180 seconds should also greatly
> reduce the problem.

Sure, I do ask for folks to revisit the short term solution though, I
did my best to communicate / document the issues.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 3, 2014, 8:11 p.m. UTC | #21

On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> +	queue_work(system_unbound_wq, &priv->attach_work->work);

Tejun,

based on my testing so far using system_highpri_wq instead of
system_unbound_wq yields close to par / better boot times
than synchronous probe support for all modules. How set are
you on using system_unbound_wq? About to punt out a new
series which also addresses built-in.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis R. Rodriguez Oct. 3, 2014, 9:12 p.m. UTC | #22

On Fri, Oct 03, 2014 at 10:11:26PM +0200, Luis R. Rodriguez wrote:
> On Fri, Sep 26, 2014 at 02:57:17PM -0700, Luis R. Rodriguez wrote:
> > +	queue_work(system_unbound_wq, &priv->attach_work->work);
> 
> Tejun,
> 
> based on my testing so far using system_highpri_wq instead of
> system_unbound_wq yields close to par / better boot times
> than synchronous probe support for all modules. How set are
> you on using system_unbound_wq? About to punt out a new
> series which also addresses built-in.

Nevermind, folks can change this later with better empirical
testing than I can provide and right now the differences I
see are not too conclusive and I suspect we'll see more of
a difference once the right built-in drivers are selected
to probe asynchrounously.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v1,5/5] driver-core: add driver asynchronous probe support

Commit Message

Comments

Patch