Message ID | 20170530160445.12810-1-lvivier@redhat.com |
---|---|
State | New |
Headers | show |
On Tue, 30 May 2017 18:04:45 +0200 Laurent Vivier <lvivier@redhat.com> wrote: > For QEMU, a hotlugged device is a device added using the HMP/QMP > interface. > For SPAPR, a hotplugged device is a device added while the > machine is running. In this case QEMU doesn't update internal > state but relies on the OS for this part > > In the case of migration, when we (libvirt) hotplug a device > on the source guest, we (libvirt) generally hotplug the same > device on the destination guest. But in this case, the machine > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > the OS will manage it as an hotplugged device as it will > be "imported" by the migration. > > This patch changes the meaning of "hotplugged" in spapr.c > to manage a QEMU hotplugged device like a "coldplugged" one > when the machine is awaiting an incoming migration. > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > --- Reviewed-by: Greg Kurz <groug@kaod.org> > hw/ppc/spapr.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 0980d73..f1302d0 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) > } > } > > +static bool spapr_coldplugged(DeviceState *dev) > +{ > + return runstate_check(RUN_STATE_INMIGRATE) || > + !dev->hotplugged; > +} > + > static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > uint32_t node, bool dedicated_hp_event_source, > Error **errp) > @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > int i, fdt_offset, fdt_size; > void *fdt; > uint64_t addr = addr_start; > + bool coldplugged = spapr_coldplugged(dev); > > for (i = 0; i < nr_lmbs; i++) { > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > SPAPR_MEMORY_BLOCK_SIZE); > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); > addr += SPAPR_MEMORY_BLOCK_SIZE; > - if (!dev->hotplugged) { > + if (coldplugged) { > /* guests expect coldplugged LMBs to be pre-allocated */ > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); > drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); > @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > /* send hotplug notification to the > * guest only in case of hotplugged memory > */ > - if (dev->hotplugged) { > + if (!coldplugged) { > if (dedicated_hp_event_source) { > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > addr_start / SPAPR_MEMORY_BLOCK_SIZE); > @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > int smt = kvmppc_smt_threads(); > CPUArchId *core_slot; > int index; > + bool coldplugged = spapr_coldplugged(dev); > > core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); > if (!core_slot) { > @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > if (drc) { > sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); > if (local_err) { > g_free(fdt); > error_propagate(errp, local_err); > @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > } > } > > - if (dev->hotplugged) { > + if (!coldplugged) { > /* > * Send hotplug notification interrupt to the guest only in case > * of hotplugged CPUs. > @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > int node_id; > int index; > > - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { > + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { > error_setg(&local_err, "CPU hotplug not supported for this machine"); > goto out; > }
On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > For QEMU, a hotlugged device is a device added using the HMP/QMP > interface. > For SPAPR, a hotplugged device is a device added while the > machine is running. In this case QEMU doesn't update internal > state but relies on the OS for this part > > In the case of migration, when we (libvirt) hotplug a device > on the source guest, we (libvirt) generally hotplug the same > device on the destination guest. But in this case, the machine > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > the OS will manage it as an hotplugged device as it will > be "imported" by the migration. > > This patch changes the meaning of "hotplugged" in spapr.c > to manage a QEMU hotplugged device like a "coldplugged" one > when the machine is awaiting an incoming migration. > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> So, I think this is a reasonable concept, at least in terms of cleanliness and not doing unnecessary work. However, if it's fixing bugs, I suspect that means we still have problems elsewhere. Specifically, what is it we're doing before the incoming migration that's breaking things. Even if it's unnecessary, anything done there should be overwritten by the incoming stream. That should certainly be the case (now) for the DRC state variables. Maybe not for the queued hotplug events - but that means we should update the queue migration to make sure we clear anything existing on the destination before adding migrated events. I'm also concerned by the fact that this makes changes for memory and cpu hotplug, but not for PCI devices. Why aren't they also affected by this problem? One nit in the implementation, see below: > --- > hw/ppc/spapr.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 0980d73..f1302d0 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) > } > } > > +static bool spapr_coldplugged(DeviceState *dev) > +{ > + return runstate_check(RUN_STATE_INMIGRATE) || > + !dev->hotplugged; > +} > + > static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > uint32_t node, bool dedicated_hp_event_source, > Error **errp) > @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > int i, fdt_offset, fdt_size; > void *fdt; > uint64_t addr = addr_start; > + bool coldplugged = spapr_coldplugged(dev); > > for (i = 0; i < nr_lmbs; i++) { > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > SPAPR_MEMORY_BLOCK_SIZE); > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); > addr += SPAPR_MEMORY_BLOCK_SIZE; > - if (!dev->hotplugged) { > + if (coldplugged) { > /* guests expect coldplugged LMBs to be pre-allocated */ > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); > drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); > @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > /* send hotplug notification to the > * guest only in case of hotplugged memory > */ > - if (dev->hotplugged) { > + if (!coldplugged) { > if (dedicated_hp_event_source) { > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > addr_start / SPAPR_MEMORY_BLOCK_SIZE); > @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > int smt = kvmppc_smt_threads(); > CPUArchId *core_slot; > int index; > + bool coldplugged = spapr_coldplugged(dev); > > core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); > if (!core_slot) { > @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > if (drc) { > sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); > if (local_err) { > g_free(fdt); > error_propagate(errp, local_err); > @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > } > } > > - if (dev->hotplugged) { > + if (!coldplugged) { > /* > * Send hotplug notification interrupt to the guest only in case > * of hotplugged CPUs. > @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > int node_id; > int index; > > - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { > + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { It probably doesn't matter in practice, but in this specific instance, I think you want the "raw" qemu meaning of hotplugged rather than the spapr meaning. > error_setg(&local_err, "CPU hotplug not supported for this machine"); > goto out; > }
On 31/05/2017 06:35, David Gibson wrote: > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: >> For QEMU, a hotlugged device is a device added using the HMP/QMP >> interface. >> For SPAPR, a hotplugged device is a device added while the >> machine is running. In this case QEMU doesn't update internal >> state but relies on the OS for this part >> >> In the case of migration, when we (libvirt) hotplug a device >> on the source guest, we (libvirt) generally hotplug the same >> device on the destination guest. But in this case, the machine >> is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect >> the OS will manage it as an hotplugged device as it will >> be "imported" by the migration. >> >> This patch changes the meaning of "hotplugged" in spapr.c >> to manage a QEMU hotplugged device like a "coldplugged" one >> when the machine is awaiting an incoming migration. >> >> Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > So, I think this is a reasonable concept, at least in terms of > cleanliness and not doing unnecessary work. However, if it's fixing > bugs, I suspect that means we still have problems elsewhere. > > Specifically, what is it we're doing before the incoming migration > that's breaking things. Even if it's unnecessary, anything done there > should be overwritten by the incoming stream. That should certainly > be the case (now) for the DRC state variables. Maybe not for the > queued hotplug events - but that means we should update the queue > migration to make sure we clear anything existing on the destination > before adding migrated events. > > I'm also concerned by the fact that this makes changes for memory and > cpu hotplug, but not for PCI devices. Why aren't they also affected > by this problem? There are some specific tests for PCI that change the behavior. For instance, see hw/ppc/spapr_drc.c, set_allocation_state() 151 if (drc->type != SPAPR_DR_CONNECTOR_TYPE_PCI) { 152 drc->allocation_state = state; 153 if (drc->awaiting_release && 154 drc->allocation_state == SPAPR_DR_ALLOCATION_STATE_UNUSABLE) { 155 trace_spapr_drc_set_allocation_state_finalizing(get_index(drc)); 156 drck->detach(drc, DEVICE(drc->dev), drc->detach_cb, 157 drc->detach_cb_opaque, NULL); 158 } else if (drc->allocation_state == SPAPR_DR_ALLOCATION_STATE_USABLE) { 159 drc->awaiting_allocation = false; 160 } 161 } attach(): 394 drc->signalled = (drc->type != SPAPR_DR_CONNECTOR_TYPE_PCI) 395 ? true : coldplug; 396 397 if (drc->type != SPAPR_DR_CONNECTOR_TYPE_PCI) { 398 drc->awaiting_allocation = true; 399 } 400 detach() 442 if (drc->type != SPAPR_DR_CONNECTOR_TYPE_PCI && 443 drc->allocation_state != SPAPR_DR_ALLOCATION_STATE_UNUSABLE) { 444 trace_spapr_drc_awaiting_unusable(get_index(drc)); 445 drc->awaiting_release = true; 446 return; 447 } and more... > > One nit in the implementation, see below: I agree, will fix. Thanks, Laurent
On Wed, 31 May 2017 14:35:57 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > interface. > > For SPAPR, a hotplugged device is a device added while the > > machine is running. In this case QEMU doesn't update internal > > state but relies on the OS for this part > > > > In the case of migration, when we (libvirt) hotplug a device > > on the source guest, we (libvirt) generally hotplug the same > > device on the destination guest. But in this case, the machine > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > the OS will manage it as an hotplugged device as it will > > be "imported" by the migration. > > > > This patch changes the meaning of "hotplugged" in spapr.c > > to manage a QEMU hotplugged device like a "coldplugged" one > > when the machine is awaiting an incoming migration. > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > So, I think this is a reasonable concept, at least in terms of > cleanliness and not doing unnecessary work. However, if it's fixing > bugs, I suspect that means we still have problems elsewhere. > > Specifically, what is it we're doing before the incoming migration > that's breaking things. Even if it's unnecessary, anything done there > should be overwritten by the incoming stream. That should certainly > be the case (now) for the DRC state variables. Maybe not for the > queued hotplug events - but that means we should update the queue > migration to make sure we clear anything existing on the destination > before adding migrated events. As David pointed out state of devices on target side should be overwritten by migration stream and migration hooks should fix up any side-effects that devices 'coldplugged' on target side had before migration stream. So patch looks like band aid, I'd hold off applying it and redefining hotplugged meaning at least until it's clearly understood what's going wrong during migration and if it could be fixed using generic approach before resorting to this patch. > I'm also concerned by the fact that this makes changes for memory and > cpu hotplug, but not for PCI devices. Why aren't they also affected > by this problem? > > One nit in the implementation, see below: > > > --- > > hw/ppc/spapr.c | 20 ++++++++++++++------ > > 1 file changed, 14 insertions(+), 6 deletions(-) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 0980d73..f1302d0 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) > > } > > } > > > > +static bool spapr_coldplugged(DeviceState *dev) > > +{ > > + return runstate_check(RUN_STATE_INMIGRATE) || > > + !dev->hotplugged; > > +} > > + > > static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > uint32_t node, bool dedicated_hp_event_source, > > Error **errp) > > @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > int i, fdt_offset, fdt_size; > > void *fdt; > > uint64_t addr = addr_start; > > + bool coldplugged = spapr_coldplugged(dev); > > > > for (i = 0; i < nr_lmbs; i++) { > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > SPAPR_MEMORY_BLOCK_SIZE); > > > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); > > addr += SPAPR_MEMORY_BLOCK_SIZE; > > - if (!dev->hotplugged) { > > + if (coldplugged) { > > /* guests expect coldplugged LMBs to be pre-allocated */ > > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); > > drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); > > @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > /* send hotplug notification to the > > * guest only in case of hotplugged memory > > */ > > - if (dev->hotplugged) { > > + if (!coldplugged) { > > if (dedicated_hp_event_source) { > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > addr_start / SPAPR_MEMORY_BLOCK_SIZE); > > @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > int smt = kvmppc_smt_threads(); > > CPUArchId *core_slot; > > int index; > > + bool coldplugged = spapr_coldplugged(dev); > > > > core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); > > if (!core_slot) { > > @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > > if (drc) { > > sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); > > if (local_err) { > > g_free(fdt); > > error_propagate(errp, local_err); > > @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > } > > } > > > > - if (dev->hotplugged) { > > + if (!coldplugged) { > > /* > > * Send hotplug notification interrupt to the guest only in case > > * of hotplugged CPUs. > > @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > int node_id; > > int index; > > > > - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { > > + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { > > It probably doesn't matter in practice, but in this specific instance, > I think you want the "raw" qemu meaning of hotplugged rather than the > spapr meaning. > > > error_setg(&local_err, "CPU hotplug not supported for this machine"); > > goto out; > > } >
Quoting David Gibson (2017-05-30 23:35:57) > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > interface. > > For SPAPR, a hotplugged device is a device added while the > > machine is running. In this case QEMU doesn't update internal > > state but relies on the OS for this part > > > > In the case of migration, when we (libvirt) hotplug a device > > on the source guest, we (libvirt) generally hotplug the same > > device on the destination guest. But in this case, the machine > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > the OS will manage it as an hotplugged device as it will > > be "imported" by the migration. > > > > This patch changes the meaning of "hotplugged" in spapr.c > > to manage a QEMU hotplugged device like a "coldplugged" one > > when the machine is awaiting an incoming migration. > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > So, I think this is a reasonable concept, at least in terms of > cleanliness and not doing unnecessary work. However, if it's fixing > bugs, I suspect that means we still have problems elsewhere. I was hoping a lot of these issues would go away once we default the initial/reset DRC states to "coldplugged". I think your pending patch: "spapr: Make DRC reset force DRC into known state" But I didn't consider the fact that libvirt will be issuing these hotplugs *after* reset, so those states would indeed need to be fixed up again to reflect boot-time,attached as opposed to boot-time,unattached before starting the target. So I do think this patch addresses a specific bug that isn't obviously fixable elsewhere. To me it seems like the only way to avoid doing something like what this patch does is to migrate all attached DRCs from the source in all cases. This would break backward-migration though, unless we switch from using subregions for DRCs to explicitly disabling DRC migration based on machine type. That approach seems to similar to what x86 does, e.g. hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state (corresponding to all DIMMs' slot status) in all cases where memory hotplug is enabled. If they were to do this using subregions for DIMMs in a transitional state I think similar issues would pop up in that code as well. Even if we take this route, we still need to explicitly suppress hotplug events during INMIGRATE to avoid extra events going on the queue. *Unless* we similarly rely purely on the ones sent by the source. I believe the proposed event migration patches using VMSTATE_QTAILQ_V only add to the list, so we'd need a variant that either nukes the list first, or a pre-load hook in vmstate_spapr_pending_events that does the same. Personally, it's seeming like the general approach of not special-casing INMIGRATE, and just letting migration do the fixing, is easier to deal with conceptually, albeit somewhat less flexible in terms of backward compatibility. Both approaches seem reasonable though. > > Specifically, what is it we're doing before the incoming migration > that's breaking things. Even if it's unnecessary, anything done there > should be overwritten by the incoming stream. That should certainly > be the case (now) for the DRC state variables. Maybe not for the > queued hotplug events - but that means we should update the queue > migration to make sure we clear anything existing on the destination > before adding migrated events. > > I'm also concerned by the fact that this makes changes for memory and > cpu hotplug, but not for PCI devices. Why aren't they also affected > by this problem? > > One nit in the implementation, see below: > > > --- > > hw/ppc/spapr.c | 20 ++++++++++++++------ > > 1 file changed, 14 insertions(+), 6 deletions(-) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 0980d73..f1302d0 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) > > } > > } > > > > +static bool spapr_coldplugged(DeviceState *dev) > > +{ > > + return runstate_check(RUN_STATE_INMIGRATE) || > > + !dev->hotplugged; > > +} > > + > > static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > uint32_t node, bool dedicated_hp_event_source, > > Error **errp) > > @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > int i, fdt_offset, fdt_size; > > void *fdt; > > uint64_t addr = addr_start; > > + bool coldplugged = spapr_coldplugged(dev); > > > > for (i = 0; i < nr_lmbs; i++) { > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > SPAPR_MEMORY_BLOCK_SIZE); > > > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); > > addr += SPAPR_MEMORY_BLOCK_SIZE; > > - if (!dev->hotplugged) { > > + if (coldplugged) { > > /* guests expect coldplugged LMBs to be pre-allocated */ > > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); > > drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); > > @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > /* send hotplug notification to the > > * guest only in case of hotplugged memory > > */ > > - if (dev->hotplugged) { > > + if (!coldplugged) { > > if (dedicated_hp_event_source) { > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > addr_start / SPAPR_MEMORY_BLOCK_SIZE); > > @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > int smt = kvmppc_smt_threads(); > > CPUArchId *core_slot; > > int index; > > + bool coldplugged = spapr_coldplugged(dev); > > > > core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); > > if (!core_slot) { > > @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > > if (drc) { > > sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); > > if (local_err) { > > g_free(fdt); > > error_propagate(errp, local_err); > > @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > } > > } > > > > - if (dev->hotplugged) { > > + if (!coldplugged) { > > /* > > * Send hotplug notification interrupt to the guest only in case > > * of hotplugged CPUs. > > @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > int node_id; > > int index; > > > > - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { > > + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { > > It probably doesn't matter in practice, but in this specific instance, > I think you want the "raw" qemu meaning of hotplugged rather than the > spapr meaning. > > > error_setg(&local_err, "CPU hotplug not supported for this machine"); > > goto out; > > } > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson
On Thu, 08 Jun 2017 15:00:53 -0500 Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > Quoting David Gibson (2017-05-30 23:35:57) > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > interface. > > > For SPAPR, a hotplugged device is a device added while the > > > machine is running. In this case QEMU doesn't update internal > > > state but relies on the OS for this part > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > on the source guest, we (libvirt) generally hotplug the same > > > device on the destination guest. But in this case, the machine > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > the OS will manage it as an hotplugged device as it will > > > be "imported" by the migration. > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > when the machine is awaiting an incoming migration. > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > So, I think this is a reasonable concept, at least in terms of > > cleanliness and not doing unnecessary work. However, if it's fixing > > bugs, I suspect that means we still have problems elsewhere. > > I was hoping a lot of these issues would go away once we default > the initial/reset DRC states to "coldplugged". I think your pending > patch: > > "spapr: Make DRC reset force DRC into known state" > > But I didn't consider the fact that libvirt will be issuing these > hotplugs *after* reset, so those states would indeed need to > be fixed up again to reflect boot-time,attached as opposed to > boot-time,unattached before starting the target. > > So I do think this patch addresses a specific bug that isn't > obviously fixable elsewhere. > > To me it seems like the only way to avoid doing something like > what this patch does is to migrate all attached DRCs from the > source in all cases. > > This would break backward-migration though, unless we switch from > using subregions for DRCs to explicitly disabling DRC migration > based on machine type. we could leave old machines broken and fix only new machine types, then it would be easy ot migrate 'additional' DRC state as subsection only on new for new machines. > > That approach seems to similar to what x86 does, e.g. > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > (corresponding to all DIMMs' slot status) in all cases where > memory hotplug is enabled. If they were to do this using > subregions for DIMMs in a transitional state I think similar > issues would pop up in that code as well. > > Even if we take this route, we still need to explicitly suppress > hotplug events during INMIGRATE to avoid extra events going on > the queue. *Unless* we similarly rely purely on the ones sent by > the source. pc/q35 might also lose events if device is hotplugged during migration, in addition migration would fail anyway since dst qemu should be launched with all devices that are present on src. ex: consider if one hotplugs DIMM during migration, it creates RAM region mapped into guest and that region might be transferred as part of VMState (not sure if it even works) and considering dst qemu has no idea about hotplugged memory mapping, the migration would fail on receiving unknown VMState. Hotplug generally doesn't work during migration, so it should be disabled in a generic way on migration start and re-enabled on target on migration completion. How about blocking device_add when INMIGRATE state and unblocking it when switching to runnig on dst? > I believe the proposed event migration patches using > VMSTATE_QTAILQ_V only add to the list, so we'd need a variant > that either nukes the list first, or a pre-load hook in > vmstate_spapr_pending_events that does the same. > > Personally, it's seeming like the general approach of not > special-casing INMIGRATE, and just letting migration do the > fixing, is easier to deal with conceptually, albeit somewhat > less flexible in terms of backward compatibility. Both approaches > seem reasonable though. > > > > > Specifically, what is it we're doing before the incoming migration > > that's breaking things. Even if it's unnecessary, anything done there > > should be overwritten by the incoming stream. That should certainly > > be the case (now) for the DRC state variables. Maybe not for the > > queued hotplug events - but that means we should update the queue > > migration to make sure we clear anything existing on the destination > > before adding migrated events. > > > > I'm also concerned by the fact that this makes changes for memory and > > cpu hotplug, but not for PCI devices. Why aren't they also affected > > by this problem? > > > > One nit in the implementation, see below: > > > > > --- > > > hw/ppc/spapr.c | 20 ++++++++++++++------ > > > 1 file changed, 14 insertions(+), 6 deletions(-) > > > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > > index 0980d73..f1302d0 100644 > > > --- a/hw/ppc/spapr.c > > > +++ b/hw/ppc/spapr.c > > > @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) > > > } > > > } > > > > > > +static bool spapr_coldplugged(DeviceState *dev) > > > +{ > > > + return runstate_check(RUN_STATE_INMIGRATE) || > > > + !dev->hotplugged; > > > +} > > > + > > > static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > > uint32_t node, bool dedicated_hp_event_source, > > > Error **errp) > > > @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > > int i, fdt_offset, fdt_size; > > > void *fdt; > > > uint64_t addr = addr_start; > > > + bool coldplugged = spapr_coldplugged(dev); > > > > > > for (i = 0; i < nr_lmbs; i++) { > > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > > @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > > SPAPR_MEMORY_BLOCK_SIZE); > > > > > > drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); > > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); > > > addr += SPAPR_MEMORY_BLOCK_SIZE; > > > - if (!dev->hotplugged) { > > > + if (coldplugged) { > > > /* guests expect coldplugged LMBs to be pre-allocated */ > > > drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); > > > drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); > > > @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, > > > /* send hotplug notification to the > > > * guest only in case of hotplugged memory > > > */ > > > - if (dev->hotplugged) { > > > + if (!coldplugged) { > > > if (dedicated_hp_event_source) { > > > drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > > > addr_start / SPAPR_MEMORY_BLOCK_SIZE); > > > @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > int smt = kvmppc_smt_threads(); > > > CPUArchId *core_slot; > > > int index; > > > + bool coldplugged = spapr_coldplugged(dev); > > > > > > core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); > > > if (!core_slot) { > > > @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > > > > if (drc) { > > > sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); > > > - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); > > > + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); > > > if (local_err) { > > > g_free(fdt); > > > error_propagate(errp, local_err); > > > @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > } > > > } > > > > > > - if (dev->hotplugged) { > > > + if (!coldplugged) { > > > /* > > > * Send hotplug notification interrupt to the guest only in case > > > * of hotplugged CPUs. > > > @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, > > > int node_id; > > > int index; > > > > > > - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { > > > + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { > > > > It probably doesn't matter in practice, but in this specific instance, > > I think you want the "raw" qemu meaning of hotplugged rather than the > > spapr meaning. > > > > > error_setg(&local_err, "CPU hotplug not supported for this machine"); > > > goto out; > > > } > > > > -- > > David Gibson | I'll have my music baroque, and my code > > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > > | _way_ _around_! > > http://www.ozlabs.org/~dgibson > >
On Thu, Jun 08, 2017 at 03:00:53PM -0500, Michael Roth wrote: > Quoting David Gibson (2017-05-30 23:35:57) > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > interface. > > > For SPAPR, a hotplugged device is a device added while the > > > machine is running. In this case QEMU doesn't update internal > > > state but relies on the OS for this part > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > on the source guest, we (libvirt) generally hotplug the same > > > device on the destination guest. But in this case, the machine > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > the OS will manage it as an hotplugged device as it will > > > be "imported" by the migration. > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > when the machine is awaiting an incoming migration. > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > So, I think this is a reasonable concept, at least in terms of > > cleanliness and not doing unnecessary work. However, if it's fixing > > bugs, I suspect that means we still have problems elsewhere. > > I was hoping a lot of these issues would go away once we default > the initial/reset DRC states to "coldplugged". I think your pending > patch: > > "spapr: Make DRC reset force DRC into known state" > > But I didn't consider the fact that libvirt will be issuing these > hotplugs *after* reset, so those states would indeed need to > be fixed up again to reflect boot-time,attached as opposed to > boot-time,unattached before starting the target. So, something I haven't written yet, but I'm considering for my DRC cleanup series is to force a reset of the DRC objects (and just the DRC objects) at CAS time. Since that's when the guest gets the final version of the device tree, that should ensure that the DRC state is in sync with whether the guest sees the device as hotplugged or coldplugged.
On Fri, Jun 09, 2017 at 10:27:33AM +0200, Igor Mammedov wrote: > On Thu, 08 Jun 2017 15:00:53 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting David Gibson (2017-05-30 23:35:57) > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > > interface. > > > > For SPAPR, a hotplugged device is a device added while the > > > > machine is running. In this case QEMU doesn't update internal > > > > state but relies on the OS for this part > > > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > > on the source guest, we (libvirt) generally hotplug the same > > > > device on the destination guest. But in this case, the machine > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > > the OS will manage it as an hotplugged device as it will > > > > be "imported" by the migration. > > > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > > when the machine is awaiting an incoming migration. > > > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > > > So, I think this is a reasonable concept, at least in terms of > > > cleanliness and not doing unnecessary work. However, if it's fixing > > > bugs, I suspect that means we still have problems elsewhere. > > > > I was hoping a lot of these issues would go away once we default > > the initial/reset DRC states to "coldplugged". I think your pending > > patch: > > > > "spapr: Make DRC reset force DRC into known state" > > > > But I didn't consider the fact that libvirt will be issuing these > > hotplugs *after* reset, so those states would indeed need to > > be fixed up again to reflect boot-time,attached as opposed to > > boot-time,unattached before starting the target. > > > > So I do think this patch addresses a specific bug that isn't > > obviously fixable elsewhere. > > > > To me it seems like the only way to avoid doing something like > > what this patch does is to migrate all attached DRCs from the > > source in all cases. > > > > This would break backward-migration though, unless we switch from > > using subregions for DRCs to explicitly disabling DRC migration > > based on machine type. > we could leave old machines broken and fix only new machine types, > then it would be easy ot migrate 'additional' DRC state as subsection > only on new for new machines. > > > > > That approach seems to similar to what x86 does, e.g. > > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > > (corresponding to all DIMMs' slot status) in all cases where > > memory hotplug is enabled. If they were to do this using > > subregions for DIMMs in a transitional state I think similar > > issues would pop up in that code as well. > > > > Even if we take this route, we still need to explicitly suppress > > hotplug events during INMIGRATE to avoid extra events going on > > the queue. *Unless* we similarly rely purely on the ones sent by > > the source. > pc/q35 might also lose events if device is hotplugged during migration, > in addition migration would fail anyway since dst qemu > should be launched with all devices that are present on src. > > ex: consider if one hotplugs DIMM during migration, it creates > RAM region mapped into guest and that region might be transferred > as part of VMState (not sure if it even works) > and considering dst qemu has no idea about hotplugged memory mapping, > the migration would fail on receiving unknown VMState. > > Hotplug generally doesn't work during migration, so it should be disabled > in a generic way on migration start and re-enabled on target > on migration completion. How about blocking device_add when > INMIGRATE state and unblocking it when switching to runnig on dst? Yeah, that sounds like a good idea. We need to cover the case of migration during an incomplete hot (un)plug as well as hotplug during a migration.
Quoting Igor Mammedov (2017-06-09 03:27:33) > On Thu, 08 Jun 2017 15:00:53 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting David Gibson (2017-05-30 23:35:57) > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > > interface. > > > > For SPAPR, a hotplugged device is a device added while the > > > > machine is running. In this case QEMU doesn't update internal > > > > state but relies on the OS for this part > > > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > > on the source guest, we (libvirt) generally hotplug the same > > > > device on the destination guest. But in this case, the machine > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > > the OS will manage it as an hotplugged device as it will > > > > be "imported" by the migration. > > > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > > when the machine is awaiting an incoming migration. > > > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > > > So, I think this is a reasonable concept, at least in terms of > > > cleanliness and not doing unnecessary work. However, if it's fixing > > > bugs, I suspect that means we still have problems elsewhere. > > > > I was hoping a lot of these issues would go away once we default > > the initial/reset DRC states to "coldplugged". I think your pending > > patch: > > > > "spapr: Make DRC reset force DRC into known state" > > > > But I didn't consider the fact that libvirt will be issuing these > > hotplugs *after* reset, so those states would indeed need to > > be fixed up again to reflect boot-time,attached as opposed to > > boot-time,unattached before starting the target. > > > > So I do think this patch addresses a specific bug that isn't > > obviously fixable elsewhere. > > > > To me it seems like the only way to avoid doing something like > > what this patch does is to migrate all attached DRCs from the > > source in all cases. > > > > This would break backward-migration though, unless we switch from > > using subregions for DRCs to explicitly disabling DRC migration > > based on machine type. > we could leave old machines broken and fix only new machine types, > then it would be easy ot migrate 'additional' DRC state as subsection > only on new for new machines. That's an option, but subsections were only really used for backward compatibility. Not sure how much we have to gain from using both. > > > > > That approach seems to similar to what x86 does, e.g. > > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > > (corresponding to all DIMMs' slot status) in all cases where > > memory hotplug is enabled. If they were to do this using > > subregions for DIMMs in a transitional state I think similar > > issues would pop up in that code as well. > > > > Even if we take this route, we still need to explicitly suppress > > hotplug events during INMIGRATE to avoid extra events going on > > the queue. *Unless* we similarly rely purely on the ones sent by > > the source. > pc/q35 might also lose events if device is hotplugged during migration, > in addition migration would fail anyway since dst qemu > should be launched with all devices that are present on src. > > ex: consider if one hotplugs DIMM during migration, it creates > RAM region mapped into guest and that region might be transferred > as part of VMState (not sure if it even works) > and considering dst qemu has no idea about hotplugged memory mapping, > the migration would fail on receiving unknown VMState. > > Hotplug generally doesn't work during migration, so it should be disabled > in a generic way on migration start and re-enabled on target > on migration completion. How about blocking device_add when > INMIGRATE state and unblocking it when switching to runnig on dst? Maybe I'm misunderstanding the intent of this patch, but in our own testing we've seen that even for CPUs hotplugged *before* migration starts, libvirt will add them to the dest via device_add instead of via the command-line. If the CPUs were all specified via command-line, I don't think these patches would be needed, since the coldplug hooks would be executed without the need to make any special considerations for INMIGRATE. This libvirt commit seems to confirm that the CPUs are added via device_add, and we've seen similar behavior in our testing: commit 9eb9106ea51b43102ee51132f69780b2c86ccbca Author: Peter Krempa <pkrempa@redhat.com> Date: Thu Aug 4 14:36:24 2016 +0200 qemu: command: Add support for sparse vcpu topologies Add support for using the new approach to hotplug vcpus using device_add during startup of qemu to allow sparse vcpu topologies. There are a few limitations imposed by qemu on the supported configuration: - vcpu0 needs to be always present and not hotpluggable - non-hotpluggable cpus need to be ordered at the beginning - order of the vcpus needs to be unique for every single hotpluggable entity Qemu also doesn't really allow to query the information necessary to start a VM with the vcpus directly on the commandline. Fortunately they can be hotplugged during startup. The new hotplug code uses the following approach: - non-hotpluggable vcpus are counted and put to the -smp option - qemu is started - qemu is queried for the necessary information - the configuration is checked - the hotpluggable vcpus are hotplugged - vcpus are started This patch adds a lot of checking code and enables the support to specify the individual vcpu element with qemu. So I don't think disabling migration during inmigrate is a possible alternative unless we rework how libvirt handles this. The only alternative to this patch that I'm aware of would be to always migrate DRCs when dev->hotplugged == true.
On Tue, 13 Jun 2017 16:42:45 -0500 Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > Quoting Igor Mammedov (2017-06-09 03:27:33) > > On Thu, 08 Jun 2017 15:00:53 -0500 > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > Quoting David Gibson (2017-05-30 23:35:57) > > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > > > interface. > > > > > For SPAPR, a hotplugged device is a device added while the > > > > > machine is running. In this case QEMU doesn't update internal > > > > > state but relies on the OS for this part > > > > > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > > > on the source guest, we (libvirt) generally hotplug the same > > > > > device on the destination guest. But in this case, the machine > > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > > > the OS will manage it as an hotplugged device as it will > > > > > be "imported" by the migration. > > > > > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > > > when the machine is awaiting an incoming migration. > > > > > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > > > > > So, I think this is a reasonable concept, at least in terms of > > > > cleanliness and not doing unnecessary work. However, if it's fixing > > > > bugs, I suspect that means we still have problems elsewhere. > > > > > > I was hoping a lot of these issues would go away once we default > > > the initial/reset DRC states to "coldplugged". I think your pending > > > patch: > > > > > > "spapr: Make DRC reset force DRC into known state" > > > > > > But I didn't consider the fact that libvirt will be issuing these > > > hotplugs *after* reset, so those states would indeed need to > > > be fixed up again to reflect boot-time,attached as opposed to > > > boot-time,unattached before starting the target. > > > > > > So I do think this patch addresses a specific bug that isn't > > > obviously fixable elsewhere. > > > > > > To me it seems like the only way to avoid doing something like > > > what this patch does is to migrate all attached DRCs from the > > > source in all cases. > > > > > > This would break backward-migration though, unless we switch from > > > using subregions for DRCs to explicitly disabling DRC migration > > > based on machine type. > > we could leave old machines broken and fix only new machine types, > > then it would be easy ot migrate 'additional' DRC state as subsection > > only on new for new machines. > > That's an option, but subsections were only really used for backward > compatibility. Not sure how much we have to gain from using both. If I remember correctly subsections could be/are used for forward compat stuff i.e. subsection is generated on source side when .needed callback returns true and destinations will just consume whatever data were sent without looking at .need callback. So source could generate extra DRC subsection when cpu hotplug is enabled for new machine types, ex: f816a62daa adding David/Juan to CC list to correct me if I'm wrong. > > > That approach seems to similar to what x86 does, e.g. > > > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > > > (corresponding to all DIMMs' slot status) in all cases where > > > memory hotplug is enabled. If they were to do this using > > > subregions for DIMMs in a transitional state I think similar > > > issues would pop up in that code as well. > > > > > > Even if we take this route, we still need to explicitly suppress > > > hotplug events during INMIGRATE to avoid extra events going on > > > the queue. *Unless* we similarly rely purely on the ones sent by > > > the source. > > pc/q35 might also lose events if device is hotplugged during migration, > > in addition migration would fail anyway since dst qemu > > should be launched with all devices that are present on src. > > > > ex: consider if one hotplugs DIMM during migration, it creates > > RAM region mapped into guest and that region might be transferred > > as part of VMState (not sure if it even works) > > and considering dst qemu has no idea about hotplugged memory mapping, > > the migration would fail on receiving unknown VMState. > > > > Hotplug generally doesn't work during migration, so it should be disabled > > in a generic way on migration start and re-enabled on target > > on migration completion. How about blocking device_add when > > INMIGRATE state and unblocking it when switching to runnig on dst? > > Maybe I'm misunderstanding the intent of this patch, but in our own > testing we've seen that even for CPUs hotplugged *before* migration > starts, libvirt will add them to the dest via device_add instead of > via the command-line. the way migration currently works, this behavior seems fine to me, whether hotplugged CPUs on target side are specified with -device or device_add, it shouldn't affect machine behavior. It doesn't in case of cpu for x86, where hotplug process state is migrated as VMSTATE_CPU_HOTPLUG subsection of piix4_pm/ich9 device. > If the CPUs were all specified via command-line, I don't think these > patches would be needed, since the coldplug hooks would be executed > without the need to make any special considerations for INMIGRATE. I don't think that it's libvirt's problem, user if free to use either -device or device_add to add devices on target side. We might re-enable writing to hotplugged property (36cccb8c5) and ask mgmt to set its value on target but that would not work fine for old mgmt tools. Perhaps we should migrate DeviceState::hotplugged property state as part of every device so that target could fixup device state according to its value, but I'm not sure if it is useful. > This libvirt commit seems to confirm that the CPUs are added via > device_add, and we've seen similar behavior in our testing: > > > commit 9eb9106ea51b43102ee51132f69780b2c86ccbca > Author: Peter Krempa <pkrempa@redhat.com> > Date: Thu Aug 4 14:36:24 2016 +0200 > > qemu: command: Add support for sparse vcpu topologies > > Add support for using the new approach to hotplug vcpus using device_add > during startup of qemu to allow sparse vcpu topologies. > > There are a few limitations imposed by qemu on the supported > configuration: > - vcpu0 needs to be always present and not hotpluggable > - non-hotpluggable cpus need to be ordered at the beginning > - order of the vcpus needs to be unique for every single hotpluggable > entity > > Qemu also doesn't really allow to query the information necessary to > start a VM with the vcpus directly on the commandline. Fortunately they > can be hotplugged during startup. > > The new hotplug code uses the following approach: > - non-hotpluggable vcpus are counted and put to the -smp option > - qemu is started > - qemu is queried for the necessary information > - the configuration is checked > - the hotpluggable vcpus are hotplugged > - vcpus are started > > This patch adds a lot of checking code and enables the support to > specify the individual vcpu element with qemu. > > > So I don't think disabling migration during inmigrate is a possible > alternative unless we rework how libvirt handles this. The only > alternative to this patch that I'm aware of would be to always > migrate DRCs when dev->hotplugged == true. Currently I'd suggest to look into always migrate DRCs if cpu hotplug is enabled even if dev->hotplugged is false (not nice but it might work). Consider: SRC1: hotplug CPU1 => CPU1.hotplugged = true DST1: -device CPU1 => CPU1.hotplugged = false so in current code relying on CPU1.hotplugged would not work as expected, it works by accident because libvirt uses device_add on target DST1: device_add CPU1 => CPU1.hotplugged = true If we try to fix it by migrating 'DeviceState::hotplugged' flag, we would need CPU/memory/machine specific migration hooks which will fix device/machine state as by the time migration stream is processed on target side, all devices are already wired up using -device or device_add paths (cold/hotplugged paths). Approach doesn't seem robust to me. May be we should 1. make DeviceState:hotpluggable property write-able again 2. transmit DeviceState:hotpluggable as part of migration stream 3. add generic migration hook which will check if target and source value match, if value differs => fail/abort migration. 4. in case values mismatch mgmt will be forced to explicitly provide hotplugged property value on -device/device_add That would enforce consistent DeviceState:hotpluggable value on target and source. We can enforce it only for new machine types so it won't break old mgmt tools with old machine types but would force mgmt for new machines to use hotplugged property on target so QEMU could rely on its value for migration purposes.
Igor Mammedov <imammedo@redhat.com> wrote: > On Tue, 13 Jun 2017 16:42:45 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > >> Quoting Igor Mammedov (2017-06-09 03:27:33) >> > On Thu, 08 Jun 2017 15:00:53 -0500 >> > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: >> > >> > > Quoting David Gibson (2017-05-30 23:35:57) >> > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: >> > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP >> > > > > interface. >> > > > > For SPAPR, a hotplugged device is a device added while the >> > > > > machine is running. In this case QEMU doesn't update internal >> > > > > state but relies on the OS for this part >> > > > > >> > > > > In the case of migration, when we (libvirt) hotplug a device >> > > > > on the source guest, we (libvirt) generally hotplug the same >> > > > > device on the destination guest. But in this case, the machine >> > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect >> > > > > the OS will manage it as an hotplugged device as it will >> > > > > be "imported" by the migration. >> > > > > >> > > > > This patch changes the meaning of "hotplugged" in spapr.c >> > > > > to manage a QEMU hotplugged device like a "coldplugged" one >> > > > > when the machine is awaiting an incoming migration. >> > > > > >> > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> >> > > > >> > > > So, I think this is a reasonable concept, at least in terms of >> > > > cleanliness and not doing unnecessary work. However, if it's fixing >> > > > bugs, I suspect that means we still have problems elsewhere. >> > > >> > > I was hoping a lot of these issues would go away once we default >> > > the initial/reset DRC states to "coldplugged". I think your pending >> > > patch: >> > > >> > > "spapr: Make DRC reset force DRC into known state" >> > > >> > > But I didn't consider the fact that libvirt will be issuing these >> > > hotplugs *after* reset, so those states would indeed need to >> > > be fixed up again to reflect boot-time,attached as opposed to >> > > boot-time,unattached before starting the target. >> > > >> > > So I do think this patch addresses a specific bug that isn't >> > > obviously fixable elsewhere. >> > > >> > > To me it seems like the only way to avoid doing something like >> > > what this patch does is to migrate all attached DRCs from the >> > > source in all cases. >> > > >> > > This would break backward-migration though, unless we switch from >> > > using subregions for DRCs to explicitly disabling DRC migration >> > > based on machine type. >> > we could leave old machines broken and fix only new machine types, >> > then it would be easy ot migrate 'additional' DRC state as subsection >> > only on new for new machines. >> >> That's an option, but subsections were only really used for backward >> compatibility. Not sure how much we have to gain from using both. > If I remember correctly subsections could be/are used for forward compat stuff > i.e. subsection is generated on source side when .needed callback returns > true and destinations will just consume whatever data were sent > without looking at .need callback. So source could generate extra > DRC subsection when cpu hotplug is enabled for new machine types, > ex: f816a62daa > > adding David/Juan to CC list to correct me if I'm wrong. Yeap. subsections are used when we know that we have missed some data (or we need some more data for some other reason). If the data would have been required always, we would have detected before. So subsections allows us to remain compatible, if needed() returns false, we are compatible with old version, and if it returns true, we send the additional data because we know that it is needed. So we have the following cases: old-qemu -> old-qemu will work as before if "needed" data is required, migration fails old-qemu -> new-qemu identical to previous new-qemu -> new-qemu subsection is sent when nededed new-qemu -> old-qemu subsection is sent if it is neded, and then it breaks migration but we know that it would have failed anyways. new-qemu -M old-machine-type -> new-qemu -M old-machine-type we sent and recognize the new subsection if it is required so, even with old machine types, if the qemus are new, we do the right thing. This is how subsections are supposed to work. I haven't investigated your particular problem or set of patches, if you need help here, please ask. Later, Juan.
* Igor Mammedov (imammedo@redhat.com) wrote: > On Tue, 13 Jun 2017 16:42:45 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > On Thu, 08 Jun 2017 15:00:53 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting David Gibson (2017-05-30 23:35:57) > > > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > > > > interface. > > > > > > For SPAPR, a hotplugged device is a device added while the > > > > > > machine is running. In this case QEMU doesn't update internal > > > > > > state but relies on the OS for this part > > > > > > > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > > > > on the source guest, we (libvirt) generally hotplug the same > > > > > > device on the destination guest. But in this case, the machine > > > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > > > > the OS will manage it as an hotplugged device as it will > > > > > > be "imported" by the migration. > > > > > > > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > > > > when the machine is awaiting an incoming migration. > > > > > > > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > > > > > > > So, I think this is a reasonable concept, at least in terms of > > > > > cleanliness and not doing unnecessary work. However, if it's fixing > > > > > bugs, I suspect that means we still have problems elsewhere. > > > > > > > > I was hoping a lot of these issues would go away once we default > > > > the initial/reset DRC states to "coldplugged". I think your pending > > > > patch: > > > > > > > > "spapr: Make DRC reset force DRC into known state" > > > > > > > > But I didn't consider the fact that libvirt will be issuing these > > > > hotplugs *after* reset, so those states would indeed need to > > > > be fixed up again to reflect boot-time,attached as opposed to > > > > boot-time,unattached before starting the target. > > > > > > > > So I do think this patch addresses a specific bug that isn't > > > > obviously fixable elsewhere. > > > > > > > > To me it seems like the only way to avoid doing something like > > > > what this patch does is to migrate all attached DRCs from the > > > > source in all cases. > > > > > > > > This would break backward-migration though, unless we switch from > > > > using subregions for DRCs to explicitly disabling DRC migration > > > > based on machine type. > > > we could leave old machines broken and fix only new machine types, > > > then it would be easy ot migrate 'additional' DRC state as subsection > > > only on new for new machines. > > > > That's an option, but subsections were only really used for backward > > compatibility. Not sure how much we have to gain from using both. > If I remember correctly subsections could be/are used for forward compat stuff > i.e. subsection is generated on source side when .needed callback returns > true and destinations will just consume whatever data were sent > without looking at .need callback. So source could generate extra > DRC subsection when cpu hotplug is enabled for new machine types, > ex: f816a62daa > > adding David/Juan to CC list to correct me if I'm wrong. Yes I think that's right; but note that the destination does have to know about the subsection definition to consume it. It can't consume a subsection sent by a newer qemu which it doesn't have a definition of and so can't parse. Dave > > > > That approach seems to similar to what x86 does, e.g. > > > > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > > > > (corresponding to all DIMMs' slot status) in all cases where > > > > memory hotplug is enabled. If they were to do this using > > > > subregions for DIMMs in a transitional state I think similar > > > > issues would pop up in that code as well. > > > > > > > > Even if we take this route, we still need to explicitly suppress > > > > hotplug events during INMIGRATE to avoid extra events going on > > > > the queue. *Unless* we similarly rely purely on the ones sent by > > > > the source. > > > pc/q35 might also lose events if device is hotplugged during migration, > > > in addition migration would fail anyway since dst qemu > > > should be launched with all devices that are present on src. > > > > > > ex: consider if one hotplugs DIMM during migration, it creates > > > RAM region mapped into guest and that region might be transferred > > > as part of VMState (not sure if it even works) > > > and considering dst qemu has no idea about hotplugged memory mapping, > > > the migration would fail on receiving unknown VMState. > > > > > > Hotplug generally doesn't work during migration, so it should be disabled > > > in a generic way on migration start and re-enabled on target > > > on migration completion. How about blocking device_add when > > > INMIGRATE state and unblocking it when switching to runnig on dst? > > > > Maybe I'm misunderstanding the intent of this patch, but in our own > > testing we've seen that even for CPUs hotplugged *before* migration > > starts, libvirt will add them to the dest via device_add instead of > > via the command-line. > the way migration currently works, this behavior seems fine to me, > whether hotplugged CPUs on target side are specified with -device or > device_add, it shouldn't affect machine behavior. It doesn't in > case of cpu for x86, where hotplug process state is > migrated as VMSTATE_CPU_HOTPLUG subsection of piix4_pm/ich9 > device. > > > > If the CPUs were all specified via command-line, I don't think these > > patches would be needed, since the coldplug hooks would be executed > > without the need to make any special considerations for INMIGRATE. > I don't think that it's libvirt's problem, user if free to use either > -device or device_add to add devices on target side. > > We might re-enable writing to hotplugged property (36cccb8c5) > and ask mgmt to set its value on target but that would not work > fine for old mgmt tools. > Perhaps we should migrate DeviceState::hotplugged property state > as part of every device so that target could fixup device state > according to its value, but I'm not sure if it is useful. > > > > This libvirt commit seems to confirm that the CPUs are added via > > device_add, and we've seen similar behavior in our testing: > > > > > > commit 9eb9106ea51b43102ee51132f69780b2c86ccbca > > Author: Peter Krempa <pkrempa@redhat.com> > > Date: Thu Aug 4 14:36:24 2016 +0200 > > > > qemu: command: Add support for sparse vcpu topologies > > > > Add support for using the new approach to hotplug vcpus using device_add > > during startup of qemu to allow sparse vcpu topologies. > > > > There are a few limitations imposed by qemu on the supported > > configuration: > > - vcpu0 needs to be always present and not hotpluggable > > - non-hotpluggable cpus need to be ordered at the beginning > > - order of the vcpus needs to be unique for every single hotpluggable > > entity > > > > Qemu also doesn't really allow to query the information necessary to > > start a VM with the vcpus directly on the commandline. Fortunately they > > can be hotplugged during startup. > > > > The new hotplug code uses the following approach: > > - non-hotpluggable vcpus are counted and put to the -smp option > > - qemu is started > > - qemu is queried for the necessary information > > - the configuration is checked > > - the hotpluggable vcpus are hotplugged > > - vcpus are started > > > > This patch adds a lot of checking code and enables the support to > > specify the individual vcpu element with qemu. > > > > > > So I don't think disabling migration during inmigrate is a possible > > alternative unless we rework how libvirt handles this. The only > > alternative to this patch that I'm aware of would be to always > > migrate DRCs when dev->hotplugged == true. > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > is enabled even if dev->hotplugged is false (not nice but it might work). > Consider: > SRC1: hotplug CPU1 => CPU1.hotplugged = true > DST1: -device CPU1 => CPU1.hotplugged = false > so in current code relying on CPU1.hotplugged would not work as expected, > it works by accident because libvirt uses device_add on target > DST1: device_add CPU1 => CPU1.hotplugged = true > > If we try to fix it by migrating 'DeviceState::hotplugged' flag, > we would need CPU/memory/machine specific migration hooks which will > fix device/machine state as by the time migration stream is processed > on target side, all devices are already wired up using -device or > device_add paths (cold/hotplugged paths). > Approach doesn't seem robust to me. > > May be we should > 1. make DeviceState:hotpluggable property write-able again > 2. transmit DeviceState:hotpluggable as part of migration stream > 3. add generic migration hook which will check if target and > source value match, if value differs => fail/abort migration. > 4. in case values mismatch mgmt will be forced to explicitly > provide hotplugged property value on -device/device_add > That would enforce consistent DeviceState:hotpluggable value > on target and source. > We can enforce it only for new machine types so it won't break > old mgmt tools with old machine types but would force mgmt > for new machines to use hotplugged property on target > so QEMU could rely on its value for migration purposes. -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Quoting Igor Mammedov (2017-06-14 04:00:01) > On Tue, 13 Jun 2017 16:42:45 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > On Thu, 08 Jun 2017 15:00:53 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting David Gibson (2017-05-30 23:35:57) > > > > > On Tue, May 30, 2017 at 06:04:45PM +0200, Laurent Vivier wrote: > > > > > > For QEMU, a hotlugged device is a device added using the HMP/QMP > > > > > > interface. > > > > > > For SPAPR, a hotplugged device is a device added while the > > > > > > machine is running. In this case QEMU doesn't update internal > > > > > > state but relies on the OS for this part > > > > > > > > > > > > In the case of migration, when we (libvirt) hotplug a device > > > > > > on the source guest, we (libvirt) generally hotplug the same > > > > > > device on the destination guest. But in this case, the machine > > > > > > is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect > > > > > > the OS will manage it as an hotplugged device as it will > > > > > > be "imported" by the migration. > > > > > > > > > > > > This patch changes the meaning of "hotplugged" in spapr.c > > > > > > to manage a QEMU hotplugged device like a "coldplugged" one > > > > > > when the machine is awaiting an incoming migration. > > > > > > > > > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > > > > > > > > > So, I think this is a reasonable concept, at least in terms of > > > > > cleanliness and not doing unnecessary work. However, if it's fixing > > > > > bugs, I suspect that means we still have problems elsewhere. > > > > > > > > I was hoping a lot of these issues would go away once we default > > > > the initial/reset DRC states to "coldplugged". I think your pending > > > > patch: > > > > > > > > "spapr: Make DRC reset force DRC into known state" > > > > > > > > But I didn't consider the fact that libvirt will be issuing these > > > > hotplugs *after* reset, so those states would indeed need to > > > > be fixed up again to reflect boot-time,attached as opposed to > > > > boot-time,unattached before starting the target. > > > > > > > > So I do think this patch addresses a specific bug that isn't > > > > obviously fixable elsewhere. > > > > > > > > To me it seems like the only way to avoid doing something like > > > > what this patch does is to migrate all attached DRCs from the > > > > source in all cases. > > > > > > > > This would break backward-migration though, unless we switch from > > > > using subregions for DRCs to explicitly disabling DRC migration > > > > based on machine type. > > > we could leave old machines broken and fix only new machine types, > > > then it would be easy ot migrate 'additional' DRC state as subsection > > > only on new for new machines. > > > > That's an option, but subsections were only really used for backward > > compatibility. Not sure how much we have to gain from using both. > If I remember correctly subsections could be/are used for forward compat stuff > i.e. subsection is generated on source side when .needed callback returns > true and destinations will just consume whatever data were sent > without looking at .need callback. So source could generate extra > DRC subsection when cpu hotplug is enabled for new machine types, > ex: f816a62daa Well, what I was thinking was that if we dropped the approach of basing .needed around "is this DRC in a transitional state?" (which can only be determined on the source) in favor of "if (dev->hotplugged)", we could possible get away with a non-subsection VMSD. But you're right, unless we can ensure dev->hotplugged is sychronized on both source and dest, we might still need to use a subsection regardless. *Unless* we just migrate all DRCs indiscriminately... > > adding David/Juan to CC list to correct me if I'm wrong. Thanks Juan and David for clarifying. > > > > > That approach seems to similar to what x86 does, e.g. > > > > hw/acpi/ich9.c and hw/acpi/piix.c migrate vmstate_memhp_state > > > > (corresponding to all DIMMs' slot status) in all cases where > > > > memory hotplug is enabled. If they were to do this using > > > > subregions for DIMMs in a transitional state I think similar > > > > issues would pop up in that code as well. > > > > > > > > Even if we take this route, we still need to explicitly suppress > > > > hotplug events during INMIGRATE to avoid extra events going on > > > > the queue. *Unless* we similarly rely purely on the ones sent by > > > > the source. > > > pc/q35 might also lose events if device is hotplugged during migration, > > > in addition migration would fail anyway since dst qemu > > > should be launched with all devices that are present on src. > > > > > > ex: consider if one hotplugs DIMM during migration, it creates > > > RAM region mapped into guest and that region might be transferred > > > as part of VMState (not sure if it even works) > > > and considering dst qemu has no idea about hotplugged memory mapping, > > > the migration would fail on receiving unknown VMState. > > > > > > Hotplug generally doesn't work during migration, so it should be disabled > > > in a generic way on migration start and re-enabled on target > > > on migration completion. How about blocking device_add when > > > INMIGRATE state and unblocking it when switching to runnig on dst? > > > > Maybe I'm misunderstanding the intent of this patch, but in our own > > testing we've seen that even for CPUs hotplugged *before* migration > > starts, libvirt will add them to the dest via device_add instead of > > via the command-line. > the way migration currently works, this behavior seems fine to me, > whether hotplugged CPUs on target side are specified with -device or > device_add, it shouldn't affect machine behavior. It doesn't in > case of cpu for x86, where hotplug process state is > migrated as VMSTATE_CPU_HOTPLUG subsection of piix4_pm/ich9 > device. Ok, looks like that works because the all the CPU/ACPI hotplug state is always migrated for newer machines. So we could do similar in our case for DRCs, and that would work, but in our case that's not quite ideal since we generally have a few hundred DRCs for a normal guest. It's maybe 8KB of data so it's not a huge deal, but it would be nice if we could avoid that. If we start having larger guests with multiple PHBs that could be an issue though, since we have 32*8 DRCs per PHB (and there's been some talk on having libvirt default to starting spapr guests with multiple PHBs (possibly the max) to deal with some limitations around hotplug for passthrough devices, so even if we don't have any passthrough devices attached we'd still have to deal with these DRCs if we take this approach). > > > > If the CPUs were all specified via command-line, I don't think these > > patches would be needed, since the coldplug hooks would be executed > > without the need to make any special considerations for INMIGRATE. > I don't think that it's libvirt's problem, user if free to use either > -device or device_add to add devices on target side. > > We might re-enable writing to hotplugged property (36cccb8c5) > and ask mgmt to set its value on target but that would not work > fine for old mgmt tools. > Perhaps we should migrate DeviceState::hotplugged property state > as part of every device so that target could fixup device state > according to its value, but I'm not sure if it is useful. > > > > This libvirt commit seems to confirm that the CPUs are added via > > device_add, and we've seen similar behavior in our testing: > > > > > > commit 9eb9106ea51b43102ee51132f69780b2c86ccbca > > Author: Peter Krempa <pkrempa@redhat.com> > > Date: Thu Aug 4 14:36:24 2016 +0200 > > > > qemu: command: Add support for sparse vcpu topologies > > > > Add support for using the new approach to hotplug vcpus using device_add > > during startup of qemu to allow sparse vcpu topologies. > > > > There are a few limitations imposed by qemu on the supported > > configuration: > > - vcpu0 needs to be always present and not hotpluggable > > - non-hotpluggable cpus need to be ordered at the beginning > > - order of the vcpus needs to be unique for every single hotpluggable > > entity > > > > Qemu also doesn't really allow to query the information necessary to > > start a VM with the vcpus directly on the commandline. Fortunately they > > can be hotplugged during startup. > > > > The new hotplug code uses the following approach: > > - non-hotpluggable vcpus are counted and put to the -smp option > > - qemu is started > > - qemu is queried for the necessary information > > - the configuration is checked > > - the hotpluggable vcpus are hotplugged > > - vcpus are started > > > > This patch adds a lot of checking code and enables the support to > > specify the individual vcpu element with qemu. > > > > > > So I don't think disabling migration during inmigrate is a possible > > alternative unless we rework how libvirt handles this. The only > > alternative to this patch that I'm aware of would be to always > > migrate DRCs when dev->hotplugged == true. > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > is enabled even if dev->hotplugged is false (not nice but it might work). > Consider: > SRC1: hotplug CPU1 => CPU1.hotplugged = true > DST1: -device CPU1 => CPU1.hotplugged = false > so in current code relying on CPU1.hotplugged would not work as expected, > it works by accident because libvirt uses device_add on target > DST1: device_add CPU1 => CPU1.hotplugged = true It's actually the reverse for us, DST1: -device CPU1 works, because default DRC state for CPU1.hotplugged = false matches the state a hotplugged CPU will be brought to after onlining at the source, so we don't send it over the wire in the first place once it reaches that post-hotplug/coldplug state. So things work as expected, even though technically the source has dev->hotplugged == true, whereas the dest has dev->hotplugged == false. It's the DST1: device_add case that wasn't accounted for when the DRC migration patches were written, as those don't default to coldplug, so, because the source doesn't send it, it ends up being presented in pre-hotplug state because the dest doesn't know that the guest already onlined the resource and transitioned it to post-hotplug/coldplug state. Ironically, dev->hotplugged is true on both source and dest in this case, but it ends up being the broken one. But your point stands, the fact that both situations are possible means we can't currently rely on dev->hotplugged without migrating it, infering it based on QEMU lifecycle, or forcing management to set it. But that raises a 2nd point. Our dilemma isn't that we can't rely on dev->hotplugged being synchronized (though if it was we could build something around that), our dilemma is that we make the following assumption in our code: "Devices present at start-time will be handled the same way, on source or dest, regardless of whether they were added via cmdline or via device_add prior to machine start / migration stream processing." And I think that's a sensible expectation, since in theory even the source could build up a machine via device_add prior to starting it, and the reasonable default there is dev->hotplugged = false rather than the opposite. That suggests a need to fix things outside of migration. So far, all QEMU's existing migration code has managed ok with the dest being starting with dev->hotplugged == false via cmdline devices, even though maybe they were hotplugged on the source. To me, it makes more sense to maintain this behavior by fixing up this relatively new use-case of adding devices via device_add before start to match the same expectations we have around cmdline-specified devices. This would fix migration for spapr, leave it working for everyone else (since that's basically what we expect for everything except newer-style cpu hotplug), and also make the device-add-before-start be truly synonymous with cmdline-created devices (which is applicable even outside of migration). > > If we try to fix it by migrating 'DeviceState::hotplugged' flag, > we would need CPU/memory/machine specific migration hooks which will > fix device/machine state as by the time migration stream is processed > on target side, all devices are already wired up using -device or > device_add paths (cold/hotplugged paths). > Approach doesn't seem robust to me. If we infer it on the target within qdev, without relying on migration stream, maybe those fix-ups are less invasive. And what about the non-migration case as mentioned above? Should the source also be able to assume device_add-before-start matches behavior for cmdline-specified devices? > > May be we should > 1. make DeviceState:hotpluggable property write-able again > 2. transmit DeviceState:hotpluggable as part of migration stream > 3. add generic migration hook which will check if target and > source value match, if value differs => fail/abort migration. > 4. in case values mismatch mgmt will be forced to explicitly > provide hotplugged property value on -device/device_add > That would enforce consistent DeviceState:hotpluggable value > on target and source. > We can enforce it only for new machine types so it won't break > old mgmt tools with old machine types but would force mgmt > for new machines to use hotplugged property on target > so QEMU could rely on its value for migration purposes. > That would work, and generalizing this beyond spapr seems appropriate. It also has reasonable semantics, and it would work for us *provided that* we always send DRC state for hotplugged devices and not just DRCs in a transitional state: SRC1: device_add $cpu -> dev->hotplugged == true -> device starts in pre-hotplug, ends up in post-hotplug state after guest onlines it <migrate> DST1: device_add $cpu,hotplugged=true -> dev->hotplugged == true -> device starts in pre-hotplug state. guest sends updated state to transition DRC to post-hotplug But what about stuff like mem/pci? Currently, migration works for cases like: SRC1: device_add virtio-net-pci DST1: qemu -device virtio-net-pci Even though DST1 has dev->hotplugged == false, and SRC1 has the opposite. So for new machines, checking SRC1:dev->hotplugged == DST1:dev->hotplugged would fail, even though the migration scenario is unchanged from before. So management would now have to do: SRC1: device_add virtio-net-pci DST1: qemu -device virtio-net-pci,hotplugged=true But the code behavior is a bit different then, since we now get an ACPI hotplug event via the hotplug handler. Maybe the migration stream fixes that up for us, but I think we would need to audit this and similar cases to be sure. That's all fine if it's necessary, but I feel like this is the hard way to address what's actually a much more specific issue: that device_add before machine-start doesn't currently match the behavior for a device started via cmdline. i.e. dev->hotplugged in the former vs. !dev->hotplugged in the latter. I don't really see a good reason these 2 cases should be different, and we can bring them to parity by doing something like: 1. For device_adds after qdev_machine_creation_done(), but before machine start, set a flag: reset_before_start. 2. At the start of processing migration stream, or unpausing a -S guest (in the non-migration case), issue a system-wide reset if reset_before_start is set. 3. reset handlers will already unset dev->hotplugged at that point and re-execute all the hotplug hooks with dev->hotplugged == false. This should put everything in a state that's identical to cmdline-created devices. 4. Only allow management to do device_add before it sends the migration stream (if management doesn't already guard against this then it's probably a bug anyway) This allows management to treat device_add/cmdline as being completely synonymous for guests that haven't started yet, both for -incoming and -S in general, and it maintains the behavior that existing migration code expects of cmdline-specified devices.
On Wed, 14 Jun 2017 19:27:12 -0500 Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > Quoting Igor Mammedov (2017-06-14 04:00:01) > > On Tue, 13 Jun 2017 16:42:45 -0500 > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) [...] > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > is enabled even if dev->hotplugged is false (not nice but it might work). > > Consider: > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > DST1: -device CPU1 => CPU1.hotplugged = false > > so in current code relying on CPU1.hotplugged would not work as expected, > > it works by accident because libvirt uses device_add on target > > DST1: device_add CPU1 => CPU1.hotplugged = true > > It's actually the reverse for us, DST1: -device CPU1 works, because > default DRC state for CPU1.hotplugged = false matches the state a > hotplugged CPU will be brought to after onlining at the source, so > we don't send it over the wire in the first place once it reaches > that post-hotplug/coldplug state. So things work as expected, even > though technically the source has dev->hotplugged == true, whereas > the dest has dev->hotplugged == false. in your case it seems fragile to rely on -device setting hotplugged cpu on target the way you want. it could be: SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > It's the DST1: device_add case that wasn't accounted for when the DRC > migration patches were written, as those don't default to coldplug, > so, because the source doesn't send it, it ends up being presented > in pre-hotplug state because the dest doesn't know that the guest > already onlined the resource and transitioned it to > post-hotplug/coldplug state. Ironically, dev->hotplugged > is true on both source and dest in this case, but it ends up being > the broken one. it looks like for hotplugged CPUs DRC state should be always migrated > But your point stands, the fact that both situations are possible > means we can't currently rely on dev->hotplugged without migrating > it, infering it based on QEMU lifecycle, or forcing management to > set it. > > But that raises a 2nd point. Our dilemma isn't that we can't > rely on dev->hotplugged being synchronized (though if it > was we could build something around that), our dilemma is > that we make the following assumption in our code: > > "Devices present at start-time will be handled the same way, > on source or dest, regardless of whether they were added via > cmdline or via device_add prior to machine start / migration > stream processing." > > And I think that's a sensible expectation, since in theory > even the source could build up a machine via device_add > prior to starting it, and the reasonable default there is > dev->hotplugged = false rather than the opposite. That > suggests a need to fix things outside of migration. Agreed to a degree, i.e. -device/device_add before machine has been started without migration should follow coldplug path it shouldn't cause problems for CPU/mem hotplug on x86 and maybe will work for PCI (it may change behavior of ACPI based hotplug and bridges), CCing Marcel to confirm. > So far, all QEMU's existing migration code has managed ok > with the dest being starting with dev->hotplugged == false > via cmdline devices, even though maybe they were hotplugged > on the source. To me, it makes more sense to maintain this > behavior by fixing up this relatively new use-case of > adding devices via device_add before start to match the > same expectations we have around cmdline-specified devices. > > This would fix migration for spapr, leave it working for > everyone else (since that's basically what we expect for > everything except newer-style cpu hotplug), and also make > the device-add-before-start be truly synonymous with > cmdline-created devices (which is applicable even outside > of migration). Neither -device or device_add can't really bring DRC/CPU into the state that devices might be at the moment when their state is transferred to target. i.e. any state that has been changed after -device/device_add on SRC, should be in migration stream. I'd say even if state would eventually go back default (coldplugged) when hotplug is completed. So trying to avoid transmitting runtime state to optimize some bytes on migration stream is just asking for trouble. [...] > > > > May be we should > > 1. make DeviceState:hotpluggable property write-able again > > 2. transmit DeviceState:hotpluggable as part of migration stream > > 3. add generic migration hook which will check if target and > > source value match, if value differs => fail/abort migration. > > 4. in case values mismatch mgmt will be forced to explicitly > > provide hotplugged property value on -device/device_add > > That would enforce consistent DeviceState:hotpluggable value > > on target and source. > > We can enforce it only for new machine types so it won't break > > old mgmt tools with old machine types but would force mgmt > > for new machines to use hotplugged property on target > > so QEMU could rely on its value for migration purposes. > > > > That would work, and generalizing this beyond spapr seems > appropriate. > > It also has reasonable semantics, and it would work for us > *provided that* we always send DRC state for hotplugged devices > and not just DRCs in a transitional state: > > SRC1: device_add $cpu > -> dev->hotplugged == true > -> device starts in pre-hotplug, ends up in post-hotplug state > after guest onlines it > <migrate> > DST1: device_add $cpu,hotplugged=true > -> dev->hotplugged == true > -> device starts in pre-hotplug state. guest sends updated state > to transition DRC to post-hotplug > > But what about stuff like mem/pci? Currently, migration works for > cases like: > > SRC1: device_add virtio-net-pci > DST1: qemu -device virtio-net-pci > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > opposite. So for new machines, checking SRC1:dev->hotplugged == > DST1:dev->hotplugged would fail, even though the migration > scenario is unchanged from before. > > So management would now have to do: > > SRC1: device_add virtio-net-pci > DST1: qemu -device virtio-net-pci,hotplugged=true > > But the code behavior is a bit different then, since we now get > an ACPI hotplug event via the hotplug handler. Maybe the > migration stream fixes that up for us, but I think we would need > to audit this and similar cases to be sure. > > That's all fine if it's necessary, but I feel like this is > the hard way to address what's actually a much more specific > issue: that device_add before machine-start doesn't currently > match the behavior for a device started via cmdline. i.e. > dev->hotplugged in the former vs. !dev->hotplugged in the > latter. I don't really see a good reason these 2 cases should > be different, and we can bring them to parity by doing > something like: > > 1. For device_adds after qdev_machine_creation_done(), but > before machine start, set a flag: reset_before_start. > 2. At the start of processing migration stream, or unpausing > a -S guest (in the non-migration case), issue a system-wide > reset if reset_before_start is set. > 3. reset handlers will already unset dev->hotplugged at that > point and re-execute all the hotplug hooks with > dev->hotplugged == false. This should put everything in > a state that's identical to cmdline-created devices. instead of flag for non migration case we could use RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING transition to reset all devices or maybe do something like this: diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 0ce45a2..cdeb8f8 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) ObjectClass *class; Property *prop; - if (qdev_hotplug) { + if (runstate_check(RUN_STATE_RUNNING) || ...) { dev->hotplugged = 1; qdev_hot_added = true; } > 4. Only allow management to do device_add before it sends > the migration stream (if management doesn't already guard > against this then it's probably a bug anyway) seems like Juan already took care of it. > This allows management to treat device_add/cmdline as being > completely synonymous for guests that haven't started yet, > both for -incoming and -S in general, and it maintains > the behavior that existing migration code expects of > cmdline-specified devices.
On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > On Wed, 14 Jun 2017 19:27:12 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > [...] > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > Consider: > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > it works by accident because libvirt uses device_add on target > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > default DRC state for CPU1.hotplugged = false matches the state a > > hotplugged CPU will be brought to after onlining at the source, so > > we don't send it over the wire in the first place once it reaches > > that post-hotplug/coldplug state. So things work as expected, even > > though technically the source has dev->hotplugged == true, whereas > > the dest has dev->hotplugged == false. > in your case it seems fragile to rely on -device setting hotplugged cpu > on target the way you want. > > it could be: > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > migration patches were written, as those don't default to coldplug, > > so, because the source doesn't send it, it ends up being presented > > in pre-hotplug state because the dest doesn't know that the guest > > already onlined the resource and transitioned it to > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > is true on both source and dest in this case, but it ends up being > > the broken one. > it looks like for hotplugged CPUs DRC state should be always migrated Yeah, I think in the first instance we should just unconditionally transfer DRC state for newer machine types (at least once I clean up what exactly the DRC state _is_). For old machine types things are still likely to be broken, but it won't be a regression. If we can work out a way to fix things for older machine types, that's a bonus, but it looks like it's very difficult, maybe impossible. First priority should be sane semantics for the newer machine types. > > But your point stands, the fact that both situations are possible > > means we can't currently rely on dev->hotplugged without migrating > > it, infering it based on QEMU lifecycle, or forcing management to > > set it. > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > rely on dev->hotplugged being synchronized (though if it > > was we could build something around that), our dilemma is > > that we make the following assumption in our code: > > > > "Devices present at start-time will be handled the same way, > > on source or dest, regardless of whether they were added via > > cmdline or via device_add prior to machine start / migration > > stream processing." > > > > And I think that's a sensible expectation, since in theory > > even the source could build up a machine via device_add > > prior to starting it, and the reasonable default there is > > dev->hotplugged = false rather than the opposite. That > > suggests a need to fix things outside of migration. > Agreed to a degree, i.e. > > -device/device_add before machine has been started > without migration should follow coldplug path > > it shouldn't cause problems for CPU/mem hotplug on x86 > and maybe will work for PCI (it may change behavior of > ACPI based hotplug and bridges), > CCing Marcel to confirm. So for ppc the main problem with early plugged devices is a duplication of device tree info between that delivered by CAS, and that delivered through configure-connector. I see two basic approaches to fixing this: 1) At CAS, reset all DRCs before building the device tree to send to the guest. That will essentially "convert" everything present at CAS time into coldplugged device. There might still be a problem if there are existing hotplug events in the queue from before CAS. 2) When building the device tree (at both CAS and reset) check the DRC state, and omit DT information for anything that's not in CONFIGURED state. That should be correct, because the guest should call configure-connector for anything not yet in CONFIGURED state, at which point it will get the DT information. I'm not sure which approach is best yet. I'm not 100% sure of (1) is correct in all cases, but it is a bit simpler than (2). > > So far, all QEMU's existing migration code has managed ok > > with the dest being starting with dev->hotplugged == false > > via cmdline devices, even though maybe they were hotplugged > > on the source. To me, it makes more sense to maintain this > > behavior by fixing up this relatively new use-case of > > adding devices via device_add before start to match the > > same expectations we have around cmdline-specified devices. > > > > This would fix migration for spapr, leave it working for > > everyone else (since that's basically what we expect for > > everything except newer-style cpu hotplug), and also make > > the device-add-before-start be truly synonymous with > > cmdline-created devices (which is applicable even outside > > of migration). > Neither -device or device_add can't really bring DRC/CPU into > the state that devices might be at the moment when their state > is transferred to target. > > i.e. any state that has been changed after -device/device_add > on SRC, should be in migration stream. I'd say even if state would > eventually go back default (coldplugged) when hotplug is completed. > So trying to avoid transmitting runtime state to optimize some bytes > on migration stream is just asking for trouble. Yeah, I'm coming to the same conclusion. Note that the reason for omitting state wasn't to save space in the migration stream, but to make the stream compatible with older versions which never transmit the state - at least in as many cases as possible. > [...] > > > > > > May be we should > > > 1. make DeviceState:hotpluggable property write-able again > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > 3. add generic migration hook which will check if target and > > > source value match, if value differs => fail/abort migration. > > > 4. in case values mismatch mgmt will be forced to explicitly > > > provide hotplugged property value on -device/device_add > > > That would enforce consistent DeviceState:hotpluggable value > > > on target and source. > > > We can enforce it only for new machine types so it won't break > > > old mgmt tools with old machine types but would force mgmt > > > for new machines to use hotplugged property on target > > > so QEMU could rely on its value for migration purposes. > > > > > > > That would work, and generalizing this beyond spapr seems > > appropriate. > > > > It also has reasonable semantics, and it would work for us > > *provided that* we always send DRC state for hotplugged devices > > and not just DRCs in a transitional state: > > > > SRC1: device_add $cpu > > -> dev->hotplugged == true > > -> device starts in pre-hotplug, ends up in post-hotplug state > > after guest onlines it > > <migrate> > > DST1: device_add $cpu,hotplugged=true > > -> dev->hotplugged == true > > -> device starts in pre-hotplug state. guest sends updated state > > to transition DRC to post-hotplug > > > > But what about stuff like mem/pci? Currently, migration works for > > cases like: > > > > SRC1: device_add virtio-net-pci > > DST1: qemu -device virtio-net-pci > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > DST1:dev->hotplugged would fail, even though the migration > > scenario is unchanged from before. > > > > So management would now have to do: > > > > SRC1: device_add virtio-net-pci > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > But the code behavior is a bit different then, since we now get > > an ACPI hotplug event via the hotplug handler. Maybe the > > migration stream fixes that up for us, but I think we would need > > to audit this and similar cases to be sure. > > > > That's all fine if it's necessary, but I feel like this is > > the hard way to address what's actually a much more specific > > issue: that device_add before machine-start doesn't currently > > match the behavior for a device started via cmdline. i.e. > > dev->hotplugged in the former vs. !dev->hotplugged in the > > latter. I don't really see a good reason these 2 cases should > > be different, and we can bring them to parity by doing > > something like: > > > > 1. For device_adds after qdev_machine_creation_done(), but > > before machine start, set a flag: reset_before_start. > > 2. At the start of processing migration stream, or unpausing > > a -S guest (in the non-migration case), issue a system-wide > > reset if reset_before_start is set. > > 3. reset handlers will already unset dev->hotplugged at that > > point and re-execute all the hotplug hooks with > > dev->hotplugged == false. This should put everything in > > a state that's identical to cmdline-created devices. > instead of flag for non migration case we could use > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > transition to reset all devices or > maybe do something like this: Hrm, does the general reset call happen now before or after this transition? Resetting DRCs at CAS time should accomplish the same thing. > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > index 0ce45a2..cdeb8f8 100644 > --- a/hw/core/qdev.c > +++ b/hw/core/qdev.c > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > ObjectClass *class; > Property *prop; > > - if (qdev_hotplug) { > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > dev->hotplugged = 1; > qdev_hot_added = true; > } > > > 4. Only allow management to do device_add before it sends > > the migration stream (if management doesn't already guard > > against this then it's probably a bug anyway) > seems like Juan already took care of it. > > > This allows management to treat device_add/cmdline as being > > completely synonymous for guests that haven't started yet, > > both for -incoming and -S in general, and it maintains > > the behavior that existing migration code expects of > > cmdline-specified devices. > >
On Fri, 16 Jun 2017 22:40:53 +0800 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > On Wed, 14 Jun 2017 19:27:12 -0500 > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) [...] > > > > > > > > May be we should > > > > 1. make DeviceState:hotpluggable property write-able again > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > 3. add generic migration hook which will check if target and > > > > source value match, if value differs => fail/abort migration. > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > provide hotplugged property value on -device/device_add > > > > That would enforce consistent DeviceState:hotpluggable value > > > > on target and source. > > > > We can enforce it only for new machine types so it won't break > > > > old mgmt tools with old machine types but would force mgmt > > > > for new machines to use hotplugged property on target > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > appropriate. > > > > > > It also has reasonable semantics, and it would work for us > > > *provided that* we always send DRC state for hotplugged devices > > > and not just DRCs in a transitional state: > > > > > > SRC1: device_add $cpu > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > after guest onlines it > > > <migrate> > > > DST1: device_add $cpu,hotplugged=true > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug state. guest sends updated state > > > to transition DRC to post-hotplug > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > cases like: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > DST1:dev->hotplugged would fail, even though the migration > > > scenario is unchanged from before. > > > > > > So management would now have to do: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > But the code behavior is a bit different then, since we now get > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > migration stream fixes that up for us, but I think we would need > > > to audit this and similar cases to be sure. > > > > > > That's all fine if it's necessary, but I feel like this is > > > the hard way to address what's actually a much more specific > > > issue: that device_add before machine-start doesn't currently > > > match the behavior for a device started via cmdline. i.e. > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > latter. I don't really see a good reason these 2 cases should > > > be different, and we can bring them to parity by doing > > > something like: > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > before machine start, set a flag: reset_before_start. > > > 2. At the start of processing migration stream, or unpausing > > > a -S guest (in the non-migration case), issue a system-wide > > > reset if reset_before_start is set. > > > 3. reset handlers will already unset dev->hotplugged at that > > > point and re-execute all the hotplug hooks with > > > dev->hotplugged == false. This should put everything in > > > a state that's identical to cmdline-created devices. > > instead of flag for non migration case we could use > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > transition to reset all devices or > > maybe do something like this: > > Hrm, does the general reset call happen now before or after this > transition? Resetting DRCs at CAS time should accomplish the same thing. currently it's before, see vl.c qdev_machine_creation_done(); qemu_register_reset(qbus_reset_all_fn, sysbus_get_default()); qemu_run_machine_init_done_notifiers(); ... qemu_system_reset(SHUTDOWN_CAUSE_NONE); register_global_state(); if (replay_mode != REPLAY_MODE_NONE) { replay_vmstate_init(); } else if (loadvm) { Error *local_err = NULL; if (load_snapshot(loadvm, &local_err) < 0) { error_report_err(local_err); autostart = 0; } } ... } else if (autostart) { vm_start(); } ... main_loop(); <-- currently device_add works here simplified version: without migration and with -S option 'autostart = 0' so we get monitor/qmp prompt with RUN_STATE_PRELAUNCH and when command 'cont' is issued, RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > index 0ce45a2..cdeb8f8 100644 > > --- a/hw/core/qdev.c > > +++ b/hw/core/qdev.c > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > ObjectClass *class; > > Property *prop; > > > > - if (qdev_hotplug) { > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > dev->hotplugged = 1; > > qdev_hot_added = true; > > } > > > > > 4. Only allow management to do device_add before it sends > > > the migration stream (if management doesn't already guard > > > against this then it's probably a bug anyway) > > seems like Juan already took care of it. > > > > > This allows management to treat device_add/cmdline as being > > > completely synonymous for guests that haven't started yet, > > > both for -incoming and -S in general, and it maintains > > > the behavior that existing migration code expects of > > > cmdline-specified devices. > > > > >
Quoting David Gibson (2017-06-16 09:40:53) > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > On Wed, 14 Jun 2017 19:27:12 -0500 > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > [...] > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > Consider: > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > it works by accident because libvirt uses device_add on target > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > default DRC state for CPU1.hotplugged = false matches the state a > > > hotplugged CPU will be brought to after onlining at the source, so > > > we don't send it over the wire in the first place once it reaches > > > that post-hotplug/coldplug state. So things work as expected, even > > > though technically the source has dev->hotplugged == true, whereas > > > the dest has dev->hotplugged == false. > > in your case it seems fragile to rely on -device setting hotplugged cpu > > on target the way you want. > > > > it could be: > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > migration patches were written, as those don't default to coldplug, > > > so, because the source doesn't send it, it ends up being presented > > > in pre-hotplug state because the dest doesn't know that the guest > > > already onlined the resource and transitioned it to > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > is true on both source and dest in this case, but it ends up being > > > the broken one. > > it looks like for hotplugged CPUs DRC state should be always migrated > > Yeah, I think in the first instance we should just unconditionally > transfer DRC state for newer machine types (at least once I clean up > what exactly the DRC state _is_). For old machine types things are > still likely to be broken, but it won't be a regression. > > If we can work out a way to fix things for older machine types, that's > a bonus, but it looks like it's very difficult, maybe impossible. > First priority should be sane semantics for the newer machine types. > > > > But your point stands, the fact that both situations are possible > > > means we can't currently rely on dev->hotplugged without migrating > > > it, infering it based on QEMU lifecycle, or forcing management to > > > set it. > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > rely on dev->hotplugged being synchronized (though if it > > > was we could build something around that), our dilemma is > > > that we make the following assumption in our code: > > > > > > "Devices present at start-time will be handled the same way, > > > on source or dest, regardless of whether they were added via > > > cmdline or via device_add prior to machine start / migration > > > stream processing." > > > > > > And I think that's a sensible expectation, since in theory > > > even the source could build up a machine via device_add > > > prior to starting it, and the reasonable default there is > > > dev->hotplugged = false rather than the opposite. That > > > suggests a need to fix things outside of migration. > > Agreed to a degree, i.e. > > > > -device/device_add before machine has been started > > without migration should follow coldplug path > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > and maybe will work for PCI (it may change behavior of > > ACPI based hotplug and bridges), > > CCing Marcel to confirm. > > So for ppc the main problem with early plugged devices is a > duplication of device tree info between that delivered by CAS, and > that delivered through configure-connector. I see two basic > approaches to fixing this: > > 1) At CAS, reset all DRCs before building the device tree to send to > the guest. That will essentially "convert" everything present at CAS > time into coldplugged device. There might still be a problem if there > are existing hotplug events in the queue from before CAS. > > 2) When building the device tree (at both CAS and reset) check the DRC > state, and omit DT information for anything that's not in CONFIGURED > state. That should be correct, because the guest should call > configure-connector for anything not yet in CONFIGURED state, at which > point it will get the DT information. > > I'm not sure which approach is best yet. I'm not 100% sure of (1) is > correct in all cases, but it is a bit simpler than (2). This fixes up devices added before CAS (whether via -device/device_add) on the source, but at the time of migration we will have (usually) already completed CAS. We also can't repeat the CAS-time fixups on the target, because that would reset any DRC state sent via migration (which would include things like transitional states outside of post-hotplug/coldplug). So either we sent all DRC state (regardless of hotplug or not), or we need a reliable starting state on the dest from which we can determine what needs to be sent by the source. The 2 options are: a) ensure all devices on the dest start off in a coldplug state, at point we can simply send DRC for anything hotplugged. This isn't the case now because device_add on the target results in the initial state being hotplug, where -device's are in coldplug. I'm proposing we ensure both these cases default coldplug state on the target if device_add occurs before the dest is started. b) ensure dev->hotplugged is in sync between src/dest by migrating it or forcing management to set it. Based on that knowledge, we can determine on the src side what the default state will be on the dest, and from that knowledge determine what state needs to be sent. Igor has suggested an approach for handling it this way. Personally I'm leaning toward a), since assuming that dest devices will start in a coldplug state is basically what most of the migration code already does (since up until fairly recently, the hotplugged devices were generally specified via cmdline on the dest and not via device_add) > > > > So far, all QEMU's existing migration code has managed ok > > > with the dest being starting with dev->hotplugged == false > > > via cmdline devices, even though maybe they were hotplugged > > > on the source. To me, it makes more sense to maintain this > > > behavior by fixing up this relatively new use-case of > > > adding devices via device_add before start to match the > > > same expectations we have around cmdline-specified devices. > > > > > > This would fix migration for spapr, leave it working for > > > everyone else (since that's basically what we expect for > > > everything except newer-style cpu hotplug), and also make > > > the device-add-before-start be truly synonymous with > > > cmdline-created devices (which is applicable even outside > > > of migration). > > Neither -device or device_add can't really bring DRC/CPU into > > the state that devices might be at the moment when their state > > is transferred to target. > > > > i.e. any state that has been changed after -device/device_add > > on SRC, should be in migration stream. I'd say even if state would > > eventually go back default (coldplugged) when hotplug is completed. > > So trying to avoid transmitting runtime state to optimize some bytes > > on migration stream is just asking for trouble. > > Yeah, I'm coming to the same conclusion. Note that the reason for > omitting state wasn't to save space in the migration stream, but to > make the stream compatible with older versions which never transmit > the state - at least in as many cases as possible. > > > [...] > > > > > > > > May be we should > > > > 1. make DeviceState:hotpluggable property write-able again > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > 3. add generic migration hook which will check if target and > > > > source value match, if value differs => fail/abort migration. > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > provide hotplugged property value on -device/device_add > > > > That would enforce consistent DeviceState:hotpluggable value > > > > on target and source. > > > > We can enforce it only for new machine types so it won't break > > > > old mgmt tools with old machine types but would force mgmt > > > > for new machines to use hotplugged property on target > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > appropriate. > > > > > > It also has reasonable semantics, and it would work for us > > > *provided that* we always send DRC state for hotplugged devices > > > and not just DRCs in a transitional state: > > > > > > SRC1: device_add $cpu > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > after guest onlines it > > > <migrate> > > > DST1: device_add $cpu,hotplugged=true > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug state. guest sends updated state > > > to transition DRC to post-hotplug > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > cases like: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > DST1:dev->hotplugged would fail, even though the migration > > > scenario is unchanged from before. > > > > > > So management would now have to do: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > But the code behavior is a bit different then, since we now get > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > migration stream fixes that up for us, but I think we would need > > > to audit this and similar cases to be sure. > > > > > > That's all fine if it's necessary, but I feel like this is > > > the hard way to address what's actually a much more specific > > > issue: that device_add before machine-start doesn't currently > > > match the behavior for a device started via cmdline. i.e. > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > latter. I don't really see a good reason these 2 cases should > > > be different, and we can bring them to parity by doing > > > something like: > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > before machine start, set a flag: reset_before_start. > > > 2. At the start of processing migration stream, or unpausing > > > a -S guest (in the non-migration case), issue a system-wide > > > reset if reset_before_start is set. > > > 3. reset handlers will already unset dev->hotplugged at that > > > point and re-execute all the hotplug hooks with > > > dev->hotplugged == false. This should put everything in > > > a state that's identical to cmdline-created devices. > > instead of flag for non migration case we could use > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > transition to reset all devices or > > maybe do something like this: > > Hrm, does the general reset call happen now before or after this > transition? Resetting DRCs at CAS time should accomplish the same thing. > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > index 0ce45a2..cdeb8f8 100644 > > --- a/hw/core/qdev.c > > +++ b/hw/core/qdev.c > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > ObjectClass *class; > > Property *prop; > > > > - if (qdev_hotplug) { > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > dev->hotplugged = 1; > > qdev_hot_added = true; > > } > > > > > 4. Only allow management to do device_add before it sends > > > the migration stream (if management doesn't already guard > > > against this then it's probably a bug anyway) > > seems like Juan already took care of it. > > > > > This allows management to treat device_add/cmdline as being > > > completely synonymous for guests that haven't started yet, > > > both for -incoming and -S in general, and it maintains > > > the behavior that existing migration code expects of > > > cmdline-specified devices. > > > > > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson
Quoting Igor Mammedov (2017-06-16 08:53:12) > On Wed, 14 Jun 2017 19:27:12 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > [...] > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > Consider: > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > it works by accident because libvirt uses device_add on target > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > default DRC state for CPU1.hotplugged = false matches the state a > > hotplugged CPU will be brought to after onlining at the source, so > > we don't send it over the wire in the first place once it reaches > > that post-hotplug/coldplug state. So things work as expected, even > > though technically the source has dev->hotplugged == true, whereas > > the dest has dev->hotplugged == false. > in your case it seems fragile to rely on -device setting hotplugged cpu > on target the way you want. Well, it's fragile in the sense that we try to assume post-hotplug == coldplug as a way to avoid migrating hotplugged devices in some cases. This was mostly done to allow for backward-compatibility without relying on machine-specific behavior, but I have no problem with just always migrating DRC state for hotplugged devices. But that still leaves the following issue: on the source, we want to assume the default DRC state is coldplugged state. Based on that, we just migrate the hotplugged DRCs and everything would be fine, since we assume the target will initially put the DRC state in the coldplug state for all DRCs, same way it would for -device cpu/etc. But device_add on the target doesn't behave this way, it defaults to hotplugged state. It doesn't matter if the device is still in a hotplug state on the source, even if we reset the machine on the source side before migrating, device_add would still default to dev->hotplugged, and -device would still default to !dev->hotplugged. So we can't just send hotplugged DRCs, nor can we just send coldplugged DRCs, because there's no consistency in what the initial starting state will be on the dest. If they use -device to specify it it does one thing, if they use device_add it does another. So we're still stuck sending *all* DRCs. So addressing that inconsistency is the key issue I think. > > it could be: > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. In this case we'd actually check to see that the DRC is in a transitional state via .needed and migrate it. It's only when we've concluded that the state is identical to coldplug state that we don't send it. But as mentioned above, this is more of an optimization, even if we just always send DRC's for hotplugged devices, things are still broken, and we need a reliable way to infer the starting state of devices on the dest to fix them. > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > migration patches were written, as those don't default to coldplug, > > so, because the source doesn't send it, it ends up being presented > > in pre-hotplug state because the dest doesn't know that the guest > > already onlined the resource and transitioned it to > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > is true on both source and dest in this case, but it ends up being > > the broken one. > it looks like for hotplugged CPUs DRC state should be always migrated That would fix this case, but even if we adopt that approach we still have this scenario that's broken: SRC: hotplug CPU, reboot, CPU is now in a coldplugged state, so we don't migrate it DST: device_add CPU, currently this defaults to dev->hotplugged == true, so this incorrectly gets exposed to the guest as being in a transitional state. This all stems from pre-start device_add semantics not aligning with cmdline-specified ones. > > > > But your point stands, the fact that both situations are possible > > means we can't currently rely on dev->hotplugged without migrating > > it, infering it based on QEMU lifecycle, or forcing management to > > set it. > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > rely on dev->hotplugged being synchronized (though if it > > was we could build something around that), our dilemma is > > that we make the following assumption in our code: > > > > "Devices present at start-time will be handled the same way, > > on source or dest, regardless of whether they were added via > > cmdline or via device_add prior to machine start / migration > > stream processing." > > > > And I think that's a sensible expectation, since in theory > > even the source could build up a machine via device_add > > prior to starting it, and the reasonable default there is > > dev->hotplugged = false rather than the opposite. That > > suggests a need to fix things outside of migration. > Agreed to a degree, i.e. > > -device/device_add before machine has been started > without migration should follow coldplug path I really feel like adopting this for the migration case as well is the cleanest solution. AFAICT none of the migration code really cares about synchronizing dev->hotplugged, something as simple as: SRC: device_add virtio-net-pci DST: -device virtio-net-pci results in a mismatch between dev->hotplugged. But the migration code doesn't really care, it's all based around knowledge of what the initial starting state of the device on DST will be. Now that we cases where device_add might be used instead of -device on DST, we no longer have that knowledge on the source. Having device_add-before-start match existing cmdline behavior seems like the most direct way to address this, and it seems sensible since they ostensibly serve the exact same purpose. > > it shouldn't cause problems for CPU/mem hotplug on x86 > and maybe will work for PCI (it may change behavior of > ACPI based hotplug and bridges), > CCing Marcel to confirm. > > > > So far, all QEMU's existing migration code has managed ok > > with the dest being starting with dev->hotplugged == false > > via cmdline devices, even though maybe they were hotplugged > > on the source. To me, it makes more sense to maintain this > > behavior by fixing up this relatively new use-case of > > adding devices via device_add before start to match the > > same expectations we have around cmdline-specified devices. > > > > This would fix migration for spapr, leave it working for > > everyone else (since that's basically what we expect for > > everything except newer-style cpu hotplug), and also make > > the device-add-before-start be truly synonymous with > > cmdline-created devices (which is applicable even outside > > of migration). > Neither -device or device_add can't really bring DRC/CPU into > the state that devices might be at the moment when their state > is transferred to target. > > i.e. any state that has been changed after -device/device_add > on SRC, should be in migration stream. I'd say even if state would > eventually go back default (coldplugged) when hotplug is completed. > So trying to avoid transmitting runtime state to optimize some bytes > on migration stream is just asking for trouble. > > [...] > > > > > > May be we should > > > 1. make DeviceState:hotpluggable property write-able again > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > 3. add generic migration hook which will check if target and > > > source value match, if value differs => fail/abort migration. > > > 4. in case values mismatch mgmt will be forced to explicitly > > > provide hotplugged property value on -device/device_add > > > That would enforce consistent DeviceState:hotpluggable value > > > on target and source. > > > We can enforce it only for new machine types so it won't break > > > old mgmt tools with old machine types but would force mgmt > > > for new machines to use hotplugged property on target > > > so QEMU could rely on its value for migration purposes. > > > > > > > That would work, and generalizing this beyond spapr seems > > appropriate. > > > > It also has reasonable semantics, and it would work for us > > *provided that* we always send DRC state for hotplugged devices > > and not just DRCs in a transitional state: > > > > SRC1: device_add $cpu > > -> dev->hotplugged == true > > -> device starts in pre-hotplug, ends up in post-hotplug state > > after guest onlines it > > <migrate> > > DST1: device_add $cpu,hotplugged=true > > -> dev->hotplugged == true > > -> device starts in pre-hotplug state. guest sends updated state > > to transition DRC to post-hotplug > > > > But what about stuff like mem/pci? Currently, migration works for > > cases like: > > > > SRC1: device_add virtio-net-pci > > DST1: qemu -device virtio-net-pci > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > DST1:dev->hotplugged would fail, even though the migration > > scenario is unchanged from before. > > > > So management would now have to do: > > > > SRC1: device_add virtio-net-pci > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > But the code behavior is a bit different then, since we now get > > an ACPI hotplug event via the hotplug handler. Maybe the > > migration stream fixes that up for us, but I think we would need > > to audit this and similar cases to be sure. > > > > That's all fine if it's necessary, but I feel like this is > > the hard way to address what's actually a much more specific > > issue: that device_add before machine-start doesn't currently > > match the behavior for a device started via cmdline. i.e. > > dev->hotplugged in the former vs. !dev->hotplugged in the > > latter. I don't really see a good reason these 2 cases should > > be different, and we can bring them to parity by doing > > something like: > > > > 1. For device_adds after qdev_machine_creation_done(), but > > before machine start, set a flag: reset_before_start. > > 2. At the start of processing migration stream, or unpausing > > a -S guest (in the non-migration case), issue a system-wide > > reset if reset_before_start is set. > > 3. reset handlers will already unset dev->hotplugged at that > > point and re-execute all the hotplug hooks with > > dev->hotplugged == false. This should put everything in > > a state that's identical to cmdline-created devices. > instead of flag for non migration case we could use > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > transition to reset all devices or > maybe do something like this: > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > index 0ce45a2..cdeb8f8 100644 > --- a/hw/core/qdev.c > +++ b/hw/core/qdev.c > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > ObjectClass *class; > Property *prop; > > - if (qdev_hotplug) { > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > dev->hotplugged = 1; > qdev_hot_added = true; > } Yah that sort of approach (this would coincedentally address the migration case as well, not sure if you are on board with that approach or not) is what I initially had in mind. The global flag seemed safer because we get a machine-wide reset will all devices already plugged just like for cmdline, but I'm not sure if that is really needed or not. This is definitely more elegant though. > > > 4. Only allow management to do device_add before it sends > > the migration stream (if management doesn't already guard > > against this then it's probably a bug anyway) > seems like Juan already took care of it. Ok, good to know. > > > This allows management to treat device_add/cmdline as being > > completely synonymous for guests that haven't started yet, > > both for -incoming and -S in general, and it maintains > > the behavior that existing migration code expects of > > cmdline-specified devices. > >
On Fri, Jun 16, 2017 at 11:15:46AM -0500, Michael Roth wrote: > Quoting David Gibson (2017-06-16 09:40:53) > > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > > On Wed, 14 Jun 2017 19:27:12 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > [...] > > > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > > Consider: > > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > > it works by accident because libvirt uses device_add on target > > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > > default DRC state for CPU1.hotplugged = false matches the state a > > > > hotplugged CPU will be brought to after onlining at the source, so > > > > we don't send it over the wire in the first place once it reaches > > > > that post-hotplug/coldplug state. So things work as expected, even > > > > though technically the source has dev->hotplugged == true, whereas > > > > the dest has dev->hotplugged == false. > > > in your case it seems fragile to rely on -device setting hotplugged cpu > > > on target the way you want. > > > > > > it could be: > > > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > > migration patches were written, as those don't default to coldplug, > > > > so, because the source doesn't send it, it ends up being presented > > > > in pre-hotplug state because the dest doesn't know that the guest > > > > already onlined the resource and transitioned it to > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > > is true on both source and dest in this case, but it ends up being > > > > the broken one. > > > it looks like for hotplugged CPUs DRC state should be always migrated > > > > Yeah, I think in the first instance we should just unconditionally > > transfer DRC state for newer machine types (at least once I clean up > > what exactly the DRC state _is_). For old machine types things are > > still likely to be broken, but it won't be a regression. > > > > If we can work out a way to fix things for older machine types, that's > > a bonus, but it looks like it's very difficult, maybe impossible. > > First priority should be sane semantics for the newer machine types. > > > > > > But your point stands, the fact that both situations are possible > > > > means we can't currently rely on dev->hotplugged without migrating > > > > it, infering it based on QEMU lifecycle, or forcing management to > > > > set it. > > > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > > rely on dev->hotplugged being synchronized (though if it > > > > was we could build something around that), our dilemma is > > > > that we make the following assumption in our code: > > > > > > > > "Devices present at start-time will be handled the same way, > > > > on source or dest, regardless of whether they were added via > > > > cmdline or via device_add prior to machine start / migration > > > > stream processing." > > > > > > > > And I think that's a sensible expectation, since in theory > > > > even the source could build up a machine via device_add > > > > prior to starting it, and the reasonable default there is > > > > dev->hotplugged = false rather than the opposite. That > > > > suggests a need to fix things outside of migration. > > > Agreed to a degree, i.e. > > > > > > -device/device_add before machine has been started > > > without migration should follow coldplug path > > > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > > and maybe will work for PCI (it may change behavior of > > > ACPI based hotplug and bridges), > > > CCing Marcel to confirm. > > > > So for ppc the main problem with early plugged devices is a > > duplication of device tree info between that delivered by CAS, and > > that delivered through configure-connector. I see two basic > > approaches to fixing this: > > > > 1) At CAS, reset all DRCs before building the device tree to send to > > the guest. That will essentially "convert" everything present at CAS > > time into coldplugged device. There might still be a problem if there > > are existing hotplug events in the queue from before CAS. > > > > 2) When building the device tree (at both CAS and reset) check the DRC > > state, and omit DT information for anything that's not in CONFIGURED > > state. That should be correct, because the guest should call > > configure-connector for anything not yet in CONFIGURED state, at which > > point it will get the DT information. > > > > I'm not sure which approach is best yet. I'm not 100% sure of (1) is > > correct in all cases, but it is a bit simpler than (2). > > This fixes up devices added before CAS (whether via -device/device_add) > on the source, but at the time of migration we will have (usually) > already completed CAS. We also can't repeat the CAS-time fixups on the > target, because that would reset any DRC state sent via migration (which > would include things like transitional states outside of > post-hotplug/coldplug). Sorry, I wasn't clear. The options above are for the problems with hotplug during early boot, they won't help migration cases. > So either we sent all DRC state (regardless of hotplug or not), or > we need a reliable starting state on the dest from which we can > determine what needs to be sent by the source. The 2 options are: So for new machine types, I think we should always send DRC state. > a) ensure all devices on the dest start off in a coldplug state, at > point we can simply send DRC for anything hotplugged. This isn't > the case now because device_add on the target results in the initial > state being hotplug, where -device's are in coldplug. I'm proposing > we ensure both these cases default coldplug state on the target if > device_add occurs before the dest is started. I think that will become the case with my Part IV DRC cleanups. IIRC, there is a system reset performed before processing the incoming migration. With my DRC changes we reset the DRC state based only on the device presence at reset time. Since the device will be hotplugged before the incoming migration, it should be reset to coldplug equivalent state. > b) ensure dev->hotplugged is in sync between src/dest by migrating it > or forcing management to set it. Based on that knowledge, we can > determine on the src side what the default state will be on the dest, > and from that knowledge determine what state needs to be sent. Igor > has suggested an approach for handling it this way. This sounds like a bad idea - I don't know what other effects altering this generic qemu state will have. > Personally I'm leaning toward a), since assuming that dest devices will > start in a coldplug state is basically what most of the migration code > already does (since up until fairly recently, the hotplugged devices > were generally specified via cmdline on the dest and not via > device_add) Yes, I agree. > > > > So far, all QEMU's existing migration code has managed ok > > > > with the dest being starting with dev->hotplugged == false > > > > via cmdline devices, even though maybe they were hotplugged > > > > on the source. To me, it makes more sense to maintain this > > > > behavior by fixing up this relatively new use-case of > > > > adding devices via device_add before start to match the > > > > same expectations we have around cmdline-specified devices. > > > > > > > > This would fix migration for spapr, leave it working for > > > > everyone else (since that's basically what we expect for > > > > everything except newer-style cpu hotplug), and also make > > > > the device-add-before-start be truly synonymous with > > > > cmdline-created devices (which is applicable even outside > > > > of migration). > > > Neither -device or device_add can't really bring DRC/CPU into > > > the state that devices might be at the moment when their state > > > is transferred to target. > > > > > > i.e. any state that has been changed after -device/device_add > > > on SRC, should be in migration stream. I'd say even if state would > > > eventually go back default (coldplugged) when hotplug is completed. > > > So trying to avoid transmitting runtime state to optimize some bytes > > > on migration stream is just asking for trouble. > > > > Yeah, I'm coming to the same conclusion. Note that the reason for > > omitting state wasn't to save space in the migration stream, but to > > make the stream compatible with older versions which never transmit > > the state - at least in as many cases as possible. > > > > > [...] > > > > > > > > > > May be we should > > > > > 1. make DeviceState:hotpluggable property write-able again > > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > > 3. add generic migration hook which will check if target and > > > > > source value match, if value differs => fail/abort migration. > > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > > provide hotplugged property value on -device/device_add > > > > > That would enforce consistent DeviceState:hotpluggable value > > > > > on target and source. > > > > > We can enforce it only for new machine types so it won't break > > > > > old mgmt tools with old machine types but would force mgmt > > > > > for new machines to use hotplugged property on target > > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > > appropriate. > > > > > > > > It also has reasonable semantics, and it would work for us > > > > *provided that* we always send DRC state for hotplugged devices > > > > and not just DRCs in a transitional state: > > > > > > > > SRC1: device_add $cpu > > > > -> dev->hotplugged == true > > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > > after guest onlines it > > > > <migrate> > > > > DST1: device_add $cpu,hotplugged=true > > > > -> dev->hotplugged == true > > > > -> device starts in pre-hotplug state. guest sends updated state > > > > to transition DRC to post-hotplug > > > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > > cases like: > > > > > > > > SRC1: device_add virtio-net-pci > > > > DST1: qemu -device virtio-net-pci > > > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > > DST1:dev->hotplugged would fail, even though the migration > > > > scenario is unchanged from before. > > > > > > > > So management would now have to do: > > > > > > > > SRC1: device_add virtio-net-pci > > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > > > But the code behavior is a bit different then, since we now get > > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > > migration stream fixes that up for us, but I think we would need > > > > to audit this and similar cases to be sure. > > > > > > > > That's all fine if it's necessary, but I feel like this is > > > > the hard way to address what's actually a much more specific > > > > issue: that device_add before machine-start doesn't currently > > > > match the behavior for a device started via cmdline. i.e. > > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > > latter. I don't really see a good reason these 2 cases should > > > > be different, and we can bring them to parity by doing > > > > something like: > > > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > > before machine start, set a flag: reset_before_start. > > > > 2. At the start of processing migration stream, or unpausing > > > > a -S guest (in the non-migration case), issue a system-wide > > > > reset if reset_before_start is set. > > > > 3. reset handlers will already unset dev->hotplugged at that > > > > point and re-execute all the hotplug hooks with > > > > dev->hotplugged == false. This should put everything in > > > > a state that's identical to cmdline-created devices. > > > instead of flag for non migration case we could use > > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > transition to reset all devices or > > > maybe do something like this: > > > > Hrm, does the general reset call happen now before or after this > > transition? Resetting DRCs at CAS time should accomplish the same thing. > > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > > index 0ce45a2..cdeb8f8 100644 > > > --- a/hw/core/qdev.c > > > +++ b/hw/core/qdev.c > > > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > > ObjectClass *class; > > > Property *prop; > > > > > > - if (qdev_hotplug) { > > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > > dev->hotplugged = 1; > > > qdev_hot_added = true; > > > } > > > > > > > 4. Only allow management to do device_add before it sends > > > > the migration stream (if management doesn't already guard > > > > against this then it's probably a bug anyway) > > > seems like Juan already took care of it. > > > > > > > This allows management to treat device_add/cmdline as being > > > > completely synonymous for guests that haven't started yet, > > > > both for -incoming and -S in general, and it maintains > > > > the behavior that existing migration code expects of > > > > cmdline-specified devices. > > > > > > > > >
On Fri, Jun 16, 2017 at 12:19:26PM -0500, Michael Roth wrote: > Quoting Igor Mammedov (2017-06-16 08:53:12) > > On Wed, 14 Jun 2017 19:27:12 -0500 > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > [...] > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > Consider: > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > it works by accident because libvirt uses device_add on target > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > default DRC state for CPU1.hotplugged = false matches the state a > > > hotplugged CPU will be brought to after onlining at the source, so > > > we don't send it over the wire in the first place once it reaches > > > that post-hotplug/coldplug state. So things work as expected, even > > > though technically the source has dev->hotplugged == true, whereas > > > the dest has dev->hotplugged == false. > > in your case it seems fragile to rely on -device setting hotplugged cpu > > on target the way you want. > > Well, it's fragile in the sense that we try to assume post-hotplug == > coldplug as a way to avoid migrating hotplugged devices in some cases. > This was mostly done to allow for backward-compatibility without > relying on machine-specific behavior, but I have no problem with just > always migrating DRC state for hotplugged devices. > > But that still leaves the following issue: on the source, we want to > assume the default DRC state is coldplugged state. Based on that, > we just migrate the hotplugged DRCs and everything would be fine, > since we assume the target will initially put the DRC state in the > coldplug state for all DRCs, same way it would for -device cpu/etc. > > But device_add on the target doesn't behave this way, it defaults to > hotplugged state. It doesn't matter if the device is still in a hotplug > state on the source, even if we reset the machine on the source side > before migrating, device_add would still default to dev->hotplugged, > and -device would still default to !dev->hotplugged. > > So we can't just send hotplugged DRCs, nor can we just send coldplugged > DRCs, because there's no consistency in what the initial starting state > will be on the dest. If they use -device to specify it it does one > thing, if they use device_add it does another. So we're still stuck > sending *all* DRCs. > > So addressing that inconsistency is the key issue I think. I think that trying to do this in terms of dev->hotplugged is a mistake. That's really more about the device's history than its current state. What we want to detect is when the DRC is in a "transitional" state. Basically if no hotplug processing is ongoing, a DRC should be in one of two states: A) DRC in UNUSABLE/ISOLATED state, no device plugged. B) DRC in CONFIGURED state, device plugged. Coldplugged devices go straight to state (B), an "empty" DRC is in state (A). When something is hotplugged the DRCs go through a few transitional states, before arriving at state (B). Likewise when something is unplugged, there are various transitions, and then it reaches state (A). This is the common case, of course; if the guest is doing something unusual, a "transitional" state can last arbitrarily long. So, I think we always need to transfer DRC state if it's anything other than (A) or (B). On the destination, we determine starting state (A) or (B) depending on whether a device is plugged (my reset patch in DRC cleanups Part IV will already do this, I think). No reference to, or manipulation of, dev->hotplugged is necessary. > > it could be: > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > In this case we'd actually check to see that the DRC is in a transitional > state via .needed and migrate it. It's only when we've concluded that > the state is identical to coldplug state that we don't send it. > > But as mentioned above, this is more of an optimization, even if we > just always send DRC's for hotplugged devices, things are still broken, > and we need a reliable way to infer the starting state of devices on > the dest to fix them. > > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > migration patches were written, as those don't default to coldplug, > > > so, because the source doesn't send it, it ends up being presented > > > in pre-hotplug state because the dest doesn't know that the guest > > > already onlined the resource and transitioned it to > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > is true on both source and dest in this case, but it ends up being > > > the broken one. > > it looks like for hotplugged CPUs DRC state should be always migrated > > That would fix this case, but even if we adopt that approach we still > have this scenario that's broken: > > SRC: hotplug CPU, reboot, CPU is now in a coldplugged state, so we don't > migrate it > DST: device_add CPU, currently this defaults to dev->hotplugged == true, > so this incorrectly gets exposed to the guest as being in a > transitional state. > > This all stems from pre-start device_add semantics not aligning with > cmdline-specified ones. Right, we can fix this if we reset all DRC states based on device presence before processing the incoming migration. > > > > > > > > But your point stands, the fact that both situations are possible > > > means we can't currently rely on dev->hotplugged without migrating > > > it, infering it based on QEMU lifecycle, or forcing management to > > > set it. > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > rely on dev->hotplugged being synchronized (though if it > > > was we could build something around that), our dilemma is > > > that we make the following assumption in our code: > > > > > > "Devices present at start-time will be handled the same way, > > > on source or dest, regardless of whether they were added via > > > cmdline or via device_add prior to machine start / migration > > > stream processing." > > > > > > And I think that's a sensible expectation, since in theory > > > even the source could build up a machine via device_add > > > prior to starting it, and the reasonable default there is > > > dev->hotplugged = false rather than the opposite. That > > > suggests a need to fix things outside of migration. > > Agreed to a degree, i.e. > > > > -device/device_add before machine has been started > > without migration should follow coldplug path > > I really feel like adopting this for the migration case as > well is the cleanest solution. AFAICT none of the migration > code really cares about synchronizing dev->hotplugged, > something as simple as: > > SRC: device_add virtio-net-pci > DST: -device virtio-net-pci > > results in a mismatch between dev->hotplugged. But the migration > code doesn't really care, it's all based around knowledge of > what the initial starting state of the device on DST will be. > > Now that we cases where device_add might be used instead of > -device on DST, we no longer have that knowledge on the source. > Having device_add-before-start match existing cmdline behavior > seems like the most direct way to address this, and it seems > sensible since they ostensibly serve the exact same purpose. Come to think of it, I'm not sure we should really have any tests of dev->hotplugged in our hotplug paths. If we put everything into "hotplugged" state, but the reset converts already attached things to coldplug state, I think that should cover it, yes? Maybe we need to omit notifications with !dev->hotplugged, although even then we should flush the event queue on reset. > > it shouldn't cause problems for CPU/mem hotplug on x86 > > and maybe will work for PCI (it may change behavior of > > ACPI based hotplug and bridges), > > CCing Marcel to confirm. > > > > > > > So far, all QEMU's existing migration code has managed ok > > > with the dest being starting with dev->hotplugged == false > > > via cmdline devices, even though maybe they were hotplugged > > > on the source. To me, it makes more sense to maintain this > > > behavior by fixing up this relatively new use-case of > > > adding devices via device_add before start to match the > > > same expectations we have around cmdline-specified devices. > > > > > > This would fix migration for spapr, leave it working for > > > everyone else (since that's basically what we expect for > > > everything except newer-style cpu hotplug), and also make > > > the device-add-before-start be truly synonymous with > > > cmdline-created devices (which is applicable even outside > > > of migration). > > Neither -device or device_add can't really bring DRC/CPU into > > the state that devices might be at the moment when their state > > is transferred to target. > > > > i.e. any state that has been changed after -device/device_add > > on SRC, should be in migration stream. I'd say even if state would > > eventually go back default (coldplugged) when hotplug is completed. > > So trying to avoid transmitting runtime state to optimize some bytes > > on migration stream is just asking for trouble. > > > > [...] > > > > > > > > May be we should > > > > 1. make DeviceState:hotpluggable property write-able again > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > 3. add generic migration hook which will check if target and > > > > source value match, if value differs => fail/abort migration. > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > provide hotplugged property value on -device/device_add > > > > That would enforce consistent DeviceState:hotpluggable value > > > > on target and source. > > > > We can enforce it only for new machine types so it won't break > > > > old mgmt tools with old machine types but would force mgmt > > > > for new machines to use hotplugged property on target > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > appropriate. > > > > > > It also has reasonable semantics, and it would work for us > > > *provided that* we always send DRC state for hotplugged devices > > > and not just DRCs in a transitional state: > > > > > > SRC1: device_add $cpu > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > after guest onlines it > > > <migrate> > > > DST1: device_add $cpu,hotplugged=true > > > -> dev->hotplugged == true > > > -> device starts in pre-hotplug state. guest sends updated state > > > to transition DRC to post-hotplug > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > cases like: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > DST1:dev->hotplugged would fail, even though the migration > > > scenario is unchanged from before. > > > > > > So management would now have to do: > > > > > > SRC1: device_add virtio-net-pci > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > But the code behavior is a bit different then, since we now get > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > migration stream fixes that up for us, but I think we would need > > > to audit this and similar cases to be sure. > > > > > > That's all fine if it's necessary, but I feel like this is > > > the hard way to address what's actually a much more specific > > > issue: that device_add before machine-start doesn't currently > > > match the behavior for a device started via cmdline. i.e. > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > latter. I don't really see a good reason these 2 cases should > > > be different, and we can bring them to parity by doing > > > something like: > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > before machine start, set a flag: reset_before_start. > > > 2. At the start of processing migration stream, or unpausing > > > a -S guest (in the non-migration case), issue a system-wide > > > reset if reset_before_start is set. > > > 3. reset handlers will already unset dev->hotplugged at that > > > point and re-execute all the hotplug hooks with > > > dev->hotplugged == false. This should put everything in > > > a state that's identical to cmdline-created devices. > > instead of flag for non migration case we could use > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > transition to reset all devices or > > maybe do something like this: > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > index 0ce45a2..cdeb8f8 100644 > > --- a/hw/core/qdev.c > > +++ b/hw/core/qdev.c > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > ObjectClass *class; > > Property *prop; > > > > - if (qdev_hotplug) { > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > dev->hotplugged = 1; > > qdev_hot_added = true; > > } > > Yah that sort of approach (this would coincedentally address the > migration case as well, not sure if you are on board with that > approach or not) is what I initially had in mind. The > global flag seemed safer because we get a machine-wide reset > will all devices already plugged just like for cmdline, but I'm > not sure if that is really needed or not. This is definitely more > elegant though. > > > > > > 4. Only allow management to do device_add before it sends > > > the migration stream (if management doesn't already guard > > > against this then it's probably a bug anyway) > > seems like Juan already took care of it. > > Ok, good to know. > > > > > > This allows management to treat device_add/cmdline as being > > > completely synonymous for guests that haven't started yet, > > > both for -incoming and -S in general, and it maintains > > > the behavior that existing migration code expects of > > > cmdline-specified devices. > > > > >
Quoting David Gibson (2017-06-18 04:59:33) > On Fri, Jun 16, 2017 at 11:15:46AM -0500, Michael Roth wrote: > > Quoting David Gibson (2017-06-16 09:40:53) > > > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > > > On Wed, 14 Jun 2017 19:27:12 -0500 > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > > [...] > > > > > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > > > Consider: > > > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > > > it works by accident because libvirt uses device_add on target > > > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > > > default DRC state for CPU1.hotplugged = false matches the state a > > > > > hotplugged CPU will be brought to after onlining at the source, so > > > > > we don't send it over the wire in the first place once it reaches > > > > > that post-hotplug/coldplug state. So things work as expected, even > > > > > though technically the source has dev->hotplugged == true, whereas > > > > > the dest has dev->hotplugged == false. > > > > in your case it seems fragile to rely on -device setting hotplugged cpu > > > > on target the way you want. > > > > > > > > it could be: > > > > > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > > > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > > > migration patches were written, as those don't default to coldplug, > > > > > so, because the source doesn't send it, it ends up being presented > > > > > in pre-hotplug state because the dest doesn't know that the guest > > > > > already onlined the resource and transitioned it to > > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > > > is true on both source and dest in this case, but it ends up being > > > > > the broken one. > > > > it looks like for hotplugged CPUs DRC state should be always migrated > > > > > > Yeah, I think in the first instance we should just unconditionally > > > transfer DRC state for newer machine types (at least once I clean up > > > what exactly the DRC state _is_). For old machine types things are > > > still likely to be broken, but it won't be a regression. > > > > > > If we can work out a way to fix things for older machine types, that's > > > a bonus, but it looks like it's very difficult, maybe impossible. > > > First priority should be sane semantics for the newer machine types. > > > > > > > > But your point stands, the fact that both situations are possible > > > > > means we can't currently rely on dev->hotplugged without migrating > > > > > it, infering it based on QEMU lifecycle, or forcing management to > > > > > set it. > > > > > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > > > rely on dev->hotplugged being synchronized (though if it > > > > > was we could build something around that), our dilemma is > > > > > that we make the following assumption in our code: > > > > > > > > > > "Devices present at start-time will be handled the same way, > > > > > on source or dest, regardless of whether they were added via > > > > > cmdline or via device_add prior to machine start / migration > > > > > stream processing." > > > > > > > > > > And I think that's a sensible expectation, since in theory > > > > > even the source could build up a machine via device_add > > > > > prior to starting it, and the reasonable default there is > > > > > dev->hotplugged = false rather than the opposite. That > > > > > suggests a need to fix things outside of migration. > > > > Agreed to a degree, i.e. > > > > > > > > -device/device_add before machine has been started > > > > without migration should follow coldplug path > > > > > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > > > and maybe will work for PCI (it may change behavior of > > > > ACPI based hotplug and bridges), > > > > CCing Marcel to confirm. > > > > > > So for ppc the main problem with early plugged devices is a > > > duplication of device tree info between that delivered by CAS, and > > > that delivered through configure-connector. I see two basic > > > approaches to fixing this: > > > > > > 1) At CAS, reset all DRCs before building the device tree to send to > > > the guest. That will essentially "convert" everything present at CAS > > > time into coldplugged device. There might still be a problem if there > > > are existing hotplug events in the queue from before CAS. > > > > > > 2) When building the device tree (at both CAS and reset) check the DRC > > > state, and omit DT information for anything that's not in CONFIGURED > > > state. That should be correct, because the guest should call > > > configure-connector for anything not yet in CONFIGURED state, at which > > > point it will get the DT information. > > > > > > I'm not sure which approach is best yet. I'm not 100% sure of (1) is > > > correct in all cases, but it is a bit simpler than (2). > > > > This fixes up devices added before CAS (whether via -device/device_add) > > on the source, but at the time of migration we will have (usually) > > already completed CAS. We also can't repeat the CAS-time fixups on the > > target, because that would reset any DRC state sent via migration (which > > would include things like transitional states outside of > > post-hotplug/coldplug). > > Sorry, I wasn't clear. The options above are for the problems with > hotplug during early boot, they won't help migration cases. > > > So either we sent all DRC state (regardless of hotplug or not), or > > we need a reliable starting state on the dest from which we can > > determine what needs to be sent by the source. The 2 options are: > > So for new machine types, I think we should always send DRC state. > > > a) ensure all devices on the dest start off in a coldplug state, at > > point we can simply send DRC for anything hotplugged. This isn't > > the case now because device_add on the target results in the initial > > state being hotplug, where -device's are in coldplug. I'm proposing > > we ensure both these cases default coldplug state on the target if > > device_add occurs before the dest is started. > > I think that will become the case with my Part IV DRC cleanups. IIRC, > there is a system reset performed before processing the incoming > migration. With my DRC changes we reset the DRC state based only on There's a system-wide reset, but it happens too early in cases where libvirt opts to specify devices via device_add on the dest rather than cmdline. In that case the sequence ends up being: 1) initialize cmdline devices and run their coldplug hooks 2) qdev_machine_creation_done(), all devices initialized after this point will be dev->hotplugged == true 3) issue system-wide reset 4) libvirt issues device_add for the CPU we've hotplugged on the source dev->hotplugged == true. 5) we set the DRC state to "pre-hotplug" in anticipation that the that guest needs to finish onlining devices (same way we'd handle it on the source). Except this is breaks, since the source was assuming we'd put it in "coldplug/post-hotplug" just like we would for cmdline-specified devices. That assumption doesn't hold when libvirt opts to use device_add to build up a machine 6) dest starts receiving migration stream So adding that reset prior to 6) is actually what I'm proposing we should do. (as well as adding a reset prior to machine start in general, even outside of migration, to handle boot-time devices specified via device_add, which I think Igor is on board with already. if we have agreement on both the migration and non-migration cases we can generalize the approach a bit more) > the device presence at reset time. Since the device will be > hotplugged before the incoming migration, it should be reset to > coldplug equivalent state. My understanding is that your patches move all our DRC-specific coldplug hooks to a DRC reset() function, so the hotplug handler will always assume that that scenario is handled by the DRC reset path. I think that that's indeed the proper way to handle this. But in the scenario above, we won't get that DRC reset() prior to 6), so that assumption doesn't hold. We need to ensure a system-wide reset happens prior to 6) even with your proposed series. I'm simplifying a little bit though: technically, we can address this specific scenario by always migrating DRCs for dev->hotplugged devices. But that end up being a band-aid: if you hotplug a CPU on the source, then do a reboot, *then* do migration you will hit that exact same issue. So I really think we need a machine-wide reset prior to 6) regardless. > > > b) ensure dev->hotplugged is in sync between src/dest by migrating it > > or forcing management to set it. Based on that knowledge, we can > > determine on the src side what the default state will be on the dest, > > and from that knowledge determine what state needs to be sent. Igor > > has suggested an approach for handling it this way. > > This sounds like a bad idea - I don't know what other effects altering > this generic qemu state will have. > > > Personally I'm leaning toward a), since assuming that dest devices will > > start in a coldplug state is basically what most of the migration code > > already does (since up until fairly recently, the hotplugged devices > > were generally specified via cmdline on the dest and not via > > device_add) > > Yes, I agree. > > > > > > So far, all QEMU's existing migration code has managed ok > > > > > with the dest being starting with dev->hotplugged == false > > > > > via cmdline devices, even though maybe they were hotplugged > > > > > on the source. To me, it makes more sense to maintain this > > > > > behavior by fixing up this relatively new use-case of > > > > > adding devices via device_add before start to match the > > > > > same expectations we have around cmdline-specified devices. > > > > > > > > > > This would fix migration for spapr, leave it working for > > > > > everyone else (since that's basically what we expect for > > > > > everything except newer-style cpu hotplug), and also make > > > > > the device-add-before-start be truly synonymous with > > > > > cmdline-created devices (which is applicable even outside > > > > > of migration). > > > > Neither -device or device_add can't really bring DRC/CPU into > > > > the state that devices might be at the moment when their state > > > > is transferred to target. > > > > > > > > i.e. any state that has been changed after -device/device_add > > > > on SRC, should be in migration stream. I'd say even if state would > > > > eventually go back default (coldplugged) when hotplug is completed. > > > > So trying to avoid transmitting runtime state to optimize some bytes > > > > on migration stream is just asking for trouble. > > > > > > Yeah, I'm coming to the same conclusion. Note that the reason for > > > omitting state wasn't to save space in the migration stream, but to > > > make the stream compatible with older versions which never transmit > > > the state - at least in as many cases as possible. > > > > > > > [...] > > > > > > > > > > > > May be we should > > > > > > 1. make DeviceState:hotpluggable property write-able again > > > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > > > 3. add generic migration hook which will check if target and > > > > > > source value match, if value differs => fail/abort migration. > > > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > > > provide hotplugged property value on -device/device_add > > > > > > That would enforce consistent DeviceState:hotpluggable value > > > > > > on target and source. > > > > > > We can enforce it only for new machine types so it won't break > > > > > > old mgmt tools with old machine types but would force mgmt > > > > > > for new machines to use hotplugged property on target > > > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > > > appropriate. > > > > > > > > > > It also has reasonable semantics, and it would work for us > > > > > *provided that* we always send DRC state for hotplugged devices > > > > > and not just DRCs in a transitional state: > > > > > > > > > > SRC1: device_add $cpu > > > > > -> dev->hotplugged == true > > > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > > > after guest onlines it > > > > > <migrate> > > > > > DST1: device_add $cpu,hotplugged=true > > > > > -> dev->hotplugged == true > > > > > -> device starts in pre-hotplug state. guest sends updated state > > > > > to transition DRC to post-hotplug > > > > > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > > > cases like: > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > DST1: qemu -device virtio-net-pci > > > > > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > > > DST1:dev->hotplugged would fail, even though the migration > > > > > scenario is unchanged from before. > > > > > > > > > > So management would now have to do: > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > > > > > But the code behavior is a bit different then, since we now get > > > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > > > migration stream fixes that up for us, but I think we would need > > > > > to audit this and similar cases to be sure. > > > > > > > > > > That's all fine if it's necessary, but I feel like this is > > > > > the hard way to address what's actually a much more specific > > > > > issue: that device_add before machine-start doesn't currently > > > > > match the behavior for a device started via cmdline. i.e. > > > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > > > latter. I don't really see a good reason these 2 cases should > > > > > be different, and we can bring them to parity by doing > > > > > something like: > > > > > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > > > before machine start, set a flag: reset_before_start. > > > > > 2. At the start of processing migration stream, or unpausing > > > > > a -S guest (in the non-migration case), issue a system-wide > > > > > reset if reset_before_start is set. > > > > > 3. reset handlers will already unset dev->hotplugged at that > > > > > point and re-execute all the hotplug hooks with > > > > > dev->hotplugged == false. This should put everything in > > > > > a state that's identical to cmdline-created devices. > > > > instead of flag for non migration case we could use > > > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > > transition to reset all devices or > > > > maybe do something like this: > > > > > > Hrm, does the general reset call happen now before or after this > > > transition? Resetting DRCs at CAS time should accomplish the same thing. > > > > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > > > index 0ce45a2..cdeb8f8 100644 > > > > --- a/hw/core/qdev.c > > > > +++ b/hw/core/qdev.c > > > > > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > > > ObjectClass *class; > > > > Property *prop; > > > > > > > > - if (qdev_hotplug) { > > > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > > > dev->hotplugged = 1; > > > > qdev_hot_added = true; > > > > } > > > > > > > > > 4. Only allow management to do device_add before it sends > > > > > the migration stream (if management doesn't already guard > > > > > against this then it's probably a bug anyway) > > > > seems like Juan already took care of it. > > > > > > > > > This allows management to treat device_add/cmdline as being > > > > > completely synonymous for guests that haven't started yet, > > > > > both for -incoming and -S in general, and it maintains > > > > > the behavior that existing migration code expects of > > > > > cmdline-specified devices. > > > > > > > > > > > > > > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson
Quoting David Gibson (2017-06-18 05:24:22) > On Fri, Jun 16, 2017 at 12:19:26PM -0500, Michael Roth wrote: > > Quoting Igor Mammedov (2017-06-16 08:53:12) > > > On Wed, 14 Jun 2017 19:27:12 -0500 > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > [...] > > > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > > Consider: > > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > > it works by accident because libvirt uses device_add on target > > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > > default DRC state for CPU1.hotplugged = false matches the state a > > > > hotplugged CPU will be brought to after onlining at the source, so > > > > we don't send it over the wire in the first place once it reaches > > > > that post-hotplug/coldplug state. So things work as expected, even > > > > though technically the source has dev->hotplugged == true, whereas > > > > the dest has dev->hotplugged == false. > > > in your case it seems fragile to rely on -device setting hotplugged cpu > > > on target the way you want. > > > > Well, it's fragile in the sense that we try to assume post-hotplug == > > coldplug as a way to avoid migrating hotplugged devices in some cases. > > This was mostly done to allow for backward-compatibility without > > relying on machine-specific behavior, but I have no problem with just > > always migrating DRC state for hotplugged devices. > > > > But that still leaves the following issue: on the source, we want to > > assume the default DRC state is coldplugged state. Based on that, > > we just migrate the hotplugged DRCs and everything would be fine, > > since we assume the target will initially put the DRC state in the > > coldplug state for all DRCs, same way it would for -device cpu/etc. > > > > But device_add on the target doesn't behave this way, it defaults to > > hotplugged state. It doesn't matter if the device is still in a hotplug > > state on the source, even if we reset the machine on the source side > > before migrating, device_add would still default to dev->hotplugged, > > and -device would still default to !dev->hotplugged. > > > > So we can't just send hotplugged DRCs, nor can we just send coldplugged > > DRCs, because there's no consistency in what the initial starting state > > will be on the dest. If they use -device to specify it it does one > > thing, if they use device_add it does another. So we're still stuck > > sending *all* DRCs. > > > > So addressing that inconsistency is the key issue I think. > > I think that trying to do this in terms of dev->hotplugged is a > mistake. That's really more about the device's history than its > current state. > > What we want to detect is when the DRC is in a "transitional" state. > Basically if no hotplug processing is ongoing, a DRC should be in one > of two states: > > A) DRC in UNUSABLE/ISOLATED state, no device plugged. > > B) DRC in CONFIGURED state, device plugged. > > Coldplugged devices go straight to state (B), an "empty" DRC is in > state (A). When something is hotplugged the DRCs go through a few > transitional states, before arriving at state (B). Likewise when > something is unplugged, there are various transitions, and then it > reaches state (A). > > This is the common case, of course; if the guest is doing something > unusual, a "transitional" state can last arbitrarily long. > > So, I think we always need to transfer DRC state if it's anything > other than (A) or (B). On the destination, we determine starting > state (A) or (B) depending on whether a device is plugged (my reset > patch in DRC cleanups Part IV will already do this, I think). > > No reference to, or manipulation of, dev->hotplugged is necessary. > > > > it could be: > > > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > > In this case we'd actually check to see that the DRC is in a transitional > > state via .needed and migrate it. It's only when we've concluded that > > the state is identical to coldplug state that we don't send it. > > > > But as mentioned above, this is more of an optimization, even if we > > just always send DRC's for hotplugged devices, things are still broken, > > and we need a reliable way to infer the starting state of devices on > > the dest to fix them. > > > > > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > > migration patches were written, as those don't default to coldplug, > > > > so, because the source doesn't send it, it ends up being presented > > > > in pre-hotplug state because the dest doesn't know that the guest > > > > already onlined the resource and transitioned it to > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > > is true on both source and dest in this case, but it ends up being > > > > the broken one. > > > it looks like for hotplugged CPUs DRC state should be always migrated > > > > That would fix this case, but even if we adopt that approach we still > > have this scenario that's broken: > > > > SRC: hotplug CPU, reboot, CPU is now in a coldplugged state, so we don't > > migrate it > > DST: device_add CPU, currently this defaults to dev->hotplugged == true, > > so this incorrectly gets exposed to the guest as being in a > > transitional state. > > > > This all stems from pre-start device_add semantics not aligning with > > cmdline-specified ones. > > Right, we can fix this if we reset all DRC states based on device > presence before processing the incoming migration. Yes, though that not the case now. For device_add's on the dest, the DRC reset()'s would've already been triggered, and they won't trigger again before we process the migration stream. We can manually trigger them via migration hooks in spapr code, but I think that punts similar issues other archs. Might hit. A system-wide reset after any device_add-before-starts addresses this for everyone and bring the machine back to a very familiar state. > > > > > > > > > > > > > But your point stands, the fact that both situations are possible > > > > means we can't currently rely on dev->hotplugged without migrating > > > > it, infering it based on QEMU lifecycle, or forcing management to > > > > set it. > > > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > > rely on dev->hotplugged being synchronized (though if it > > > > was we could build something around that), our dilemma is > > > > that we make the following assumption in our code: > > > > > > > > "Devices present at start-time will be handled the same way, > > > > on source or dest, regardless of whether they were added via > > > > cmdline or via device_add prior to machine start / migration > > > > stream processing." > > > > > > > > And I think that's a sensible expectation, since in theory > > > > even the source could build up a machine via device_add > > > > prior to starting it, and the reasonable default there is > > > > dev->hotplugged = false rather than the opposite. That > > > > suggests a need to fix things outside of migration. > > > Agreed to a degree, i.e. > > > > > > -device/device_add before machine has been started > > > without migration should follow coldplug path > > > > I really feel like adopting this for the migration case as > > well is the cleanest solution. AFAICT none of the migration > > code really cares about synchronizing dev->hotplugged, > > something as simple as: > > > > SRC: device_add virtio-net-pci > > DST: -device virtio-net-pci > > > > results in a mismatch between dev->hotplugged. But the migration > > code doesn't really care, it's all based around knowledge of > > what the initial starting state of the device on DST will be. > > > > Now that we cases where device_add might be used instead of > > -device on DST, we no longer have that knowledge on the source. > > Having device_add-before-start match existing cmdline behavior > > seems like the most direct way to address this, and it seems > > sensible since they ostensibly serve the exact same purpose. > > Come to think of it, I'm not sure we should really have any tests of > dev->hotplugged in our hotplug paths. If we put everything into > "hotplugged" state, but the reset converts already attached things to > coldplug state, I think that should cover it, yes? It would if the DRC reset happened after device_add on dest, but it doesn't currently. I outlined an approach earlier in the thread for ensuring a system-wide reset is performed prior to processing migration stream if any device_add's occurred beforehand. If it seems reasonable to you and Igor I can post an RFC for it. (Igor: I think this also demonstrates a scenario where just setting dev->hotplugged = false for device_add-before-start isn't sufficient. The DRC reset() approach David mentions would require a system-wide reset with all devices already plugged) > > Maybe we need to omit notifications with !dev->hotplugged, although > even then we should flush the event queue on reset. > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > > and maybe will work for PCI (it may change behavior of > > > ACPI based hotplug and bridges), > > > CCing Marcel to confirm. > > > > > > > > > > So far, all QEMU's existing migration code has managed ok > > > > with the dest being starting with dev->hotplugged == false > > > > via cmdline devices, even though maybe they were hotplugged > > > > on the source. To me, it makes more sense to maintain this > > > > behavior by fixing up this relatively new use-case of > > > > adding devices via device_add before start to match the > > > > same expectations we have around cmdline-specified devices. > > > > > > > > This would fix migration for spapr, leave it working for > > > > everyone else (since that's basically what we expect for > > > > everything except newer-style cpu hotplug), and also make > > > > the device-add-before-start be truly synonymous with > > > > cmdline-created devices (which is applicable even outside > > > > of migration). > > > Neither -device or device_add can't really bring DRC/CPU into > > > the state that devices might be at the moment when their state > > > is transferred to target. > > > > > > i.e. any state that has been changed after -device/device_add > > > on SRC, should be in migration stream. I'd say even if state would > > > eventually go back default (coldplugged) when hotplug is completed. > > > So trying to avoid transmitting runtime state to optimize some bytes > > > on migration stream is just asking for trouble. > > > > > > [...] > > > > > > > > > > May be we should > > > > > 1. make DeviceState:hotpluggable property write-able again > > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > > 3. add generic migration hook which will check if target and > > > > > source value match, if value differs => fail/abort migration. > > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > > provide hotplugged property value on -device/device_add > > > > > That would enforce consistent DeviceState:hotpluggable value > > > > > on target and source. > > > > > We can enforce it only for new machine types so it won't break > > > > > old mgmt tools with old machine types but would force mgmt > > > > > for new machines to use hotplugged property on target > > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > > appropriate. > > > > > > > > It also has reasonable semantics, and it would work for us > > > > *provided that* we always send DRC state for hotplugged devices > > > > and not just DRCs in a transitional state: > > > > > > > > SRC1: device_add $cpu > > > > -> dev->hotplugged == true > > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > > after guest onlines it > > > > <migrate> > > > > DST1: device_add $cpu,hotplugged=true > > > > -> dev->hotplugged == true > > > > -> device starts in pre-hotplug state. guest sends updated state > > > > to transition DRC to post-hotplug > > > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > > cases like: > > > > > > > > SRC1: device_add virtio-net-pci > > > > DST1: qemu -device virtio-net-pci > > > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > > DST1:dev->hotplugged would fail, even though the migration > > > > scenario is unchanged from before. > > > > > > > > So management would now have to do: > > > > > > > > SRC1: device_add virtio-net-pci > > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > > > But the code behavior is a bit different then, since we now get > > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > > migration stream fixes that up for us, but I think we would need > > > > to audit this and similar cases to be sure. > > > > > > > > That's all fine if it's necessary, but I feel like this is > > > > the hard way to address what's actually a much more specific > > > > issue: that device_add before machine-start doesn't currently > > > > match the behavior for a device started via cmdline. i.e. > > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > > latter. I don't really see a good reason these 2 cases should > > > > be different, and we can bring them to parity by doing > > > > something like: > > > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > > before machine start, set a flag: reset_before_start. > > > > 2. At the start of processing migration stream, or unpausing > > > > a -S guest (in the non-migration case), issue a system-wide > > > > reset if reset_before_start is set. > > > > 3. reset handlers will already unset dev->hotplugged at that > > > > point and re-execute all the hotplug hooks with > > > > dev->hotplugged == false. This should put everything in > > > > a state that's identical to cmdline-created devices. > > > instead of flag for non migration case we could use > > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > transition to reset all devices or > > > maybe do something like this: > > > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > > index 0ce45a2..cdeb8f8 100644 > > > --- a/hw/core/qdev.c > > > +++ b/hw/core/qdev.c > > > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > > ObjectClass *class; > > > Property *prop; > > > > > > - if (qdev_hotplug) { > > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > > dev->hotplugged = 1; > > > qdev_hot_added = true; > > > } > > > > Yah that sort of approach (this would coincedentally address the > > migration case as well, not sure if you are on board with that > > approach or not) is what I initially had in mind. The > > global flag seemed safer because we get a machine-wide reset > > will all devices already plugged just like for cmdline, but I'm > > not sure if that is really needed or not. This is definitely more > > elegant though. > > > > > > > > > 4. Only allow management to do device_add before it sends > > > > the migration stream (if management doesn't already guard > > > > against this then it's probably a bug anyway) > > > seems like Juan already took care of it. > > > > Ok, good to know. > > > > > > > > > This allows management to treat device_add/cmdline as being > > > > completely synonymous for guests that haven't started yet, > > > > both for -incoming and -S in general, and it maintains > > > > the behavior that existing migration code expects of > > > > cmdline-specified devices. > > > > > > > > > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson
On Sun, Jun 18, 2017 at 07:37:00AM -0500, Michael Roth wrote: > Quoting David Gibson (2017-06-18 04:59:33) > > On Fri, Jun 16, 2017 at 11:15:46AM -0500, Michael Roth wrote: > > > Quoting David Gibson (2017-06-16 09:40:53) > > > > On Fri, Jun 16, 2017 at 03:53:12PM +0200, Igor Mammedov wrote: > > > > > On Wed, 14 Jun 2017 19:27:12 -0500 > > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > > > Quoting Igor Mammedov (2017-06-14 04:00:01) > > > > > > > On Tue, 13 Jun 2017 16:42:45 -0500 > > > > > > > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > > > Quoting Igor Mammedov (2017-06-09 03:27:33) > > > > > [...] > > > > > > > > > > > > Currently I'd suggest to look into always migrate DRCs if cpu hotplug > > > > > > > is enabled even if dev->hotplugged is false (not nice but it might work). > > > > > > > Consider: > > > > > > > SRC1: hotplug CPU1 => CPU1.hotplugged = true > > > > > > > DST1: -device CPU1 => CPU1.hotplugged = false > > > > > > > so in current code relying on CPU1.hotplugged would not work as expected, > > > > > > > it works by accident because libvirt uses device_add on target > > > > > > > DST1: device_add CPU1 => CPU1.hotplugged = true > > > > > > > > > > > > It's actually the reverse for us, DST1: -device CPU1 works, because > > > > > > default DRC state for CPU1.hotplugged = false matches the state a > > > > > > hotplugged CPU will be brought to after onlining at the source, so > > > > > > we don't send it over the wire in the first place once it reaches > > > > > > that post-hotplug/coldplug state. So things work as expected, even > > > > > > though technically the source has dev->hotplugged == true, whereas > > > > > > the dest has dev->hotplugged == false. > > > > > in your case it seems fragile to rely on -device setting hotplugged cpu > > > > > on target the way you want. > > > > > > > > > > it could be: > > > > > > > > > > SRC: hotplug CPU and start migration with DST: -device cpu[,hotplugged=false] > > > > > *: machine is in hotplugging state and one would lose transitional state if DRC is not migrated. > > > > > > > > > > > It's the DST1: device_add case that wasn't accounted for when the DRC > > > > > > migration patches were written, as those don't default to coldplug, > > > > > > so, because the source doesn't send it, it ends up being presented > > > > > > in pre-hotplug state because the dest doesn't know that the guest > > > > > > already onlined the resource and transitioned it to > > > > > > post-hotplug/coldplug state. Ironically, dev->hotplugged > > > > > > is true on both source and dest in this case, but it ends up being > > > > > > the broken one. > > > > > it looks like for hotplugged CPUs DRC state should be always migrated > > > > > > > > Yeah, I think in the first instance we should just unconditionally > > > > transfer DRC state for newer machine types (at least once I clean up > > > > what exactly the DRC state _is_). For old machine types things are > > > > still likely to be broken, but it won't be a regression. > > > > > > > > If we can work out a way to fix things for older machine types, that's > > > > a bonus, but it looks like it's very difficult, maybe impossible. > > > > First priority should be sane semantics for the newer machine types. > > > > > > > > > > But your point stands, the fact that both situations are possible > > > > > > means we can't currently rely on dev->hotplugged without migrating > > > > > > it, infering it based on QEMU lifecycle, or forcing management to > > > > > > set it. > > > > > > > > > > > > But that raises a 2nd point. Our dilemma isn't that we can't > > > > > > rely on dev->hotplugged being synchronized (though if it > > > > > > was we could build something around that), our dilemma is > > > > > > that we make the following assumption in our code: > > > > > > > > > > > > "Devices present at start-time will be handled the same way, > > > > > > on source or dest, regardless of whether they were added via > > > > > > cmdline or via device_add prior to machine start / migration > > > > > > stream processing." > > > > > > > > > > > > And I think that's a sensible expectation, since in theory > > > > > > even the source could build up a machine via device_add > > > > > > prior to starting it, and the reasonable default there is > > > > > > dev->hotplugged = false rather than the opposite. That > > > > > > suggests a need to fix things outside of migration. > > > > > Agreed to a degree, i.e. > > > > > > > > > > -device/device_add before machine has been started > > > > > without migration should follow coldplug path > > > > > > > > > > it shouldn't cause problems for CPU/mem hotplug on x86 > > > > > and maybe will work for PCI (it may change behavior of > > > > > ACPI based hotplug and bridges), > > > > > CCing Marcel to confirm. > > > > > > > > So for ppc the main problem with early plugged devices is a > > > > duplication of device tree info between that delivered by CAS, and > > > > that delivered through configure-connector. I see two basic > > > > approaches to fixing this: > > > > > > > > 1) At CAS, reset all DRCs before building the device tree to send to > > > > the guest. That will essentially "convert" everything present at CAS > > > > time into coldplugged device. There might still be a problem if there > > > > are existing hotplug events in the queue from before CAS. > > > > > > > > 2) When building the device tree (at both CAS and reset) check the DRC > > > > state, and omit DT information for anything that's not in CONFIGURED > > > > state. That should be correct, because the guest should call > > > > configure-connector for anything not yet in CONFIGURED state, at which > > > > point it will get the DT information. > > > > > > > > I'm not sure which approach is best yet. I'm not 100% sure of (1) is > > > > correct in all cases, but it is a bit simpler than (2). > > > > > > This fixes up devices added before CAS (whether via -device/device_add) > > > on the source, but at the time of migration we will have (usually) > > > already completed CAS. We also can't repeat the CAS-time fixups on the > > > target, because that would reset any DRC state sent via migration (which > > > would include things like transitional states outside of > > > post-hotplug/coldplug). > > > > Sorry, I wasn't clear. The options above are for the problems with > > hotplug during early boot, they won't help migration cases. > > > > > So either we sent all DRC state (regardless of hotplug or not), or > > > we need a reliable starting state on the dest from which we can > > > determine what needs to be sent by the source. The 2 options are: > > > > So for new machine types, I think we should always send DRC state. > > > > > a) ensure all devices on the dest start off in a coldplug state, at > > > point we can simply send DRC for anything hotplugged. This isn't > > > the case now because device_add on the target results in the initial > > > state being hotplug, where -device's are in coldplug. I'm proposing > > > we ensure both these cases default coldplug state on the target if > > > device_add occurs before the dest is started. > > > > I think that will become the case with my Part IV DRC cleanups. IIRC, > > there is a system reset performed before processing the incoming > > migration. With my DRC changes we reset the DRC state based only on > > There's a system-wide reset, but it happens too early in cases where > libvirt opts to specify devices via device_add on the dest rather than > cmdline. > > In that case the sequence ends up being: > > 1) initialize cmdline devices and run their coldplug hooks > 2) qdev_machine_creation_done(), all devices initialized > after this point will be dev->hotplugged == true > 3) issue system-wide reset > 4) libvirt issues device_add for the CPU we've hotplugged on the source > dev->hotplugged == true. > 5) we set the DRC state to "pre-hotplug" in anticipation that the > that guest needs to finish onlining devices (same way we'd > handle it on the source). Except this is breaks, since the source > was assuming we'd put it in "coldplug/post-hotplug" just like we > would for cmdline-specified devices. That assumption doesn't hold > when libvirt opts to use device_add to build up a machine > 6) dest starts receiving migration stream > > So adding that reset prior to 6) is actually what I'm proposing we > should do. Ah, right. I mistakenly assumed that the reset would happen immediately prior to processing the incoming stream. > (as well as adding a reset prior to machine start in general, even > outside of migration, to handle boot-time devices specified via > device_add, which I think Igor is on board with already. if we have > agreement on both the migration and non-migration cases we can > generalize the approach a bit more) > > > the device presence at reset time. Since the device will be > > hotplugged before the incoming migration, it should be reset to > > coldplug equivalent state. > > My understanding is that your patches move all our DRC-specific > coldplug hooks to a DRC reset() function, so the hotplug handler > will always assume that that scenario is handled by the DRC reset > path. I think that that's indeed the proper way to handle this. Well, the patches I have so far don't quite get there (even including the ones drafted byt not posted). But that's the direction I'm heading in, yes. > But in the scenario above, we won't get that DRC reset() prior > to 6), so that assumption doesn't hold. We need to ensure a > system-wide reset happens prior to 6) even with your proposed > series. > > I'm simplifying a little bit though: technically, we can address > this specific scenario by always migrating DRCs for > dev->hotplugged devices. But that end up being a band-aid: if you > hotplug a CPU on the source, then do a reboot, *then* do > migration you will hit that exact same issue. > > So I really think we need a machine-wide reset prior to 6) > regardless. We probably don't need a true machine-wide reset - just to call the reset() hook on every DRC. > > > > > > b) ensure dev->hotplugged is in sync between src/dest by migrating it > > > or forcing management to set it. Based on that knowledge, we can > > > determine on the src side what the default state will be on the dest, > > > and from that knowledge determine what state needs to be sent. Igor > > > has suggested an approach for handling it this way. > > > > This sounds like a bad idea - I don't know what other effects altering > > this generic qemu state will have. > > > > > Personally I'm leaning toward a), since assuming that dest devices will > > > start in a coldplug state is basically what most of the migration code > > > already does (since up until fairly recently, the hotplugged devices > > > were generally specified via cmdline on the dest and not via > > > device_add) > > > > Yes, I agree. > > > > > > > > So far, all QEMU's existing migration code has managed ok > > > > > > with the dest being starting with dev->hotplugged == false > > > > > > via cmdline devices, even though maybe they were hotplugged > > > > > > on the source. To me, it makes more sense to maintain this > > > > > > behavior by fixing up this relatively new use-case of > > > > > > adding devices via device_add before start to match the > > > > > > same expectations we have around cmdline-specified devices. > > > > > > > > > > > > This would fix migration for spapr, leave it working for > > > > > > everyone else (since that's basically what we expect for > > > > > > everything except newer-style cpu hotplug), and also make > > > > > > the device-add-before-start be truly synonymous with > > > > > > cmdline-created devices (which is applicable even outside > > > > > > of migration). > > > > > Neither -device or device_add can't really bring DRC/CPU into > > > > > the state that devices might be at the moment when their state > > > > > is transferred to target. > > > > > > > > > > i.e. any state that has been changed after -device/device_add > > > > > on SRC, should be in migration stream. I'd say even if state would > > > > > eventually go back default (coldplugged) when hotplug is completed. > > > > > So trying to avoid transmitting runtime state to optimize some bytes > > > > > on migration stream is just asking for trouble. > > > > > > > > Yeah, I'm coming to the same conclusion. Note that the reason for > > > > omitting state wasn't to save space in the migration stream, but to > > > > make the stream compatible with older versions which never transmit > > > > the state - at least in as many cases as possible. > > > > > > > > > [...] > > > > > > > > > > > > > > May be we should > > > > > > > 1. make DeviceState:hotpluggable property write-able again > > > > > > > 2. transmit DeviceState:hotpluggable as part of migration stream > > > > > > > 3. add generic migration hook which will check if target and > > > > > > > source value match, if value differs => fail/abort migration. > > > > > > > 4. in case values mismatch mgmt will be forced to explicitly > > > > > > > provide hotplugged property value on -device/device_add > > > > > > > That would enforce consistent DeviceState:hotpluggable value > > > > > > > on target and source. > > > > > > > We can enforce it only for new machine types so it won't break > > > > > > > old mgmt tools with old machine types but would force mgmt > > > > > > > for new machines to use hotplugged property on target > > > > > > > so QEMU could rely on its value for migration purposes. > > > > > > > > > > > > > > > > > > > That would work, and generalizing this beyond spapr seems > > > > > > appropriate. > > > > > > > > > > > > It also has reasonable semantics, and it would work for us > > > > > > *provided that* we always send DRC state for hotplugged devices > > > > > > and not just DRCs in a transitional state: > > > > > > > > > > > > SRC1: device_add $cpu > > > > > > -> dev->hotplugged == true > > > > > > -> device starts in pre-hotplug, ends up in post-hotplug state > > > > > > after guest onlines it > > > > > > <migrate> > > > > > > DST1: device_add $cpu,hotplugged=true > > > > > > -> dev->hotplugged == true > > > > > > -> device starts in pre-hotplug state. guest sends updated state > > > > > > to transition DRC to post-hotplug > > > > > > > > > > > > But what about stuff like mem/pci? Currently, migration works for > > > > > > cases like: > > > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > > DST1: qemu -device virtio-net-pci > > > > > > > > > > > > Even though DST1 has dev->hotplugged == false, and SRC1 has the > > > > > > opposite. So for new machines, checking SRC1:dev->hotplugged == > > > > > > DST1:dev->hotplugged would fail, even though the migration > > > > > > scenario is unchanged from before. > > > > > > > > > > > > So management would now have to do: > > > > > > > > > > > > SRC1: device_add virtio-net-pci > > > > > > DST1: qemu -device virtio-net-pci,hotplugged=true > > > > > > > > > > > > But the code behavior is a bit different then, since we now get > > > > > > an ACPI hotplug event via the hotplug handler. Maybe the > > > > > > migration stream fixes that up for us, but I think we would need > > > > > > to audit this and similar cases to be sure. > > > > > > > > > > > > That's all fine if it's necessary, but I feel like this is > > > > > > the hard way to address what's actually a much more specific > > > > > > issue: that device_add before machine-start doesn't currently > > > > > > match the behavior for a device started via cmdline. i.e. > > > > > > dev->hotplugged in the former vs. !dev->hotplugged in the > > > > > > latter. I don't really see a good reason these 2 cases should > > > > > > be different, and we can bring them to parity by doing > > > > > > something like: > > > > > > > > > > > > 1. For device_adds after qdev_machine_creation_done(), but > > > > > > before machine start, set a flag: reset_before_start. > > > > > > 2. At the start of processing migration stream, or unpausing > > > > > > a -S guest (in the non-migration case), issue a system-wide > > > > > > reset if reset_before_start is set. > > > > > > 3. reset handlers will already unset dev->hotplugged at that > > > > > > point and re-execute all the hotplug hooks with > > > > > > dev->hotplugged == false. This should put everything in > > > > > > a state that's identical to cmdline-created devices. > > > > > instead of flag for non migration case we could use > > > > > RUN_STATE_PRELAUNCH -> RUN_STATE_RUNNING > > > > > transition to reset all devices or > > > > > maybe do something like this: > > > > > > > > Hrm, does the general reset call happen now before or after this > > > > transition? Resetting DRCs at CAS time should accomplish the same thing. > > > > > > > > > diff --git a/hw/core/qdev.c b/hw/core/qdev.c > > > > > index 0ce45a2..cdeb8f8 100644 > > > > > --- a/hw/core/qdev.c > > > > > +++ b/hw/core/qdev.c > > > > > > > > > > @@ -1020,7 +1010,7 @@ static void device_initfn(Object *obj) > > > > > ObjectClass *class; > > > > > Property *prop; > > > > > > > > > > - if (qdev_hotplug) { > > > > > + if (runstate_check(RUN_STATE_RUNNING) || ...) { > > > > > dev->hotplugged = 1; > > > > > qdev_hot_added = true; > > > > > } > > > > > > > > > > > 4. Only allow management to do device_add before it sends > > > > > > the migration stream (if management doesn't already guard > > > > > > against this then it's probably a bug anyway) > > > > > seems like Juan already took care of it. > > > > > > > > > > > This allows management to treat device_add/cmdline as being > > > > > > completely synonymous for guests that haven't started yet, > > > > > > both for -incoming and -S in general, and it maintains > > > > > > the behavior that existing migration code expects of > > > > > > cmdline-specified devices. > > > > > > > > > > > > > > > > > > > >
On 16/06/2017 16:53, Igor Mammedov wrote: > On Wed, 14 Jun 2017 19:27:12 -0500 > Michael Roth <mdroth@linux.vnet.ibm.com> wrote: > >> Quoting Igor Mammedov (2017-06-14 04:00:01) >>> On Tue, 13 Jun 2017 16:42:45 -0500 >>> Michael Roth <mdroth@linux.vnet.ibm.com> wrote: >>> >>>> Quoting Igor Mammedov (2017-06-09 03:27:33) [...] >> But that raises a 2nd point. Our dilemma isn't that we can't >> rely on dev->hotplugged being synchronized (though if it >> was we could build something around that), our dilemma is >> that we make the following assumption in our code: >> >> "Devices present at start-time will be handled the same way, >> on source or dest, regardless of whether they were added via >> cmdline or via device_add prior to machine start / migration >> stream processing." >> >> And I think that's a sensible expectation, since in theory >> even the source could build up a machine via device_add >> prior to starting it, and the reasonable default there is >> dev->hotplugged = false rather than the opposite. That >> suggests a need to fix things outside of migration. > Agreed to a degree, i.e. > > -device/device_add before machine has been started > without migration should follow coldplug path > > it shouldn't cause problems for CPU/mem hotplug on x86 > and maybe will work for PCI (it may change behavior of > ACPI based hotplug and bridges), > CCing Marcel to confirm. > As long as machine_done is not triggered and the dev->hotplugged is not set I suppose it will work for PCI. But why allow device_add before the machine is running? I don't see the benefits for it. I did see in the mail thread that libvirt may build the machine using device_add, but again, what is the gain? What you can do before the machine is running do it at command line. I would also not allow hot-plugging once migration is started and delay migration until we finish all hot-plugging operations (not directly related to this thread). Thanks, Marcel > [...]
On 06/19/2017 09:40 AM, Marcel Apfelbaum wrote: > On 16/06/2017 16:53, Igor Mammedov wrote: >> On Wed, 14 Jun 2017 19:27:12 -0500 >> Michael Roth <mdroth@linux.vnet.ibm.com> wrote: >> >>> Quoting Igor Mammedov (2017-06-14 04:00:01) >>>> On Tue, 13 Jun 2017 16:42:45 -0500 >>>> Michael Roth <mdroth@linux.vnet.ibm.com> wrote: >>>>> Quoting Igor Mammedov (2017-06-09 03:27:33) > > [...] > >>> But that raises a 2nd point. Our dilemma isn't that we can't >>> rely on dev->hotplugged being synchronized (though if it >>> was we could build something around that), our dilemma is >>> that we make the following assumption in our code: >>> >>> "Devices present at start-time will be handled the same way, >>> on source or dest, regardless of whether they were added via >>> cmdline or via device_add prior to machine start / migration >>> stream processing." >>> >>> And I think that's a sensible expectation, since in theory >>> even the source could build up a machine via device_add >>> prior to starting it, and the reasonable default there is >>> dev->hotplugged = false rather than the opposite. That >>> suggests a need to fix things outside of migration. >> Agreed to a degree, i.e. >> >> -device/device_add before machine has been started >> without migration should follow coldplug path >> >> it shouldn't cause problems for CPU/mem hotplug on x86 >> and maybe will work for PCI (it may change behavior of >> ACPI based hotplug and bridges), >> CCing Marcel to confirm. >> > > As long as machine_done is not triggered and the dev->hotplugged > is not set I suppose it will work for PCI. > > But why allow device_add before the machine is running? > I don't see the benefits for it. I did see in the mail thread > that libvirt may build the machine using device_add, but > again, what is the gain? What you can do before the machine is running > do it at command line. This takes us back to an older discussion about this subject. If I recall correctly the only reason we're allowing this is because Libvirt does it. At that time there where no mentions of any other use case or benefits from allowing it in PPC. I haven't had the time to check it but if Libvirt can get away with device_add before the machine is running in x86 (or any other arch for instance) then supporting this scenario in PPC makes management happier, avoiding more arch specific code from their side to handle migration and hotplugs. A discussion that might be worth having is why support device_add/del operations before the machine is started in any architecture at all, not just pseries. Not sure if this is the right moment/thread for that though. Thanks, Daniel > > I would also not allow hot-plugging once migration is started > and delay migration until we finish all hot-plugging operations (not > directly related to this thread). > > > Thanks, > Marcel > > >> > > > [...] >
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 0980d73..f1302d0 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2511,6 +2511,12 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp) } } +static bool spapr_coldplugged(DeviceState *dev) +{ + return runstate_check(RUN_STATE_INMIGRATE) || + !dev->hotplugged; +} + static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, uint32_t node, bool dedicated_hp_event_source, Error **errp) @@ -2521,6 +2527,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, int i, fdt_offset, fdt_size; void *fdt; uint64_t addr = addr_start; + bool coldplugged = spapr_coldplugged(dev); for (i = 0; i < nr_lmbs; i++) { drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, @@ -2532,9 +2539,9 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, SPAPR_MEMORY_BLOCK_SIZE); drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, errp); + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, errp); addr += SPAPR_MEMORY_BLOCK_SIZE; - if (!dev->hotplugged) { + if (coldplugged) { /* guests expect coldplugged LMBs to be pre-allocated */ drck->set_allocation_state(drc, SPAPR_DR_ALLOCATION_STATE_USABLE); drck->set_isolation_state(drc, SPAPR_DR_ISOLATION_STATE_UNISOLATED); @@ -2543,7 +2550,7 @@ static void spapr_add_lmbs(DeviceState *dev, uint64_t addr_start, uint64_t size, /* send hotplug notification to the * guest only in case of hotplugged memory */ - if (dev->hotplugged) { + if (!coldplugged) { if (dedicated_hp_event_source) { drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, addr_start / SPAPR_MEMORY_BLOCK_SIZE); @@ -2776,6 +2783,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, int smt = kvmppc_smt_threads(); CPUArchId *core_slot; int index; + bool coldplugged = spapr_coldplugged(dev); core_slot = spapr_find_cpu_slot(MACHINE(hotplug_dev), cc->core_id, &index); if (!core_slot) { @@ -2797,7 +2805,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, if (drc) { sPAPRDRConnectorClass *drck = SPAPR_DR_CONNECTOR_GET_CLASS(drc); - drck->attach(drc, dev, fdt, fdt_offset, !dev->hotplugged, &local_err); + drck->attach(drc, dev, fdt, fdt_offset, coldplugged, &local_err); if (local_err) { g_free(fdt); error_propagate(errp, local_err); @@ -2805,7 +2813,7 @@ static void spapr_core_plug(HotplugHandler *hotplug_dev, DeviceState *dev, } } - if (dev->hotplugged) { + if (!coldplugged) { /* * Send hotplug notification interrupt to the guest only in case * of hotplugged CPUs. @@ -2838,7 +2846,7 @@ static void spapr_core_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, int node_id; int index; - if (dev->hotplugged && !mc->has_hotpluggable_cpus) { + if (!spapr_coldplugged(dev) && !mc->has_hotpluggable_cpus) { error_setg(&local_err, "CPU hotplug not supported for this machine"); goto out; }
For QEMU, a hotlugged device is a device added using the HMP/QMP interface. For SPAPR, a hotplugged device is a device added while the machine is running. In this case QEMU doesn't update internal state but relies on the OS for this part In the case of migration, when we (libvirt) hotplug a device on the source guest, we (libvirt) generally hotplug the same device on the destination guest. But in this case, the machine is stopped (RUN_STATE_INMIGRATE) and QEMU must not expect the OS will manage it as an hotplugged device as it will be "imported" by the migration. This patch changes the meaning of "hotplugged" in spapr.c to manage a QEMU hotplugged device like a "coldplugged" one when the machine is awaiting an incoming migration. Signed-off-by: Laurent Vivier <lvivier@redhat.com> --- hw/ppc/spapr.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-)