mbox series

[for-9.1,v5,00/14] migration: Improve error reporting

Message ID 20240320064911.545001-1-clg@redhat.com
Headers show
Series migration: Improve error reporting | expand

Message

Cédric Le Goater March 20, 2024, 6:48 a.m. UTC
Hello,

The motivation behind these changes is to improve error reporting to
the upper management layer (libvirt) with a more detailed error, this
to let it decide, depending on the reported error, whether to try
migration again later. It would be useful in cases where migration
fails due to lack of HW resources on the host. For instance, some
adapters can only initiate a limited number of simultaneous dirty
tracking requests and this imposes a limit on the the number of VMs
that can be migrated simultaneously.

We are not quite ready for such a mechanism but what we can do first is
to cleanup the error reporting in the early save_setup sequence. This
is what the following changes propose, by adding an Error** argument to
various handlers and propagating it to the core migration subsystem.


Patchset is organized as follow :

* [1-4] are prerequisite changes in other components related to the
  migration save_setup() handler. They make sure a failure is not
  returned without setting an error.
  
  s390/stattrib: Add Error** argument to set_migrationmode() handler
  vfio: Always report an error in vfio_save_setup()
  migration: Always report an error in block_save_setup()
  migration: Always report an error in ram_save_setup()
  migration: Add Error** argument to vmstate_save()

* [5-14] are the core changes in migration and memory components to
  propagate an error reported in a save_setup() handler.

  migration: Add Error** argument to qemu_savevm_state_setup()
  migration: Add Error** argument to .save_setup() handler
  migration: Add Error** argument to .load_setup() handler
  memory: Add Error** argument to .log_global_start() handler
  migration: Introduce ram_bitmaps_destroy()
  memory: Add Error** argument to the global_dirty_log routines
  migration: Add Error** argument to ram_state_init()
  migration: Add Error** argument to xbzrle_init()
  migration: Modify ram_init_bitmaps() to report dirty tracking errors

The VFIO changes depend on the above. They are simpler and have been
reviewed already. I kept them for another series.

Thanks,

C.

Changes in v5:
 
 - Rebased on 2e128776dc56 ("migration: Skip only empty block devices")
 - Removed Fabiano's R-b because of changes 
 - Handled qemu_savevm_state_setup() failures after waiting for
   virtio-net-failover devices to unplug.
 - Removed memory_global_dirty_log_rollback()
 - Introduced memory_global_dirty_log_do_start() to call
   .log_global_start() handlers and do the rollback in case of error.
 - Kept modification of the global_dirty_tracking flag within
   memory_global_dirty_log_start()  
 - Added an assert on error of a .log_global_start() handler in
   listener_add_address_space()
 - Removed Yong Huang's R-b
 - Introduced ram_bitmaps_destroy()
 - Added Error** argument to ram_state_init() and xbzrle_init()
 - Made use of ram_bitmaps_destroy() in ram_init_bitmaps() to cleanup
   allocated bitmaps
 - Took into account changes of ram_state_init() and xbzrle_init() to
   propagate the error.
 - Reduced series to migration. VFIO can come later. 

Changes in v4:

 - Fixed frenchism futur to future
 - Fixed typo in set_migrationmode() handler
 - Added error_free() in hmp_migrationmode()
 - Fixed state name printed out in error returned by vfio_save_setup()
 - Fixed test on error returned by qemu_file_get_error()
 - Added an error when bdrv_nb_sectors() returns a negative value 
 - Dropped log_global_stop() and log_global_sync() changes
 - Dropped MEMORY_LISTENER_CALL_LOG_GLOBAL
 - Modified memory_global_dirty_log_start() to loop on the list of
   listeners and handle errors directly.
 - Introduced memory_global_dirty_log_rollback() to revert operations
   previously done

Changes in v3:

 - New changes to make sure an error is always set in case of failure.
   This is the reason behind the 5/6 extra patches. (Markus)
 - Documentation fixup (Peter + Avihai)
 - Set migration state to MIGRATION_STATUS_FAILED always
 - Fixed error handling in bg_migration_thread() (Peter)
 - Fixed return value of vfio_listener_log_global_start/stop(). 
   Went unnoticed because value is not tested. (Peter)
 - Add ERRP_GUARD() when error_prepend is used 
 - Use error_setg_errno() when possible
    
Changes in v2:

- Removed v1 patches addressing the return-path thread termination as
  they are now superseded by :  
  https://lore.kernel.org/qemu-devel/20240226203122.22894-1-farosas@suse.de/
- Documentation updates of handlers
- Removed call to PRECOPY_NOTIFY_SETUP notifiers in case of errors
- Modified routines taking an Error** argument to return a bool when
  possible and made adjustments in callers.
- new MEMORY_LISTENER_CALL_LOG_GLOBAL macro for .log_global*()
  handlers
- Handled SETUP state when migration terminates
- Modified memory_get_xlat_addr() to take an Error** argument
- Various refinements on error handling

Cédric Le Goater (14):
  s390/stattrib: Add Error** argument to set_migrationmode() handler
  vfio: Always report an error in vfio_save_setup()
  migration: Always report an error in block_save_setup()
  migration: Always report an error in ram_save_setup()
  migration: Add Error** argument to vmstate_save()
  migration: Add Error** argument to qemu_savevm_state_setup()
  migration: Add Error** argument to .save_setup() handler
  migration: Add Error** argument to .load_setup() handler
  memory: Add Error** argument to .log_global_start() handler
  migration: Introduce ram_bitmaps_destroy()
  memory: Add Error** argument to the global_dirty_log routines
  migration: Add Error** argument to ram_state_init()
  migration: Add Error** argument to xbzrle_init()
  migration: Modify ram_init_bitmaps() to report dirty tracking errors

 include/exec/memory.h                 |  10 ++-
 include/hw/s390x/storage-attributes.h |   2 +-
 include/migration/register.h          |   6 +-
 migration/savevm.h                    |   2 +-
 hw/i386/xen/xen-hvm.c                 |   5 +-
 hw/ppc/spapr.c                        |   2 +-
 hw/s390x/s390-stattrib-kvm.c          |  12 ++-
 hw/s390x/s390-stattrib.c              |  15 ++--
 hw/vfio/common.c                      |   4 +-
 hw/vfio/migration.c                   |  29 +++++--
 hw/virtio/vhost.c                     |   3 +-
 migration/block-dirty-bitmap.c        |   4 +-
 migration/block.c                     |  17 +++--
 migration/dirtyrate.c                 |  13 +++-
 migration/migration.c                 |  33 +++++++-
 migration/ram.c                       | 106 +++++++++++++++++---------
 migration/savevm.c                    |  57 ++++++++------
 system/memory.c                       |  40 +++++++++-
 18 files changed, 261 insertions(+), 99 deletions(-)

Comments

Peter Xu March 22, 2024, 1:42 p.m. UTC | #1
On Wed, Mar 20, 2024 at 07:48:56AM +0100, Cédric Le Goater wrote:
> Hello,
> 
> The motivation behind these changes is to improve error reporting to
> the upper management layer (libvirt) with a more detailed error, this
> to let it decide, depending on the reported error, whether to try
> migration again later. It would be useful in cases where migration
> fails due to lack of HW resources on the host. For instance, some
> adapters can only initiate a limited number of simultaneous dirty
> tracking requests and this imposes a limit on the the number of VMs
> that can be migrated simultaneously.
> 
> We are not quite ready for such a mechanism but what we can do first is
> to cleanup the error reporting in the early save_setup sequence. This
> is what the following changes propose, by adding an Error** argument to
> various handlers and propagating it to the core migration subsystem.
> 
> 
> Patchset is organized as follow :
> 
> * [1-4] are prerequisite changes in other components related to the
>   migration save_setup() handler. They make sure a failure is not
>   returned without setting an error.
>   
>   s390/stattrib: Add Error** argument to set_migrationmode() handler
>   vfio: Always report an error in vfio_save_setup()
>   migration: Always report an error in block_save_setup()
>   migration: Always report an error in ram_save_setup()
>   migration: Add Error** argument to vmstate_save()
> 
> * [5-14] are the core changes in migration and memory components to
>   propagate an error reported in a save_setup() handler.
> 
>   migration: Add Error** argument to qemu_savevm_state_setup()
>   migration: Add Error** argument to .save_setup() handler
>   migration: Add Error** argument to .load_setup() handler
>   memory: Add Error** argument to .log_global_start() handler
>   migration: Introduce ram_bitmaps_destroy()
>   memory: Add Error** argument to the global_dirty_log routines
>   migration: Add Error** argument to ram_state_init()
>   migration: Add Error** argument to xbzrle_init()
>   migration: Modify ram_init_bitmaps() to report dirty tracking errors
> 
> The VFIO changes depend on the above. They are simpler and have been
> reviewed already. I kept them for another series.

queued for 9.1, thanks.