mbox series

[v2,00/53] migration/rdma: Error handling fixes

Message ID 20230928132019.2544702-1-armbru@redhat.com
Headers show
Series migration/rdma: Error handling fixes | expand

Message

Markus Armbruster Sept. 28, 2023, 1:19 p.m. UTC
Oh dear, where to start.  There's so much wrong, and in pretty obvious
ways.  This code should never have passed review.  I'm refraining from
saying more; see the commit messages instead.

Issues remaining after this series include:

* Terrible error messages

* Some error message cascades remain

* There is no written contract for QEMUFileHooks, and the
  responsibility for reporting errors is unclear

* There seem to be no tests whatsoever

PATCH 29 is arguably a matter of taste.  I made my case for it during
review of v1.  If maintainers don't want it, I'll drop it.

Related: [PATCH 1/7] migration/rdma: Fix save_page method to fail on
polling error

v2:
* PATCH 06: New
* PATCH 07: Tweak local variables in _readv() and _writev() for
  consistency [Fabiano]
* PATCH 13: Drop a comment that has become useless [Fabiano]
* PATCH 14: Fix spelling of qemu_rdma_buffer_mergeable() [Eric]
* PATCH 16: New; adding temporary FIXMEs
* PATCH 17: Fix typo in commit message [Li Zhijian]
* PATCH 26: Squash in old PATCH 23 [Li Zhijian]
* PATCH 27: Fix type of RDMAContext member @errored [Eric]
* PATCH 40: Resolve a temporary FIXME
* PATCH 44: Drop advice on qemu_rdma_alloc_pd_cq() failure in
  qemu_rdma_source_init() [Li Zhijian]
* PATCH 46+49: Resolve a temporary FIXME
* PATCH 51: Downgrade to warning instead of silence [Li Zhijian]
* PATCH 52: Resolve the last temporary FIXME
* PATCH 53: Convert to tracing instead [Li Zhijian]

Markus Armbruster (53):
  migration/rdma: Clean up qemu_rdma_poll()'s return type
  migration/rdma: Clean up qemu_rdma_data_init()'s return type
  migration/rdma: Clean up rdma_delete_block()'s return type
  migration/rdma: Drop fragile wr_id formatting
  migration/rdma: Consistently use uint64_t for work request IDs
  migration/rdma: Fix unwanted integer truncation
  migration/rdma: Clean up two more harmless signed vs. unsigned issues
  migration/rdma: Give qio_channel_rdma_source_funcs internal linkage
  migration/rdma: Fix qemu_rdma_accept() to return failure on errors
  migration/rdma: Put @errp parameter last
  migration/rdma: Eliminate error_propagate()
  migration/rdma: Drop rdma_add_block() error handling
  migration/rdma: Drop qemu_rdma_search_ram_block() error handling
  migration/rdma: Make qemu_rdma_buffer_mergeable() return bool
  migration/rdma: Use bool for two RDMAContext flags
  migration/rdma: Fix or document problematic uses of errno
  migration/rdma: Ditch useless numeric error codes in error messages
  migration/rdma: Fix io_writev(), io_readv() methods to obey contract
  migration/rdma: Replace dangerous macro CHECK_ERROR_STATE()
  migration/rdma: Fix qemu_rdma_broken_ipv6_kernel() to set error
  migration/rdma: Fix qemu_get_cm_event_timeout() to always set error
  migration/rdma: Drop dead qemu_rdma_data_init() code for !@host_port
  migration/rdma: Fix QEMUFileHooks method return values
  migration/rdma: Fix rdma_getaddrinfo() error checking
  migration/rdma: Return -1 instead of negative errno code
  migration/rdma: Dumb down remaining int error values to -1
  migration/rdma: Replace int error_state by bool errored
  migration/rdma: Drop superfluous assignments to @ret
  migration/rdma: Check negative error values the same way everywhere
  migration/rdma: Plug a memory leak and improve a message
  migration/rdma: Delete inappropriate error_report() in macro ERROR()
  migration/rdma: Retire macro ERROR()
  migration/rdma: Fix error handling around rdma_getaddrinfo()
  migration/rdma: Drop "@errp is clear" guards around error_setg()
  migration/rdma: Convert qemu_rdma_exchange_recv() to Error
  migration/rdma: Convert qemu_rdma_exchange_send() to Error
  migration/rdma: Convert qemu_rdma_exchange_get_response() to Error
  migration/rdma: Convert qemu_rdma_reg_whole_ram_blocks() to Error
  migration/rdma: Convert qemu_rdma_write_flush() to Error
  migration/rdma: Convert qemu_rdma_write_one() to Error
  migration/rdma: Convert qemu_rdma_write() to Error
  migration/rdma: Convert qemu_rdma_post_send_control() to Error
  migration/rdma: Convert qemu_rdma_post_recv_control() to Error
  migration/rdma: Convert qemu_rdma_alloc_pd_cq() to Error
  migration/rdma: Silence qemu_rdma_resolve_host()
  migration/rdma: Silence qemu_rdma_connect()
  migration/rdma: Silence qemu_rdma_reg_control()
  migration/rdma: Don't report received completion events as error
  migration/rdma: Silence qemu_rdma_block_for_wrid()
  migration/rdma: Silence qemu_rdma_register_and_get_keys()
  migration/rdma: Downgrade qemu_rdma_cleanup() errors to warnings
  migration/rdma: Use error_report() & friends instead of stderr
  migration/rdma: Replace flawed device detail dump by tracing

 migration/rdma.c       | 1026 ++++++++++++++++++++--------------------
 migration/trace-events |   10 +-
 2 files changed, 514 insertions(+), 522 deletions(-)

Comments

Juan Quintela Oct. 4, 2023, 5:52 p.m. UTC | #1
Markus Armbruster <armbru@redhat.com> wrote:
> Oh dear, where to start.  There's so much wrong, and in pretty obvious
> ways.  This code should never have passed review.  I'm refraining from
> saying more; see the commit messages instead.
>
> Issues remaining after this series include:
>
> * Terrible error messages
>
> * Some error message cascades remain
>
> * There is no written contract for QEMUFileHooks, and the
>   responsibility for reporting errors is unclear
>
> * There seem to be no tests whatsoever
>
> PATCH 29 is arguably a matter of taste.  I made my case for it during
> review of v1.  If maintainers don't want it, I'll drop it.
>
> Related: [PATCH 1/7] migration/rdma: Fix save_page method to fail on
> polling error

Hi Markus

I integrated everything except:

>   migration/rdma: Fix or document problematic uses of errno

Most of them are dropped on following patches.

>   migration/rdma: Use error_report() & friends instead of stderr

You said you have to resend this one.

There were some conflicts, I was careful, but one never knows.  So you
are wellcome to take a look when the PULL cames to the tree.

Thanks, Juan.
Markus Armbruster Oct. 5, 2023, 5:07 a.m. UTC | #2
Juan Quintela <quintela@redhat.com> writes:

> Markus Armbruster <armbru@redhat.com> wrote:
>> Oh dear, where to start.  There's so much wrong, and in pretty obvious
>> ways.  This code should never have passed review.  I'm refraining from
>> saying more; see the commit messages instead.
>>
>> Issues remaining after this series include:
>>
>> * Terrible error messages
>>
>> * Some error message cascades remain
>>
>> * There is no written contract for QEMUFileHooks, and the
>>   responsibility for reporting errors is unclear
>>
>> * There seem to be no tests whatsoever
>>
>> PATCH 29 is arguably a matter of taste.  I made my case for it during
>> review of v1.  If maintainers don't want it, I'll drop it.
>>
>> Related: [PATCH 1/7] migration/rdma: Fix save_page method to fail on
>> polling error
>
> Hi Markus
>
> I integrated everything except:
>
>>   migration/rdma: Fix or document problematic uses of errno
>
> Most of them are dropped on following patches.

The hunks that are dropped in later patches are:

* Four FIXME comments about incorrect or problematic use of perror().

  If you drop the patch, you have to adjust the later patches that
  remove these hunks.  Resolving the conflicts is *not* enough; you also
  have to correct the commit messages.

The hunks that are not dropped are:

* Three comments about bugs (either library doc bug or incorrect use of
  @errno here).  I'd hate to lose them.

* One bug fix, in qemu_rdma_advise_prefetch_mr().  Losing this one would
  be foolish.

Please consider keeping the patch.

>>   migration/rdma: Use error_report() & friends instead of stderr
>
> You said you have to resend this one.

Can do, but since the change is trivial, perhaps you could make it in
your tree without a resend.  Change the line

                warn_report("WARN: migrations may fail:"

to

                warn_report("migrations may fail:"

> There were some conflicts, I was careful, but one never knows.  So you
> are wellcome to take a look when the PULL cames to the tree.
>
> Thanks, Juan.
Juan Quintela Oct. 5, 2023, 6:37 a.m. UTC | #3
Markus Armbruster <armbru@redhat.com> wrote:
> Juan Quintela <quintela@redhat.com> writes:
>
>> Markus Armbruster <armbru@redhat.com> wrote:
>>> Oh dear, where to start.  There's so much wrong, and in pretty obvious
>>> ways.  This code should never have passed review.  I'm refraining from
>>> saying more; see the commit messages instead.
>>>
>>> Issues remaining after this series include:
>>>
>>> * Terrible error messages
>>>
>>> * Some error message cascades remain
>>>
>>> * There is no written contract for QEMUFileHooks, and the
>>>   responsibility for reporting errors is unclear
>>>
>>> * There seem to be no tests whatsoever
>>>
>>> PATCH 29 is arguably a matter of taste.  I made my case for it during
>>> review of v1.  If maintainers don't want it, I'll drop it.
>>>
>>> Related: [PATCH 1/7] migration/rdma: Fix save_page method to fail on
>>> polling error
>>
>> Hi Markus
>>
>> I integrated everything except:
>>
>>>   migration/rdma: Fix or document problematic uses of errno
>>
>> Most of them are dropped on following patches.
>
> The hunks that are dropped in later patches are:
>
> * Four FIXME comments about incorrect or problematic use of perror().
>
>   If you drop the patch, you have to adjust the later patches that
>   remove these hunks.  Resolving the conflicts is *not* enough; you also
>   have to correct the commit messages.
>
> The hunks that are not dropped are:
>
> * Three comments about bugs (either library doc bug or incorrect use of
>   @errno here).  I'd hate to lose them.
>
> * One bug fix, in qemu_rdma_advise_prefetch_mr().  Losing this one would
>   be foolish.
>
> Please consider keeping the patch.

And here I am, having to redo the merge from this patch O:-)

>>>   migration/rdma: Use error_report() & friends instead of stderr
>>
>> You said you have to resend this one.
>
> Can do, but since the change is trivial, perhaps you could make it in
> your tree without a resend.  Change the line
>
>                 warn_report("WARN: migrations may fail:"
>
> to
>
>                 warn_report("migrations may fail:"

I thought it was more complicated.

Ok doing both.

Thanks, Juan.