diff mbox series

[v3,08/10] migration: handle the error condition properly

Message ID 20180807091209.13531-9-xiaoguangrong@tencent.com
State New
Headers show
Series migration: compression optimization | expand

Commit Message

Xiao Guangrong Aug. 7, 2018, 9:12 a.m. UTC
From: Xiao Guangrong <xiaoguangrong@tencent.com>

ram_find_and_save_block() can return negative if any error hanppens,
however, it is completely ignored in current code

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Peter Xu Aug. 8, 2018, 5:08 a.m. UTC | #1
On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> ram_find_and_save_block() can return negative if any error hanppens,
> however, it is completely ignored in current code

Could you hint me where we'll return an error?

(Anyway I agree that the error handling is not that good, mostly
 because the QEMUFile APIs does not provide proper return code, e.g.,
 qemu_put_be64 returns void)

> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 55966bc2c1..09be01dca2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2367,7 +2367,8 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>   *
>   * Called within an RCU critical section.
>   *
> - * Returns the number of pages written where zero means no dirty pages
> + * Returns the number of pages written where zero means no dirty pages,
> + * or negative on error
>   *
>   * @rs: current RAM state
>   * @last_stage: if we are at the completion stage
> @@ -3202,6 +3203,12 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>              done = 1;
>              break;
>          }
> +
> +        if (pages < 0) {
> +            qemu_file_set_error(f, pages);
> +            break;
> +        }
> +
>          rs->iterations++;
>  
>          /* we want to check in the 1st loop, just in case it was the 1st time
> @@ -3243,7 +3250,7 @@ out:
>  /**
>   * ram_save_complete: function called to send the remaining amount of ram
>   *
> - * Returns zero to indicate success
> + * Returns zero to indicate success or negative on error
>   *
>   * Called with iothread lock
>   *
> @@ -3254,6 +3261,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>  {
>      RAMState **temp = opaque;
>      RAMState *rs = *temp;
> +    int ret = 0;
>  
>      rcu_read_lock();
>  
> @@ -3274,6 +3282,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          if (pages == 0) {
>              break;
>          }
> +        if (pages < 0) {
> +            ret = pages;
> +            break;
> +        }
>      }
>  
>      flush_compressed_data(rs);
> @@ -3285,7 +3297,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>      qemu_fflush(f);
>  
> -    return 0;
> +    return ret;
>  }
>  
>  static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> -- 
> 2.14.4
> 

Regards,
Xiao Guangrong Aug. 8, 2018, 6:29 a.m. UTC | #2
On 08/08/2018 01:08 PM, Peter Xu wrote:
> On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> ram_find_and_save_block() can return negative if any error hanppens,
>> however, it is completely ignored in current code
> 
> Could you hint me where we'll return an error?
> 

I think control_save_page() may return a error condition but i am not
good at it ... Other places look safe _currently_. These functions were
designed to have error returned anyway.

> (Anyway I agree that the error handling is not that good, mostly
>   because the QEMUFile APIs does not provide proper return code, e.g.,
>   qemu_put_be64 returns void)
> 

Yes, it is, the returned error condition is mixed in file's API and
function's return value... :(
Peter Xu Aug. 8, 2018, 6:56 a.m. UTC | #3
On Wed, Aug 08, 2018 at 02:29:52PM +0800, Xiao Guangrong wrote:
> 
> 
> On 08/08/2018 01:08 PM, Peter Xu wrote:
> > On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > ram_find_and_save_block() can return negative if any error hanppens,
> > > however, it is completely ignored in current code
> > 
> > Could you hint me where we'll return an error?
> > 
> 
> I think control_save_page() may return a error condition but i am not
> good at it ... Other places look safe _currently_. These functions were
> designed to have error returned anyway.

Ah, the RDMA codes...

Then I feel like this patch would be more suitable to be put into some
of the RDMA series - at least we'd better be clear about what errors
we're going to capture.  For non-RDMA, it seems a bit helpless after
all - AFAIU we're depending on the few qemu_file_get_error() calls to
detect output errors.

> 
> > (Anyway I agree that the error handling is not that good, mostly
> >   because the QEMUFile APIs does not provide proper return code, e.g.,
> >   qemu_put_be64 returns void)
> > 
> 
> Yes, it is, the returned error condition is mixed in file's API and
> function's return value... :(
> 

Regards,
Xiao Guangrong Aug. 8, 2018, 7:23 a.m. UTC | #4
On 08/08/2018 02:56 PM, Peter Xu wrote:
> On Wed, Aug 08, 2018 at 02:29:52PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 08/08/2018 01:08 PM, Peter Xu wrote:
>>> On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> ram_find_and_save_block() can return negative if any error hanppens,
>>>> however, it is completely ignored in current code
>>>
>>> Could you hint me where we'll return an error?
>>>
>>
>> I think control_save_page() may return a error condition but i am not
>> good at it ... Other places look safe _currently_. These functions were
>> designed to have error returned anyway.
> 
> Ah, the RDMA codes...
> 
> Then I feel like this patch would be more suitable to be put into some
> of the RDMA series - at least we'd better be clear about what errors
> we're going to capture.  For non-RDMA, it seems a bit helpless after
> all - AFAIU we're depending on the few qemu_file_get_error() calls to
> detect output errors.

So, are you talking about to modify the semantic of these functions,
ram_save_host_page(), ram_save_target_page(), etc, and make them
be:
"Returns the number of pages written where zero means no dirty pages,
error conditions are indicated in the QEMUFile @rs->file if it
happened."

If it's what you want, i will update the comments and make the
implementation more clear to reflect this fact for these
functions
Peter Xu Aug. 8, 2018, 8:46 a.m. UTC | #5
On Wed, Aug 08, 2018 at 03:23:22PM +0800, Xiao Guangrong wrote:
> 
> 
> On 08/08/2018 02:56 PM, Peter Xu wrote:
> > On Wed, Aug 08, 2018 at 02:29:52PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 08/08/2018 01:08 PM, Peter Xu wrote:
> > > > On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > ram_find_and_save_block() can return negative if any error hanppens,
> > > > > however, it is completely ignored in current code
> > > > 
> > > > Could you hint me where we'll return an error?
> > > > 
> > > 
> > > I think control_save_page() may return a error condition but i am not
> > > good at it ... Other places look safe _currently_. These functions were
> > > designed to have error returned anyway.
> > 
> > Ah, the RDMA codes...
> > 
> > Then I feel like this patch would be more suitable to be put into some
> > of the RDMA series - at least we'd better be clear about what errors
> > we're going to capture.  For non-RDMA, it seems a bit helpless after
> > all - AFAIU we're depending on the few qemu_file_get_error() calls to
> > detect output errors.
> 
> So, are you talking about to modify the semantic of these functions,
> ram_save_host_page(), ram_save_target_page(), etc, and make them
> be:
> "Returns the number of pages written where zero means no dirty pages,
> error conditions are indicated in the QEMUFile @rs->file if it
> happened."
> 
> If it's what you want, i will update the comments and make the
> implementation more clear to reflect this fact for these
> functions

Not really; I am just unclear about how this patch could help current
code, however I have no objection on the content.  Let's see whether
Dave or Juan would like it.

Thanks,
Dr. David Alan Gilbert Aug. 8, 2018, 2:11 p.m. UTC | #6
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 08/08/2018 01:08 PM, Peter Xu wrote:
> > On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > ram_find_and_save_block() can return negative if any error hanppens,
> > > however, it is completely ignored in current code
> > 
> > Could you hint me where we'll return an error?
> > 
> 
> I think control_save_page() may return a error condition but i am not
> good at it ... Other places look safe _currently_. These functions were
> designed to have error returned anyway.

ram_control_save_page's return is checked by control_save_page which
returns true/false but sets *pages to a return value.

What I'd need to follow closely is the case where ram_control_save_page
returns RAM_SAVE_CONTROL_DELAYED, in that case control_save_page I think
returns with *pages=-1 and returns true.
And I think in that case ram_save_target_page can leak that -1 - hmm.

Now, ram_save_host_page already checks for <0 and will return that,
but I think that would potentially loop in ram_find_and_save_block; I'm
not sure we want to change that or not!

Dave

> 
> > (Anyway I agree that the error handling is not that good, mostly
> >   because the QEMUFile APIs does not provide proper return code, e.g.,
> >   qemu_put_be64 returns void)
> > 
> 
> Yes, it is, the returned error condition is mixed in file's API and
> function's return value... :(
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Xiao Guangrong Aug. 9, 2018, 3:08 a.m. UTC | #7
On 08/08/2018 10:11 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 08/08/2018 01:08 PM, Peter Xu wrote:
>>> On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> ram_find_and_save_block() can return negative if any error hanppens,
>>>> however, it is completely ignored in current code
>>>
>>> Could you hint me where we'll return an error?
>>>
>>
>> I think control_save_page() may return a error condition but i am not
>> good at it ... Other places look safe _currently_. These functions were
>> designed to have error returned anyway.
> 
> ram_control_save_page's return is checked by control_save_page which
> returns true/false but sets *pages to a return value.
> 
> What I'd need to follow closely is the case where ram_control_save_page
> returns RAM_SAVE_CONTROL_DELAYED, in that case control_save_page I think
> returns with *pages=-1 and returns true.
> And I think in that case ram_save_target_page can leak that -1 - hmm.
> 
> Now, ram_save_host_page already checks for <0 and will return that,
> but I think that would potentially loop in ram_find_and_save_block; I'm
> not sure we want to change that or not!

ram_find_and_save_block() will continue the look only if ram_save_host_page
returns zero:

......
         if (found) {
             pages = ram_save_host_page(rs, &pss, last_stage);
         }
     } while (!pages && again);

IMHO, how to change the code really depends on the semantic of these functions,
based on the comments of them, returning error conditions is the current
semantic.

Another choice would be the one squashes error conditions to QEMUFile and
adapt comments and code of these functions to reflect the new semantic
clearly.

Which one do you guys prefer to? :)
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 55966bc2c1..09be01dca2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2367,7 +2367,8 @@  static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
  *
  * Called within an RCU critical section.
  *
- * Returns the number of pages written where zero means no dirty pages
+ * Returns the number of pages written where zero means no dirty pages,
+ * or negative on error
  *
  * @rs: current RAM state
  * @last_stage: if we are at the completion stage
@@ -3202,6 +3203,12 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
             done = 1;
             break;
         }
+
+        if (pages < 0) {
+            qemu_file_set_error(f, pages);
+            break;
+        }
+
         rs->iterations++;
 
         /* we want to check in the 1st loop, just in case it was the 1st time
@@ -3243,7 +3250,7 @@  out:
 /**
  * ram_save_complete: function called to send the remaining amount of ram
  *
- * Returns zero to indicate success
+ * Returns zero to indicate success or negative on error
  *
  * Called with iothread lock
  *
@@ -3254,6 +3261,7 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
+    int ret = 0;
 
     rcu_read_lock();
 
@@ -3274,6 +3282,10 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
         if (pages == 0) {
             break;
         }
+        if (pages < 0) {
+            ret = pages;
+            break;
+        }
     }
 
     flush_compressed_data(rs);
@@ -3285,7 +3297,7 @@  static int ram_save_complete(QEMUFile *f, void *opaque)
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     qemu_fflush(f);
 
-    return 0;
+    return ret;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,