diff mbox series

[PULL,1/4] balloon: Allow multiple inhibit users

Message ID 153452570408.13804.6971118770752634964.stgit@gimli.home
State New
Headers show
Series [PULL,1/4] balloon: Allow multiple inhibit users | expand

Commit Message

Alex Williamson Aug. 17, 2018, 5:08 p.m. UTC
A simple true/false internal state does not allow multiple users.  Fix
this within the existing interface by converting to a counter, so long
as the counter is elevated, ballooning is inhibited.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 balloon.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Comments

Christian Borntraeger Aug. 22, 2018, 3:56 p.m. UTC | #1
This breaks qemu-io test for me.

#0  0x000003ff98f3e2d4 in raise () at /lib64/libc.so.6
#1  0x000003ff98f239a8 in abort () at /lib64/libc.so.6
#2  0x000003ff98f3632e in __assert_fail_base () at /lib64/libc.so.6
#3  0x000003ff98f363ac in  () at /lib64/libc.so.6
#4  0x000000000108301a in qemu_balloon_inhibit (state=state@entry=false) at /home/cborntra/REPOS/qemu/balloon.c:56
#5  0x00000000012026f2 in postcopy_ram_incoming_cleanup (mis=mis@entry=0x163bc5c0) at /home/cborntra/REPOS/qemu/migration/postcopy-ram.c:542
#6  0x00000000011eed5e in process_incoming_migration_co (opaque=<optimized out>) at /home/cborntra/REPOS/qemu/migration/migration.c:407
#7  0x0000000001374000 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/cborntra/REPOS/qemu/util/coroutine-ucontext.c:116
#8  0x000003ff98f53c02 in __makecontext_ret () at /lib64/libc.so.6

Cant see right now whats wrong.

On 08/17/2018 07:08 PM, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index 6bf0a9681377..931987983858 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -26,6 +26,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu-common.h"
> +#include "qemu/atomic.h"
>  #include "exec/cpu-common.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/balloon.h"
> @@ -37,16 +38,22 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static bool balloon_inhibited;
> +static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return balloon_inhibited;
> +    return atomic_read(&balloon_inhibit_count) > 0;
>  }
>  
>  void qemu_balloon_inhibit(bool state)
>  {
> -    balloon_inhibited = state;
> +    if (state) {
> +        atomic_inc(&balloon_inhibit_count);
> +    } else {
> +        atomic_dec(&balloon_inhibit_count);
> +    }
> +
> +    assert(atomic_read(&balloon_inhibit_count) >= 0);
>  }
>  
>  static bool have_balloon(Error **errp)
> 
>
Alex Williamson Aug. 22, 2018, 5:47 p.m. UTC | #2
On Wed, 22 Aug 2018 17:56:12 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> This breaks qemu-io test for me.
> 
> #0  0x000003ff98f3e2d4 in raise () at /lib64/libc.so.6
> #1  0x000003ff98f239a8 in abort () at /lib64/libc.so.6
> #2  0x000003ff98f3632e in __assert_fail_base () at /lib64/libc.so.6
> #3  0x000003ff98f363ac in  () at /lib64/libc.so.6
> #4  0x000000000108301a in qemu_balloon_inhibit (state=state@entry=false) at /home/cborntra/REPOS/qemu/balloon.c:56
> #5  0x00000000012026f2 in postcopy_ram_incoming_cleanup (mis=mis@entry=0x163bc5c0) at /home/cborntra/REPOS/qemu/migration/postcopy-ram.c:542
> #6  0x00000000011eed5e in process_incoming_migration_co (opaque=<optimized out>) at /home/cborntra/REPOS/qemu/migration/migration.c:407
> #7  0x0000000001374000 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/cborntra/REPOS/qemu/util/coroutine-ucontext.c:116
> #8  0x000003ff98f53c02 in __makecontext_ret () at /lib64/libc.so.6
> 
> Cant see right now whats wrong.

I'd have to guess it's the assert added to verify increments and
decrements are balanced, which suggests postcopy is de-inhibiting more
than it's inhibiting :-\  Thanks,

Alex
Alex Williamson Aug. 22, 2018, 7:01 p.m. UTC | #3
On Wed, 22 Aug 2018 11:47:09 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 22 Aug 2018 17:56:12 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> > This breaks qemu-io test for me.
> > 
> > #0  0x000003ff98f3e2d4 in raise () at /lib64/libc.so.6
> > #1  0x000003ff98f239a8 in abort () at /lib64/libc.so.6
> > #2  0x000003ff98f3632e in __assert_fail_base () at /lib64/libc.so.6
> > #3  0x000003ff98f363ac in  () at /lib64/libc.so.6
> > #4  0x000000000108301a in qemu_balloon_inhibit (state=state@entry=false) at /home/cborntra/REPOS/qemu/balloon.c:56
> > #5  0x00000000012026f2 in postcopy_ram_incoming_cleanup (mis=mis@entry=0x163bc5c0) at /home/cborntra/REPOS/qemu/migration/postcopy-ram.c:542
> > #6  0x00000000011eed5e in process_incoming_migration_co (opaque=<optimized out>) at /home/cborntra/REPOS/qemu/migration/migration.c:407
> > #7  0x0000000001374000 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/cborntra/REPOS/qemu/util/coroutine-ucontext.c:116
> > #8  0x000003ff98f53c02 in __makecontext_ret () at /lib64/libc.so.6
> > 
> > Cant see right now whats wrong.  
> 
> I'd have to guess it's the assert added to verify increments and
> decrements are balanced, which suggests postcopy is de-inhibiting more
> than it's inhibiting :-\  Thanks,

When postcopy inhibits ballooning, it does so from the migration stream
at loaddvm_process_command().  If the command is
MIG_CMD_POSTCOPY_LISTEN, this calls loadvm_postcopy_handle_listen().
If postcopy RAM is enabled, this calls postcopy_ram_enable_notify(),
which is where the balloon inhibitor is triggered.  I think the
postcopy_ram_listen_thread, which is created at the end of
loadvm_postcopy_handle_listen(), is intended to be where the inhibit
would be lifted, but I note a couple error paths before that where the
inhibit is not released.

postcopy_ram_listen_thread() calls postcopy_ram_incoming_cleanup(),
which always releases the balloon inhibit, regardless of the postcopy
RAM capability, and of course there are some error cases in advance of
this that would again leave the inhibit in place.

We also have the coroutine process_incoming_migration_co, launched from
migration_incoming_process() which seems to do a gratuitous call to
postcopy_ram_incoming_cleanup() and will release the inhibit regardless
of whether it was activated previous.

So apparently I assumed too much about the existing balloon inhibitor
use case, it's clearly not as general as the interface would lead one
to believe.  The immediate solution is probably to make a postcopy
specific wrapper around qemu_balloon_inhibit() that keeps a static
internal state to avoid over incrementing or decrementing.  That should
resolve this and then we can move on to whether we want to change the
overall internal interface or fixup some of the other cases noted
above.  Thanks,

Alex
Cornelia Huck Aug. 23, 2018, 8:10 a.m. UTC | #4
On Wed, 22 Aug 2018 13:01:39 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> So apparently I assumed too much about the existing balloon inhibitor
> use case, it's clearly not as general as the interface would lead one
> to believe.  The immediate solution is probably to make a postcopy
> specific wrapper around qemu_balloon_inhibit() that keeps a static
> internal state to avoid over incrementing or decrementing.  That should
> resolve this and then we can move on to whether we want to change the
> overall internal interface or fixup some of the other cases noted
> above.  Thanks,
> 
> Alex
> 

My gut feeling is that an interface that allows code to add
yet-another-inhibitor and to remove it again (IOW, what this patch
implemented) is the way to go and that we should push out any checking
whether a inhibitor actually can be removed to the callers.

So, I think it actually makes a lot of sense for the postcopy code to
keep track whether it actually submitted 'its' inhibit and only call
into the generic inhibitor code to release it if that's the case.
diff mbox series

Patch

diff --git a/balloon.c b/balloon.c
index 6bf0a9681377..931987983858 100644
--- a/balloon.c
+++ b/balloon.c
@@ -26,6 +26,7 @@ 
 
 #include "qemu/osdep.h"
 #include "qemu-common.h"
+#include "qemu/atomic.h"
 #include "exec/cpu-common.h"
 #include "sysemu/kvm.h"
 #include "sysemu/balloon.h"
@@ -37,16 +38,22 @@ 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static bool balloon_inhibited;
+static int balloon_inhibit_count;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return balloon_inhibited;
+    return atomic_read(&balloon_inhibit_count) > 0;
 }
 
 void qemu_balloon_inhibit(bool state)
 {
-    balloon_inhibited = state;
+    if (state) {
+        atomic_inc(&balloon_inhibit_count);
+    } else {
+        atomic_dec(&balloon_inhibit_count);
+    }
+
+    assert(atomic_read(&balloon_inhibit_count) >= 0);
 }
 
 static bool have_balloon(Error **errp)