mbox series

[00/12] migration: improve multithreads for compression and decompression

Message ID 20180604095520.8563-1-xiaoguangrong@tencent.com
Headers show
Series migration: improve multithreads for compression and decompression | expand

Message

Xiao Guangrong June 4, 2018, 9:55 a.m. UTC
From: Xiao Guangrong <xiaoguangrong@tencent.com>

Background
----------
Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

Our Ideas
---------
To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Other works in this patchset is offering some statistics to see
if compression works as we expected and making the migration thread
work fast so it can feed more requests to the threads

Implementation of the Ring
--------------------------
The key component is the ring which supports both single producer
vs. single consumer and multiple producers vs. single consumer

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implementation. It corrects some
bugs of memory barriers in kfifo and it is the simpler lockless
version of rte_ring as currently multiple access is only allowed
for producer.

If has single producer vs. single consumer, it is the traditional
fifo. If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
   - first step, occupy the entry in the ring:

retry:
      in = ring->in
      if (cmpxhg(&ring->in, in, in +1) != in)
            goto retry;

     after that the entry pointed by ring->data[in] has been owned by
     the producer.

     assert(ring->data[in] == NULL);

     Note, no other producer can touch this entry so that this entry
     should always be the initialized state.

   - second step, write the data to the entry:

     ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

     if (!ring_is_empty(ring))
          entry = &ring[ring->out];

     Note: the ring->out has not been updated so that the entry pointed
     by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
     if (*entry == NULL)
            goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

      data = *entry;
      *entry = NULL;
      ring->out++;

Memory barrier is omitted here, please refer to the comment in the code

Performance Result
-----------------
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97


(1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html 

Xiao Guangrong (12):
  migration: do not wait if no free thread
  migration: fix counting normal page for compression
  migration: fix counting xbzrle cache_miss_rate
  migration: introduce migration_update_rates
  migration: show the statistics of compression
  migration: do not detect zero page for compression
  migration: hold the lock only if it is really needed
  migration: do not flush_compressed_data at the end of each iteration
  ring: introduce lockless ring buffer
  migration: introduce lockless multithreads model
  migration: use lockless Multithread model for compression
  migration: use lockless Multithread model for decompression

 hmp.c                   |  13 +
 include/qemu/queue.h    |   1 +
 migration/Makefile.objs |   1 +
 migration/migration.c   |  11 +
 migration/ram.c         | 898 ++++++++++++++++++++++--------------------------
 migration/ram.h         |   1 +
 migration/ring.h        | 265 ++++++++++++++
 migration/threads.c     | 265 ++++++++++++++
 migration/threads.h     | 116 +++++++
 qapi/migration.json     |  25 +-
 10 files changed, 1109 insertions(+), 487 deletions(-)
 create mode 100644 migration/ring.h
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

Comments

Peter Xu June 11, 2018, 8 a.m. UTC | #1
On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Background
> ----------
> Current implementation of compression and decompression are very
> hard to be enabled on productions. We noticed that too many wait-wakes
> go to kernel space and CPU usages are very low even if the system
> is really free
> 
> The reasons are:
> 1) there are two many locks used to do synchronous,there
>   is a global lock and each single thread has its own lock,
>   migration thread and work threads need to go to sleep if
>   these locks are busy
> 
> 2) migration thread separately submits request to the thread
>    however, only one request can be pended, that means, the
>    thread has to go to sleep after finishing the request
> 
> Our Ideas
> ---------
> To make it work better, we introduce a new multithread model,
> the user, currently it is the migration thread, submits request
> to each thread with round-robin manner, the thread has its own
> ring whose capacity is 4 and puts the result to a global ring
> which is lockless for multiple producers, the user fetches result
> out from the global ring and do remaining operations for the
> request, e.g, posting the compressed data out for migration on
> the source QEMU
> 
> Other works in this patchset is offering some statistics to see
> if compression works as we expected and making the migration thread
> work fast so it can feed more requests to the threads

Hi, Guangrong,

I'm not sure whether my understanding is correct, but AFAIU the old
code has a major defect that it depends too much on the big lock.  The
critial section of the small lock seems to be very short always, and
also that's per-thread.  However we use the big lock in lots of
places: flush compress data, queue every page, or send the notifies in
the compression thread.

I haven't yet read the whole work, this work seems to be quite nice
according to your test results.  However have you thought about
firstly remove the big lock without touching much of other part of the
code, then continue to improve it?  Or have you ever tried to do so?
I don't think you need to do extra work for this, but I would
appreciate if you have existing test results to share.

In other words, would it be nicer to separate the work into two
pieces?

- one to refactor the existing locks, to see what we can gain by
  simplify the locks to minimum.  AFAIU now the locking used is still
  not ideal, my thinking is that _maybe_ we can start by removing the
  big lock, and use a semaphore or something to replace the "done"
  notification while still keep the small lock?  Even some busy
  looping?

- one to introduce the lockless ring buffer, to demostrate how the
  lockless data structure helps comparing to the locking ways

Then we can know which item contributed how much to the performance
numbers.  After all the new ring and thread model seems to be a big
chunk of work (sorry I haven't read them yet, but I will).

What do you think?

Thanks,
Xiao Guangrong June 12, 2018, 3:19 a.m. UTC | #2
On 06/11/2018 04:00 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Background
>> ----------
>> Current implementation of compression and decompression are very
>> hard to be enabled on productions. We noticed that too many wait-wakes
>> go to kernel space and CPU usages are very low even if the system
>> is really free
>>
>> The reasons are:
>> 1) there are two many locks used to do synchronous,there
>>   is a global lock and each single thread has its own lock,
>>   migration thread and work threads need to go to sleep if
>>   these locks are busy
>>
>> 2) migration thread separately submits request to the thread
>>     however, only one request can be pended, that means, the
>>     thread has to go to sleep after finishing the request
>>
>> Our Ideas
>> ---------
>> To make it work better, we introduce a new multithread model,
>> the user, currently it is the migration thread, submits request
>> to each thread with round-robin manner, the thread has its own
>> ring whose capacity is 4 and puts the result to a global ring
>> which is lockless for multiple producers, the user fetches result
>> out from the global ring and do remaining operations for the
>> request, e.g, posting the compressed data out for migration on
>> the source QEMU
>>
>> Other works in this patchset is offering some statistics to see
>> if compression works as we expected and making the migration thread
>> work fast so it can feed more requests to the threads
> 
> Hi, Guangrong,
> 
> I'm not sure whether my understanding is correct, but AFAIU the old
> code has a major defect that it depends too much on the big lock.  The
> critial section of the small lock seems to be very short always, and
> also that's per-thread.  However we use the big lock in lots of
> places: flush compress data, queue every page, or send the notifies in
> the compression thread.
> 

The lock is one issue, however, another issue is that, the thread has
to go to sleep after finishing one request and the main thread (live
migration thread) needs to go to kernel space and wake the thread up
for every single request.

And we also observed that linearly scan the threads one by one to
see which is free is not cache-friendly...

> I haven't yet read the whole work, this work seems to be quite nice
> according to your test results.  However have you thought about
> firstly remove the big lock without touching much of other part of the
> code, then continue to improve it?  Or have you ever tried to do so?
> I don't think you need to do extra work for this, but I would
> appreciate if you have existing test results to share.
> 

If you really want the performance result, i will try it...

Actually, the first version we used on our production is that we
use a lockless multi-thread model (only one atomic operation is needed
for both producer and consumer) but only one request can be fed to the
thread. It's comparable to your suggestion (and should far more faster
than your suggestion).

We observed the shortcoming of this solutions is that too many waits and
wakeups trapped to kernel, so CPU is idle and bandwidth is low.

> In other words, would it be nicer to separate the work into two
> pieces?
> 
> - one to refactor the existing locks, to see what we can gain by
>    simplify the locks to minimum.  AFAIU now the locking used is still
>    not ideal, my thinking is that _maybe_ we can start by removing the
>    big lock, and use a semaphore or something to replace the "done"
>    notification while still keep the small lock?  Even some busy
>    looping?
> 

Note: no lock is used after this patchset...

> - one to introduce the lockless ring buffer, to demostrate how the
>    lockless data structure helps comparing to the locking ways
> 
> Then we can know which item contributed how much to the performance
> numbers.  After all the new ring and thread model seems to be a big
> chunk of work (sorry I haven't read them yet, but I will).

It is really a huge burden that refactor old code and later completely
remove old code.

We redesigned the data struct and algorithm completely and abstract the
model to clean up the code used for compression and decompression, it's
not easy to modify the old code part by part... :(

But... if you really it is really needed, i will try to figure out a way
to address your suggestion. :)

Thanks!
Peter Xu June 12, 2018, 5:36 a.m. UTC | #3
On Tue, Jun 12, 2018 at 11:19:14AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/11/2018 04:00 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Background
> > > ----------
> > > Current implementation of compression and decompression are very
> > > hard to be enabled on productions. We noticed that too many wait-wakes
> > > go to kernel space and CPU usages are very low even if the system
> > > is really free
> > > 
> > > The reasons are:
> > > 1) there are two many locks used to do synchronous,there
> > >   is a global lock and each single thread has its own lock,
> > >   migration thread and work threads need to go to sleep if
> > >   these locks are busy
> > > 
> > > 2) migration thread separately submits request to the thread
> > >     however, only one request can be pended, that means, the
> > >     thread has to go to sleep after finishing the request
> > > 
> > > Our Ideas
> > > ---------
> > > To make it work better, we introduce a new multithread model,
> > > the user, currently it is the migration thread, submits request
> > > to each thread with round-robin manner, the thread has its own
> > > ring whose capacity is 4 and puts the result to a global ring
> > > which is lockless for multiple producers, the user fetches result
> > > out from the global ring and do remaining operations for the
> > > request, e.g, posting the compressed data out for migration on
> > > the source QEMU
> > > 
> > > Other works in this patchset is offering some statistics to see
> > > if compression works as we expected and making the migration thread
> > > work fast so it can feed more requests to the threads
> > 
> > Hi, Guangrong,
> > 
> > I'm not sure whether my understanding is correct, but AFAIU the old
> > code has a major defect that it depends too much on the big lock.  The
> > critial section of the small lock seems to be very short always, and
> > also that's per-thread.  However we use the big lock in lots of
> > places: flush compress data, queue every page, or send the notifies in
> > the compression thread.
> > 
> 
> The lock is one issue, however, another issue is that, the thread has
> to go to sleep after finishing one request and the main thread (live
> migration thread) needs to go to kernel space and wake the thread up
> for every single request.
> 
> And we also observed that linearly scan the threads one by one to
> see which is free is not cache-friendly...

I don't quite understand how this can be fixed on cache POV, but I'll
read the series first before further asking.

> 
> > I haven't yet read the whole work, this work seems to be quite nice
> > according to your test results.  However have you thought about
> > firstly remove the big lock without touching much of other part of the
> > code, then continue to improve it?  Or have you ever tried to do so?
> > I don't think you need to do extra work for this, but I would
> > appreciate if you have existing test results to share.
> > 
> 
> If you really want the performance result, i will try it...

Then that'll be enough for me.  Please only provide the performance
numbers if there are more people asking for that.  Otherwise please
feel free to put that aside.

> 
> Actually, the first version we used on our production is that we
> use a lockless multi-thread model (only one atomic operation is needed
> for both producer and consumer) but only one request can be fed to the
> thread. It's comparable to your suggestion (and should far more faster
> than your suggestion).
> 
> We observed the shortcoming of this solutions is that too many waits and
> wakeups trapped to kernel, so CPU is idle and bandwidth is low.

Okay.

> 
> > In other words, would it be nicer to separate the work into two
> > pieces?
> > 
> > - one to refactor the existing locks, to see what we can gain by
> >    simplify the locks to minimum.  AFAIU now the locking used is still
> >    not ideal, my thinking is that _maybe_ we can start by removing the
> >    big lock, and use a semaphore or something to replace the "done"
> >    notification while still keep the small lock?  Even some busy
> >    looping?
> > 
> 
> Note: no lock is used after this patchset...
> 
> > - one to introduce the lockless ring buffer, to demostrate how the
> >    lockless data structure helps comparing to the locking ways
> > 
> > Then we can know which item contributed how much to the performance
> > numbers.  After all the new ring and thread model seems to be a big
> > chunk of work (sorry I haven't read them yet, but I will).
> 
> It is really a huge burden that refactor old code and later completely
> remove old code.
> 
> We redesigned the data struct and algorithm completely and abstract the
> model to clean up the code used for compression and decompression, it's
> not easy to modify the old code part by part... :(

Yeah; my suggestion above is based on the possibility that removing
the big lock won't be that hard (I _feel_ like it can be <100 LOC, but
I can't tell).  If it's proven to be very hard itself already, then
I'm totally fine without it.

> 
> But... if you really it is really needed, i will try to figure out a way
> to address your suggestion. :)

Not necessary.  I won't spend your time for my solo question.  :)

My point is not strong, it's just how I think about it - generally I
would prefer incremental changes along the way especially the first
step seems obvious (for this, again I would try operating on the big
lock first).  But this is of course not a reason to refuse your
work.

I'll study your whole series soon; meanwhile let's see how other
people think.

Thanks,