[v3,00/10] migration: improve and cleanup compression

Message ID	20180330075128.26919-1-xiaoguangrong@tencent.com
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: guangrong.xiao@gmail.com To: pbonzini@redhat.com, mst@redhat.com, mtosatti@redhat.com Date: Fri, 30 Mar 2018 15:51:18 +0800 Message-Id: <20180330075128.26919-1-xiaoguangrong@tencent.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Qemu-devel] [PATCH v3 00/10] migration: improve and cleanup compression Precedence: list Cc: kvm@vger.kernel.org, Xiao Guangrong <xiaoguangrong@tencent.com>, qemu-devel@nongnu.org, peterx@redhat.com, dgilbert@redhat.com, wei.w.wang@intel.com, jiang.biao2@zte.com.cn Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	migration: improve and cleanup compression \| expand [v3,00/10] migration: improve and cleanup compression [v3,01/10] migration: stop compressing page in migration thread [v3,02/10] migration: stop compression to allocate and free memory frequently [v3,03/10] migration: stop decompression to allocate and free memory frequently [v3,04/10] migration: detect compression and decompression errors [v3,05/10] migration: introduce control_save_page() [v3,06/10] migration: move some code to ram_save_host_page [v3,07/10] migration: move calling control_save_page to the common place [v3,08/10] migration: move calling save_zero_page to the common place [v3,09/10] migration: introduce save_normal_page() [v3,10/10] migration: remove ram_save_compressed_page()

Xiao Guangrong March 30, 2018, 7:51 a.m. UTC

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Changelog in v3:
Following changes are from Peter's review:
1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
   the thread is properly init'd or not
2) save the file which is used by ram loader to the global variable
   instead it is cached per decompression thread

Changelog in v2:
Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
in this version are:
1) include the performance number in the cover letter
2）add some comments to explain how to use z_stream->opaque in the
   patchset
3) allocate a internal buffer for per thread to store the data to
   be compressed
4) add a new patch that moves some code to ram_save_host_page() so
   that 'goto' can be omitted gracefully
5) split the optimization of compression and decompress into two
   separated patches
6) refine and correct code styles


This is the first part of our work to improve compression to make it
be more useful in the production.

The first patch resolves the problem that the migration thread spends
too much CPU resource to compression memory if it jumps to a new block
that causes the network is used very deficient.

The second patch fixes the performance issue that too many VM-exits
happen during live migration if compression is being used, it is caused
by huge memory returned to kernel frequently as the memory is allocated
and freed for every signal call to compress2()

The remaining patches clean the code up dramatically

Performance numbers:
We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
350. During the migration, a workload which has 8 threads repeatedly
written total 6G memory in the VM.

Before this patchset, its bandwidth is ~25 mbps, after applying, the
bandwidth is ~50 mbp.

We also collected the perf data for patch 2 and 3 on our production,
before the patchset:
+  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
+  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common

-   1.16%  kqemu  [kernel.kallsyms]        [k] lock_acquire                                       ▒
   - lock_acquire                                                                                 ▒
      - 15.68% _raw_spin_lock                                                                     ▒
         + 29.42% __schedule                                                                      ▒
         + 29.14% perf_event_context_sched_out                                                    ▒
         + 23.60% tdp_page_fault                                                                  ▒
         + 10.54% do_anonymous_page                                                               ▒
         + 2.07% kvm_mmu_notifier_invalidate_range_start                                          ▒
         + 1.83% zap_pte_range                                                                    ▒
         + 1.44% kvm_mmu_notifier_invalidate_range_end


apply our work:
+  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
+  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
+   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
+   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
+   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2

-  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire                                     ▒
   - __lock_acquire                                                                               ▒
      - 99.75% lock_acquire                                                                       ▒
         - 18.38% _raw_spin_lock                                                                  ▒
            + 39.62% tdp_page_fault                                                               ▒
            + 31.32% __schedule                                                                   ▒
            + 27.53% perf_event_context_sched_out                                                 ▒
            + 0.58% hrtimer_interrupt


We can see the TLB flush and mmu-lock contention have gone.

Xiao Guangrong (10):
  migration: stop compressing page in migration thread
  migration: stop compression to allocate and free memory frequently
  migration: stop decompression to allocate and free memory frequently
  migration: detect compression and decompression errors
  migration: introduce control_save_page()
  migration: move some code to ram_save_host_page
  migration: move calling control_save_page to the common place
  migration: move calling save_zero_page to the common place
  migration: introduce save_normal_page()
  migration: remove ram_save_compressed_page()

 migration/qemu-file.c |  43 ++++-
 migration/qemu-file.h |   6 +-
 migration/ram.c       | 482 ++++++++++++++++++++++++++++++--------------------
 3 files changed, 324 insertions(+), 207 deletions(-)

no-reply@patchew.org March 31, 2018, 8:22 a.m. UTC | #1

Hi,

This series failed docker-quick@centos6 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

Type: series
Message-id: 20180330075128.26919-1-xiaoguangrong@tencent.com
Subject: [Qemu-devel] [PATCH v3 00/10] migration: improve and cleanup compression

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=8
time make docker-test-quick@centos6
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
eec13da29e migration: remove ram_save_compressed_page()
4742355795 migration: introduce save_normal_page()
7bebdd70d7 migration: move calling save_zero_page to the common place
a05cecb88d migration: move calling control_save_page to the common place
2aa4825057 migration: move some code to ram_save_host_page
ea1dfe9e22 migration: introduce control_save_page()
7efa946755 migration: detect compression and decompression errors
5c24f92c70 migration: stop decompression to allocate and free memory frequently
a9f99164c2 migration: stop compression to allocate and free memory frequently
503cb617bc migration: stop compressing page in migration thread

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into '/var/tmp/patchew-tester-tmp-k24r8yo6/src/dtc'...
Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42'
  BUILD   centos6
make[1]: Entering directory '/var/tmp/patchew-tester-tmp-k24r8yo6/src'
  GEN     /var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266/qemu.tar
Cloning into '/var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266/qemu.tar.vroot'...
done.
Checking out files:  15% (957/6066)   
Checking out files:  16% (971/6066)   
Checking out files:  16% (1007/6066)   
Checking out files:  17% (1032/6066)   
Checking out files:  18% (1092/6066)   
Checking out files:  19% (1153/6066)   
Checking out files:  20% (1214/6066)   
Checking out files:  21% (1274/6066)   
Checking out files:  22% (1335/6066)   
Checking out files:  23% (1396/6066)   
Checking out files:  24% (1456/6066)   
Checking out files:  25% (1517/6066)   
Checking out files:  26% (1578/6066)   
Checking out files:  27% (1638/6066)   
Checking out files:  28% (1699/6066)   
Checking out files:  29% (1760/6066)   
Checking out files:  30% (1820/6066)   
Checking out files:  31% (1881/6066)   
Checking out files:  32% (1942/6066)   
Checking out files:  33% (2002/6066)   
Checking out files:  34% (2063/6066)   
Checking out files:  35% (2124/6066)   
Checking out files:  36% (2184/6066)   
Checking out files:  37% (2245/6066)   
Checking out files:  38% (2306/6066)   
Checking out files:  39% (2366/6066)   
Checking out files:  40% (2427/6066)   
Checking out files:  41% (2488/6066)   
Checking out files:  42% (2548/6066)   
Checking out files:  43% (2609/6066)   
Checking out files:  44% (2670/6066)   
Checking out files:  45% (2730/6066)   
Checking out files:  46% (2791/6066)   
Checking out files:  47% (2852/6066)   
Checking out files:  48% (2912/6066)   
Checking out files:  49% (2973/6066)   
Checking out files:  50% (3033/6066)   
Checking out files:  51% (3094/6066)   
Checking out files:  52% (3155/6066)   
Checking out files:  53% (3215/6066)   
Checking out files:  54% (3276/6066)   
Checking out files:  55% (3337/6066)   
Checking out files:  56% (3397/6066)   
Checking out files:  57% (3458/6066)   
Checking out files:  58% (3519/6066)   
Checking out files:  59% (3579/6066)   
Checking out files:  60% (3640/6066)   
Checking out files:  61% (3701/6066)   
Checking out files:  62% (3761/6066)   
Checking out files:  63% (3822/6066)   
Checking out files:  64% (3883/6066)   
Checking out files:  65% (3943/6066)   
Checking out files:  66% (4004/6066)   
Checking out files:  67% (4065/6066)   
Checking out files:  68% (4125/6066)   
Checking out files:  69% (4186/6066)   
Checking out files:  70% (4247/6066)   
Checking out files:  71% (4307/6066)   
Checking out files:  72% (4368/6066)   
Checking out files:  73% (4429/6066)   
Checking out files:  74% (4489/6066)   
Checking out files:  74% (4522/6066)   
Checking out files:  75% (4550/6066)   
Checking out files:  76% (4611/6066)   
Checking out files:  77% (4671/6066)   
Checking out files:  78% (4732/6066)   
Checking out files:  79% (4793/6066)   
Checking out files:  80% (4853/6066)   
Checking out files:  80% (4883/6066)   
Checking out files:  81% (4914/6066)   
Checking out files:  82% (4975/6066)   
Checking out files:  83% (5035/6066)   
Checking out files:  84% (5096/6066)   
Checking out files:  85% (5157/6066)   
Checking out files:  86% (5217/6066)   
Checking out files:  87% (5278/6066)   
Checking out files:  88% (5339/6066)   
Checking out files:  89% (5399/6066)   
Checking out files:  90% (5460/6066)   
Checking out files:  91% (5521/6066)   
Checking out files:  92% (5581/6066)   
Checking out files:  93% (5642/6066)   
Checking out files:  94% (5703/6066)   
Checking out files:  95% (5763/6066)   
Checking out files:  96% (5824/6066)   
Checking out files:  97% (5885/6066)   
Checking out files:  98% (5945/6066)   
Checking out files:  99% (6006/6066)   
Checking out files: 100% (6066/6066)   
Checking out files: 100% (6066/6066), done.
Your branch is up-to-date with 'origin/test'.
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into '/var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266/qemu.tar.vroot/dtc'...
Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42'
Submodule 'ui/keycodemapdb' (git://git.qemu.org/keycodemapdb.git) registered for path 'ui/keycodemapdb'
Cloning into '/var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266/qemu.tar.vroot/ui/keycodemapdb'...
Submodule path 'ui/keycodemapdb': checked out '6b3d716e2b6472eb7189d3220552280ef3d832ce'
tar: /var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266/qemu.tar: Wrote only 4096 of 10240 bytes
tar: Error is not recoverable: exiting now
failed to create tar file
  COPY    RUNNER
    RUN test-quick in qemu:centos6 
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
/var/tmp/qemu/run: line 32: prep_fail: command not found
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
bison-2.4.1-5.el6.x86_64
bzip2-devel-1.0.5-7.el6_0.x86_64
ccache-3.1.6-2.el6.x86_64
csnappy-devel-0-6.20150729gitd7bc683.el6.x86_64
flex-2.5.35-9.el6.x86_64
gcc-4.4.7-18.el6.x86_64
gettext-0.17-18.el6.x86_64
git-1.7.1-9.el6_9.x86_64
glib2-devel-2.28.8-9.el6.x86_64
libepoxy-devel-1.2-3.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
librdmacm-devel-1.0.21-0.el6.x86_64
lzo-devel-2.03-3.1.el6_5.1.x86_64
make-3.81-23.el6.x86_64
mesa-libEGL-devel-11.0.7-4.el6.x86_64
mesa-libgbm-devel-11.0.7-4.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
spice-glib-devel-0.26-8.el6.x86_64
spice-server-devel-0.12.4-16.el6.x86_64
tar-1.23-15.el6_8.x86_64
vte-devel-0.25.1-9.el6.x86_64
xen-devel-4.6.6-2.el6.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=bison     bzip2-devel     ccache     csnappy-devel     flex     g++     gcc     gettext     git     glib2-devel     libepoxy-devel     libfdt-devel     librdmacm-devel     lzo-devel     make     mesa-libEGL-devel     mesa-libgbm-devel     pixman-devel     SDL-devel     spice-glib-devel     spice-server-devel     tar     vte-devel     xen-devel     zlib-devel
HOSTNAME=9926f00f82ac
MAKEFLAGS= -j8
J=8
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
TARGET_LIST=
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
FEATURES= dtc
DEBUG=
_=/usr/bin/env

/var/tmp/qemu/run: line 52: cd: /tmp/qemu-test/src/tests/docker: No such file or directory
/var/tmp/qemu/run: line 57: /test-quick: No such file or directory
/var/tmp/qemu/run: line 57: exec: /test-quick: cannot execute: No such file or directory
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 407, in <module>
    sys.exit(main())
  File "./tests/docker/docker.py", line 404, in main
    return args.cmdobj.run(args, argv)
  File "./tests/docker/docker.py", line 261, in run
    return Docker().run(argv, args.keep, quiet=args.quiet)
  File "./tests/docker/docker.py", line 229, in run
    quiet=quiet)
  File "./tests/docker/docker.py", line 147, in _do_check
    return subprocess.check_call(self._command + cmd, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'run', '--label', 'com.qemu.instance.uuid=9d6718c634bc11e8af1152540069c830', '-u', '0', '--security-opt', 'seccomp=unconfined', '--rm', '--net=none', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=8', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/root/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-k24r8yo6/src/docker-src.2018-03-31-04.21.52.28266:/var/tmp/qemu:z,ro', 'qemu:centos6', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 126
make[1]: *** [tests/docker/Makefile.include:129: docker-run] Error 1
make[1]: Leaving directory '/var/tmp/patchew-tester-tmp-k24r8yo6/src'
make: *** [tests/docker/Makefile.include:163: docker-run-test-quick@centos6] Error 2

real	0m34.910s
user	0m9.086s
sys	0m7.322s
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Xiao Guangrong April 8, 2018, 3:19 a.m. UTC | #2

Hi Paolo, Michael, Stefan and others,

Could anyone merge this patchset if it is okay to you guys?

On 03/30/2018 03:51 PM, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Changelog in v3:
> Following changes are from Peter's review:
> 1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
>     the thread is properly init'd or not
> 2) save the file which is used by ram loader to the global variable
>     instead it is cached per decompression thread
> 
> Changelog in v2:
> Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
> in this version are:
> 1) include the performance number in the cover letter
> 2）add some comments to explain how to use z_stream->opaque in the
>     patchset
> 3) allocate a internal buffer for per thread to store the data to
>     be compressed
> 4) add a new patch that moves some code to ram_save_host_page() so
>     that 'goto' can be omitted gracefully
> 5) split the optimization of compression and decompress into two
>     separated patches
> 6) refine and correct code styles
> 
> 
> This is the first part of our work to improve compression to make it
> be more useful in the production.
> 
> The first patch resolves the problem that the migration thread spends
> too much CPU resource to compression memory if it jumps to a new block
> that causes the network is used very deficient.
> 
> The second patch fixes the performance issue that too many VM-exits
> happen during live migration if compression is being used, it is caused
> by huge memory returned to kernel frequently as the memory is allocated
> and freed for every signal call to compress2()
> 
> The remaining patches clean the code up dramatically
> 
> Performance numbers:
> We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
> 350. During the migration, a workload which has 8 threads repeatedly
> written total 6G memory in the VM.
> 
> Before this patchset, its bandwidth is ~25 mbps, after applying, the
> bandwidth is ~50 mbp.
> 
> We also collected the perf data for patch 2 and 3 on our production,
> before the patchset:
> +  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> +  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common
> 
> -   1.16%  kqemu  [kernel.kallsyms]        [k] lock_acquire                                       ▒
>     - lock_acquire                                                                                 ▒
>        - 15.68% _raw_spin_lock                                                                     ▒
>           + 29.42% __schedule                                                                      ▒
>           + 29.14% perf_event_context_sched_out                                                    ▒
>           + 23.60% tdp_page_fault                                                                  ▒
>           + 10.54% do_anonymous_page                                                               ▒
>           + 2.07% kvm_mmu_notifier_invalidate_range_start                                          ▒
>           + 1.83% zap_pte_range                                                                    ▒
>           + 1.44% kvm_mmu_notifier_invalidate_range_end
> 
> 
> apply our work:
> +  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> +  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
> +   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
> +   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
> +   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2
> 
> -  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire                                     ▒
>     - __lock_acquire                                                                               ▒
>        - 99.75% lock_acquire                                                                       ▒
>           - 18.38% _raw_spin_lock                                                                  ▒
>              + 39.62% tdp_page_fault                                                               ▒
>              + 31.32% __schedule                                                                   ▒
>              + 27.53% perf_event_context_sched_out                                                 ▒
>              + 0.58% hrtimer_interrupt
> 
> 
> We can see the TLB flush and mmu-lock contention have gone.
> 
> Xiao Guangrong (10):
>    migration: stop compressing page in migration thread
>    migration: stop compression to allocate and free memory frequently
>    migration: stop decompression to allocate and free memory frequently
>    migration: detect compression and decompression errors
>    migration: introduce control_save_page()
>    migration: move some code to ram_save_host_page
>    migration: move calling control_save_page to the common place
>    migration: move calling save_zero_page to the common place
>    migration: introduce save_normal_page()
>    migration: remove ram_save_compressed_page()
> 
>   migration/qemu-file.c |  43 ++++-
>   migration/qemu-file.h |   6 +-
>   migration/ram.c       | 482 ++++++++++++++++++++++++++++++--------------------
>   3 files changed, 324 insertions(+), 207 deletions(-)
>

Paolo Bonzini April 9, 2018, 9:17 a.m. UTC | #3

On 08/04/2018 05:19, Xiao Guangrong wrote:
> 
> Hi Paolo, Michael, Stefan and others,
> 
> Could anyone merge this patchset if it is okay to you guys?

Hi Guangrong,

Dave and Juan will take care of merging it.  However, right now QEMU is
in freeze so they may wait a week or two.  If they have reviewed it,
it's certainly on their radar!

Thanks,

Paolo

> On 03/30/2018 03:51 PM, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Changelog in v3:
>> Following changes are from Peter's review:
>> 1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
>>     the thread is properly init'd or not
>> 2) save the file which is used by ram loader to the global variable
>>     instead it is cached per decompression thread
>>
>> Changelog in v2:
>> Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
>> in this version are:
>> 1) include the performance number in the cover letter
>> 2）add some comments to explain how to use z_stream->opaque in the
>>     patchset
>> 3) allocate a internal buffer for per thread to store the data to
>>     be compressed
>> 4) add a new patch that moves some code to ram_save_host_page() so
>>     that 'goto' can be omitted gracefully
>> 5) split the optimization of compression and decompress into two
>>     separated patches
>> 6) refine and correct code styles
>>
>>
>> This is the first part of our work to improve compression to make it
>> be more useful in the production.
>>
>> The first patch resolves the problem that the migration thread spends
>> too much CPU resource to compression memory if it jumps to a new block
>> that causes the network is used very deficient.
>>
>> The second patch fixes the performance issue that too many VM-exits
>> happen during live migration if compression is being used, it is caused
>> by huge memory returned to kernel frequently as the memory is allocated
>> and freed for every signal call to compress2()
>>
>> The remaining patches clean the code up dramatically
>>
>> Performance numbers:
>> We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
>> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
>> 350. During the migration, a workload which has 8 threads repeatedly
>> written total 6G memory in the VM.
>>
>> Before this patchset, its bandwidth is ~25 mbps, after applying, the
>> bandwidth is ~50 mbp.
>>
>> We also collected the perf data for patch 2 and 3 on our production,
>> before the patchset:
>> +  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
>> +  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>> +   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common
>>
>> -   1.16%  kqemu  [kernel.kallsyms]        [k]
>> lock_acquire                                       ▒
>>     -
>> lock_acquire                                                                                
>> ▒
>>        - 15.68%
>> _raw_spin_lock                                                                    
>> ▒
>>           + 29.42%
>> __schedule                                                                     
>> ▒
>>           + 29.14%
>> perf_event_context_sched_out                                                   
>> ▒
>>           + 23.60%
>> tdp_page_fault                                                                 
>> ▒
>>           + 10.54%
>> do_anonymous_page                                                              
>> ▒
>>           + 2.07%
>> kvm_mmu_notifier_invalidate_range_start                                         
>> ▒
>>           + 1.83%
>> zap_pte_range                                                                   
>> ▒
>>           + 1.44% kvm_mmu_notifier_invalidate_range_end
>>
>>
>> apply our work:
>> +  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
>> +  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>> +   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
>> +   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
>> +   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
>> +   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2
>>
>> -  14.82%  kqemu  [kernel.kallsyms]        [k]
>> __lock_acquire                                     ▒
>>     -
>> __lock_acquire                                                                              
>> ▒
>>        - 99.75%
>> lock_acquire                                                                      
>> ▒
>>           - 18.38%
>> _raw_spin_lock                                                                 
>> ▒
>>              + 39.62%
>> tdp_page_fault                                                              
>> ▒
>>              + 31.32%
>> __schedule                                                                  
>> ▒
>>              + 27.53%
>> perf_event_context_sched_out                                                
>> ▒
>>              + 0.58% hrtimer_interrupt
>>
>>
>> We can see the TLB flush and mmu-lock contention have gone.
>>
>> Xiao Guangrong (10):
>>    migration: stop compressing page in migration thread
>>    migration: stop compression to allocate and free memory frequently
>>    migration: stop decompression to allocate and free memory frequently
>>    migration: detect compression and decompression errors
>>    migration: introduce control_save_page()
>>    migration: move some code to ram_save_host_page
>>    migration: move calling control_save_page to the common place
>>    migration: move calling save_zero_page to the common place
>>    migration: introduce save_normal_page()
>>    migration: remove ram_save_compressed_page()
>>
>>   migration/qemu-file.c |  43 ++++-
>>   migration/qemu-file.h |   6 +-
>>   migration/ram.c       | 482
>> ++++++++++++++++++++++++++++++--------------------
>>   3 files changed, 324 insertions(+), 207 deletions(-)
>>

Dr. David Alan Gilbert April 9, 2018, 7:30 p.m. UTC | #4

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 08/04/2018 05:19, Xiao Guangrong wrote:
> > 
> > Hi Paolo, Michael, Stefan and others,
> > 
> > Could anyone merge this patchset if it is okay to you guys?
> 
> Hi Guangrong,
> 
> Dave and Juan will take care of merging it.  However, right now QEMU is
> in freeze so they may wait a week or two.  If they have reviewed it,
> it's certainly on their radar!

Yep, one of us will get it at the start of 2.13.

Dave

> Thanks,
> 
> Paolo
> 
> > On 03/30/2018 03:51 PM, guangrong.xiao@gmail.com wrote:
> >> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> >>
> >> Changelog in v3:
> >> Following changes are from Peter's review:
> >> 1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
> >>     the thread is properly init'd or not
> >> 2) save the file which is used by ram loader to the global variable
> >>     instead it is cached per decompression thread
> >>
> >> Changelog in v2:
> >> Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
> >> in this version are:
> >> 1) include the performance number in the cover letter
> >> 2）add some comments to explain how to use z_stream->opaque in the
> >>     patchset
> >> 3) allocate a internal buffer for per thread to store the data to
> >>     be compressed
> >> 4) add a new patch that moves some code to ram_save_host_page() so
> >>     that 'goto' can be omitted gracefully
> >> 5) split the optimization of compression and decompress into two
> >>     separated patches
> >> 6) refine and correct code styles
> >>
> >>
> >> This is the first part of our work to improve compression to make it
> >> be more useful in the production.
> >>
> >> The first patch resolves the problem that the migration thread spends
> >> too much CPU resource to compression memory if it jumps to a new block
> >> that causes the network is used very deficient.
> >>
> >> The second patch fixes the performance issue that too many VM-exits
> >> happen during live migration if compression is being used, it is caused
> >> by huge memory returned to kernel frequently as the memory is allocated
> >> and freed for every signal call to compress2()
> >>
> >> The remaining patches clean the code up dramatically
> >>
> >> Performance numbers:
> >> We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> >> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
> >> 350. During the migration, a workload which has 8 threads repeatedly
> >> written total 6G memory in the VM.
> >>
> >> Before this patchset, its bandwidth is ~25 mbps, after applying, the
> >> bandwidth is ~50 mbp.
> >>
> >> We also collected the perf data for patch 2 and 3 on our production,
> >> before the patchset:
> >> +  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> >> +  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> >> +   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common
> >>
> >> -   1.16%  kqemu  [kernel.kallsyms]        [k]
> >> lock_acquire                                       ▒
> >>     -
> >> lock_acquire                                                                                
> >> ▒
> >>        - 15.68%
> >> _raw_spin_lock                                                                    
> >> ▒
> >>           + 29.42%
> >> __schedule                                                                     
> >> ▒
> >>           + 29.14%
> >> perf_event_context_sched_out                                                   
> >> ▒
> >>           + 23.60%
> >> tdp_page_fault                                                                 
> >> ▒
> >>           + 10.54%
> >> do_anonymous_page                                                              
> >> ▒
> >>           + 2.07%
> >> kvm_mmu_notifier_invalidate_range_start                                         
> >> ▒
> >>           + 1.83%
> >> zap_pte_range                                                                   
> >> ▒
> >>           + 1.44% kvm_mmu_notifier_invalidate_range_end
> >>
> >>
> >> apply our work:
> >> +  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> >> +  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> >> +   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
> >> +   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
> >> +   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
> >> +   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2
> >>
> >> -  14.82%  kqemu  [kernel.kallsyms]        [k]
> >> __lock_acquire                                     ▒
> >>     -
> >> __lock_acquire                                                                              
> >> ▒
> >>        - 99.75%
> >> lock_acquire                                                                      
> >> ▒
> >>           - 18.38%
> >> _raw_spin_lock                                                                 
> >> ▒
> >>              + 39.62%
> >> tdp_page_fault                                                              
> >> ▒
> >>              + 31.32%
> >> __schedule                                                                  
> >> ▒
> >>              + 27.53%
> >> perf_event_context_sched_out                                                
> >> ▒
> >>              + 0.58% hrtimer_interrupt
> >>
> >>
> >> We can see the TLB flush and mmu-lock contention have gone.
> >>
> >> Xiao Guangrong (10):
> >>    migration: stop compressing page in migration thread
> >>    migration: stop compression to allocate and free memory frequently
> >>    migration: stop decompression to allocate and free memory frequently
> >>    migration: detect compression and decompression errors
> >>    migration: introduce control_save_page()
> >>    migration: move some code to ram_save_host_page
> >>    migration: move calling control_save_page to the common place
> >>    migration: move calling save_zero_page to the common place
> >>    migration: introduce save_normal_page()
> >>    migration: remove ram_save_compressed_page()
> >>
> >>   migration/qemu-file.c |  43 ++++-
> >>   migration/qemu-file.h |   6 +-
> >>   migration/ram.c       | 482
> >> ++++++++++++++++++++++++++++++--------------------
> >>   3 files changed, 324 insertions(+), 207 deletions(-)
> >>
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Dr. David Alan Gilbert April 25, 2018, 5:04 p.m. UTC | #5

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 

Queued.

> Changelog in v3:
> Following changes are from Peter's review:
> 1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
>    the thread is properly init'd or not
> 2) save the file which is used by ram loader to the global variable
>    instead it is cached per decompression thread
> 
> Changelog in v2:
> Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
> in this version are:
> 1) include the performance number in the cover letter
> 2）add some comments to explain how to use z_stream->opaque in the
>    patchset
> 3) allocate a internal buffer for per thread to store the data to
>    be compressed
> 4) add a new patch that moves some code to ram_save_host_page() so
>    that 'goto' can be omitted gracefully
> 5) split the optimization of compression and decompress into two
>    separated patches
> 6) refine and correct code styles
> 
> 
> This is the first part of our work to improve compression to make it
> be more useful in the production.
> 
> The first patch resolves the problem that the migration thread spends
> too much CPU resource to compression memory if it jumps to a new block
> that causes the network is used very deficient.
> 
> The second patch fixes the performance issue that too many VM-exits
> happen during live migration if compression is being used, it is caused
> by huge memory returned to kernel frequently as the memory is allocated
> and freed for every signal call to compress2()
> 
> The remaining patches clean the code up dramatically
> 
> Performance numbers:
> We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
> 350. During the migration, a workload which has 8 threads repeatedly
> written total 6G memory in the VM.
> 
> Before this patchset, its bandwidth is ~25 mbps, after applying, the
> bandwidth is ~50 mbp.
> 
> We also collected the perf data for patch 2 and 3 on our production,
> before the patchset:
> +  57.88%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> +  10.55%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   4.83%  kqemu  [kernel.kallsyms]        [k] flush_tlb_func_common
> 
> -   1.16%  kqemu  [kernel.kallsyms]        [k] lock_acquire                                       ▒
>    - lock_acquire                                                                                 ▒
>       - 15.68% _raw_spin_lock                                                                     ▒
>          + 29.42% __schedule                                                                      ▒
>          + 29.14% perf_event_context_sched_out                                                    ▒
>          + 23.60% tdp_page_fault                                                                  ▒
>          + 10.54% do_anonymous_page                                                               ▒
>          + 2.07% kvm_mmu_notifier_invalidate_range_start                                          ▒
>          + 1.83% zap_pte_range                                                                    ▒
>          + 1.44% kvm_mmu_notifier_invalidate_range_end
> 
> 
> apply our work:
> +  51.92%  kqemu  [kernel.kallsyms]        [k] queued_spin_lock_slowpath
> +  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.47%  kqemu  [kernel.kallsyms]        [k] mark_lock.clone.0
> +   1.46%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
> +   1.31%  kqemu  [kernel.kallsyms]        [k] lock_acquire
> +   1.24%  kqemu  libc-2.12.so             [.] __memset_sse2
> 
> -  14.82%  kqemu  [kernel.kallsyms]        [k] __lock_acquire                                     ▒
>    - __lock_acquire                                                                               ▒
>       - 99.75% lock_acquire                                                                       ▒
>          - 18.38% _raw_spin_lock                                                                  ▒
>             + 39.62% tdp_page_fault                                                               ▒
>             + 31.32% __schedule                                                                   ▒
>             + 27.53% perf_event_context_sched_out                                                 ▒
>             + 0.58% hrtimer_interrupt
> 
> 
> We can see the TLB flush and mmu-lock contention have gone.
> 
> Xiao Guangrong (10):
>   migration: stop compressing page in migration thread
>   migration: stop compression to allocate and free memory frequently
>   migration: stop decompression to allocate and free memory frequently
>   migration: detect compression and decompression errors
>   migration: introduce control_save_page()
>   migration: move some code to ram_save_host_page
>   migration: move calling control_save_page to the common place
>   migration: move calling save_zero_page to the common place
>   migration: introduce save_normal_page()
>   migration: remove ram_save_compressed_page()
> 
>  migration/qemu-file.c |  43 ++++-
>  migration/qemu-file.h |   6 +-
>  migration/ram.c       | 482 ++++++++++++++++++++++++++++++--------------------
>  3 files changed, 324 insertions(+), 207 deletions(-)
> 
> -- 
> 2.14.3
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

[v3,00/10] migration: improve and cleanup compression

Message

Comments