[v5,00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.

Message ID	20240711215244.19237-1-yichen.wang@bytedance.com
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Yichen Wang <yichen.wang@bytedance.com> To: Paolo Bonzini <pbonzini@redhat.com>, =?utf-8?q?Marc-Andr=C3=A9_Lureau?= <marcandre.lureau@redhat.com>, =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, Thomas Huth <thuth@redhat.com>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Cornelia Huck <cohuck@redhat.com>, qemu-devel@nongnu.org Cc: "Hao Xiang" <hao.xiang@linux.dev>, "Liu, Yuan1" <yuan1.liu@intel.com>, "Shivam Kumar" <shivam.kumar1@nutanix.com>, "Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com>, "Yichen Wang" <yichen.wang@bytedance.com> Subject: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration. Date: Thu, 11 Jul 2024 14:52:35 -0700 Message-Id: <20240711215244.19237-1-yichen.wang@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::336; envelope-from=yichen.wang@bytedance.com; helo=mail-ot1-x336.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration. \| expand [v5,00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration. [v5,01/13] meson: Introduce new instruction set enqcmd to the build system. [v5,02/13] util/dsa: Add idxd into linux header copy list. [v5,03/13] util/dsa: Implement DSA device start and stop logic. [v5,04/13] util/dsa: Implement DSA task enqueue and dequeue. [v5,05/13] util/dsa: Implement DSA task asynchronous completion thread model. [v5,06/13] util/dsa: Implement zero page checking in DSA task. [v5,07/13] util/dsa: Implement DSA task asynchronous submission and wait for completion. [v5,08/13] migration/multifd: Add new migration option for multifd DSA offloading. [v5,09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path. [v5,10/13] migration/multifd: Enable DSA offloading in multifd sender path. [v5,11/13] migration/multifd: Add migration option set packet size. [v5,12/13] util/dsa: Add unit test coverage for Intel DSA task submission and completion. [v5,13/13] migration/multifd: Add integration tests for multifd with Intel DSA offloading.

Yichen Wang July 11, 2024, 9:52 p.m. UTC

v5
* Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
* Rename struct definitions with typedef and CamelCase names;
* Add build and runtime checks about DSA accelerator;
* Address all comments from v4 reviews about typos, licenses, comments,
error reporting, etc.

v4
* Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
* A separate "multifd zero page checking" patchset was split from this
patchset's v3 and got merged into master. v4 re-applied the rest of all
commits on top of that patchset, re-factored and re-tested.
https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
* There are some feedback from v3 I likely overlooked.
 
v3
* Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
* Fix error/warning from checkpatch.pl
* Fix use-after-free bug when multifd-dsa-accel option is not set.
* Handle error from dsa_init and correctly propogate the error.
* Remove unnecessary call to dsa_stop.
* Detect availability of DSA feature at compile time.
* Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
* Remove all exit() calls and propagate errors correctly.
* Use bytes instead of page count to configure multifd-packet-size option.

v2
* Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
* Leave Juan's changes in their original form instead of squashing them.
* Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
* Use page count to configure multifd-packet-size option.
* Don't use the FLAKY flag in DSA tests.
* Test if DSA integration test is setup correctly and skip the test if
* not.
* Fixed broken link in the previous patch cover.

* Background:

I posted an RFC about DSA offloading in QEMU:
https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/

This patchset implements the DSA offloading on zero page checking in
multifd live migration code path.

* Overview:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This patchset implements a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
two benefits from this change:
1. Reduces CPU usage in multifd live migration workflow across all use
cases.
2. Reduces migration total time in some use cases. 

* Design:

These are the logical steps to perform DSA offloading:
1. Configure DSA accelerators and create user space openable DSA work
queues via the idxd driver.
2. Map DSA's work queue into a user space address space.
3. Fill an in-memory task descriptor to describe the memory operation.
4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
the work queue.
5. Pull the task descriptor's completion status field until the task
completes.
6. Check return status.

The memory operation is now totally done by the accelerator hardware but
the new workflow introduces overheads. The overhead is the extra cost CPU
prepares and submits the task descriptors and the extra cost CPU pulls for
completion. The design is around minimizing these two overheads.

1. In order to reduce the overhead on task preparation and submission,
we use batch descriptors. A batch descriptor will contain N individual
zero page checking tasks where the default N is 128 (default packet size
/ page size) and we can increase N by setting the packet size via a new
migration option.
2. The multifd sender threads prepares and submits batch tasks to DSA
hardware and it waits on a synchronization object for task completion.
Whenever a DSA task is submitted, the task structure is added to a
thread safe queue. It's safe to have multiple multifd sender threads to
submit tasks concurrently.
3. Multiple DSA hardware devices can be used. During multifd initialization,
every sender thread will be assigned a DSA device to work with. We
use a round-robin scheme to evenly distribute the work across all used
DSA devices.
4. Use a dedicated thread dsa_completion to perform busy pulling for all
DSA task completions. The thread keeps dequeuing DSA tasks from the
thread safe queue. The thread blocks when there is no outstanding DSA
task. When pulling for completion of a DSA task, the thread uses CPU
instruction _mm_pause between the iterations of a busy loop to save some
CPU power as well as optimizing core resources for the other hypercore.
5. DSA accelerator can encounter errors. The most popular error is a
page fault. We have tested using devices to handle page faults but
performance is bad. Right now, if DSA hits a page fault, we fallback to
use CPU to complete the rest of the work. The CPU fallback is done in
the multifd sender thread.
6. Added a new migration option multifd-dsa-accel to set the DSA device
path. If set, the multifd workflow will leverage the DSA devices for
offloading.
7. Added a new migration option multifd-normal-page-ratio to make
multifd live migration easier to test. Setting a normal page ratio will
make live migration recognize a zero page as a normal page and send
the entire payload over the network. If we want to send a large network
payload and analyze throughput, this option is useful.
8. Added a new migration option multifd-packet-size. This can increase
the number of pages being zero page checked and sent over the network.
The extra synchronization between the sender threads and the dsa
completion thread is an overhead. Using a large packet size can reduce
that overhead.

* Performance:

We use two Intel 4th generation Xeon servers for testing.

Architecture:        x86_64
CPU(s):              192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8457C
Stepping:            8
CPU MHz:             2538.624
CPU max MHz:         3800.0000
CPU min MHz:         800.0000

We perform multifd live migration with below setup:
1. VM has 100GB memory. 
2. Use the new migration option multifd-set-normal-page-ratio to control the total
size of the payload sent over the network.
3. Use 8 multifd channels.
4. Use tcp for live migration.
4. Use CPU to perform zero page checking as the baseline.
5. Use one DSA device to offload zero page checking to compare with the baseline.
6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.

A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.

	CPU usage

	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|5657.58	|		|
	|		|multifdsend_0	|3931.563	|		|
	|		|multifdsend_1	|4405.273	|		|
	|		|multifdsend_2	|3941.968	|		|
	|		|multifdsend_3	|5032.975	|		|
	|		|multifdsend_4	|4533.865	|		|
	|		|multifdsend_5	|4530.461	|		|
	|		|multifdsend_6	|5171.916	|		|
	|		|multifdsend_7	|4722.769	|41922		|
	|---------------|---------------|---------------|---------------|
	|DSA		|live_migration	|6129.168	|		|
	|		|multifdsend_0	|2954.717	|		|
	|		|multifdsend_1	|2766.359	|		|
	|		|multifdsend_2	|2853.519	|		|
	|		|multifdsend_3	|2740.717	|		|
	|		|multifdsend_4	|2824.169	|		|
	|		|multifdsend_5	|2966.908	|		|
	|		|multifdsend_6	|2611.137	|		|
	|		|multifdsend_7	|3114.732	|		|
	|		|dsa_completion	|3612.564	|32568		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is calculated by adding up all multifdsend_X
and live_migration threads runtime. DSA offloading total runtime is
calculated by adding up all multifdsend_X, live_migration and
dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
that is 23% total CPU usage savings.

	Latency
	|---------------|---------------|---------------|---------------|---------------|---------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
	|---------------|---------------|---------------|---------------|---------------|---------------|	
	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
	|---------------|---------------|---------------|---------------|-------------------------------|
	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
	|---------------|---------------|---------------|---------------|---------------|---------------|

Total time is 8% faster and down time is 16% faster.

B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.

	CPU usage
	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|4860.718	|		|
	|	 	|multifdsend_0	|748.875	|		|
	|		|multifdsend_1	|898.498	|		|
	|		|multifdsend_2	|787.456	|		|
	|		|multifdsend_3	|764.537	|		|
	|		|multifdsend_4	|785.687	|		|
	|		|multifdsend_5	|756.941	|		|
	|		|multifdsend_6	|774.084	|		|
	|		|multifdsend_7	|782.900	|11154		|
	|---------------|---------------|-------------------------------|
	|DSA offloading	|live_migration	|3846.976	|		|
	|		|multifdsend_0	|191.880	|		|
	|		|multifdsend_1	|166.331	|		|
	|		|multifdsend_2	|168.528	|		|
	|		|multifdsend_3	|197.831	|		|
	|		|multifdsend_4	|169.580	|		|
	|		|multifdsend_5	|167.984	|		|
	|		|multifdsend_6	|198.042	|		|
	|		|multifdsend_7	|170.624	|		|
	|		|dsa_completion	|3428.669	|8700		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is 11154 msec and DSA offloading total runtime is
8700 msec. That is 22% CPU savings.

	Latency
	|--------------------------------------------------------------------------------------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
	|---------------|---------------|---------------|---------------|---------------|------------|	
	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
	|---------------|---------------|---------------|---------------|----------------------------|
	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
	|---------------|---------------|---------------|---------------|---------------|------------|

Total time 20% faster and down time 10% faster.

* Testing:

1. Added unit tests for cover the added code path in dsa.c
2. Added integration tests to cover multifd live migration using DSA
offloading.

Hao Xiang (12):
  meson: Introduce new instruction set enqcmd to the build system.
  util/dsa: Implement DSA device start and stop logic.
  util/dsa: Implement DSA task enqueue and dequeue.
  util/dsa: Implement DSA task asynchronous completion thread model.
  util/dsa: Implement zero page checking in DSA task.
  util/dsa: Implement DSA task asynchronous submission and wait for
    completion.
  migration/multifd: Add new migration option for multifd DSA
    offloading.
  migration/multifd: Prepare to introduce DSA acceleration on the
    multifd path.
  migration/multifd: Enable DSA offloading in multifd sender path.
  migration/multifd: Add migration option set packet size.
  util/dsa: Add unit test coverage for Intel DSA task submission and
    completion.
  migration/multifd: Add integration tests for multifd with Intel DSA
    offloading.

Yichen Wang (1):
  util/dsa: Add idxd into linux header copy list.

 include/qemu/dsa.h              |  176 +++++
 meson.build                     |   14 +
 meson_options.txt               |    2 +
 migration/migration-hmp-cmds.c  |   22 +-
 migration/migration.c           |    2 +-
 migration/multifd-zero-page.c   |  100 ++-
 migration/multifd-zlib.c        |    6 +-
 migration/multifd-zstd.c        |    6 +-
 migration/multifd.c             |   53 +-
 migration/multifd.h             |    8 +-
 migration/options.c             |   85 +++
 migration/options.h             |    2 +
 qapi/migration.json             |   49 +-
 scripts/meson-buildoptions.sh   |    3 +
 scripts/update-linux-headers.sh |    2 +-
 tests/qtest/migration-test.c    |   80 ++-
 tests/unit/meson.build          |    6 +
 tests/unit/test-dsa.c           |  503 ++++++++++++++
 util/dsa.c                      | 1082 +++++++++++++++++++++++++++++++
 util/meson.build                |    3 +
 20 files changed, 2177 insertions(+), 27 deletions(-)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 tests/unit/test-dsa.c
 create mode 100644 util/dsa.c

Michael S. Tsirkin July 11, 2024, 10:49 p.m. UTC | #1

On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> * Performance:
> 
> We use two Intel 4th generation Xeon servers for testing.
> 
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
> 
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory. 
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> 
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> 
> 	CPU usage
> 
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|5657.58	|		|
> 	|		|multifdsend_0	|3931.563	|		|
> 	|		|multifdsend_1	|4405.273	|		|
> 	|		|multifdsend_2	|3941.968	|		|
> 	|		|multifdsend_3	|5032.975	|		|
> 	|		|multifdsend_4	|4533.865	|		|
> 	|		|multifdsend_5	|4530.461	|		|
> 	|		|multifdsend_6	|5171.916	|		|
> 	|		|multifdsend_7	|4722.769	|41922		|
> 	|---------------|---------------|---------------|---------------|
> 	|DSA		|live_migration	|6129.168	|		|
> 	|		|multifdsend_0	|2954.717	|		|
> 	|		|multifdsend_1	|2766.359	|		|
> 	|		|multifdsend_2	|2853.519	|		|
> 	|		|multifdsend_3	|2740.717	|		|
> 	|		|multifdsend_4	|2824.169	|		|
> 	|		|multifdsend_5	|2966.908	|		|
> 	|		|multifdsend_6	|2611.137	|		|
> 	|		|multifdsend_7	|3114.732	|		|
> 	|		|dsa_completion	|3612.564	|32568		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.


Here the DSA was mostly idle.

Sounds good but a question: what if several qemu instances are
migrated in parallel?

Some accelerators tend to basically stall if several tasks
are trying to use them at the same time.

Where is the boundary here?

Michael S. Tsirkin July 12, 2024, 10:58 a.m. UTC | #2

On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> v5
> * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> * Rename struct definitions with typedef and CamelCase names;
> * Add build and runtime checks about DSA accelerator;
> * Address all comments from v4 reviews about typos, licenses, comments,
> error reporting, etc.
> 
> v4
> * Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
> * A separate "multifd zero page checking" patchset was split from this
> patchset's v3 and got merged into master. v4 re-applied the rest of all
> commits on top of that patchset, re-factored and re-tested.
> https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
> * There are some feedback from v3 I likely overlooked.
>  
> v3
> * Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
> * Fix error/warning from checkpatch.pl
> * Fix use-after-free bug when multifd-dsa-accel option is not set.
> * Handle error from dsa_init and correctly propogate the error.
> * Remove unnecessary call to dsa_stop.
> * Detect availability of DSA feature at compile time.
> * Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
> * Remove all exit() calls and propagate errors correctly.
> * Use bytes instead of page count to configure multifd-packet-size option.
> 
> v2
> * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> * Leave Juan's changes in their original form instead of squashing them.
> * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> * Use page count to configure multifd-packet-size option.
> * Don't use the FLAKY flag in DSA tests.
> * Test if DSA integration test is setup correctly and skip the test if
> * not.
> * Fixed broken link in the previous patch cover.
> 
> * Background:

The DSA interface here is extremely low level: it would mean we add a
ton of complex fragile code for any new accelerator.
Please add something high level and simple on top of this.
Off the top of my head:

void start_memcmp(void *a, void *b, int cnt, void *opaque,
	void (*callback)(int result, void *a, void *b, int cnt, void *opaque)
	);

Do all the batching hacks internally.




> I posted an RFC about DSA offloading in QEMU:
> https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
> 
> This patchset implements the DSA offloading on zero page checking in
> multifd live migration code path.
> 
> * Overview:
> 
> Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> Xeon server, aka Sapphire Rapids.
> https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> One of the things DSA can do is to offload memory comparison workload from
> CPU to DSA accelerator hardware. This patchset implements a solution to offload
> QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> two benefits from this change:
> 1. Reduces CPU usage in multifd live migration workflow across all use
> cases.
> 2. Reduces migration total time in some use cases. 
> 
> * Design:
> 
> These are the logical steps to perform DSA offloading:
> 1. Configure DSA accelerators and create user space openable DSA work
> queues via the idxd driver.
> 2. Map DSA's work queue into a user space address space.
> 3. Fill an in-memory task descriptor to describe the memory operation.
> 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> the work queue.
> 5. Pull the task descriptor's completion status field until the task
> completes.
> 6. Check return status.
> 
> The memory operation is now totally done by the accelerator hardware but
> the new workflow introduces overheads. The overhead is the extra cost CPU
> prepares and submits the task descriptors and the extra cost CPU pulls for
> completion. The design is around minimizing these two overheads.
> 
> 1. In order to reduce the overhead on task preparation and submission,
> we use batch descriptors. A batch descriptor will contain N individual
> zero page checking tasks where the default N is 128 (default packet size
> / page size) and we can increase N by setting the packet size via a new
> migration option.
> 2. The multifd sender threads prepares and submits batch tasks to DSA
> hardware and it waits on a synchronization object for task completion.
> Whenever a DSA task is submitted, the task structure is added to a
> thread safe queue. It's safe to have multiple multifd sender threads to
> submit tasks concurrently.
> 3. Multiple DSA hardware devices can be used. During multifd initialization,
> every sender thread will be assigned a DSA device to work with. We
> use a round-robin scheme to evenly distribute the work across all used
> DSA devices.
> 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> DSA task completions. The thread keeps dequeuing DSA tasks from the
> thread safe queue. The thread blocks when there is no outstanding DSA
> task. When pulling for completion of a DSA task, the thread uses CPU
> instruction _mm_pause between the iterations of a busy loop to save some
> CPU power as well as optimizing core resources for the other hypercore.
> 5. DSA accelerator can encounter errors. The most popular error is a
> page fault. We have tested using devices to handle page faults but
> performance is bad. Right now, if DSA hits a page fault, we fallback to
> use CPU to complete the rest of the work. The CPU fallback is done in
> the multifd sender thread.
> 6. Added a new migration option multifd-dsa-accel to set the DSA device
> path. If set, the multifd workflow will leverage the DSA devices for
> offloading.
> 7. Added a new migration option multifd-normal-page-ratio to make
> multifd live migration easier to test. Setting a normal page ratio will
> make live migration recognize a zero page as a normal page and send
> the entire payload over the network. If we want to send a large network
> payload and analyze throughput, this option is useful.
> 8. Added a new migration option multifd-packet-size. This can increase
> the number of pages being zero page checked and sent over the network.
> The extra synchronization between the sender threads and the dsa
> completion thread is an overhead. Using a large packet size can reduce
> that overhead.
> 
> * Performance:
> 
> We use two Intel 4th generation Xeon servers for testing.
> 
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
> 
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory. 
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> 
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> 
> 	CPU usage
> 
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|5657.58	|		|
> 	|		|multifdsend_0	|3931.563	|		|
> 	|		|multifdsend_1	|4405.273	|		|
> 	|		|multifdsend_2	|3941.968	|		|
> 	|		|multifdsend_3	|5032.975	|		|
> 	|		|multifdsend_4	|4533.865	|		|
> 	|		|multifdsend_5	|4530.461	|		|
> 	|		|multifdsend_6	|5171.916	|		|
> 	|		|multifdsend_7	|4722.769	|41922		|
> 	|---------------|---------------|---------------|---------------|
> 	|DSA		|live_migration	|6129.168	|		|
> 	|		|multifdsend_0	|2954.717	|		|
> 	|		|multifdsend_1	|2766.359	|		|
> 	|		|multifdsend_2	|2853.519	|		|
> 	|		|multifdsend_3	|2740.717	|		|
> 	|		|multifdsend_4	|2824.169	|		|
> 	|		|multifdsend_5	|2966.908	|		|
> 	|		|multifdsend_6	|2611.137	|		|
> 	|		|multifdsend_7	|3114.732	|		|
> 	|		|dsa_completion	|3612.564	|32568		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.
> 
> 	Latency
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
> 	|---------------|---------------|---------------|---------------|---------------|---------------|	
> 	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
> 	|---------------|---------------|---------------|---------------|-------------------------------|
> 	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
> 
> Total time is 8% faster and down time is 16% faster.
> 
> B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
> 
> 	CPU usage
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|4860.718	|		|
> 	|	 	|multifdsend_0	|748.875	|		|
> 	|		|multifdsend_1	|898.498	|		|
> 	|		|multifdsend_2	|787.456	|		|
> 	|		|multifdsend_3	|764.537	|		|
> 	|		|multifdsend_4	|785.687	|		|
> 	|		|multifdsend_5	|756.941	|		|
> 	|		|multifdsend_6	|774.084	|		|
> 	|		|multifdsend_7	|782.900	|11154		|
> 	|---------------|---------------|-------------------------------|
> 	|DSA offloading	|live_migration	|3846.976	|		|
> 	|		|multifdsend_0	|191.880	|		|
> 	|		|multifdsend_1	|166.331	|		|
> 	|		|multifdsend_2	|168.528	|		|
> 	|		|multifdsend_3	|197.831	|		|
> 	|		|multifdsend_4	|169.580	|		|
> 	|		|multifdsend_5	|167.984	|		|
> 	|		|multifdsend_6	|198.042	|		|
> 	|		|multifdsend_7	|170.624	|		|
> 	|		|dsa_completion	|3428.669	|8700		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is 11154 msec and DSA offloading total runtime is
> 8700 msec. That is 22% CPU savings.
> 
> 	Latency
> 	|--------------------------------------------------------------------------------------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
> 	|---------------|---------------|---------------|---------------|---------------|------------|	
> 	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
> 	|---------------|---------------|---------------|---------------|----------------------------|
> 	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
> 	|---------------|---------------|---------------|---------------|---------------|------------|
> 
> Total time 20% faster and down time 10% faster.
> 
> * Testing:
> 
> 1. Added unit tests for cover the added code path in dsa.c
> 2. Added integration tests to cover multifd live migration using DSA
> offloading.
> 
> Hao Xiang (12):
>   meson: Introduce new instruction set enqcmd to the build system.
>   util/dsa: Implement DSA device start and stop logic.
>   util/dsa: Implement DSA task enqueue and dequeue.
>   util/dsa: Implement DSA task asynchronous completion thread model.
>   util/dsa: Implement zero page checking in DSA task.
>   util/dsa: Implement DSA task asynchronous submission and wait for
>     completion.
>   migration/multifd: Add new migration option for multifd DSA
>     offloading.
>   migration/multifd: Prepare to introduce DSA acceleration on the
>     multifd path.
>   migration/multifd: Enable DSA offloading in multifd sender path.
>   migration/multifd: Add migration option set packet size.
>   util/dsa: Add unit test coverage for Intel DSA task submission and
>     completion.
>   migration/multifd: Add integration tests for multifd with Intel DSA
>     offloading.
> 
> Yichen Wang (1):
>   util/dsa: Add idxd into linux header copy list.
> 
>  include/qemu/dsa.h              |  176 +++++
>  meson.build                     |   14 +
>  meson_options.txt               |    2 +
>  migration/migration-hmp-cmds.c  |   22 +-
>  migration/migration.c           |    2 +-
>  migration/multifd-zero-page.c   |  100 ++-
>  migration/multifd-zlib.c        |    6 +-
>  migration/multifd-zstd.c        |    6 +-
>  migration/multifd.c             |   53 +-
>  migration/multifd.h             |    8 +-
>  migration/options.c             |   85 +++
>  migration/options.h             |    2 +
>  qapi/migration.json             |   49 +-
>  scripts/meson-buildoptions.sh   |    3 +
>  scripts/update-linux-headers.sh |    2 +-
>  tests/qtest/migration-test.c    |   80 ++-
>  tests/unit/meson.build          |    6 +
>  tests/unit/test-dsa.c           |  503 ++++++++++++++
>  util/dsa.c                      | 1082 +++++++++++++++++++++++++++++++
>  util/meson.build                |    3 +
>  20 files changed, 2177 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 tests/unit/test-dsa.c
>  create mode 100644 util/dsa.c
> 
> -- 
> Yichen Wang

Yuan Liu July 15, 2024, 8:29 a.m. UTC | #3

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, July 12, 2024 6:49 AM
> To: Wang, Yichen <yichen.wang@bytedance.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren
> (Jack) Chuang <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > * Performance:
> >
> > We use two Intel 4th generation Xeon servers for testing.
> >
> > Architecture:        x86_64
> > CPU(s):              192
> > Thread(s) per core:  2
> > Core(s) per socket:  48
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               143
> > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > Stepping:            8
> > CPU MHz:             2538.624
> > CPU max MHz:         3800.0000
> > CPU min MHz:         800.0000
> >
> > We perform multifd live migration with below setup:
> > 1. VM has 100GB memory.
> > 2. Use the new migration option multifd-set-normal-page-ratio to control
> the total
> > size of the payload sent over the network.
> > 3. Use 8 multifd channels.
> > 4. Use tcp for live migration.
> > 4. Use CPU to perform zero page checking as the baseline.
> > 5. Use one DSA device to offload zero page checking to compare with the
> baseline.
> > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> usage.
> >
> > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> >
> > 	CPU usage
> >
> > 	|---------------|---------------|---------------|---------------|
> > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > 	|---------------|---------------|---------------|---------------|
> > 	|Baseline	|live_migration	|5657.58	|		|
> > 	|		|multifdsend_0	|3931.563	|		|
> > 	|		|multifdsend_1	|4405.273	|		|
> > 	|		|multifdsend_2	|3941.968	|		|
> > 	|		|multifdsend_3	|5032.975	|		|
> > 	|		|multifdsend_4	|4533.865	|		|
> > 	|		|multifdsend_5	|4530.461	|		|
> > 	|		|multifdsend_6	|5171.916	|		|
> > 	|		|multifdsend_7	|4722.769	|41922		|
> > 	|---------------|---------------|---------------|---------------|
> > 	|DSA		|live_migration	|6129.168	|		|
> > 	|		|multifdsend_0	|2954.717	|		|
> > 	|		|multifdsend_1	|2766.359	|		|
> > 	|		|multifdsend_2	|2853.519	|		|
> > 	|		|multifdsend_3	|2740.717	|		|
> > 	|		|multifdsend_4	|2824.169	|		|
> > 	|		|multifdsend_5	|2966.908	|		|
> > 	|		|multifdsend_6	|2611.137	|		|
> > 	|		|multifdsend_7	|3114.732	|		|
> > 	|		|dsa_completion	|3612.564	|32568		|
> > 	|---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is calculated by adding up all multifdsend_X
> > and live_migration threads runtime. DSA offloading total runtime is
> > calculated by adding up all multifdsend_X, live_migration and
> > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > that is 23% total CPU usage savings.
> 
> 
> Here the DSA was mostly idle.
> 
> Sounds good but a question: what if several qemu instances are
> migrated in parallel?
> 
> Some accelerators tend to basically stall if several tasks
> are trying to use them at the same time.
> 
> Where is the boundary here?

A DSA device can be assigned to multiple Qemu instances. 
The DSA resource used by each process is called a work queue, each DSA
device can support up to 8 work queues and work queues are classified into 
dedicated queues and shared queues. 

A dedicated queue can only serve one process. Theoretically, there is no limit 
on the number of processes in a shared queue, it is based on enqcmd + SVM technology.

https://www.kernel.org/doc/html/v5.17/x86/sva.html

> --
> MST

Michael S. Tsirkin July 15, 2024, 12:23 p.m. UTC | #4

On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, July 12, 2024 6:49 AM
> > To: Wang, Yichen <yichen.wang@bytedance.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren
> > (Jack) Chuang <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > * Performance:
> > >
> > > We use two Intel 4th generation Xeon servers for testing.
> > >
> > > Architecture:        x86_64
> > > CPU(s):              192
> > > Thread(s) per core:  2
> > > Core(s) per socket:  48
> > > Socket(s):           2
> > > NUMA node(s):        2
> > > Vendor ID:           GenuineIntel
> > > CPU family:          6
> > > Model:               143
> > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > Stepping:            8
> > > CPU MHz:             2538.624
> > > CPU max MHz:         3800.0000
> > > CPU min MHz:         800.0000
> > >
> > > We perform multifd live migration with below setup:
> > > 1. VM has 100GB memory.
> > > 2. Use the new migration option multifd-set-normal-page-ratio to control
> > the total
> > > size of the payload sent over the network.
> > > 3. Use 8 multifd channels.
> > > 4. Use tcp for live migration.
> > > 4. Use CPU to perform zero page checking as the baseline.
> > > 5. Use one DSA device to offload zero page checking to compare with the
> > baseline.
> > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > usage.
> > >
> > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > >
> > > 	CPU usage
> > >
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|Baseline	|live_migration	|5657.58	|		|
> > > 	|		|multifdsend_0	|3931.563	|		|
> > > 	|		|multifdsend_1	|4405.273	|		|
> > > 	|		|multifdsend_2	|3941.968	|		|
> > > 	|		|multifdsend_3	|5032.975	|		|
> > > 	|		|multifdsend_4	|4533.865	|		|
> > > 	|		|multifdsend_5	|4530.461	|		|
> > > 	|		|multifdsend_6	|5171.916	|		|
> > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|DSA		|live_migration	|6129.168	|		|
> > > 	|		|multifdsend_0	|2954.717	|		|
> > > 	|		|multifdsend_1	|2766.359	|		|
> > > 	|		|multifdsend_2	|2853.519	|		|
> > > 	|		|multifdsend_3	|2740.717	|		|
> > > 	|		|multifdsend_4	|2824.169	|		|
> > > 	|		|multifdsend_5	|2966.908	|		|
> > > 	|		|multifdsend_6	|2611.137	|		|
> > > 	|		|multifdsend_7	|3114.732	|		|
> > > 	|		|dsa_completion	|3612.564	|32568		|
> > > 	|---------------|---------------|---------------|---------------|
> > >
> > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > and live_migration threads runtime. DSA offloading total runtime is
> > > calculated by adding up all multifdsend_X, live_migration and
> > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > that is 23% total CPU usage savings.
> > 
> > 
> > Here the DSA was mostly idle.
> > 
> > Sounds good but a question: what if several qemu instances are
> > migrated in parallel?
> > 
> > Some accelerators tend to basically stall if several tasks
> > are trying to use them at the same time.
> > 
> > Where is the boundary here?
> 
> A DSA device can be assigned to multiple Qemu instances. 
> The DSA resource used by each process is called a work queue, each DSA
> device can support up to 8 work queues and work queues are classified into 
> dedicated queues and shared queues. 
> 
> A dedicated queue can only serve one process. Theoretically, there is no limit 
> on the number of processes in a shared queue, it is based on enqcmd + SVM technology.
> 
> https://www.kernel.org/doc/html/v5.17/x86/sva.html

This server has 200 CPUs which can thinkably migrate around 100 single
cpu qemu instances with no issue. What happens if you do this with DSA?

> > --
> > MST

Yuan Liu July 15, 2024, 1:09 p.m. UTC | #5

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, July 15, 2024 8:24 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, July 12, 2024 6:49 AM
> > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> <berrange@redhat.com>;
> > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-
> Ren
> > > (Jack) Chuang <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > * Performance:
> > > >
> > > > We use two Intel 4th generation Xeon servers for testing.
> > > >
> > > > Architecture:        x86_64
> > > > CPU(s):              192
> > > > Thread(s) per core:  2
> > > > Core(s) per socket:  48
> > > > Socket(s):           2
> > > > NUMA node(s):        2
> > > > Vendor ID:           GenuineIntel
> > > > CPU family:          6
> > > > Model:               143
> > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > Stepping:            8
> > > > CPU MHz:             2538.624
> > > > CPU max MHz:         3800.0000
> > > > CPU min MHz:         800.0000
> > > >
> > > > We perform multifd live migration with below setup:
> > > > 1. VM has 100GB memory.
> > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> control
> > > the total
> > > > size of the payload sent over the network.
> > > > 3. Use 8 multifd channels.
> > > > 4. Use tcp for live migration.
> > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > 5. Use one DSA device to offload zero page checking to compare with
> the
> > > baseline.
> > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > > usage.
> > > >
> > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > >
> > > > 	CPU usage
> > > >
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > >
> > > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > > and live_migration threads runtime. DSA offloading total runtime is
> > > > calculated by adding up all multifdsend_X, live_migration and
> > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > > that is 23% total CPU usage savings.
> > >
> > >
> > > Here the DSA was mostly idle.
> > >
> > > Sounds good but a question: what if several qemu instances are
> > > migrated in parallel?
> > >
> > > Some accelerators tend to basically stall if several tasks
> > > are trying to use them at the same time.
> > >
> > > Where is the boundary here?
> >
> > A DSA device can be assigned to multiple Qemu instances.
> > The DSA resource used by each process is called a work queue, each DSA
> > device can support up to 8 work queues and work queues are classified
> into
> > dedicated queues and shared queues.
> >
> > A dedicated queue can only serve one process. Theoretically, there is no
> limit
> > on the number of processes in a shared queue, it is based on enqcmd +
> SVM technology.
> >
> > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> 
> This server has 200 CPUs which can thinkably migrate around 100 single
> cpu qemu instances with no issue. What happens if you do this with DSA?

First, the DSA work queue needs to be configured in shared mode, and one
queue is enough. 

The maximum depth of the work queue of the DSA hardware is 128, which means
that the number of zero-page detection tasks submitted cannot exceed 128,
otherwise, enqcmd will return an error until the work queue is available again

100 Qemu instances need to be migrated concurrently, I don't have any data on
this yet, I think the 100 zero-page detection tasks can be successfully submitted
to the DSA hardware work queue, but the throughput of DSA's zero-page detection also
needs to be considered. Once the DSA maximum throughput is reached, the work queue
may be filled up quickly, this will cause some Qemu instances to be temporarily unable
to submit new tasks to DSA. This is likely to happen in the first round of migration
memory iteration.

> > > --
> > > MST

Michael S. Tsirkin July 15, 2024, 2:42 p.m. UTC | #6

On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 8:24 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > <berrange@redhat.com>;
> > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-
> > Ren
> > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > * Performance:
> > > > >
> > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > >
> > > > > Architecture:        x86_64
> > > > > CPU(s):              192
> > > > > Thread(s) per core:  2
> > > > > Core(s) per socket:  48
> > > > > Socket(s):           2
> > > > > NUMA node(s):        2
> > > > > Vendor ID:           GenuineIntel
> > > > > CPU family:          6
> > > > > Model:               143
> > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > Stepping:            8
> > > > > CPU MHz:             2538.624
> > > > > CPU max MHz:         3800.0000
> > > > > CPU min MHz:         800.0000
> > > > >
> > > > > We perform multifd live migration with below setup:
> > > > > 1. VM has 100GB memory.
> > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > control
> > > > the total
> > > > > size of the payload sent over the network.
> > > > > 3. Use 8 multifd channels.
> > > > > 4. Use tcp for live migration.
> > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > 5. Use one DSA device to offload zero page checking to compare with
> > the
> > > > baseline.
> > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > > > usage.
> > > > >
> > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > >
> > > > > 	CPU usage
> > > > >
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > >
> > > > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > > > and live_migration threads runtime. DSA offloading total runtime is
> > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > > > that is 23% total CPU usage savings.
> > > >
> > > >
> > > > Here the DSA was mostly idle.
> > > >
> > > > Sounds good but a question: what if several qemu instances are
> > > > migrated in parallel?
> > > >
> > > > Some accelerators tend to basically stall if several tasks
> > > > are trying to use them at the same time.
> > > >
> > > > Where is the boundary here?
> > >
> > > A DSA device can be assigned to multiple Qemu instances.
> > > The DSA resource used by each process is called a work queue, each DSA
> > > device can support up to 8 work queues and work queues are classified
> > into
> > > dedicated queues and shared queues.
> > >
> > > A dedicated queue can only serve one process. Theoretically, there is no
> > limit
> > > on the number of processes in a shared queue, it is based on enqcmd +
> > SVM technology.
> > >
> > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > 
> > This server has 200 CPUs which can thinkably migrate around 100 single
> > cpu qemu instances with no issue. What happens if you do this with DSA?
> 
> First, the DSA work queue needs to be configured in shared mode, and one
> queue is enough. 
> 
> The maximum depth of the work queue of the DSA hardware is 128, which means
> that the number of zero-page detection tasks submitted cannot exceed 128,
> otherwise, enqcmd will return an error until the work queue is available again
> 
> 100 Qemu instances need to be migrated concurrently, I don't have any data on
> this yet, I think the 100 zero-page detection tasks can be successfully submitted
> to the DSA hardware work queue, but the throughput of DSA's zero-page detection also
> needs to be considered. Once the DSA maximum throughput is reached, the work queue
> may be filled up quickly, this will cause some Qemu instances to be temporarily unable
> to submit new tasks to DSA.

The unfortunate reality here would be that there's likely no QoS, this
is purely fifo, right?

> This is likely to happen in the first round of migration
> memory iteration.

Try testing this and see then?

Yuan Liu July 15, 2024, 3:23 p.m. UTC | #7

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, July 15, 2024 10:43 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, July 15, 2024 8:24 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>;
> > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > <berrange@redhat.com>;
> > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster
> > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>;
> Ho-
> > > Ren
> > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > offload
> > > > > zero page checking in multifd live migration.
> > > > >
> > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > * Performance:
> > > > > >
> > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > >
> > > > > > Architecture:        x86_64
> > > > > > CPU(s):              192
> > > > > > Thread(s) per core:  2
> > > > > > Core(s) per socket:  48
> > > > > > Socket(s):           2
> > > > > > NUMA node(s):        2
> > > > > > Vendor ID:           GenuineIntel
> > > > > > CPU family:          6
> > > > > > Model:               143
> > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > Stepping:            8
> > > > > > CPU MHz:             2538.624
> > > > > > CPU max MHz:         3800.0000
> > > > > > CPU min MHz:         800.0000
> > > > > >
> > > > > > We perform multifd live migration with below setup:
> > > > > > 1. VM has 100GB memory.
> > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > > control
> > > > > the total
> > > > > > size of the payload sent over the network.
> > > > > > 3. Use 8 multifd channels.
> > > > > > 4. Use tcp for live migration.
> > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > 5. Use one DSA device to offload zero page checking to compare
> with
> > > the
> > > > > baseline.
> > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze
> CPU
> > > > > usage.
> > > > > >
> > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > >
> > > > > > 	CPU usage
> > > > > >
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > >
> > > > > > Baseline total runtime is calculated by adding up all
> multifdsend_X
> > > > > > and live_migration threads runtime. DSA offloading total runtime
> is
> > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime
> and
> > > > > > that is 23% total CPU usage savings.
> > > > >
> > > > >
> > > > > Here the DSA was mostly idle.
> > > > >
> > > > > Sounds good but a question: what if several qemu instances are
> > > > > migrated in parallel?
> > > > >
> > > > > Some accelerators tend to basically stall if several tasks
> > > > > are trying to use them at the same time.
> > > > >
> > > > > Where is the boundary here?
> > > >
> > > > A DSA device can be assigned to multiple Qemu instances.
> > > > The DSA resource used by each process is called a work queue, each
> DSA
> > > > device can support up to 8 work queues and work queues are
> classified
> > > into
> > > > dedicated queues and shared queues.
> > > >
> > > > A dedicated queue can only serve one process. Theoretically, there
> is no
> > > limit
> > > > on the number of processes in a shared queue, it is based on enqcmd
> +
> > > SVM technology.
> > > >
> > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > >
> > > This server has 200 CPUs which can thinkably migrate around 100 single
> > > cpu qemu instances with no issue. What happens if you do this with
> DSA?
> >
> > First, the DSA work queue needs to be configured in shared mode, and one
> > queue is enough.
> >
> > The maximum depth of the work queue of the DSA hardware is 128, which
> means
> > that the number of zero-page detection tasks submitted cannot exceed
> 128,
> > otherwise, enqcmd will return an error until the work queue is available
> again
> >
> > 100 Qemu instances need to be migrated concurrently, I don't have any
> data on
> > this yet, I think the 100 zero-page detection tasks can be successfully
> submitted
> > to the DSA hardware work queue, but the throughput of DSA's zero-page
> detection also
> > needs to be considered. Once the DSA maximum throughput is reached, the
> work queue
> > may be filled up quickly, this will cause some Qemu instances to be
> temporarily unable
> > to submit new tasks to DSA.
> 
> The unfortunate reality here would be that there's likely no QoS, this
> is purely fifo, right?

Yes, this scenario may be fifo, assuming that the number of pages each task
is the same, because DSA hardware consists of multiple work engines, they can
process tasks concurrently, usually in a round-robin way to get tasks from the
work queue.	

DSA supports priority and flow control based on work queue granularity.
https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt

> > This is likely to happen in the first round of migration
> > memory iteration.
> 
> Try testing this and see then?

Yes, I can test based on this patch set. Please review the test scenario
My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
All 8 DSA devices serve 100 Qemu instances for simultaneous live migration.
Each VM has 1 vCPU, and 1G memory, with no workload in the VM.

You want to know if some Qemu instances are stalled because of DSA, right?

> --
> MST

Yuan Liu July 15, 2024, 3:57 p.m. UTC | #8

> -----Original Message-----
> From: Liu, Yuan1
> Sent: Monday, July 15, 2024 11:23 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 10:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> >
> > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>;
> > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > <thuth@redhat.com>;
> > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > <peterx@redhat.com>;
> > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > Markus
> > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > -----Original Message-----
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > <berrange@redhat.com>;
> > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster
> > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > <yuan1.liu@intel.com>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>;
> > Ho-
> > > > Ren
> > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > > offload
> > > > > > zero page checking in multifd live migration.
> > > > > >
> > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > * Performance:
> > > > > > >
> > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > >
> > > > > > > Architecture:        x86_64
> > > > > > > CPU(s):              192
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  48
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               143
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > Stepping:            8
> > > > > > > CPU MHz:             2538.624
> > > > > > > CPU max MHz:         3800.0000
> > > > > > > CPU min MHz:         800.0000
> > > > > > >
> > > > > > > We perform multifd live migration with below setup:
> > > > > > > 1. VM has 100GB memory.
> > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio
> to
> > > > control
> > > > > > the total
> > > > > > > size of the payload sent over the network.
> > > > > > > 3. Use 8 multifd channels.
> > > > > > > 4. Use tcp for live migration.
> > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > 5. Use one DSA device to offload zero page checking to compare
> > with
> > > > the
> > > > > > baseline.
> > > > > > > 6. Use "perf sched record" and "perf sched timehist" to
> analyze
> > CPU
> > > > > > usage.
> > > > > > >
> > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > >
> > > > > > > 	CPU usage
> > > > > > >
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|		|comm		|runtime(msec)
> 	|totaltime(msec)|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > 	|		|multifdsend_7	|4722.769	|41922
> 	|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > 	|		|dsa_completion	|3612.564	|32568
> 	|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > >
> > > > > > > Baseline total runtime is calculated by adding up all
> > multifdsend_X
> > > > > > > and live_migration threads runtime. DSA offloading total
> runtime
> > is
> > > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec
> runtime
> > and
> > > > > > > that is 23% total CPU usage savings.
> > > > > >
> > > > > >
> > > > > > Here the DSA was mostly idle.
> > > > > >
> > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > migrated in parallel?
> > > > > >
> > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > are trying to use them at the same time.
> > > > > >
> > > > > > Where is the boundary here?

If I understand correctly, you are concerned that in some scenarios the
accelerator itself is the migration bottleneck, causing the migration performance
to be degraded.

My understanding is to make full use of the accelerator bandwidth, and once
the accelerator is the bottleneck, it will fall back to zero-page detection
by the CPU.

For example, when the enqcmd command returns an error which means the work queue
is full, then we can add some retry mechanisms or directly use CPU detection.

> > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > The DSA resource used by each process is called a work queue, each
> > DSA
> > > > > device can support up to 8 work queues and work queues are
> > classified
> > > > into
> > > > > dedicated queues and shared queues.
> > > > >
> > > > > A dedicated queue can only serve one process. Theoretically, there
> > is no
> > > > limit
> > > > > on the number of processes in a shared queue, it is based on
> enqcmd
> > +
> > > > SVM technology.
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > >
> > > > This server has 200 CPUs which can thinkably migrate around 100
> single
> > > > cpu qemu instances with no issue. What happens if you do this with
> > DSA?
> > >
> > > First, the DSA work queue needs to be configured in shared mode, and
> one
> > > queue is enough.
> > >
> > > The maximum depth of the work queue of the DSA hardware is 128, which
> > means
> > > that the number of zero-page detection tasks submitted cannot exceed
> > 128,
> > > otherwise, enqcmd will return an error until the work queue is
> available
> > again
> > >
> > > 100 Qemu instances need to be migrated concurrently, I don't have any
> > data on
> > > this yet, I think the 100 zero-page detection tasks can be
> successfully
> > submitted
> > > to the DSA hardware work queue, but the throughput of DSA's zero-page
> > detection also
> > > needs to be considered. Once the DSA maximum throughput is reached,
> the
> > work queue
> > > may be filled up quickly, this will cause some Qemu instances to be
> > temporarily unable
> > > to submit new tasks to DSA.
> >
> > The unfortunate reality here would be that there's likely no QoS, this
> > is purely fifo, right?
> 
> Yes, this scenario may be fifo, assuming that the number of pages each
> task
> is the same, because DSA hardware consists of multiple work engines, they
> can
> process tasks concurrently, usually in a round-robin way to get tasks from
> the
> work queue.
> 
> DSA supports priority and flow control based on work queue granularity.
> https://github.com/intel/idxd-
> config/blob/stable/Documentation/accfg/accel-config-config-wq.txt
> 
> > > This is likely to happen in the first round of migration
> > > memory iteration.
> >
> > Try testing this and see then?
> 
> Yes, I can test based on this patch set. Please review the test scenario
> My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> All 8 DSA devices serve 100 Qemu instances for simultaneous live
> migration.
> Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> 
> You want to know if some Qemu instances are stalled because of DSA, right?
> 
> > --
> > MST

Michael S. Tsirkin July 15, 2024, 4:08 p.m. UTC | #9

On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 10:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>;
> > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > <thuth@redhat.com>;
> > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > <peterx@redhat.com>;
> > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > Markus
> > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > -----Original Message-----
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > <berrange@redhat.com>;
> > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster
> > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>;
> > Ho-
> > > > Ren
> > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > > offload
> > > > > > zero page checking in multifd live migration.
> > > > > >
> > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > * Performance:
> > > > > > >
> > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > >
> > > > > > > Architecture:        x86_64
> > > > > > > CPU(s):              192
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  48
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               143
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > Stepping:            8
> > > > > > > CPU MHz:             2538.624
> > > > > > > CPU max MHz:         3800.0000
> > > > > > > CPU min MHz:         800.0000
> > > > > > >
> > > > > > > We perform multifd live migration with below setup:
> > > > > > > 1. VM has 100GB memory.
> > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > > > control
> > > > > > the total
> > > > > > > size of the payload sent over the network.
> > > > > > > 3. Use 8 multifd channels.
> > > > > > > 4. Use tcp for live migration.
> > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > 5. Use one DSA device to offload zero page checking to compare
> > with
> > > > the
> > > > > > baseline.
> > > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze
> > CPU
> > > > > > usage.
> > > > > > >
> > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > >
> > > > > > > 	CPU usage
> > > > > > >
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > >
> > > > > > > Baseline total runtime is calculated by adding up all
> > multifdsend_X
> > > > > > > and live_migration threads runtime. DSA offloading total runtime
> > is
> > > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime
> > and
> > > > > > > that is 23% total CPU usage savings.
> > > > > >
> > > > > >
> > > > > > Here the DSA was mostly idle.
> > > > > >
> > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > migrated in parallel?
> > > > > >
> > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > are trying to use them at the same time.
> > > > > >
> > > > > > Where is the boundary here?
> > > > >
> > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > The DSA resource used by each process is called a work queue, each
> > DSA
> > > > > device can support up to 8 work queues and work queues are
> > classified
> > > > into
> > > > > dedicated queues and shared queues.
> > > > >
> > > > > A dedicated queue can only serve one process. Theoretically, there
> > is no
> > > > limit
> > > > > on the number of processes in a shared queue, it is based on enqcmd
> > +
> > > > SVM technology.
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > >
> > > > This server has 200 CPUs which can thinkably migrate around 100 single
> > > > cpu qemu instances with no issue. What happens if you do this with
> > DSA?
> > >
> > > First, the DSA work queue needs to be configured in shared mode, and one
> > > queue is enough.
> > >
> > > The maximum depth of the work queue of the DSA hardware is 128, which
> > means
> > > that the number of zero-page detection tasks submitted cannot exceed
> > 128,
> > > otherwise, enqcmd will return an error until the work queue is available
> > again
> > >
> > > 100 Qemu instances need to be migrated concurrently, I don't have any
> > data on
> > > this yet, I think the 100 zero-page detection tasks can be successfully
> > submitted
> > > to the DSA hardware work queue, but the throughput of DSA's zero-page
> > detection also
> > > needs to be considered. Once the DSA maximum throughput is reached, the
> > work queue
> > > may be filled up quickly, this will cause some Qemu instances to be
> > temporarily unable
> > > to submit new tasks to DSA.
> > 
> > The unfortunate reality here would be that there's likely no QoS, this
> > is purely fifo, right?
> 
> Yes, this scenario may be fifo, assuming that the number of pages each task
> is the same, because DSA hardware consists of multiple work engines, they can
> process tasks concurrently, usually in a round-robin way to get tasks from the
> work queue.	
> 
> DSA supports priority and flow control based on work queue granularity.
> https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt

Right but it seems clear there aren't enough work queues for a typical setup.

> > > This is likely to happen in the first round of migration
> > > memory iteration.
> > 
> > Try testing this and see then?
> 
> Yes, I can test based on this patch set. Please review the test scenario
> My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> All 8 DSA devices serve 100 Qemu instances for simultaneous live migration.
> Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> 
> You want to know if some Qemu instances are stalled because of DSA, right?

And generally just run same benchmark you did compared to cpu:
worst case and average numbers would be interesting.

> > --
> > MST

Michael S. Tsirkin July 15, 2024, 4:24 p.m. UTC | #10

On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote:
> > > > > > > > that is 23% total CPU usage savings.
> > > > > > >
> > > > > > >
> > > > > > > Here the DSA was mostly idle.
> > > > > > >
> > > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > > migrated in parallel?
> > > > > > >
> > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > are trying to use them at the same time.
> > > > > > >
> > > > > > > Where is the boundary here?
> 
> If I understand correctly, you are concerned that in some scenarios the
> accelerator itself is the migration bottleneck, causing the migration performance
> to be degraded.
> 
> My understanding is to make full use of the accelerator bandwidth, and once
> the accelerator is the bottleneck, it will fall back to zero-page detection
> by the CPU.
> 
> For example, when the enqcmd command returns an error which means the work queue
> is full, then we can add some retry mechanisms or directly use CPU detection.


How is it handled in your patch? If you just abort migration unless
enqcmd succeeds then would that not be a bug, where loading the system
leads to migraton failures?

Yuan Liu July 16, 2024, 1:21 a.m. UTC | #11

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, July 16, 2024 12:09 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, July 15, 2024 10:43 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>;
> > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > > <pbonzini@redhat.com>; Marc-André Lureau
> > > <marcandre.lureau@redhat.com>;
> > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > > <thuth@redhat.com>;
> > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > > <peterx@redhat.com>;
> > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > > Markus
> > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > > qemu-
> > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > > <horenchuang@bytedance.com>
> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > offload
> > > > > zero page checking in multifd live migration.
> > > > >
> > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > > <berrange@redhat.com>;
> > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano
> Rosas
> > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > > Armbruster
> > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>;
> > > Ho-
> > > > > Ren
> > > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator
> to
> > > > > offload
> > > > > > > zero page checking in multifd live migration.
> > > > > > >
> > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > > * Performance:
> > > > > > > >
> > > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > > >
> > > > > > > > Architecture:        x86_64
> > > > > > > > CPU(s):              192
> > > > > > > > Thread(s) per core:  2
> > > > > > > > Core(s) per socket:  48
> > > > > > > > Socket(s):           2
> > > > > > > > NUMA node(s):        2
> > > > > > > > Vendor ID:           GenuineIntel
> > > > > > > > CPU family:          6
> > > > > > > > Model:               143
> > > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > > Stepping:            8
> > > > > > > > CPU MHz:             2538.624
> > > > > > > > CPU max MHz:         3800.0000
> > > > > > > > CPU min MHz:         800.0000
> > > > > > > >
> > > > > > > > We perform multifd live migration with below setup:
> > > > > > > > 1. VM has 100GB memory.
> > > > > > > > 2. Use the new migration option multifd-set-normal-page-
> ratio to
> > > > > control
> > > > > > > the total
> > > > > > > > size of the payload sent over the network.
> > > > > > > > 3. Use 8 multifd channels.
> > > > > > > > 4. Use tcp for live migration.
> > > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > > 5. Use one DSA device to offload zero page checking to
> compare
> > > with
> > > > > the
> > > > > > > baseline.
> > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to
> analyze
> > > CPU
> > > > > > > usage.
> > > > > > > >
> > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > > >
> > > > > > > > 	CPU usage
> > > > > > > >
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|		|comm		|runtime(msec)
> 	|totaltime(msec)|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > > 	|		|multifdsend_7	|4722.769	|41922
> 	|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > > 	|		|dsa_completion	|3612.564	|32568
> 	|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > >
> > > > > > > > Baseline total runtime is calculated by adding up all
> > > multifdsend_X
> > > > > > > > and live_migration threads runtime. DSA offloading total
> runtime
> > > is
> > > > > > > > calculated by adding up all multifdsend_X, live_migration
> and
> > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec
> runtime
> > > and
> > > > > > > > that is 23% total CPU usage savings.
> > > > > > >
> > > > > > >
> > > > > > > Here the DSA was mostly idle.
> > > > > > >
> > > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > > migrated in parallel?
> > > > > > >
> > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > are trying to use them at the same time.
> > > > > > >
> > > > > > > Where is the boundary here?
> > > > > >
> > > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > > The DSA resource used by each process is called a work queue,
> each
> > > DSA
> > > > > > device can support up to 8 work queues and work queues are
> > > classified
> > > > > into
> > > > > > dedicated queues and shared queues.
> > > > > >
> > > > > > A dedicated queue can only serve one process. Theoretically,
> there
> > > is no
> > > > > limit
> > > > > > on the number of processes in a shared queue, it is based on
> enqcmd
> > > +
> > > > > SVM technology.
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > > >
> > > > > This server has 200 CPUs which can thinkably migrate around 100
> single
> > > > > cpu qemu instances with no issue. What happens if you do this with
> > > DSA?
> > > >
> > > > First, the DSA work queue needs to be configured in shared mode, and
> one
> > > > queue is enough.
> > > >
> > > > The maximum depth of the work queue of the DSA hardware is 128,
> which
> > > means
> > > > that the number of zero-page detection tasks submitted cannot exceed
> > > 128,
> > > > otherwise, enqcmd will return an error until the work queue is
> available
> > > again
> > > >
> > > > 100 Qemu instances need to be migrated concurrently, I don't have
> any
> > > data on
> > > > this yet, I think the 100 zero-page detection tasks can be
> successfully
> > > submitted
> > > > to the DSA hardware work queue, but the throughput of DSA's zero-
> page
> > > detection also
> > > > needs to be considered. Once the DSA maximum throughput is reached,
> the
> > > work queue
> > > > may be filled up quickly, this will cause some Qemu instances to be
> > > temporarily unable
> > > > to submit new tasks to DSA.
> > >
> > > The unfortunate reality here would be that there's likely no QoS, this
> > > is purely fifo, right?
> >
> > Yes, this scenario may be fifo, assuming that the number of pages each
> task
> > is the same, because DSA hardware consists of multiple work engines,
> they can
> > process tasks concurrently, usually in a round-robin way to get tasks
> from the
> > work queue.
> >
> > DSA supports priority and flow control based on work queue granularity.
> > https://github.com/intel/idxd-
> config/blob/stable/Documentation/accfg/accel-config-config-wq.txt
> 
> Right but it seems clear there aren't enough work queues for a typical
> setup.
> 
> > > > This is likely to happen in the first round of migration
> > > > memory iteration.
> > >
> > > Try testing this and see then?
> >
> > Yes, I can test based on this patch set. Please review the test scenario
> > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> > All 8 DSA devices serve 100 Qemu instances for simultaneous live
> migration.
> > Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> >
> > You want to know if some Qemu instances are stalled because of DSA,
> right?
> 
> And generally just run same benchmark you did compared to cpu:
> worst case and average numbers would be interesting.

Sure, I will have a test for this.

> > > --
> > > MST

Yuan Liu July 16, 2024, 1:25 a.m. UTC | #12

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, July 16, 2024 12:24 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote:
> > > > > > > > > that is 23% total CPU usage savings.
> > > > > > > >
> > > > > > > >
> > > > > > > > Here the DSA was mostly idle.
> > > > > > > >
> > > > > > > > Sounds good but a question: what if several qemu instances
> are
> > > > > > > > migrated in parallel?
> > > > > > > >
> > > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > > are trying to use them at the same time.
> > > > > > > >
> > > > > > > > Where is the boundary here?
> >
> > If I understand correctly, you are concerned that in some scenarios the
> > accelerator itself is the migration bottleneck, causing the migration
> performance
> > to be degraded.
> >
> > My understanding is to make full use of the accelerator bandwidth, and
> once
> > the accelerator is the bottleneck, it will fall back to zero-page
> detection
> > by the CPU.
> >
> > For example, when the enqcmd command returns an error which means the
> work queue
> > is full, then we can add some retry mechanisms or directly use CPU
> detection.
> 
> 
> How is it handled in your patch? If you just abort migration unless
> enqcmd succeeds then would that not be a bug, where loading the system
> leads to migraton failures?

Sorry for this, I have just started reviewing this patch. The content we
discussed before is only related to the DSA device itself and may not be
related to this patch's implementation. I will review the issue you mentioned
carefully. Thank you for your reminder.

> --
> MST

Fabiano Rosas July 16, 2024, 9:47 p.m. UTC | #13

Yichen Wang <yichen.wang@bytedance.com> writes:

> v5
> * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> * Rename struct definitions with typedef and CamelCase names;
> * Add build and runtime checks about DSA accelerator;
> * Address all comments from v4 reviews about typos, licenses, comments,
> error reporting, etc.

Hi,

You forgot to make sure the patches compile without DSA support as
well! =)

Also, please be more explicit on the state of the series, the WIP on the
title is not enough. You can send the whole series as RFC (e.g. PATCH RFC v5)
if it's not ready to merge, or put the RFC tag only on the patches you
need help with. But make sure you have some words in the cover-letter
stating what is going on.

Another point is, I see you have applied some suggestions from the
previous version, but did those on top of the existing code in some
cases. Try to avoid that and please fix it for the next version. That
is, don't add code in one patch just to remove it on the next, try to
apply the changes/suggestions on the patch that introduces the code, as
much as possible.

[v5,00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.

Message

Comments