diff mbox series

migration: Refine the convergence detection

Message ID 89555383887355d848c1005bfde1641bcddff024.1721821071.git.yong.huang@smartx.com
State New
Headers show
Series migration: Refine the convergence detection | expand

Commit Message

Yong Huang July 24, 2024, 11:39 a.m. UTC
Currently, the convergence algorithm determines that the migration
cannot converge according to the following principle:
The dirty pages generated in current iteration exceed a specific
percentage (throttle-trigger-threshold, 50 by default) of the number
of transmissions. Let's refer to this criteria as the
"transmission speed," If this criteria is met more than or equal to
twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
be increased.

In most cases, above implementation is appropriate. However, for a
VM with a huge memory and high memory overload, each iteration is
time-consuming. The VM's computing performance may be throttled at
a high percentage and last for a long time due to the repeated
confirmation behavior. Which may be intolerable for some
computationally sensitive software in the VM. The refinement is for
this scenario.

As the comment mentioned in the migration_trigger_throttle function,
in order to avoid erroneous detection, the original algorithm confirms
the criteria repeatedly. Put differently, once the detection become
more reliable, it does not need to be confirmed twice.

In the refinement, in order to make the detection more accurate, we
introduce another criteria, called the "valid transmission ratio"
to determine the migration convergence. The "valid transmission ratio"
is the ratio of bytes_xfer_period and bytes_dirty_period, which
actually describe the migration efficiency.

When the algorithm repeatedly detects that the current iteration
"valid transmission ratio" is lower than the previous iteration,
the algorithm determines that the migration cannot converge.

For the "transmission speed" and "valid transmission ratio", if one
of the two criteria is met, the penalty percentage would be increased.
This saves the time of the entire iteration and therefore reduces
the time of VM performance degradation.

In conclusion, this refinement significantly reduces the processing
time required for the throttle percentage step to its maximum while
the VM is under a high memory load.

The following are test environment:
----------------------------------------------------------------------
a. Test tool:
guestperf

Refer to the following link to see details:
https://github.com/qemu/qemu/tree/master/tests/migration/guestperf

b. Test VM scale:
CPU: 10; Memory: 28GB

c. Average bandwidth between source and destination for migration:
1.53 Gbits/sec
----------------------------------------------------------------------
All the supplementary test data shown as follows are basing on
above test environment.

We use stress tool contained in the initrd-stress.img to update
ramsize MB on every CPU in guest, refer to the following link to
see the source code:
https://github.com/qemu/qemu/blob/master/tests/migration/stress.c

We run the following command to compare our refined QEMU with the
original QEMU:

$ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
--binary $(path_to_qemu-kvm) \
--dst-host $(destination_ip) \
--transport tcp --kernel $(path_to_vmlinuz) \
--initrd $(path_to_initrd-stress.img) \
--auto-converge \
--auto-converge-step 10 \
--max-iters 40

The following data shows the migration test results with an increase in
stress, ignoring the title row, the odd rows show the unrefined QEMU
test data, and the even rows show the refined QEMU test data:

|---------+--------+-----------+--------------+------------+------------|
| ramsize | sucess | totaltime | downtime(ms) | iterations | switchover |
| (MB)    |        | (ms)      | (ms)         |            | throttle   |
|         |        |           |              |            | percent    |
|---------+--------+-----------+--------------+------------+------------|
|     350 |    yes |    170285 |          399 |         23 |         99 |
|     350 |    yes |     85962 |          412 |         24 |         70 |
|     350 |    yes |    176305 |          199 |         20 |         99 |
|     350 |    yes |     66729 |          321 |         11 |         40 |
|     400 |    yes |    183042 |          469 |         21 |         99 |
|     400 |    yes |     77225 |          421 |         10 |         30 |
|     400 |    yes |    183641 |          866 |         27 |         99 |
|     400 |    yes |     59796 |          479 |         15 |         50 |
|     450 |    yes |    165881 |          820 |         21 |         99 |
|     450 |    yes |     87930 |          368 |         20 |         90 |
|     450 |    yes |    195448 |          452 |         23 |         99 |
|     450 |    yes |     70394 |          295 |          6 |         20 |
|     500 |    yes |    112587 |          471 |         19 |         99 |
|     500 |    yes |     57401 |          299 |          5 |         30 |
|     500 |    yes |    110683 |          657 |         21 |         99 |
|     500 |    yes |     69949 |          649 |          8 |         40 |
|     600 |    yes |    111036 |          324 |         23 |         99 |
|     600 |    yes |     63455 |          346 |          4 |         20 |
|     600 |    yes |    126667 |          426 |         20 |         99 |
|     600 |    yes |    101024 |          643 |         20 |         99 |
|    1000 |    yes |    296216 |          660 |         23 |         99 |
|    1000 |    yes |    106915 |          468 |         16 |         99 |
|    1000 |     no |    300000 |              |            |            |
|    1000 |    yes |    125819 |          824 |         17 |         99 |
|    1200 |     no |    300000 |              |            |            |
|    1200 |    yes |    127379 |          664 |         14 |         90 |
|    1200 |     no |    300000 |              |            |            |
|    1200 |    yes |     67086 |          793 |         11 |         50 |
|---------+--------+-----------+--------------+------------+------------|

To summarize the data above, any data that implies negative optimization
does not appear, and morover, the throttle algorithm seems to become more
responsive to dirty rate increases due to the refined detection.

Signed-off-by: Hyman Huang <yong.huang@smartx.com>
---
 migration/ram.c        | 48 +++++++++++++++++++++++++++++++++++++++---
 migration/trace-events |  1 +
 2 files changed, 46 insertions(+), 3 deletions(-)

Comments

Peter Xu July 30, 2024, 8:01 p.m. UTC | #1
On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote:
> Currently, the convergence algorithm determines that the migration
> cannot converge according to the following principle:
> The dirty pages generated in current iteration exceed a specific
> percentage (throttle-trigger-threshold, 50 by default) of the number
> of transmissions. Let's refer to this criteria as the
> "transmission speed," If this criteria is met more than or equal to
> twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
> be increased.
> 
> In most cases, above implementation is appropriate. However, for a
> VM with a huge memory and high memory overload, each iteration is

When you're talking about huge memory, I do agree the current throttle
logic doesn't look like to scale, because migration_trigger_throttle() is
only called for each iteration, so it won't be invoked for a long time if
one iteration can take a long time.

I wonder whether you considered fixing that for a huge VM case in some way,
so that we may need to pay for the sync overhead more, but maybe the
throttle logic can still get scheduled from time to time.

E.g., on a 10TB system with even a 100Gbps network, one iteration can take:

  10TB / ~10GB/s = ~1000 seconds = ~17min

It means migration_trigger_throttle() will only be invoked once every 17
mins.

> time-consuming. The VM's computing performance may be throttled at
> a high percentage and last for a long time due to the repeated
> confirmation behavior. Which may be intolerable for some
> computationally sensitive software in the VM. The refinement is for
> this scenario.
> 
> As the comment mentioned in the migration_trigger_throttle function,
> in order to avoid erroneous detection, the original algorithm confirms
> the criteria repeatedly. Put differently, once the detection become
> more reliable, it does not need to be confirmed twice.
> 
> In the refinement, in order to make the detection more accurate, we
> introduce another criteria, called the "valid transmission ratio"
> to determine the migration convergence. The "valid transmission ratio"
> is the ratio of bytes_xfer_period and bytes_dirty_period, which
> actually describe the migration efficiency.
> 
> When the algorithm repeatedly detects that the current iteration
> "valid transmission ratio" is lower than the previous iteration,
> the algorithm determines that the migration cannot converge.
> 
> For the "transmission speed" and "valid transmission ratio", if one
> of the two criteria is met, the penalty percentage would be increased.
> This saves the time of the entire iteration and therefore reduces
> the time of VM performance degradation.
> 
> In conclusion, this refinement significantly reduces the processing
> time required for the throttle percentage step to its maximum while
> the VM is under a high memory load.
> 
> The following are test environment:
> ----------------------------------------------------------------------
> a. Test tool:
> guestperf
> 
> Refer to the following link to see details:
> https://github.com/qemu/qemu/tree/master/tests/migration/guestperf
> 
> b. Test VM scale:
> CPU: 10; Memory: 28GB

Isn't 28GB not a huge VM at all?  It'll be nice to know exactly what's the
problem behind first.  So are we talking about "real huge VM"s (I'd say at
least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU invokes)?

> 
> c. Average bandwidth between source and destination for migration:
> 1.53 Gbits/sec
> ----------------------------------------------------------------------
> All the supplementary test data shown as follows are basing on
> above test environment.
> 
> We use stress tool contained in the initrd-stress.img to update
> ramsize MB on every CPU in guest, refer to the following link to
> see the source code:
> https://github.com/qemu/qemu/blob/master/tests/migration/stress.c
> 
> We run the following command to compare our refined QEMU with the
> original QEMU:
> 
> $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
> --binary $(path_to_qemu-kvm) \
> --dst-host $(destination_ip) \
> --transport tcp --kernel $(path_to_vmlinuz) \
> --initrd $(path_to_initrd-stress.img) \
> --auto-converge \
> --auto-converge-step 10 \
> --max-iters 40

So this is for aut-converge.  How's the dirty limit solution?  I am
surprised you switched to the old one.  Does it mean that auto-converge is
better in some cases?

> 
> The following data shows the migration test results with an increase in
> stress, ignoring the title row, the odd rows show the unrefined QEMU
> test data, and the even rows show the refined QEMU test data:
> 
> |---------+--------+-----------+--------------+------------+------------|
> | ramsize | sucess | totaltime | downtime(ms) | iterations | switchover |
> | (MB)    |        | (ms)      | (ms)         |            | throttle   |
> |         |        |           |              |            | percent    |
> |---------+--------+-----------+--------------+------------+------------|
> |     350 |    yes |    170285 |          399 |         23 |         99 |
> |     350 |    yes |     85962 |          412 |         24 |         70 |
> |     350 |    yes |    176305 |          199 |         20 |         99 |
> |     350 |    yes |     66729 |          321 |         11 |         40 |
> |     400 |    yes |    183042 |          469 |         21 |         99 |
> |     400 |    yes |     77225 |          421 |         10 |         30 |
> |     400 |    yes |    183641 |          866 |         27 |         99 |
> |     400 |    yes |     59796 |          479 |         15 |         50 |
> |     450 |    yes |    165881 |          820 |         21 |         99 |
> |     450 |    yes |     87930 |          368 |         20 |         90 |
> |     450 |    yes |    195448 |          452 |         23 |         99 |
> |     450 |    yes |     70394 |          295 |          6 |         20 |
> |     500 |    yes |    112587 |          471 |         19 |         99 |
> |     500 |    yes |     57401 |          299 |          5 |         30 |
> |     500 |    yes |    110683 |          657 |         21 |         99 |
> |     500 |    yes |     69949 |          649 |          8 |         40 |
> |     600 |    yes |    111036 |          324 |         23 |         99 |
> |     600 |    yes |     63455 |          346 |          4 |         20 |
> |     600 |    yes |    126667 |          426 |         20 |         99 |
> |     600 |    yes |    101024 |          643 |         20 |         99 |
> |    1000 |    yes |    296216 |          660 |         23 |         99 |
> |    1000 |    yes |    106915 |          468 |         16 |         99 |
> |    1000 |     no |    300000 |              |            |            |
> |    1000 |    yes |    125819 |          824 |         17 |         99 |
> |    1200 |     no |    300000 |              |            |            |
> |    1200 |    yes |    127379 |          664 |         14 |         90 |
> |    1200 |     no |    300000 |              |            |            |
> |    1200 |    yes |     67086 |          793 |         11 |         50 |
> |---------+--------+-----------+--------------+------------+------------|
> 
> To summarize the data above, any data that implies negative optimization
> does not appear, and morover, the throttle algorithm seems to become more
> responsive to dirty rate increases due to the refined detection.
> 
> Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> ---
>  migration/ram.c        | 48 +++++++++++++++++++++++++++++++++++++++---
>  migration/trace-events |  1 +
>  2 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index edec1a2d07..18b2d422b5 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -414,6 +414,17 @@ struct RAMState {
>       * RAM migration.
>       */
>      unsigned int postcopy_bmap_sync_requested;
> +
> +    /*
> +     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
> +     * iteration
> +     */
> +    uint64_t dirty_ratio_pct;
> +    /*
> +     * How many times is the most recent iteration dirty ratio is
> +     * higher than previous one
> +     */
> +    int dirty_ratio_high_cnt;
>  };
>  typedef struct RAMState RAMState;
>  
> @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void)
>      trace_migration_dirty_limit_guest(quota_dirtyrate);
>  }
>  
> +static bool migration_dirty_ratio_unexpected(RAMState *rs)
> +{
> +    uint64_t threshold = migrate_throttle_trigger_threshold();
> +    uint64_t bytes_xfer_period =
> +        migration_transferred_bytes() - rs->bytes_xfer_prev;
> +    uint64_t bytes_dirty_period = rs->num_dirty_pages_period * TARGET_PAGE_SIZE;
> +    uint64_t prev, curr;
> +
> +    /* Skip the first iterations since it isn't helpful */
> +    if (stat64_get(&mig_stats.dirty_sync_count) == 1 || !bytes_xfer_period) {
> +        return false;
> +    }
> +
> +    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
> +
> +    prev = rs->dirty_ratio_pct;
> +    rs->dirty_ratio_pct = curr;
> +
> +    if (prev == 0 || curr <= threshold) {
> +        return false;
> +    }
> +
> +    trace_dirty_ratio_pct(curr, prev);
> +    return curr >= prev;
> +}
> +
>  static void migration_trigger_throttle(RAMState *rs)
>  {
>      uint64_t threshold = migrate_throttle_trigger_threshold();
> @@ -1028,9 +1065,14 @@ static void migration_trigger_throttle(RAMState *rs)
>       * we were in this routine reaches the threshold. If that happens
>       * twice, start or increase throttling.
>       */
> -    if ((bytes_dirty_period > bytes_dirty_threshold) &&
> -        (++rs->dirty_rate_high_cnt >= 2)) {
> -        rs->dirty_rate_high_cnt = 0;
> +    if ((migration_dirty_ratio_unexpected(rs) &&
> +         (++rs->dirty_ratio_high_cnt >= 2)) ||
> +        ((bytes_dirty_period > bytes_dirty_threshold) &&
> +         (++rs->dirty_rate_high_cnt >= 2))) {

I'm afraid this is a mess.

Now it's (A||B) and any condition can trigger this throttle logic.  Both A
& B make decisions on merely the same parameters.  It's totally
unpredictable to me on how these cnts bumps, and not readable.

What I kind of see how this could make migration shorter is that now either
A or B can trigger the throttle, so it triggered in a faster pace than
before.  E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also
achieve similar thing at least in guestperf tests.

Have you considered what I mentioned above that may make auto converge work
better with "real huge VM"s (by allowing sync >1 times for each iteration),
or have you considered postcopy?

Thanks,

> +        rs->dirty_rate_high_cnt =
> +            rs->dirty_rate_high_cnt >= 2 ? 0 : rs->dirty_rate_high_cnt;
> +        rs->dirty_ratio_high_cnt =
> +            rs->dirty_ratio_high_cnt >= 2 ? 0 : rs->dirty_ratio_high_cnt;
>          if (migrate_auto_converge()) {
>              trace_migration_throttle();
>              mig_throttle_guest_down(bytes_dirty_period,
> diff --git a/migration/trace-events b/migration/trace-events
> index 0b7c3324fb..654c52c5e4 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -90,6 +90,7 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
>  qemu_file_fclose(void) ""
>  
>  # ram.c
> +dirty_ratio_pct(uint64_t cur, uint64_t prev) "current ratio: %" PRIu64 " previous ratio: %" PRIu64
>  get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
>  get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
>  migration_bitmap_sync_start(void) ""
> -- 
> 2.39.1
>
Yong Huang Aug. 5, 2024, 7:03 a.m. UTC | #2
Sorry for the late reply.

On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote:
> > Currently, the convergence algorithm determines that the migration
> > cannot converge according to the following principle:
> > The dirty pages generated in current iteration exceed a specific
> > percentage (throttle-trigger-threshold, 50 by default) of the number
> > of transmissions. Let's refer to this criteria as the
> > "transmission speed," If this criteria is met more than or equal to
> > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
> > be increased.
> >
> > In most cases, above implementation is appropriate. However, for a
> > VM with a huge memory and high memory overload, each iteration is
>
> When you're talking about huge memory, I do agree the current throttle
> logic doesn't look like to scale, because migration_trigger_throttle() is
> only called for each iteration, so it won't be invoked for a long time if
> one iteration can take a long time.
>
> I wonder whether you considered fixing that for a huge VM case in some way,
> so that we may need to pay for the sync overhead more, but maybe the
> throttle logic can still get scheduled from time to time.


> E.g., on a 10TB system with even a 100Gbps network, one iteration can take:
>
>   10TB / ~10GB/s = ~1000 seconds = ~17min
>
> It means migration_trigger_throttle() will only be invoked once every 17
> mins.


Agree, and another case, VM is at a high dirty rate when migrating,
similarly making the iteration take a long time.


> > time-consuming. The VM's computing performance may be throttled at
> > a high percentage and last for a long time due to the repeated
> > confirmation behavior. Which may be intolerable for some
> > computationally sensitive software in the VM. The refinement is for
> > this scenario.
> >
> > As the comment mentioned in the migration_trigger_throttle function,
> > in order to avoid erroneous detection, the original algorithm confirms
> > the criteria repeatedly. Put differently, once the detection become
> > more reliable, it does not need to be confirmed twice.
> >
> > In the refinement, in order to make the detection more accurate, we
> > introduce another criteria, called the "valid transmission ratio"
> > to determine the migration convergence. The "valid transmission ratio"
> > is the ratio of bytes_xfer_period and bytes_dirty_period, which
> > actually describe the migration efficiency.
> >
> > When the algorithm repeatedly detects that the current iteration
> > "valid transmission ratio" is lower than the previous iteration,
> > the algorithm determines that the migration cannot converge.
> >
> > For the "transmission speed" and "valid transmission ratio", if one
> > of the two criteria is met, the penalty percentage would be increased.
> > This saves the time of the entire iteration and therefore reduces
> > the time of VM performance degradation.
> >
> > In conclusion, this refinement significantly reduces the processing
> > time required for the throttle percentage step to its maximum while
> > the VM is under a high memory load.
> >
> > The following are test environment:
> > ----------------------------------------------------------------------
> > a. Test tool:
> > guestperf
> >
> > Refer to the following link to see details:
> > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf
> >
> > b. Test VM scale:
> > CPU: 10; Memory: 28GB
>
> Isn't 28GB not a huge VM at all?  It'll be nice to know exactly what's the
> problem behind first.  So are we talking about "real huge VM"s (I'd say at
> least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU invokes)?
>
> >
> > c. Average bandwidth between source and destination for migration:
> > 1.53 Gbits/sec
> > ----------------------------------------------------------------------
> > All the supplementary test data shown as follows are basing on
> > above test environment.
> >
> > We use stress tool contained in the initrd-stress.img to update
> > ramsize MB on every CPU in guest, refer to the following link to
> > see the source code:
> > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c
> >
> > We run the following command to compare our refined QEMU with the
> > original QEMU:
> >
> > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
> > --binary $(path_to_qemu-kvm) \
> > --dst-host $(destination_ip) \
> > --transport tcp --kernel $(path_to_vmlinuz) \
> > --initrd $(path_to_initrd-stress.img) \
> > --auto-converge \
> > --auto-converge-step 10 \
> > --max-iters 40
>
> So this is for aut-converge.  How's the dirty limit solution?  I am
>

This patch is a general solution, not just for auto-convergence, here
I missed the dirty limit test case, i'll post the dirty limit test result
in the next version.


> surprised you switched to the old one.  Does it mean that auto-converge is
> better in some cases?
>

Since for non-x86 hardware platforms, auto-converge is still the only
solution
to throttle the guest.  It still reap the benefits from this patch.


>
> >
> > The following data shows the migration test results with an increase in
> > stress, ignoring the title row, the odd rows show the unrefined QEMU
> > test data, and the even rows show the refined QEMU test data:
> >
> > |---------+--------+-----------+--------------+------------+------------|
> > | ramsize | sucess | totaltime | downtime(ms) | iterations | switchover |
> > | (MB)    |        | (ms)      | (ms)         |            | throttle   |
> > |         |        |           |              |            | percent    |
> > |---------+--------+-----------+--------------+------------+------------|
> > |     350 |    yes |    170285 |          399 |         23 |         99 |
> > |     350 |    yes |     85962 |          412 |         24 |         70 |
> > |     350 |    yes |    176305 |          199 |         20 |         99 |
> > |     350 |    yes |     66729 |          321 |         11 |         40 |
> > |     400 |    yes |    183042 |          469 |         21 |         99 |
> > |     400 |    yes |     77225 |          421 |         10 |         30 |
> > |     400 |    yes |    183641 |          866 |         27 |         99 |
> > |     400 |    yes |     59796 |          479 |         15 |         50 |
> > |     450 |    yes |    165881 |          820 |         21 |         99 |
> > |     450 |    yes |     87930 |          368 |         20 |         90 |
> > |     450 |    yes |    195448 |          452 |         23 |         99 |
> > |     450 |    yes |     70394 |          295 |          6 |         20 |
> > |     500 |    yes |    112587 |          471 |         19 |         99 |
> > |     500 |    yes |     57401 |          299 |          5 |         30 |
> > |     500 |    yes |    110683 |          657 |         21 |         99 |
> > |     500 |    yes |     69949 |          649 |          8 |         40 |
> > |     600 |    yes |    111036 |          324 |         23 |         99 |
> > |     600 |    yes |     63455 |          346 |          4 |         20 |
> > |     600 |    yes |    126667 |          426 |         20 |         99 |
> > |     600 |    yes |    101024 |          643 |         20 |         99 |
> > |    1000 |    yes |    296216 |          660 |         23 |         99 |
> > |    1000 |    yes |    106915 |          468 |         16 |         99 |
> > |    1000 |     no |    300000 |              |            |            |
> > |    1000 |    yes |    125819 |          824 |         17 |         99 |
> > |    1200 |     no |    300000 |              |            |            |
> > |    1200 |    yes |    127379 |          664 |         14 |         90 |
> > |    1200 |     no |    300000 |              |            |            |
> > |    1200 |    yes |     67086 |          793 |         11 |         50 |
> > |---------+--------+-----------+--------------+------------+------------|
> >
> > To summarize the data above, any data that implies negative optimization
> > does not appear, and morover, the throttle algorithm seems to become more
> > responsive to dirty rate increases due to the refined detection.
> >
> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> > ---
> >  migration/ram.c        | 48 +++++++++++++++++++++++++++++++++++++++---
> >  migration/trace-events |  1 +
> >  2 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index edec1a2d07..18b2d422b5 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -414,6 +414,17 @@ struct RAMState {
> >       * RAM migration.
> >       */
> >      unsigned int postcopy_bmap_sync_requested;
> > +
> > +    /*
> > +     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
> > +     * iteration
> > +     */
> > +    uint64_t dirty_ratio_pct;
> > +    /*
> > +     * How many times is the most recent iteration dirty ratio is
> > +     * higher than previous one
> > +     */
> > +    int dirty_ratio_high_cnt;
> >  };
> >  typedef struct RAMState RAMState;
> >
> > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void)
> >      trace_migration_dirty_limit_guest(quota_dirtyrate);
> >  }
> >
> > +static bool migration_dirty_ratio_unexpected(RAMState *rs)
> > +{
> > +    uint64_t threshold = migrate_throttle_trigger_threshold();
> > +    uint64_t bytes_xfer_period =
> > +        migration_transferred_bytes() - rs->bytes_xfer_prev;
> > +    uint64_t bytes_dirty_period = rs->num_dirty_pages_period *
> TARGET_PAGE_SIZE;
> > +    uint64_t prev, curr;
> > +
> > +    /* Skip the first iterations since it isn't helpful */
> > +    if (stat64_get(&mig_stats.dirty_sync_count) == 1 ||
> !bytes_xfer_period) {
> > +        return false;
> > +    }
> > +
> > +    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
> > +
> > +    prev = rs->dirty_ratio_pct;
> > +    rs->dirty_ratio_pct = curr;
> > +
> > +    if (prev == 0 || curr <= threshold) {
> > +        return false;
> > +    }
> > +
> > +    trace_dirty_ratio_pct(curr, prev);
> > +    return curr >= prev;
> > +}
> > +
> >  static void migration_trigger_throttle(RAMState *rs)
> >  {
> >      uint64_t threshold = migrate_throttle_trigger_threshold();
> > @@ -1028,9 +1065,14 @@ static void migration_trigger_throttle(RAMState
> *rs)
> >       * we were in this routine reaches the threshold. If that happens
> >       * twice, start or increase throttling.
> >       */
> > -    if ((bytes_dirty_period > bytes_dirty_threshold) &&
> > -        (++rs->dirty_rate_high_cnt >= 2)) {
> > -        rs->dirty_rate_high_cnt = 0;
> > +    if ((migration_dirty_ratio_unexpected(rs) &&
> > +         (++rs->dirty_ratio_high_cnt >= 2)) ||
> > +        ((bytes_dirty_period > bytes_dirty_threshold) &&
> > +         (++rs->dirty_rate_high_cnt >= 2))) {
>
> I'm afraid this is a mess.
>
> Now it's (A||B) and any condition can trigger this throttle logic.  Both A
> & B make decisions on merely the same parameters.  It's totally
> unpredictable to me on how these cnts bumps, and not readable.
>

Indeed, this is not readable. How about introducing a migration capability
like "precise-detection" to make this patch backward-compatible?


>
> What I kind of see how this could make migration shorter is that now either
> A or B can trigger the throttle, so it triggered in a faster pace than
> before.  E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also
> achieve similar thing at least in guestperf tests.
>

Yes !  I get it, and what my original idea is to drop the
dirty_rate_high_cnt.

As we mentioned above: migration needs to pay for the sync overhead more
once a VM is configured with huge memory or running at a high dirty rate.

Dropping the dirty_rate_high_cnt will make the iteration take less time
in above cases.  I think this is feasible since there is no other reasons to
retain the  dirty_rate_high_cnt once we make the detection precise,
let me know if i missed something.


>
> Have you considered what I mentioned above that may make auto converge work
> better with "real huge VM"s (by allowing sync >1 times for each iteration),
> or have you considered postcopy?
>

Yes, IMHO, this is another refinement direction for the auto-converge, we'll
try this and keep communication with upstream once it gets progress.


>
> Thanks,
>
> > +        rs->dirty_rate_high_cnt =
> > +            rs->dirty_rate_high_cnt >= 2 ? 0 : rs->dirty_rate_high_cnt;
> > +        rs->dirty_ratio_high_cnt =
> > +            rs->dirty_ratio_high_cnt >= 2 ? 0 :
> rs->dirty_ratio_high_cnt;
> >          if (migrate_auto_converge()) {
> >              trace_migration_throttle();
> >              mig_throttle_guest_down(bytes_dirty_period,
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 0b7c3324fb..654c52c5e4 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -90,6 +90,7 @@ put_qlist_end(const char *field_name, const char
> *vmsd_name) "%s(%s)"
> >  qemu_file_fclose(void) ""
> >
> >  # ram.c
> > +dirty_ratio_pct(uint64_t cur, uint64_t prev) "current ratio: %" PRIu64
> " previous ratio: %" PRIu64
> >  get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned
> long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
> >  get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset,
> unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
> >  migration_bitmap_sync_start(void) ""
> > --
> > 2.39.1
> >
>
> --
> Peter Xu
>
>
Peter Xu Aug. 6, 2024, 4:45 p.m. UTC | #3
On Mon, Aug 05, 2024 at 03:03:27PM +0800, Yong Huang wrote:
> Sorry for the late reply.
> 
> On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote:
> > > Currently, the convergence algorithm determines that the migration
> > > cannot converge according to the following principle:
> > > The dirty pages generated in current iteration exceed a specific
> > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > of transmissions. Let's refer to this criteria as the
> > > "transmission speed," If this criteria is met more than or equal to
> > > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
> > > be increased.
> > >
> > > In most cases, above implementation is appropriate. However, for a
> > > VM with a huge memory and high memory overload, each iteration is
> >
> > When you're talking about huge memory, I do agree the current throttle
> > logic doesn't look like to scale, because migration_trigger_throttle() is
> > only called for each iteration, so it won't be invoked for a long time if
> > one iteration can take a long time.
> >
> > I wonder whether you considered fixing that for a huge VM case in some way,
> > so that we may need to pay for the sync overhead more, but maybe the
> > throttle logic can still get scheduled from time to time.
> 
> 
> > E.g., on a 10TB system with even a 100Gbps network, one iteration can take:
> >
> >   10TB / ~10GB/s = ~1000 seconds = ~17min
> >
> > It means migration_trigger_throttle() will only be invoked once every 17
> > mins.
> 
> 
> Agree, and another case, VM is at a high dirty rate when migrating,
> similarly making the iteration take a long time.
> 
> 
> > > time-consuming. The VM's computing performance may be throttled at
> > > a high percentage and last for a long time due to the repeated
> > > confirmation behavior. Which may be intolerable for some
> > > computationally sensitive software in the VM. The refinement is for
> > > this scenario.
> > >
> > > As the comment mentioned in the migration_trigger_throttle function,
> > > in order to avoid erroneous detection, the original algorithm confirms
> > > the criteria repeatedly. Put differently, once the detection become
> > > more reliable, it does not need to be confirmed twice.
> > >
> > > In the refinement, in order to make the detection more accurate, we
> > > introduce another criteria, called the "valid transmission ratio"
> > > to determine the migration convergence. The "valid transmission ratio"
> > > is the ratio of bytes_xfer_period and bytes_dirty_period, which
> > > actually describe the migration efficiency.
> > >
> > > When the algorithm repeatedly detects that the current iteration
> > > "valid transmission ratio" is lower than the previous iteration,
> > > the algorithm determines that the migration cannot converge.
> > >
> > > For the "transmission speed" and "valid transmission ratio", if one
> > > of the two criteria is met, the penalty percentage would be increased.
> > > This saves the time of the entire iteration and therefore reduces
> > > the time of VM performance degradation.
> > >
> > > In conclusion, this refinement significantly reduces the processing
> > > time required for the throttle percentage step to its maximum while
> > > the VM is under a high memory load.
> > >
> > > The following are test environment:
> > > ----------------------------------------------------------------------
> > > a. Test tool:
> > > guestperf
> > >
> > > Refer to the following link to see details:
> > > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf
> > >
> > > b. Test VM scale:
> > > CPU: 10; Memory: 28GB
> >
> > Isn't 28GB not a huge VM at all?  It'll be nice to know exactly what's the
> > problem behind first.  So are we talking about "real huge VM"s (I'd say at
> > least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU invokes)?
> >
> > >
> > > c. Average bandwidth between source and destination for migration:
> > > 1.53 Gbits/sec
> > > ----------------------------------------------------------------------
> > > All the supplementary test data shown as follows are basing on
> > > above test environment.
> > >
> > > We use stress tool contained in the initrd-stress.img to update
> > > ramsize MB on every CPU in guest, refer to the following link to
> > > see the source code:
> > > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c
> > >
> > > We run the following command to compare our refined QEMU with the
> > > original QEMU:
> > >
> > > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
> > > --binary $(path_to_qemu-kvm) \
> > > --dst-host $(destination_ip) \
> > > --transport tcp --kernel $(path_to_vmlinuz) \
> > > --initrd $(path_to_initrd-stress.img) \
> > > --auto-converge \
> > > --auto-converge-step 10 \
> > > --max-iters 40
> >
> > So this is for aut-converge.  How's the dirty limit solution?  I am
> >
> 
> This patch is a general solution, not just for auto-convergence, here
> I missed the dirty limit test case, i'll post the dirty limit test result
> in the next version.
> 
> 
> > surprised you switched to the old one.  Does it mean that auto-converge is
> > better in some cases?
> >
> 
> Since for non-x86 hardware platforms, auto-converge is still the only
> solution
> to throttle the guest.  It still reap the benefits from this patch.

ARM should have that too now, if that's what you're talking about.

Said that, for quite some time I was wondering whether upstream should be
focused on some solution that will work out for convergence, where I
ultimately chose postcopy.  That's also why I was asking whether you
considered postcopy.

It'll be more focused for everyone if we can have one major solution over
this problem rather than spreading the energy over a few.

> 
> 
> >
> > >
> > > The following data shows the migration test results with an increase in
> > > stress, ignoring the title row, the odd rows show the unrefined QEMU
> > > test data, and the even rows show the refined QEMU test data:
> > >
> > > |---------+--------+-----------+--------------+------------+------------|
> > > | ramsize | sucess | totaltime | downtime(ms) | iterations | switchover |
> > > | (MB)    |        | (ms)      | (ms)         |            | throttle   |
> > > |         |        |           |              |            | percent    |
> > > |---------+--------+-----------+--------------+------------+------------|
> > > |     350 |    yes |    170285 |          399 |         23 |         99 |
> > > |     350 |    yes |     85962 |          412 |         24 |         70 |
> > > |     350 |    yes |    176305 |          199 |         20 |         99 |
> > > |     350 |    yes |     66729 |          321 |         11 |         40 |
> > > |     400 |    yes |    183042 |          469 |         21 |         99 |
> > > |     400 |    yes |     77225 |          421 |         10 |         30 |
> > > |     400 |    yes |    183641 |          866 |         27 |         99 |
> > > |     400 |    yes |     59796 |          479 |         15 |         50 |
> > > |     450 |    yes |    165881 |          820 |         21 |         99 |
> > > |     450 |    yes |     87930 |          368 |         20 |         90 |
> > > |     450 |    yes |    195448 |          452 |         23 |         99 |
> > > |     450 |    yes |     70394 |          295 |          6 |         20 |
> > > |     500 |    yes |    112587 |          471 |         19 |         99 |
> > > |     500 |    yes |     57401 |          299 |          5 |         30 |
> > > |     500 |    yes |    110683 |          657 |         21 |         99 |
> > > |     500 |    yes |     69949 |          649 |          8 |         40 |
> > > |     600 |    yes |    111036 |          324 |         23 |         99 |
> > > |     600 |    yes |     63455 |          346 |          4 |         20 |
> > > |     600 |    yes |    126667 |          426 |         20 |         99 |
> > > |     600 |    yes |    101024 |          643 |         20 |         99 |
> > > |    1000 |    yes |    296216 |          660 |         23 |         99 |
> > > |    1000 |    yes |    106915 |          468 |         16 |         99 |
> > > |    1000 |     no |    300000 |              |            |            |
> > > |    1000 |    yes |    125819 |          824 |         17 |         99 |
> > > |    1200 |     no |    300000 |              |            |            |
> > > |    1200 |    yes |    127379 |          664 |         14 |         90 |
> > > |    1200 |     no |    300000 |              |            |            |
> > > |    1200 |    yes |     67086 |          793 |         11 |         50 |
> > > |---------+--------+-----------+--------------+------------+------------|
> > >
> > > To summarize the data above, any data that implies negative optimization
> > > does not appear, and morover, the throttle algorithm seems to become more
> > > responsive to dirty rate increases due to the refined detection.
> > >
> > > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> > > ---
> > >  migration/ram.c        | 48 +++++++++++++++++++++++++++++++++++++++---
> > >  migration/trace-events |  1 +
> > >  2 files changed, 46 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index edec1a2d07..18b2d422b5 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -414,6 +414,17 @@ struct RAMState {
> > >       * RAM migration.
> > >       */
> > >      unsigned int postcopy_bmap_sync_requested;
> > > +
> > > +    /*
> > > +     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
> > > +     * iteration
> > > +     */
> > > +    uint64_t dirty_ratio_pct;
> > > +    /*
> > > +     * How many times is the most recent iteration dirty ratio is
> > > +     * higher than previous one
> > > +     */
> > > +    int dirty_ratio_high_cnt;
> > >  };
> > >  typedef struct RAMState RAMState;
> > >
> > > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void)
> > >      trace_migration_dirty_limit_guest(quota_dirtyrate);
> > >  }
> > >
> > > +static bool migration_dirty_ratio_unexpected(RAMState *rs)
> > > +{
> > > +    uint64_t threshold = migrate_throttle_trigger_threshold();
> > > +    uint64_t bytes_xfer_period =
> > > +        migration_transferred_bytes() - rs->bytes_xfer_prev;
> > > +    uint64_t bytes_dirty_period = rs->num_dirty_pages_period *
> > TARGET_PAGE_SIZE;
> > > +    uint64_t prev, curr;
> > > +
> > > +    /* Skip the first iterations since it isn't helpful */
> > > +    if (stat64_get(&mig_stats.dirty_sync_count) == 1 ||
> > !bytes_xfer_period) {
> > > +        return false;
> > > +    }
> > > +
> > > +    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
> > > +
> > > +    prev = rs->dirty_ratio_pct;
> > > +    rs->dirty_ratio_pct = curr;
> > > +
> > > +    if (prev == 0 || curr <= threshold) {
> > > +        return false;
> > > +    }
> > > +
> > > +    trace_dirty_ratio_pct(curr, prev);
> > > +    return curr >= prev;
> > > +}
> > > +
> > >  static void migration_trigger_throttle(RAMState *rs)
> > >  {
> > >      uint64_t threshold = migrate_throttle_trigger_threshold();
> > > @@ -1028,9 +1065,14 @@ static void migration_trigger_throttle(RAMState
> > *rs)
> > >       * we were in this routine reaches the threshold. If that happens
> > >       * twice, start or increase throttling.
> > >       */
> > > -    if ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > -        (++rs->dirty_rate_high_cnt >= 2)) {
> > > -        rs->dirty_rate_high_cnt = 0;
> > > +    if ((migration_dirty_ratio_unexpected(rs) &&
> > > +         (++rs->dirty_ratio_high_cnt >= 2)) ||
> > > +        ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > +         (++rs->dirty_rate_high_cnt >= 2))) {
> >
> > I'm afraid this is a mess.
> >
> > Now it's (A||B) and any condition can trigger this throttle logic.  Both A
> > & B make decisions on merely the same parameters.  It's totally
> > unpredictable to me on how these cnts bumps, and not readable.
> >
> 
> Indeed, this is not readable. How about introducing a migration capability
> like "precise-detection" to make this patch backward-compatible?

If we want to do some change here anyway, I'd think it better we do it in
the best form rather than keep adding caps.

For now "stick with postcopy" is still one way to go, but not sure whether
it applies in your setup.  Otherwise I'd consider we sync more for huge
VMs.  I mean "throttle only for each sync, sync for each iteration" may
make sense in the old days, but perhaps not anymore.  That sound like the
real root cause of this issue.

> 
> 
> >
> > What I kind of see how this could make migration shorter is that now either
> > A or B can trigger the throttle, so it triggered in a faster pace than
> > before.  E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also
> > achieve similar thing at least in guestperf tests.
> >
> 
> Yes !  I get it, and what my original idea is to drop the
> dirty_rate_high_cnt.
> 
> As we mentioned above: migration needs to pay for the sync overhead more
> once a VM is configured with huge memory or running at a high dirty rate.
> 
> Dropping the dirty_rate_high_cnt will make the iteration take less time
> in above cases.  I think this is feasible since there is no other reasons to
> retain the  dirty_rate_high_cnt once we make the detection precise,
> let me know if i missed something.
> 
> 
> >
> > Have you considered what I mentioned above that may make auto converge work
> > better with "real huge VM"s (by allowing sync >1 times for each iteration),
> > or have you considered postcopy?
> >
> 
> Yes, IMHO, this is another refinement direction for the auto-converge, we'll
> try this and keep communication with upstream once it gets progress.

Thanks,
Yong Huang Aug. 7, 2024, 6:46 a.m. UTC | #4
On Wed, Aug 7, 2024 at 12:45 AM Peter Xu <peterx@redhat.com> wrote:

> On Mon, Aug 05, 2024 at 03:03:27PM +0800, Yong Huang wrote:
> > Sorry for the late reply.
> >
> > On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > > On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote:
> > > > Currently, the convergence algorithm determines that the migration
> > > > cannot converge according to the following principle:
> > > > The dirty pages generated in current iteration exceed a specific
> > > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > > of transmissions. Let's refer to this criteria as the
> > > > "transmission speed," If this criteria is met more than or equal to
> > > > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
> > > > be increased.
> > > >
> > > > In most cases, above implementation is appropriate. However, for a
> > > > VM with a huge memory and high memory overload, each iteration is
> > >
> > > When you're talking about huge memory, I do agree the current throttle
> > > logic doesn't look like to scale, because migration_trigger_throttle()
> is
> > > only called for each iteration, so it won't be invoked for a long time
> if
> > > one iteration can take a long time.
> > >
> > > I wonder whether you considered fixing that for a huge VM case in some
> way,
> > > so that we may need to pay for the sync overhead more, but maybe the
> > > throttle logic can still get scheduled from time to time.
> >
> >
> > > E.g., on a 10TB system with even a 100Gbps network, one iteration can
> take:
> > >
> > >   10TB / ~10GB/s = ~1000 seconds = ~17min
> > >
> > > It means migration_trigger_throttle() will only be invoked once every
> 17
> > > mins.
> >
> >
> > Agree, and another case, VM is at a high dirty rate when migrating,
> > similarly making the iteration take a long time.
> >
> >
> > > > time-consuming. The VM's computing performance may be throttled at
> > > > a high percentage and last for a long time due to the repeated
> > > > confirmation behavior. Which may be intolerable for some
> > > > computationally sensitive software in the VM. The refinement is for
> > > > this scenario.
> > > >
> > > > As the comment mentioned in the migration_trigger_throttle function,
> > > > in order to avoid erroneous detection, the original algorithm
> confirms
> > > > the criteria repeatedly. Put differently, once the detection become
> > > > more reliable, it does not need to be confirmed twice.
> > > >
> > > > In the refinement, in order to make the detection more accurate, we
> > > > introduce another criteria, called the "valid transmission ratio"
> > > > to determine the migration convergence. The "valid transmission
> ratio"
> > > > is the ratio of bytes_xfer_period and bytes_dirty_period, which
> > > > actually describe the migration efficiency.
> > > >
> > > > When the algorithm repeatedly detects that the current iteration
> > > > "valid transmission ratio" is lower than the previous iteration,
> > > > the algorithm determines that the migration cannot converge.
> > > >
> > > > For the "transmission speed" and "valid transmission ratio", if one
> > > > of the two criteria is met, the penalty percentage would be
> increased.
> > > > This saves the time of the entire iteration and therefore reduces
> > > > the time of VM performance degradation.
> > > >
> > > > In conclusion, this refinement significantly reduces the processing
> > > > time required for the throttle percentage step to its maximum while
> > > > the VM is under a high memory load.
> > > >
> > > > The following are test environment:
> > > >
> ----------------------------------------------------------------------
> > > > a. Test tool:
> > > > guestperf
> > > >
> > > > Refer to the following link to see details:
> > > > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf
> > > >
> > > > b. Test VM scale:
> > > > CPU: 10; Memory: 28GB
> > >
> > > Isn't 28GB not a huge VM at all?  It'll be nice to know exactly what's
> the
> > > problem behind first.  So are we talking about "real huge VM"s (I'd
> say at
> > > least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU
> invokes)?
> > >
> > > >
> > > > c. Average bandwidth between source and destination for migration:
> > > > 1.53 Gbits/sec
> > > >
> ----------------------------------------------------------------------
> > > > All the supplementary test data shown as follows are basing on
> > > > above test environment.
> > > >
> > > > We use stress tool contained in the initrd-stress.img to update
> > > > ramsize MB on every CPU in guest, refer to the following link to
> > > > see the source code:
> > > > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c
> > > >
> > > > We run the following command to compare our refined QEMU with the
> > > > original QEMU:
> > > >
> > > > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
> > > > --binary $(path_to_qemu-kvm) \
> > > > --dst-host $(destination_ip) \
> > > > --transport tcp --kernel $(path_to_vmlinuz) \
> > > > --initrd $(path_to_initrd-stress.img) \
> > > > --auto-converge \
> > > > --auto-converge-step 10 \
> > > > --max-iters 40
> > >
> > > So this is for aut-converge.  How's the dirty limit solution?  I am
> > >
> >
> > This patch is a general solution, not just for auto-convergence, here
> > I missed the dirty limit test case, i'll post the dirty limit test result
> > in the next version.
> >
> >
> > > surprised you switched to the old one.  Does it mean that
> auto-converge is
> > > better in some cases?
> > >
> >
> > Since for non-x86 hardware platforms, auto-converge is still the only
> > solution
> > to throttle the guest.  It still reap the benefits from this patch.
>
> ARM should have that too now, if that's what you're talking about.
>

Ok, thanks for the tip, i missed the important series:

https://patchew.org/QEMU/20230509022122.20888-1-gshan@redhat.com/
https://lore.kernel.org/kvmarm/20220819005601.198436-1-gshan@redhat.com/#r

I'll try to enable the dirty-limit on both arches and see how it goes.


> Said that, for quite some time I was wondering whether upstream should be
> focused on some solution that will work out for convergence, where I
> ultimately chose postcopy.  That's also why I was asking whether you
> considered postcopy.
>

Get it, so far, applying postcopy still needs more tests and practice for
our
production environment. For stability reasons, it may not be the first
choice for us.


>
> It'll be more focused for everyone if we can have one major solution over
> this problem rather than spreading the energy over a few.
>

Agree.


>
> >
> >
> > >
> > > >
> > > > The following data shows the migration test results with an increase
> in
> > > > stress, ignoring the title row, the odd rows show the unrefined QEMU
> > > > test data, and the even rows show the refined QEMU test data:
> > > >
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > > | ramsize | sucess | totaltime | downtime(ms) | iterations |
> switchover |
> > > > | (MB)    |        | (ms)      | (ms)         |            |
> throttle   |
> > > > |         |        |           |              |            |
> percent    |
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > > |     350 |    yes |    170285 |          399 |         23 |
>  99 |
> > > > |     350 |    yes |     85962 |          412 |         24 |
>  70 |
> > > > |     350 |    yes |    176305 |          199 |         20 |
>  99 |
> > > > |     350 |    yes |     66729 |          321 |         11 |
>  40 |
> > > > |     400 |    yes |    183042 |          469 |         21 |
>  99 |
> > > > |     400 |    yes |     77225 |          421 |         10 |
>  30 |
> > > > |     400 |    yes |    183641 |          866 |         27 |
>  99 |
> > > > |     400 |    yes |     59796 |          479 |         15 |
>  50 |
> > > > |     450 |    yes |    165881 |          820 |         21 |
>  99 |
> > > > |     450 |    yes |     87930 |          368 |         20 |
>  90 |
> > > > |     450 |    yes |    195448 |          452 |         23 |
>  99 |
> > > > |     450 |    yes |     70394 |          295 |          6 |
>  20 |
> > > > |     500 |    yes |    112587 |          471 |         19 |
>  99 |
> > > > |     500 |    yes |     57401 |          299 |          5 |
>  30 |
> > > > |     500 |    yes |    110683 |          657 |         21 |
>  99 |
> > > > |     500 |    yes |     69949 |          649 |          8 |
>  40 |
> > > > |     600 |    yes |    111036 |          324 |         23 |
>  99 |
> > > > |     600 |    yes |     63455 |          346 |          4 |
>  20 |
> > > > |     600 |    yes |    126667 |          426 |         20 |
>  99 |
> > > > |     600 |    yes |    101024 |          643 |         20 |
>  99 |
> > > > |    1000 |    yes |    296216 |          660 |         23 |
>  99 |
> > > > |    1000 |    yes |    106915 |          468 |         16 |
>  99 |
> > > > |    1000 |     no |    300000 |              |            |
>     |
> > > > |    1000 |    yes |    125819 |          824 |         17 |
>  99 |
> > > > |    1200 |     no |    300000 |              |            |
>     |
> > > > |    1200 |    yes |    127379 |          664 |         14 |
>  90 |
> > > > |    1200 |     no |    300000 |              |            |
>     |
> > > > |    1200 |    yes |     67086 |          793 |         11 |
>  50 |
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > >
> > > > To summarize the data above, any data that implies negative
> optimization
> > > > does not appear, and morover, the throttle algorithm seems to become
> more
> > > > responsive to dirty rate increases due to the refined detection.
> > > >
> > > > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> > > > ---
> > > >  migration/ram.c        | 48
> +++++++++++++++++++++++++++++++++++++++---
> > > >  migration/trace-events |  1 +
> > > >  2 files changed, 46 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index edec1a2d07..18b2d422b5 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -414,6 +414,17 @@ struct RAMState {
> > > >       * RAM migration.
> > > >       */
> > > >      unsigned int postcopy_bmap_sync_requested;
> > > > +
> > > > +    /*
> > > > +     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
> > > > +     * iteration
> > > > +     */
> > > > +    uint64_t dirty_ratio_pct;
> > > > +    /*
> > > > +     * How many times is the most recent iteration dirty ratio is
> > > > +     * higher than previous one
> > > > +     */
> > > > +    int dirty_ratio_high_cnt;
> > > >  };
> > > >  typedef struct RAMState RAMState;
> > > >
> > > > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void)
> > > >      trace_migration_dirty_limit_guest(quota_dirtyrate);
> > > >  }
> > > >
> > > > +static bool migration_dirty_ratio_unexpected(RAMState *rs)
> > > > +{
> > > > +    uint64_t threshold = migrate_throttle_trigger_threshold();
> > > > +    uint64_t bytes_xfer_period =
> > > > +        migration_transferred_bytes() - rs->bytes_xfer_prev;
> > > > +    uint64_t bytes_dirty_period = rs->num_dirty_pages_period *
> > > TARGET_PAGE_SIZE;
> > > > +    uint64_t prev, curr;
> > > > +
> > > > +    /* Skip the first iterations since it isn't helpful */
> > > > +    if (stat64_get(&mig_stats.dirty_sync_count) == 1 ||
> > > !bytes_xfer_period) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
> > > > +
> > > > +    prev = rs->dirty_ratio_pct;
> > > > +    rs->dirty_ratio_pct = curr;
> > > > +
> > > > +    if (prev == 0 || curr <= threshold) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    trace_dirty_ratio_pct(curr, prev);
> > > > +    return curr >= prev;
> > > > +}
> > > > +
> > > >  static void migration_trigger_throttle(RAMState *rs)
> > > >  {
> > > >      uint64_t threshold = migrate_throttle_trigger_threshold();
> > > > @@ -1028,9 +1065,14 @@ static void
> migration_trigger_throttle(RAMState
> > > *rs)
> > > >       * we were in this routine reaches the threshold. If that
> happens
> > > >       * twice, start or increase throttling.
> > > >       */
> > > > -    if ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > > -        (++rs->dirty_rate_high_cnt >= 2)) {
> > > > -        rs->dirty_rate_high_cnt = 0;
> > > > +    if ((migration_dirty_ratio_unexpected(rs) &&
> > > > +         (++rs->dirty_ratio_high_cnt >= 2)) ||
> > > > +        ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > > +         (++rs->dirty_rate_high_cnt >= 2))) {
> > >
> > > I'm afraid this is a mess.
> > >
> > > Now it's (A||B) and any condition can trigger this throttle logic.
> Both A
> > > & B make decisions on merely the same parameters.  It's totally
> > > unpredictable to me on how these cnts bumps, and not readable.
> > >
> >
> > Indeed, this is not readable. How about introducing a migration
> capability
> > like "precise-detection" to make this patch backward-compatible?
>
> If we want to do some change here anyway, I'd think it better we do it in
> the best form rather than keep adding caps.


> For now "stick with postcopy" is still one way to go, but not sure whether
> it applies in your setup.  Otherwise I'd consider we sync more for huge
> VMs.  I mean "throttle only for each sync, sync for each iteration" may
> make sense in the old days, but perhaps not anymore.  That sound like the
> real root cause of this issue.
>

These inspired me,  I'll try this way. :)


>
> >
> >
> > >
> > > What I kind of see how this could make migration shorter is that now
> either
> > > A or B can trigger the throttle, so it triggered in a faster pace than
> > > before.  E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also
> > > achieve similar thing at least in guestperf tests.
> > >
> >
> > Yes !  I get it, and what my original idea is to drop the
> > dirty_rate_high_cnt.
> >
> > As we mentioned above: migration needs to pay for the sync overhead more
> > once a VM is configured with huge memory or running at a high dirty rate.
> >
> > Dropping the dirty_rate_high_cnt will make the iteration take less time
> > in above cases.  I think this is feasible since there is no other
> reasons to
> > retain the  dirty_rate_high_cnt once we make the detection precise,
> > let me know if i missed something.
>

I still think dropping the dirty_rate_high_cnt is a refinement. If migration
has a precise detection of convergence, what do you think of it?


> >
> >
> > >
> > > Have you considered what I mentioned above that may make auto converge
> work
> > > better with "real huge VM"s (by allowing sync >1 times for each
> iteration),
> > > or have you considered postcopy?
> > >
> >
> > Yes, IMHO, this is another refinement direction for the auto-converge,
> we'll
> > try this and keep communication with upstream once it gets progress.
>
> Thanks,
>
> --
> Peter Xu
>
>
Peter Xu Aug. 7, 2024, 12:59 p.m. UTC | #5
On Wed, Aug 07, 2024 at 02:46:29PM +0800, Yong Huang wrote:
> I still think dropping the dirty_rate_high_cnt is a refinement. If migration
> has a precise detection of convergence, what do you think of it?

Maybe; I don't think I thought through those yet..  If so that'll make
sense to be based on the sync change first, then we can have a parameter
(doesn't sound like a cap on its own) for keeping the old auto-converge,
and pile new logics there.

Thanks,
Yong Huang Aug. 7, 2024, 3:43 p.m. UTC | #6
On Wed, Aug 7, 2024 at 8:59 PM Peter Xu <peterx@redhat.com> wrote:

> On Wed, Aug 07, 2024 at 02:46:29PM +0800, Yong Huang wrote:
> > I still think dropping the dirty_rate_high_cnt is a refinement. If
> migration
> > has a precise detection of convergence, what do you think of it?
>
> Maybe; I don't think I thought through those yet..  If so that'll make
> sense to be based on the sync change first, then we can have a parameter
> (doesn't sound like a cap on its own) for keeping the old auto-converge,
> and pile new logics there.
>

Get it, I'll try it this way and see how it works


>
> Thanks,
>
> --
> Peter Xu
>
>
diff mbox series

Patch

diff --git a/migration/ram.c b/migration/ram.c
index edec1a2d07..18b2d422b5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -414,6 +414,17 @@  struct RAMState {
      * RAM migration.
      */
     unsigned int postcopy_bmap_sync_requested;
+
+    /*
+     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
+     * iteration
+     */
+    uint64_t dirty_ratio_pct;
+    /*
+     * How many times is the most recent iteration dirty ratio is
+     * higher than previous one
+     */
+    int dirty_ratio_high_cnt;
 };
 typedef struct RAMState RAMState;
 
@@ -1013,6 +1024,32 @@  static void migration_dirty_limit_guest(void)
     trace_migration_dirty_limit_guest(quota_dirtyrate);
 }
 
+static bool migration_dirty_ratio_unexpected(RAMState *rs)
+{
+    uint64_t threshold = migrate_throttle_trigger_threshold();
+    uint64_t bytes_xfer_period =
+        migration_transferred_bytes() - rs->bytes_xfer_prev;
+    uint64_t bytes_dirty_period = rs->num_dirty_pages_period * TARGET_PAGE_SIZE;
+    uint64_t prev, curr;
+
+    /* Skip the first iterations since it isn't helpful */
+    if (stat64_get(&mig_stats.dirty_sync_count) == 1 || !bytes_xfer_period) {
+        return false;
+    }
+
+    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
+
+    prev = rs->dirty_ratio_pct;
+    rs->dirty_ratio_pct = curr;
+
+    if (prev == 0 || curr <= threshold) {
+        return false;
+    }
+
+    trace_dirty_ratio_pct(curr, prev);
+    return curr >= prev;
+}
+
 static void migration_trigger_throttle(RAMState *rs)
 {
     uint64_t threshold = migrate_throttle_trigger_threshold();
@@ -1028,9 +1065,14 @@  static void migration_trigger_throttle(RAMState *rs)
      * we were in this routine reaches the threshold. If that happens
      * twice, start or increase throttling.
      */
-    if ((bytes_dirty_period > bytes_dirty_threshold) &&
-        (++rs->dirty_rate_high_cnt >= 2)) {
-        rs->dirty_rate_high_cnt = 0;
+    if ((migration_dirty_ratio_unexpected(rs) &&
+         (++rs->dirty_ratio_high_cnt >= 2)) ||
+        ((bytes_dirty_period > bytes_dirty_threshold) &&
+         (++rs->dirty_rate_high_cnt >= 2))) {
+        rs->dirty_rate_high_cnt =
+            rs->dirty_rate_high_cnt >= 2 ? 0 : rs->dirty_rate_high_cnt;
+        rs->dirty_ratio_high_cnt =
+            rs->dirty_ratio_high_cnt >= 2 ? 0 : rs->dirty_ratio_high_cnt;
         if (migrate_auto_converge()) {
             trace_migration_throttle();
             mig_throttle_guest_down(bytes_dirty_period,
diff --git a/migration/trace-events b/migration/trace-events
index 0b7c3324fb..654c52c5e4 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -90,6 +90,7 @@  put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
 qemu_file_fclose(void) ""
 
 # ram.c
+dirty_ratio_pct(uint64_t cur, uint64_t prev) "current ratio: %" PRIu64 " previous ratio: %" PRIu64
 get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
 get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
 migration_bitmap_sync_start(void) ""