diff mbox series

[RFC,10/10] tests/migration-tests: Add test case for responsive CPU throttle

Message ID 96eeea4efd3417212d6e2639bc118b90d4dcf926.1725889277.git.yong.huang@smartx.com
State New
Headers show
Series migration: auto-converge refinements for huge VM | expand

Commit Message

Yong Huang Sept. 9, 2024, 1:47 p.m. UTC
Despite the fact that the responsive CPU throttle is enabled,
the dirty sync count may not always increase because this is
an optimization that might not happen in any situation.

This test case just making sure it doesn't interfere with any
current functionality.

Signed-off-by: Hyman Huang <yong.huang@smartx.com>
---
 tests/qtest/migration-test.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Peter Maydell Sept. 9, 2024, 2:02 p.m. UTC | #1
On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
>
> Despite the fact that the responsive CPU throttle is enabled,
> the dirty sync count may not always increase because this is
> an optimization that might not happen in any situation.
>
> This test case just making sure it doesn't interfere with any
> current functionality.
>
> Signed-off-by: Hyman Huang <yong.huang@smartx.com>

tests/qtest/migration-test already runs 75 different
subtests, takes up a massive chunk of our "make check"
time, and is very commonly a "times out" test on some
of our CI jobs. It runs on five different guest CPU
architectures, each one of which takes between 2 and
5 minutes to complete the full migration-test.

Do we really need to make it even bigger?

thanks
-- PMM
Peter Xu Sept. 9, 2024, 2:36 p.m. UTC | #2
On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
> >
> > Despite the fact that the responsive CPU throttle is enabled,
> > the dirty sync count may not always increase because this is
> > an optimization that might not happen in any situation.
> >
> > This test case just making sure it doesn't interfere with any
> > current functionality.
> >
> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> 
> tests/qtest/migration-test already runs 75 different
> subtests, takes up a massive chunk of our "make check"
> time, and is very commonly a "times out" test on some
> of our CI jobs. It runs on five different guest CPU
> architectures, each one of which takes between 2 and
> 5 minutes to complete the full migration-test.
> 
> Do we really need to make it even bigger?

I'll try to find some time in the next few weeks looking into this to see
whether we can further shrink migration test times after previous attemps
from Dan.  At least a low hanging fruit is we should indeed put some more
tests into g_test_slow(), and this new test could also be a candidate (then
we can run "-m slow" for migration PRs only).

Thanks,
Yong Huang Sept. 9, 2024, 2:43 p.m. UTC | #3
On Mon, Sep 9, 2024 at 10:03 PM Peter Maydell <peter.maydell@linaro.org>
wrote:

> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
> >
> > Despite the fact that the responsive CPU throttle is enabled,
> > the dirty sync count may not always increase because this is
> > an optimization that might not happen in any situation.
> >
> > This test case just making sure it doesn't interfere with any
> > current functionality.
> >
> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
>
> tests/qtest/migration-test already runs 75 different
> subtests, takes up a massive chunk of our "make check"
> time, and is very commonly a "times out" test on some
> of our CI jobs. It runs on five different guest CPU
> architectures, each one of which takes between 2 and
> 5 minutes to complete the full migration-test.
>
> Do we really need to make it even bigger?
>

No, I don't insist on that; the cpu-responsive-throttle
parameter may also be enabled on the existing
migrate_auto_converge test case by default.

Thank for the comment.

Yong.


>
> thanks
> -- PMM
>
Fabiano Rosas Sept. 9, 2024, 9:54 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
>> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
>> >
>> > Despite the fact that the responsive CPU throttle is enabled,
>> > the dirty sync count may not always increase because this is
>> > an optimization that might not happen in any situation.
>> >
>> > This test case just making sure it doesn't interfere with any
>> > current functionality.
>> >
>> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
>> 
>> tests/qtest/migration-test already runs 75 different
>> subtests, takes up a massive chunk of our "make check"
>> time, and is very commonly a "times out" test on some
>> of our CI jobs. It runs on five different guest CPU
>> architectures, each one of which takes between 2 and
>> 5 minutes to complete the full migration-test.
>> 
>> Do we really need to make it even bigger?
>
> I'll try to find some time in the next few weeks looking into this to see
> whether we can further shrink migration test times after previous attemps
> from Dan.  At least a low hanging fruit is we should indeed put some more
> tests into g_test_slow(), and this new test could also be a candidate (then
> we can run "-m slow" for migration PRs only).

I think we could (using -m slow or any other method) separate tests
that are generic enough that every CI run should benefit from them
vs. tests that are only useful once someone starts touching migration
code. I'd say very few in the former category and most of them in the
latter.

For an idea of where migration bugs lie, I took a look at what was
fixed since 2022:

# bugs | device/subsystem/arch
----------------------------------
    54 | migration
    10 | vfio
     6 | ppc
     3 | virtio-gpu
     2 | pcie_sriov, tpm_emulator,
          vdpa, virtio-rng-pci
     1 | arm, block, gpio, lasi,
          pci, s390, scsi-disk,
          virtio-mem, TCG

From these, ignoring the migration bugs, the migration-tests cover some
of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
once it is, then virtio-gpu would be covered and we could investigate
adding some of the others.

For actual migration code issues:

# bugs | (sub)subsystem | kind
----------------------------------------------
    13 | multifd        | correctness/races
     8 | ram            | correctness
     8 | rdma:          | general programming
     7 | qmp            | new api bugs
     5 | postcopy       | races
     4 | file:          | leaks
     3 | return path    | races
     3 | fd_cleanup     | races
     2 | savevm, aio/coroutines
     1 | xbzrle, colo, dirtyrate, exec:,
          windows, iochannel, qemufile,
          arch (ppc64le)

Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
file, rp, fd_cleanup, iochannel, qemufile, xbzrle.

My suggestion is we run per arch:

"/precopy/tcp/plain"
"/precopy/tcp/tls/psk/match",
"/postcopy/plain"
"/postcopy/preempt/plain"
"/postcopy/preempt/recovery/plain"
"/multifd/tcp/plain/cancel"
"/multifd/tcp/uri/plain/none"

and x86 gets extra:

"/precopy/unix/suspend/live"
"/precopy/unix/suspend/notlive"
"/dirty_ring"

(the other dirty_* tests are too slow)

All the rest go behind a knob that people touching migration code will
enable.

wdyt?

1- allows adding devices to QEMU cmdline for migration-test
https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
Peter Xu Sept. 10, 2024, 9:22 p.m. UTC | #5
On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
> >> >
> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> > the dirty sync count may not always increase because this is
> >> > an optimization that might not happen in any situation.
> >> >
> >> > This test case just making sure it doesn't interfere with any
> >> > current functionality.
> >> >
> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> >> 
> >> tests/qtest/migration-test already runs 75 different
> >> subtests, takes up a massive chunk of our "make check"
> >> time, and is very commonly a "times out" test on some
> >> of our CI jobs. It runs on five different guest CPU
> >> architectures, each one of which takes between 2 and
> >> 5 minutes to complete the full migration-test.
> >> 
> >> Do we really need to make it even bigger?
> >
> > I'll try to find some time in the next few weeks looking into this to see
> > whether we can further shrink migration test times after previous attemps
> > from Dan.  At least a low hanging fruit is we should indeed put some more
> > tests into g_test_slow(), and this new test could also be a candidate (then
> > we can run "-m slow" for migration PRs only).
> 
> I think we could (using -m slow or any other method) separate tests
> that are generic enough that every CI run should benefit from them
> vs. tests that are only useful once someone starts touching migration
> code. I'd say very few in the former category and most of them in the
> latter.
> 
> For an idea of where migration bugs lie, I took a look at what was
> fixed since 2022:
> 
> # bugs | device/subsystem/arch
> ----------------------------------
>     54 | migration
>     10 | vfio
>      6 | ppc
>      3 | virtio-gpu
>      2 | pcie_sriov, tpm_emulator,
>           vdpa, virtio-rng-pci
>      1 | arm, block, gpio, lasi,
>           pci, s390, scsi-disk,
>           virtio-mem, TCG

Just curious; how did you collect these?

> 
> From these, ignoring the migration bugs, the migration-tests cover some
> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> once it is, then virtio-gpu would be covered and we could investigate
> adding some of the others.
> 
> For actual migration code issues:
> 
> # bugs | (sub)subsystem | kind
> ----------------------------------------------
>     13 | multifd        | correctness/races
>      8 | ram            | correctness
>      8 | rdma:          | general programming

8 rdma bugs??? ouch..

>      7 | qmp            | new api bugs
>      5 | postcopy       | races
>      4 | file:          | leaks
>      3 | return path    | races
>      3 | fd_cleanup     | races
>      2 | savevm, aio/coroutines
>      1 | xbzrle, colo, dirtyrate, exec:,
>           windows, iochannel, qemufile,
>           arch (ppc64le)
> 
> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> 
> My suggestion is we run per arch:
> 
> "/precopy/tcp/plain"
> "/precopy/tcp/tls/psk/match",
> "/postcopy/plain"
> "/postcopy/preempt/plain"
> "/postcopy/preempt/recovery/plain"
> "/multifd/tcp/plain/cancel"
> "/multifd/tcp/uri/plain/none"

Don't you want to still keep a few multifd / file tests?

IIUC some file ops can still be relevant to archs.  Multifd still has one
bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
race condition when migration finishes, and the issue could be memory
ordering relevant, but maybe not.

> 
> and x86 gets extra:
> 
> "/precopy/unix/suspend/live"
> "/precopy/unix/suspend/notlive"
> "/dirty_ring"

dirty ring will be disabled anyway when !x86, so probably not a major
concern.

> 
> (the other dirty_* tests are too slow)

These are the 10 slowest tests when I run locally:

/x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
/x86_64/migration/postcopy/recovery/plain 2.43
/x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
/x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
/x86_64/migration/postcopy/tls/psk 2.91
/x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
/x86_64/migration/postcopy/preempt/tls/psk 3.30
/x86_64/migration/postcopy/recovery/tls/psk 3.81
/x86_64/migration/vcpu_dirty_limit 13.29
/x86_64/migration/precopy/unix/xbzrle 27.55

Are you aware of people using xbzrle at all?

> 
> All the rest go behind a knob that people touching migration code will
> enable.
> 
> wdyt?

Agree with the general idea, but I worry above exact list can be too small.

IMHO we can definitely, at least, move the last two into slow list
(vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..

> 
> 1- allows adding devices to QEMU cmdline for migration-test
> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
>
Fabiano Rosas Sept. 10, 2024, 10:23 p.m. UTC | #6
Peter Xu <peterx@redhat.com> writes:

> On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
>> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
>> >> >
>> >> > Despite the fact that the responsive CPU throttle is enabled,
>> >> > the dirty sync count may not always increase because this is
>> >> > an optimization that might not happen in any situation.
>> >> >
>> >> > This test case just making sure it doesn't interfere with any
>> >> > current functionality.
>> >> >
>> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
>> >> 
>> >> tests/qtest/migration-test already runs 75 different
>> >> subtests, takes up a massive chunk of our "make check"
>> >> time, and is very commonly a "times out" test on some
>> >> of our CI jobs. It runs on five different guest CPU
>> >> architectures, each one of which takes between 2 and
>> >> 5 minutes to complete the full migration-test.
>> >> 
>> >> Do we really need to make it even bigger?
>> >
>> > I'll try to find some time in the next few weeks looking into this to see
>> > whether we can further shrink migration test times after previous attemps
>> > from Dan.  At least a low hanging fruit is we should indeed put some more
>> > tests into g_test_slow(), and this new test could also be a candidate (then
>> > we can run "-m slow" for migration PRs only).
>> 
>> I think we could (using -m slow or any other method) separate tests
>> that are generic enough that every CI run should benefit from them
>> vs. tests that are only useful once someone starts touching migration
>> code. I'd say very few in the former category and most of them in the
>> latter.
>> 
>> For an idea of where migration bugs lie, I took a look at what was
>> fixed since 2022:
>> 
>> # bugs | device/subsystem/arch
>> ----------------------------------
>>     54 | migration
>>     10 | vfio
>>      6 | ppc
>>      3 | virtio-gpu
>>      2 | pcie_sriov, tpm_emulator,
>>           vdpa, virtio-rng-pci
>>      1 | arm, block, gpio, lasi,
>>           pci, s390, scsi-disk,
>>           virtio-mem, TCG
>
> Just curious; how did you collect these?

git log --since=2022 and then squinted at it. I wrote a warning to take
this with a grain of salt, but it missed the final version.

>
>> 
>> From these, ignoring the migration bugs, the migration-tests cover some
>> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
>> once it is, then virtio-gpu would be covered and we could investigate
>> adding some of the others.
>> 
>> For actual migration code issues:
>> 
>> # bugs | (sub)subsystem | kind
>> ----------------------------------------------
>>     13 | multifd        | correctness/races
>>      8 | ram            | correctness
>>      8 | rdma:          | general programming
>
> 8 rdma bugs??? ouch..

Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
integer comparisons and bugs in error handling. I don't even want to
look too much at it.

...hopefully this release we'll manage to resolve that situation.

>
>>      7 | qmp            | new api bugs
>>      5 | postcopy       | races
>>      4 | file:          | leaks
>>      3 | return path    | races
>>      3 | fd_cleanup     | races
>>      2 | savevm, aio/coroutines
>>      1 | xbzrle, colo, dirtyrate, exec:,
>>           windows, iochannel, qemufile,
>>           arch (ppc64le)
>> 
>> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
>> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
>> 
>> My suggestion is we run per arch:
>> 
>> "/precopy/tcp/plain"
>> "/precopy/tcp/tls/psk/match",
>> "/postcopy/plain"
>> "/postcopy/preempt/plain"
>> "/postcopy/preempt/recovery/plain"
>> "/multifd/tcp/plain/cancel"
>> "/multifd/tcp/uri/plain/none"
>
> Don't you want to still keep a few multifd / file tests?

Not really, but I won't object if you want to add some more in there. To
be honest, I want to get out of people's way as much as I can because
having to revisit this every couple of months is stressful to me.

My rationale for those is:

"/precopy/tcp/plain":
 Smoke test, the most common migration

"/precopy/tcp/tls/psk/match":
 Something might change in the distro regarding tls. Such as:
 https://gitlab.com/qemu-project/qemu/-/issues/1937

"/postcopy/plain":
 Smoke test for postcopy

"/postcopy/preempt/plain":
 Just to exercise the preempt thread

"/postcopy/preempt/recovery/plain":
 Recovery has had some issues with races in the past

"/multifd/tcp/plain/cancel":
 The MVP of catching concurrency issues

"/multifd/tcp/uri/plain/none":
 Smoke test for multifd

All in all, these will test basic funcionality and run very often. The
more tests we add to this set, the less return we get in relation to the
time they take.

>
> IIUC some file ops can still be relevant to archs.  Multifd still has one
> bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
> race condition when migration finishes, and the issue could be memory
> ordering relevant, but maybe not.

I'm not aware of anything. I believe the last arm64 bug we had was the
threadinfo stuff[1]. If you remember what it's about, let me know.

1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").

>
>> 
>> and x86 gets extra:
>> 
>> "/precopy/unix/suspend/live"
>> "/precopy/unix/suspend/notlive"
>> "/dirty_ring"
>
> dirty ring will be disabled anyway when !x86, so probably not a major
> concern.
>
>> 
>> (the other dirty_* tests are too slow)
>
> These are the 10 slowest tests when I run locally:
>
> /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
> /x86_64/migration/postcopy/recovery/plain 2.43
> /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
> /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
> /x86_64/migration/postcopy/tls/psk 2.91
> /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
> /x86_64/migration/postcopy/preempt/tls/psk 3.30
> /x86_64/migration/postcopy/recovery/tls/psk 3.81
> /x86_64/migration/vcpu_dirty_limit 13.29
> /x86_64/migration/precopy/unix/xbzrle 27.55
>
> Are you aware of people using xbzrle at all?

Nope.

>
>> 
>> All the rest go behind a knob that people touching migration code will
>> enable.
>> 
>> wdyt?
>
> Agree with the general idea, but I worry above exact list can be too small.

We won't stop running the full set of tests. We can run them in our
forks' CI as much as we want. There are no cases of people reporting a
migration bug because their 'make check' caught something that ours
didn't.

Besides, the main strength of CI is to catch bugs when someone makes a
code change. If people touch migration code, then we'll run it in our CI
anyway. If they touch device code and that device is migrated by default
then any one of the simple tests will catch the issue when it runs via
the migration-compat job. If the device is not enabled by default, then
no tests will catch it.

The worst case scenario is they touch some code completely unrelated and
their 'make check' or CI run breaks because of some race in the
migration code. That's not what CI is for, that's just an annoyance for
everyone. I'd rather it breaks in our hands and then we'll go fix it.

>
> IMHO we can definitely, at least, move the last two into slow list
> (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..

Agreed. I'll send a patch once I get out from under downstream stuff.

>
>> 
>> 1- allows adding devices to QEMU cmdline for migration-test
>> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
>>
Peter Xu Sept. 11, 2024, 3:59 p.m. UTC | #7
On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
> >> >> >
> >> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> >> > the dirty sync count may not always increase because this is
> >> >> > an optimization that might not happen in any situation.
> >> >> >
> >> >> > This test case just making sure it doesn't interfere with any
> >> >> > current functionality.
> >> >> >
> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> >> >> 
> >> >> tests/qtest/migration-test already runs 75 different
> >> >> subtests, takes up a massive chunk of our "make check"
> >> >> time, and is very commonly a "times out" test on some
> >> >> of our CI jobs. It runs on five different guest CPU
> >> >> architectures, each one of which takes between 2 and
> >> >> 5 minutes to complete the full migration-test.
> >> >> 
> >> >> Do we really need to make it even bigger?
> >> >
> >> > I'll try to find some time in the next few weeks looking into this to see
> >> > whether we can further shrink migration test times after previous attemps
> >> > from Dan.  At least a low hanging fruit is we should indeed put some more
> >> > tests into g_test_slow(), and this new test could also be a candidate (then
> >> > we can run "-m slow" for migration PRs only).
> >> 
> >> I think we could (using -m slow or any other method) separate tests
> >> that are generic enough that every CI run should benefit from them
> >> vs. tests that are only useful once someone starts touching migration
> >> code. I'd say very few in the former category and most of them in the
> >> latter.
> >> 
> >> For an idea of where migration bugs lie, I took a look at what was
> >> fixed since 2022:
> >> 
> >> # bugs | device/subsystem/arch
> >> ----------------------------------
> >>     54 | migration
> >>     10 | vfio
> >>      6 | ppc
> >>      3 | virtio-gpu
> >>      2 | pcie_sriov, tpm_emulator,
> >>           vdpa, virtio-rng-pci
> >>      1 | arm, block, gpio, lasi,
> >>           pci, s390, scsi-disk,
> >>           virtio-mem, TCG
> >
> > Just curious; how did you collect these?
> 
> git log --since=2022 and then squinted at it. I wrote a warning to take
> this with a grain of salt, but it missed the final version.
> 
> >
> >> 
> >> From these, ignoring the migration bugs, the migration-tests cover some
> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> >> once it is, then virtio-gpu would be covered and we could investigate
> >> adding some of the others.
> >> 
> >> For actual migration code issues:
> >> 
> >> # bugs | (sub)subsystem | kind
> >> ----------------------------------------------
> >>     13 | multifd        | correctness/races
> >>      8 | ram            | correctness
> >>      8 | rdma:          | general programming
> >
> > 8 rdma bugs??? ouch..
> 
> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
> integer comparisons and bugs in error handling. I don't even want to
> look too much at it.
> 
> ...hopefully this release we'll manage to resolve that situation.
> 
> >
> >>      7 | qmp            | new api bugs
> >>      5 | postcopy       | races
> >>      4 | file:          | leaks
> >>      3 | return path    | races
> >>      3 | fd_cleanup     | races
> >>      2 | savevm, aio/coroutines
> >>      1 | xbzrle, colo, dirtyrate, exec:,
> >>           windows, iochannel, qemufile,
> >>           arch (ppc64le)
> >> 
> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> >> 
> >> My suggestion is we run per arch:
> >> 
> >> "/precopy/tcp/plain"
> >> "/precopy/tcp/tls/psk/match",
> >> "/postcopy/plain"
> >> "/postcopy/preempt/plain"
> >> "/postcopy/preempt/recovery/plain"
> >> "/multifd/tcp/plain/cancel"
> >> "/multifd/tcp/uri/plain/none"
> >
> > Don't you want to still keep a few multifd / file tests?
> 
> Not really, but I won't object if you want to add some more in there. To
> be honest, I want to get out of people's way as much as I can because
> having to revisit this every couple of months is stressful to me.

I hope migration tests are not too obstructive yet so far :) - this
discussion is still within a context where we might add one more slow test
in CI, and we probably simply should make it a -m slow test.

> 
> My rationale for those is:
> 
> "/precopy/tcp/plain":
>  Smoke test, the most common migration

E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.

And note that even if unix shares the socket iochannel with tcp, it may not
work the same.  For example, if you remember I mentioned I was looking at
threadify the dest qemu receive side, i have a draft there but currently it
has a bug to hang a unix migration, not tcp..

So IMHO it's not easy to justify which we can simply drop, if we still want
to provide some good gating in CI.  And I won't be 100% surprised if some
other non-migration PR (e.g. io/) changed some slight bit that unix is
broken and tcp keeps working, for example, then we loose some CI benefits.

> 
> "/precopy/tcp/tls/psk/match":
>  Something might change in the distro regarding tls. Such as:
>  https://gitlab.com/qemu-project/qemu/-/issues/1937
> 
> "/postcopy/plain":
>  Smoke test for postcopy
> 
> "/postcopy/preempt/plain":
>  Just to exercise the preempt thread
> 
> "/postcopy/preempt/recovery/plain":
>  Recovery has had some issues with races in the past
> 
> "/multifd/tcp/plain/cancel":
>  The MVP of catching concurrency issues
> 
> "/multifd/tcp/uri/plain/none":
>  Smoke test for multifd
> 
> All in all, these will test basic funcionality and run very often. The
> more tests we add to this set, the less return we get in relation to the
> time they take.

This is true.  We can try to discuss more on which is more important; I
still think something might be good to be added on top of above.

There's also the other way - at some point, I still want to improve
migration-test run speed, and please have a look if you like too at some
point: so there's still chance (average is ~2sec as of now), IMHO, we don't
lose anything in CI but runs simply faster.

> 
> >
> > IIUC some file ops can still be relevant to archs.  Multifd still has one
> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
> > race condition when migration finishes, and the issue could be memory
> > ordering relevant, but maybe not.
> 
> I'm not aware of anything. I believe the last arm64 bug we had was the
> threadinfo stuff[1]. If you remember what it's about, let me know.
> 
> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").

https://issues.redhat.com/browse/RHEL-45628

On RH side Bandan is looking at it, but he's during a long holidays
recently.

> 
> >
> >> 
> >> and x86 gets extra:
> >> 
> >> "/precopy/unix/suspend/live"
> >> "/precopy/unix/suspend/notlive"
> >> "/dirty_ring"
> >
> > dirty ring will be disabled anyway when !x86, so probably not a major
> > concern.
> >
> >> 
> >> (the other dirty_* tests are too slow)
> >
> > These are the 10 slowest tests when I run locally:
> >
> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
> > /x86_64/migration/postcopy/recovery/plain 2.43
> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
> > /x86_64/migration/postcopy/tls/psk 2.91
> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
> > /x86_64/migration/vcpu_dirty_limit 13.29
> > /x86_64/migration/precopy/unix/xbzrle 27.55
> >
> > Are you aware of people using xbzrle at all?
> 
> Nope.
> 
> >
> >> 
> >> All the rest go behind a knob that people touching migration code will
> >> enable.
> >> 
> >> wdyt?
> >
> > Agree with the general idea, but I worry above exact list can be too small.
> 
> We won't stop running the full set of tests. We can run them in our
> forks' CI as much as we want. There are no cases of people reporting a
> migration bug because their 'make check' caught something that ours
> didn't.

IIUC it's hard to say - when the test is in CI maintainers can catch them
already before sending a formal PR.

If the test is run by default in make check, a developer can trigger a
migration issue (with his/her patch applied) then one can notice it
introduced a bug, fix it, then post the patches.  We won't know whether
that happened.

So one thing we can do (if you think worthwhile to do it now) is we shrink
the default test case a lot as you proposed, then we wait and see what
breaks, and then we gradually add tests back when it can be used to find
breakages.  But that'll also take time if it really can find such tests,
because then we'll need to debug them one by one (instead of developer /
maintainer looking into them with their expertise knowledge..).  I'm not
sure whether it's worthwhile to do it now, but I don't feel strongly if we
can still have a reliable set of default test cases.

> 
> Besides, the main strength of CI is to catch bugs when someone makes a
> code change. If people touch migration code, then we'll run it in our CI
> anyway. If they touch device code and that device is migrated by default
> then any one of the simple tests will catch the issue when it runs via
> the migration-compat job. If the device is not enabled by default, then
> no tests will catch it.
> 
> The worst case scenario is they touch some code completely unrelated and
> their 'make check' or CI run breaks because of some race in the
> migration code. That's not what CI is for, that's just an annoyance for
> everyone. I'd rather it breaks in our hands and then we'll go fix it.
> 
> >
> > IMHO we can definitely, at least, move the last two into slow list
> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
> 
> Agreed. I'll send a patch once I get out from under downstream stuff.
> 
> >
> >> 
> >> 1- allows adding devices to QEMU cmdline for migration-test
> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
> >> 
>
Fabiano Rosas Sept. 11, 2024, 7:48 p.m. UTC | #8
Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
>> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
>> >> >> >
>> >> >> > Despite the fact that the responsive CPU throttle is enabled,
>> >> >> > the dirty sync count may not always increase because this is
>> >> >> > an optimization that might not happen in any situation.
>> >> >> >
>> >> >> > This test case just making sure it doesn't interfere with any
>> >> >> > current functionality.
>> >> >> >
>> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
>> >> >> 
>> >> >> tests/qtest/migration-test already runs 75 different
>> >> >> subtests, takes up a massive chunk of our "make check"
>> >> >> time, and is very commonly a "times out" test on some
>> >> >> of our CI jobs. It runs on five different guest CPU
>> >> >> architectures, each one of which takes between 2 and
>> >> >> 5 minutes to complete the full migration-test.
>> >> >> 
>> >> >> Do we really need to make it even bigger?
>> >> >
>> >> > I'll try to find some time in the next few weeks looking into this to see
>> >> > whether we can further shrink migration test times after previous attemps
>> >> > from Dan.  At least a low hanging fruit is we should indeed put some more
>> >> > tests into g_test_slow(), and this new test could also be a candidate (then
>> >> > we can run "-m slow" for migration PRs only).
>> >> 
>> >> I think we could (using -m slow or any other method) separate tests
>> >> that are generic enough that every CI run should benefit from them
>> >> vs. tests that are only useful once someone starts touching migration
>> >> code. I'd say very few in the former category and most of them in the
>> >> latter.
>> >> 
>> >> For an idea of where migration bugs lie, I took a look at what was
>> >> fixed since 2022:
>> >> 
>> >> # bugs | device/subsystem/arch
>> >> ----------------------------------
>> >>     54 | migration
>> >>     10 | vfio
>> >>      6 | ppc
>> >>      3 | virtio-gpu
>> >>      2 | pcie_sriov, tpm_emulator,
>> >>           vdpa, virtio-rng-pci
>> >>      1 | arm, block, gpio, lasi,
>> >>           pci, s390, scsi-disk,
>> >>           virtio-mem, TCG
>> >
>> > Just curious; how did you collect these?
>> 
>> git log --since=2022 and then squinted at it. I wrote a warning to take
>> this with a grain of salt, but it missed the final version.
>> 
>> >
>> >> 
>> >> From these, ignoring the migration bugs, the migration-tests cover some
>> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
>> >> once it is, then virtio-gpu would be covered and we could investigate
>> >> adding some of the others.
>> >> 
>> >> For actual migration code issues:
>> >> 
>> >> # bugs | (sub)subsystem | kind
>> >> ----------------------------------------------
>> >>     13 | multifd        | correctness/races
>> >>      8 | ram            | correctness
>> >>      8 | rdma:          | general programming
>> >
>> > 8 rdma bugs??? ouch..
>> 
>> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
>> integer comparisons and bugs in error handling. I don't even want to
>> look too much at it.
>> 
>> ...hopefully this release we'll manage to resolve that situation.
>> 
>> >
>> >>      7 | qmp            | new api bugs
>> >>      5 | postcopy       | races
>> >>      4 | file:          | leaks
>> >>      3 | return path    | races
>> >>      3 | fd_cleanup     | races
>> >>      2 | savevm, aio/coroutines
>> >>      1 | xbzrle, colo, dirtyrate, exec:,
>> >>           windows, iochannel, qemufile,
>> >>           arch (ppc64le)
>> >> 
>> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
>> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
>> >> 
>> >> My suggestion is we run per arch:
>> >> 
>> >> "/precopy/tcp/plain"
>> >> "/precopy/tcp/tls/psk/match",
>> >> "/postcopy/plain"
>> >> "/postcopy/preempt/plain"
>> >> "/postcopy/preempt/recovery/plain"
>> >> "/multifd/tcp/plain/cancel"
>> >> "/multifd/tcp/uri/plain/none"
>> >
>> > Don't you want to still keep a few multifd / file tests?
>> 
>> Not really, but I won't object if you want to add some more in there. To
>> be honest, I want to get out of people's way as much as I can because
>> having to revisit this every couple of months is stressful to me.
>
> I hope migration tests are not too obstructive yet so far :) - this
> discussion is still within a context where we might add one more slow test
> in CI, and we probably simply should make it a -m slow test.
>
>> 
>> My rationale for those is:
>> 
>> "/precopy/tcp/plain":
>>  Smoke test, the most common migration
>
> E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.
>
> And note that even if unix shares the socket iochannel with tcp, it may not
> work the same.  For example, if you remember I mentioned I was looking at
> threadify the dest qemu receive side, i have a draft there but currently it
> has a bug to hang a unix migration, not tcp..

Ok, let's add a unix one, no problem.

>
> So IMHO it's not easy to justify which we can simply drop, if we still want
> to provide some good gating in CI.

It's not exactly about dropping, it's about which ones need to be run at
*every* invocation of make check (x5 because of the multiple archs) and
at every git branch push in CI (unless -o ci.skip). For gating we don't
need all the tests. Many of them are testing the same core code with
just a few differences at the margins.

>  And I won't be 100% surprised if some
> other non-migration PR (e.g. io/) changed some slight bit that unix is
> broken and tcp keeps working, for example, then we loose some CI benefits.

IMO, these non-migration PRs are exactly what we have to worry
about. Because migration PRs would run with -m slow and we'd catch the
issue there.

>
>> 
>> "/precopy/tcp/tls/psk/match":
>>  Something might change in the distro regarding tls. Such as:
>>  https://gitlab.com/qemu-project/qemu/-/issues/1937
>> 
>> "/postcopy/plain":
>>  Smoke test for postcopy
>> 
>> "/postcopy/preempt/plain":
>>  Just to exercise the preempt thread
>> 
>> "/postcopy/preempt/recovery/plain":
>>  Recovery has had some issues with races in the past
>> 
>> "/multifd/tcp/plain/cancel":
>>  The MVP of catching concurrency issues
>> 
>> "/multifd/tcp/uri/plain/none":
>>  Smoke test for multifd
>> 
>> All in all, these will test basic funcionality and run very often. The
>> more tests we add to this set, the less return we get in relation to the
>> time they take.
>
> This is true.  We can try to discuss more on which is more important; I
> still think something might be good to be added on top of above.

Sure, just add what you think we need.

>
> There's also the other way - at some point, I still want to improve
> migration-test run speed, and please have a look if you like too at some
> point: so there's still chance (average is ~2sec as of now), IMHO, we don't
> lose anything in CI but runs simply faster.

My only idea, but it requires a bit of work, is to do unit testing on
the interfaces. Anything before migration_fd_connect(). Then we could
stop doing a full migration for those.

>
>> 
>> >
>> > IIUC some file ops can still be relevant to archs.  Multifd still has one
>> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
>> > race condition when migration finishes, and the issue could be memory
>> > ordering relevant, but maybe not.
>> 
>> I'm not aware of anything. I believe the last arm64 bug we had was the
>> threadinfo stuff[1]. If you remember what it's about, let me know.
>> 
>> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").
>
> https://issues.redhat.com/browse/RHEL-45628

Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't
think other tests will. Also, we don't have tests for zerocopy AFAIK.

>
> On RH side Bandan is looking at it, but he's during a long holidays
> recently.

Good luck to him.

>
>> 
>> >
>> >> 
>> >> and x86 gets extra:
>> >> 
>> >> "/precopy/unix/suspend/live"
>> >> "/precopy/unix/suspend/notlive"
>> >> "/dirty_ring"
>> >
>> > dirty ring will be disabled anyway when !x86, so probably not a major
>> > concern.
>> >
>> >> 
>> >> (the other dirty_* tests are too slow)
>> >
>> > These are the 10 slowest tests when I run locally:
>> >
>> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
>> > /x86_64/migration/postcopy/recovery/plain 2.43
>> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
>> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
>> > /x86_64/migration/postcopy/tls/psk 2.91
>> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
>> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
>> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
>> > /x86_64/migration/vcpu_dirty_limit 13.29
>> > /x86_64/migration/precopy/unix/xbzrle 27.55
>> >
>> > Are you aware of people using xbzrle at all?
>> 
>> Nope.
>> 
>> >
>> >> 
>> >> All the rest go behind a knob that people touching migration code will
>> >> enable.
>> >> 
>> >> wdyt?
>> >
>> > Agree with the general idea, but I worry above exact list can be too small.
>> 
>> We won't stop running the full set of tests. We can run them in our
>> forks' CI as much as we want. There are no cases of people reporting a
>> migration bug because their 'make check' caught something that ours
>> didn't.
>
> IIUC it's hard to say - when the test is in CI maintainers can catch them
> already before sending a formal PR.
>
> If the test is run by default in make check, a developer can trigger a
> migration issue (with his/her patch applied) then one can notice it
> introduced a bug, fix it, then post the patches.  We won't know whether
> that happened.

Good point.

>
> So one thing we can do (if you think worthwhile to do it now) is we shrink
> the default test case a lot as you proposed, then we wait and see what
> breaks, and then we gradually add tests back when it can be used to find
> breakages.  But that'll also take time if it really can find such tests,
> because then we'll need to debug them one by one (instead of developer /
> maintainer looking into them with their expertise knowledge..).  I'm not
> sure whether it's worthwhile to do it now, but I don't feel strongly if we
> can still have a reliable set of default test cases.

We first need a way to enable them for the migration CI. Do we have a
variable that CI understands that can be used to enable slow tests?

>
>> 
>> Besides, the main strength of CI is to catch bugs when someone makes a
>> code change. If people touch migration code, then we'll run it in our CI
>> anyway. If they touch device code and that device is migrated by default
>> then any one of the simple tests will catch the issue when it runs via
>> the migration-compat job. If the device is not enabled by default, then
>> no tests will catch it.
>> 
>> The worst case scenario is they touch some code completely unrelated and
>> their 'make check' or CI run breaks because of some race in the
>> migration code. That's not what CI is for, that's just an annoyance for
>> everyone. I'd rather it breaks in our hands and then we'll go fix it.
>> 
>> >
>> > IMHO we can definitely, at least, move the last two into slow list
>> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
>> 
>> Agreed. I'll send a patch once I get out from under downstream stuff.
>> 
>> >
>> >> 
>> >> 1- allows adding devices to QEMU cmdline for migration-test
>> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
>> >> 
>>
Peter Xu Sept. 11, 2024, 8:37 p.m. UTC | #9
On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >> 
> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
> >> >> >> >
> >> >> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> >> >> > the dirty sync count may not always increase because this is
> >> >> >> > an optimization that might not happen in any situation.
> >> >> >> >
> >> >> >> > This test case just making sure it doesn't interfere with any
> >> >> >> > current functionality.
> >> >> >> >
> >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> >> >> >> 
> >> >> >> tests/qtest/migration-test already runs 75 different
> >> >> >> subtests, takes up a massive chunk of our "make check"
> >> >> >> time, and is very commonly a "times out" test on some
> >> >> >> of our CI jobs. It runs on five different guest CPU
> >> >> >> architectures, each one of which takes between 2 and
> >> >> >> 5 minutes to complete the full migration-test.
> >> >> >> 
> >> >> >> Do we really need to make it even bigger?
> >> >> >
> >> >> > I'll try to find some time in the next few weeks looking into this to see
> >> >> > whether we can further shrink migration test times after previous attemps
> >> >> > from Dan.  At least a low hanging fruit is we should indeed put some more
> >> >> > tests into g_test_slow(), and this new test could also be a candidate (then
> >> >> > we can run "-m slow" for migration PRs only).
> >> >> 
> >> >> I think we could (using -m slow or any other method) separate tests
> >> >> that are generic enough that every CI run should benefit from them
> >> >> vs. tests that are only useful once someone starts touching migration
> >> >> code. I'd say very few in the former category and most of them in the
> >> >> latter.
> >> >> 
> >> >> For an idea of where migration bugs lie, I took a look at what was
> >> >> fixed since 2022:
> >> >> 
> >> >> # bugs | device/subsystem/arch
> >> >> ----------------------------------
> >> >>     54 | migration
> >> >>     10 | vfio
> >> >>      6 | ppc
> >> >>      3 | virtio-gpu
> >> >>      2 | pcie_sriov, tpm_emulator,
> >> >>           vdpa, virtio-rng-pci
> >> >>      1 | arm, block, gpio, lasi,
> >> >>           pci, s390, scsi-disk,
> >> >>           virtio-mem, TCG
> >> >
> >> > Just curious; how did you collect these?
> >> 
> >> git log --since=2022 and then squinted at it. I wrote a warning to take
> >> this with a grain of salt, but it missed the final version.
> >> 
> >> >
> >> >> 
> >> >> From these, ignoring the migration bugs, the migration-tests cover some
> >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> >> >> once it is, then virtio-gpu would be covered and we could investigate
> >> >> adding some of the others.
> >> >> 
> >> >> For actual migration code issues:
> >> >> 
> >> >> # bugs | (sub)subsystem | kind
> >> >> ----------------------------------------------
> >> >>     13 | multifd        | correctness/races
> >> >>      8 | ram            | correctness
> >> >>      8 | rdma:          | general programming
> >> >
> >> > 8 rdma bugs??? ouch..
> >> 
> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
> >> integer comparisons and bugs in error handling. I don't even want to
> >> look too much at it.
> >> 
> >> ...hopefully this release we'll manage to resolve that situation.
> >> 
> >> >
> >> >>      7 | qmp            | new api bugs
> >> >>      5 | postcopy       | races
> >> >>      4 | file:          | leaks
> >> >>      3 | return path    | races
> >> >>      3 | fd_cleanup     | races
> >> >>      2 | savevm, aio/coroutines
> >> >>      1 | xbzrle, colo, dirtyrate, exec:,
> >> >>           windows, iochannel, qemufile,
> >> >>           arch (ppc64le)
> >> >> 
> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> >> >> 
> >> >> My suggestion is we run per arch:
> >> >> 
> >> >> "/precopy/tcp/plain"
> >> >> "/precopy/tcp/tls/psk/match",
> >> >> "/postcopy/plain"
> >> >> "/postcopy/preempt/plain"
> >> >> "/postcopy/preempt/recovery/plain"
> >> >> "/multifd/tcp/plain/cancel"
> >> >> "/multifd/tcp/uri/plain/none"
> >> >
> >> > Don't you want to still keep a few multifd / file tests?
> >> 
> >> Not really, but I won't object if you want to add some more in there. To
> >> be honest, I want to get out of people's way as much as I can because
> >> having to revisit this every couple of months is stressful to me.
> >
> > I hope migration tests are not too obstructive yet so far :) - this
> > discussion is still within a context where we might add one more slow test
> > in CI, and we probably simply should make it a -m slow test.
> >
> >> 
> >> My rationale for those is:
> >> 
> >> "/precopy/tcp/plain":
> >>  Smoke test, the most common migration
> >
> > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.
> >
> > And note that even if unix shares the socket iochannel with tcp, it may not
> > work the same.  For example, if you remember I mentioned I was looking at
> > threadify the dest qemu receive side, i have a draft there but currently it
> > has a bug to hang a unix migration, not tcp..
> 
> Ok, let's add a unix one, no problem.
> 
> >
> > So IMHO it's not easy to justify which we can simply drop, if we still want
> > to provide some good gating in CI.
> 
> It's not exactly about dropping, it's about which ones need to be run at
> *every* invocation of make check (x5 because of the multiple archs) and

Yeah, indeed we could consider reducing the number of runs, maybe.  However
I do remember we used to have migration-test bugs only reproduced with some
specific distro, well..

Now I'm looking at the pipelines..

Why 5x, btw?  I saw alphine, centos, debian, fedora, opensuse, ubuntu,
cfi-x86_64, then compat-x86_64.  So that's 8x?  They're for the same arch
(amd64) so far.

Maybe we can skip some indeed.

> at every git branch push in CI (unless -o ci.skip). For gating we don't
> need all the tests. Many of them are testing the same core code with
> just a few differences at the margins.
> 
> >  And I won't be 100% surprised if some
> > other non-migration PR (e.g. io/) changed some slight bit that unix is
> > broken and tcp keeps working, for example, then we loose some CI benefits.
> 
> IMO, these non-migration PRs are exactly what we have to worry
> about. Because migration PRs would run with -m slow and we'd catch the
> issue there.
> 
> >
> >> 
> >> "/precopy/tcp/tls/psk/match":
> >>  Something might change in the distro regarding tls. Such as:
> >>  https://gitlab.com/qemu-project/qemu/-/issues/1937
> >> 
> >> "/postcopy/plain":
> >>  Smoke test for postcopy
> >> 
> >> "/postcopy/preempt/plain":
> >>  Just to exercise the preempt thread
> >> 
> >> "/postcopy/preempt/recovery/plain":
> >>  Recovery has had some issues with races in the past
> >> 
> >> "/multifd/tcp/plain/cancel":
> >>  The MVP of catching concurrency issues
> >> 
> >> "/multifd/tcp/uri/plain/none":
> >>  Smoke test for multifd
> >> 
> >> All in all, these will test basic funcionality and run very often. The
> >> more tests we add to this set, the less return we get in relation to the
> >> time they take.
> >
> > This is true.  We can try to discuss more on which is more important; I
> > still think something might be good to be added on top of above.
> 
> Sure, just add what you think we need.

Let's leave the detailed discussion on what to choose until the patch
phase.  IIUC this can be the last we do.

We've just queued your other patch to "slow down" the two time consuming
tests, we save 40s for each CI kick (total ~90s now locally, so that's 30%
cut already all acrosss!), not so bad as a start.

Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's
3 system runs, it'll be another 62% cut, a total of:

  1.0 - 9/13 * 3/8 = ~75%

So if we keep removing 5 system running it we can reduce it to only a
quarter time comparing to before for each CI run.

Would that be good enough already for us to live for quite a few more
releases?  They're so far all low hanging fruits.

If we want to move further (and you think we should then start with
reducing test case, rather than profiling the test cases), then feel free
to go ahead and send a patch when you're ready.  But maybe you don't want
to bother after you notice that we can already shrink 75% of runtime even
without dropping anything.  Your call!

> 
> >
> > There's also the other way - at some point, I still want to improve
> > migration-test run speed, and please have a look if you like too at some
> > point: so there's still chance (average is ~2sec as of now), IMHO, we don't
> > lose anything in CI but runs simply faster.
> 
> My only idea, but it requires a bit of work, is to do unit testing on
> the interfaces. Anything before migration_fd_connect(). Then we could
> stop doing a full migration for those.

What I'm thinking is some further profiling, like what Dan used to do.  I
feel like we still have chance, considering what we run as guest image is
simply a loop.. so basically zero boot time logically.

> 
> >
> >> 
> >> >
> >> > IIUC some file ops can still be relevant to archs.  Multifd still has one
> >> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
> >> > race condition when migration finishes, and the issue could be memory
> >> > ordering relevant, but maybe not.
> >> 
> >> I'm not aware of anything. I believe the last arm64 bug we had was the
> >> threadinfo stuff[1]. If you remember what it's about, let me know.
> >> 
> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").
> >
> > https://issues.redhat.com/browse/RHEL-45628
> 
> Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't
> think other tests will. Also, we don't have tests for zerocopy AFAIK.

That'll be challenging as it requires locked_vm limit.

> 
> >
> > On RH side Bandan is looking at it, but he's during a long holidays
> > recently.
> 
> Good luck to him.
> 
> >
> >> 
> >> >
> >> >> 
> >> >> and x86 gets extra:
> >> >> 
> >> >> "/precopy/unix/suspend/live"
> >> >> "/precopy/unix/suspend/notlive"
> >> >> "/dirty_ring"
> >> >
> >> > dirty ring will be disabled anyway when !x86, so probably not a major
> >> > concern.
> >> >
> >> >> 
> >> >> (the other dirty_* tests are too slow)
> >> >
> >> > These are the 10 slowest tests when I run locally:
> >> >
> >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
> >> > /x86_64/migration/postcopy/recovery/plain 2.43
> >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
> >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
> >> > /x86_64/migration/postcopy/tls/psk 2.91
> >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
> >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
> >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
> >> > /x86_64/migration/vcpu_dirty_limit 13.29
> >> > /x86_64/migration/precopy/unix/xbzrle 27.55
> >> >
> >> > Are you aware of people using xbzrle at all?
> >> 
> >> Nope.
> >> 
> >> >
> >> >> 
> >> >> All the rest go behind a knob that people touching migration code will
> >> >> enable.
> >> >> 
> >> >> wdyt?
> >> >
> >> > Agree with the general idea, but I worry above exact list can be too small.
> >> 
> >> We won't stop running the full set of tests. We can run them in our
> >> forks' CI as much as we want. There are no cases of people reporting a
> >> migration bug because their 'make check' caught something that ours
> >> didn't.
> >
> > IIUC it's hard to say - when the test is in CI maintainers can catch them
> > already before sending a formal PR.
> >
> > If the test is run by default in make check, a developer can trigger a
> > migration issue (with his/her patch applied) then one can notice it
> > introduced a bug, fix it, then post the patches.  We won't know whether
> > that happened.
> 
> Good point.
> 
> >
> > So one thing we can do (if you think worthwhile to do it now) is we shrink
> > the default test case a lot as you proposed, then we wait and see what
> > breaks, and then we gradually add tests back when it can be used to find
> > breakages.  But that'll also take time if it really can find such tests,
> > because then we'll need to debug them one by one (instead of developer /
> > maintainer looking into them with their expertise knowledge..).  I'm not
> > sure whether it's worthwhile to do it now, but I don't feel strongly if we
> > can still have a reliable set of default test cases.
> 
> We first need a way to enable them for the migration CI. Do we have a
> variable that CI understands that can be used to enable slow tests?

I'm not aware of.  Yes sounds like we should have one if it doesn't exist.

> 
> >
> >> 
> >> Besides, the main strength of CI is to catch bugs when someone makes a
> >> code change. If people touch migration code, then we'll run it in our CI
> >> anyway. If they touch device code and that device is migrated by default
> >> then any one of the simple tests will catch the issue when it runs via
> >> the migration-compat job. If the device is not enabled by default, then
> >> no tests will catch it.
> >> 
> >> The worst case scenario is they touch some code completely unrelated and
> >> their 'make check' or CI run breaks because of some race in the
> >> migration code. That's not what CI is for, that's just an annoyance for
> >> everyone. I'd rather it breaks in our hands and then we'll go fix it.
> >> 
> >> >
> >> > IMHO we can definitely, at least, move the last two into slow list
> >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
> >> 
> >> Agreed. I'll send a patch once I get out from under downstream stuff.
> >> 
> >> >
> >> >> 
> >> >> 1- allows adding devices to QEMU cmdline for migration-test
> >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
> >> >> 
> >> 
>
Fabiano Rosas Sept. 11, 2024, 9:26 p.m. UTC | #10
Peter Xu <peterx@redhat.com> writes:

> On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
>> >> >> Peter Xu <peterx@redhat.com> writes:
>> >> >> 
>> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
>> >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com> wrote:
>> >> >> >> >
>> >> >> >> > Despite the fact that the responsive CPU throttle is enabled,
>> >> >> >> > the dirty sync count may not always increase because this is
>> >> >> >> > an optimization that might not happen in any situation.
>> >> >> >> >
>> >> >> >> > This test case just making sure it doesn't interfere with any
>> >> >> >> > current functionality.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
>> >> >> >> 
>> >> >> >> tests/qtest/migration-test already runs 75 different
>> >> >> >> subtests, takes up a massive chunk of our "make check"
>> >> >> >> time, and is very commonly a "times out" test on some
>> >> >> >> of our CI jobs. It runs on five different guest CPU
>> >> >> >> architectures, each one of which takes between 2 and
>> >> >> >> 5 minutes to complete the full migration-test.
>> >> >> >> 
>> >> >> >> Do we really need to make it even bigger?
>> >> >> >
>> >> >> > I'll try to find some time in the next few weeks looking into this to see
>> >> >> > whether we can further shrink migration test times after previous attemps
>> >> >> > from Dan.  At least a low hanging fruit is we should indeed put some more
>> >> >> > tests into g_test_slow(), and this new test could also be a candidate (then
>> >> >> > we can run "-m slow" for migration PRs only).
>> >> >> 
>> >> >> I think we could (using -m slow or any other method) separate tests
>> >> >> that are generic enough that every CI run should benefit from them
>> >> >> vs. tests that are only useful once someone starts touching migration
>> >> >> code. I'd say very few in the former category and most of them in the
>> >> >> latter.
>> >> >> 
>> >> >> For an idea of where migration bugs lie, I took a look at what was
>> >> >> fixed since 2022:
>> >> >> 
>> >> >> # bugs | device/subsystem/arch
>> >> >> ----------------------------------
>> >> >>     54 | migration
>> >> >>     10 | vfio
>> >> >>      6 | ppc
>> >> >>      3 | virtio-gpu
>> >> >>      2 | pcie_sriov, tpm_emulator,
>> >> >>           vdpa, virtio-rng-pci
>> >> >>      1 | arm, block, gpio, lasi,
>> >> >>           pci, s390, scsi-disk,
>> >> >>           virtio-mem, TCG
>> >> >
>> >> > Just curious; how did you collect these?
>> >> 
>> >> git log --since=2022 and then squinted at it. I wrote a warning to take
>> >> this with a grain of salt, but it missed the final version.
>> >> 
>> >> >
>> >> >> 
>> >> >> From these, ignoring the migration bugs, the migration-tests cover some
>> >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
>> >> >> once it is, then virtio-gpu would be covered and we could investigate
>> >> >> adding some of the others.
>> >> >> 
>> >> >> For actual migration code issues:
>> >> >> 
>> >> >> # bugs | (sub)subsystem | kind
>> >> >> ----------------------------------------------
>> >> >>     13 | multifd        | correctness/races
>> >> >>      8 | ram            | correctness
>> >> >>      8 | rdma:          | general programming
>> >> >
>> >> > 8 rdma bugs??? ouch..
>> >> 
>> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
>> >> integer comparisons and bugs in error handling. I don't even want to
>> >> look too much at it.
>> >> 
>> >> ...hopefully this release we'll manage to resolve that situation.
>> >> 
>> >> >
>> >> >>      7 | qmp            | new api bugs
>> >> >>      5 | postcopy       | races
>> >> >>      4 | file:          | leaks
>> >> >>      3 | return path    | races
>> >> >>      3 | fd_cleanup     | races
>> >> >>      2 | savevm, aio/coroutines
>> >> >>      1 | xbzrle, colo, dirtyrate, exec:,
>> >> >>           windows, iochannel, qemufile,
>> >> >>           arch (ppc64le)
>> >> >> 
>> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
>> >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
>> >> >> 
>> >> >> My suggestion is we run per arch:
>> >> >> 
>> >> >> "/precopy/tcp/plain"
>> >> >> "/precopy/tcp/tls/psk/match",
>> >> >> "/postcopy/plain"
>> >> >> "/postcopy/preempt/plain"
>> >> >> "/postcopy/preempt/recovery/plain"
>> >> >> "/multifd/tcp/plain/cancel"
>> >> >> "/multifd/tcp/uri/plain/none"
>> >> >
>> >> > Don't you want to still keep a few multifd / file tests?
>> >> 
>> >> Not really, but I won't object if you want to add some more in there. To
>> >> be honest, I want to get out of people's way as much as I can because
>> >> having to revisit this every couple of months is stressful to me.
>> >
>> > I hope migration tests are not too obstructive yet so far :) - this
>> > discussion is still within a context where we might add one more slow test
>> > in CI, and we probably simply should make it a -m slow test.
>> >
>> >> 
>> >> My rationale for those is:
>> >> 
>> >> "/precopy/tcp/plain":
>> >>  Smoke test, the most common migration
>> >
>> > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.
>> >
>> > And note that even if unix shares the socket iochannel with tcp, it may not
>> > work the same.  For example, if you remember I mentioned I was looking at
>> > threadify the dest qemu receive side, i have a draft there but currently it
>> > has a bug to hang a unix migration, not tcp..
>> 
>> Ok, let's add a unix one, no problem.
>> 
>> >
>> > So IMHO it's not easy to justify which we can simply drop, if we still want
>> > to provide some good gating in CI.
>> 
>> It's not exactly about dropping, it's about which ones need to be run at
>> *every* invocation of make check (x5 because of the multiple archs) and
>
> Yeah, indeed we could consider reducing the number of runs, maybe.  However
> I do remember we used to have migration-test bugs only reproduced with some
> specific distro, well..
>
> Now I'm looking at the pipelines..
>
> Why 5x, btw?

5x because migration-test supports 5 architectures (i386, x86_64, ppc64,
aarch64, s390x), so if your build includes all of those, make check will
run migration-test five times.

>  I saw alphine, centos, debian, fedora, opensuse, ubuntu,
> cfi-x86_64, then compat-x86_64.  So that's 8x?  They're for the same arch
> (amd64) so far.

I agree we should fine tune this in CI, but this is a different
discussion. We can't currently skip migration-test without skipping make
check-qtest. I proposed a while ago to have a separate make
check-migration, but people didn't like that, maybe we should revisit
the idea.

>
> Maybe we can skip some indeed.
>
>> at every git branch push in CI (unless -o ci.skip). For gating we don't
>> need all the tests. Many of them are testing the same core code with
>> just a few differences at the margins.
>> 
>> >  And I won't be 100% surprised if some
>> > other non-migration PR (e.g. io/) changed some slight bit that unix is
>> > broken and tcp keeps working, for example, then we loose some CI benefits.
>> 
>> IMO, these non-migration PRs are exactly what we have to worry
>> about. Because migration PRs would run with -m slow and we'd catch the
>> issue there.
>> 
>> >
>> >> 
>> >> "/precopy/tcp/tls/psk/match":
>> >>  Something might change in the distro regarding tls. Such as:
>> >>  https://gitlab.com/qemu-project/qemu/-/issues/1937
>> >> 
>> >> "/postcopy/plain":
>> >>  Smoke test for postcopy
>> >> 
>> >> "/postcopy/preempt/plain":
>> >>  Just to exercise the preempt thread
>> >> 
>> >> "/postcopy/preempt/recovery/plain":
>> >>  Recovery has had some issues with races in the past
>> >> 
>> >> "/multifd/tcp/plain/cancel":
>> >>  The MVP of catching concurrency issues
>> >> 
>> >> "/multifd/tcp/uri/plain/none":
>> >>  Smoke test for multifd
>> >> 
>> >> All in all, these will test basic funcionality and run very often. The
>> >> more tests we add to this set, the less return we get in relation to the
>> >> time they take.
>> >
>> > This is true.  We can try to discuss more on which is more important; I
>> > still think something might be good to be added on top of above.
>> 
>> Sure, just add what you think we need.
>
> Let's leave the detailed discussion on what to choose until the patch
> phase.  IIUC this can be the last we do.

ok

>
> We've just queued your other patch to "slow down" the two time consuming
> tests, we save 40s for each CI kick (total ~90s now locally, so that's 30%
> cut already all acrosss!), not so bad as a start.
>
> Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's
> 3 system runs, it'll be another 62% cut, a total of:
>
>   1.0 - 9/13 * 3/8 = ~75%
>
> So if we keep removing 5 system running it we can reduce it to only a
> quarter time comparing to before for each CI run.
>
> Would that be good enough already for us to live for quite a few more
> releases?  They're so far all low hanging fruits.

Reducing the runs in CI is not a low hanging fruit. We'd need to first
split migration-test from check-qtest somehow. But that would mean not
running migration-test at all during make check, which is worse than
just reducing migration-test while keeping it in make check.

>
> If we want to move further (and you think we should then start with
> reducing test case, rather than profiling the test cases), then feel free
> to go ahead and send a patch when you're ready.  But maybe you don't want
> to bother after you notice that we can already shrink 75% of runtime even
> without dropping anything.  Your call!

I don't think we're discussing total CI time at this point, so the math
doesn't really add up. We're not looking into making the CI finish
faster. We're looking into making migration-test finish faster. That
would reduce timeouts in CI, speed-up make check and reduce the chance
of random race conditions* affecting other people/staging runs.

*- like the one I'm debugging at this very moment in multifd+ppc, are
   you aware of that btw?

>
>> 
>> >
>> > There's also the other way - at some point, I still want to improve
>> > migration-test run speed, and please have a look if you like too at some
>> > point: so there's still chance (average is ~2sec as of now), IMHO, we don't
>> > lose anything in CI but runs simply faster.
>> 
>> My only idea, but it requires a bit of work, is to do unit testing on
>> the interfaces. Anything before migration_fd_connect(). Then we could
>> stop doing a full migration for those.
>
> What I'm thinking is some further profiling, like what Dan used to do.  I
> feel like we still have chance, considering what we run as guest image is
> simply a loop.. so basically zero boot time logically.

I started looking again today into code coverage tools to see if we can
trim the tests that are redundant. But no promises, I already have a
queue of stuff that's pending.

>
>> 
>> >
>> >> 
>> >> >
>> >> > IIUC some file ops can still be relevant to archs.  Multifd still has one
>> >> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
>> >> > race condition when migration finishes, and the issue could be memory
>> >> > ordering relevant, but maybe not.
>> >> 
>> >> I'm not aware of anything. I believe the last arm64 bug we had was the
>> >> threadinfo stuff[1]. If you remember what it's about, let me know.
>> >> 
>> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").
>> >
>> > https://issues.redhat.com/browse/RHEL-45628
>> 
>> Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't
>> think other tests will. Also, we don't have tests for zerocopy AFAIK.
>
> That'll be challenging as it requires locked_vm limit.
>
>> 
>> >
>> > On RH side Bandan is looking at it, but he's during a long holidays
>> > recently.
>> 
>> Good luck to him.
>> 
>> >
>> >> 
>> >> >
>> >> >> 
>> >> >> and x86 gets extra:
>> >> >> 
>> >> >> "/precopy/unix/suspend/live"
>> >> >> "/precopy/unix/suspend/notlive"
>> >> >> "/dirty_ring"
>> >> >
>> >> > dirty ring will be disabled anyway when !x86, so probably not a major
>> >> > concern.
>> >> >
>> >> >> 
>> >> >> (the other dirty_* tests are too slow)
>> >> >
>> >> > These are the 10 slowest tests when I run locally:
>> >> >
>> >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
>> >> > /x86_64/migration/postcopy/recovery/plain 2.43
>> >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
>> >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
>> >> > /x86_64/migration/postcopy/tls/psk 2.91
>> >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
>> >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
>> >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
>> >> > /x86_64/migration/vcpu_dirty_limit 13.29
>> >> > /x86_64/migration/precopy/unix/xbzrle 27.55
>> >> >
>> >> > Are you aware of people using xbzrle at all?
>> >> 
>> >> Nope.
>> >> 
>> >> >
>> >> >> 
>> >> >> All the rest go behind a knob that people touching migration code will
>> >> >> enable.
>> >> >> 
>> >> >> wdyt?
>> >> >
>> >> > Agree with the general idea, but I worry above exact list can be too small.
>> >> 
>> >> We won't stop running the full set of tests. We can run them in our
>> >> forks' CI as much as we want. There are no cases of people reporting a
>> >> migration bug because their 'make check' caught something that ours
>> >> didn't.
>> >
>> > IIUC it's hard to say - when the test is in CI maintainers can catch them
>> > already before sending a formal PR.
>> >
>> > If the test is run by default in make check, a developer can trigger a
>> > migration issue (with his/her patch applied) then one can notice it
>> > introduced a bug, fix it, then post the patches.  We won't know whether
>> > that happened.
>> 
>> Good point.
>> 
>> >
>> > So one thing we can do (if you think worthwhile to do it now) is we shrink
>> > the default test case a lot as you proposed, then we wait and see what
>> > breaks, and then we gradually add tests back when it can be used to find
>> > breakages.  But that'll also take time if it really can find such tests,
>> > because then we'll need to debug them one by one (instead of developer /
>> > maintainer looking into them with their expertise knowledge..).  I'm not
>> > sure whether it's worthwhile to do it now, but I don't feel strongly if we
>> > can still have a reliable set of default test cases.
>> 
>> We first need a way to enable them for the migration CI. Do we have a
>> variable that CI understands that can be used to enable slow tests?
>
> I'm not aware of.  Yes sounds like we should have one if it doesn't exist.
>
>> 
>> >
>> >> 
>> >> Besides, the main strength of CI is to catch bugs when someone makes a
>> >> code change. If people touch migration code, then we'll run it in our CI
>> >> anyway. If they touch device code and that device is migrated by default
>> >> then any one of the simple tests will catch the issue when it runs via
>> >> the migration-compat job. If the device is not enabled by default, then
>> >> no tests will catch it.
>> >> 
>> >> The worst case scenario is they touch some code completely unrelated and
>> >> their 'make check' or CI run breaks because of some race in the
>> >> migration code. That's not what CI is for, that's just an annoyance for
>> >> everyone. I'd rather it breaks in our hands and then we'll go fix it.
>> >> 
>> >> >
>> >> > IMHO we can definitely, at least, move the last two into slow list
>> >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
>> >> 
>> >> Agreed. I'll send a patch once I get out from under downstream stuff.
>> >> 
>> >> >
>> >> >> 
>> >> >> 1- allows adding devices to QEMU cmdline for migration-test
>> >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
>> >> >> 
>> >> 
>>
Peter Maydell Sept. 12, 2024, 8:13 a.m. UTC | #11
On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> I don't think we're discussing total CI time at this point, so the math
> doesn't really add up. We're not looking into making the CI finish
> faster. We're looking into making migration-test finish faster. That
> would reduce timeouts in CI, speed-up make check and reduce the chance
> of random race conditions* affecting other people/staging runs.

Right. The reason migration-test appears on my radar is because
it is very frequently the thing that shows up as "this sometimes
just fails or just times out and if you hit retry it goes away
again". That might not be migration-test's fault specifically,
because those retries tend to be certain CI configs (s390,
the i686-tci one), and I have some theories about what might be
causing it (e.g. build system runs 4 migration-tests in parallel,
which means 8 QEMU processes which is too many for the number
of host CPUs). But right now I look at CI job failures and my reaction
is "oh, it's the migration-test failing yet again" :-(

For some examples from this week:

https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
https://gitlab.com/qemu-project/qemu/-/jobs/7786579155

-- PMM
Fabiano Rosas Sept. 12, 2024, 1:48 p.m. UTC | #12
Peter Maydell <peter.maydell@linaro.org> writes:

> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>> I don't think we're discussing total CI time at this point, so the math
>> doesn't really add up. We're not looking into making the CI finish
>> faster. We're looking into making migration-test finish faster. That
>> would reduce timeouts in CI, speed-up make check and reduce the chance
>> of random race conditions* affecting other people/staging runs.
>
> Right. The reason migration-test appears on my radar is because
> it is very frequently the thing that shows up as "this sometimes
> just fails or just times out and if you hit retry it goes away
> again". That might not be migration-test's fault specifically,
> because those retries tend to be certain CI configs (s390,
> the i686-tci one), and I have some theories about what might be
> causing it (e.g. build system runs 4 migration-tests in parallel,
> which means 8 QEMU processes which is too many for the number
> of host CPUs). But right now I look at CI job failures and my reaction
> is "oh, it's the migration-test failing yet again" :-(

And then I go: "oh, people complaining about migration-test again, I
thought we had fixed all the issues this time". It's frustrating for
everyone, as I said previously.

>
> For some examples from this week:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155

About these:

There are 2 instances of plain-old-SIGSEGV here. Both happen in
non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means
they're either races or memory ordering issues. Having i386 crashing
points to the former. So having the CI loaded and causing timeouts is
probably what exposed the issue.

The thread is mig/dst/recv_7 and grepping the objdump output shows:
<set_bit_atomic> 55 48 89 e5 48 89 7d e8 48 89 75 e0 48 8b 45 e8 83 e0
3f ba 01 00 00 00 89 c1 48 d3 e2 48 89 d0 48 89 45 f0 48 8b 45 e8 48 c1
e8 06 48 8d 14 c5 00 00 00 00 48 8b 45 e0 48 01 d0 48 89 45 f8 48 8b 45
f8 48 8b 55 f0 <f0> 48 09 10 90 5d c3

I tried a bisect overnight, but it seems the issue has been there since
before 9.0. I'll try to repro with gdb attached or get a core dump.
Peter Maydell Sept. 12, 2024, 2:09 p.m. UTC | #13
On Thu, 12 Sept 2024 at 14:48, Fabiano Rosas <farosas@suse.de> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
> > For some examples from this week:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>
> About these:
>
> There are 2 instances of plain-old-SIGSEGV here. Both happen in
> non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means
> they're either races or memory ordering issues. Having i386 crashing
> points to the former. So having the CI loaded and causing timeouts is
> probably what exposed the issue.

They're also both TCI. Would these tests be relying on
specific atomic-access behaviour in the guest code that's
running, or is all the avoidance-of-races in the migration
code in QEMU itself?

(I don't know of any particular problems with TCI's
implementation of atomic accesses, so this is just a stab
in the dark.)

thanks
-- PMM
Fabiano Rosas Sept. 12, 2024, 2:28 p.m. UTC | #14
Peter Maydell <peter.maydell@linaro.org> writes:

> On Thu, 12 Sept 2024 at 14:48, Fabiano Rosas <farosas@suse.de> wrote:
>> Peter Maydell <peter.maydell@linaro.org> writes:
>> > For some examples from this week:
>> >
>> > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>>
>> About these:
>>
>> There are 2 instances of plain-old-SIGSEGV here. Both happen in
>> non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means
>> they're either races or memory ordering issues. Having i386 crashing
>> points to the former. So having the CI loaded and causing timeouts is
>> probably what exposed the issue.
>
> They're also both TCI. Would these tests be relying on
> specific atomic-access behaviour in the guest code that's
> running, or is all the avoidance-of-races in the migration
> code in QEMU itself?

I misspoke about memory ordering, this is all just the x86 host and the
multifd threads in QEMU having synchronization issues.

>
> (I don't know of any particular problems with TCI's
> implementation of atomic accesses, so this is just a stab
> in the dark.)
>
> thanks
> -- PMM
Peter Xu Sept. 12, 2024, 3:09 p.m. UTC | #15
On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> > I don't think we're discussing total CI time at this point, so the math
> > doesn't really add up. We're not looking into making the CI finish
> > faster. We're looking into making migration-test finish faster. That
> > would reduce timeouts in CI, speed-up make check and reduce the chance
> > of random race conditions* affecting other people/staging runs.
> 
> Right. The reason migration-test appears on my radar is because
> it is very frequently the thing that shows up as "this sometimes
> just fails or just times out and if you hit retry it goes away
> again". That might not be migration-test's fault specifically,
> because those retries tend to be certain CI configs (s390,
> the i686-tci one), and I have some theories about what might be
> causing it (e.g. build system runs 4 migration-tests in parallel,
> which means 8 QEMU processes which is too many for the number
> of host CPUs). But right now I look at CI job failures and my reaction
> is "oh, it's the migration-test failing yet again" :-(
> 
> For some examples from this week:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155

Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
parallel.  It indeed sounds like no good way to finally solve.. I don't
also see how speeding up / reducing tests in migration test would help, as
that's (from some degree..) is the same as tuning the timeout value bigger.
When the tests are less it'll fit into 480s window, but maybe it's too
quick now we wonder whether we should shrink it to e.g. 90s, but then it
can timeout again when on a busy host with less capability of concurrency.

But indeed there're two ERRORs ([1,2] above)..  I collected some more info
here before the log expires:

=================================8<================================

*** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host

https://gitlab.com/qemu-project/qemu/-/jobs/7799842373

101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
(test program exited with status code -6)
TAP parsing error: Too few tests run (expected 53, got 39)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

# Start of plain tests
# Running /i386/migration/multifd/tcp/plain/cancel
# Using machine type: pc-i440fx-9.2
# starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
# starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
----------------------------------- stderr -----------------------------------
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)

*** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host

https://gitlab.com/qemu-project/qemu/-/jobs/7786579152

174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
(test program exited with status code -6)
TAP parsing error: Too few tests run (expected 61, got 47)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

# Start of plain tests
# Running /ppc64/migration/multifd/tcp/plain/cancel
# Using machine type: pseries-9.2
# starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
# starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
----------------------------------- stderr -----------------------------------
qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)

(test program exited with status code -6)
=================================8<================================

So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
different host / arch being tested.  What is more weird is the two failures
are different, the 2nd failure throw out a TLS error even though the test
doesn't yet have tls involved.

Fabiano, is this the issue you're looking at?

Peter, do you think it'll be helpful if we temporarily mark this test as
"slow" too so it's not run in CI (so we still run it ourselves when prepare
migration PR, with the hope that it can reproduce)?

Thanks,
Peter Maydell Sept. 12, 2024, 3:14 p.m. UTC | #16
On Thu, 12 Sept 2024 at 16:09, Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> > > I don't think we're discussing total CI time at this point, so the math
> > > doesn't really add up. We're not looking into making the CI finish
> > > faster. We're looking into making migration-test finish faster. That
> > > would reduce timeouts in CI, speed-up make check and reduce the chance
> > > of random race conditions* affecting other people/staging runs.
> >
> > Right. The reason migration-test appears on my radar is because
> > it is very frequently the thing that shows up as "this sometimes
> > just fails or just times out and if you hit retry it goes away
> > again". That might not be migration-test's fault specifically,
> > because those retries tend to be certain CI configs (s390,
> > the i686-tci one), and I have some theories about what might be
> > causing it (e.g. build system runs 4 migration-tests in parallel,
> > which means 8 QEMU processes which is too many for the number
> > of host CPUs). But right now I look at CI job failures and my reaction
> > is "oh, it's the migration-test failing yet again" :-(
> >
> > For some examples from this week:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>
> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> parallel.  It indeed sounds like no good way to finally solve.. I don't
> also see how speeding up / reducing tests in migration test would help, as
> that's (from some degree..) is the same as tuning the timeout value bigger.
> When the tests are less it'll fit into 480s window, but maybe it's too
> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> can timeout again when on a busy host with less capability of concurrency.

For the TIMEOUT part on cross-i686-tci I plan to try this patch:
https://patchew.org/QEMU/20240912151003.2045031-1-peter.maydell@linaro.org/
which makes 'make check' single-threaded; that will help to see
if the parallelism is a problem. (If it is then we might want
to do a more generalised approach rather than just for that one
CI job.)

> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> here before the log expires:

> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> different host / arch being tested.  What is more weird is the two failures
> are different, the 2nd failure throw out a TLS error even though the test
> doesn't yet have tls involved.
>
> Fabiano, is this the issue you're looking at?
>
> Peter, do you think it'll be helpful if we temporarily mark this test as
> "slow" too so it's not run in CI (so we still run it ourselves when prepare
> migration PR, with the hope that it can reproduce)?

If you think that specific test is flaky then I think that's
probably a good idea. As usual with this kind of thing,
probably best to have a comment next to the test noting
why and with a URL to a gitlab issue for it, so we don't
forget why we disabled the test.

-- PMM
Fabiano Rosas Sept. 12, 2024, 3:37 p.m. UTC | #17
Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>> > I don't think we're discussing total CI time at this point, so the math
>> > doesn't really add up. We're not looking into making the CI finish
>> > faster. We're looking into making migration-test finish faster. That
>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>> > of random race conditions* affecting other people/staging runs.
>> 
>> Right. The reason migration-test appears on my radar is because
>> it is very frequently the thing that shows up as "this sometimes
>> just fails or just times out and if you hit retry it goes away
>> again". That might not be migration-test's fault specifically,
>> because those retries tend to be certain CI configs (s390,
>> the i686-tci one), and I have some theories about what might be
>> causing it (e.g. build system runs 4 migration-tests in parallel,
>> which means 8 QEMU processes which is too many for the number
>> of host CPUs). But right now I look at CI job failures and my reaction
>> is "oh, it's the migration-test failing yet again" :-(
>> 
>> For some examples from this week:
>> 
>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>
> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> parallel.  It indeed sounds like no good way to finally solve.. I don't
> also see how speeding up / reducing tests in migration test would help, as
> that's (from some degree..) is the same as tuning the timeout value bigger.
> When the tests are less it'll fit into 480s window, but maybe it's too
> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> can timeout again when on a busy host with less capability of concurrency.
>
> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> here before the log expires:
>
> =================================8<================================
>
> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>
> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>
> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> (test program exited with status code -6)
> TAP parsing error: Too few tests run (expected 53, got 39)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>
> # Start of plain tests
> # Running /i386/migration/multifd/tcp/plain/cancel
> # Using machine type: pc-i440fx-9.2
> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> ----------------------------------- stderr -----------------------------------
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>
> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>
> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> (test program exited with status code -6)
> TAP parsing error: Too few tests run (expected 61, got 47)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>
> # Start of plain tests
> # Running /ppc64/migration/multifd/tcp/plain/cancel
> # Using machine type: pseries-9.2
> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> ----------------------------------- stderr -----------------------------------
> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>
> (test program exited with status code -6)
> =================================8<================================
>
> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> different host / arch being tested.  What is more weird is the two failures
> are different, the 2nd failure throw out a TLS error even though the test
> doesn't yet have tls involved.

I think that's just a parallel test being cancelled prematurely, either
due to the crash or due to the timeout.

>
> Fabiano, is this the issue you're looking at?

Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
with make -j$(nproc) check and another loop with tcp/plain/cancel. It
takes ~1h to hit. I've seen crashes with ppc64, s390 and
aarch64.

> Peter, do you think it'll be helpful if we temporarily mark this test as
> "slow" too so it's not run in CI (so we still run it ourselves when prepare
> migration PR, with the hope that it can reproduce)?
>
> Thanks,
Fabiano Rosas Sept. 12, 2024, 10:52 p.m. UTC | #18
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>>> > I don't think we're discussing total CI time at this point, so the math
>>> > doesn't really add up. We're not looking into making the CI finish
>>> > faster. We're looking into making migration-test finish faster. That
>>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>>> > of random race conditions* affecting other people/staging runs.
>>> 
>>> Right. The reason migration-test appears on my radar is because
>>> it is very frequently the thing that shows up as "this sometimes
>>> just fails or just times out and if you hit retry it goes away
>>> again". That might not be migration-test's fault specifically,
>>> because those retries tend to be certain CI configs (s390,
>>> the i686-tci one), and I have some theories about what might be
>>> causing it (e.g. build system runs 4 migration-tests in parallel,
>>> which means 8 QEMU processes which is too many for the number
>>> of host CPUs). But right now I look at CI job failures and my reaction
>>> is "oh, it's the migration-test failing yet again" :-(
>>> 
>>> For some examples from this week:
>>> 
>>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>>
>> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
>> parallel.  It indeed sounds like no good way to finally solve.. I don't
>> also see how speeding up / reducing tests in migration test would help, as
>> that's (from some degree..) is the same as tuning the timeout value bigger.
>> When the tests are less it'll fit into 480s window, but maybe it's too
>> quick now we wonder whether we should shrink it to e.g. 90s, but then it
>> can timeout again when on a busy host with less capability of concurrency.
>>
>> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
>> here before the log expires:
>>
>> =================================8<================================
>>
>> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>>
>> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> (test program exited with status code -6)
>> TAP parsing error: Too few tests run (expected 53, got 39)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>
>> # Start of plain tests
>> # Running /i386/migration/multifd/tcp/plain/cancel
>> # Using machine type: pc-i440fx-9.2
>> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> ----------------------------------- stderr -----------------------------------
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>
>> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>>
>> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> (test program exited with status code -6)
>> TAP parsing error: Too few tests run (expected 61, got 47)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>
>> # Start of plain tests
>> # Running /ppc64/migration/multifd/tcp/plain/cancel
>> # Using machine type: pseries-9.2
>> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> ----------------------------------- stderr -----------------------------------
>> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> warning: fd: migration to a file is deprecated. Use file: instead.
>> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>
>> (test program exited with status code -6)
>> =================================8<================================
>>
>> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>> different host / arch being tested.  What is more weird is the two failures
>> are different, the 2nd failure throw out a TLS error even though the test
>> doesn't yet have tls involved.
>
> I think that's just a parallel test being cancelled prematurely, either
> due to the crash or due to the timeout.
>
>>
>> Fabiano, is this the issue you're looking at?
>
> Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
> with make -j$(nproc) check and another loop with tcp/plain/cancel. It
> takes ~1h to hit. I've seen crashes with ppc64, s390 and
> aarch64.
>

Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
zero page causing multiple page faults"), the multifd code started using
the rb->receivedmap bitmap, which belongs to the ram code and is
initialized and *freed* from the ram SaveVMHandlers.

process_incoming_migration_co()        ...
  qemu_loadvm_state()                  multifd_nocomp_recv()
    qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
      rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
  ...
  migration_incoming_state_destroy()
    multifd_recv_cleanup()
      multifd_recv_terminate_threads(NULL)

Multifd threads are live until migration_incoming_state_destroy(), which
is called some time later.

>> Peter, do you think it'll be helpful if we temporarily mark this test as
>> "slow" too so it's not run in CI (so we still run it ourselves when prepare
>> migration PR, with the hope that it can reproduce)?
>>
>> Thanks,
Peter Xu Sept. 13, 2024, 3 p.m. UTC | #19
On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> >>> > I don't think we're discussing total CI time at this point, so the math
> >>> > doesn't really add up. We're not looking into making the CI finish
> >>> > faster. We're looking into making migration-test finish faster. That
> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
> >>> > of random race conditions* affecting other people/staging runs.
> >>> 
> >>> Right. The reason migration-test appears on my radar is because
> >>> it is very frequently the thing that shows up as "this sometimes
> >>> just fails or just times out and if you hit retry it goes away
> >>> again". That might not be migration-test's fault specifically,
> >>> because those retries tend to be certain CI configs (s390,
> >>> the i686-tci one), and I have some theories about what might be
> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
> >>> which means 8 QEMU processes which is too many for the number
> >>> of host CPUs). But right now I look at CI job failures and my reaction
> >>> is "oh, it's the migration-test failing yet again" :-(
> >>> 
> >>> For some examples from this week:
> >>> 
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
> >>
> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
> >> also see how speeding up / reducing tests in migration test would help, as
> >> that's (from some degree..) is the same as tuning the timeout value bigger.
> >> When the tests are less it'll fit into 480s window, but maybe it's too
> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> >> can timeout again when on a busy host with less capability of concurrency.
> >>
> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> >> here before the log expires:
> >>
> >> =================================8<================================
> >>
> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> >>
> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >> (test program exited with status code -6)
> >> TAP parsing error: Too few tests run (expected 53, got 39)
> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>
> >> # Start of plain tests
> >> # Running /i386/migration/multifd/tcp/plain/cancel
> >> # Using machine type: pc-i440fx-9.2
> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >> ----------------------------------- stderr -----------------------------------
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>
> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> >>
> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >> (test program exited with status code -6)
> >> TAP parsing error: Too few tests run (expected 61, got 47)
> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>
> >> # Start of plain tests
> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
> >> # Using machine type: pseries-9.2
> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >> ----------------------------------- stderr -----------------------------------
> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>
> >> (test program exited with status code -6)
> >> =================================8<================================
> >>
> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> >> different host / arch being tested.  What is more weird is the two failures
> >> are different, the 2nd failure throw out a TLS error even though the test
> >> doesn't yet have tls involved.
> >
> > I think that's just a parallel test being cancelled prematurely, either
> > due to the crash or due to the timeout.
> >
> >>
> >> Fabiano, is this the issue you're looking at?
> >
> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
> > aarch64.
> >
> 
> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
> zero page causing multiple page faults"), the multifd code started using
> the rb->receivedmap bitmap, which belongs to the ram code and is
> initialized and *freed* from the ram SaveVMHandlers.
> 
> process_incoming_migration_co()        ...
>   qemu_loadvm_state()                  multifd_nocomp_recv()
>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>   ...
>   migration_incoming_state_destroy()
>     multifd_recv_cleanup()
>       multifd_recv_terminate_threads(NULL)
> 
> Multifd threads are live until migration_incoming_state_destroy(), which
> is called some time later.

Thanks for the debugging.  Hmm I would expect loadvm should wait until all
ram is received somehow..

And when looking, I also found we're ambiguous on the cleanups
now.. probably a separate issue that not yet cause crashes.  I meant we
even don't have a consistent order to clean these in precopy / postcopy,
as:

We have this:

  qemu_loadvm_state
    qemu_loadvm_state_cleanup              <------------- [1]
  migration_bh_schedule(process_incoming_migration_bh, mis);
  (then in bh...)
  process_incoming_migration_bh
    migration_incoming_state_destroy       <------------- [2]

But then:
  
  postcopy_ram_listen_thread
    migration_incoming_state_destroy       <------------- [2]
    qemu_loadvm_state_cleanup              <------------- [1]

I think it makes more sense to do [2] after [1], so maybe postcopy needs
changing too..

> 
> >> Peter, do you think it'll be helpful if we temporarily mark this test as
> >> "slow" too so it's not run in CI (so we still run it ourselves when prepare
> >> migration PR, with the hope that it can reproduce)?
> >>
> >> Thanks,
>
Peter Xu Sept. 13, 2024, 3:02 p.m. UTC | #20
On Thu, Sep 12, 2024 at 04:14:20PM +0100, Peter Maydell wrote:
> On Thu, 12 Sept 2024 at 16:09, Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> > > > I don't think we're discussing total CI time at this point, so the math
> > > > doesn't really add up. We're not looking into making the CI finish
> > > > faster. We're looking into making migration-test finish faster. That
> > > > would reduce timeouts in CI, speed-up make check and reduce the chance
> > > > of random race conditions* affecting other people/staging runs.
> > >
> > > Right. The reason migration-test appears on my radar is because
> > > it is very frequently the thing that shows up as "this sometimes
> > > just fails or just times out and if you hit retry it goes away
> > > again". That might not be migration-test's fault specifically,
> > > because those retries tend to be certain CI configs (s390,
> > > the i686-tci one), and I have some theories about what might be
> > > causing it (e.g. build system runs 4 migration-tests in parallel,
> > > which means 8 QEMU processes which is too many for the number
> > > of host CPUs). But right now I look at CI job failures and my reaction
> > > is "oh, it's the migration-test failing yet again" :-(
> > >
> > > For some examples from this week:
> > >
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
> >
> > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> > parallel.  It indeed sounds like no good way to finally solve.. I don't
> > also see how speeding up / reducing tests in migration test would help, as
> > that's (from some degree..) is the same as tuning the timeout value bigger.
> > When the tests are less it'll fit into 480s window, but maybe it's too
> > quick now we wonder whether we should shrink it to e.g. 90s, but then it
> > can timeout again when on a busy host with less capability of concurrency.
> 
> For the TIMEOUT part on cross-i686-tci I plan to try this patch:
> https://patchew.org/QEMU/20240912151003.2045031-1-peter.maydell@linaro.org/
> which makes 'make check' single-threaded; that will help to see
> if the parallelism is a problem. (If it is then we might want
> to do a more generalised approach rather than just for that one
> CI job.)

Sounds good.

> 
> > But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> > here before the log expires:
> 
> > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> > different host / arch being tested.  What is more weird is the two failures
> > are different, the 2nd failure throw out a TLS error even though the test
> > doesn't yet have tls involved.
> >
> > Fabiano, is this the issue you're looking at?
> >
> > Peter, do you think it'll be helpful if we temporarily mark this test as
> > "slow" too so it's not run in CI (so we still run it ourselves when prepare
> > migration PR, with the hope that it can reproduce)?
> 
> If you think that specific test is flaky then I think that's
> probably a good idea. As usual with this kind of thing,
> probably best to have a comment next to the test noting
> why and with a URL to a gitlab issue for it, so we don't
> forget why we disabled the test.

Looks like Fabiano root-caused the issue.  We'll see how that goes, or we
can prepare a patch to make it optional with the comments in place.

Thanks,
Fabiano Rosas Sept. 13, 2024, 3:09 p.m. UTC | #21
Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>> >>> > I don't think we're discussing total CI time at this point, so the math
>> >>> > doesn't really add up. We're not looking into making the CI finish
>> >>> > faster. We're looking into making migration-test finish faster. That
>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>> >>> > of random race conditions* affecting other people/staging runs.
>> >>> 
>> >>> Right. The reason migration-test appears on my radar is because
>> >>> it is very frequently the thing that shows up as "this sometimes
>> >>> just fails or just times out and if you hit retry it goes away
>> >>> again". That might not be migration-test's fault specifically,
>> >>> because those retries tend to be certain CI configs (s390,
>> >>> the i686-tci one), and I have some theories about what might be
>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
>> >>> which means 8 QEMU processes which is too many for the number
>> >>> of host CPUs). But right now I look at CI job failures and my reaction
>> >>> is "oh, it's the migration-test failing yet again" :-(
>> >>> 
>> >>> For some examples from this week:
>> >>> 
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>> >>
>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
>> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
>> >> also see how speeding up / reducing tests in migration test would help, as
>> >> that's (from some degree..) is the same as tuning the timeout value bigger.
>> >> When the tests are less it'll fit into 480s window, but maybe it's too
>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
>> >> can timeout again when on a busy host with less capability of concurrency.
>> >>
>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
>> >> here before the log expires:
>> >>
>> >> =================================8<================================
>> >>
>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>> >>
>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>> >>
>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >> stderr:
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >> (test program exited with status code -6)
>> >> TAP parsing error: Too few tests run (expected 53, got 39)
>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>
>> >> # Start of plain tests
>> >> # Running /i386/migration/multifd/tcp/plain/cancel
>> >> # Using machine type: pc-i440fx-9.2
>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >> ----------------------------------- stderr -----------------------------------
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>
>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>> >>
>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>> >>
>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >> stderr:
>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >> (test program exited with status code -6)
>> >> TAP parsing error: Too few tests run (expected 61, got 47)
>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>
>> >> # Start of plain tests
>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
>> >> # Using machine type: pseries-9.2
>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >> ----------------------------------- stderr -----------------------------------
>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>
>> >> (test program exited with status code -6)
>> >> =================================8<================================
>> >>
>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>> >> different host / arch being tested.  What is more weird is the two failures
>> >> are different, the 2nd failure throw out a TLS error even though the test
>> >> doesn't yet have tls involved.
>> >
>> > I think that's just a parallel test being cancelled prematurely, either
>> > due to the crash or due to the timeout.
>> >
>> >>
>> >> Fabiano, is this the issue you're looking at?
>> >
>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
>> > aarch64.
>> >
>> 
>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
>> zero page causing multiple page faults"), the multifd code started using
>> the rb->receivedmap bitmap, which belongs to the ram code and is
>> initialized and *freed* from the ram SaveVMHandlers.
>> 
>> process_incoming_migration_co()        ...
>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>>   ...
>>   migration_incoming_state_destroy()
>>     multifd_recv_cleanup()
>>       multifd_recv_terminate_threads(NULL)
>> 
>> Multifd threads are live until migration_incoming_state_destroy(), which
>> is called some time later.
>
> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
> ram is received somehow..

Looks like a similar issue as when we didn't have the multifd_send sync
working correctly and ram code would run and do cleanup.

>
> And when looking, I also found we're ambiguous on the cleanups
> now.. probably a separate issue that not yet cause crashes.  I meant we
> even don't have a consistent order to clean these in precopy / postcopy,
> as:
>
> We have this:
>
>   qemu_loadvm_state
>     qemu_loadvm_state_cleanup              <------------- [1]
>   migration_bh_schedule(process_incoming_migration_bh, mis);
>   (then in bh...)
>   process_incoming_migration_bh
>     migration_incoming_state_destroy       <------------- [2]
>
> But then:
>   
>   postcopy_ram_listen_thread
>     migration_incoming_state_destroy       <------------- [2]
>     qemu_loadvm_state_cleanup              <------------- [1]
>
> I think it makes more sense to do [2] after [1], so maybe postcopy needs
> changing too..

Yes, ram_load_cleanup doesn't do much:

static int ram_load_cleanup(void *opaque)
{
    RAMBlock *rb;

    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
        qemu_ram_block_writeback(rb);
    }

    xbzrle_load_cleanup();

    return 0;
}

We can probably change the order just fine.

For the bug I'm now trying to add a helper to init/clean the receivedmap
and only call it from code that uses it (postcopy, multifd_recv),
instead of always init/clean along with ram_load_setup/cleanup.

>
>> 
>> >> Peter, do you think it'll be helpful if we temporarily mark this test as
>> >> "slow" too so it's not run in CI (so we still run it ourselves when prepare
>> >> migration PR, with the hope that it can reproduce)?
>> >>
>> >> Thanks,
>>
Fabiano Rosas Sept. 13, 2024, 3:17 p.m. UTC | #22
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
>>> Fabiano Rosas <farosas@suse.de> writes:
>>> 
>>> > Peter Xu <peterx@redhat.com> writes:
>>> >
>>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>>> >>> > I don't think we're discussing total CI time at this point, so the math
>>> >>> > doesn't really add up. We're not looking into making the CI finish
>>> >>> > faster. We're looking into making migration-test finish faster. That
>>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>>> >>> > of random race conditions* affecting other people/staging runs.
>>> >>> 
>>> >>> Right. The reason migration-test appears on my radar is because
>>> >>> it is very frequently the thing that shows up as "this sometimes
>>> >>> just fails or just times out and if you hit retry it goes away
>>> >>> again". That might not be migration-test's fault specifically,
>>> >>> because those retries tend to be certain CI configs (s390,
>>> >>> the i686-tci one), and I have some theories about what might be
>>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
>>> >>> which means 8 QEMU processes which is too many for the number
>>> >>> of host CPUs). But right now I look at CI job failures and my reaction
>>> >>> is "oh, it's the migration-test failing yet again" :-(
>>> >>> 
>>> >>> For some examples from this week:
>>> >>> 
>>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>>> >>
>>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
>>> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
>>> >> also see how speeding up / reducing tests in migration test would help, as
>>> >> that's (from some degree..) is the same as tuning the timeout value bigger.
>>> >> When the tests are less it'll fit into 480s window, but maybe it's too
>>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
>>> >> can timeout again when on a busy host with less capability of concurrency.
>>> >>
>>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
>>> >> here before the log expires:
>>> >>
>>> >> =================================8<================================
>>> >>
>>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>>> >>
>>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>>> >>
>>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>>> >> stderr:
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>> >> (test program exited with status code -6)
>>> >> TAP parsing error: Too few tests run (expected 53, got 39)
>>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>> >>
>>> >> # Start of plain tests
>>> >> # Running /i386/migration/multifd/tcp/plain/cancel
>>> >> # Using machine type: pc-i440fx-9.2
>>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>>> >> ----------------------------------- stderr -----------------------------------
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>> >>
>>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>>> >>
>>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>>> >>
>>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>>> >> stderr:
>>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>> >> (test program exited with status code -6)
>>> >> TAP parsing error: Too few tests run (expected 61, got 47)
>>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>> >>
>>> >> # Start of plain tests
>>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
>>> >> # Using machine type: pseries-9.2
>>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>>> >> ----------------------------------- stderr -----------------------------------
>>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>> >>
>>> >> (test program exited with status code -6)
>>> >> =================================8<================================
>>> >>
>>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>>> >> different host / arch being tested.  What is more weird is the two failures
>>> >> are different, the 2nd failure throw out a TLS error even though the test
>>> >> doesn't yet have tls involved.
>>> >
>>> > I think that's just a parallel test being cancelled prematurely, either
>>> > due to the crash or due to the timeout.
>>> >
>>> >>
>>> >> Fabiano, is this the issue you're looking at?
>>> >
>>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
>>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
>>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
>>> > aarch64.
>>> >
>>> 
>>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
>>> zero page causing multiple page faults"), the multifd code started using
>>> the rb->receivedmap bitmap, which belongs to the ram code and is
>>> initialized and *freed* from the ram SaveVMHandlers.
>>> 
>>> process_incoming_migration_co()        ...
>>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>>>   ...
>>>   migration_incoming_state_destroy()
>>>     multifd_recv_cleanup()
>>>       multifd_recv_terminate_threads(NULL)
>>> 
>>> Multifd threads are live until migration_incoming_state_destroy(), which
>>> is called some time later.
>>
>> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
>> ram is received somehow..
>
> Looks like a similar issue as when we didn't have the multifd_send sync
> working correctly and ram code would run and do cleanup.

Btw, this is hard to debug, but I bet what's happening is that the
ram_load code itself is exiting due to qemufile error. So there wouldn't
be a way to make it wait for multifd.
Peter Xu Sept. 13, 2024, 3:38 p.m. UTC | #23
On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
> >>> Fabiano Rosas <farosas@suse.de> writes:
> >>> 
> >>> > Peter Xu <peterx@redhat.com> writes:
> >>> >
> >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
> >>> >>> > I don't think we're discussing total CI time at this point, so the math
> >>> >>> > doesn't really add up. We're not looking into making the CI finish
> >>> >>> > faster. We're looking into making migration-test finish faster. That
> >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
> >>> >>> > of random race conditions* affecting other people/staging runs.
> >>> >>> 
> >>> >>> Right. The reason migration-test appears on my radar is because
> >>> >>> it is very frequently the thing that shows up as "this sometimes
> >>> >>> just fails or just times out and if you hit retry it goes away
> >>> >>> again". That might not be migration-test's fault specifically,
> >>> >>> because those retries tend to be certain CI configs (s390,
> >>> >>> the i686-tci one), and I have some theories about what might be
> >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
> >>> >>> which means 8 QEMU processes which is too many for the number
> >>> >>> of host CPUs). But right now I look at CI job failures and my reaction
> >>> >>> is "oh, it's the migration-test failing yet again" :-(
> >>> >>> 
> >>> >>> For some examples from this week:
> >>> >>> 
> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
> >>> >>
> >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> >>> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
> >>> >> also see how speeding up / reducing tests in migration test would help, as
> >>> >> that's (from some degree..) is the same as tuning the timeout value bigger.
> >>> >> When the tests are less it'll fit into 480s window, but maybe it's too
> >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> >>> >> can timeout again when on a busy host with less capability of concurrency.
> >>> >>
> >>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> >>> >> here before the log expires:
> >>> >>
> >>> >> =================================8<================================
> >>> >>
> >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
> >>> >>
> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> >>> >>
> >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
> >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >>> >> stderr:
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>> >> (test program exited with status code -6)
> >>> >> TAP parsing error: Too few tests run (expected 53, got 39)
> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>> >>
> >>> >> # Start of plain tests
> >>> >> # Running /i386/migration/multifd/tcp/plain/cancel
> >>> >> # Using machine type: pc-i440fx-9.2
> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >>> >> ----------------------------------- stderr -----------------------------------
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>> >>
> >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
> >>> >>
> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> >>> >>
> >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
> >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >>> >> stderr:
> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>> >> (test program exited with status code -6)
> >>> >> TAP parsing error: Too few tests run (expected 61, got 47)
> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>> >>
> >>> >> # Start of plain tests
> >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
> >>> >> # Using machine type: pseries-9.2
> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >>> >> ----------------------------------- stderr -----------------------------------
> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>> >>
> >>> >> (test program exited with status code -6)
> >>> >> =================================8<================================
> >>> >>
> >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> >>> >> different host / arch being tested.  What is more weird is the two failures
> >>> >> are different, the 2nd failure throw out a TLS error even though the test
> >>> >> doesn't yet have tls involved.
> >>> >
> >>> > I think that's just a parallel test being cancelled prematurely, either
> >>> > due to the crash or due to the timeout.
> >>> >
> >>> >>
> >>> >> Fabiano, is this the issue you're looking at?
> >>> >
> >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
> >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
> >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
> >>> > aarch64.
> >>> >
> >>> 
> >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
> >>> zero page causing multiple page faults"), the multifd code started using
> >>> the rb->receivedmap bitmap, which belongs to the ram code and is
> >>> initialized and *freed* from the ram SaveVMHandlers.
> >>> 
> >>> process_incoming_migration_co()        ...
> >>>   qemu_loadvm_state()                  multifd_nocomp_recv()
> >>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
> >>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
> >>>   ...
> >>>   migration_incoming_state_destroy()
> >>>     multifd_recv_cleanup()
> >>>       multifd_recv_terminate_threads(NULL)
> >>> 
> >>> Multifd threads are live until migration_incoming_state_destroy(), which
> >>> is called some time later.
> >>
> >> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
> >> ram is received somehow..
> >
> > Looks like a similar issue as when we didn't have the multifd_send sync
> > working correctly and ram code would run and do cleanup.
> 
> Btw, this is hard to debug, but I bet what's happening is that the
> ram_load code itself is exiting due to qemufile error. So there wouldn't
> be a way to make it wait for multifd.

One more thing is I remember one of the error is not the crash but some TLS
disconnection error.  I wonder which one you can reproduce and why that TLS
code can got kicked off in the multifd cancel test.  Perhaps the memory was
simply corrupted around, so bitmap ops can write to some other memory?
Fabiano Rosas Sept. 13, 2024, 5:51 p.m. UTC | #24
Peter Xu <peterx@redhat.com> writes:

> On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
>> >>> Fabiano Rosas <farosas@suse.de> writes:
>> >>> 
>> >>> > Peter Xu <peterx@redhat.com> writes:
>> >>> >
>> >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>> >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>> >>> >>> > I don't think we're discussing total CI time at this point, so the math
>> >>> >>> > doesn't really add up. We're not looking into making the CI finish
>> >>> >>> > faster. We're looking into making migration-test finish faster. That
>> >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>> >>> >>> > of random race conditions* affecting other people/staging runs.
>> >>> >>> 
>> >>> >>> Right. The reason migration-test appears on my radar is because
>> >>> >>> it is very frequently the thing that shows up as "this sometimes
>> >>> >>> just fails or just times out and if you hit retry it goes away
>> >>> >>> again". That might not be migration-test's fault specifically,
>> >>> >>> because those retries tend to be certain CI configs (s390,
>> >>> >>> the i686-tci one), and I have some theories about what might be
>> >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
>> >>> >>> which means 8 QEMU processes which is too many for the number
>> >>> >>> of host CPUs). But right now I look at CI job failures and my reaction
>> >>> >>> is "oh, it's the migration-test failing yet again" :-(
>> >>> >>> 
>> >>> >>> For some examples from this week:
>> >>> >>> 
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>> >>> >>
>> >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
>> >>> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
>> >>> >> also see how speeding up / reducing tests in migration test would help, as
>> >>> >> that's (from some degree..) is the same as tuning the timeout value bigger.
>> >>> >> When the tests are less it'll fit into 480s window, but maybe it's too
>> >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
>> >>> >> can timeout again when on a busy host with less capability of concurrency.
>> >>> >>
>> >>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
>> >>> >> here before the log expires:
>> >>> >>
>> >>> >> =================================8<================================
>> >>> >>
>> >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>> >>> >>
>> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>> >>> >>
>> >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>> >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >>> >> stderr:
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>> >> (test program exited with status code -6)
>> >>> >> TAP parsing error: Too few tests run (expected 53, got 39)
>> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>> >>
>> >>> >> # Start of plain tests
>> >>> >> # Running /i386/migration/multifd/tcp/plain/cancel
>> >>> >> # Using machine type: pc-i440fx-9.2
>> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >>> >> ----------------------------------- stderr -----------------------------------
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>> >>
>> >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>> >>> >>
>> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>> >>> >>
>> >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>> >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >>> >> stderr:
>> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>> >> (test program exited with status code -6)
>> >>> >> TAP parsing error: Too few tests run (expected 61, got 47)
>> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>> >>
>> >>> >> # Start of plain tests
>> >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
>> >>> >> # Using machine type: pseries-9.2
>> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >>> >> ----------------------------------- stderr -----------------------------------
>> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>> >>
>> >>> >> (test program exited with status code -6)
>> >>> >> =================================8<================================
>> >>> >>
>> >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>> >>> >> different host / arch being tested.  What is more weird is the two failures
>> >>> >> are different, the 2nd failure throw out a TLS error even though the test
>> >>> >> doesn't yet have tls involved.
>> >>> >
>> >>> > I think that's just a parallel test being cancelled prematurely, either
>> >>> > due to the crash or due to the timeout.
>> >>> >
>> >>> >>
>> >>> >> Fabiano, is this the issue you're looking at?
>> >>> >
>> >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
>> >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
>> >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
>> >>> > aarch64.
>> >>> >
>> >>> 
>> >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
>> >>> zero page causing multiple page faults"), the multifd code started using
>> >>> the rb->receivedmap bitmap, which belongs to the ram code and is
>> >>> initialized and *freed* from the ram SaveVMHandlers.
>> >>> 
>> >>> process_incoming_migration_co()        ...
>> >>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>> >>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>> >>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>> >>>   ...
>> >>>   migration_incoming_state_destroy()
>> >>>     multifd_recv_cleanup()
>> >>>       multifd_recv_terminate_threads(NULL)
>> >>> 
>> >>> Multifd threads are live until migration_incoming_state_destroy(), which
>> >>> is called some time later.
>> >>
>> >> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
>> >> ram is received somehow..
>> >
>> > Looks like a similar issue as when we didn't have the multifd_send sync
>> > working correctly and ram code would run and do cleanup.
>> 
>> Btw, this is hard to debug, but I bet what's happening is that the
>> ram_load code itself is exiting due to qemufile error. So there wouldn't
>> be a way to make it wait for multifd.
>
> One more thing is I remember one of the error is not the crash but some TLS
> disconnection error.  I wonder which one you can reproduce and why that TLS
> code can got kicked off in the multifd cancel test.  Perhaps the memory was
> simply corrupted around, so bitmap ops can write to some other memory?

I haven't seen that error. I said in another email that I think that is
due to the timeout cancelling the tests forcefully.
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 4626301435..cf0b1dcb50 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -718,6 +718,7 @@  typedef struct {
 typedef struct {
     /* CPU throttle parameters */
     bool periodic;
+    bool responsive;
 } AutoConvergeArgs;
 
 static int test_migrate_start(QTestState **from, QTestState **to,
@@ -2795,6 +2796,7 @@  static void test_migrate_auto_converge_args(AutoConvergeArgs *input_args)
     QTestState *from, *to;
     int64_t percentage;
     bool periodic = (input_args && input_args->periodic);
+    bool responsive = (input_args && input_args->responsive);
 
     /*
      * We want the test to be stable and as fast as possible.
@@ -2820,6 +2822,16 @@  static void test_migrate_auto_converge_args(AutoConvergeArgs *input_args)
                 periodic_throttle_interval);
     }
 
+    if (responsive) {
+        /*
+         * The dirty-sync-count may not always go down while using responsive
+         * throttle because it is an optimization and may not take effect in
+         * any scenario. Just making sure this feature doesn't break any
+         * existing functionality by turning it on.
+         */
+        migrate_set_parameter_bool(from, "cpu-responsive-throttle", true);
+    }
+
     /*
      * Set the initial parameters so that the migration could not converge
      * without throttling.
@@ -2902,6 +2914,12 @@  static void test_migrate_auto_converge_periodic_throttle(void)
     test_migrate_auto_converge_args(&args);
 }
 
+static void test_migrate_auto_converge_responsive_throttle(void)
+{
+    AutoConvergeArgs args = {.responsive = true};
+    test_migrate_auto_converge_args(&args);
+}
+
 static void *
 test_migrate_precopy_tcp_multifd_start_common(QTestState *from,
                                               QTestState *to,
@@ -3955,6 +3973,8 @@  int main(int argc, char **argv)
                            test_migrate_auto_converge);
         migration_test_add("/migration/auto_converge_periodic_throttle",
                            test_migrate_auto_converge_periodic_throttle);
+        migration_test_add("/migration/auto_converge_responsive_throttle",
+                           test_migrate_auto_converge_responsive_throttle);
         if (g_str_equal(arch, "x86_64") &&
             has_kvm && kvm_dirty_ring_supported()) {
             migration_test_add("/migration/dirty_limit",