mbox series

[0/6] net: fix non-deterministic failures of the 'netdev-socket' qtest

Message ID 20240104162942.211458-1-berrange@redhat.com
Headers show
Series net: fix non-deterministic failures of the 'netdev-socket' qtest | expand

Message

Daniel P. Berrangé Jan. 4, 2024, 4:29 p.m. UTC
We've previously bumped up the timeouts in the netdev-socket qtest
to supposedly fix non-deterministic failures, however, the failures
are still hitting CI.

A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
should be very quick to execute, even under high system load, so it
was never likely that the test was failing due to timeouts being
reached.

The actual root cause was a race condition in the test design. It
was spawning a QEMU with a 'server' netdev, and then spawning one
with the 'client' netdev. There was insufficient synchronization,
however, so it was possible for the 2nd QEMU process to attempt
to 'connect()' before the 'listen()' call was made by the 1st QEMU.

In the test scenarios that did not use the 'reconnect' flag, this
would result in the client QEMU never getting into the expected
state. The test code would thus loop on 'info network' until
hitting the maximum wait time.

This series reverts the increased timeouts, and fixes synchronization
in the test scenarios. It also improves reporting of errors in the
socket netdev backend so that 'info network' reports what actually
went wrong rather than a useless generic 'connection error' string.
This will help us diagnose any future CI problems, should they occurr.

Daniel P. Berrangé (6):
  Revert "netdev: set timeout depending on loadavg"
  Revert "osdep: add getloadavg"
  Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
    seconds"
  net: add explicit info about connecting/listening state
  net: handle QIOTask completion to report useful error message
  qtest: ensure netdev-socket tests have non-overlapping names

 include/qemu/osdep.h        | 10 ---------
 meson.build                 |  1 -
 net/stream.c                | 18 +++++++++++-----
 tests/qtest/netdev-socket.c | 42 +++++++------------------------------
 4 files changed, 21 insertions(+), 50 deletions(-)

Comments

Stefan Hajnoczi Jan. 4, 2024, 4:45 p.m. UTC | #1
On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> We've previously bumped up the timeouts in the netdev-socket qtest
> to supposedly fix non-deterministic failures, however, the failures
> are still hitting CI.
>
> A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> should be very quick to execute, even under high system load, so it
> was never likely that the test was failing due to timeouts being
> reached.
>
> The actual root cause was a race condition in the test design. It
> was spawning a QEMU with a 'server' netdev, and then spawning one
> with the 'client' netdev. There was insufficient synchronization,
> however, so it was possible for the 2nd QEMU process to attempt
> to 'connect()' before the 'listen()' call was made by the 1st QEMU.
>
> In the test scenarios that did not use the 'reconnect' flag, this
> would result in the client QEMU never getting into the expected
> state. The test code would thus loop on 'info network' until
> hitting the maximum wait time.
>
> This series reverts the increased timeouts, and fixes synchronization
> in the test scenarios. It also improves reporting of errors in the
> socket netdev backend so that 'info network' reports what actually
> went wrong rather than a useless generic 'connection error' string.
> This will help us diagnose any future CI problems, should they occurr.
>
> Daniel P. Berrangé (6):
>   Revert "netdev: set timeout depending on loadavg"
>   Revert "osdep: add getloadavg"
>   Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
>     seconds"
>   net: add explicit info about connecting/listening state
>   net: handle QIOTask completion to report useful error message
>   qtest: ensure netdev-socket tests have non-overlapping names
>
>  include/qemu/osdep.h        | 10 ---------
>  meson.build                 |  1 -
>  net/stream.c                | 18 +++++++++++-----
>  tests/qtest/netdev-socket.c | 42 +++++++------------------------------
>  4 files changed, 21 insertions(+), 50 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Daniel P. Berrangé Jan. 9, 2024, 1:49 p.m. UTC | #2
Hi Jason,

As the net/ maintainer, could you take a look at this series.
This failure has been causing pain for CI for quite a while.

If you're happy with it, I can include it in a pending pull
request of other misc patches I have.

On Thu, Jan 04, 2024 at 04:29:36PM +0000, Daniel P. Berrangé wrote:
> We've previously bumped up the timeouts in the netdev-socket qtest
> to supposedly fix non-deterministic failures, however, the failures
> are still hitting CI.
> 
> A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> should be very quick to execute, even under high system load, so it
> was never likely that the test was failing due to timeouts being
> reached.
> 
> The actual root cause was a race condition in the test design. It
> was spawning a QEMU with a 'server' netdev, and then spawning one
> with the 'client' netdev. There was insufficient synchronization,
> however, so it was possible for the 2nd QEMU process to attempt
> to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> 
> In the test scenarios that did not use the 'reconnect' flag, this
> would result in the client QEMU never getting into the expected
> state. The test code would thus loop on 'info network' until
> hitting the maximum wait time.
> 
> This series reverts the increased timeouts, and fixes synchronization
> in the test scenarios. It also improves reporting of errors in the
> socket netdev backend so that 'info network' reports what actually
> went wrong rather than a useless generic 'connection error' string.
> This will help us diagnose any future CI problems, should they occurr.
> 
> Daniel P. Berrangé (6):
>   Revert "netdev: set timeout depending on loadavg"
>   Revert "osdep: add getloadavg"
>   Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
>     seconds"
>   net: add explicit info about connecting/listening state
>   net: handle QIOTask completion to report useful error message
>   qtest: ensure netdev-socket tests have non-overlapping names
> 
>  include/qemu/osdep.h        | 10 ---------
>  meson.build                 |  1 -
>  net/stream.c                | 18 +++++++++++-----
>  tests/qtest/netdev-socket.c | 42 +++++++------------------------------
>  4 files changed, 21 insertions(+), 50 deletions(-)
> 
> -- 
> 2.43.0
> 

With regards,
Daniel
Jason Wang Jan. 15, 2024, 2:36 a.m. UTC | #3
On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > We've previously bumped up the timeouts in the netdev-socket qtest
> > to supposedly fix non-deterministic failures, however, the failures
> > are still hitting CI.
> >
> > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > should be very quick to execute, even under high system load, so it
> > was never likely that the test was failing due to timeouts being
> > reached.
> >
> > The actual root cause was a race condition in the test design. It
> > was spawning a QEMU with a 'server' netdev, and then spawning one
> > with the 'client' netdev. There was insufficient synchronization,
> > however, so it was possible for the 2nd QEMU process to attempt
> > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> >
> > In the test scenarios that did not use the 'reconnect' flag, this
> > would result in the client QEMU never getting into the expected
> > state. The test code would thus loop on 'info network' until
> > hitting the maximum wait time.
> >
> > This series reverts the increased timeouts, and fixes synchronization
> > in the test scenarios. It also improves reporting of errors in the
> > socket netdev backend so that 'info network' reports what actually
> > went wrong rather than a useless generic 'connection error' string.
> > This will help us diagnose any future CI problems, should they occurr.
> >
> > Daniel P. Berrangé (6):
> >   Revert "netdev: set timeout depending on loadavg"
> >   Revert "osdep: add getloadavg"
> >   Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> >     seconds"
> >   net: add explicit info about connecting/listening state
> >   net: handle QIOTask completion to report useful error message
> >   qtest: ensure netdev-socket tests have non-overlapping names
> >
> >  include/qemu/osdep.h        | 10 ---------
> >  meson.build                 |  1 -
> >  net/stream.c                | 18 +++++++++++-----
> >  tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> >  4 files changed, 21 insertions(+), 50 deletions(-)
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>

Queued.

Thanks
Peter Maydell Jan. 15, 2024, 10:19 a.m. UTC | #4
On Mon, 15 Jan 2024 at 02:37, Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > We've previously bumped up the timeouts in the netdev-socket qtest
> > > to supposedly fix non-deterministic failures, however, the failures
> > > are still hitting CI.
> > >
> > > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > > should be very quick to execute, even under high system load, so it
> > > was never likely that the test was failing due to timeouts being
> > > reached.
> > >
> > > The actual root cause was a race condition in the test design. It
> > > was spawning a QEMU with a 'server' netdev, and then spawning one
> > > with the 'client' netdev. There was insufficient synchronization,
> > > however, so it was possible for the 2nd QEMU process to attempt
> > > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> > >
> > > In the test scenarios that did not use the 'reconnect' flag, this
> > > would result in the client QEMU never getting into the expected
> > > state. The test code would thus loop on 'info network' until
> > > hitting the maximum wait time.
> > >
> > > This series reverts the increased timeouts, and fixes synchronization
> > > in the test scenarios. It also improves reporting of errors in the
> > > socket netdev backend so that 'info network' reports what actually
> > > went wrong rather than a useless generic 'connection error' string.
> > > This will help us diagnose any future CI problems, should they occurr.
> > >
> > > Daniel P. Berrangé (6):
> > >   Revert "netdev: set timeout depending on loadavg"
> > >   Revert "osdep: add getloadavg"
> > >   Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> > >     seconds"
> > >   net: add explicit info about connecting/listening state
> > >   net: handle QIOTask completion to report useful error message
> > >   qtest: ensure netdev-socket tests have non-overlapping names
> > >
> > >  include/qemu/osdep.h        | 10 ---------
> > >  meson.build                 |  1 -
> > >  net/stream.c                | 18 +++++++++++-----
> > >  tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> > >  4 files changed, 21 insertions(+), 50 deletions(-)
> >
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >
>
> Queued.

These are already upstream, via thuth's testing queue.

thanks
-- PMM
Jason Wang Jan. 16, 2024, 1:05 a.m. UTC | #5
On Mon, Jan 15, 2024 at 6:19 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 15 Jan 2024 at 02:37, Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 12:45 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, 4 Jan 2024 at 11:30, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > We've previously bumped up the timeouts in the netdev-socket qtest
> > > > to supposedly fix non-deterministic failures, however, the failures
> > > > are still hitting CI.
> > > >
> > > > A simple 'listen()' and 'connect()' pairing across 2 QEMU processes
> > > > should be very quick to execute, even under high system load, so it
> > > > was never likely that the test was failing due to timeouts being
> > > > reached.
> > > >
> > > > The actual root cause was a race condition in the test design. It
> > > > was spawning a QEMU with a 'server' netdev, and then spawning one
> > > > with the 'client' netdev. There was insufficient synchronization,
> > > > however, so it was possible for the 2nd QEMU process to attempt
> > > > to 'connect()' before the 'listen()' call was made by the 1st QEMU.
> > > >
> > > > In the test scenarios that did not use the 'reconnect' flag, this
> > > > would result in the client QEMU never getting into the expected
> > > > state. The test code would thus loop on 'info network' until
> > > > hitting the maximum wait time.
> > > >
> > > > This series reverts the increased timeouts, and fixes synchronization
> > > > in the test scenarios. It also improves reporting of errors in the
> > > > socket netdev backend so that 'info network' reports what actually
> > > > went wrong rather than a useless generic 'connection error' string.
> > > > This will help us diagnose any future CI problems, should they occurr.
> > > >
> > > > Daniel P. Berrangé (6):
> > > >   Revert "netdev: set timeout depending on loadavg"
> > > >   Revert "osdep: add getloadavg"
> > > >   Revert "tests/qtest/netdev-socket: Raise connection timeout to 120
> > > >     seconds"
> > > >   net: add explicit info about connecting/listening state
> > > >   net: handle QIOTask completion to report useful error message
> > > >   qtest: ensure netdev-socket tests have non-overlapping names
> > > >
> > > >  include/qemu/osdep.h        | 10 ---------
> > > >  meson.build                 |  1 -
> > > >  net/stream.c                | 18 +++++++++++-----
> > > >  tests/qtest/netdev-socket.c | 42 +++++++------------------------------
> > > >  4 files changed, 21 insertions(+), 50 deletions(-)
> > >
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > >
> >
> > Queued.
>
> These are already upstream, via thuth's testing queue.
>
> thanks
> -- PMM
>

Great, so I dropped this.

Thanks