Message ID | 20230912131340.405619-1-npiggin@gmail.com |
---|---|
State | New |
Headers | show |
Series | tests/avocado: Fix console data loss | expand |
On Tue, Sep 12, 2023 at 11:13:40PM +1000, Nicholas Piggin wrote: > Occasionally some avocado tests will fail waiting for console line > despite the machine running correctly. Console data goes missing, as can > be seen in the console log. This is due to _console_interaction calling > makefile() on the console socket each time it is invoked, which must be > losing old buffer contents when going out of scope. > > It is not enough to makefile() with buffered=0. That helps significantly > but data loss is still possible. My guess is that readline() has a line > buffer even when the file is in unbuffered mode, that can eat data. > > Fix this by providing a console file that persists for the life of the > console. > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m > was flakey for me due to this bug. I don't know why that in particular, > 3 calls to wait_for_console_pattern probably helps. > > I didn't pinpoint when the bug was introduced because the original > was probably not buggy because it was only run once at the end of the > test. At some point after it was moved to common code, something would > have started to call it more than once which is where potential for bug > is introduced. I'm suspecting this patch might also fix many other failures we see in avocado, all with wait_for_console_pattern in their trace https://gitlab.com/qemu-project/qemu/-/issues/1884 > python/qemu/machine/machine.py | 19 +++++++++++++++++++ > tests/avocado/avocado_qemu/__init__.py | 2 +- > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py > index c16a0b6fed..35d5a672db 100644 > --- a/python/qemu/machine/machine.py > +++ b/python/qemu/machine/machine.py > @@ -191,6 +191,7 @@ def __init__(self, > self.sock_dir, f"{self._name}.con" > ) > self._console_socket: Optional[socket.socket] = None > + self._console_file: Optional[socket.SocketIO] = None > self._remove_files: List[str] = [] > self._user_killed = False > self._quit_issued = False > @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None: > # If we keep the console socket open, we may deadlock waiting > # for QEMU to exit, while QEMU is waiting for the socket to > # become writable. > + if self._console_file is not None: > + LOG.debug("Closing console file") > + self._console_file.close() > + self._console_file = None > + > if self._console_socket is not None: > LOG.debug("Closing console socket") > self._console_socket.close() > @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket: > Returns a socket connected to the console > """ > if self._console_socket is None: > + LOG.debug("Opening console socket") > self._console_socket = console_socket.ConsoleSocket( > self._console_address, > file=self._console_log_path, > drain=self._drain_console) > return self._console_socket > > + @property > + def console_file(self) -> socket.SocketIO: > + """ > + Returns a file associated with the console socket > + """ > + if self._console_file is None: > + LOG.debug("Opening console file") > + self._console_file = self.console_socket.makefile(mode='rb', > + buffering=0, > + encoding='utf-8') > + return self._console_file > + > @property > def temp_dir(self) -> str: > """ > diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py > index 33090903f1..0172a359b7 100644 > --- a/tests/avocado/avocado_qemu/__init__.py > +++ b/tests/avocado/avocado_qemu/__init__.py > @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message, > assert not keep_sending or send_string > if vm is None: > vm = test.vm > - console = vm.console_socket.makefile(mode='rb', encoding='utf-8') > + console = vm.console_file > console_logger = logging.getLogger('console') > while True: > if send_string: Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel
+Alex who also looked at that IIRC. On 12/9/23 15:13, Nicholas Piggin wrote: > Occasionally some avocado tests will fail waiting for console line > despite the machine running correctly. Console data goes missing, as can > be seen in the console log. This is due to _console_interaction calling > makefile() on the console socket each time it is invoked, which must be > losing old buffer contents when going out of scope. > > It is not enough to makefile() with buffered=0. That helps significantly > but data loss is still possible. My guess is that readline() has a line > buffer even when the file is in unbuffered mode, that can eat data. > > Fix this by providing a console file that persists for the life of the > console. > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m > was flakey for me due to this bug. I don't know why that in particular, > 3 calls to wait_for_console_pattern probably helps. > > I didn't pinpoint when the bug was introduced because the original > was probably not buggy because it was only run once at the end of the > test. At some point after it was moved to common code, something would > have started to call it more than once which is where potential for bug > is introduced. > > Thanks, > Nick > > python/qemu/machine/machine.py | 19 +++++++++++++++++++ > tests/avocado/avocado_qemu/__init__.py | 2 +- > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py > index c16a0b6fed..35d5a672db 100644 > --- a/python/qemu/machine/machine.py > +++ b/python/qemu/machine/machine.py > @@ -191,6 +191,7 @@ def __init__(self, > self.sock_dir, f"{self._name}.con" > ) > self._console_socket: Optional[socket.socket] = None > + self._console_file: Optional[socket.SocketIO] = None > self._remove_files: List[str] = [] > self._user_killed = False > self._quit_issued = False > @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None: > # If we keep the console socket open, we may deadlock waiting > # for QEMU to exit, while QEMU is waiting for the socket to > # become writable. > + if self._console_file is not None: > + LOG.debug("Closing console file") > + self._console_file.close() > + self._console_file = None > + > if self._console_socket is not None: > LOG.debug("Closing console socket") > self._console_socket.close() > @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket: > Returns a socket connected to the console > """ > if self._console_socket is None: > + LOG.debug("Opening console socket") > self._console_socket = console_socket.ConsoleSocket( > self._console_address, > file=self._console_log_path, > drain=self._drain_console) > return self._console_socket > > + @property > + def console_file(self) -> socket.SocketIO: > + """ > + Returns a file associated with the console socket > + """ > + if self._console_file is None: > + LOG.debug("Opening console file") > + self._console_file = self.console_socket.makefile(mode='rb', > + buffering=0, > + encoding='utf-8') > + return self._console_file > + > @property > def temp_dir(self) -> str: > """ > diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py > index 33090903f1..0172a359b7 100644 > --- a/tests/avocado/avocado_qemu/__init__.py > +++ b/tests/avocado/avocado_qemu/__init__.py > @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message, > assert not keep_sending or send_string > if vm is None: > vm = test.vm > - console = vm.console_socket.makefile(mode='rb', encoding='utf-8') > + console = vm.console_file > console_logger = logging.getLogger('console') > while True: > if send_string:
Nicholas Piggin <npiggin@gmail.com> writes: > Occasionally some avocado tests will fail waiting for console line > despite the machine running correctly. Console data goes missing, as can > be seen in the console log. This is due to _console_interaction calling > makefile() on the console socket each time it is invoked, which must be > losing old buffer contents when going out of scope. > > It is not enough to makefile() with buffered=0. That helps significantly > but data loss is still possible. My guess is that readline() has a line > buffer even when the file is in unbuffered mode, that can eat data. > > Fix this by providing a console file that persists for the life of the > console. > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Queued to testing/next, thanks. > --- > > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m > was flakey for me due to this bug. I don't know why that in particular, > 3 calls to wait_for_console_pattern probably helps. > > I didn't pinpoint when the bug was introduced because the original > was probably not buggy because it was only run once at the end of the > test. At some point after it was moved to common code, something would > have started to call it more than once which is where potential for bug > is introduced. There is a sprawling mass somewhere between: - pythons buffering of IO - device models dropping chars when blocked - noisy tests with competing console output that adds up to unreliable tests that rely on seeing certain patterns on the console. > > Thanks, > Nick > > python/qemu/machine/machine.py | 19 +++++++++++++++++++ > tests/avocado/avocado_qemu/__init__.py | 2 +- > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py > index c16a0b6fed..35d5a672db 100644 > --- a/python/qemu/machine/machine.py > +++ b/python/qemu/machine/machine.py > @@ -191,6 +191,7 @@ def __init__(self, > self.sock_dir, f"{self._name}.con" > ) > self._console_socket: Optional[socket.socket] = None > + self._console_file: Optional[socket.SocketIO] = None > self._remove_files: List[str] = [] > self._user_killed = False > self._quit_issued = False > @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None: > # If we keep the console socket open, we may deadlock waiting > # for QEMU to exit, while QEMU is waiting for the socket to > # become writable. > + if self._console_file is not None: > + LOG.debug("Closing console file") > + self._console_file.close() > + self._console_file = None > + > if self._console_socket is not None: > LOG.debug("Closing console socket") > self._console_socket.close() > @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket: > Returns a socket connected to the console > """ > if self._console_socket is None: > + LOG.debug("Opening console socket") > self._console_socket = console_socket.ConsoleSocket( > self._console_address, > file=self._console_log_path, > drain=self._drain_console) > return self._console_socket > > + @property > + def console_file(self) -> socket.SocketIO: > + """ > + Returns a file associated with the console socket > + """ > + if self._console_file is None: > + LOG.debug("Opening console file") > + self._console_file = self.console_socket.makefile(mode='rb', > + buffering=0, > + encoding='utf-8') > + return self._console_file > + > @property > def temp_dir(self) -> str: > """ > diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py > index 33090903f1..0172a359b7 100644 > --- a/tests/avocado/avocado_qemu/__init__.py > +++ b/tests/avocado/avocado_qemu/__init__.py > @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message, > assert not keep_sending or send_string > if vm is None: > vm = test.vm > - console = vm.console_socket.makefile(mode='rb', encoding='utf-8') > + console = vm.console_file > console_logger = logging.getLogger('console') > while True: > if send_string:
On Wed, Sep 13, 2023 at 4:53 AM Alex Bennée <alex.bennee@linaro.org> wrote: > > > Nicholas Piggin <npiggin@gmail.com> writes: > > > Occasionally some avocado tests will fail waiting for console line > > despite the machine running correctly. Console data goes missing, as can > > be seen in the console log. This is due to _console_interaction calling > > makefile() on the console socket each time it is invoked, which must be > > losing old buffer contents when going out of scope. > > > > It is not enough to makefile() with buffered=0. That helps significantly > > but data loss is still possible. My guess is that readline() has a line > > buffer even when the file is in unbuffered mode, that can eat data. > > > > Fix this by providing a console file that persists for the life of the > > console. > > > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > > Queued to testing/next, thanks. Sure, ACK. I didn't realize that avocado was technically creating multiple buffered files like this. --js > > > --- > > > > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m > > was flakey for me due to this bug. I don't know why that in particular, > > 3 calls to wait_for_console_pattern probably helps. > > > > I didn't pinpoint when the bug was introduced because the original > > was probably not buggy because it was only run once at the end of the > > test. At some point after it was moved to common code, something would > > have started to call it more than once which is where potential for bug > > is introduced. > > There is a sprawling mass somewhere between: > > - pythons buffering of IO > - device models dropping chars when blocked > - noisy tests with competing console output > > that adds up to unreliable tests that rely on seeing certain patterns on > the console. > > > > > Thanks, > > Nick > > > > python/qemu/machine/machine.py | 19 +++++++++++++++++++ > > tests/avocado/avocado_qemu/__init__.py | 2 +- > > 2 files changed, 20 insertions(+), 1 deletion(-) > > > > diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py > > index c16a0b6fed..35d5a672db 100644 > > --- a/python/qemu/machine/machine.py > > +++ b/python/qemu/machine/machine.py > > @@ -191,6 +191,7 @@ def __init__(self, > > self.sock_dir, f"{self._name}.con" > > ) > > self._console_socket: Optional[socket.socket] = None > > + self._console_file: Optional[socket.SocketIO] = None > > self._remove_files: List[str] = [] > > self._user_killed = False > > self._quit_issued = False > > @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None: > > # If we keep the console socket open, we may deadlock waiting > > # for QEMU to exit, while QEMU is waiting for the socket to > > # become writable. > > + if self._console_file is not None: > > + LOG.debug("Closing console file") > > + self._console_file.close() > > + self._console_file = None > > + > > if self._console_socket is not None: > > LOG.debug("Closing console socket") > > self._console_socket.close() > > @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket: > > Returns a socket connected to the console > > """ > > if self._console_socket is None: > > + LOG.debug("Opening console socket") > > self._console_socket = console_socket.ConsoleSocket( > > self._console_address, > > file=self._console_log_path, > > drain=self._drain_console) > > return self._console_socket > > > > + @property > > + def console_file(self) -> socket.SocketIO: > > + """ > > + Returns a file associated with the console socket > > + """ > > + if self._console_file is None: > > + LOG.debug("Opening console file") > > + self._console_file = self.console_socket.makefile(mode='rb', > > + buffering=0, > > + encoding='utf-8') > > + return self._console_file > > + > > @property > > def temp_dir(self) -> str: > > """ > > diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py > > index 33090903f1..0172a359b7 100644 > > --- a/tests/avocado/avocado_qemu/__init__.py > > +++ b/tests/avocado/avocado_qemu/__init__.py > > @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message, > > assert not keep_sending or send_string > > if vm is None: > > vm = test.vm > > - console = vm.console_socket.makefile(mode='rb', encoding='utf-8') > > + console = vm.console_file > > console_logger = logging.getLogger('console') > > while True: > > if send_string: > > > -- > Alex Bennée > Virtualisation Tech Lead @ Linaro >
On Wed Sep 13, 2023 at 6:51 PM AEST, Alex Bennée wrote: > > Nicholas Piggin <npiggin@gmail.com> writes: > > > Occasionally some avocado tests will fail waiting for console line > > despite the machine running correctly. Console data goes missing, as can > > be seen in the console log. This is due to _console_interaction calling > > makefile() on the console socket each time it is invoked, which must be > > losing old buffer contents when going out of scope. > > > > It is not enough to makefile() with buffered=0. That helps significantly > > but data loss is still possible. My guess is that readline() has a line > > buffer even when the file is in unbuffered mode, that can eat data. > > > > Fix this by providing a console file that persists for the life of the > > console. > > > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > > Queued to testing/next, thanks. > > > --- > > > > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m > > was flakey for me due to this bug. I don't know why that in particular, > > 3 calls to wait_for_console_pattern probably helps. > > > > I didn't pinpoint when the bug was introduced because the original > > was probably not buggy because it was only run once at the end of the > > test. At some point after it was moved to common code, something would > > have started to call it more than once which is where potential for bug > > is introduced. > > There is a sprawling mass somewhere between: > > - pythons buffering of IO > - device models dropping chars when blocked > - noisy tests with competing console output > > that adds up to unreliable tests that rely on seeing certain patterns on > the console. Yeah it's a tricky bug and a difficult stack to diagnose. I started to look at 40p machine firmware console at first since it was happening on there. It's actually not too bad now, I was irritating it by putting delays in various avocado console socket reading, which can trigger it easily (my guess is due to delay allowing file buffer to pull in more data than is consumed). With patch the only check-avocado failures I was getting was some OS watchdog timeouts in their console print code caused by back pressure. Thanks, Nick
diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py index c16a0b6fed..35d5a672db 100644 --- a/python/qemu/machine/machine.py +++ b/python/qemu/machine/machine.py @@ -191,6 +191,7 @@ def __init__(self, self.sock_dir, f"{self._name}.con" ) self._console_socket: Optional[socket.socket] = None + self._console_file: Optional[socket.SocketIO] = None self._remove_files: List[str] = [] self._user_killed = False self._quit_issued = False @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None: # If we keep the console socket open, we may deadlock waiting # for QEMU to exit, while QEMU is waiting for the socket to # become writable. + if self._console_file is not None: + LOG.debug("Closing console file") + self._console_file.close() + self._console_file = None + if self._console_socket is not None: LOG.debug("Closing console socket") self._console_socket.close() @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket: Returns a socket connected to the console """ if self._console_socket is None: + LOG.debug("Opening console socket") self._console_socket = console_socket.ConsoleSocket( self._console_address, file=self._console_log_path, drain=self._drain_console) return self._console_socket + @property + def console_file(self) -> socket.SocketIO: + """ + Returns a file associated with the console socket + """ + if self._console_file is None: + LOG.debug("Opening console file") + self._console_file = self.console_socket.makefile(mode='rb', + buffering=0, + encoding='utf-8') + return self._console_file + @property def temp_dir(self) -> str: """ diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py index 33090903f1..0172a359b7 100644 --- a/tests/avocado/avocado_qemu/__init__.py +++ b/tests/avocado/avocado_qemu/__init__.py @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message, assert not keep_sending or send_string if vm is None: vm = test.vm - console = vm.console_socket.makefile(mode='rb', encoding='utf-8') + console = vm.console_file console_logger = logging.getLogger('console') while True: if send_string:
Occasionally some avocado tests will fail waiting for console line despite the machine running correctly. Console data goes missing, as can be seen in the console log. This is due to _console_interaction calling makefile() on the console socket each time it is invoked, which must be losing old buffer contents when going out of scope. It is not enough to makefile() with buffered=0. That helps significantly but data loss is still possible. My guess is that readline() has a line buffer even when the file is in unbuffered mode, that can eat data. Fix this by providing a console file that persists for the life of the console. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m was flakey for me due to this bug. I don't know why that in particular, 3 calls to wait_for_console_pattern probably helps. I didn't pinpoint when the bug was introduced because the original was probably not buggy because it was only run once at the end of the test. At some point after it was moved to common code, something would have started to call it more than once which is where potential for bug is introduced. Thanks, Nick python/qemu/machine/machine.py | 19 +++++++++++++++++++ tests/avocado/avocado_qemu/__init__.py | 2 +- 2 files changed, 20 insertions(+), 1 deletion(-)