diff mbox series

[ovs-dev,v3] python-stream: handle SSL error in do_handshake

Message ID da98669bd765b1d8e66cf4c1ba76cf142b67f472.camel@cloudandheat.com
State Accepted
Commit b456b1a02f629c2438ef2c3f247f35c8712f12c6
Headers show
Series [ovs-dev,v3] python-stream: handle SSL error in do_handshake | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Stefan Hoffmann April 21, 2023, 8:29 a.m. UTC
In some cases ovsdb server or relay gets restarted, ovsdb python clients
may keep the local socket open. Instead of reconnecting a lot of failures
will be logged.
This can be reproduced with ssl connections to the server/relay and
restarting it, so it has the same IP after restart.

This patch catches the Exceptions at do_handshake to recreate the
connection on the client side.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
---
 python/ovs/stream.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Stefan Hoffmann April 25, 2023, 1:08 p.m. UTC | #1
On Mon, 2023-04-24 at 16:42 -0500, Terry Wilson wrote:
> 
> On Fri, Apr 21, 2023 at 3:29 AM Stefan Hoffmann
> <stefan.hoffmann@cloudandheat.com> wrote:
> > In some cases ovsdb server or relay gets restarted, ovsdb python
> > clients
> > may keep the local socket open. Instead of reconnecting a lot of
> > failures
> > will be logged.
> > This can be reproduced with ssl connections to the server/relay and
> > restarting it, so it has the same IP after restart.
> > 
> > This patch catches the Exceptions at do_handshake to recreate the
> > connection on the client side.
> > 
> 
> 
> Is this something we can add some tests for? I know we have some
> other tests that involve restarts, but I don't think they involve
> python clients or SSL.

I can try to find a proper place to add such an test. Can you point me
to the correct place to add python clients to that test?
But I'm not sure, if we run always into that issue at a restart.
>  
> > Reviewed-by: Simon Horman <simon.horman@corigine.com>
> > Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
> > Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> > Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> > Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
> > Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> > ---
> >  python/ovs/stream.py | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/python/ovs/stream.py b/python/ovs/stream.py
> > index ac5b0fd0c..b32341076 100644
> > --- a/python/ovs/stream.py
> > +++ b/python/ovs/stream.py
> > @@ -824,7 +824,8 @@ class SSLStream(Stream):
> >              self.socket.do_handshake()
> >          except ssl.SSLWantReadError:
> >              return errno.EAGAIN
> > -        except ssl.SSLSyscallError as e:
> > +        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
> > +                ssl.SSLEOFError, OSError) as e:
> >              return ovs.socket_util.get_exception_errno(e)
> > 
> >          return 0
Ilya Maximets April 25, 2023, 4:13 p.m. UTC | #2
On 4/25/23 15:08, Stefan Hoffmann wrote:
> On Mon, 2023-04-24 at 16:42 -0500, Terry Wilson wrote:
>>
>> On Fri, Apr 21, 2023 at 3:29 AM Stefan Hoffmann
>> <stefan.hoffmann@cloudandheat.com> wrote:
>>> In some cases ovsdb server or relay gets restarted, ovsdb python
>>> clients
>>> may keep the local socket open. Instead of reconnecting a lot of
>>> failures
>>> will be logged.
>>> This can be reproduced with ssl connections to the server/relay and
>>> restarting it, so it has the same IP after restart.
>>>
>>> This patch catches the Exceptions at do_handshake to recreate the
>>> connection on the client side.
>>>
>>
>>
>> Is this something we can add some tests for? I know we have some
>> other tests that involve restarts, but I don't think they involve
>> python clients or SSL.
> 
> I can try to find a proper place to add such an test. Can you point me
> to the correct place to add python clients to that test?
> But I'm not sure, if we run always into that issue at a restart.

Usually, such tests live somewhere around tests/ovsdb-idl.at.
There is a basic CHECK_STREAM_OPEN_BLOCK test there, but it
doesn't check SSL connections, maybe it's worth it to extend
the test to be able to connect via SSL.  This will require
changing the tests/test-stream.py test script.

On the subject, I'm not sure if the issue can be reliably
reproduced, especially because it requires OpenSSL 3.0, AFAICT.

I have one question for the code below.

>>  
>>> Reviewed-by: Simon Horman <simon.horman@corigine.com>
>>> Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
>>> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
>>> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>> Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
>>> Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>> ---
>>>  python/ovs/stream.py | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/python/ovs/stream.py b/python/ovs/stream.py
>>> index ac5b0fd0c..b32341076 100644
>>> --- a/python/ovs/stream.py
>>> +++ b/python/ovs/stream.py
>>> @@ -824,7 +824,8 @@ class SSLStream(Stream):
>>>              self.socket.do_handshake()
>>>          except ssl.SSLWantReadError:
>>>              return errno.EAGAIN
>>> -        except ssl.SSLSyscallError as e:
>>> +        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
>>> +                ssl.SSLEOFError, OSError) as e:

Do we actually need to catch the generic OSError here?
It looks a bit odd to catch non-ssl exception.

Best regards, Ilya Maximets.
Stefan Hoffmann April 26, 2023, 8:13 a.m. UTC | #3
On Tue, 2023-04-25 at 18:13 +0200, Ilya Maximets wrote:
> On 4/25/23 15:08, Stefan Hoffmann wrote:
> > On Mon, 2023-04-24 at 16:42 -0500, Terry Wilson wrote:
> > > 
> > > On Fri, Apr 21, 2023 at 3:29 AM Stefan Hoffmann
> > > <stefan.hoffmann@cloudandheat.com> wrote:
> > > > In some cases ovsdb server or relay gets restarted, ovsdb python
> > > > clients
> > > > may keep the local socket open. Instead of reconnecting a lot of
> > > > failures
> > > > will be logged.
> > > > This can be reproduced with ssl connections to the server/relay and
> > > > restarting it, so it has the same IP after restart.
> > > > 
> > > > This patch catches the Exceptions at do_handshake to recreate the
> > > > connection on the client side.
> > > > 
> > > 
> > > 
> > > Is this something we can add some tests for? I know we have some
> > > other tests that involve restarts, but I don't think they involve
> > > python clients or SSL.
> > 
> > I can try to find a proper place to add such an test. Can you point me
> > to the correct place to add python clients to that test?
> > But I'm not sure, if we run always into that issue at a restart.
> 
> Usually, such tests live somewhere around tests/ovsdb-idl.at.
> There is a basic CHECK_STREAM_OPEN_BLOCK test there, but it
> doesn't check SSL connections, maybe it's worth it to extend
> the test to be able to connect via SSL.  This will require
> changing the tests/test-stream.py test script.
> 
> On the subject, I'm not sure if the issue can be reliably
> reproduced, especially because it requires OpenSSL 3.0, AFAICT.
> 
> I have one question for the code below.
> 
> > >  
> > > > Reviewed-by: Simon Horman <simon.horman@corigine.com>
> > > > Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
> > > > Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
> > > > Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> > > > Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
> > > > Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
> > > > ---
> > > >  python/ovs/stream.py | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/python/ovs/stream.py b/python/ovs/stream.py
> > > > index ac5b0fd0c..b32341076 100644
> > > > --- a/python/ovs/stream.py
> > > > +++ b/python/ovs/stream.py
> > > > @@ -824,7 +824,8 @@ class SSLStream(Stream):
> > > >              self.socket.do_handshake()
> > > >          except ssl.SSLWantReadError:
> > > >              return errno.EAGAIN
> > > > -        except ssl.SSLSyscallError as e:
> > > > +        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
> > > > +                ssl.SSLEOFError, OSError) as e:
> 
> Do we actually need to catch the generic OSError here?
> It looks a bit odd to catch non-ssl exception.

We got an "Transport endpoint is not connected" sometimes.
It tries getpeername() at a socket that is not connected anymore or
something like that. That results in the OSError.
We also saw, that the file descriptor was still there but not
connected.

Here the traceback (I hope you can still read that after line wrap):
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-
packages/ovsdbapp/backend/ovs_idl/connection.py", line 107, in run
    self.idl.run()
 File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
py3.9.egg/ovs/db/idl.py", line 433, in run
    self._session.run()
  File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
py3.9.egg/ovs/jsonrpc.py", line 519, in run
    error = self.stream.connect()
  File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
py3.9.egg/ovs/stream.py", line 824, in connect
    self.socket.do_handshake()
  File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
line 312, in do_handshake
    return self._call_trampolining(
  File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
line 158, in _call_trampolining
    return func(*a, **kw)
  File "/usr/local/lib/python3.9/ssl.py", line 1305, in do_handshake
    self._check_connected()
  File "/usr/local/lib/python3.9/ssl.py", line 1089, in
_check_connected
    self.getpeername()

OSError: [Errno 107] Transport endpoint is not connected
> 
> Best regards, Ilya Maximets.
Ilya Maximets April 26, 2023, 10:57 a.m. UTC | #4
On 4/26/23 10:13, Stefan Hoffmann wrote:
> On Tue, 2023-04-25 at 18:13 +0200, Ilya Maximets wrote:
>> On 4/25/23 15:08, Stefan Hoffmann wrote:
>>> On Mon, 2023-04-24 at 16:42 -0500, Terry Wilson wrote:
>>>>
>>>> On Fri, Apr 21, 2023 at 3:29 AM Stefan Hoffmann
>>>> <stefan.hoffmann@cloudandheat.com> wrote:
>>>>> In some cases ovsdb server or relay gets restarted, ovsdb python
>>>>> clients
>>>>> may keep the local socket open. Instead of reconnecting a lot of
>>>>> failures
>>>>> will be logged.
>>>>> This can be reproduced with ssl connections to the server/relay and
>>>>> restarting it, so it has the same IP after restart.
>>>>>
>>>>> This patch catches the Exceptions at do_handshake to recreate the
>>>>> connection on the client side.
>>>>>
>>>>
>>>>
>>>> Is this something we can add some tests for? I know we have some
>>>> other tests that involve restarts, but I don't think they involve
>>>> python clients or SSL.
>>>
>>> I can try to find a proper place to add such an test. Can you point me
>>> to the correct place to add python clients to that test?
>>> But I'm not sure, if we run always into that issue at a restart.
>>
>> Usually, such tests live somewhere around tests/ovsdb-idl.at.
>> There is a basic CHECK_STREAM_OPEN_BLOCK test there, but it
>> doesn't check SSL connections, maybe it's worth it to extend
>> the test to be able to connect via SSL.  This will require
>> changing the tests/test-stream.py test script.
>>
>> On the subject, I'm not sure if the issue can be reliably
>> reproduced, especially because it requires OpenSSL 3.0, AFAICT.
>>
>> I have one question for the code below.
>>
>>>>  
>>>>> Reviewed-by: Simon Horman <simon.horman@corigine.com>
>>>>> Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
>>>>> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
>>>>> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>>>> Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
>>>>> Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>>>> ---
>>>>>  python/ovs/stream.py | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/python/ovs/stream.py b/python/ovs/stream.py
>>>>> index ac5b0fd0c..b32341076 100644
>>>>> --- a/python/ovs/stream.py
>>>>> +++ b/python/ovs/stream.py
>>>>> @@ -824,7 +824,8 @@ class SSLStream(Stream):
>>>>>              self.socket.do_handshake()
>>>>>          except ssl.SSLWantReadError:
>>>>>              return errno.EAGAIN
>>>>> -        except ssl.SSLSyscallError as e:
>>>>> +        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
>>>>> +                ssl.SSLEOFError, OSError) as e:
>>
>> Do we actually need to catch the generic OSError here?
>> It looks a bit odd to catch non-ssl exception.
> 
> We got an "Transport endpoint is not connected" sometimes.
> It tries getpeername() at a socket that is not connected anymore or
> something like that. That results in the OSError.
> We also saw, that the file descriptor was still there but not
> connected.
> 
> Here the traceback (I hope you can still read that after line wrap):
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.9/site-
> packages/ovsdbapp/backend/ovs_idl/connection.py", line 107, in run
>     self.idl.run()
>  File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
> py3.9.egg/ovs/db/idl.py", line 433, in run
>     self._session.run()
>   File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
> py3.9.egg/ovs/jsonrpc.py", line 519, in run
>     error = self.stream.connect()
>   File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
> py3.9.egg/ovs/stream.py", line 824, in connect
>     self.socket.do_handshake()
>   File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
> line 312, in do_handshake
>     return self._call_trampolining(
>   File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
> line 158, in _call_trampolining
>     return func(*a, **kw)
>   File "/usr/local/lib/python3.9/ssl.py", line 1305, in do_handshake
>     self._check_connected()
>   File "/usr/local/lib/python3.9/ssl.py", line 1089, in
> _check_connected
>     self.getpeername()
> 
> OSError: [Errno 107] Transport endpoint is not connected

OK.  Yeah, looking through the ssl.py implementation, it can indeed
throw a generic OSError.  That's unfortunate, but OK, I guess.

Thanks for clarification!

>>
>> Best regards, Ilya Maximets.
>
Ilya Maximets April 26, 2023, 3:10 p.m. UTC | #5
On 4/26/23 12:57, Ilya Maximets wrote:
> On 4/26/23 10:13, Stefan Hoffmann wrote:
>> On Tue, 2023-04-25 at 18:13 +0200, Ilya Maximets wrote:
>>> On 4/25/23 15:08, Stefan Hoffmann wrote:
>>>> On Mon, 2023-04-24 at 16:42 -0500, Terry Wilson wrote:
>>>>>
>>>>> On Fri, Apr 21, 2023 at 3:29 AM Stefan Hoffmann
>>>>> <stefan.hoffmann@cloudandheat.com> wrote:
>>>>>> In some cases ovsdb server or relay gets restarted, ovsdb python
>>>>>> clients
>>>>>> may keep the local socket open. Instead of reconnecting a lot of
>>>>>> failures
>>>>>> will be logged.
>>>>>> This can be reproduced with ssl connections to the server/relay and
>>>>>> restarting it, so it has the same IP after restart.
>>>>>>
>>>>>> This patch catches the Exceptions at do_handshake to recreate the
>>>>>> connection on the client side.
>>>>>>
>>>>>
>>>>>
>>>>> Is this something we can add some tests for? I know we have some
>>>>> other tests that involve restarts, but I don't think they involve
>>>>> python clients or SSL.
>>>>
>>>> I can try to find a proper place to add such an test. Can you point me
>>>> to the correct place to add python clients to that test?
>>>> But I'm not sure, if we run always into that issue at a restart.
>>>
>>> Usually, such tests live somewhere around tests/ovsdb-idl.at.
>>> There is a basic CHECK_STREAM_OPEN_BLOCK test there, but it
>>> doesn't check SSL connections, maybe it's worth it to extend
>>> the test to be able to connect via SSL.  This will require
>>> changing the tests/test-stream.py test script.
>>>
>>> On the subject, I'm not sure if the issue can be reliably
>>> reproduced, especially because it requires OpenSSL 3.0, AFAICT.
>>>
>>> I have one question for the code below.
>>>
>>>>>  
>>>>>> Reviewed-by: Simon Horman <simon.horman@corigine.com>
>>>>>> Signed-off-by: Stefan Hoffmann <stefan.hoffmann@cloudandheat.com>
>>>>>> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz>
>>>>>> Signed-off-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>>>>> Co-authored-by: Luca Czesla <luca.czesla@mail.schwarz>
>>>>>> Co-authored-by: Max Lamprecht <max.lamprecht@mail.schwarz>
>>>>>> ---
>>>>>>  python/ovs/stream.py | 3 ++-
>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/python/ovs/stream.py b/python/ovs/stream.py
>>>>>> index ac5b0fd0c..b32341076 100644
>>>>>> --- a/python/ovs/stream.py
>>>>>> +++ b/python/ovs/stream.py
>>>>>> @@ -824,7 +824,8 @@ class SSLStream(Stream):
>>>>>>              self.socket.do_handshake()
>>>>>>          except ssl.SSLWantReadError:
>>>>>>              return errno.EAGAIN
>>>>>> -        except ssl.SSLSyscallError as e:
>>>>>> +        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
>>>>>> +                ssl.SSLEOFError, OSError) as e:
>>>
>>> Do we actually need to catch the generic OSError here?
>>> It looks a bit odd to catch non-ssl exception.
>>
>> We got an "Transport endpoint is not connected" sometimes.
>> It tries getpeername() at a socket that is not connected anymore or
>> something like that. That results in the OSError.
>> We also saw, that the file descriptor was still there but not
>> connected.
>>
>> Here the traceback (I hope you can still read that after line wrap):
>> Traceback (most recent call last):
>>   File "/usr/local/lib/python3.9/site-
>> packages/ovsdbapp/backend/ovs_idl/connection.py", line 107, in run
>>     self.idl.run()
>>  File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
>> py3.9.egg/ovs/db/idl.py", line 433, in run
>>     self._session.run()
>>   File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
>> py3.9.egg/ovs/jsonrpc.py", line 519, in run
>>     error = self.stream.connect()
>>   File "/usr/local/lib/python3.9/site-packages/ovs-3.1.0-
>> py3.9.egg/ovs/stream.py", line 824, in connect
>>     self.socket.do_handshake()
>>   File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
>> line 312, in do_handshake
>>     return self._call_trampolining(
>>   File "/usr/local/lib/python3.9/site-packages/eventlet/green/ssl.py",
>> line 158, in _call_trampolining
>>     return func(*a, **kw)
>>   File "/usr/local/lib/python3.9/ssl.py", line 1305, in do_handshake
>>     self._check_connected()
>>   File "/usr/local/lib/python3.9/ssl.py", line 1089, in
>> _check_connected
>>     self.getpeername()
>>
>> OSError: [Errno 107] Transport endpoint is not connected
> 
> OK.  Yeah, looking through the ssl.py implementation, it can indeed
> throw a generic OSError.  That's unfortunate, but OK, I guess.
> 
> Thanks for clarification!

Since it's unlikely we will have a test specific for this issue,
I applied the change as is for now.

Also, backported down to 2.17.

It would stll be great to have more test coverage around python
and SSL, but that is slightly out of the scope of the current fix.

Thanks!

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/python/ovs/stream.py b/python/ovs/stream.py
index ac5b0fd0c..b32341076 100644
--- a/python/ovs/stream.py
+++ b/python/ovs/stream.py
@@ -824,7 +824,8 @@  class SSLStream(Stream):
             self.socket.do_handshake()
         except ssl.SSLWantReadError:
             return errno.EAGAIN
-        except ssl.SSLSyscallError as e:
+        except (ssl.SSLSyscallError, ssl.SSLZeroReturnError,
+                ssl.SSLEOFError, OSError) as e:
             return ovs.socket_util.get_exception_errno(e)
 
         return 0