Message ID | 20240221074214.96824-1-zhihui.ding@easystack.cn |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev,v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | warning | apply and check: warning |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
Daniel Ding <danieldin186@gmail.com> writes: > After running ovs-tcpdump and inputs multiple CTRL+C, the program will > raise the following exception. > > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror > ovsdb = OVSDB(db_sock) > File "/usr/bin/ovs-tcpdump", line 168, in __init__ > OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB > File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change > while idl.change_seqno == seq and not idl.run(): > > Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> > --- LGTM for the linux side - maybe Alin might check the windows side. When you post v2 you can keep my Reviewed-by: Aaron Conole <aconole@redhat.com> > utilities/ovs-tcpdump.in | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in > index 4cbd9a5d3..eec2262d6 100755 > --- a/utilities/ovs-tcpdump.in > +++ b/utilities/ovs-tcpdump.in > @@ -17,6 +17,7 @@ > import os > import pwd > from random import randint > +import signal > import subprocess > import sys > import time > @@ -417,8 +418,22 @@ def py_which(executable): > for path in os.environ["PATH"].split(os.pathsep)) > > > +def ignore_fatal_signals(): > + if sys.platform == "win32": > + signals = [signal.SIGTERM, signal.SIGINT] > + else: > + signals = [signal.SIGTERM, signal.SIGINT, signal.SIGHUP, > + signal.SIGALRM] > + > + for signr in signals: > + signal.signal(signr, signal.SIG_IGN) > + > + Just a nit, but it might be clearer to put the ignore_fatal_signals as a function scoped in teardown below. That way it is a bit clearer that this will only be called to turn off these in the double failure case. > def teardown(db_sock, interface, mirror_interface, tap_created): > def cleanup_mirror(): > + # Ignore fatal signals, avoid it to break OVSDB operations. > + # So that cleanup mirror and ports be finished. > + ignore_fatal_signals() > try: > ovsdb = OVSDB(db_sock) > ovsdb.destroy_mirror(interface, ovsdb.port_bridge(interface))
On 2/21/24 15:27, Aaron Conole wrote: > Daniel Ding <danieldin186@gmail.com> writes: > >> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >> raise the following exception. >> >> Error in atexit._run_exitfuncs: >> Traceback (most recent call last): >> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >> ovsdb = OVSDB(db_sock) >> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >> while idl.change_seqno == seq and not idl.run(): >> >> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> >> --- > > LGTM for the linux side - maybe Alin might check the windows side. The code is copied from the fatal-signla library, so it is likely fine. However, I don;t think this issue should be fixed on the application level (ovs-tcpdump). There seems to be adifference in how C and python fatal-signla libraries behave. The C version calls the hooks in the run() function and the signal handler only updates the signal number variable. So, if the same signal arrives multiple times it will be handled only once and the run() function will not exit until all the hooks are done, unless it is a segfault. With python implementation we're calling hooks right from the signal handler (it's not a real signal handler, so we're allowed to use not really singal-safe functions there). So, if another signal arrives while we're executing the hooks, the second hook execution will be skipped due to 'recursion' check, but the signal will be re-raised immediately, killing the process. There is likley a way to fix that by using a mutex instead of recursion check and blocking it while executing the hooks. The signal number will need to be stored in the signal handler and the signal will need to be re-raised in the call_hooks() instead of signal handler. We can use mutex.acquire(blocking=False) as a way to prevent recursion. Does that make sense? Best regards, Ilya Maximets.
> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org> 写道: > > On 2/21/24 15:27, Aaron Conole wrote: >> Daniel Ding <danieldin186@gmail.com> writes: >> >>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >>> raise the following exception. >>> >>> Error in atexit._run_exitfuncs: >>> Traceback (most recent call last): >>> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >>> ovsdb = OVSDB(db_sock) >>> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >>> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >>> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >>> while idl.change_seqno == seq and not idl.run(): >>> >>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> >>> --- >> >> LGTM for the linux side - maybe Alin might check the windows side. > > The code is copied from the fatal-signla library, so it is likely fine. > However, I don;t think this issue should be fixed on the application > level (ovs-tcpdump). There seems to be adifference in how C and python > fatal-signla libraries behave. The C version calls the hooks in the run() > function and the signal handler only updates the signal number variable. > So, if the same signal arrives multiple times it will be handled only once > and the run() function will not exit until all the hooks are done, unless > it is a segfault. > > With python implementation we're calling hooks right from the signal > handler (it's not a real signal handler, so we're allowed to use not > really singal-safe functions there). So, if another signal arrives while > we're executing the hooks, the second hook execution will be skipped > due to 'recursion' check, but the signal will be re-raised immediately, > killing the process. As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks applications registered. > > There is likley a way to fix that by using a mutex instead of recursion > check and blocking it while executing the hooks. The signal number > will need to be stored in the signal handler and the signal will need > to be re-raised in the call_hooks() instead of signal handler. > We can use mutex.acquire(blocking=False) as a way to prevent recursion. > To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in call_hooks. And move re-raised the signal in the call_hooks with locked. > Does that make sense? > > Best regards, Ilya Maximets. diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py index cb2e99e87..3aa56a166 100644 --- a/python/ovs/fatal_signal.py +++ b/python/ovs/fatal_signal.py @@ -16,6 +16,7 @@ import atexit import os import signal import sys +import threading import ovs.vlog @@ -112,28 +113,34 @@ def _unlink(file_): def _signal_handler(signr, _): _call_hooks(signr) - # Re-raise the signal with the default handling so that the program - # termination status reflects that we were killed by this signal. - signal.signal(signr, signal.SIG_DFL) - os.kill(os.getpid(), signr) - def _atexit_handler(): _call_hooks(0) -recurse = False +recurse = threading.Lock() def _call_hooks(signr): global recurse - if recurse: + if recurse.locked(): return - recurse = True - for hook, cancel, run_at_exit in _hooks: - if signr != 0 or run_at_exit: - hook() + with recurse: + if signr != 0: + signal.signal(signr, signal.SIG_IGN) + else: + signal.signal(signal.SIGINT, signal.SIG_IGN) + + for hook, cancel, run_at_exit in _hooks: + if signr != 0 or run_at_exit: + hook() + + if signr != 0: + # Re-raise the signal with the default handling so that the program + # termination status reflects that we were killed by this signal. + signal.signal(signr, signal.SIG_DFL) + os.kill(os.getpid(), signr) _inited = False @@ -165,7 +172,6 @@ def signal_alarm(timeout): if sys.platform == "win32": import time - import threading class Alarm (threading.Thread): def __init__(self, timeout): -- 2.43.0 Thanks Ilya Maximets, please help me review again. Best regards, Daniel Ding.
On 2/22/24 11:02, Daniel Ding wrote: > > >> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道: >> >> On 2/21/24 15:27, Aaron Conole wrote: >>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes: >>> >>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >>>> raise the following exception. >>>> >>>> Error in atexit._run_exitfuncs: >>>> Traceback (most recent call last): >>>> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >>>> ovsdb = OVSDB(db_sock) >>>> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >>>> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >>>> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >>>> while idl.change_seqno == seq and not idl.run(): >>>> >>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>> >>>> --- >>> >>> LGTM for the linux side - maybe Alin might check the windows side. >> >> The code is copied from the fatal-signla library, so it is likely fine. >> However, I don;t think this issue should be fixed on the application >> level (ovs-tcpdump). There seems to be adifference in how C and python >> fatal-signla libraries behave. The C version calls the hooks in the run() >> function and the signal handler only updates the signal number variable. >> So, if the same signal arrives multiple times it will be handled only once >> and the run() function will not exit until all the hooks are done, unless >> it is a segfault. >> >> With python implementation we're calling hooks right from the signal >> handler (it's not a real signal handler, so we're allowed to use not >> really singal-safe functions there). So, if another signal arrives while >> we're executing the hooks, the second hook execution will be skipped >> due to 'recursion' check, but the signal will be re-raised immediately, >> killing the process. > > As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks > applications registered. > >> >> There is likley a way to fix that by using a mutex instead of recursion >> check and blocking it while executing the hooks. The signal number >> will need to be stored in the signal handler and the signal will need >> to be re-raised in the call_hooks() instead of signal handler. >> We can use mutex.acquire(blocking=False) as a way to prevent recursion. >> > > To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in > call_hooks. I'm not sure this is necessary, see below. > And move re-raised the signal in the call_hooks with locked. > >> Does that make sense? >> >> Best regards, Ilya Maximets. > > diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py > index cb2e99e87..3aa56a166 100644 > --- a/python/ovs/fatal_signal.py > +++ b/python/ovs/fatal_signal.py > @@ -16,6 +16,7 @@ import atexit > import os > import signal > import sys > +import threading > > import ovs.vlog > > @@ -112,28 +113,34 @@ def _unlink(file_): > def _signal_handler(signr, _): > _call_hooks(signr) > > - # Re-raise the signal with the default handling so that the program > - # termination status reflects that we were killed by this signal. > - signal.signal(signr, signal.SIG_DFL) > - os.kill(os.getpid(), signr) > - > > def _atexit_handler(): > _call_hooks(0) > > > -recurse = False > +recurse = threading.Lock() It's better to rename this to just 'mutex' or 'lock'. > > > def _call_hooks(signr): > global recurse > - if recurse: > + if recurse.locked(): > return There is an awkward race window between checking that the mutex is locked and actually attempting to take it below. It's probably not a big deal here, but it's better to take the lock at the same time with checking, e.g.: if not mutex.acquire(blocking=False): return And since we're going to end the process anyway (it's a handler for fatal signals after all), we don't actually need to unlock it, AFAICT. > - recurse = True > > - for hook, cancel, run_at_exit in _hooks: > - if signr != 0 or run_at_exit: > - hook() > + with recurse: > + if signr != 0: > + signal.signal(signr, signal.SIG_IGN) > + else: > + signal.signal(signal.SIGINT, signal.SIG_IGN) If another signal arrives while we're under the lock, the signal handler will call the _call_hooks(), check that the lock is already taken and return. There is no need to ignore it. Also, if a different signal arrives the code above will not ignore it anyway. > + > + for hook, cancel, run_at_exit in _hooks: > + if signr != 0 or run_at_exit: > + hook() > + > + if signr != 0: > + # Re-raise the signal with the default handling so that the program > + # termination status reflects that we were killed by this signal. > + signal.signal(signr, signal.SIG_DFL) > + os.kill(os.getpid(), signr) We may want to place this in the try: finally: block, so we still re-raise the signal even if one of the hooks throws an exception. Hooks should not generally throw exceptions, but may be safer this way. What do you think? > > > _inited = False > @@ -165,7 +172,6 @@ def signal_alarm(timeout): > > if sys.platform == "win32": > import time > - import threading > > class Alarm (threading.Thread): > def __init__(self, timeout): > -- > 2.43.0 > > Thanks Ilya Maximets, please help me review again. > > Best regards, Daniel Ding. >
On 2/22/24 11:53, Ilya Maximets wrote: > On 2/22/24 11:02, Daniel Ding wrote: >> >> >>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道: >>> >>> On 2/21/24 15:27, Aaron Conole wrote: >>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes: >>>> >>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >>>>> raise the following exception. >>>>> >>>>> Error in atexit._run_exitfuncs: >>>>> Traceback (most recent call last): >>>>> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >>>>> ovsdb = OVSDB(db_sock) >>>>> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >>>>> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >>>>> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >>>>> while idl.change_seqno == seq and not idl.run(): >>>>> >>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>> >>>>> --- >>>> >>>> LGTM for the linux side - maybe Alin might check the windows side. >>> >>> The code is copied from the fatal-signla library, so it is likely fine. >>> However, I don;t think this issue should be fixed on the application >>> level (ovs-tcpdump). There seems to be adifference in how C and python >>> fatal-signla libraries behave. The C version calls the hooks in the run() >>> function and the signal handler only updates the signal number variable. >>> So, if the same signal arrives multiple times it will be handled only once >>> and the run() function will not exit until all the hooks are done, unless >>> it is a segfault. >>> >>> With python implementation we're calling hooks right from the signal >>> handler (it's not a real signal handler, so we're allowed to use not >>> really singal-safe functions there). So, if another signal arrives while >>> we're executing the hooks, the second hook execution will be skipped >>> due to 'recursion' check, but the signal will be re-raised immediately, >>> killing the process. >> >> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks >> applications registered. >> >>> >>> There is likley a way to fix that by using a mutex instead of recursion >>> check and blocking it while executing the hooks. The signal number >>> will need to be stored in the signal handler and the signal will need >>> to be re-raised in the call_hooks() instead of signal handler. >>> We can use mutex.acquire(blocking=False) as a way to prevent recursion. >>> >> >> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in >> call_hooks. > > I'm not sure this is necessary, see below. > >> And move re-raised the signal in the call_hooks with locked. >> >>> Does that make sense? >>> >>> Best regards, Ilya Maximets. >> >> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py >> index cb2e99e87..3aa56a166 100644 >> --- a/python/ovs/fatal_signal.py >> +++ b/python/ovs/fatal_signal.py >> @@ -16,6 +16,7 @@ import atexit >> import os >> import signal >> import sys >> +import threading >> >> import ovs.vlog >> >> @@ -112,28 +113,34 @@ def _unlink(file_): >> def _signal_handler(signr, _): >> _call_hooks(signr) >> >> - # Re-raise the signal with the default handling so that the program >> - # termination status reflects that we were killed by this signal. >> - signal.signal(signr, signal.SIG_DFL) >> - os.kill(os.getpid(), signr) >> - >> >> def _atexit_handler(): >> _call_hooks(0) >> >> >> -recurse = False >> +recurse = threading.Lock() > > It's better to rename this to just 'mutex' or 'lock'. > >> >> >> def _call_hooks(signr): >> global recurse >> - if recurse: >> + if recurse.locked(): >> return > > There is an awkward race window between checking that the mutex is > locked and actually attempting to take it below. It's probably not > a big deal here, but it's better to take the lock at the same time > with checking, e.g.: > > if not mutex.acquire(blocking=False): > return > > And since we're going to end the process anyway (it's a handler > for fatal signals after all), we don't actually need to unlock it, > AFAICT. > >> - recurse = True >> >> - for hook, cancel, run_at_exit in _hooks: >> - if signr != 0 or run_at_exit: >> - hook() >> + with recurse: >> + if signr != 0: >> + signal.signal(signr, signal.SIG_IGN) >> + else: >> + signal.signal(signal.SIGINT, signal.SIG_IGN) > > If another signal arrives while we're under the lock, the signal > handler will call the _call_hooks(), check that the lock is already > taken and return. There is no need to ignore it. > Also, if a different signal arrives the code above will not ignore > it anyway. > >> + >> + for hook, cancel, run_at_exit in _hooks: >> + if signr != 0 or run_at_exit: >> + hook() >> + >> + if signr != 0: >> + # Re-raise the signal with the default handling so that the program >> + # termination status reflects that we were killed by this signal. >> + signal.signal(signr, signal.SIG_DFL) >> + os.kill(os.getpid(), signr) > > We may want to place this in the try: finally: block, so we still > re-raise the signal even if one of the hooks throws an exception. > Hooks should not generally throw exceptions, but may be safer this > way. On a second thought, if we re-raise after the exception in the hook, we may loose the exception. So, letting the exception to kill the application may be better in this case then hiding it with re-raising a signal. > > What do you think? > >> >> >> _inited = False >> @@ -165,7 +172,6 @@ def signal_alarm(timeout): >> >> if sys.platform == "win32": >> import time >> - import threading >> >> class Alarm (threading.Thread): >> def __init__(self, timeout): >> -- >> 2.43.0 >> >> Thanks Ilya Maximets, please help me review again. >> >> Best regards, Daniel Ding. >> >
> 2024年2月22日 下午7:20,Ilya Maximets <i.maximets@ovn.org> 写道: > > On 2/22/24 11:53, Ilya Maximets wrote: >> On 2/22/24 11:02, Daniel Ding wrote: >>> >>> >>>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道: >>>> >>>> On 2/21/24 15:27, Aaron Conole wrote: >>>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com>> writes: >>>>> >>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >>>>>> raise the following exception. >>>>>> >>>>>> Error in atexit._run_exitfuncs: >>>>>> Traceback (most recent call last): >>>>>> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >>>>>> ovsdb = OVSDB(db_sock) >>>>>> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >>>>>> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >>>>>> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >>>>>> while idl.change_seqno == seq and not idl.run(): >>>>>> >>>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>> >>>>>> --- >>>>> >>>>> LGTM for the linux side - maybe Alin might check the windows side. >>>> >>>> The code is copied from the fatal-signla library, so it is likely fine. >>>> However, I don;t think this issue should be fixed on the application >>>> level (ovs-tcpdump). There seems to be adifference in how C and python >>>> fatal-signla libraries behave. The C version calls the hooks in the run() >>>> function and the signal handler only updates the signal number variable. >>>> So, if the same signal arrives multiple times it will be handled only once >>>> and the run() function will not exit until all the hooks are done, unless >>>> it is a segfault. >>>> >>>> With python implementation we're calling hooks right from the signal >>>> handler (it's not a real signal handler, so we're allowed to use not >>>> really singal-safe functions there). So, if another signal arrives while >>>> we're executing the hooks, the second hook execution will be skipped >>>> due to 'recursion' check, but the signal will be re-raised immediately, >>>> killing the process. >>> >>> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks >>> applications registered. >>> >>>> >>>> There is likley a way to fix that by using a mutex instead of recursion >>>> check and blocking it while executing the hooks. The signal number >>>> will need to be stored in the signal handler and the signal will need >>>> to be re-raised in the call_hooks() instead of signal handler. >>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion. >>>> >>> >>> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in >>> call_hooks. >> >> I'm not sure this is necessary, see below. The default handler of SIGINT is default_int_handler, so it not register the signal handler successfully. When received CTRL+C again, the program be break, and calling hook can't execute completely. >> >>> And move re-raised the signal in the call_hooks with locked. >>> >>>> Does that make sense? >>>> >>>> Best regards, Ilya Maximets. >>> >>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py >>> index cb2e99e87..3aa56a166 100644 >>> --- a/python/ovs/fatal_signal.py >>> +++ b/python/ovs/fatal_signal.py >>> @@ -16,6 +16,7 @@ import atexit >>> import os >>> import signal >>> import sys >>> +import threading >>> >>> import ovs.vlog >>> >>> @@ -112,28 +113,34 @@ def _unlink(file_): >>> def _signal_handler(signr, _): >>> _call_hooks(signr) >>> >>> - # Re-raise the signal with the default handling so that the program >>> - # termination status reflects that we were killed by this signal. >>> - signal.signal(signr, signal.SIG_DFL) >>> - os.kill(os.getpid(), signr) >>> - >>> >>> def _atexit_handler(): >>> _call_hooks(0) >>> >>> >>> -recurse = False >>> +recurse = threading.Lock() >> >> It's better to rename this to just 'mutex' or 'lock'. Done >> >>> >>> >>> def _call_hooks(signr): >>> global recurse >>> - if recurse: >>> + if recurse.locked(): >>> return >> >> There is an awkward race window between checking that the mutex is >> locked and actually attempting to take it below. It's probably not >> a big deal here, but it's better to take the lock at the same time >> with checking, e.g.: >> >> if not mutex.acquire(blocking=False): >> return >> >> And since we're going to end the process anyway (it's a handler >> for fatal signals after all), we don't actually need to unlock it, >> AFAICT. Agree it. >> >>> - recurse = True >>> >>> - for hook, cancel, run_at_exit in _hooks: >>> - if signr != 0 or run_at_exit: >>> - hook() >>> + with recurse: >>> + if signr != 0: >>> + signal.signal(signr, signal.SIG_IGN) >>> + else: >>> + signal.signal(signal.SIGINT, signal.SIG_IGN) >> >> If another signal arrives while we're under the lock, the signal >> handler will call the _call_hooks(), check that the lock is already >> taken and return. There is no need to ignore it. >> Also, if a different signal arrives the code above will not ignore >> it anyway. >> You’re right, but the SIGINT is exception in the _init because its default handler is not SIG_DFL. So I append a new condition for SIGINT to register the signal handler. for signr in signals: handler = signal.getsignal(signr) if (handler == signal.SIG_DFL or handler == signal.default_int_handler): signal.signal(signr, _signal_handler) >>> + >>> + for hook, cancel, run_at_exit in _hooks: >>> + if signr != 0 or run_at_exit: >>> + hook() >>> + >>> + if signr != 0: >>> + # Re-raise the signal with the default handling so that the program >>> + # termination status reflects that we were killed by this signal. >>> + signal.signal(signr, signal.SIG_DFL) >>> + os.kill(os.getpid(), signr) >> >> We may want to place this in the try: finally: block, so we still >> re-raise the signal even if one of the hooks throws an exception. >> Hooks should not generally throw exceptions, but may be safer this >> way. > > On a second thought, if we re-raise after the exception in the > hook, we may loose the exception. So, letting the exception to > kill the application may be better in this case then hiding it > with re-raising a signal. > >> >> What do you think? >> >>> >>> >>> _inited = False >>> @@ -165,7 +172,6 @@ def signal_alarm(timeout): >>> >>> if sys.platform == "win32": >>> import time >>> - import threading >>> >>> class Alarm (threading.Thread): >>> def __init__(self, timeout): >>> -- >>> 2.43.0 >>> >>> Thanks Ilya Maximets, please help me review again. >>> >>> Best regards, Daniel Ding. python/ovs/fatal_signal.py | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py index cb2e99e87..077f50dd5 100644 --- a/python/ovs/fatal_signal.py +++ b/python/ovs/fatal_signal.py @@ -16,6 +16,7 @@ import atexit import os import signal import sys +import threading import ovs.vlog @@ -112,29 +113,29 @@ def _unlink(file_): def _signal_handler(signr, _): _call_hooks(signr) - # Re-raise the signal with the default handling so that the program - # termination status reflects that we were killed by this signal. - signal.signal(signr, signal.SIG_DFL) - os.kill(os.getpid(), signr) - def _atexit_handler(): _call_hooks(0) -recurse = False +mutex = threading.Lock() def _call_hooks(signr): - global recurse - if recurse: + global mutex + if not mutex.acquire(blocking=False): return - recurse = True for hook, cancel, run_at_exit in _hooks: if signr != 0 or run_at_exit: hook() + if signr != 0: + # Re-raise the signal with the default handling so that the program + # termination status reflects that we were killed by this signal. + signal.signal(signr, signal.SIG_DFL) + os.kill(os.getpid(), signr) + _inited = False @@ -150,7 +151,9 @@ def _init(): signal.SIGALRM] for signr in signals: - if signal.getsignal(signr) == signal.SIG_DFL: + handler = signal.getsignal(signr) + if (handler == signal.SIG_DFL or + handler == signal.default_int_handler): signal.signal(signr, _signal_handler) atexit.register(_atexit_handler) @@ -165,7 +168,6 @@ def signal_alarm(timeout): if sys.platform == "win32": import time - import threading class Alarm (threading.Thread): def __init__(self, timeout): Thanks for your review, Ilya Maximets. Regards, Daniel Ding
On 2/22/24 14:12, Daniel Ding wrote: > > > >> 2024年2月22日 下午7:20,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> 写道: >> >> On 2/22/24 11:53, Ilya Maximets wrote: >>> On 2/22/24 11:02, Daniel Ding wrote: >>>> >>>> >>>>> 2024年2月22日 上午1:55,Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org> <mailto:i.maximets@ovn.org <mailto:i.maximets@ovn.org>>> 写道: >>>>> >>>>> On 2/21/24 15:27, Aaron Conole wrote: >>>>>> Daniel Ding <danieldin186@gmail.com <mailto:danieldin186@gmail.com> <mailto:danieldin186@gmail.com <mailto:danieldin186@gmail.com>>> writes: >>>>>> >>>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will >>>>>>> raise the following exception. >>>>>>> >>>>>>> Error in atexit._run_exitfuncs: >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror >>>>>>> ovsdb = OVSDB(db_sock) >>>>>>> File "/usr/bin/ovs-tcpdump", line 168, in __init__ >>>>>>> OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB >>>>>>> File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change >>>>>>> while idl.change_seqno == seq and not idl.run(): >>>>>>> >>>>>>> Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn> <mailto:zhihui.ding@easystack.cn <mailto:zhihui.ding@easystack.cn>>> >>>>>>> --- >>>>>> >>>>>> LGTM for the linux side - maybe Alin might check the windows side. >>>>> >>>>> The code is copied from the fatal-signla library, so it is likely fine. >>>>> However, I don;t think this issue should be fixed on the application >>>>> level (ovs-tcpdump). There seems to be adifference in how C and python >>>>> fatal-signla libraries behave. The C version calls the hooks in the run() >>>>> function and the signal handler only updates the signal number variable. >>>>> So, if the same signal arrives multiple times it will be handled only once >>>>> and the run() function will not exit until all the hooks are done, unless >>>>> it is a segfault. >>>>> >>>>> With python implementation we're calling hooks right from the signal >>>>> handler (it's not a real signal handler, so we're allowed to use not >>>>> really singal-safe functions there). So, if another signal arrives while >>>>> we're executing the hooks, the second hook execution will be skipped >>>>> due to 'recursion' check, but the signal will be re-raised immediately, >>>>> killing the process. >>>> >>>> As your suggestions, I rewrite recurse as a threading.Lock. So that protect calling hooks >>>> applications registered. >>>> >>>>> >>>>> There is likley a way to fix that by using a mutex instead of recursion >>>>> check and blocking it while executing the hooks. The signal number >>>>> will need to be stored in the signal handler and the signal will need >>>>> to be re-raised in the call_hooks() instead of signal handler. >>>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion. >>>>> >>>> >>>> To avoid breaking calling hooks, I set the signal restored in the signal handler to SIG_IGN in >>>> call_hooks. >>> >>> I'm not sure this is necessary, see below. > > The default handler of SIGINT is default_int_handler, so it not register the signal handler > successfully. When received CTRL+C again, the program be break, and calling hook can't > execute completely. > >>> >>>> And move re-raised the signal in the call_hooks with locked. >>>> >>>>> Does that make sense? >>>>> >>>>> Best regards, Ilya Maximets. >>>> >>>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py >>>> index cb2e99e87..3aa56a166 100644 >>>> --- a/python/ovs/fatal_signal.py >>>> +++ b/python/ovs/fatal_signal.py >>>> @@ -16,6 +16,7 @@ import atexit >>>> import os >>>> import signal >>>> import sys >>>> +import threading >>>> >>>> import ovs.vlog >>>> >>>> @@ -112,28 +113,34 @@ def _unlink(file_): >>>> def _signal_handler(signr, _): >>>> _call_hooks(signr) >>>> >>>> - # Re-raise the signal with the default handling so that the program >>>> - # termination status reflects that we were killed by this signal. >>>> - signal.signal(signr, signal.SIG_DFL) >>>> - os.kill(os.getpid(), signr) >>>> - >>>> >>>> def _atexit_handler(): >>>> _call_hooks(0) >>>> >>>> >>>> -recurse = False >>>> +recurse = threading.Lock() >>> >>> It's better to rename this to just 'mutex' or 'lock'. > > Done > >>> >>>> >>>> >>>> def _call_hooks(signr): >>>> global recurse >>>> - if recurse: >>>> + if recurse.locked(): >>>> return >>> >>> There is an awkward race window between checking that the mutex is >>> locked and actually attempting to take it below. It's probably not >>> a big deal here, but it's better to take the lock at the same time >>> with checking, e.g.: >>> >>> if not mutex.acquire(blocking=False): >>> return >>> >>> And since we're going to end the process anyway (it's a handler >>> for fatal signals after all), we don't actually need to unlock it, >>> AFAICT. > > Agree it. > >>> >>>> - recurse = True >>>> >>>> - for hook, cancel, run_at_exit in _hooks: >>>> - if signr != 0 or run_at_exit: >>>> - hook() >>>> + with recurse: >>>> + if signr != 0: >>>> + signal.signal(signr, signal.SIG_IGN) >>>> + else: >>>> + signal.signal(signal.SIGINT, signal.SIG_IGN) >>> >>> If another signal arrives while we're under the lock, the signal >>> handler will call the _call_hooks(), check that the lock is already >>> taken and return. There is no need to ignore it. >>> Also, if a different signal arrives the code above will not ignore >>> it anyway. >>> > > You’re right, but the SIGINT is exception in the _init because its default handler is not SIG_DFL. > So I append a new condition for SIGINT to register the signal handler. > > for signr in signals: > handler = signal.getsignal(signr) > if (handler == signal.SIG_DFL or > handler == signal.default_int_handler): > signal.signal(signr, _signal_handler) Hmm, yeah, makes sense. With that, do we need to catch the KeyboardInterrupt exception in the ovs-tcpdump now? > > >>>> + >>>> + for hook, cancel, run_at_exit in _hooks: >>>> + if signr != 0 or run_at_exit: >>>> + hook() >>>> + >>>> + if signr != 0: >>>> + # Re-raise the signal with the default handling so that the program >>>> + # termination status reflects that we were killed by this signal. >>>> + signal.signal(signr, signal.SIG_DFL) >>>> + os.kill(os.getpid(), signr) >>> >>> We may want to place this in the try: finally: block, so we still >>> re-raise the signal even if one of the hooks throws an exception. >>> Hooks should not generally throw exceptions, but may be safer this >>> way. >> >> On a second thought, if we re-raise after the exception in the >> hook, we may loose the exception. So, letting the exception to >> kill the application may be better in this case then hiding it >> with re-raising a signal. >> >>> >>> What do you think? >>> >>>> >>>> >>>> _inited = False >>>> @@ -165,7 +172,6 @@ def signal_alarm(timeout): >>>> >>>> if sys.platform == "win32": >>>> import time >>>> - import threading >>>> >>>> class Alarm (threading.Thread): >>>> def __init__(self, timeout): >>>> -- >>>> 2.43.0 >>>> >>>> Thanks Ilya Maximets, please help me review again. >>>> >>>> Best regards, Daniel Ding. > > > python/ovs/fatal_signal.py | 24 +++++++++++++----------- > 1 file changed, 13 insertions(+), 11 deletions(-) > > diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py > index cb2e99e87..077f50dd5 100644 > --- a/python/ovs/fatal_signal.py > +++ b/python/ovs/fatal_signal.py > @@ -16,6 +16,7 @@ import atexit > import os > import signal > import sys > +import threading > > import ovs.vlog > > @@ -112,29 +113,29 @@ def _unlink(file_): > def _signal_handler(signr, _): > _call_hooks(signr) > > - # Re-raise the signal with the default handling so that the program > - # termination status reflects that we were killed by this signal. > - signal.signal(signr, signal.SIG_DFL) > - os.kill(os.getpid(), signr) > - > > def _atexit_handler(): > _call_hooks(0) > > > -recurse = False > +mutex = threading.Lock() > > > def _call_hooks(signr): > - global recurse > - if recurse: > + global mutex > + if not mutex.acquire(blocking=False): > return > - recurse = True > > for hook, cancel, run_at_exit in _hooks: > if signr != 0 or run_at_exit: > hook() > > + if signr != 0: > + # Re-raise the signal with the default handling so that the program > + # termination status reflects that we were killed by this signal. > + signal.signal(signr, signal.SIG_DFL) > + os.kill(os.getpid(), signr) > + > > _inited = False > > @@ -150,7 +151,9 @@ def _init(): > signal.SIGALRM] > > for signr in signals: > - if signal.getsignal(signr) == signal.SIG_DFL: > + handler = signal.getsignal(signr) > + if (handler == signal.SIG_DFL or > + handler == signal.default_int_handler): > signal.signal(signr, _signal_handler) > atexit.register(_atexit_handler) > > @@ -165,7 +168,6 @@ def signal_alarm(timeout): > > if sys.platform == "win32": > import time > - import threading > > class Alarm (threading.Thread): > def __init__(self, timeout): > > > > Thanks for your review, Ilya Maximets. I left a comment above that may need to be addressed. But otherwise, this version seems mostly fine. Could you please send it as a new PATCH v2 with adjusted patch name and the commit message? It will get tested and it will be easier to continue reviews this way. Thanks! Best regards, Ilya Maximets.
diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in index 4cbd9a5d3..eec2262d6 100755 --- a/utilities/ovs-tcpdump.in +++ b/utilities/ovs-tcpdump.in @@ -17,6 +17,7 @@ import os import pwd from random import randint +import signal import subprocess import sys import time @@ -417,8 +418,22 @@ def py_which(executable): for path in os.environ["PATH"].split(os.pathsep)) +def ignore_fatal_signals(): + if sys.platform == "win32": + signals = [signal.SIGTERM, signal.SIGINT] + else: + signals = [signal.SIGTERM, signal.SIGINT, signal.SIGHUP, + signal.SIGALRM] + + for signr in signals: + signal.signal(signr, signal.SIG_IGN) + + def teardown(db_sock, interface, mirror_interface, tap_created): def cleanup_mirror(): + # Ignore fatal signals, avoid it to break OVSDB operations. + # So that cleanup mirror and ports be finished. + ignore_fatal_signals() try: ovsdb = OVSDB(db_sock) ovsdb.destroy_mirror(interface, ovsdb.port_bridge(interface))
After running ovs-tcpdump and inputs multiple CTRL+C, the program will raise the following exception. Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror ovsdb = OVSDB(db_sock) File "/usr/bin/ovs-tcpdump", line 168, in __init__ OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change while idl.change_seqno == seq and not idl.run(): Signed-off-by: Daniel Ding <zhihui.ding@easystack.cn> --- utilities/ovs-tcpdump.in | 15 +++++++++++++++ 1 file changed, 15 insertions(+)