diff mbox series

[ovs-dev,1/9] ipsec: Add a helper function to run commands from the monitor.

Message ID 20241029101608.2991596-2-i.maximets@ovn.org
State Changes Requested
Headers show
Series ipsec: Resiliency to Libreswan failures. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Ilya Maximets Oct. 29, 2024, 10:14 a.m. UTC
Until now, functions that needed to call external programs like openssl
or ipsec commands were using subprocess commands directly.  Most of
these calls had no failure checks or any logging making it hard to
understand what is happening inside the daemon when something doesn't
work as intended.

Some commands also had a chance to not read the command output in full.
That might sound like not a big problem, but in practice it causes
ovs-monitor-ipsec to deadlock pluto and itself with certain versions of
Libreswan (mainly Libreswan 5+).  The order of events is following:

 1. ovs-monitor-ipsec calls ipsec status redirecting the output
    to a pipe.
 2. ipsec status calls ipsec whack.
 3. ipsec whack connects to pluto and asks for status.
 4. ovs-monitor-ipsec doesn't read the pipe in full.
 5. ipsec whack blocks on write to the other side of the pipe
    when it runs out of buffer space.
 6. pluto blocks on sendmsg to ipsec whack for the same reason.
 7. ovs-monitor-ipsec calls another ipsec command and blocks
    on connection to pluto.

In this scenario the running process is at the mercy of garbage
collector and it doesn't run because we're blocked on calling another
ipsec command.  All the processes are completely blocked and will not
do any work until ipsec whack is killed.

With this change we're introducing a new function that will be used
for all the external process execution commands and will read the full
output before returning, avoiding the deadlock.  It will also log all
the failures as warnings and the commands themselves at the debug level.

We'll be adding more logic into this function in later commits as well,
so it will not stay that simple.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 ipsec/ovs-monitor-ipsec.in | 296 +++++++++++++++++--------------------
 1 file changed, 134 insertions(+), 162 deletions(-)

Comments

Roi Dayan Oct. 29, 2024, 12:16 p.m. UTC | #1
On 29/10/2024 12:14, Ilya Maximets wrote:
> Until now, functions that needed to call external programs like openssl
> or ipsec commands were using subprocess commands directly.  Most of
> these calls had no failure checks or any logging making it hard to
> understand what is happening inside the daemon when something doesn't
> work as intended.
> 
> Some commands also had a chance to not read the command output in full.
> That might sound like not a big problem, but in practice it causes
> ovs-monitor-ipsec to deadlock pluto and itself with certain versions of
> Libreswan (mainly Libreswan 5+).  The order of events is following:
> 
>  1. ovs-monitor-ipsec calls ipsec status redirecting the output
>     to a pipe.
>  2. ipsec status calls ipsec whack.
>  3. ipsec whack connects to pluto and asks for status.
>  4. ovs-monitor-ipsec doesn't read the pipe in full.
>  5. ipsec whack blocks on write to the other side of the pipe
>     when it runs out of buffer space.
>  6. pluto blocks on sendmsg to ipsec whack for the same reason.
>  7. ovs-monitor-ipsec calls another ipsec command and blocks
>     on connection to pluto.
> 
> In this scenario the running process is at the mercy of garbage
> collector and it doesn't run because we're blocked on calling another
> ipsec command.  All the processes are completely blocked and will not
> do any work until ipsec whack is killed.
> 
> With this change we're introducing a new function that will be used
> for all the external process execution commands and will read the full
> output before returning, avoiding the deadlock.  It will also log all
> the failures as warnings and the commands themselves at the debug level.
> 
> We'll be adding more logic into this function in later commits as well,
> so it will not stay that simple.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  ipsec/ovs-monitor-ipsec.in | 296 +++++++++++++++++--------------------
>  1 file changed, 134 insertions(+), 162 deletions(-)
> 
> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
> index 37c509ac6..19a401609 100755
> --- a/ipsec/ovs-monitor-ipsec.in
> +++ b/ipsec/ovs-monitor-ipsec.in
> @@ -84,6 +84,28 @@ monitor = None
>  xfrm = None
>  
>  
> +def run_command(args, description=None):
> +    """ This function runs the process args[0] with args[1:] arguments
> +    and returns a tuple: return-code, stdout, stderr. """
> +
> +    if not description:

the last arg as a description might not give much or confuse?
maybe it's better to use the first arg i.e. the command ?

> +        description = args[-1]
> +
> +    vlog.dbg("Running %s" % args)
> +    proc = subprocess.Popen(args, stdout=subprocess.PIPE,
> +                            stderr=subprocess.PIPE)
> +    pout, perr = proc.communicate()
> +

you can skip call to len() in the if statement.
pout and perr should always be a string empty or not.
the if statement will work well on perr itself.

> +    if proc.returncode or len(perr):
> +        vlog.warn("Failed to %s; exit code: %d"
> +                  % (description, proc.returncode))
> +        vlog.warn("cmdline: %s" % proc.args)
> +        vlog.warn("stderr: %s" % perr)
> +        vlog.warn("stdout: %s" % pout)
> +

Since pout and perr are always strings you don't need
the or statement.

this should work the same

    return proc.returncode, pout, perr

also every caller that uses pout and/or perr
will do decode().strip().
so why not return it like so by default?

    return proc.returncode, pout.decode().strip(), perr.decode()


> +    return proc.returncode, pout or b'', perr or b''
> +
> +
>  class XFRM(object):
>      """This class is a simple wrapper around ip-xfrm (8) command line
>      utility.  We are using this class only for informational purposes
> @@ -99,13 +121,14 @@ class XFRM(object):
>          where <key> is destination IPv4 address and <value> is SELECTOR of
>          the IPsec policy."""
>          policies = {}
> -        proc = subprocess.Popen([self.IP, 'xfrm', 'policy'],
> -                                stdout=subprocess.PIPE)
> -        while True:
> -            line = proc.stdout.readline().strip().decode()
> -            if line == '':
> -                break
> -            a = line.split(" ")
> +
> +        ret, pout, perr = run_command([self.IP, 'xfrm', 'policy'],
> +                                      "get XFRM policies")
> +        if ret:
> +            return policies
> +
> +        for line in pout.decode().splitlines():
> +            a = line.strip().split(" ")
>              if len(a) >= 4 and a[0] == "src" and a[2] == "dst":
>                  dst = (a[3].split("/"))[0]
>                  if dst not in policies:
> @@ -122,13 +145,14 @@ class XFRM(object):
>          in a dictionary where <key> is destination IPv4 address and <value>
>          is SELECTOR."""
>          securities = {}
> -        proc = subprocess.Popen([self.IP, 'xfrm', 'state'],
> -                                stdout=subprocess.PIPE)
> -        while True:
> -            line = proc.stdout.readline().strip().decode()
> -            if line == '':
> -                break
> -            a = line.split(" ")
> +
> +        ret, pout, perr = run_command([self.IP, 'xfrm', 'state'],
> +                                      "get XFRM state")
> +        if ret:
> +            return securities
> +
> +        for line in pout.decode().splitlines():
> +            a = line.strip().split(" ")
>              if len(a) >= 4 and a[0] == "sel" \
>                      and a[1] == "src" and a[3] == "dst":
>                  remote_ip = a[4].rstrip().split("/")[0]
> @@ -242,7 +266,7 @@ conn prevent_unencrypted_vxlan
>          f.close()
>  
>          vlog.info("Restarting StrongSwan")
> -        subprocess.call([self.IPSEC, "restart"])
> +        run_command([self.IPSEC, "restart"], "restart StrongSwan")
>  
>      def get_active_conns(self):
>          """This function parses output from 'ipsec status' command.
> @@ -252,13 +276,13 @@ conn prevent_unencrypted_vxlan
>          sample line from the parsed outpus as <value>. """
>  
>          conns = {}
> -        proc = subprocess.Popen([self.IPSEC, 'status'], stdout=subprocess.PIPE)
> +        ret, pout, perr = run_command([self.IPSEC, 'status'],
> +                                      "get active connections")
> +        if ret:
> +            return conns
>  
> -        while True:
> -            line = proc.stdout.readline().strip().decode()
> -            if line == '':
> -                break
> -            tunnel_name = line.split(":")
> +        for line in pout.decode().splitlines():
> +            tunnel_name = line.strip().split(":")
>              if len(tunnel_name) < 2:
>                  continue
>              m = re.match(r"(.*)(-in-\d+|-out-\d+|-\d+).*", tunnel_name[0])
> @@ -341,15 +365,11 @@ conn prevent_unencrypted_vxlan
>          Once strongSwan vici bindings will be distributed with major
>          Linux distributions this function could be simplified."""
>          vlog.info("Refreshing StrongSwan configuration")
> -        proc = subprocess.Popen([self.IPSEC, "update"],
> -                        stdout=subprocess.PIPE,
> -                        stderr=subprocess.PIPE)
> -        outs, errs = proc.communicate()
> -        if proc.returncode != 0:
> -            vlog.err("StrongSwan failed to update configuration:\n"
> -                           "%s \n %s" % (str(outs), str(errs)))
> -
> -        subprocess.call([self.IPSEC, "rereadsecrets"])
> +
> +        run_command([self.IPSEC, "update"],
> +                    "update StrongSwan's configuration")
> +        run_command([self.IPSEC, "rereadsecrets"], "re-read secrets")
> +
>          # "ipsec update" command does not remove those tunnels that were
>          # updated or that disappeared from the ipsec.conf file.  So, we have
>          # to manually remove them by calling "ipsec stroke down-nb <tunnel>"
> @@ -382,7 +402,8 @@ conn prevent_unencrypted_vxlan
>  
>                  if not tunnel or tunnel.version != ver:
>                      vlog.info("%s is outdated %u" % (conn, ver))
> -                    subprocess.call([self.IPSEC, "stroke", "down-nb", conn])
> +                    run_command([self.IPSEC, "stroke", "down-nb", conn],
> +                                "stroke the outdated %s" % conn)
>  
>  
>  class LibreSwanHelper(object):
> @@ -460,13 +481,11 @@ conn prevent_unencrypted_vxlan
>          # Collect version infromation
>          self.IPSEC = libreswan_root_prefix + "/usr/sbin/ipsec"
>          self.IPSEC_AUTO = [self.IPSEC]
> -        proc = subprocess.Popen([self.IPSEC, "--version"],
> -                                stdout=subprocess.PIPE,
> -                                encoding="latin1")
> -        pout, perr = proc.communicate()
>  
> -        v = re.match("^Libreswan v?(.*)$", pout)
> +        ret, pout, perr = run_command([self.IPSEC, "--version"],
> +                                      "get Libreswan's version")
>          try:
> +            v = re.match("^Libreswan v?(.*)$", pout.decode().strip())
>              version = int(v.group(1).split(".")[0])
>          except:
>              version = 0
> @@ -513,7 +532,7 @@ conn prevent_unencrypted_vxlan
>          f.close()
>  
>          vlog.info("Restarting LibreSwan")
> -        subprocess.call([self.IPSEC, "restart"])
> +        run_command([self.IPSEC, "restart"], "restart Libreswan")
>  
>      def config_init(self):
>          self.conf_file = open(self.IPSEC_CONF, "w")
> @@ -599,8 +618,10 @@ conn prevent_unencrypted_vxlan
>  
>      def refresh(self, monitor):
>          vlog.info("Refreshing LibreSwan configuration")
> -        subprocess.call(self.IPSEC_AUTO + ["--ctlsocket", self.IPSEC_CTL,
> -                        "--config", self.IPSEC_CONF, "--rereadsecrets"])
> +        run_command(self.IPSEC_AUTO + ["--ctlsocket", self.IPSEC_CTL,
> +                                       "--config", self.IPSEC_CONF,
> +                                       "--rereadsecrets"],
> +                    "re-read secrets")
>          tunnels = set(monitor.tunnels.keys())
>  
>          # Delete old connections
> @@ -627,9 +648,10 @@ conn prevent_unencrypted_vxlan
>  
>                  if not tunnel or tunnel.version != ver:
>                      vlog.info("%s is outdated %u" % (conn, ver))
> -                    subprocess.call(self.IPSEC_AUTO + ["--ctlsocket",
> -                                    self.IPSEC_CTL, "--config",
> -                                    self.IPSEC_CONF, "--delete", conn])
> +                    run_command(self.IPSEC_AUTO +
> +                                ["--ctlsocket", self.IPSEC_CTL,
> +                                 "--config", self.IPSEC_CONF,
> +                                 "--delete", conn], "delete %s" % conn)
>                  elif ifname in tunnels:
>                      tunnels.remove(ifname)
>  
> @@ -649,43 +671,43 @@ conn prevent_unencrypted_vxlan
>          # Update shunt policy if changed
>          if monitor.conf_in_use["skb_mark"] != monitor.conf["skb_mark"]:
>              if monitor.conf["skb_mark"]:
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--add",
>                              "--asynchronous", "prevent_unencrypted_gre"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--add",
>                              "--asynchronous", "prevent_unencrypted_geneve"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--add",
>                              "--asynchronous", "prevent_unencrypted_stt"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--add",
>                              "--asynchronous", "prevent_unencrypted_vxlan"])
>              else:
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--delete",
>                              "--asynchronous", "prevent_unencrypted_gre"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--delete",
>                              "--asynchronous", "prevent_unencrypted_geneve"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--delete",
>                              "--asynchronous", "prevent_unencrypted_stt"])
> -                subprocess.call(self.IPSEC_AUTO +
> +                run_command(self.IPSEC_AUTO +
>                              ["--config", self.IPSEC_CONF,
>                              "--ctlsocket", self.IPSEC_CTL,
>                              "--delete",
> @@ -700,14 +722,13 @@ conn prevent_unencrypted_vxlan
>          sample line from the parsed outpus as <value>. """
>  
>          conns = {}
> -        proc = subprocess.Popen([self.IPSEC, 'status', '--ctlsocket',
> -                                self.IPSEC_CTL], stdout=subprocess.PIPE)
> -
> -        while True:
> -            line = proc.stdout.readline().strip().decode()
> -            if line == '':
> -                break
> +        ret, pout, perr = run_command([self.IPSEC, 'status',
> +                                      '--ctlsocket', self.IPSEC_CTL],
> +                                      "get active connections")
> +        if ret:
> +            return conns
>  
> +        for line in pout.decode().splitlines():
>              m = re.search(r"#\d+: \"(.*)\".*", line)
>              if not m:
>                  continue
> @@ -732,117 +753,72 @@ conn prevent_unencrypted_vxlan
>          # the "ipsec auto --start" command is lost. Just retry to make sure
>          # the command is received by LibreSwan.
>          while True:
> -            proc = subprocess.Popen(self.IPSEC_AUTO +
> -                                    ["--config", self.IPSEC_CONF,
> -                                    "--ctlsocket", self.IPSEC_CTL,
> -                                    "--start",
> -                                    "--asynchronous", conn],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            perr = str(proc.stderr.read())
> -            pout = str(proc.stdout.read())
> -            if not re.match(r".*Connection refused.*", perr) and \
> -                    not re.match(r".*need --listen.*", pout):
> +            ret, pout, perr = run_command(self.IPSEC_AUTO +
> +                                          ["--config", self.IPSEC_CONF,
> +                                          "--ctlsocket", self.IPSEC_CTL,
> +                                          "--start",
> +                                          "--asynchronous", conn],
> +                                          "start %s" % conn)
> +            if not re.match(r".*Connection refused.*", perr.decode()) and \
> +                    not re.match(r".*need --listen.*", pout.decode()):
>                  break
>  
> -        if re.match(r".*[F|f]ailed to initiate connection.*", pout):
> +        if re.match(r".*[F|f]ailed to initiate connection.*", pout.decode()):
>              vlog.err('Failed to initiate connection through'
>                      ' Interface %s.\n' % (conn.split('-')[0]))
> -            vlog.err(pout)
> +            vlog.err("stdout: %s" % pout)
>  
>      def _nss_clear_database(self):
>          """Remove all OVS IPsec related state from the NSS database"""
> -        try:
> -            proc = subprocess.Popen(['certutil', '-L', '-d',
> -                                    self.IPSEC_D],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE,
> -                                    universal_newlines=True)
> -            lines = proc.stdout.readlines()
> -
> -            for line in lines:
> -                s = line.strip().split()
> -                if len(s) < 1:
> -                    continue
> -                name = s[0]
> -                if name.startswith(self.CERT_PREFIX):
> -                    self._nss_delete_cert(name)
> -                elif name.startswith(self.CERTKEY_PREFIX):
> -                    self._nss_delete_cert_and_key(name)
> +        ret, pout, perr = run_command(['certutil', '-L', '-d', self.IPSEC_D],
> +                                      "clear NSS database")
> +        if ret:
> +            return
>  
> -        except Exception as e:
> -            vlog.err("Failed to clear NSS database.\n" + str(e))
> +        for line in pout.decode().splitlines():
> +            s = line.strip().split()
> +            if len(s) < 1:
> +                continue
> +            name = s[0]
> +            if name.startswith(self.CERT_PREFIX):
> +                self._nss_delete_cert(name)
> +            elif name.startswith(self.CERTKEY_PREFIX):
> +                self._nss_delete_cert_and_key(name)
>  
>      def _nss_import_cert(self, cert, name, cert_type):
>          """Cert_type is 'CT,,' for the CA certificate and 'P,P,P' for the
>          normal certificate."""
> -        try:
> -            proc = subprocess.Popen(['certutil', '-A', '-a', '-i', cert,
> -                                    '-d', self.IPSEC_D, '-n',
> -                                    name, '-t', cert_type],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -        except Exception as e:
> -            vlog.err("Failed to import certificate into NSS.\n" + str(e))
> +        run_command(['certutil', '-A', '-a', '-i', cert, '-d', self.IPSEC_D,
> +                     '-n', name, '-t', cert_type],
> +                    "import certificate %s into NSS" % name)
>  
>      def _nss_delete_cert(self, name):
> -        try:
> -            proc = subprocess.Popen(['certutil', '-D', '-d',
> -                                    self.IPSEC_D, '-n', name],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -        except Exception as e:
> -            vlog.err("Failed to delete certificate from NSS.\n" + str(e))
> +        run_command(['certutil', '-D', '-d', self.IPSEC_D, '-n', name],
> +                    "delete certificate %s from NSS" % name)
>  
>      def _nss_import_cert_and_key(self, cert, key, name):
> -        try:
> -            # Avoid deleting other files
> -            path = os.path.abspath('/tmp/%s.p12' % name)
> -            if not path.startswith('/tmp/'):
> -                raise Exception("Illegal certificate name!")
> -
> -            # Create p12 file from pem files
> -            proc = subprocess.Popen(['openssl', 'pkcs12', '-export',
> -                                    '-in', cert, '-inkey', key, '-out',
> -                                    path, '-name', name, '-passout', 'pass:'],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -
> -            # Load p12 file to the database
> -            proc = subprocess.Popen(['pk12util', '-i', path, '-d',
> -                                    self.IPSEC_D, '-W', ''],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -
> -        except Exception as e:
> -            vlog.err("Import cert and key failed.\n" + str(e))
> +        # Avoid deleting other files
> +        path = os.path.abspath('/tmp/%s.p12' % name)
> +        if not path.startswith('/tmp/'):
> +            vlog.err("Illegal certificate name '%s'!" % name)
> +            return
> +
> +        if run_command(['openssl', 'pkcs12', '-export',
> +                       '-in', cert, '-inkey', key,
> +                       '-out', path, '-name', name,
> +                       '-passout', 'pass:'],
> +                       "create p12 file from pem files")[0]:
> +            return
> +
> +        # Load p12 file to the database
> +        run_command(['pk12util', '-i', path, '-d', self.IPSEC_D, '-W', ''],
> +                    "load p12 file to the NSS database")
>          os.remove(path)
>  
>      def _nss_delete_cert_and_key(self, name):
> -        try:
> -            # Delete certificate and private key
> -            proc = subprocess.Popen(['certutil', '-F', '-d',
> -                                    self.IPSEC_D, '-n', name],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -
> -        except Exception as e:
> -            vlog.err("Delete cert and key failed.\n" + str(e))
> +        # Delete certificate and private key
> +        run_command(['certutil', '-F', '-d', self.IPSEC_D, '-n', name],
> +                    "delete certificate and private key for %s" % name)
>  
>  
>  class IPsecTunnel(object):
> @@ -1220,19 +1196,15 @@ class IPsecMonitor(object):
>              self.ike_helper.refresh(self)
>  
>      def _get_cn_from_cert(self, cert):
> -        try:
> -            proc = subprocess.Popen(['openssl', 'x509', '-noout', '-subject',
> -                                    '-nameopt', 'RFC2253', '-in', cert],
> -                                    stdout=subprocess.PIPE,
> -                                    stderr=subprocess.PIPE)
> -            proc.wait()
> -            if proc.returncode:
> -                raise Exception(proc.stderr.read())
> -            m = re.search(r"CN=(.+?),", proc.stdout.readline().decode())
> -            if not m:
> -                raise Exception("No CN in the certificate subject.")
> -        except Exception as e:
> -            vlog.warn(str(e))
> +        ret, pout, perr = run_command(['openssl', 'x509', '-noout', '-subject',
> +                                       '-nameopt', 'RFC2253', '-in', cert],
> +                                       "get certificate %s options" % cert)
> +        if ret:
> +            return None
> +
> +        m = re.search(r"CN=(.+?),", pout.decode().strip())
> +        if not m:
> +            vlog.warn("No CN in the certificate subject (%s)." % cert)
>              return None
>  
>          return m.group(1)
Ilya Maximets Oct. 29, 2024, 12:50 p.m. UTC | #2
On 10/29/24 13:16, Roi Dayan wrote:
> 
> 
> On 29/10/2024 12:14, Ilya Maximets wrote:
>> Until now, functions that needed to call external programs like openssl
>> or ipsec commands were using subprocess commands directly.  Most of
>> these calls had no failure checks or any logging making it hard to
>> understand what is happening inside the daemon when something doesn't
>> work as intended.
>>
>> Some commands also had a chance to not read the command output in full.
>> That might sound like not a big problem, but in practice it causes
>> ovs-monitor-ipsec to deadlock pluto and itself with certain versions of
>> Libreswan (mainly Libreswan 5+).  The order of events is following:
>>
>>  1. ovs-monitor-ipsec calls ipsec status redirecting the output
>>     to a pipe.
>>  2. ipsec status calls ipsec whack.
>>  3. ipsec whack connects to pluto and asks for status.
>>  4. ovs-monitor-ipsec doesn't read the pipe in full.
>>  5. ipsec whack blocks on write to the other side of the pipe
>>     when it runs out of buffer space.
>>  6. pluto blocks on sendmsg to ipsec whack for the same reason.
>>  7. ovs-monitor-ipsec calls another ipsec command and blocks
>>     on connection to pluto.
>>
>> In this scenario the running process is at the mercy of garbage
>> collector and it doesn't run because we're blocked on calling another
>> ipsec command.  All the processes are completely blocked and will not
>> do any work until ipsec whack is killed.
>>
>> With this change we're introducing a new function that will be used
>> for all the external process execution commands and will read the full
>> output before returning, avoiding the deadlock.  It will also log all
>> the failures as warnings and the commands themselves at the debug level.
>>
>> We'll be adding more logic into this function in later commits as well,
>> so it will not stay that simple.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>  ipsec/ovs-monitor-ipsec.in | 296 +++++++++++++++++--------------------
>>  1 file changed, 134 insertions(+), 162 deletions(-)
>>
>> diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
>> index 37c509ac6..19a401609 100755
>> --- a/ipsec/ovs-monitor-ipsec.in
>> +++ b/ipsec/ovs-monitor-ipsec.in
>> @@ -84,6 +84,28 @@ monitor = None
>>  xfrm = None
>>  
>>  
>> +def run_command(args, description=None):
>> +    """ This function runs the process args[0] with args[1:] arguments
>> +    and returns a tuple: return-code, stdout, stderr. """
>> +
>> +    if not description:
> 
> the last arg as a description might not give much or confuse?
> maybe it's better to use the first arg i.e. the command ?

Thanks, Roi, for taking a look!

The first argument in most cases is 'ipsec', so it doesn't give
a lot of context.  The last argument is used, just because I was
a little lazy and didn't want to add repeating descriptions to
all the calls like this:

    run_command(self.IPSEC_AUTO +
                ["--config", self.IPSEC_CONF,
                "--ctlsocket", self.IPSEC_CTL,
                "--add",
                "--asynchronous", "prevent_unencrypted_gre"])

These ones have a recongnizeable last argument, which is an actual
command name.  Most other types of calls do have their description
specified.

So, we could either keep as is, or add "prevent unencrypted gre"
to the commands like the one above, which seems a little repetitive.
But using the first argument doesn't seem to be very useful.

WDYT?

> 
>> +        description = args[-1]
>> +
>> +    vlog.dbg("Running %s" % args)
>> +    proc = subprocess.Popen(args, stdout=subprocess.PIPE,
>> +                            stderr=subprocess.PIPE)
>> +    pout, perr = proc.communicate()
>> +
> 
> you can skip call to len() in the if statement.
> pout and perr should always be a string empty or not.
> the if statement will work well on perr itself.

Yeah, good point, I'll change that.

> 
>> +    if proc.returncode or len(perr):
>> +        vlog.warn("Failed to %s; exit code: %d"
>> +                  % (description, proc.returncode))
>> +        vlog.warn("cmdline: %s" % proc.args)
>> +        vlog.warn("stderr: %s" % perr)
>> +        vlog.warn("stdout: %s" % pout)
>> +
> 
> Since pout and perr are always strings you don't need
> the or statement.
> 
> this should work the same
> 
>     return proc.returncode, pout, perr

I guess, there were too many verions of this code and they were not
initialized at some point during timeout exception handling in the
patch #5, but they are now, so I should fix this, agreed.

> 
> also every caller that uses pout and/or perr
> will do decode().strip().
> so why not return it like so by default?
> 
>     return proc.returncode, pout.decode().strip(), perr.decode()

Good point.  I'd leave the strip() for the caller, but otherwise
it makes sense to decode here, true.

> 
> 
>> +    return proc.returncode, pout or b'', perr or b''
>> +
>> +

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/ipsec/ovs-monitor-ipsec.in b/ipsec/ovs-monitor-ipsec.in
index 37c509ac6..19a401609 100755
--- a/ipsec/ovs-monitor-ipsec.in
+++ b/ipsec/ovs-monitor-ipsec.in
@@ -84,6 +84,28 @@  monitor = None
 xfrm = None
 
 
+def run_command(args, description=None):
+    """ This function runs the process args[0] with args[1:] arguments
+    and returns a tuple: return-code, stdout, stderr. """
+
+    if not description:
+        description = args[-1]
+
+    vlog.dbg("Running %s" % args)
+    proc = subprocess.Popen(args, stdout=subprocess.PIPE,
+                            stderr=subprocess.PIPE)
+    pout, perr = proc.communicate()
+
+    if proc.returncode or len(perr):
+        vlog.warn("Failed to %s; exit code: %d"
+                  % (description, proc.returncode))
+        vlog.warn("cmdline: %s" % proc.args)
+        vlog.warn("stderr: %s" % perr)
+        vlog.warn("stdout: %s" % pout)
+
+    return proc.returncode, pout or b'', perr or b''
+
+
 class XFRM(object):
     """This class is a simple wrapper around ip-xfrm (8) command line
     utility.  We are using this class only for informational purposes
@@ -99,13 +121,14 @@  class XFRM(object):
         where <key> is destination IPv4 address and <value> is SELECTOR of
         the IPsec policy."""
         policies = {}
-        proc = subprocess.Popen([self.IP, 'xfrm', 'policy'],
-                                stdout=subprocess.PIPE)
-        while True:
-            line = proc.stdout.readline().strip().decode()
-            if line == '':
-                break
-            a = line.split(" ")
+
+        ret, pout, perr = run_command([self.IP, 'xfrm', 'policy'],
+                                      "get XFRM policies")
+        if ret:
+            return policies
+
+        for line in pout.decode().splitlines():
+            a = line.strip().split(" ")
             if len(a) >= 4 and a[0] == "src" and a[2] == "dst":
                 dst = (a[3].split("/"))[0]
                 if dst not in policies:
@@ -122,13 +145,14 @@  class XFRM(object):
         in a dictionary where <key> is destination IPv4 address and <value>
         is SELECTOR."""
         securities = {}
-        proc = subprocess.Popen([self.IP, 'xfrm', 'state'],
-                                stdout=subprocess.PIPE)
-        while True:
-            line = proc.stdout.readline().strip().decode()
-            if line == '':
-                break
-            a = line.split(" ")
+
+        ret, pout, perr = run_command([self.IP, 'xfrm', 'state'],
+                                      "get XFRM state")
+        if ret:
+            return securities
+
+        for line in pout.decode().splitlines():
+            a = line.strip().split(" ")
             if len(a) >= 4 and a[0] == "sel" \
                     and a[1] == "src" and a[3] == "dst":
                 remote_ip = a[4].rstrip().split("/")[0]
@@ -242,7 +266,7 @@  conn prevent_unencrypted_vxlan
         f.close()
 
         vlog.info("Restarting StrongSwan")
-        subprocess.call([self.IPSEC, "restart"])
+        run_command([self.IPSEC, "restart"], "restart StrongSwan")
 
     def get_active_conns(self):
         """This function parses output from 'ipsec status' command.
@@ -252,13 +276,13 @@  conn prevent_unencrypted_vxlan
         sample line from the parsed outpus as <value>. """
 
         conns = {}
-        proc = subprocess.Popen([self.IPSEC, 'status'], stdout=subprocess.PIPE)
+        ret, pout, perr = run_command([self.IPSEC, 'status'],
+                                      "get active connections")
+        if ret:
+            return conns
 
-        while True:
-            line = proc.stdout.readline().strip().decode()
-            if line == '':
-                break
-            tunnel_name = line.split(":")
+        for line in pout.decode().splitlines():
+            tunnel_name = line.strip().split(":")
             if len(tunnel_name) < 2:
                 continue
             m = re.match(r"(.*)(-in-\d+|-out-\d+|-\d+).*", tunnel_name[0])
@@ -341,15 +365,11 @@  conn prevent_unencrypted_vxlan
         Once strongSwan vici bindings will be distributed with major
         Linux distributions this function could be simplified."""
         vlog.info("Refreshing StrongSwan configuration")
-        proc = subprocess.Popen([self.IPSEC, "update"],
-                        stdout=subprocess.PIPE,
-                        stderr=subprocess.PIPE)
-        outs, errs = proc.communicate()
-        if proc.returncode != 0:
-            vlog.err("StrongSwan failed to update configuration:\n"
-                           "%s \n %s" % (str(outs), str(errs)))
-
-        subprocess.call([self.IPSEC, "rereadsecrets"])
+
+        run_command([self.IPSEC, "update"],
+                    "update StrongSwan's configuration")
+        run_command([self.IPSEC, "rereadsecrets"], "re-read secrets")
+
         # "ipsec update" command does not remove those tunnels that were
         # updated or that disappeared from the ipsec.conf file.  So, we have
         # to manually remove them by calling "ipsec stroke down-nb <tunnel>"
@@ -382,7 +402,8 @@  conn prevent_unencrypted_vxlan
 
                 if not tunnel or tunnel.version != ver:
                     vlog.info("%s is outdated %u" % (conn, ver))
-                    subprocess.call([self.IPSEC, "stroke", "down-nb", conn])
+                    run_command([self.IPSEC, "stroke", "down-nb", conn],
+                                "stroke the outdated %s" % conn)
 
 
 class LibreSwanHelper(object):
@@ -460,13 +481,11 @@  conn prevent_unencrypted_vxlan
         # Collect version infromation
         self.IPSEC = libreswan_root_prefix + "/usr/sbin/ipsec"
         self.IPSEC_AUTO = [self.IPSEC]
-        proc = subprocess.Popen([self.IPSEC, "--version"],
-                                stdout=subprocess.PIPE,
-                                encoding="latin1")
-        pout, perr = proc.communicate()
 
-        v = re.match("^Libreswan v?(.*)$", pout)
+        ret, pout, perr = run_command([self.IPSEC, "--version"],
+                                      "get Libreswan's version")
         try:
+            v = re.match("^Libreswan v?(.*)$", pout.decode().strip())
             version = int(v.group(1).split(".")[0])
         except:
             version = 0
@@ -513,7 +532,7 @@  conn prevent_unencrypted_vxlan
         f.close()
 
         vlog.info("Restarting LibreSwan")
-        subprocess.call([self.IPSEC, "restart"])
+        run_command([self.IPSEC, "restart"], "restart Libreswan")
 
     def config_init(self):
         self.conf_file = open(self.IPSEC_CONF, "w")
@@ -599,8 +618,10 @@  conn prevent_unencrypted_vxlan
 
     def refresh(self, monitor):
         vlog.info("Refreshing LibreSwan configuration")
-        subprocess.call(self.IPSEC_AUTO + ["--ctlsocket", self.IPSEC_CTL,
-                        "--config", self.IPSEC_CONF, "--rereadsecrets"])
+        run_command(self.IPSEC_AUTO + ["--ctlsocket", self.IPSEC_CTL,
+                                       "--config", self.IPSEC_CONF,
+                                       "--rereadsecrets"],
+                    "re-read secrets")
         tunnels = set(monitor.tunnels.keys())
 
         # Delete old connections
@@ -627,9 +648,10 @@  conn prevent_unencrypted_vxlan
 
                 if not tunnel or tunnel.version != ver:
                     vlog.info("%s is outdated %u" % (conn, ver))
-                    subprocess.call(self.IPSEC_AUTO + ["--ctlsocket",
-                                    self.IPSEC_CTL, "--config",
-                                    self.IPSEC_CONF, "--delete", conn])
+                    run_command(self.IPSEC_AUTO +
+                                ["--ctlsocket", self.IPSEC_CTL,
+                                 "--config", self.IPSEC_CONF,
+                                 "--delete", conn], "delete %s" % conn)
                 elif ifname in tunnels:
                     tunnels.remove(ifname)
 
@@ -649,43 +671,43 @@  conn prevent_unencrypted_vxlan
         # Update shunt policy if changed
         if monitor.conf_in_use["skb_mark"] != monitor.conf["skb_mark"]:
             if monitor.conf["skb_mark"]:
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--add",
                             "--asynchronous", "prevent_unencrypted_gre"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--add",
                             "--asynchronous", "prevent_unencrypted_geneve"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--add",
                             "--asynchronous", "prevent_unencrypted_stt"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--add",
                             "--asynchronous", "prevent_unencrypted_vxlan"])
             else:
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--delete",
                             "--asynchronous", "prevent_unencrypted_gre"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--delete",
                             "--asynchronous", "prevent_unencrypted_geneve"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--delete",
                             "--asynchronous", "prevent_unencrypted_stt"])
-                subprocess.call(self.IPSEC_AUTO +
+                run_command(self.IPSEC_AUTO +
                             ["--config", self.IPSEC_CONF,
                             "--ctlsocket", self.IPSEC_CTL,
                             "--delete",
@@ -700,14 +722,13 @@  conn prevent_unencrypted_vxlan
         sample line from the parsed outpus as <value>. """
 
         conns = {}
-        proc = subprocess.Popen([self.IPSEC, 'status', '--ctlsocket',
-                                self.IPSEC_CTL], stdout=subprocess.PIPE)
-
-        while True:
-            line = proc.stdout.readline().strip().decode()
-            if line == '':
-                break
+        ret, pout, perr = run_command([self.IPSEC, 'status',
+                                      '--ctlsocket', self.IPSEC_CTL],
+                                      "get active connections")
+        if ret:
+            return conns
 
+        for line in pout.decode().splitlines():
             m = re.search(r"#\d+: \"(.*)\".*", line)
             if not m:
                 continue
@@ -732,117 +753,72 @@  conn prevent_unencrypted_vxlan
         # the "ipsec auto --start" command is lost. Just retry to make sure
         # the command is received by LibreSwan.
         while True:
-            proc = subprocess.Popen(self.IPSEC_AUTO +
-                                    ["--config", self.IPSEC_CONF,
-                                    "--ctlsocket", self.IPSEC_CTL,
-                                    "--start",
-                                    "--asynchronous", conn],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            perr = str(proc.stderr.read())
-            pout = str(proc.stdout.read())
-            if not re.match(r".*Connection refused.*", perr) and \
-                    not re.match(r".*need --listen.*", pout):
+            ret, pout, perr = run_command(self.IPSEC_AUTO +
+                                          ["--config", self.IPSEC_CONF,
+                                          "--ctlsocket", self.IPSEC_CTL,
+                                          "--start",
+                                          "--asynchronous", conn],
+                                          "start %s" % conn)
+            if not re.match(r".*Connection refused.*", perr.decode()) and \
+                    not re.match(r".*need --listen.*", pout.decode()):
                 break
 
-        if re.match(r".*[F|f]ailed to initiate connection.*", pout):
+        if re.match(r".*[F|f]ailed to initiate connection.*", pout.decode()):
             vlog.err('Failed to initiate connection through'
                     ' Interface %s.\n' % (conn.split('-')[0]))
-            vlog.err(pout)
+            vlog.err("stdout: %s" % pout)
 
     def _nss_clear_database(self):
         """Remove all OVS IPsec related state from the NSS database"""
-        try:
-            proc = subprocess.Popen(['certutil', '-L', '-d',
-                                    self.IPSEC_D],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE,
-                                    universal_newlines=True)
-            lines = proc.stdout.readlines()
-
-            for line in lines:
-                s = line.strip().split()
-                if len(s) < 1:
-                    continue
-                name = s[0]
-                if name.startswith(self.CERT_PREFIX):
-                    self._nss_delete_cert(name)
-                elif name.startswith(self.CERTKEY_PREFIX):
-                    self._nss_delete_cert_and_key(name)
+        ret, pout, perr = run_command(['certutil', '-L', '-d', self.IPSEC_D],
+                                      "clear NSS database")
+        if ret:
+            return
 
-        except Exception as e:
-            vlog.err("Failed to clear NSS database.\n" + str(e))
+        for line in pout.decode().splitlines():
+            s = line.strip().split()
+            if len(s) < 1:
+                continue
+            name = s[0]
+            if name.startswith(self.CERT_PREFIX):
+                self._nss_delete_cert(name)
+            elif name.startswith(self.CERTKEY_PREFIX):
+                self._nss_delete_cert_and_key(name)
 
     def _nss_import_cert(self, cert, name, cert_type):
         """Cert_type is 'CT,,' for the CA certificate and 'P,P,P' for the
         normal certificate."""
-        try:
-            proc = subprocess.Popen(['certutil', '-A', '-a', '-i', cert,
-                                    '-d', self.IPSEC_D, '-n',
-                                    name, '-t', cert_type],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-        except Exception as e:
-            vlog.err("Failed to import certificate into NSS.\n" + str(e))
+        run_command(['certutil', '-A', '-a', '-i', cert, '-d', self.IPSEC_D,
+                     '-n', name, '-t', cert_type],
+                    "import certificate %s into NSS" % name)
 
     def _nss_delete_cert(self, name):
-        try:
-            proc = subprocess.Popen(['certutil', '-D', '-d',
-                                    self.IPSEC_D, '-n', name],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-        except Exception as e:
-            vlog.err("Failed to delete certificate from NSS.\n" + str(e))
+        run_command(['certutil', '-D', '-d', self.IPSEC_D, '-n', name],
+                    "delete certificate %s from NSS" % name)
 
     def _nss_import_cert_and_key(self, cert, key, name):
-        try:
-            # Avoid deleting other files
-            path = os.path.abspath('/tmp/%s.p12' % name)
-            if not path.startswith('/tmp/'):
-                raise Exception("Illegal certificate name!")
-
-            # Create p12 file from pem files
-            proc = subprocess.Popen(['openssl', 'pkcs12', '-export',
-                                    '-in', cert, '-inkey', key, '-out',
-                                    path, '-name', name, '-passout', 'pass:'],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-
-            # Load p12 file to the database
-            proc = subprocess.Popen(['pk12util', '-i', path, '-d',
-                                    self.IPSEC_D, '-W', ''],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-
-        except Exception as e:
-            vlog.err("Import cert and key failed.\n" + str(e))
+        # Avoid deleting other files
+        path = os.path.abspath('/tmp/%s.p12' % name)
+        if not path.startswith('/tmp/'):
+            vlog.err("Illegal certificate name '%s'!" % name)
+            return
+
+        if run_command(['openssl', 'pkcs12', '-export',
+                       '-in', cert, '-inkey', key,
+                       '-out', path, '-name', name,
+                       '-passout', 'pass:'],
+                       "create p12 file from pem files")[0]:
+            return
+
+        # Load p12 file to the database
+        run_command(['pk12util', '-i', path, '-d', self.IPSEC_D, '-W', ''],
+                    "load p12 file to the NSS database")
         os.remove(path)
 
     def _nss_delete_cert_and_key(self, name):
-        try:
-            # Delete certificate and private key
-            proc = subprocess.Popen(['certutil', '-F', '-d',
-                                    self.IPSEC_D, '-n', name],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-
-        except Exception as e:
-            vlog.err("Delete cert and key failed.\n" + str(e))
+        # Delete certificate and private key
+        run_command(['certutil', '-F', '-d', self.IPSEC_D, '-n', name],
+                    "delete certificate and private key for %s" % name)
 
 
 class IPsecTunnel(object):
@@ -1220,19 +1196,15 @@  class IPsecMonitor(object):
             self.ike_helper.refresh(self)
 
     def _get_cn_from_cert(self, cert):
-        try:
-            proc = subprocess.Popen(['openssl', 'x509', '-noout', '-subject',
-                                    '-nameopt', 'RFC2253', '-in', cert],
-                                    stdout=subprocess.PIPE,
-                                    stderr=subprocess.PIPE)
-            proc.wait()
-            if proc.returncode:
-                raise Exception(proc.stderr.read())
-            m = re.search(r"CN=(.+?),", proc.stdout.readline().decode())
-            if not m:
-                raise Exception("No CN in the certificate subject.")
-        except Exception as e:
-            vlog.warn(str(e))
+        ret, pout, perr = run_command(['openssl', 'x509', '-noout', '-subject',
+                                       '-nameopt', 'RFC2253', '-in', cert],
+                                       "get certificate %s options" % cert)
+        if ret:
+            return None
+
+        m = re.search(r"CN=(.+?),", pout.decode().strip())
+        if not m:
+            vlog.warn("No CN in the certificate subject (%s)." % cert)
             return None
 
         return m.group(1)