diff mbox series

[ovs-dev,v2] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Message ID 20230511100350.3898835-1-frode.nordahl@canonical.com
State Superseded
Delegated to: Simon Horman
Headers show
Series [ovs-dev,v2] tc: fix crash on EAGAIN return from recvmsg on netlink socket. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Frode Nordahl May 11, 2023, 10:03 a.m. UTC
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.

`tc_transact` in turn calls `nl_transact`, which via
`nl_transact_multiple__` ultimately calls and handles return
value from `recvmsg`.  On error a check for EAGAIN is performed
and a consequence of this condition is effectively to provide a
non-error (0) result and an empty reply buffer.

Before this change, the `tc_transact` and, consumers of it, were
unaware of this condition.  The use of assertions on the reply
buffer can as such lead to a fatal crash of OVS.

To be fair, the behavior of `nl_transact` when handling an EAGAIN
return is currently not documented, so this change also adds that
documentation.

While fixing the problem, it led me to find potential problems
with several of the one-time initialization functions in
the netdev-offload-tc module.  Before this patch they would
happily accept temporary failure as permanent one, with
consequences for the remaining lifetime of the process.

For the record, the symptom of the crash is this in the log:
EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()

And an excerpt of the backtrace looks like this:
0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384

Reported-At: https://launchpad.net/bugs/2018500
Fixes f98e418fbdb6 (tc: Add tc flower functions)
Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 lib/dpif-netlink.c      |  4 +++-
 lib/dpif.c              |  2 +-
 lib/netdev-linux.c      | 40 ++++++++++++++++++++++++++--------------
 lib/netdev-offload-tc.c | 31 ++++++++++++++++++++++---------
 lib/netlink-socket.c    |  5 +++++
 lib/tc.c                | 18 ++++++++++++++++++
 6 files changed, 75 insertions(+), 25 deletions(-)

Comments

Roi Dayan May 11, 2023, 10:50 a.m. UTC | #1
On 11/05/2023 13:03, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
> 
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
> 
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
> 
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
> 
> While fixing the problem, it led me to find potential problems
> with several of the one-time initialization functions in
> the netdev-offload-tc module.  Before this patch they would
> happily accept temporary failure as permanent one, with
> consequences for the remaining lifetime of the process.

Good catch on EAGAIN returning empty buffer.
I have a question about the change to loop EAGAIN in netdev-offload-tc calls.
Why not to keep treat EAGAIN as an error? the revalidator will call
again if needed.
In current situation what will happen if kernel will keep giving EAGAIN?
why to keep ovs in this loop and not resume normal operation until next revalidation?
This is so OVS could keep do other tasks as needed.

> 
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> 
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> 
> Reported-At: https://launchpad.net/bugs/2018500
> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> ---
>  lib/dpif-netlink.c      |  4 +++-
>  lib/dpif.c              |  2 +-
>  lib/netdev-linux.c      | 40 ++++++++++++++++++++++++++--------------
>  lib/netdev-offload-tc.c | 31 ++++++++++++++++++++++---------
>  lib/netlink-socket.c    |  5 +++++
>  lib/tc.c                | 18 ++++++++++++++++++
>  6 files changed, 75 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 60bd39643..75537b2f8 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
>              }
>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
>          }
> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> +        level = (err == EAGAIN ||
> +                 err == ENOSPC ||
> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
>                  ovs_strerror(err),
>                  (oor_netdev ? oor_netdev->name : dev->name));
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 3305401fe..2bbb0aac4 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
>       * Kernels that support flow wildcarding will reject these flows as
>       * duplicates (EEXIST), so lower the log level to debug for these
>       * types of messages. */
> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
>  }
>  
>  static bool
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 36620199e..a6065bf69 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -5817,8 +5817,11 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>  
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl,
> +                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> +                "Failed to dump police action (index: %u), err=%d",
> +                index,
> +                error);
>          return error;
>      }
>  
> @@ -5849,8 +5852,11 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>  
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl,
> +                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> +                "Failed to delete police action (index: %u), err=%d",
> +                index,
> +                error);
>          return error;
>      }
>  
> @@ -6114,11 +6120,13 @@ tc_query_class(const struct netdev *netdev,
>  
>      error = tc_transact(&request, replyp);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     tc_get_major(parent), tc_get_minor(parent),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl,
> +                (error == EAGAIN) ? VLL_DBG : VLL_WARN,
> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                tc_get_major(parent), tc_get_minor(parent),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6140,10 +6148,12 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
>  
>      error = tc_transact(&request, NULL);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl,
> +                (error == EAGAIN) ? VLL_DBG : VLL_WARN,
> +                "delete %s class %u:%u failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6262,7 +6272,9 @@ tc_query_qdisc(const struct netdev *netdev_)
>                  ops = &tc_ops_other;
>              }
>          }
> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> +    } else if ((!error && !qdisc->size) ||
> +               error == ENOENT || error == EAGAIN)
> +    {
>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
>           * set up by some other entity that doesn't have a handle 1:0.  We will
>           * assume that it's the system default qdisc. */
> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> index 4f26dd8cc..4c15989d6 100644
> --- a/lib/netdev-offload-tc.c
> +++ b/lib/netdev-offload-tc.c
> @@ -2505,10 +2505,11 @@ netdev_tc_flow_get(struct netdev *netdev,
>  
>      err = tc_get_flower(&id, &flower);
>      if (err) {
> -        VLOG_ERR_RL(&error_rl,
> -                    "flow get failed (dev %s prio %d handle %d): %s",
> -                    netdev_get_name(netdev), id.prio, id.handle,
> -                    ovs_strerror(err));
> +        VLOG_RL(&error_rl,
> +                (err == EAGAIN) ? VLL_DBG : VLL_ERR,
> +                "flow get failed (dev %s prio %d handle %d): %s",
> +                netdev_get_name(netdev), id.prio, id.handle,
> +                ovs_strerror(err));
>          return err;
>      }
>  
> @@ -2594,7 +2595,9 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>  
>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id1, &flower);
> +    do {
> +        error = tc_replace_flower(&id1, &flower);
> +    } while (error == EAGAIN);
>      if (error) {
>          goto out;
>      }
> @@ -2603,7 +2606,9 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
>  
>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id2, &flower);
> +    do {
> +        error = tc_replace_flower(&id2, &flower);
> +    } while (error == EAGAIN);
>      tc_del_flower_filter(&id1);
>  
>      if (error) {
> @@ -2625,6 +2630,7 @@ probe_insert_ct_state_rule(int ifindex, uint16_t ct_state, struct tcf_id *id)
>  {
>      int prio = TC_RESERVED_PRIORITY_MAX + 1;
>      struct tc_flower flower;
> +    int error;
>  
>      memset(&flower, 0, sizeof flower);
>      flower.key.ct_state = ct_state;
> @@ -2634,7 +2640,10 @@ probe_insert_ct_state_rule(int ifindex, uint16_t ct_state, struct tcf_id *id)
>      flower.mask.eth_type = OVS_BE16_MAX;
>  
>      *id = tc_make_tcf_id(ifindex, 0, prio, TC_INGRESS);
> -    return tc_replace_flower(id, &flower);
> +    do {
> +        error = tc_replace_flower(id, &flower);
> +    } while (error == EAGAIN);
> +    return error;
>  }
>  
>  static void
> @@ -2657,7 +2666,9 @@ probe_ct_state_support(int ifindex)
>          goto out;
>      }
>  
> -    error = tc_get_flower(&id, &flower);
> +    do {
> +        error = tc_get_flower(&id, &flower);
> +    } while (error == EAGAIN);
>      if (error || flower.mask.ct_state != ct_state) {
>          goto out_del;
>      }
> @@ -2732,7 +2743,9 @@ probe_tc_block_support(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>  
>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id, &flower);
> +    do {
> +        error = tc_replace_flower(&id, &flower);
> +    } while (error == EAGAIN);
>  
>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
>  
> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> index 80da20d9f..dea060fc3 100644
> --- a/lib/netlink-socket.c
> +++ b/lib/netlink-socket.c
> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>   *
>   *      2. Resending the request causes it to be re-executed, so the request
>   *         needs to be idempotent.
> + *
> + *      3. In the event that the kernel is too busy to handle the request to
> + *         receive the response (i.e. EAGAIN), this function will still return
> + *         0.  The caller can check for this condition by checking for a zero
> + *         size of the 'replyp' ofpbuf buffer.
>   */
>  int
>  nl_transact(int protocol, const struct ofpbuf *request,
> diff --git a/lib/tc.c b/lib/tc.c
> index 5c32c6f97..7df6a10c5 100644
> --- a/lib/tc.c
> +++ b/lib/tc.c
> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
>      }
>  }
>  
> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> + * addition of:
> + *
> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> + *    regardless of success or failure.
> + *
> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> + *    being too busy to process the request for the response, a positive
> + *    error return will be provided with the value of EAGAIN.
> + *
> + * Please acquaint yourself with the documentation of the `nl_transact`
> + * function in the netlink-sock module before making use of this function.
> + */
>  int
>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>  {
>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>      ofpbuf_uninit(request);
> +
> +    if (!error && replyp && !(*replyp)->size) {
> +        return EAGAIN;
> +    }
> +
>      return error;
>  }
>
Frode Nordahl May 11, 2023, 11:08 a.m. UTC | #2
Hello, Roi,

Thank you for your prompt reply, see response in-line below.

On Thu, May 11, 2023 at 12:51 PM Roi Dayan <roid@nvidia.com> wrote:
> On 11/05/2023 13:03, Frode Nordahl wrote:
> > The tc module combines the use of the `tc_transact` helper
> > function for communication with the in-kernel tc infrastructure
> > with assertions on the reply data by `ofpbuf_at_assert` on the
> > received data prior to further processing.
> >
> > `tc_transact` in turn calls `nl_transact`, which via
> > `nl_transact_multiple__` ultimately calls and handles return
> > value from `recvmsg`.  On error a check for EAGAIN is performed
> > and a consequence of this condition is effectively to provide a
> > non-error (0) result and an empty reply buffer.
> >
> > Before this change, the `tc_transact` and, consumers of it, were
> > unaware of this condition.  The use of assertions on the reply
> > buffer can as such lead to a fatal crash of OVS.
> >
> > To be fair, the behavior of `nl_transact` when handling an EAGAIN
> > return is currently not documented, so this change also adds that
> > documentation.
> >
> > While fixing the problem, it led me to find potential problems
> > with several of the one-time initialization functions in
> > the netdev-offload-tc module.  Before this patch they would
> > happily accept temporary failure as permanent one, with
> > consequences for the remaining lifetime of the process.
>
> Good catch on EAGAIN returning empty buffer.
> I have a question about the change to loop EAGAIN in netdev-offload-tc calls.
> Why not to keep treat EAGAIN as an error? the revalidator will call
> again if needed.
> In current situation what will happen if kernel will keep giving EAGAIN?
> why to keep ovs in this loop and not resume normal operation until next revalidation?
> This is so OVS could keep do other tasks as needed.

The change to loop was only made in one-time initialization functions
(i.e. `probe_multi_mask_per_prio`, `probe_insert_ct_state_rule`,
 `probe_ct_state_support`, `probe_tc_block_support`).

AFAICT, these are all ultimately called from within a one time
initialization block, i.e:

    if (ovsthread_once_start(&once)) {}

Which means they are only attempted once per ovs-vswitchd process invocation.

Or am I missing something?
Simon Horman May 12, 2023, 8:35 a.m. UTC | #3
On Thu, May 11, 2023 at 12:03:50PM +0200, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
> 
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
> 
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
> 
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
> 
> While fixing the problem, it led me to find potential problems
> with several of the one-time initialization functions in
> the netdev-offload-tc module.  Before this patch they would
> happily accept temporary failure as permanent one, with
> consequences for the remaining lifetime of the process.
> 
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> 
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> 
> Reported-At: https://launchpad.net/bugs/2018500
> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>

Hi Frode,

thanks for the patch, a nettlesome problem to be sure.

...

> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 36620199e..a6065bf69 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -5817,8 +5817,11 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>  
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl,
> +                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> +                "Failed to dump police action (index: %u), err=%d",
> +                index,
> +                error);

nit: could we economise the number of lines?

        VLOG_RL(&rl, (error == EAGAIN) ? VLL_DBG : VLL_ERR,
                "Failed to dump police action (index: %u), err=%d",
                index, error);

>          return error;
>      }
>  

...

> @@ -2594,7 +2595,9 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>  
>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id1, &flower);
> +    do {
> +        error = tc_replace_flower(&id1, &flower);

Perhaps there is a reason why this can't arise,
but I'm concerned that this loop could spin forever.

> +    } while (error == EAGAIN);
>      if (error) {
>          goto out;
>      }

...
Frode Nordahl May 12, 2023, 9:25 a.m. UTC | #4
On Fri, May 12, 2023 at 10:35 AM Simon Horman <simon.horman@corigine.com> wrote:
>
> On Thu, May 11, 2023 at 12:03:50PM +0200, Frode Nordahl wrote:
> > The tc module combines the use of the `tc_transact` helper
> > function for communication with the in-kernel tc infrastructure
> > with assertions on the reply data by `ofpbuf_at_assert` on the
> > received data prior to further processing.
> >
> > `tc_transact` in turn calls `nl_transact`, which via
> > `nl_transact_multiple__` ultimately calls and handles return
> > value from `recvmsg`.  On error a check for EAGAIN is performed
> > and a consequence of this condition is effectively to provide a
> > non-error (0) result and an empty reply buffer.
> >
> > Before this change, the `tc_transact` and, consumers of it, were
> > unaware of this condition.  The use of assertions on the reply
> > buffer can as such lead to a fatal crash of OVS.
> >
> > To be fair, the behavior of `nl_transact` when handling an EAGAIN
> > return is currently not documented, so this change also adds that
> > documentation.
> >
> > While fixing the problem, it led me to find potential problems
> > with several of the one-time initialization functions in
> > the netdev-offload-tc module.  Before this patch they would
> > happily accept temporary failure as permanent one, with
> > consequences for the remaining lifetime of the process.
> >
> > For the record, the symptom of the crash is this in the log:
> > EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> >
> > And an excerpt of the backtrace looks like this:
> > 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> > tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> > 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> > ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> > 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> > match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> > parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> > try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> >
> > Reported-At: https://launchpad.net/bugs/2018500
> > Fixes f98e418fbdb6 (tc: Add tc flower functions)
> > Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> > Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
>
> Hi Frode,
>
> thanks for the patch, a nettlesome problem to be sure.

Thank you for taking the time to review!

> ...
>
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index 36620199e..a6065bf69 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -5817,8 +5817,11 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >
> >      error = tc_transact(&request, &replyp);
> >      if (error) {
> > -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> > -                    index, error);
> > +        VLOG_RL(&rl,
> > +                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> > +                "Failed to dump police action (index: %u), err=%d",
> > +                index,
> > +                error);
>
> nit: could we economise the number of lines?

Sure, I will compact these.

>         VLOG_RL(&rl, (error == EAGAIN) ? VLL_DBG : VLL_ERR,
>                 "Failed to dump police action (index: %u), err=%d",
>                 index, error);
>
> >          return error;
> >      }
> >
>
> ...
>
> > @@ -2594,7 +2595,9 @@ probe_multi_mask_per_prio(int ifindex)
> >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >
> >      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id1, &flower);
> > +    do {
> > +        error = tc_replace_flower(&id1, &flower);
>
> Perhaps there is a reason why this can't arise,
> but I'm concerned that this loop could spin forever.

I understand the concern, my rationale is that if we make the wrong decision
here because of a temporary error, it will have consequences for the rest of
the execution time of the process/thread, which may be a source of
inconsistent behavior. But we absolutely do not want to hang forever either.

I'll revisit this and see if there is a way to retry the
initialization steps for a
finite amount of attempts/time and move on with a warning/error logged
after that.

Does that sound reasonable?
Simon Horman May 12, 2023, 9:43 a.m. UTC | #5
On Fri, May 12, 2023 at 11:25:02AM +0200, Frode Nordahl wrote:
> On Fri, May 12, 2023 at 10:35 AM Simon Horman <simon.horman@corigine.com> wrote:
> >
> > On Thu, May 11, 2023 at 12:03:50PM +0200, Frode Nordahl wrote:
> > > The tc module combines the use of the `tc_transact` helper
> > > function for communication with the in-kernel tc infrastructure
> > > with assertions on the reply data by `ofpbuf_at_assert` on the
> > > received data prior to further processing.
> > >
> > > `tc_transact` in turn calls `nl_transact`, which via
> > > `nl_transact_multiple__` ultimately calls and handles return
> > > value from `recvmsg`.  On error a check for EAGAIN is performed
> > > and a consequence of this condition is effectively to provide a
> > > non-error (0) result and an empty reply buffer.
> > >
> > > Before this change, the `tc_transact` and, consumers of it, were
> > > unaware of this condition.  The use of assertions on the reply
> > > buffer can as such lead to a fatal crash of OVS.
> > >
> > > To be fair, the behavior of `nl_transact` when handling an EAGAIN
> > > return is currently not documented, so this change also adds that
> > > documentation.
> > >
> > > While fixing the problem, it led me to find potential problems
> > > with several of the one-time initialization functions in
> > > the netdev-offload-tc module.  Before this patch they would
> > > happily accept temporary failure as permanent one, with
> > > consequences for the remaining lifetime of the process.
> > >
> > > For the record, the symptom of the crash is this in the log:
> > > EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> > >
> > > And an excerpt of the backtrace looks like this:
> > > 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> > > tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> > > 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> > > ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> > > 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> > > match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> > > parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> > > try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> > >
> > > Reported-At: https://launchpad.net/bugs/2018500
> > > Fixes f98e418fbdb6 (tc: Add tc flower functions)
> > > Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> > > Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> >
> > Hi Frode,
> >
> > thanks for the patch, a nettlesome problem to be sure.
> 
> Thank you for taking the time to review!
> 
> > ...
> >
> > > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > > index 36620199e..a6065bf69 100644
> > > --- a/lib/netdev-linux.c
> > > +++ b/lib/netdev-linux.c
> > > @@ -5817,8 +5817,11 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> > >
> > >      error = tc_transact(&request, &replyp);
> > >      if (error) {
> > > -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> > > -                    index, error);
> > > +        VLOG_RL(&rl,
> > > +                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> > > +                "Failed to dump police action (index: %u), err=%d",
> > > +                index,
> > > +                error);
> >
> > nit: could we economise the number of lines?
> 
> Sure, I will compact these.
> 
> >         VLOG_RL(&rl, (error == EAGAIN) ? VLL_DBG : VLL_ERR,
> >                 "Failed to dump police action (index: %u), err=%d",
> >                 index, error);
> >
> > >          return error;
> > >      }
> > >
> >
> > ...
> >
> > > @@ -2594,7 +2595,9 @@ probe_multi_mask_per_prio(int ifindex)
> > >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> > >
> > >      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > > -    error = tc_replace_flower(&id1, &flower);
> > > +    do {
> > > +        error = tc_replace_flower(&id1, &flower);
> >
> > Perhaps there is a reason why this can't arise,
> > but I'm concerned that this loop could spin forever.
> 
> I understand the concern, my rationale is that if we make the wrong decision
> here because of a temporary error, it will have consequences for the rest of
> the execution time of the process/thread, which may be a source of
> inconsistent behavior. But we absolutely do not want to hang forever either.
> 
> I'll revisit this and see if there is a way to retry the
> initialization steps for a
> finite amount of attempts/time and move on with a warning/error logged
> after that.
> 
> Does that sound reasonable?

Yes, that would be what I would do too.
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 60bd39643..75537b2f8 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2340,7 +2340,9 @@  parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
             }
             netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
         }
-        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
+        level = (err == EAGAIN ||
+                 err == ENOSPC ||
+                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
         VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
                 ovs_strerror(err),
                 (oor_netdev ? oor_netdev->name : dev->name));
diff --git a/lib/dpif.c b/lib/dpif.c
index 3305401fe..2bbb0aac4 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1751,7 +1751,7 @@  flow_message_log_level(int error)
      * Kernels that support flow wildcarding will reject these flows as
      * duplicates (EEXIST), so lower the log level to debug for these
      * types of messages. */
-    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
+    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
 }
 
 static bool
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 36620199e..a6065bf69 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -5817,8 +5817,11 @@  tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
 
     error = tc_transact(&request, &replyp);
     if (error) {
-        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
-                    index, error);
+        VLOG_RL(&rl,
+                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
+                "Failed to dump police action (index: %u), err=%d",
+                index,
+                error);
         return error;
     }
 
@@ -5849,8 +5852,11 @@  tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
 
     error = tc_transact(&request, &replyp);
     if (error) {
-        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
-                    index, error);
+        VLOG_RL(&rl,
+                (error == EAGAIN) ? VLL_DBG : VLL_ERR,
+                "Failed to delete police action (index: %u), err=%d",
+                index,
+                error);
         return error;
     }
 
@@ -6114,11 +6120,13 @@  tc_query_class(const struct netdev *netdev,
 
     error = tc_transact(&request, replyp);
     if (error) {
-        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
-                     netdev_get_name(netdev),
-                     tc_get_major(handle), tc_get_minor(handle),
-                     tc_get_major(parent), tc_get_minor(parent),
-                     ovs_strerror(error));
+        VLOG_RL(&rl,
+                (error == EAGAIN) ? VLL_DBG : VLL_WARN,
+                "query %s class %u:%u (parent %u:%u) failed (%s)",
+                netdev_get_name(netdev),
+                tc_get_major(handle), tc_get_minor(handle),
+                tc_get_major(parent), tc_get_minor(parent),
+                ovs_strerror(error));
     }
     return error;
 }
@@ -6140,10 +6148,12 @@  tc_delete_class(const struct netdev *netdev, unsigned int handle)
 
     error = tc_transact(&request, NULL);
     if (error) {
-        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
-                     netdev_get_name(netdev),
-                     tc_get_major(handle), tc_get_minor(handle),
-                     ovs_strerror(error));
+        VLOG_RL(&rl,
+                (error == EAGAIN) ? VLL_DBG : VLL_WARN,
+                "delete %s class %u:%u failed (%s)",
+                netdev_get_name(netdev),
+                tc_get_major(handle), tc_get_minor(handle),
+                ovs_strerror(error));
     }
     return error;
 }
@@ -6262,7 +6272,9 @@  tc_query_qdisc(const struct netdev *netdev_)
                 ops = &tc_ops_other;
             }
         }
-    } else if ((!error && !qdisc->size) || error == ENOENT) {
+    } else if ((!error && !qdisc->size) ||
+               error == ENOENT || error == EAGAIN)
+    {
         /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
          * set up by some other entity that doesn't have a handle 1:0.  We will
          * assume that it's the system default qdisc. */
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 4f26dd8cc..4c15989d6 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -2505,10 +2505,11 @@  netdev_tc_flow_get(struct netdev *netdev,
 
     err = tc_get_flower(&id, &flower);
     if (err) {
-        VLOG_ERR_RL(&error_rl,
-                    "flow get failed (dev %s prio %d handle %d): %s",
-                    netdev_get_name(netdev), id.prio, id.handle,
-                    ovs_strerror(err));
+        VLOG_RL(&error_rl,
+                (err == EAGAIN) ? VLL_DBG : VLL_ERR,
+                "flow get failed (dev %s prio %d handle %d): %s",
+                netdev_get_name(netdev), id.prio, id.handle,
+                ovs_strerror(err));
         return err;
     }
 
@@ -2594,7 +2595,9 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id1, &flower);
+    do {
+        error = tc_replace_flower(&id1, &flower);
+    } while (error == EAGAIN);
     if (error) {
         goto out;
     }
@@ -2603,7 +2606,9 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
 
     id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id2, &flower);
+    do {
+        error = tc_replace_flower(&id2, &flower);
+    } while (error == EAGAIN);
     tc_del_flower_filter(&id1);
 
     if (error) {
@@ -2625,6 +2630,7 @@  probe_insert_ct_state_rule(int ifindex, uint16_t ct_state, struct tcf_id *id)
 {
     int prio = TC_RESERVED_PRIORITY_MAX + 1;
     struct tc_flower flower;
+    int error;
 
     memset(&flower, 0, sizeof flower);
     flower.key.ct_state = ct_state;
@@ -2634,7 +2640,10 @@  probe_insert_ct_state_rule(int ifindex, uint16_t ct_state, struct tcf_id *id)
     flower.mask.eth_type = OVS_BE16_MAX;
 
     *id = tc_make_tcf_id(ifindex, 0, prio, TC_INGRESS);
-    return tc_replace_flower(id, &flower);
+    do {
+        error = tc_replace_flower(id, &flower);
+    } while (error == EAGAIN);
+    return error;
 }
 
 static void
@@ -2657,7 +2666,9 @@  probe_ct_state_support(int ifindex)
         goto out;
     }
 
-    error = tc_get_flower(&id, &flower);
+    do {
+        error = tc_get_flower(&id, &flower);
+    } while (error == EAGAIN);
     if (error || flower.mask.ct_state != ct_state) {
         goto out_del;
     }
@@ -2732,7 +2743,9 @@  probe_tc_block_support(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id, &flower);
+    do {
+        error = tc_replace_flower(&id, &flower);
+    } while (error == EAGAIN);
 
     tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
 
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 80da20d9f..dea060fc3 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -1798,6 +1798,11 @@  nl_pool_release(struct nl_sock *sock)
  *
  *      2. Resending the request causes it to be re-executed, so the request
  *         needs to be idempotent.
+ *
+ *      3. In the event that the kernel is too busy to handle the request to
+ *         receive the response (i.e. EAGAIN), this function will still return
+ *         0.  The caller can check for this condition by checking for a zero
+ *         size of the 'replyp' ofpbuf buffer.
  */
 int
 nl_transact(int protocol, const struct ofpbuf *request,
diff --git a/lib/tc.c b/lib/tc.c
index 5c32c6f97..7df6a10c5 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -237,11 +237,29 @@  static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
     }
 }
 
+/* The `tc_transact` function is a wrapper around `nl_transact` with the
+ * addition of:
+ *
+ * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
+ *    regardless of success or failure.
+ *
+ * 2. When a 'replyp' pointer is provided; in the event of the kernel
+ *    being too busy to process the request for the response, a positive
+ *    error return will be provided with the value of EAGAIN.
+ *
+ * Please acquaint yourself with the documentation of the `nl_transact`
+ * function in the netlink-sock module before making use of this function.
+ */
 int
 tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
 {
     int error = nl_transact(NETLINK_ROUTE, request, replyp);
     ofpbuf_uninit(request);
+
+    if (!error && replyp && !(*replyp)->size) {
+        return EAGAIN;
+    }
+
     return error;
 }