diff mbox series

[ovs-dev,v3] tc: fix crash on EAGAIN return from recvmsg on netlink socket.

Message ID 20230515080448.954879-1-frode.nordahl@canonical.com
State Changes Requested
Headers show
Series [ovs-dev,v3] tc: fix crash on EAGAIN return from recvmsg on netlink socket. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Frode Nordahl May 15, 2023, 8:04 a.m. UTC
The tc module combines the use of the `tc_transact` helper
function for communication with the in-kernel tc infrastructure
with assertions on the reply data by `ofpbuf_at_assert` on the
received data prior to further processing.

`tc_transact` in turn calls `nl_transact`, which via
`nl_transact_multiple__` ultimately calls and handles return
value from `recvmsg`.  On error a check for EAGAIN is performed
and a consequence of this condition is effectively to provide a
non-error (0) result and an empty reply buffer.

Before this change, the `tc_transact` and, consumers of it, were
unaware of this condition.  The use of assertions on the reply
buffer can as such lead to a fatal crash of OVS.

To be fair, the behavior of `nl_transact` when handling an EAGAIN
return is currently not documented, so this change also adds that
documentation.

While fixing the problem, it led me to find potential problems
with the one-time initialization functions in the netdev-offload-tc
module.  Make use of the information now available about an EAGAIN
condition to retry one-time initialization, and resort to logging
a warning if probing of tc features fails due to temporary
situations such as resource depletion.

For the record, the symptom of the crash is this in the log:
EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()

And an excerpt of the backtrace looks like this:
0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384

Reported-At: https://launchpad.net/bugs/2018500
Fixes f98e418fbdb6 (tc: Add tc flower functions)
Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
---
 lib/dpif-netlink.c      |  4 +-
 lib/dpif.c              |  2 +-
 lib/netdev-linux.c      | 34 +++++++++-------
 lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
 lib/netlink-socket.c    |  5 +++
 lib/tc.c                | 18 +++++++++
 6 files changed, 122 insertions(+), 27 deletions(-)

Comments

Roi Dayan May 15, 2023, 8:37 a.m. UTC | #1
On 15/05/2023 11:04, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
> 
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
> 
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
> 
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
> 
> While fixing the problem, it led me to find potential problems
> with the one-time initialization functions in the netdev-offload-tc
> module.  Make use of the information now available about an EAGAIN
> condition to retry one-time initialization, and resort to logging
> a warning if probing of tc features fails due to temporary
> situations such as resource depletion.
> 
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> 
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> 
> Reported-At: https://launchpad.net/bugs/2018500
> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> ---
>  lib/dpif-netlink.c      |  4 +-
>  lib/dpif.c              |  2 +-
>  lib/netdev-linux.c      | 34 +++++++++-------
>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
>  lib/netlink-socket.c    |  5 +++
>  lib/tc.c                | 18 +++++++++
>  6 files changed, 122 insertions(+), 27 deletions(-)
> 
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 60bd39643..75537b2f8 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
>              }
>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
>          }
> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> +        level = (err == EAGAIN ||
> +                 err == ENOSPC ||
> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
>                  ovs_strerror(err),
>                  (oor_netdev ? oor_netdev->name : dev->name));
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 3305401fe..2bbb0aac4 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
>       * Kernels that support flow wildcarding will reject these flows as
>       * duplicates (EEXIST), so lower the log level to debug for these
>       * types of messages. */
> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
>  }
>  
>  static bool
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 36620199e..a0294ca00 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>  
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> +                "Failed to dump police action (index: %u), err=%d",
> +                index, error);
>          return error;
>      }
>  
> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>  
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> +                "Failed to delete police action (index: %u), err=%d",
> +                index, error);
>          return error;
>      }
>  
> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
>  
>      error = tc_transact(&request, replyp);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     tc_get_major(parent), tc_get_minor(parent),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                tc_get_major(parent), tc_get_minor(parent),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
>  
>      error = tc_transact(&request, NULL);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> +                "delete %s class %u:%u failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
>                  ops = &tc_ops_other;
>              }
>          }
> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> +    } else if ((!error && !qdisc->size) ||
> +               error == ENOENT || error == EAGAIN)
> +    {
>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
>           * set up by some other entity that doesn't have a handle 1:0.  We will
>           * assume that it's the system default qdisc. */
> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> index 4f26dd8cc..4fb9dc239 100644
> --- a/lib/netdev-offload-tc.c
> +++ b/lib/netdev-offload-tc.c
> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
>  
>      err = tc_get_flower(&id, &flower);
>      if (err) {
> -        VLOG_ERR_RL(&error_rl,
> -                    "flow get failed (dev %s prio %d handle %d): %s",
> -                    netdev_get_name(netdev), id.prio, id.handle,
> -                    ovs_strerror(err));
> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
> +                "flow get failed (dev %s prio %d handle %d): %s",
> +                netdev_get_name(netdev), id.prio, id.handle,
> +                ovs_strerror(err));
>          return err;
>      }
>  
> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
>      return 0;
>  }
>  
> +/* This macro is for use by one-time initialization functions, where we have
> + * one shot per thread/process to perform a pertinent initialization task that
> + * may return a temporary error (EAGAIN).
> + *
> + * With min/max values of 1/64 we would retry 7 times, spending at the
> + * most 127 * 1E6 nsec (0.127s) sleeping.
> + */
> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
> +                                             FUNCTION, ...)                   \
> +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
> +    do {                                                                      \
> +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
> +        if (CONDITION) {                                                      \
> +            xnanosleep(BACKOFF * 1E6);                                        \
> +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
> +                BACKOFF <<= 1;                                                \
> +            } else {                                                          \
> +                break;                                                        \
> +            }                                                                 \
> +        }                                                                     \
> +    } while (CONDITION);
> +
>  static void
>  probe_multi_mask_per_prio(int ifindex)
>  {
>      struct tc_flower flower;
>      struct tcf_id id1, id2;
> +    uint64_t backoff;
>      int block_id = 0;
>      int prio = 1;
>      int error;
> @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>  
>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id1, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id1, &flower);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> +                      "support: %s", ovs_strerror(error));
> +        }
>          goto out;
>      }
>  
> @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
>  
>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id2, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id2, &flower);
>      tc_del_flower_filter(&id1);
>  
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> +                      "support: %s", ovs_strerror(error));
> +        }
>          goto out;
>      }
>  
> @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
>  {
>      struct tc_flower flower;
>      uint16_t ct_state;
> +    uint64_t backoff;
>      struct tcf_id id;
>      int error;
>  
> @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
>          goto out;
>      }
>  
> -    error = tc_get_flower(&id, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_get_flower, &id, &flower);
>      if (error || flower.mask.ct_state != ct_state) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out_del;
>      }
>  
> @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
>  
>      /* Test for reject, ct_state >= MAX */
>      ct_state = ~0;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (!error) {
>          /* No reject, can't continue probing other flags */
>          goto out_del;
> +    } else if (error == EAGAIN) {
> +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                  ovs_strerror(error));
> +        goto out_del;
>      }
>  
>      tc_del_flower_filter(&id);
> @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
>      memset(&flower, 0, sizeof flower);
>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out;
>      }
>  
> @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
>                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out;
>      }
>  
> @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
>  {
>      struct tc_flower flower;
>      uint32_t block_id = 1;
> +    uint64_t backoff;
>      struct tcf_id id;
>      int prio = 0;
>      int error;
> @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>  
>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id, &flower);
>  
>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
>  
>      if (!error) {
>          block_support = true;
>          VLOG_INFO("probe tc: block offload is supported.");
> +    } else if (error == EAGAIN) {
> +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
> +                  ovs_strerror(error));
>      }
>  }
>  
> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> index 80da20d9f..dea060fc3 100644
> --- a/lib/netlink-socket.c
> +++ b/lib/netlink-socket.c
> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>   *
>   *      2. Resending the request causes it to be re-executed, so the request
>   *         needs to be idempotent.
> + *
> + *      3. In the event that the kernel is too busy to handle the request to
> + *         receive the response (i.e. EAGAIN), this function will still return
> + *         0.  The caller can check for this condition by checking for a zero
> + *         size of the 'replyp' ofpbuf buffer.
>   */
>  int
>  nl_transact(int protocol, const struct ofpbuf *request,
> diff --git a/lib/tc.c b/lib/tc.c
> index 5c32c6f97..7df6a10c5 100644
> --- a/lib/tc.c
> +++ b/lib/tc.c
> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
>      }
>  }
>  
> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> + * addition of:
> + *
> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> + *    regardless of success or failure.
> + *
> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> + *    being too busy to process the request for the response, a positive
> + *    error return will be provided with the value of EAGAIN.
> + *
> + * Please acquaint yourself with the documentation of the `nl_transact`
> + * function in the netlink-sock module before making use of this function.
> + */
>  int
>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>  {
>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>      ofpbuf_uninit(request);
> +
> +    if (!error && replyp && !(*replyp)->size) {
> +        return EAGAIN;
> +    }
> +
>      return error;
>  }
>  

looks fine to me. thanks.

Acked-by: Roi Dayan <roid@nvidia.com>
Simon Horman May 17, 2023, 8:05 p.m. UTC | #2
On Mon, May 15, 2023 at 10:04:48AM +0200, Frode Nordahl wrote:
> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
> 
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
> 
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
> 
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
> 
> While fixing the problem, it led me to find potential problems
> with the one-time initialization functions in the netdev-offload-tc
> module.  Make use of the information now available about an EAGAIN
> condition to retry one-time initialization, and resort to logging
> a warning if probing of tc features fails due to temporary
> situations such as resource depletion.
> 
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> 
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> 
> Reported-At: https://launchpad.net/bugs/2018500
> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Eelco Chaudron May 19, 2023, 9:55 a.m. UTC | #3
On 15 May 2023, at 10:04, Frode Nordahl wrote:

> The tc module combines the use of the `tc_transact` helper
> function for communication with the in-kernel tc infrastructure
> with assertions on the reply data by `ofpbuf_at_assert` on the
> received data prior to further processing.
>
> `tc_transact` in turn calls `nl_transact`, which via
> `nl_transact_multiple__` ultimately calls and handles return
> value from `recvmsg`.  On error a check for EAGAIN is performed
> and a consequence of this condition is effectively to provide a
> non-error (0) result and an empty reply buffer.
>
> Before this change, the `tc_transact` and, consumers of it, were
> unaware of this condition.  The use of assertions on the reply
> buffer can as such lead to a fatal crash of OVS.
>
> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> return is currently not documented, so this change also adds that
> documentation.
>
> While fixing the problem, it led me to find potential problems
> with the one-time initialization functions in the netdev-offload-tc
> module.  Make use of the information now available about an EAGAIN
> condition to retry one-time initialization, and resort to logging
> a warning if probing of tc features fails due to temporary
> situations such as resource depletion.
>
> For the record, the symptom of the crash is this in the log:
> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
>
> And an excerpt of the backtrace looks like this:
> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384

Thanks for this patch. See some comments inline below.

Cheers,

Eelco

> Reported-At: https://launchpad.net/bugs/2018500
> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> ---
>  lib/dpif-netlink.c      |  4 +-
>  lib/dpif.c              |  2 +-
>  lib/netdev-linux.c      | 34 +++++++++-------
>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
>  lib/netlink-socket.c    |  5 +++
>  lib/tc.c                | 18 +++++++++
>  6 files changed, 122 insertions(+), 27 deletions(-)
>
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 60bd39643..75537b2f8 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
>              }
>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
>          }
> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> +        level = (err == EAGAIN ||
> +                 err == ENOSPC ||
> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
>                  ovs_strerror(err),
>                  (oor_netdev ? oor_netdev->name : dev->name));

Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?

> diff --git a/lib/dpif.c b/lib/dpif.c
> index 3305401fe..2bbb0aac4 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
>       * Kernels that support flow wildcarding will reject these flows as
>       * duplicates (EEXIST), so lower the log level to debug for these
>       * types of messages. */
> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;

Just a bit, but as the above comment explicitly	mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.

In addition same as above, i.e., do we want a debug for EAGAIN here?

>  }
>
>  static bool
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 36620199e..a0294ca00 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> +                "Failed to dump police action (index: %u), err=%d",
> +                index, error);
>          return error;
>      }
>
> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>
>      error = tc_transact(&request, &replyp);
>      if (error) {
> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> -                    index, error);
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> +                "Failed to delete police action (index: %u), err=%d",
> +                index, error);
>          return error;
>      }

The two above will still result in a warning by upper-level callers, so dbg is fine.

> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
>
>      error = tc_transact(&request, replyp);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     tc_get_major(parent), tc_get_minor(parent),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                tc_get_major(parent), tc_get_minor(parent),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
>
>      error = tc_transact(&request, NULL);
>      if (error) {
> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> -                     netdev_get_name(netdev),
> -                     tc_get_major(handle), tc_get_minor(handle),
> -                     ovs_strerror(error));
> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> +                "delete %s class %u:%u failed (%s)",
> +                netdev_get_name(netdev),
> +                tc_get_major(handle), tc_get_minor(handle),
> +                ovs_strerror(error));
>      }
>      return error;
>  }
> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
>                  ops = &tc_ops_other;
>              }
>          }
> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> +    } else if ((!error && !qdisc->size) ||
> +               error == ENOENT || error == EAGAIN)

Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.

> +    {
>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
>           * set up by some other entity that doesn't have a handle 1:0.  We will
>           * assume that it's the system default qdisc. */
> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> index 4f26dd8cc..4fb9dc239 100644
> --- a/lib/netdev-offload-tc.c
> +++ b/lib/netdev-offload-tc.c
> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
>
>      err = tc_get_flower(&id, &flower);
>      if (err) {
> -        VLOG_ERR_RL(&error_rl,
> -                    "flow get failed (dev %s prio %d handle %d): %s",
> -                    netdev_get_name(netdev), id.prio, id.handle,
> -                    ovs_strerror(err));
> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,

Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?

> +                "flow get failed (dev %s prio %d handle %d): %s",
> +                netdev_get_name(netdev), id.prio, id.handle,
> +                ovs_strerror(err));
>          return err;
>      }
>
> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
>      return 0;
>  }
>
> +/* This macro is for use by one-time initialization functions, where we have
> + * one shot per thread/process to perform a pertinent initialization task that
> + * may return a temporary error (EAGAIN).
> + *
> + * With min/max values of 1/64 we would retry 7 times, spending at the
> + * most 127 * 1E6 nsec (0.127s) sleeping.
> + */
> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \

I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.

> +                                             FUNCTION, ...)                   \
> +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
> +    do {                                                                      \
> +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
> +        if (CONDITION) {                                                      \
> +            xnanosleep(BACKOFF * 1E6);                                        \
> +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
> +                BACKOFF <<= 1;                                                \
> +            } else {                                                          \
> +                break;                                                        \
> +            }                                                                 \
> +        }                                                                     \
> +    } while (CONDITION);
> +
>  static void
>  probe_multi_mask_per_prio(int ifindex)
>  {
>      struct tc_flower flower;
>      struct tcf_id id1, id2;
> +    uint64_t backoff;
>      int block_id = 0;
>      int prio = 1;
>      int error;
> @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>
>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id1, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id1, &flower);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> +                      "support: %s", ovs_strerror(error));
> +        }
>          goto out;
>      }
>
> @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
>
>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id2, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id2, &flower);
>      tc_del_flower_filter(&id1);
>
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> +                      "support: %s", ovs_strerror(error));
> +        }
>          goto out;
>      }
>
> @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
>  {
>      struct tc_flower flower;
>      uint16_t ct_state;
> +    uint64_t backoff;
>      struct tcf_id id;
>      int error;
>
> @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
>          goto out;
>      }
>
> -    error = tc_get_flower(&id, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_get_flower, &id, &flower);
>      if (error || flower.mask.ct_state != ct_state) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out_del;
>      }
>
> @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
>
>      /* Test for reject, ct_state >= MAX */
>      ct_state = ~0;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (!error) {
>          /* No reject, can't continue probing other flags */
>          goto out_del;
> +    } else if (error == EAGAIN) {
> +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                  ovs_strerror(error));
> +        goto out_del;
>      }
>
>      tc_del_flower_filter(&id);
> @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
>      memset(&flower, 0, sizeof flower);
>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out;
>      }
>
> @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
>                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         probe_insert_ct_state_rule, ifindex,
> +                                         ct_state, &id);
>      if (error) {
> +        if (error == EAGAIN) {
> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> +                      ovs_strerror(error));
> +        }
>          goto out;
>      }
>
> @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
>  {
>      struct tc_flower flower;
>      uint32_t block_id = 1;
> +    uint64_t backoff;
>      struct tcf_id id;
>      int prio = 0;
>      int error;
> @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>
>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> -    error = tc_replace_flower(&id, &flower);
> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> +                                         tc_replace_flower, &id, &flower);
>
>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
>
>      if (!error) {
>          block_support = true;
>          VLOG_INFO("probe tc: block offload is supported.");
> +    } else if (error == EAGAIN) {
> +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
> +                  ovs_strerror(error));
>      }
>  }
>
> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> index 80da20d9f..dea060fc3 100644
> --- a/lib/netlink-socket.c
> +++ b/lib/netlink-socket.c
> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>   *
>   *      2. Resending the request causes it to be re-executed, so the request
>   *         needs to be idempotent.
> + *
> + *      3. In the event that the kernel is too busy to handle the request to
> + *         receive the response (i.e. EAGAIN), this function will still return
> + *         0.  The caller can check for this condition by checking for a zero
> + *         size of the 'replyp' ofpbuf buffer.
>   */
>  int
>  nl_transact(int protocol, const struct ofpbuf *request,
> diff --git a/lib/tc.c b/lib/tc.c
> index 5c32c6f97..7df6a10c5 100644
> --- a/lib/tc.c
> +++ b/lib/tc.c
> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
>      }
>  }
>
> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> + * addition of:
> + *
> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> + *    regardless of success or failure.
> + *
> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> + *    being too busy to process the request for the response, a positive
> + *    error return will be provided with the value of EAGAIN.
> + *
> + * Please acquaint yourself with the documentation of the `nl_transact`
> + * function in the netlink-sock module before making use of this function.
> + */
>  int
>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>  {
>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>      ofpbuf_uninit(request);
> +
> +    if (!error && replyp && !(*replyp)->size) {

I think it would be really nice to add a coverage counter for this, so we know it happened as most of it is hidden behind a debug message.

> +        return EAGAIN;
> +    }
> +
>      return error;
>  }
>
> -- 
> 2.39.2
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Frode Nordahl May 19, 2023, 10:31 a.m. UTC | #4
On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 15 May 2023, at 10:04, Frode Nordahl wrote:
>
> > The tc module combines the use of the `tc_transact` helper
> > function for communication with the in-kernel tc infrastructure
> > with assertions on the reply data by `ofpbuf_at_assert` on the
> > received data prior to further processing.
> >
> > `tc_transact` in turn calls `nl_transact`, which via
> > `nl_transact_multiple__` ultimately calls and handles return
> > value from `recvmsg`.  On error a check for EAGAIN is performed
> > and a consequence of this condition is effectively to provide a
> > non-error (0) result and an empty reply buffer.
> >
> > Before this change, the `tc_transact` and, consumers of it, were
> > unaware of this condition.  The use of assertions on the reply
> > buffer can as such lead to a fatal crash of OVS.
> >
> > To be fair, the behavior of `nl_transact` when handling an EAGAIN
> > return is currently not documented, so this change also adds that
> > documentation.
> >
> > While fixing the problem, it led me to find potential problems
> > with the one-time initialization functions in the netdev-offload-tc
> > module.  Make use of the information now available about an EAGAIN
> > condition to retry one-time initialization, and resort to logging
> > a warning if probing of tc features fails due to temporary
> > situations such as resource depletion.
> >
> > For the record, the symptom of the crash is this in the log:
> > EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> >
> > And an excerpt of the backtrace looks like this:
> > 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> > tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> > 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> > ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> > 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> > match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> > parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> > try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
>
> Thanks for this patch. See some comments inline below.

Hello, Eelco,

Thank you for taking the time to review, much appreciated! I'll
respond inline as well.

> Cheers,
>
> Eelco
>
> > Reported-At: https://launchpad.net/bugs/2018500
> > Fixes f98e418fbdb6 (tc: Add tc flower functions)
> > Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)

I see now that I've forgotten a `:` after `Fixes` which makes
patchwork not account for
them, I'll fix that in the next iteration

> > Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> > ---
> >  lib/dpif-netlink.c      |  4 +-
> >  lib/dpif.c              |  2 +-
> >  lib/netdev-linux.c      | 34 +++++++++-------
> >  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
> >  lib/netlink-socket.c    |  5 +++
> >  lib/tc.c                | 18 +++++++++
> >  6 files changed, 122 insertions(+), 27 deletions(-)
> >
> > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> > index 60bd39643..75537b2f8 100644
> > --- a/lib/dpif-netlink.c
> > +++ b/lib/dpif-netlink.c
> > @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
> >              }
> >              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
> >          }
> > -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> > +        level = (err == EAGAIN ||
> > +                 err == ENOSPC ||
> > +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
> >                  ovs_strerror(err),
> >                  (oor_netdev ? oor_netdev->name : dev->name));
>
> Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?

The reasoning is that the flow will be attempted offloaded again, and if the
system is already busy, adding to the load by possibly rampant logging would
not help on the system load?

Do you think this condition is not to be expected with this subsystem?

> > diff --git a/lib/dpif.c b/lib/dpif.c
> > index 3305401fe..2bbb0aac4 100644
> > --- a/lib/dpif.c
> > +++ b/lib/dpif.c
> > @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
> >       * Kernels that support flow wildcarding will reject these flows as
> >       * duplicates (EEXIST), so lower the log level to debug for these
> >       * types of messages. */
> > -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> > +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
>
> Just a bit, but as the above comment explicitly mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.

Same reasoning for suppressing EAGAIN as above, and it surely deserves a
comment, thank you for pointing it out!

> In addition same as above, i.e., do we want a debug for EAGAIN here?

The reason for adding it here is that as we discovered the `netlink-socket`
module is currently suppressing these in a rather surprising manner, but there
was probably a reason for the suppression.

Now that we are exposing a new temporary error condition in the `tc` module,
many of the call paths end up in higher level dpif callbacks.  I'm approaching
this with a bit of conservative safety in mind, squelching the logging down to
DBG level to ensure that we don't cause any issues for existing systems or
tests.

> >  }
> >
> >  static bool
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index 36620199e..a0294ca00 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >
> >      error = tc_transact(&request, &replyp);
> >      if (error) {
> > -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> > -                    index, error);
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> > +                "Failed to dump police action (index: %u), err=%d",
> > +                index, error);
> >          return error;
> >      }
> >
> > @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >
> >      error = tc_transact(&request, &replyp);
> >      if (error) {
> > -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> > -                    index, error);
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> > +                "Failed to delete police action (index: %u), err=%d",
> > +                index, error);
> >          return error;
> >      }
>
> The two above will still result in a warning by upper-level callers, so dbg is fine.
>
> > @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
> >
> >      error = tc_transact(&request, replyp);
> >      if (error) {
> > -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> > -                     netdev_get_name(netdev),
> > -                     tc_get_major(handle), tc_get_minor(handle),
> > -                     tc_get_major(parent), tc_get_minor(parent),
> > -                     ovs_strerror(error));
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> > +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> > +                netdev_get_name(netdev),
> > +                tc_get_major(handle), tc_get_minor(handle),
> > +                tc_get_major(parent), tc_get_minor(parent),
> > +                ovs_strerror(error));
> >      }
> >      return error;
> >  }
> > @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
> >
> >      error = tc_transact(&request, NULL);
> >      if (error) {
> > -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> > -                     netdev_get_name(netdev),
> > -                     tc_get_major(handle), tc_get_minor(handle),
> > -                     ovs_strerror(error));
> > +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> > +                "delete %s class %u:%u failed (%s)",
> > +                netdev_get_name(netdev),
> > +                tc_get_major(handle), tc_get_minor(handle),
> > +                ovs_strerror(error));
> >      }
> >      return error;
> >  }
> > @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
> >                  ops = &tc_ops_other;
> >              }
> >          }
> > -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> > +    } else if ((!error && !qdisc->size) ||
> > +               error == ENOENT || error == EAGAIN)
>
> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.

Ack.

> > +    {
> >          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
> >           * set up by some other entity that doesn't have a handle 1:0.  We will
> >           * assume that it's the system default qdisc. */
> > diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> > index 4f26dd8cc..4fb9dc239 100644
> > --- a/lib/netdev-offload-tc.c
> > +++ b/lib/netdev-offload-tc.c
> > @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
> >
> >      err = tc_get_flower(&id, &flower);
> >      if (err) {
> > -        VLOG_ERR_RL(&error_rl,
> > -                    "flow get failed (dev %s prio %d handle %d): %s",
> > -                    netdev_get_name(netdev), id.prio, id.handle,
> > -                    ovs_strerror(err));
> > +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
>
> Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?

I'd lean towards yes, given short periods of the system being busy and returning
EAGAIN on calls is to be expected, and rampant logging about it when it happens
would not help with the load?

Do you think this condition is not to be expected with this subsystem?

> > +                "flow get failed (dev %s prio %d handle %d): %s",
> > +                netdev_get_name(netdev), id.prio, id.handle,
> > +                ovs_strerror(err));
> >          return err;
> >      }
> >
> > @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
> >      return 0;
> >  }
> >
> > +/* This macro is for use by one-time initialization functions, where we have
> > + * one shot per thread/process to perform a pertinent initialization task that
> > + * may return a temporary error (EAGAIN).
> > + *
> > + * With min/max values of 1/64 we would retry 7 times, spending at the
> > + * most 127 * 1E6 nsec (0.127s) sleeping.
> > + */
> > +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> > +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> > +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
>
> I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.

Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
already defined by the consumer of the macro, but I guess the compiler will
be helpful in pointing this out.  Will fix it.

> > +                                             FUNCTION, ...)                   \
> > +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
> > +    do {                                                                      \
> > +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
> > +        if (CONDITION) {                                                      \
> > +            xnanosleep(BACKOFF * 1E6);                                        \
> > +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
> > +                BACKOFF <<= 1;                                                \
> > +            } else {                                                          \
> > +                break;                                                        \
> > +            }                                                                 \
> > +        }                                                                     \
> > +    } while (CONDITION);
> > +
> >  static void
> >  probe_multi_mask_per_prio(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      struct tcf_id id1, id2;
> > +    uint64_t backoff;
> >      int block_id = 0;
> >      int prio = 1;
> >      int error;
> > @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
> >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >
> >      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id1, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id1, &flower);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> > +                      "support: %s", ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
> >      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
> >
> >      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id2, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id2, &flower);
> >      tc_del_flower_filter(&id1);
> >
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> > +                      "support: %s", ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      uint16_t ct_state;
> > +    uint64_t backoff;
> >      struct tcf_id id;
> >      int error;
> >
> > @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
> >          goto out;
> >      }
> >
> > -    error = tc_get_flower(&id, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_get_flower, &id, &flower);
> >      if (error || flower.mask.ct_state != ct_state) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out_del;
> >      }
> >
> > @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
> >
> >      /* Test for reject, ct_state >= MAX */
> >      ct_state = ~0;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, ifindex,
> > +                                         ct_state, &id);
> >      if (!error) {
> >          /* No reject, can't continue probing other flags */
> >          goto out_del;
> > +    } else if (error == EAGAIN) {
> > +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                  ovs_strerror(error));
> > +        goto out_del;
> >      }
> >
> >      tc_del_flower_filter(&id);
> > @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
> >      memset(&flower, 0, sizeof flower);
> >      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, ifindex,
> > +                                         ct_state, &id);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
> >      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
> >                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
> > -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         probe_insert_ct_state_rule, ifindex,
> > +                                         ct_state, &id);
> >      if (error) {
> > +        if (error == EAGAIN) {
> > +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> > +                      ovs_strerror(error));
> > +        }
> >          goto out;
> >      }
> >
> > @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
> >  {
> >      struct tc_flower flower;
> >      uint32_t block_id = 1;
> > +    uint64_t backoff;
> >      struct tcf_id id;
> >      int prio = 0;
> >      int error;
> > @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
> >      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >
> >      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> > -    error = tc_replace_flower(&id, &flower);
> > +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> > +                                         tc_replace_flower, &id, &flower);
> >
> >      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
> >
> >      if (!error) {
> >          block_support = true;
> >          VLOG_INFO("probe tc: block offload is supported.");
> > +    } else if (error == EAGAIN) {
> > +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
> > +                  ovs_strerror(error));
> >      }
> >  }
> >
> > diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> > index 80da20d9f..dea060fc3 100644
> > --- a/lib/netlink-socket.c
> > +++ b/lib/netlink-socket.c
> > @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
> >   *
> >   *      2. Resending the request causes it to be re-executed, so the request
> >   *         needs to be idempotent.
> > + *
> > + *      3. In the event that the kernel is too busy to handle the request to
> > + *         receive the response (i.e. EAGAIN), this function will still return
> > + *         0.  The caller can check for this condition by checking for a zero
> > + *         size of the 'replyp' ofpbuf buffer.
> >   */
> >  int
> >  nl_transact(int protocol, const struct ofpbuf *request,
> > diff --git a/lib/tc.c b/lib/tc.c
> > index 5c32c6f97..7df6a10c5 100644
> > --- a/lib/tc.c
> > +++ b/lib/tc.c
> > @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
> >      }
> >  }
> >
> > +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> > + * addition of:
> > + *
> > + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> > + *    regardless of success or failure.
> > + *
> > + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> > + *    being too busy to process the request for the response, a positive
> > + *    error return will be provided with the value of EAGAIN.
> > + *
> > + * Please acquaint yourself with the documentation of the `nl_transact`
> > + * function in the netlink-sock module before making use of this function.
> > + */
> >  int
> >  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
> >  {
> >      int error = nl_transact(NETLINK_ROUTE, request, replyp);
> >      ofpbuf_uninit(request);
> > +
> > +    if (!error && replyp && !(*replyp)->size) {
>
> I think it would be really nice to add a coverage counter for this, so we know it happened as most of it is hidden behind a debug message.

Ack, good point, will add!
Eelco Chaudron May 22, 2023, 9:33 a.m. UTC | #5
On 19 May 2023, at 12:31, Frode Nordahl wrote:

> On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>
>>
>>
>> On 15 May 2023, at 10:04, Frode Nordahl wrote:
>>
>>> The tc module combines the use of the `tc_transact` helper
>>> function for communication with the in-kernel tc infrastructure
>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>> received data prior to further processing.
>>>
>>> `tc_transact` in turn calls `nl_transact`, which via
>>> `nl_transact_multiple__` ultimately calls and handles return
>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>> and a consequence of this condition is effectively to provide a
>>> non-error (0) result and an empty reply buffer.
>>>
>>> Before this change, the `tc_transact` and, consumers of it, were
>>> unaware of this condition.  The use of assertions on the reply
>>> buffer can as such lead to a fatal crash of OVS.
>>>
>>> To be fair, the behavior of `nl_transact` when handling an EAGAIN
>>> return is currently not documented, so this change also adds that
>>> documentation.
>>>
>>> While fixing the problem, it led me to find potential problems
>>> with the one-time initialization functions in the netdev-offload-tc
>>> module.  Make use of the information now available about an EAGAIN
>>> condition to retry one-time initialization, and resort to logging
>>> a warning if probing of tc features fails due to temporary
>>> situations such as resource depletion.
>>>
>>> For the record, the symptom of the crash is this in the log:
>>> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
>>>
>>> And an excerpt of the backtrace looks like this:
>>> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
>>> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
>>> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
>>> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
>>> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
>>> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
>>> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
>>> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
>>
>> Thanks for this patch. See some comments inline below.
>
> Hello, Eelco,
>
> Thank you for taking the time to review, much appreciated! I'll
> respond inline as well.

See my responses inline again ;) I’m wondering if you know why we would fail with EAGAIN on a nonblocking socket? Is it TC or NetLink-specific?

>> Cheers,
>>
>> Eelco
>>
>>> Reported-At: https://launchpad.net/bugs/2018500
>>> Fixes f98e418fbdb6 (tc: Add tc flower functions)
>>> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
>
> I see now that I've forgotten a `:` after `Fixes` which makes
> patchwork not account for
> them, I'll fix that in the next iteration
>
>>> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
>>> ---
>>>  lib/dpif-netlink.c      |  4 +-
>>>  lib/dpif.c              |  2 +-
>>>  lib/netdev-linux.c      | 34 +++++++++-------
>>>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
>>>  lib/netlink-socket.c    |  5 +++
>>>  lib/tc.c                | 18 +++++++++
>>>  6 files changed, 122 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>>> index 60bd39643..75537b2f8 100644
>>> --- a/lib/dpif-netlink.c
>>> +++ b/lib/dpif-netlink.c
>>> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
>>>              }
>>>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
>>>          }
>>> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>>> +        level = (err == EAGAIN ||
>>> +                 err == ENOSPC ||
>>> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>>>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
>>>                  ovs_strerror(err),
>>>                  (oor_netdev ? oor_netdev->name : dev->name));
>>
>> Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?
>
> The reasoning is that the flow will be attempted offloaded again, and if the
> system is already busy, adding to the load by possibly rampant logging would
> not help on the system load?
>
> Do you think this condition is not to be expected with this subsystem?

As you mentioned, we did not log it before so we might not know. But I would expect it not to happen often, if it does, the system might not have enough resources to begin with. Anyway, we have a rate limit of five messages per minute.

Do we know why the kernel would return EAGAIN, i.e. what is TC busy with to return this? The socket is not nonblocking, so it must be TC-specific I guess.

>>> diff --git a/lib/dpif.c b/lib/dpif.c
>>> index 3305401fe..2bbb0aac4 100644
>>> --- a/lib/dpif.c
>>> +++ b/lib/dpif.c
>>> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
>>>       * Kernels that support flow wildcarding will reject these flows as
>>>       * duplicates (EEXIST), so lower the log level to debug for these
>>>       * types of messages. */
>>> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
>>> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
>>
>> Just a bit, but as the above comment explicitly mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.
>
> Same reasoning for suppressing EAGAIN as above, and it surely deserves a
> comment, thank you for pointing it out!
>
>> In addition same as above, i.e., do we want a debug for EAGAIN here?
>
> The reason for adding it here is that as we discovered the `netlink-socket`
> module is currently suppressing these in a rather surprising manner, but there
> was probably a reason for the suppression.

Same as above, do we know why we could get EAGAIN from the kernel? Is it NetLink specific or TC?

> Now that we are exposing a new temporary error condition in the `tc` module,
> many of the call paths end up in higher level dpif callbacks.  I'm approaching
> this with a bit of conservative safety in mind, squelching the logging down to
> DBG level to ensure that we don't cause any issues for existing systems or
> tests.

But I agree for this one we might want to keep it debug as it will log upto 60 a minute :(

>>>  }
>>>
>>>  static bool
>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>> index 36620199e..a0294ca00 100644
>>> --- a/lib/netdev-linux.c
>>> +++ b/lib/netdev-linux.c
>>> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>>>
>>>      error = tc_transact(&request, &replyp);
>>>      if (error) {
>>> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
>>> -                    index, error);
>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
>>> +                "Failed to dump police action (index: %u), err=%d",
>>> +                index, error);
>>>          return error;
>>>      }
>>>
>>> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>>>
>>>      error = tc_transact(&request, &replyp);
>>>      if (error) {
>>> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
>>> -                    index, error);
>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
>>> +                "Failed to delete police action (index: %u), err=%d",
>>> +                index, error);
>>>          return error;
>>>      }
>>
>> The two above will still result in a warning by upper-level callers, so dbg is fine.
>>
>>> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
>>>
>>>      error = tc_transact(&request, replyp);
>>>      if (error) {
>>> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
>>> -                     netdev_get_name(netdev),
>>> -                     tc_get_major(handle), tc_get_minor(handle),
>>> -                     tc_get_major(parent), tc_get_minor(parent),
>>> -                     ovs_strerror(error));
>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
>>> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
>>> +                netdev_get_name(netdev),
>>> +                tc_get_major(handle), tc_get_minor(handle),
>>> +                tc_get_major(parent), tc_get_minor(parent),
>>> +                ovs_strerror(error));
>>>      }
>>>      return error;
>>>  }
>>> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
>>>
>>>      error = tc_transact(&request, NULL);
>>>      if (error) {
>>> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
>>> -                     netdev_get_name(netdev),
>>> -                     tc_get_major(handle), tc_get_minor(handle),
>>> -                     ovs_strerror(error));
>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
>>> +                "delete %s class %u:%u failed (%s)",
>>> +                netdev_get_name(netdev),
>>> +                tc_get_major(handle), tc_get_minor(handle),
>>> +                ovs_strerror(error));
>>>      }
>>>      return error;
>>>  }
>>> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
>>>                  ops = &tc_ops_other;
>>>              }
>>>          }
>>> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
>>> +    } else if ((!error && !qdisc->size) ||
>>> +               error == ENOENT || error == EAGAIN)
>>
>> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
>> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.
>
> Ack.
>
>>> +    {
>>>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
>>>           * set up by some other entity that doesn't have a handle 1:0.  We will
>>>           * assume that it's the system default qdisc. */
>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
>>> index 4f26dd8cc..4fb9dc239 100644
>>> --- a/lib/netdev-offload-tc.c
>>> +++ b/lib/netdev-offload-tc.c
>>> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
>>>
>>>      err = tc_get_flower(&id, &flower);
>>>      if (err) {
>>> -        VLOG_ERR_RL(&error_rl,
>>> -                    "flow get failed (dev %s prio %d handle %d): %s",
>>> -                    netdev_get_name(netdev), id.prio, id.handle,
>>> -                    ovs_strerror(err));
>>> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
>>
>> Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?
>
> I'd lean towards yes, given short periods of the system being busy and returning
> EAGAIN on calls is to be expected, and rampant logging about it when it happens
> would not help with the load?
>
> Do you think this condition is not to be expected with this subsystem?

I think with 5 errors per second, it should be no problem, and it will give visibility into why stats might be off.

If you do like to keep this change, the error_rl should also be replaced with rl for debug messages.

>>> +                "flow get failed (dev %s prio %d handle %d): %s",
>>> +                netdev_get_name(netdev), id.prio, id.handle,
>>> +                ovs_strerror(err));
>>>          return err;
>>>      }
>>>
>>> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
>>>      return 0;
>>>  }
>>>
>>> +/* This macro is for use by one-time initialization functions, where we have
>>> + * one shot per thread/process to perform a pertinent initialization task that
>>> + * may return a temporary error (EAGAIN).
>>> + *
>>> + * With min/max values of 1/64 we would retry 7 times, spending at the
>>> + * most 127 * 1E6 nsec (0.127s) sleeping.
>>> + */
>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
>>> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
>>
>> I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.
>
> Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
> already defined by the consumer of the macro, but I guess the compiler will
> be helpful in pointing this out.  Will fix it.

Thanks

>>> +                                             FUNCTION, ...)                   \
>>> +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
>>> +    do {                                                                      \
>>> +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
>>> +        if (CONDITION) {                                                      \
>>> +            xnanosleep(BACKOFF * 1E6);                                        \
>>> +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
>>> +                BACKOFF <<= 1;                                                \
>>> +            } else {                                                          \
>>> +                break;                                                        \
>>> +            }                                                                 \
>>> +        }                                                                     \
>>> +    } while (CONDITION);
>>> +
>>>  static void
>>>  probe_multi_mask_per_prio(int ifindex)
>>>  {
>>>      struct tc_flower flower;
>>>      struct tcf_id id1, id2;
>>> +    uint64_t backoff;
>>>      int block_id = 0;
>>>      int prio = 1;
>>>      int error;
>>> @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>>>
>>>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>> -    error = tc_replace_flower(&id1, &flower);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         tc_replace_flower, &id1, &flower);
>>>      if (error) {
>>> +        if (error == EAGAIN) {
>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
>>> +                      "support: %s", ovs_strerror(error));
>>> +        }
>>>          goto out;
>>>      }
>>>
>>> @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
>>>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
>>>
>>>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>> -    error = tc_replace_flower(&id2, &flower);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         tc_replace_flower, &id2, &flower);
>>>      tc_del_flower_filter(&id1);
>>>
>>>      if (error) {
>>> +        if (error == EAGAIN) {
>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
>>> +                      "support: %s", ovs_strerror(error));
>>> +        }
>>>          goto out;
>>>      }
>>>
>>> @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
>>>  {
>>>      struct tc_flower flower;
>>>      uint16_t ct_state;
>>> +    uint64_t backoff;
>>>      struct tcf_id id;
>>>      int error;
>>>
>>> @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
>>>          goto out;
>>>      }
>>>
>>> -    error = tc_get_flower(&id, &flower);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         tc_get_flower, &id, &flower);
>>>      if (error || flower.mask.ct_state != ct_state) {
>>> +        if (error == EAGAIN) {
>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>> +                      ovs_strerror(error));
>>> +        }
>>>          goto out_del;
>>>      }
>>>
>>> @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
>>>
>>>      /* Test for reject, ct_state >= MAX */
>>>      ct_state = ~0;
>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         probe_insert_ct_state_rule, ifindex,
>>> +                                         ct_state, &id);
>>>      if (!error) {
>>>          /* No reject, can't continue probing other flags */
>>>          goto out_del;
>>> +    } else if (error == EAGAIN) {
>>> +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>> +                  ovs_strerror(error));
>>> +        goto out_del;
>>>      }
>>>
>>>      tc_del_flower_filter(&id);
>>> @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
>>>      memset(&flower, 0, sizeof flower);
>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>>>                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         probe_insert_ct_state_rule, ifindex,
>>> +                                         ct_state, &id);
>>>      if (error) {
>>> +        if (error == EAGAIN) {
>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>> +                      ovs_strerror(error));
>>> +        }
>>>          goto out;
>>>      }
>>>
>>> @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>>>                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
>>>                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         probe_insert_ct_state_rule, ifindex,
>>> +                                         ct_state, &id);
>>>      if (error) {
>>> +        if (error == EAGAIN) {
>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>> +                      ovs_strerror(error));
>>> +        }
>>>          goto out;
>>>      }
>>>
>>> @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
>>>  {
>>>      struct tc_flower flower;
>>>      uint32_t block_id = 1;
>>> +    uint64_t backoff;
>>>      struct tcf_id id;
>>>      int prio = 0;
>>>      int error;
>>> @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>>>
>>>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>> -    error = tc_replace_flower(&id, &flower);
>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>> +                                         tc_replace_flower, &id, &flower);
>>>
>>>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
>>>
>>>      if (!error) {
>>>          block_support = true;
>>>          VLOG_INFO("probe tc: block offload is supported.");
>>> +    } else if (error == EAGAIN) {
>>> +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
>>> +                  ovs_strerror(error));
>>>      }
>>>  }
>>>
>>> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
>>> index 80da20d9f..dea060fc3 100644
>>> --- a/lib/netlink-socket.c
>>> +++ b/lib/netlink-socket.c
>>> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>>>   *
>>>   *      2. Resending the request causes it to be re-executed, so the request
>>>   *         needs to be idempotent.
>>> + *
>>> + *      3. In the event that the kernel is too busy to handle the request to
>>> + *         receive the response (i.e. EAGAIN), this function will still return
>>> + *         0.  The caller can check for this condition by checking for a zero
>>> + *         size of the 'replyp' ofpbuf buffer.
>>>   */
>>>  int
>>>  nl_transact(int protocol, const struct ofpbuf *request,
>>> diff --git a/lib/tc.c b/lib/tc.c
>>> index 5c32c6f97..7df6a10c5 100644
>>> --- a/lib/tc.c
>>> +++ b/lib/tc.c
>>> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
>>>      }
>>>  }
>>>
>>> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
>>> + * addition of:
>>> + *
>>> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
>>> + *    regardless of success or failure.
>>> + *
>>> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
>>> + *    being too busy to process the request for the response, a positive
>>> + *    error return will be provided with the value of EAGAIN.
>>> + *
>>> + * Please acquaint yourself with the documentation of the `nl_transact`
>>> + * function in the netlink-sock module before making use of this function.
>>> + */
>>>  int
>>>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>>>  {
>>>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>>>      ofpbuf_uninit(request);
>>> +
>>> +    if (!error && replyp && !(*replyp)->size) {
>>
>> I think it would be really nice to add a coverage counter for this, so we know it happened as most of it is hidden behind a debug message.
>
> Ack, good point, will add!
>
> -- 
> Frode Nordahl
>
>>> +        return EAGAIN;
>>> +    }
>>> +
>>>      return error;
>>>  }
>>>
>>> --
>>> 2.39.2
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
Frode Nordahl May 22, 2023, 10:32 a.m. UTC | #6
On Mon, May 22, 2023 at 11:33 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 19 May 2023, at 12:31, Frode Nordahl wrote:
>
> > On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >>
> >>
> >>
> >> On 15 May 2023, at 10:04, Frode Nordahl wrote:
> >>
> >>> The tc module combines the use of the `tc_transact` helper
> >>> function for communication with the in-kernel tc infrastructure
> >>> with assertions on the reply data by `ofpbuf_at_assert` on the
> >>> received data prior to further processing.
> >>>
> >>> `tc_transact` in turn calls `nl_transact`, which via
> >>> `nl_transact_multiple__` ultimately calls and handles return
> >>> value from `recvmsg`.  On error a check for EAGAIN is performed
> >>> and a consequence of this condition is effectively to provide a
> >>> non-error (0) result and an empty reply buffer.
> >>>
> >>> Before this change, the `tc_transact` and, consumers of it, were
> >>> unaware of this condition.  The use of assertions on the reply
> >>> buffer can as such lead to a fatal crash of OVS.
> >>>
> >>> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> >>> return is currently not documented, so this change also adds that
> >>> documentation.
> >>>
> >>> While fixing the problem, it led me to find potential problems
> >>> with the one-time initialization functions in the netdev-offload-tc
> >>> module.  Make use of the information now available about an EAGAIN
> >>> condition to retry one-time initialization, and resort to logging
> >>> a warning if probing of tc features fails due to temporary
> >>> situations such as resource depletion.
> >>>
> >>> For the record, the symptom of the crash is this in the log:
> >>> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> >>>
> >>> And an excerpt of the backtrace looks like this:
> >>> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> >>> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> >>> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> >>> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> >>> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> >>> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> >>> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> >>> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> >>
> >> Thanks for this patch. See some comments inline below.
> >
> > Hello, Eelco,
> >
> > Thank you for taking the time to review, much appreciated! I'll
> > respond inline as well.
>
> See my responses inline again ;) I’m wondering if you know why we would fail with EAGAIN on a nonblocking socket? Is it TC or NetLink-specific?

The recvmsg call [0] is made with the MSG_DONTWAIT flag, and
a return of EAGAIN is then to be expected whenever the call
would block [1].

So I don't think this is TC or netlink specific, this is just the
semantics of a nonblocking socket.

0: https://github.com/openvswitch/ovs/blob/8045c0f8de5192355ca438ed7eef77457c3c1625/lib/netlink-socket.c#L712
1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/recv.2#n113

> >> Cheers,
> >>
> >> Eelco
> >>
> >>> Reported-At: https://launchpad.net/bugs/2018500
> >>> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> >>> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> >
> > I see now that I've forgotten a `:` after `Fixes` which makes
> > patchwork not account for
> > them, I'll fix that in the next iteration
> >
> >>> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> >>> ---
> >>>  lib/dpif-netlink.c      |  4 +-
> >>>  lib/dpif.c              |  2 +-
> >>>  lib/netdev-linux.c      | 34 +++++++++-------
> >>>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
> >>>  lib/netlink-socket.c    |  5 +++
> >>>  lib/tc.c                | 18 +++++++++
> >>>  6 files changed, 122 insertions(+), 27 deletions(-)
> >>>
> >>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> >>> index 60bd39643..75537b2f8 100644
> >>> --- a/lib/dpif-netlink.c
> >>> +++ b/lib/dpif-netlink.c
> >>> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
> >>>              }
> >>>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
> >>>          }
> >>> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >>> +        level = (err == EAGAIN ||
> >>> +                 err == ENOSPC ||
> >>> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >>>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
> >>>                  ovs_strerror(err),
> >>>                  (oor_netdev ? oor_netdev->name : dev->name));
> >>
> >> Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?
> >
> > The reasoning is that the flow will be attempted offloaded again, and if the
> > system is already busy, adding to the load by possibly rampant logging would
> > not help on the system load?
> >
> > Do you think this condition is not to be expected with this subsystem?
>
> As you mentioned, we did not log it before so we might not know. But I would expect it not to happen often, if it does, the system might not have enough resources to begin with. Anyway, we have a rate limit of five messages per minute.
>
> Do we know why the kernel would return EAGAIN, i.e. what is TC busy with to return this? The socket is not nonblocking, so it must be TC-specific I guess.

See above.

> >>> diff --git a/lib/dpif.c b/lib/dpif.c
> >>> index 3305401fe..2bbb0aac4 100644
> >>> --- a/lib/dpif.c
> >>> +++ b/lib/dpif.c
> >>> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
> >>>       * Kernels that support flow wildcarding will reject these flows as
> >>>       * duplicates (EEXIST), so lower the log level to debug for these
> >>>       * types of messages. */
> >>> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> >>> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
> >>
> >> Just a bit, but as the above comment explicitly mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.
> >
> > Same reasoning for suppressing EAGAIN as above, and it surely deserves a
> > comment, thank you for pointing it out!
> >
> >> In addition same as above, i.e., do we want a debug for EAGAIN here?
> >
> > The reason for adding it here is that as we discovered the `netlink-socket`
> > module is currently suppressing these in a rather surprising manner, but there
> > was probably a reason for the suppression.
>
> Same as above, do we know why we could get EAGAIN from the kernel? Is it NetLink specific or TC?

See above.

> > Now that we are exposing a new temporary error condition in the `tc` module,
> > many of the call paths end up in higher level dpif callbacks.  I'm approaching
> > this with a bit of conservative safety in mind, squelching the logging down to
> > DBG level to ensure that we don't cause any issues for existing systems or
> > tests.
>
> But I agree for this one we might want to keep it debug as it will log upto 60 a minute :(

++

> >>>  }
> >>>
> >>>  static bool
> >>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> >>> index 36620199e..a0294ca00 100644
> >>> --- a/lib/netdev-linux.c
> >>> +++ b/lib/netdev-linux.c
> >>> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >>>
> >>>      error = tc_transact(&request, &replyp);
> >>>      if (error) {
> >>> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> >>> -                    index, error);
> >>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> >>> +                "Failed to dump police action (index: %u), err=%d",
> >>> +                index, error);
> >>>          return error;
> >>>      }
> >>>
> >>> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >>>
> >>>      error = tc_transact(&request, &replyp);
> >>>      if (error) {
> >>> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> >>> -                    index, error);
> >>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> >>> +                "Failed to delete police action (index: %u), err=%d",
> >>> +                index, error);
> >>>          return error;
> >>>      }
> >>
> >> The two above will still result in a warning by upper-level callers, so dbg is fine.
> >>
> >>> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
> >>>
> >>>      error = tc_transact(&request, replyp);
> >>>      if (error) {
> >>> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> >>> -                     netdev_get_name(netdev),
> >>> -                     tc_get_major(handle), tc_get_minor(handle),
> >>> -                     tc_get_major(parent), tc_get_minor(parent),
> >>> -                     ovs_strerror(error));
> >>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> >>> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> >>> +                netdev_get_name(netdev),
> >>> +                tc_get_major(handle), tc_get_minor(handle),
> >>> +                tc_get_major(parent), tc_get_minor(parent),
> >>> +                ovs_strerror(error));
> >>>      }
> >>>      return error;
> >>>  }
> >>> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
> >>>
> >>>      error = tc_transact(&request, NULL);
> >>>      if (error) {
> >>> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> >>> -                     netdev_get_name(netdev),
> >>> -                     tc_get_major(handle), tc_get_minor(handle),
> >>> -                     ovs_strerror(error));
> >>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> >>> +                "delete %s class %u:%u failed (%s)",
> >>> +                netdev_get_name(netdev),
> >>> +                tc_get_major(handle), tc_get_minor(handle),
> >>> +                ovs_strerror(error));
> >>>      }
> >>>      return error;
> >>>  }
> >>> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
> >>>                  ops = &tc_ops_other;
> >>>              }
> >>>          }
> >>> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> >>> +    } else if ((!error && !qdisc->size) ||
> >>> +               error == ENOENT || error == EAGAIN)
> >>
> >> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
> >> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.
> >
> > Ack.
> >
> >>> +    {
> >>>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
> >>>           * set up by some other entity that doesn't have a handle 1:0.  We will
> >>>           * assume that it's the system default qdisc. */
> >>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> >>> index 4f26dd8cc..4fb9dc239 100644
> >>> --- a/lib/netdev-offload-tc.c
> >>> +++ b/lib/netdev-offload-tc.c
> >>> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
> >>>
> >>>      err = tc_get_flower(&id, &flower);
> >>>      if (err) {
> >>> -        VLOG_ERR_RL(&error_rl,
> >>> -                    "flow get failed (dev %s prio %d handle %d): %s",
> >>> -                    netdev_get_name(netdev), id.prio, id.handle,
> >>> -                    ovs_strerror(err));
> >>> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
> >>
> >> Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?
> >
> > I'd lean towards yes, given short periods of the system being busy and returning
> > EAGAIN on calls is to be expected, and rampant logging about it when it happens
> > would not help with the load?
> >
> > Do you think this condition is not to be expected with this subsystem?
>
> I think with 5 errors per second, it should be no problem, and it will give visibility into why stats might be off.
>
> If you do like to keep this change, the error_rl should also be replaced with rl for debug messages.

Ack

> >>> +                "flow get failed (dev %s prio %d handle %d): %s",
> >>> +                netdev_get_name(netdev), id.prio, id.handle,
> >>> +                ovs_strerror(err));
> >>>          return err;
> >>>      }
> >>>
> >>> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
> >>>      return 0;
> >>>  }
> >>>
> >>> +/* This macro is for use by one-time initialization functions, where we have
> >>> + * one shot per thread/process to perform a pertinent initialization task that
> >>> + * may return a temporary error (EAGAIN).
> >>> + *
> >>> + * With min/max values of 1/64 we would retry 7 times, spending at the
> >>> + * most 127 * 1E6 nsec (0.127s) sleeping.
> >>> + */
> >>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> >>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> >>> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
> >>
> >> I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.
> >
> > Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
> > already defined by the consumer of the macro, but I guess the compiler will
> > be helpful in pointing this out.  Will fix it.
>
> Thanks
Eelco Chaudron May 22, 2023, 2:31 p.m. UTC | #7
On 22 May 2023, at 12:32, Frode Nordahl wrote:

> On Mon, May 22, 2023 at 11:33 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>
>>
>>
>> On 19 May 2023, at 12:31, Frode Nordahl wrote:
>>
>>> On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <echaudro@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 15 May 2023, at 10:04, Frode Nordahl wrote:
>>>>
>>>>> The tc module combines the use of the `tc_transact` helper
>>>>> function for communication with the in-kernel tc infrastructure
>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>>> received data prior to further processing.
>>>>>
>>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>>> and a consequence of this condition is effectively to provide a
>>>>> non-error (0) result and an empty reply buffer.
>>>>>
>>>>> Before this change, the `tc_transact` and, consumers of it, were
>>>>> unaware of this condition.  The use of assertions on the reply
>>>>> buffer can as such lead to a fatal crash of OVS.
>>>>>
>>>>> To be fair, the behavior of `nl_transact` when handling an EAGAIN
>>>>> return is currently not documented, so this change also adds that
>>>>> documentation.
>>>>>
>>>>> While fixing the problem, it led me to find potential problems
>>>>> with the one-time initialization functions in the netdev-offload-tc
>>>>> module.  Make use of the information now available about an EAGAIN
>>>>> condition to retry one-time initialization, and resort to logging
>>>>> a warning if probing of tc features fails due to temporary
>>>>> situations such as resource depletion.
>>>>>
>>>>> For the record, the symptom of the crash is this in the log:
>>>>> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
>>>>>
>>>>> And an excerpt of the backtrace looks like this:
>>>>> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
>>>>> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
>>>>> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
>>>>> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
>>>>> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
>>>>> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
>>>>> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
>>>>> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
>>>>
>>>> Thanks for this patch. See some comments inline below.
>>>
>>> Hello, Eelco,
>>>
>>> Thank you for taking the time to review, much appreciated! I'll
>>> respond inline as well.
>>
>> See my responses inline again ;) I’m wondering if you know why we would fail with EAGAIN on a nonblocking socket? Is it TC or NetLink-specific?
>
> The recvmsg call [0] is made with the MSG_DONTWAIT flag, and
> a return of EAGAIN is then to be expected whenever the call
> would block [1].
>
> So I don't think this is TC or netlink specific, this is just the
> semantics of a nonblocking socket.
>
> 0: https://github.com/openvswitch/ovs/blob/8045c0f8de5192355ca438ed7eef77457c3c1625/lib/netlink-socket.c#L712
> 1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/recv.2#n113

I should have drank more coffee as wait != MSG_DONTWAIT :)

So peeking at some kernel code, the most obvious way for this to happen is if the driver is still busy processing the request. Did you see this with specific offload hardware, or even without hw offloading?

I would still like the specific netdev_tc_flow_get() to be of level error (they are protected by a max 5 a minute), as it interacts with the revalidator (but potentially only at ovs restart). In theory netdev_tc_flow_put() might also be a candidate for logging, but there is none, so the coverage counter is what we will have for now…

See one more comment below on a potential memory leak…

>>>> Cheers,
>>>>
>>>> Eelco
>>>>
>>>>> Reported-At: https://launchpad.net/bugs/2018500
>>>>> Fixes f98e418fbdb6 (tc: Add tc flower functions)
>>>>> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
>>>
>>> I see now that I've forgotten a `:` after `Fixes` which makes
>>> patchwork not account for
>>> them, I'll fix that in the next iteration
>>>
>>>>> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
>>>>> ---
>>>>>  lib/dpif-netlink.c      |  4 +-
>>>>>  lib/dpif.c              |  2 +-
>>>>>  lib/netdev-linux.c      | 34 +++++++++-------
>>>>>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
>>>>>  lib/netlink-socket.c    |  5 +++
>>>>>  lib/tc.c                | 18 +++++++++
>>>>>  6 files changed, 122 insertions(+), 27 deletions(-)
>>>>>
>>>>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
>>>>> index 60bd39643..75537b2f8 100644
>>>>> --- a/lib/dpif-netlink.c
>>>>> +++ b/lib/dpif-netlink.c
>>>>> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
>>>>>              }
>>>>>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
>>>>>          }
>>>>> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>>>>> +        level = (err == EAGAIN ||
>>>>> +                 err == ENOSPC ||
>>>>> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
>>>>>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
>>>>>                  ovs_strerror(err),
>>>>>                  (oor_netdev ? oor_netdev->name : dev->name));
>>>>
>>>> Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?
>>>
>>> The reasoning is that the flow will be attempted offloaded again, and if the
>>> system is already busy, adding to the load by possibly rampant logging would
>>> not help on the system load?
>>>
>>> Do you think this condition is not to be expected with this subsystem?
>>
>> As you mentioned, we did not log it before so we might not know. But I would expect it not to happen often, if it does, the system might not have enough resources to begin with. Anyway, we have a rate limit of five messages per minute.
>>
>> Do we know why the kernel would return EAGAIN, i.e. what is TC busy with to return this? The socket is not nonblocking, so it must be TC-specific I guess.
>
> See above.
>
>>>>> diff --git a/lib/dpif.c b/lib/dpif.c
>>>>> index 3305401fe..2bbb0aac4 100644
>>>>> --- a/lib/dpif.c
>>>>> +++ b/lib/dpif.c
>>>>> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
>>>>>       * Kernels that support flow wildcarding will reject these flows as
>>>>>       * duplicates (EEXIST), so lower the log level to debug for these
>>>>>       * types of messages. */
>>>>> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
>>>>> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
>>>>
>>>> Just a bit, but as the above comment explicitly mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.
>>>
>>> Same reasoning for suppressing EAGAIN as above, and it surely deserves a
>>> comment, thank you for pointing it out!
>>>
>>>> In addition same as above, i.e., do we want a debug for EAGAIN here?
>>>
>>> The reason for adding it here is that as we discovered the `netlink-socket`
>>> module is currently suppressing these in a rather surprising manner, but there
>>> was probably a reason for the suppression.
>>
>> Same as above, do we know why we could get EAGAIN from the kernel? Is it NetLink specific or TC?
>
> See above.
>
>>> Now that we are exposing a new temporary error condition in the `tc` module,
>>> many of the call paths end up in higher level dpif callbacks.  I'm approaching
>>> this with a bit of conservative safety in mind, squelching the logging down to
>>> DBG level to ensure that we don't cause any issues for existing systems or
>>> tests.
>>
>> But I agree for this one we might want to keep it debug as it will log upto 60 a minute :(
>
> ++
>
>>>>>  }
>>>>>
>>>>>  static bool
>>>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>>>> index 36620199e..a0294ca00 100644
>>>>> --- a/lib/netdev-linux.c
>>>>> +++ b/lib/netdev-linux.c
>>>>> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>>>>>
>>>>>      error = tc_transact(&request, &replyp);
>>>>>      if (error) {
>>>>> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
>>>>> -                    index, error);
>>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
>>>>> +                "Failed to dump police action (index: %u), err=%d",
>>>>> +                index, error);
>>>>>          return error;
>>>>>      }
>>>>>
>>>>> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
>>>>>
>>>>>      error = tc_transact(&request, &replyp);
>>>>>      if (error) {
>>>>> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
>>>>> -                    index, error);
>>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
>>>>> +                "Failed to delete police action (index: %u), err=%d",
>>>>> +                index, error);
>>>>>          return error;
>>>>>      }
>>>>
>>>> The two above will still result in a warning by upper-level callers, so dbg is fine.
>>>>
>>>>> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
>>>>>
>>>>>      error = tc_transact(&request, replyp);
>>>>>      if (error) {
>>>>> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
>>>>> -                     netdev_get_name(netdev),
>>>>> -                     tc_get_major(handle), tc_get_minor(handle),
>>>>> -                     tc_get_major(parent), tc_get_minor(parent),
>>>>> -                     ovs_strerror(error));
>>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
>>>>> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
>>>>> +                netdev_get_name(netdev),
>>>>> +                tc_get_major(handle), tc_get_minor(handle),
>>>>> +                tc_get_major(parent), tc_get_minor(parent),
>>>>> +                ovs_strerror(error));
>>>>>      }
>>>>>      return error;
>>>>>  }
>>>>> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
>>>>>
>>>>>      error = tc_transact(&request, NULL);
>>>>>      if (error) {
>>>>> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
>>>>> -                     netdev_get_name(netdev),
>>>>> -                     tc_get_major(handle), tc_get_minor(handle),
>>>>> -                     ovs_strerror(error));
>>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
>>>>> +                "delete %s class %u:%u failed (%s)",
>>>>> +                netdev_get_name(netdev),
>>>>> +                tc_get_major(handle), tc_get_minor(handle),
>>>>> +                ovs_strerror(error));
>>>>>      }
>>>>>      return error;
>>>>>  }
>>>>> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
>>>>>                  ops = &tc_ops_other;
>>>>>              }
>>>>>          }
>>>>> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
>>>>> +    } else if ((!error && !qdisc->size) ||
>>>>> +               error == ENOENT || error == EAGAIN)
>>>>
>>>> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
>>>> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.
>>>
>>> Ack.
>>>
>>>>> +    {
>>>>>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
>>>>>           * set up by some other entity that doesn't have a handle 1:0.  We will
>>>>>           * assume that it's the system default qdisc. */
>>>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
>>>>> index 4f26dd8cc..4fb9dc239 100644
>>>>> --- a/lib/netdev-offload-tc.c
>>>>> +++ b/lib/netdev-offload-tc.c
>>>>> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
>>>>>
>>>>>      err = tc_get_flower(&id, &flower);
>>>>>      if (err) {
>>>>> -        VLOG_ERR_RL(&error_rl,
>>>>> -                    "flow get failed (dev %s prio %d handle %d): %s",
>>>>> -                    netdev_get_name(netdev), id.prio, id.handle,
>>>>> -                    ovs_strerror(err));
>>>>> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
>>>>
>>>> Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?
>>>
>>> I'd lean towards yes, given short periods of the system being busy and returning
>>> EAGAIN on calls is to be expected, and rampant logging about it when it happens
>>> would not help with the load?
>>>
>>> Do you think this condition is not to be expected with this subsystem?
>>
>> I think with 5 errors per second, it should be no problem, and it will give visibility into why stats might be off.
>>
>> If you do like to keep this change, the error_rl should also be replaced with rl for debug messages.
>
> Ack
>
>>>>> +                "flow get failed (dev %s prio %d handle %d): %s",
>>>>> +                netdev_get_name(netdev), id.prio, id.handle,
>>>>> +                ovs_strerror(err));
>>>>>          return err;
>>>>>      }
>>>>>
>>>>> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> +/* This macro is for use by one-time initialization functions, where we have
>>>>> + * one shot per thread/process to perform a pertinent initialization task that
>>>>> + * may return a temporary error (EAGAIN).
>>>>> + *
>>>>> + * With min/max values of 1/64 we would retry 7 times, spending at the
>>>>> + * most 127 * 1E6 nsec (0.127s) sleeping.
>>>>> + */
>>>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
>>>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
>>>>> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
>>>>
>>>> I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.
>>>
>>> Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
>>> already defined by the consumer of the macro, but I guess the compiler will
>>> be helpful in pointing this out.  Will fix it.
>>
>> Thanks
>
> -- 
> Frode Nordahl
>
>>
>>>>> +                                             FUNCTION, ...)                   \
>>>>> +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
>>>>> +    do {                                                                      \
>>>>> +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
>>>>> +        if (CONDITION) {                                                      \
>>>>> +            xnanosleep(BACKOFF * 1E6);                                        \
>>>>> +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
>>>>> +                BACKOFF <<= 1;                                                \
>>>>> +            } else {                                                          \
>>>>> +                break;                                                        \
>>>>> +            }                                                                 \
>>>>> +        }                                                                     \
>>>>> +    } while (CONDITION);
>>>>> +
>>>>>  static void
>>>>>  probe_multi_mask_per_prio(int ifindex)
>>>>>  {
>>>>>      struct tc_flower flower;
>>>>>      struct tcf_id id1, id2;
>>>>> +    uint64_t backoff;
>>>>>      int block_id = 0;
>>>>>      int prio = 1;
>>>>>      int error;
>>>>> @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
>>>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>>>>>
>>>>>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>>>> -    error = tc_replace_flower(&id1, &flower);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         tc_replace_flower, &id1, &flower);
>>>>>      if (error) {
>>>>> +        if (error == EAGAIN) {
>>>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
>>>>> +                      "support: %s", ovs_strerror(error));
>>>>> +        }
>>>>>          goto out;
>>>>>      }
>>>>>
>>>>> @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
>>>>>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
>>>>>
>>>>>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>>>> -    error = tc_replace_flower(&id2, &flower);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         tc_replace_flower, &id2, &flower);
>>>>>      tc_del_flower_filter(&id1);
>>>>>
>>>>>      if (error) {
>>>>> +        if (error == EAGAIN) {
>>>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
>>>>> +                      "support: %s", ovs_strerror(error));
>>>>> +        }
>>>>>          goto out;
>>>>>      }
>>>>>
>>>>> @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
>>>>>  {
>>>>>      struct tc_flower flower;
>>>>>      uint16_t ct_state;
>>>>> +    uint64_t backoff;
>>>>>      struct tcf_id id;
>>>>>      int error;
>>>>>
>>>>> @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
>>>>>          goto out;
>>>>>      }
>>>>>
>>>>> -    error = tc_get_flower(&id, &flower);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         tc_get_flower, &id, &flower);
>>>>>      if (error || flower.mask.ct_state != ct_state) {
>>>>> +        if (error == EAGAIN) {
>>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>>>> +                      ovs_strerror(error));
>>>>> +        }
>>>>>          goto out_del;
>>>>>      }
>>>>>
>>>>> @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
>>>>>
>>>>>      /* Test for reject, ct_state >= MAX */
>>>>>      ct_state = ~0;
>>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         probe_insert_ct_state_rule, ifindex,
>>>>> +                                         ct_state, &id);
>>>>>      if (!error) {
>>>>>          /* No reject, can't continue probing other flags */
>>>>>          goto out_del;
>>>>> +    } else if (error == EAGAIN) {
>>>>> +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>>>> +                  ovs_strerror(error));
>>>>> +        goto out_del;
>>>>>      }
>>>>>
>>>>>      tc_del_flower_filter(&id);
>>>>> @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
>>>>>      memset(&flower, 0, sizeof flower);
>>>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>>>>>                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
>>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         probe_insert_ct_state_rule, ifindex,
>>>>> +                                         ct_state, &id);
>>>>>      if (error) {
>>>>> +        if (error == EAGAIN) {
>>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>>>> +                      ovs_strerror(error));
>>>>> +        }
>>>>>          goto out;
>>>>>      }
>>>>>
>>>>> @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
>>>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
>>>>>                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
>>>>>                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
>>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         probe_insert_ct_state_rule, ifindex,
>>>>> +                                         ct_state, &id);
>>>>>      if (error) {
>>>>> +        if (error == EAGAIN) {
>>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
>>>>> +                      ovs_strerror(error));
>>>>> +        }
>>>>>          goto out;
>>>>>      }
>>>>>
>>>>> @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
>>>>>  {
>>>>>      struct tc_flower flower;
>>>>>      uint32_t block_id = 1;
>>>>> +    uint64_t backoff;
>>>>>      struct tcf_id id;
>>>>>      int prio = 0;
>>>>>      int error;
>>>>> @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
>>>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
>>>>>
>>>>>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
>>>>> -    error = tc_replace_flower(&id, &flower);
>>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
>>>>> +                                         tc_replace_flower, &id, &flower);
>>>>>
>>>>>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
>>>>>
>>>>>      if (!error) {
>>>>>          block_support = true;
>>>>>          VLOG_INFO("probe tc: block offload is supported.");
>>>>> +    } else if (error == EAGAIN) {
>>>>> +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
>>>>> +                  ovs_strerror(error));
>>>>>      }
>>>>>  }
>>>>>
>>>>> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
>>>>> index 80da20d9f..dea060fc3 100644
>>>>> --- a/lib/netlink-socket.c
>>>>> +++ b/lib/netlink-socket.c
>>>>> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
>>>>>   *
>>>>>   *      2. Resending the request causes it to be re-executed, so the request
>>>>>   *         needs to be idempotent.
>>>>> + *
>>>>> + *      3. In the event that the kernel is too busy to handle the request to
>>>>> + *         receive the response (i.e. EAGAIN), this function will still return
>>>>> + *         0.  The caller can check for this condition by checking for a zero
>>>>> + *         size of the 'replyp' ofpbuf buffer.
>>>>>   */
>>>>>  int
>>>>>  nl_transact(int protocol, const struct ofpbuf *request,
>>>>> diff --git a/lib/tc.c b/lib/tc.c
>>>>> index 5c32c6f97..7df6a10c5 100644
>>>>> --- a/lib/tc.c
>>>>> +++ b/lib/tc.c
>>>>> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
>>>>>      }
>>>>>  }
>>>>>
>>>>> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
>>>>> + * addition of:
>>>>> + *
>>>>> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
>>>>> + *    regardless of success or failure.
>>>>> + *
>>>>> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
>>>>> + *    being too busy to process the request for the response, a positive
>>>>> + *    error return will be provided with the value of EAGAIN.
>>>>> + *
>>>>> + * Please acquaint yourself with the documentation of the `nl_transact`
>>>>> + * function in the netlink-sock module before making use of this function.
>>>>> + */
>>>>>  int
>>>>>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
>>>>>  {
>>>>>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
>>>>>      ofpbuf_uninit(request);
>>>>> +
>>>>> +    if (!error && replyp && !(*replyp)->size) {
>>>>

The nl_trasact() function has the following statement “ on failure '*replyp' is set to NULL.”. Looking at the code, it expects the same from tc_transact() so we might need to free the buffer here. And maybe add a note to the comment above.

>>>> I think it would be really nice to add a coverage counter for this, so we know it happened as most of it is hidden behind a debug message.
>>>
>>> Ack, good point, will add!
>>>
>>> --
>>> Frode Nordahl
>>>
>>>>> +        return EAGAIN;
>>>>> +    }
>>>>> +
>>>>>      return error;
>>>>>  }
>>>>>
>>>>> --
>>>>> 2.39.2
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev@openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>
Frode Nordahl May 22, 2023, 5:33 p.m. UTC | #8
On Mon, May 22, 2023 at 4:31 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 22 May 2023, at 12:32, Frode Nordahl wrote:
>
> > On Mon, May 22, 2023 at 11:33 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >>
> >>
> >>
> >> On 19 May 2023, at 12:31, Frode Nordahl wrote:
> >>
> >>> On Fri, May 19, 2023 at 11:55 AM Eelco Chaudron <echaudro@redhat.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 15 May 2023, at 10:04, Frode Nordahl wrote:
> >>>>
> >>>>> The tc module combines the use of the `tc_transact` helper
> >>>>> function for communication with the in-kernel tc infrastructure
> >>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
> >>>>> received data prior to further processing.
> >>>>>
> >>>>> `tc_transact` in turn calls `nl_transact`, which via
> >>>>> `nl_transact_multiple__` ultimately calls and handles return
> >>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
> >>>>> and a consequence of this condition is effectively to provide a
> >>>>> non-error (0) result and an empty reply buffer.
> >>>>>
> >>>>> Before this change, the `tc_transact` and, consumers of it, were
> >>>>> unaware of this condition.  The use of assertions on the reply
> >>>>> buffer can as such lead to a fatal crash of OVS.
> >>>>>
> >>>>> To be fair, the behavior of `nl_transact` when handling an EAGAIN
> >>>>> return is currently not documented, so this change also adds that
> >>>>> documentation.
> >>>>>
> >>>>> While fixing the problem, it led me to find potential problems
> >>>>> with the one-time initialization functions in the netdev-offload-tc
> >>>>> module.  Make use of the information now available about an EAGAIN
> >>>>> condition to retry one-time initialization, and resort to logging
> >>>>> a warning if probing of tc features fails due to temporary
> >>>>> situations such as resource depletion.
> >>>>>
> >>>>> For the record, the symptom of the crash is this in the log:
> >>>>> EMER|../include/openvswitch/ofpbuf.h:194: assertion offset + size <= b->size failed in ofpbuf_at_assert()
> >>>>>
> >>>>> And an excerpt of the backtrace looks like this:
> >>>>> 0x0000561dac1396d1 in ofpbuf_at_assert (b=0x7fb650005d20, b=0x7fb650005d20, offset=16, size=20) at ../include/openvswitch/ofpbuf.h:194
> >>>>> tc_replace_flower (id=<optimized out>, flower=<optimized out>) at ../lib/tc.c:3223
> >>>>> 0x0000561dac128155 in netdev_tc_flow_put (netdev=0x561dacf91840, match=<optimized out>, actions=<optimized out>, actions_len=<optimized out>,
> >>>>> ufid=<optimized out>, info=<optimized out>, stats=<optimized out>) at ../lib/netdev-offload-tc.c:2096
> >>>>> 0x0000561dac117541 in netdev_flow_put (stats=<optimized out>, info=0x7fb65b7ba780, ufid=<optimized out>, act_len=<optimized out>, actions=<optimized out>,
> >>>>> match=0x7fb65b7ba980, netdev=0x561dacf91840) at ../lib/netdev-offload.c:257
> >>>>> parse_flow_put (put=0x7fb65b7bcc50, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2297
> >>>>> try_send_to_netdev (op=0x7fb65b7bcc48, dpif=0x561dad0ad550) at ../lib/dpif-netlink.c:2384
> >>>>
> >>>> Thanks for this patch. See some comments inline below.
> >>>
> >>> Hello, Eelco,
> >>>
> >>> Thank you for taking the time to review, much appreciated! I'll
> >>> respond inline as well.
> >>
> >> See my responses inline again ;) I’m wondering if you know why we would fail with EAGAIN on a nonblocking socket? Is it TC or NetLink-specific?
> >
> > The recvmsg call [0] is made with the MSG_DONTWAIT flag, and
> > a return of EAGAIN is then to be expected whenever the call
> > would block [1].
> >
> > So I don't think this is TC or netlink specific, this is just the
> > semantics of a nonblocking socket.
> >
> > 0: https://github.com/openvswitch/ovs/blob/8045c0f8de5192355ca438ed7eef77457c3c1625/lib/netlink-socket.c#L712
> > 1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/recv.2#n113
>
> I should have drank more coffee as wait != MSG_DONTWAIT :)

Good that we agree, I was wondering if we were talking past each other :-)

> So peeking at some kernel code, the most obvious way for this to happen is if the driver is still busy processing the request. Did you see this with specific offload hardware, or even without hw offloading?

The crash occurs regularly in a production environment with offload to
real hardware, a CX-6 DX card to be specific. Setting
other_config:hw-offload=false and restarting Open vSwitch makes it stop.

We have so far been unable to reproduce this in a lab environment, but that
is most likely due to limitations with the lab equipment.

In the production environment there are mostly regular virtio instances, so my
hunch is that OVS is attempting to offload a lot of stuff that is not
offloadable,
leading to the condition we see here.

So there will be a follow-up investigation to figure out and resolve that issue,
but we think it would be most pertinent to prevent the crash from happening
before we continue on that journey.

> I would still like the specific netdev_tc_flow_get() to be of level error (they are protected by a max 5 a minute), as it interacts with the revalidator (but potentially only at ovs restart). In theory netdev_tc_flow_put() might also be a candidate for logging, but there is none, so the coverage counter is what we will have for now…

Ack.

> See one more comment below on a potential memory leak…
>
> >>>> Cheers,
> >>>>
> >>>> Eelco
> >>>>
> >>>>> Reported-At: https://launchpad.net/bugs/2018500
> >>>>> Fixes f98e418fbdb6 (tc: Add tc flower functions)
> >>>>> Fixes 407556ac6c90 (netlink-socket: Avoid forcing a reply for final message in a transaction.)
> >>>
> >>> I see now that I've forgotten a `:` after `Fixes` which makes
> >>> patchwork not account for
> >>> them, I'll fix that in the next iteration
> >>>
> >>>>> Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
> >>>>> ---
> >>>>>  lib/dpif-netlink.c      |  4 +-
> >>>>>  lib/dpif.c              |  2 +-
> >>>>>  lib/netdev-linux.c      | 34 +++++++++-------
> >>>>>  lib/netdev-offload-tc.c | 86 +++++++++++++++++++++++++++++++++++------
> >>>>>  lib/netlink-socket.c    |  5 +++
> >>>>>  lib/tc.c                | 18 +++++++++
> >>>>>  6 files changed, 122 insertions(+), 27 deletions(-)
> >>>>>
> >>>>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> >>>>> index 60bd39643..75537b2f8 100644
> >>>>> --- a/lib/dpif-netlink.c
> >>>>> +++ b/lib/dpif-netlink.c
> >>>>> @@ -2340,7 +2340,9 @@ parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
> >>>>>              }
> >>>>>              netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
> >>>>>          }
> >>>>> -        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >>>>> +        level = (err == EAGAIN ||
> >>>>> +                 err == ENOSPC ||
> >>>>> +                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
> >>>>>          VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
> >>>>>                  ovs_strerror(err),
> >>>>>                  (oor_netdev ? oor_netdev->name : dev->name));
> >>>>
> >>>> Do we want a debug for EAGAIN here? I think it could as well be an error, as we failed to offload a flow due to a busy system?
> >>>
> >>> The reasoning is that the flow will be attempted offloaded again, and if the
> >>> system is already busy, adding to the load by possibly rampant logging would
> >>> not help on the system load?
> >>>
> >>> Do you think this condition is not to be expected with this subsystem?
> >>
> >> As you mentioned, we did not log it before so we might not know. But I would expect it not to happen often, if it does, the system might not have enough resources to begin with. Anyway, we have a rate limit of five messages per minute.
> >>
> >> Do we know why the kernel would return EAGAIN, i.e. what is TC busy with to return this? The socket is not nonblocking, so it must be TC-specific I guess.
> >
> > See above.
> >
> >>>>> diff --git a/lib/dpif.c b/lib/dpif.c
> >>>>> index 3305401fe..2bbb0aac4 100644
> >>>>> --- a/lib/dpif.c
> >>>>> +++ b/lib/dpif.c
> >>>>> @@ -1751,7 +1751,7 @@ flow_message_log_level(int error)
> >>>>>       * Kernels that support flow wildcarding will reject these flows as
> >>>>>       * duplicates (EEXIST), so lower the log level to debug for these
> >>>>>       * types of messages. */
> >>>>> -    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
> >>>>> +    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
> >>>>
> >>>> Just a bit, but as the above comment explicitly mentions why EEXIST should be DBG level, we might want to add a word on why EAGAIN should also be.
> >>>
> >>> Same reasoning for suppressing EAGAIN as above, and it surely deserves a
> >>> comment, thank you for pointing it out!
> >>>
> >>>> In addition same as above, i.e., do we want a debug for EAGAIN here?
> >>>
> >>> The reason for adding it here is that as we discovered the `netlink-socket`
> >>> module is currently suppressing these in a rather surprising manner, but there
> >>> was probably a reason for the suppression.
> >>
> >> Same as above, do we know why we could get EAGAIN from the kernel? Is it NetLink specific or TC?
> >
> > See above.
> >
> >>> Now that we are exposing a new temporary error condition in the `tc` module,
> >>> many of the call paths end up in higher level dpif callbacks.  I'm approaching
> >>> this with a bit of conservative safety in mind, squelching the logging down to
> >>> DBG level to ensure that we don't cause any issues for existing systems or
> >>> tests.
> >>
> >> But I agree for this one we might want to keep it debug as it will log upto 60 a minute :(
> >
> > ++
> >
> >>>>>  }
> >>>>>
> >>>>>  static bool
> >>>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> >>>>> index 36620199e..a0294ca00 100644
> >>>>> --- a/lib/netdev-linux.c
> >>>>> +++ b/lib/netdev-linux.c
> >>>>> @@ -5817,8 +5817,9 @@ tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >>>>>
> >>>>>      error = tc_transact(&request, &replyp);
> >>>>>      if (error) {
> >>>>> -        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
> >>>>> -                    index, error);
> >>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> >>>>> +                "Failed to dump police action (index: %u), err=%d",
> >>>>> +                index, error);
> >>>>>          return error;
> >>>>>      }
> >>>>>
> >>>>> @@ -5849,8 +5850,9 @@ tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
> >>>>>
> >>>>>      error = tc_transact(&request, &replyp);
> >>>>>      if (error) {
> >>>>> -        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
> >>>>> -                    index, error);
> >>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
> >>>>> +                "Failed to delete police action (index: %u), err=%d",
> >>>>> +                index, error);
> >>>>>          return error;
> >>>>>      }
> >>>>
> >>>> The two above will still result in a warning by upper-level callers, so dbg is fine.
> >>>>
> >>>>> @@ -6114,11 +6116,12 @@ tc_query_class(const struct netdev *netdev,
> >>>>>
> >>>>>      error = tc_transact(&request, replyp);
> >>>>>      if (error) {
> >>>>> -        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
> >>>>> -                     netdev_get_name(netdev),
> >>>>> -                     tc_get_major(handle), tc_get_minor(handle),
> >>>>> -                     tc_get_major(parent), tc_get_minor(parent),
> >>>>> -                     ovs_strerror(error));
> >>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> >>>>> +                "query %s class %u:%u (parent %u:%u) failed (%s)",
> >>>>> +                netdev_get_name(netdev),
> >>>>> +                tc_get_major(handle), tc_get_minor(handle),
> >>>>> +                tc_get_major(parent), tc_get_minor(parent),
> >>>>> +                ovs_strerror(error));
> >>>>>      }
> >>>>>      return error;
> >>>>>  }
> >>>>> @@ -6140,10 +6143,11 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
> >>>>>
> >>>>>      error = tc_transact(&request, NULL);
> >>>>>      if (error) {
> >>>>> -        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
> >>>>> -                     netdev_get_name(netdev),
> >>>>> -                     tc_get_major(handle), tc_get_minor(handle),
> >>>>> -                     ovs_strerror(error));
> >>>>> +        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
> >>>>> +                "delete %s class %u:%u failed (%s)",
> >>>>> +                netdev_get_name(netdev),
> >>>>> +                tc_get_major(handle), tc_get_minor(handle),
> >>>>> +                ovs_strerror(error));
> >>>>>      }
> >>>>>      return error;
> >>>>>  }
> >>>>> @@ -6262,7 +6266,9 @@ tc_query_qdisc(const struct netdev *netdev_)
> >>>>>                  ops = &tc_ops_other;
> >>>>>              }
> >>>>>          }
> >>>>> -    } else if ((!error && !qdisc->size) || error == ENOENT) {
> >>>>> +    } else if ((!error && !qdisc->size) ||
> >>>>> +               error == ENOENT || error == EAGAIN)
> >>>>
> >>>> Is the (!error && !qdisc->size) not the same as EAGAIN is now? If so, we could remove the (!error && !qdisc->size).
> >>>> Same for a couple of lines above where we do ‘if (!error && qdisc->size)’ which can now be just !error.
> >>>
> >>> Ack.
> >>>
> >>>>> +    {
> >>>>>          /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
> >>>>>           * set up by some other entity that doesn't have a handle 1:0.  We will
> >>>>>           * assume that it's the system default qdisc. */
> >>>>> diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
> >>>>> index 4f26dd8cc..4fb9dc239 100644
> >>>>> --- a/lib/netdev-offload-tc.c
> >>>>> +++ b/lib/netdev-offload-tc.c
> >>>>> @@ -2505,10 +2505,10 @@ netdev_tc_flow_get(struct netdev *netdev,
> >>>>>
> >>>>>      err = tc_get_flower(&id, &flower);
> >>>>>      if (err) {
> >>>>> -        VLOG_ERR_RL(&error_rl,
> >>>>> -                    "flow get failed (dev %s prio %d handle %d): %s",
> >>>>> -                    netdev_get_name(netdev), id.prio, id.handle,
> >>>>> -                    ovs_strerror(err));
> >>>>> +        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
> >>>>
> >>>> Do we want a debug for EAGAIN here? I think it could as well be an error/warning as the system was busy?
> >>>
> >>> I'd lean towards yes, given short periods of the system being busy and returning
> >>> EAGAIN on calls is to be expected, and rampant logging about it when it happens
> >>> would not help with the load?
> >>>
> >>> Do you think this condition is not to be expected with this subsystem?
> >>
> >> I think with 5 errors per second, it should be no problem, and it will give visibility into why stats might be off.
> >>
> >> If you do like to keep this change, the error_rl should also be replaced with rl for debug messages.
> >
> > Ack
> >
> >>>>> +                "flow get failed (dev %s prio %d handle %d): %s",
> >>>>> +                netdev_get_name(netdev), id.prio, id.handle,
> >>>>> +                ovs_strerror(err));
> >>>>>          return err;
> >>>>>      }
> >>>>>
> >>>>> @@ -2571,11 +2571,36 @@ netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
> >>>>>      return 0;
> >>>>>  }
> >>>>>
> >>>>> +/* This macro is for use by one-time initialization functions, where we have
> >>>>> + * one shot per thread/process to perform a pertinent initialization task that
> >>>>> + * may return a temporary error (EAGAIN).
> >>>>> + *
> >>>>> + * With min/max values of 1/64 we would retry 7 times, spending at the
> >>>>> + * most 127 * 1E6 nsec (0.127s) sleeping.
> >>>>> + */
> >>>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
> >>>>> +#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
> >>>>> +#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
> >>>>
> >>>> I think it would be nice to define the backoff variable inside the macro as a local variable, as we do not need it outside the macro.
> >>>
> >>> Ack, was worried about the risk of unexpectedly redefining/shadowing a variable
> >>> already defined by the consumer of the macro, but I guess the compiler will
> >>> be helpful in pointing this out.  Will fix it.
> >>
> >> Thanks
> >
> > --
> > Frode Nordahl
> >
> >>
> >>>>> +                                             FUNCTION, ...)                   \
> >>>>> +    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
> >>>>> +    do {                                                                      \
> >>>>> +        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
> >>>>> +        if (CONDITION) {                                                      \
> >>>>> +            xnanosleep(BACKOFF * 1E6);                                        \
> >>>>> +            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
> >>>>> +                BACKOFF <<= 1;                                                \
> >>>>> +            } else {                                                          \
> >>>>> +                break;                                                        \
> >>>>> +            }                                                                 \
> >>>>> +        }                                                                     \
> >>>>> +    } while (CONDITION);
> >>>>> +
> >>>>>  static void
> >>>>>  probe_multi_mask_per_prio(int ifindex)
> >>>>>  {
> >>>>>      struct tc_flower flower;
> >>>>>      struct tcf_id id1, id2;
> >>>>> +    uint64_t backoff;
> >>>>>      int block_id = 0;
> >>>>>      int prio = 1;
> >>>>>      int error;
> >>>>> @@ -2594,8 +2619,13 @@ probe_multi_mask_per_prio(int ifindex)
> >>>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >>>>>
> >>>>>      id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> >>>>> -    error = tc_replace_flower(&id1, &flower);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         tc_replace_flower, &id1, &flower);
> >>>>>      if (error) {
> >>>>> +        if (error == EAGAIN) {
> >>>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> >>>>> +                      "support: %s", ovs_strerror(error));
> >>>>> +        }
> >>>>>          goto out;
> >>>>>      }
> >>>>>
> >>>>> @@ -2603,10 +2633,15 @@ probe_multi_mask_per_prio(int ifindex)
> >>>>>      memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
> >>>>>
> >>>>>      id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> >>>>> -    error = tc_replace_flower(&id2, &flower);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         tc_replace_flower, &id2, &flower);
> >>>>>      tc_del_flower_filter(&id1);
> >>>>>
> >>>>>      if (error) {
> >>>>> +        if (error == EAGAIN) {
> >>>>> +            VLOG_WARN("probe tc: unable to probe for multiple mask "
> >>>>> +                      "support: %s", ovs_strerror(error));
> >>>>> +        }
> >>>>>          goto out;
> >>>>>      }
> >>>>>
> >>>>> @@ -2642,6 +2677,7 @@ probe_ct_state_support(int ifindex)
> >>>>>  {
> >>>>>      struct tc_flower flower;
> >>>>>      uint16_t ct_state;
> >>>>> +    uint64_t backoff;
> >>>>>      struct tcf_id id;
> >>>>>      int error;
> >>>>>
> >>>>> @@ -2657,8 +2693,13 @@ probe_ct_state_support(int ifindex)
> >>>>>          goto out;
> >>>>>      }
> >>>>>
> >>>>> -    error = tc_get_flower(&id, &flower);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         tc_get_flower, &id, &flower);
> >>>>>      if (error || flower.mask.ct_state != ct_state) {
> >>>>> +        if (error == EAGAIN) {
> >>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> >>>>> +                      ovs_strerror(error));
> >>>>> +        }
> >>>>>          goto out_del;
> >>>>>      }
> >>>>>
> >>>>> @@ -2670,10 +2711,16 @@ probe_ct_state_support(int ifindex)
> >>>>>
> >>>>>      /* Test for reject, ct_state >= MAX */
> >>>>>      ct_state = ~0;
> >>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         probe_insert_ct_state_rule, ifindex,
> >>>>> +                                         ct_state, &id);
> >>>>>      if (!error) {
> >>>>>          /* No reject, can't continue probing other flags */
> >>>>>          goto out_del;
> >>>>> +    } else if (error == EAGAIN) {
> >>>>> +        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> >>>>> +                  ovs_strerror(error));
> >>>>> +        goto out_del;
> >>>>>      }
> >>>>>
> >>>>>      tc_del_flower_filter(&id);
> >>>>> @@ -2682,8 +2729,14 @@ probe_ct_state_support(int ifindex)
> >>>>>      memset(&flower, 0, sizeof flower);
> >>>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >>>>>                 TCA_FLOWER_KEY_CT_FLAGS_INVALID;
> >>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         probe_insert_ct_state_rule, ifindex,
> >>>>> +                                         ct_state, &id);
> >>>>>      if (error) {
> >>>>> +        if (error == EAGAIN) {
> >>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> >>>>> +                      ovs_strerror(error));
> >>>>> +        }
> >>>>>          goto out;
> >>>>>      }
> >>>>>
> >>>>> @@ -2695,8 +2748,14 @@ probe_ct_state_support(int ifindex)
> >>>>>      ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
> >>>>>                 TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
> >>>>>                 TCA_FLOWER_KEY_CT_FLAGS_REPLY;
> >>>>> -    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         probe_insert_ct_state_rule, ifindex,
> >>>>> +                                         ct_state, &id);
> >>>>>      if (error) {
> >>>>> +        if (error == EAGAIN) {
> >>>>> +            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
> >>>>> +                      ovs_strerror(error));
> >>>>> +        }
> >>>>>          goto out;
> >>>>>      }
> >>>>>
> >>>>> @@ -2714,6 +2773,7 @@ probe_tc_block_support(int ifindex)
> >>>>>  {
> >>>>>      struct tc_flower flower;
> >>>>>      uint32_t block_id = 1;
> >>>>> +    uint64_t backoff;
> >>>>>      struct tcf_id id;
> >>>>>      int prio = 0;
> >>>>>      int error;
> >>>>> @@ -2732,13 +2792,17 @@ probe_tc_block_support(int ifindex)
> >>>>>      memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
> >>>>>
> >>>>>      id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
> >>>>> -    error = tc_replace_flower(&id, &flower);
> >>>>> +    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
> >>>>> +                                         tc_replace_flower, &id, &flower);
> >>>>>
> >>>>>      tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
> >>>>>
> >>>>>      if (!error) {
> >>>>>          block_support = true;
> >>>>>          VLOG_INFO("probe tc: block offload is supported.");
> >>>>> +    } else if (error == EAGAIN) {
> >>>>> +        VLOG_WARN("probe tc: unable to probe block offload support: %s",
> >>>>> +                  ovs_strerror(error));
> >>>>>      }
> >>>>>  }
> >>>>>
> >>>>> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> >>>>> index 80da20d9f..dea060fc3 100644
> >>>>> --- a/lib/netlink-socket.c
> >>>>> +++ b/lib/netlink-socket.c
> >>>>> @@ -1798,6 +1798,11 @@ nl_pool_release(struct nl_sock *sock)
> >>>>>   *
> >>>>>   *      2. Resending the request causes it to be re-executed, so the request
> >>>>>   *         needs to be idempotent.
> >>>>> + *
> >>>>> + *      3. In the event that the kernel is too busy to handle the request to
> >>>>> + *         receive the response (i.e. EAGAIN), this function will still return
> >>>>> + *         0.  The caller can check for this condition by checking for a zero
> >>>>> + *         size of the 'replyp' ofpbuf buffer.
> >>>>>   */
> >>>>>  int
> >>>>>  nl_transact(int protocol, const struct ofpbuf *request,
> >>>>> diff --git a/lib/tc.c b/lib/tc.c
> >>>>> index 5c32c6f97..7df6a10c5 100644
> >>>>> --- a/lib/tc.c
> >>>>> +++ b/lib/tc.c
> >>>>> @@ -237,11 +237,29 @@ static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
> >>>>>      }
> >>>>>  }
> >>>>>
> >>>>> +/* The `tc_transact` function is a wrapper around `nl_transact` with the
> >>>>> + * addition of:
> >>>>> + *
> >>>>> + * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
> >>>>> + *    regardless of success or failure.
> >>>>> + *
> >>>>> + * 2. When a 'replyp' pointer is provided; in the event of the kernel
> >>>>> + *    being too busy to process the request for the response, a positive
> >>>>> + *    error return will be provided with the value of EAGAIN.
> >>>>> + *
> >>>>> + * Please acquaint yourself with the documentation of the `nl_transact`
> >>>>> + * function in the netlink-sock module before making use of this function.
> >>>>> + */
> >>>>>  int
> >>>>>  tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
> >>>>>  {
> >>>>>      int error = nl_transact(NETLINK_ROUTE, request, replyp);
> >>>>>      ofpbuf_uninit(request);
> >>>>> +
> >>>>> +    if (!error && replyp && !(*replyp)->size) {
> >>>>
>
> The nl_trasact() function has the following statement “ on failure '*replyp' is set to NULL.”. Looking at the code, it expects the same from tc_transact() so we might need to free the buffer here. And maybe add a note to the comment above.

Thank you for pointing that out. It did not occur to me since
`nl_transact` currently
handles the EAGAIN as a non-error, but once we change that for the callers of
`tc_transct` we might indeed need to handle it differently. Will check
and update
according to findings.
diff mbox series

Patch

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 60bd39643..75537b2f8 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2340,7 +2340,9 @@  parse_flow_put(struct dpif_netlink *dpif, struct dpif_flow_put *put)
             }
             netdev_set_hw_info(oor_netdev, HW_INFO_TYPE_OOR, true);
         }
-        level = (err == ENOSPC || err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
+        level = (err == EAGAIN ||
+                 err == ENOSPC ||
+                 err == EOPNOTSUPP) ? VLL_DBG : VLL_ERR;
         VLOG_RL(&rl, level, "failed to offload flow: %s: %s",
                 ovs_strerror(err),
                 (oor_netdev ? oor_netdev->name : dev->name));
diff --git a/lib/dpif.c b/lib/dpif.c
index 3305401fe..2bbb0aac4 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1751,7 +1751,7 @@  flow_message_log_level(int error)
      * Kernels that support flow wildcarding will reject these flows as
      * duplicates (EEXIST), so lower the log level to debug for these
      * types of messages. */
-    return (error && error != EEXIST) ? VLL_WARN : VLL_DBG;
+    return (error && error != EEXIST && error != EAGAIN) ? VLL_WARN : VLL_DBG;
 }
 
 static bool
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 36620199e..a0294ca00 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -5817,8 +5817,9 @@  tc_get_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
 
     error = tc_transact(&request, &replyp);
     if (error) {
-        VLOG_ERR_RL(&rl, "Failed to dump police action (index: %u), err=%d",
-                    index, error);
+        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
+                "Failed to dump police action (index: %u), err=%d",
+                index, error);
         return error;
     }
 
@@ -5849,8 +5850,9 @@  tc_del_policer_action(uint32_t index, struct ofputil_meter_stats *stats)
 
     error = tc_transact(&request, &replyp);
     if (error) {
-        VLOG_ERR_RL(&rl, "Failed to delete police action (index: %u), err=%d",
-                    index, error);
+        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_ERR,
+                "Failed to delete police action (index: %u), err=%d",
+                index, error);
         return error;
     }
 
@@ -6114,11 +6116,12 @@  tc_query_class(const struct netdev *netdev,
 
     error = tc_transact(&request, replyp);
     if (error) {
-        VLOG_WARN_RL(&rl, "query %s class %u:%u (parent %u:%u) failed (%s)",
-                     netdev_get_name(netdev),
-                     tc_get_major(handle), tc_get_minor(handle),
-                     tc_get_major(parent), tc_get_minor(parent),
-                     ovs_strerror(error));
+        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
+                "query %s class %u:%u (parent %u:%u) failed (%s)",
+                netdev_get_name(netdev),
+                tc_get_major(handle), tc_get_minor(handle),
+                tc_get_major(parent), tc_get_minor(parent),
+                ovs_strerror(error));
     }
     return error;
 }
@@ -6140,10 +6143,11 @@  tc_delete_class(const struct netdev *netdev, unsigned int handle)
 
     error = tc_transact(&request, NULL);
     if (error) {
-        VLOG_WARN_RL(&rl, "delete %s class %u:%u failed (%s)",
-                     netdev_get_name(netdev),
-                     tc_get_major(handle), tc_get_minor(handle),
-                     ovs_strerror(error));
+        VLOG_RL(&rl, error == EAGAIN ? VLL_DBG : VLL_WARN,
+                "delete %s class %u:%u failed (%s)",
+                netdev_get_name(netdev),
+                tc_get_major(handle), tc_get_minor(handle),
+                ovs_strerror(error));
     }
     return error;
 }
@@ -6262,7 +6266,9 @@  tc_query_qdisc(const struct netdev *netdev_)
                 ops = &tc_ops_other;
             }
         }
-    } else if ((!error && !qdisc->size) || error == ENOENT) {
+    } else if ((!error && !qdisc->size) ||
+               error == ENOENT || error == EAGAIN)
+    {
         /* Either it's a built-in qdisc, or (on Linux pre-2.6.35) it's a qdisc
          * set up by some other entity that doesn't have a handle 1:0.  We will
          * assume that it's the system default qdisc. */
diff --git a/lib/netdev-offload-tc.c b/lib/netdev-offload-tc.c
index 4f26dd8cc..4fb9dc239 100644
--- a/lib/netdev-offload-tc.c
+++ b/lib/netdev-offload-tc.c
@@ -2505,10 +2505,10 @@  netdev_tc_flow_get(struct netdev *netdev,
 
     err = tc_get_flower(&id, &flower);
     if (err) {
-        VLOG_ERR_RL(&error_rl,
-                    "flow get failed (dev %s prio %d handle %d): %s",
-                    netdev_get_name(netdev), id.prio, id.handle,
-                    ovs_strerror(err));
+        VLOG_RL(&error_rl, err == EAGAIN ? VLL_DBG : VLL_ERR,
+                "flow get failed (dev %s prio %d handle %d): %s",
+                netdev_get_name(netdev), id.prio, id.handle,
+                ovs_strerror(err));
         return err;
     }
 
@@ -2571,11 +2571,36 @@  netdev_tc_get_n_flows(struct netdev *netdev, uint64_t *n_flows)
     return 0;
 }
 
+/* This macro is for use by one-time initialization functions, where we have
+ * one shot per thread/process to perform a pertinent initialization task that
+ * may return a temporary error (EAGAIN).
+ *
+ * With min/max values of 1/64 we would retry 7 times, spending at the
+ * most 127 * 1E6 nsec (0.127s) sleeping.
+ */
+#define NETDEV_OFFLOAD_TC_BACKOFF_MIN 1
+#define NETDEV_OFFLOAD_TC_BACKOFF_MAX 64
+#define NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(BACKOFF, ERROR, CONDITION,       \
+                                             FUNCTION, ...)                   \
+    BACKOFF = NETDEV_OFFLOAD_TC_BACKOFF_MIN;                                  \
+    do {                                                                      \
+        ERROR = (FUNCTION)(__VA_ARGS__);                                      \
+        if (CONDITION) {                                                      \
+            xnanosleep(BACKOFF * 1E6);                                        \
+            if (BACKOFF < NETDEV_OFFLOAD_TC_BACKOFF_MAX) {                    \
+                BACKOFF <<= 1;                                                \
+            } else {                                                          \
+                break;                                                        \
+            }                                                                 \
+        }                                                                     \
+    } while (CONDITION);
+
 static void
 probe_multi_mask_per_prio(int ifindex)
 {
     struct tc_flower flower;
     struct tcf_id id1, id2;
+    uint64_t backoff;
     int block_id = 0;
     int prio = 1;
     int error;
@@ -2594,8 +2619,13 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id1 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id1, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         tc_replace_flower, &id1, &flower);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2603,10 +2633,15 @@  probe_multi_mask_per_prio(int ifindex)
     memset(&flower.mask.src_mac, 0xff, sizeof flower.mask.src_mac);
 
     id2 = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id2, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         tc_replace_flower, &id2, &flower);
     tc_del_flower_filter(&id1);
 
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe for multiple mask "
+                      "support: %s", ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2642,6 +2677,7 @@  probe_ct_state_support(int ifindex)
 {
     struct tc_flower flower;
     uint16_t ct_state;
+    uint64_t backoff;
     struct tcf_id id;
     int error;
 
@@ -2657,8 +2693,13 @@  probe_ct_state_support(int ifindex)
         goto out;
     }
 
-    error = tc_get_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         tc_get_flower, &id, &flower);
     if (error || flower.mask.ct_state != ct_state) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out_del;
     }
 
@@ -2670,10 +2711,16 @@  probe_ct_state_support(int ifindex)
 
     /* Test for reject, ct_state >= MAX */
     ct_state = ~0;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (!error) {
         /* No reject, can't continue probing other flags */
         goto out_del;
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                  ovs_strerror(error));
+        goto out_del;
     }
 
     tc_del_flower_filter(&id);
@@ -2682,8 +2729,14 @@  probe_ct_state_support(int ifindex)
     memset(&flower, 0, sizeof flower);
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_INVALID;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2695,8 +2748,14 @@  probe_ct_state_support(int ifindex)
     ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
                TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED |
                TCA_FLOWER_KEY_CT_FLAGS_REPLY;
-    error = probe_insert_ct_state_rule(ifindex, ct_state, &id);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         probe_insert_ct_state_rule, ifindex,
+                                         ct_state, &id);
     if (error) {
+        if (error == EAGAIN) {
+            VLOG_WARN("probe tc: unable to probe ct_state support: %s",
+                      ovs_strerror(error));
+        }
         goto out;
     }
 
@@ -2714,6 +2773,7 @@  probe_tc_block_support(int ifindex)
 {
     struct tc_flower flower;
     uint32_t block_id = 1;
+    uint64_t backoff;
     struct tcf_id id;
     int prio = 0;
     int error;
@@ -2732,13 +2792,17 @@  probe_tc_block_support(int ifindex)
     memset(&flower.mask.dst_mac, 0xff, sizeof flower.mask.dst_mac);
 
     id = tc_make_tcf_id(ifindex, block_id, prio, TC_INGRESS);
-    error = tc_replace_flower(&id, &flower);
+    NETDEV_OFFLOAD_TC_RETRY_WITH_BACKOFF(backoff, error, error == EAGAIN,
+                                         tc_replace_flower, &id, &flower);
 
     tc_add_del_qdisc(ifindex, false, block_id, TC_INGRESS);
 
     if (!error) {
         block_support = true;
         VLOG_INFO("probe tc: block offload is supported.");
+    } else if (error == EAGAIN) {
+        VLOG_WARN("probe tc: unable to probe block offload support: %s",
+                  ovs_strerror(error));
     }
 }
 
diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index 80da20d9f..dea060fc3 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -1798,6 +1798,11 @@  nl_pool_release(struct nl_sock *sock)
  *
  *      2. Resending the request causes it to be re-executed, so the request
  *         needs to be idempotent.
+ *
+ *      3. In the event that the kernel is too busy to handle the request to
+ *         receive the response (i.e. EAGAIN), this function will still return
+ *         0.  The caller can check for this condition by checking for a zero
+ *         size of the 'replyp' ofpbuf buffer.
  */
 int
 nl_transact(int protocol, const struct ofpbuf *request,
diff --git a/lib/tc.c b/lib/tc.c
index 5c32c6f97..7df6a10c5 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -237,11 +237,29 @@  static void request_from_tcf_id(struct tcf_id *id, uint16_t eth_type,
     }
 }
 
+/* The `tc_transact` function is a wrapper around `nl_transact` with the
+ * addition of:
+ *
+ * 1. the 'request' ofpbuf buffer is freed after `nl_transact` returns,
+ *    regardless of success or failure.
+ *
+ * 2. When a 'replyp' pointer is provided; in the event of the kernel
+ *    being too busy to process the request for the response, a positive
+ *    error return will be provided with the value of EAGAIN.
+ *
+ * Please acquaint yourself with the documentation of the `nl_transact`
+ * function in the netlink-sock module before making use of this function.
+ */
 int
 tc_transact(struct ofpbuf *request, struct ofpbuf **replyp)
 {
     int error = nl_transact(NETLINK_ROUTE, request, replyp);
     ofpbuf_uninit(request);
+
+    if (!error && replyp && !(*replyp)->size) {
+        return EAGAIN;
+    }
+
     return error;
 }