diff mbox series

[ovs-dev,v9] netdev-dpdk: add control plane protection support

Message ID 20230222154321.23421-1-rjarry@redhat.com
State Superseded
Headers show
Series [ovs-dev,v9] netdev-dpdk: add control plane protection support | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Robin Jarry Feb. 22, 2023, 3:43 p.m. UTC
Some control protocols are used to maintain link status between
forwarding engines (e.g. LACP). When the system is not sized properly,
the PMD threads may not be able to process all incoming traffic from the
configured Rx queues. When a signaling packet of such protocols is
dropped, it can cause link flapping, worsening the situation.

Use the RTE flow API to redirect these protocols into a dedicated Rx
queue. The assumption is made that the ratio between control protocol
traffic and user data traffic is very low and thus this dedicated Rx
queue will never get full. The RSS redirection table is re-programmed to
only use the other Rx queues. The RSS table size is stored in the
netdev_dpdk structure at port initialization to avoid requesting the
information again when changing the port configuration.

The additional Rx queue will be assigned a PMD core like any other Rx
queue. Polling that extra queue may introduce increased latency and
a slight performance penalty at the benefit of preventing link flapping.

This feature must be enabled per port on specific protocols via the
cp-protection option. This option takes a coma-separated list of
protocol names. It is only supported on ethernet ports. This feature is
experimental.

If the user has already configured multiple Rx queues on the port, an
additional one will be allocated for control plane packets. If the
hardware cannot satisfy the requested number of requested Rx queues, the
last Rx queue will be assigned for control plane. If only one Rx queue
is available, the cp-protection feature will be disabled. If the
hardware does not support the RTE flow matchers/actions, the feature
will be disabled.

It cannot be enabled when other_config:hw-offload=true as it may
conflict with the offloaded RTE flows. Similarly, if hw-offload is
enabled while some ports already have cp-protection enabled, RTE flow
offloading will be disabled on these ports.

Example use:

 ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
   set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
   set interface phy0 options:cp-protection=lacp -- \
   set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
   set interface phy1 options:cp-protection=lacp

As a starting point, only one protocol is supported: LACP. Other
protocols can be added in the future. NIC compatibility should be
checked.

To validate that this works as intended, I used a traffic generator to
generate random traffic slightly above the machine capacity at line rate
on a two ports bond interface. OVS is configured to receive traffic on
two VLANs and pop/push them in a br-int bridge based on tags set on
patch ports.

   +----------------------+
   |         DUT          |
   |+--------------------+|
   ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
   ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
   || patch10    patch11 ||
   |+---|-----------|----+|
   |    |           |     |
   |+---|-----------|----+|
   || patch00    patch01 ||
   ||  tag:10    tag:20  ||
   ||                    ||
   ||       br-phy       || default flow, action=NORMAL
   ||                    ||
   ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
   ||    phy0   phy1     ||
   |+------|-----|-------+|
   +-------|-----|--------+
           |     |
   +-------|-----|--------+
   |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
   |         lag          | mode trunk VLANs 10, 20
   |                      |
   |        switch        |
   |                      |
   |  vlan 10    vlan 20  |  mode access
   |   port2      port3   |
   +-----|----------|-----+
         |          |
   +-----|----------|-----+
   |   tgen0      tgen1   |  Random traffic that is properly balanced
   |                      |  across the bond ports in both directions.
   |  traffic generator   |
   +----------------------+

Without cp-protection, the bond0 links are randomly switching to
"defaulted" when one of the LACP packets sent by the switch is dropped
because the RX queues are full and the PMD threads did not process them
fast enough. When that happens, all traffic must go through a single
link which causes above line rate traffic to be dropped.

 ~# ovs-appctl lacp/show-stats bond0
 ---- bond0 statistics ----
 member: phy0:
   TX PDUs: 347246
   RX PDUs: 14865
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 168
   Link Defaulted: 0
   Carrier Status Changed: 0
 member: phy1:
   TX PDUs: 347245
   RX PDUs: 14919
   RX Bad PDUs: 0
   RX Marker Request PDUs: 0
   Link Expired: 147
   Link Defaulted: 1
   Carrier Status Changed: 0

When cp-protection is enabled, no LACP packet is dropped and the bond
links remain enabled at all times, maximizing the throughput. Neither
the "Link Expired" nor the "Link Defaulted" counters are incremented
anymore.

This feature may be considered as "QoS". However, it does not work by
limiting the rate of traffic explicitly. It only guarantees that some
protocols have a lower chance of being dropped because the PMD cores
cannot keep up with regular traffic.

The choice of protocols is limited on purpose. This is not meant to be
configurable by users. Some limited configurability could be considered
in the future but it would expose to more potential issues if users are
accidentally redirecting all traffic in the control plane queue.

Cc: Anthony Harivel <aharivel@redhat.com>
Cc: Christophe Fontaine <cfontain@redhat.com>
Cc: David Marchand <david.marchand@redhat.com>
Cc: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
---
v8 -> v9:

* Rebased on cf288fdfe2bf ("AUTHORS: Add Liang Mancang and Viacheslav
  Galaktionov.")
* Reset rte_flow_error struct before passing it to functions.
* Refined some comments.
* Updated check for hw-offload on a per-port basis. That way, if a port
  already has cp-protection enabled, hw-offload will not be enabled on
  it but cp-protection will continue to work until next restart.
  However, On next restart, hw-offload will be checked first and
  therefore cp-protection will be disabled on all ports.

Unless there are significant reserves about this patch. Would it be ok
to include it for 3.2?

Thanks!

 Documentation/topics/dpdk/phy.rst |  77 ++++++++
 NEWS                              |   4 +
 lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
 vswitchd/vswitch.xml              |  26 +++
 4 files changed, 414 insertions(+), 3 deletions(-)

Comments

Robin Jarry Feb. 23, 2023, 5:08 p.m. UTC | #1
For the record, I have tested this feature as a non-privileged user:

 ovs-ctl --ovs-user="openvswitch:hugetlbfs" start

With the proper UDEV rules for vfio based devices, I have successfully
configured cp-protection=lacp on i40e (Intel X710 NICs) and mlx5
(ConnectX-5 Ex NICs) drivers.

 cp-protection: redirected lacp traffic to rx queue 1
 cp-protection: redirected other traffic to rx queue 0
Aaron Conole March 7, 2023, 6:57 p.m. UTC | #2
Robin Jarry <rjarry@redhat.com> writes:

> Some control protocols are used to maintain link status between
> forwarding engines (e.g. LACP). When the system is not sized properly,
> the PMD threads may not be able to process all incoming traffic from the
> configured Rx queues. When a signaling packet of such protocols is
> dropped, it can cause link flapping, worsening the situation.
>
> Use the RTE flow API to redirect these protocols into a dedicated Rx
> queue. The assumption is made that the ratio between control protocol
> traffic and user data traffic is very low and thus this dedicated Rx
> queue will never get full. The RSS redirection table is re-programmed to
> only use the other Rx queues. The RSS table size is stored in the
> netdev_dpdk structure at port initialization to avoid requesting the
> information again when changing the port configuration.
>
> The additional Rx queue will be assigned a PMD core like any other Rx
> queue. Polling that extra queue may introduce increased latency and
> a slight performance penalty at the benefit of preventing link flapping.
>
> This feature must be enabled per port on specific protocols via the
> cp-protection option. This option takes a coma-separated list of
> protocol names. It is only supported on ethernet ports. This feature is
> experimental.
>
> If the user has already configured multiple Rx queues on the port, an
> additional one will be allocated for control plane packets. If the
> hardware cannot satisfy the requested number of requested Rx queues, the
> last Rx queue will be assigned for control plane. If only one Rx queue
> is available, the cp-protection feature will be disabled. If the
> hardware does not support the RTE flow matchers/actions, the feature
> will be disabled.
>
> It cannot be enabled when other_config:hw-offload=true as it may
> conflict with the offloaded RTE flows. Similarly, if hw-offload is
> enabled while some ports already have cp-protection enabled, RTE flow
> offloading will be disabled on these ports.

I'm concerned about this - this is a negative interference with rte_flow
offload.  And rte_flow offload would also just alleviate these problems,
yes?  Maybe we should encourage a user to just turn on flow offloading?

Additionally, it doesn't seem to have a good solution for kernel side.
And I worry about adding flows into the system that the ofproto layer
doesn't know about but will interact with - but maybe I'm just being
paranoid.

> Example use:
>
>  ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>    set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>    set interface phy0 options:cp-protection=lacp -- \
>    set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>    set interface phy1 options:cp-protection=lacp
>
> As a starting point, only one protocol is supported: LACP. Other
> protocols can be added in the future. NIC compatibility should be
> checked.
>
> To validate that this works as intended, I used a traffic generator to
> generate random traffic slightly above the machine capacity at line rate
> on a two ports bond interface. OVS is configured to receive traffic on
> two VLANs and pop/push them in a br-int bridge based on tags set on
> patch ports.
>
>    +----------------------+
>    |         DUT          |
>    |+--------------------+|
>    ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
>    ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
>    || patch10    patch11 ||
>    |+---|-----------|----+|
>    |    |           |     |
>    |+---|-----------|----+|
>    || patch00    patch01 ||
>    ||  tag:10    tag:20  ||
>    ||                    ||
>    ||       br-phy       || default flow, action=NORMAL
>    ||                    ||
>    ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>    ||    phy0   phy1     ||
>    |+------|-----|-------+|
>    +-------|-----|--------+
>            |     |
>    +-------|-----|--------+
>    |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>    |         lag          | mode trunk VLANs 10, 20
>    |                      |
>    |        switch        |
>    |                      |
>    |  vlan 10    vlan 20  |  mode access
>    |   port2      port3   |
>    +-----|----------|-----+
>          |          |
>    +-----|----------|-----+
>    |   tgen0      tgen1   |  Random traffic that is properly balanced
>    |                      |  across the bond ports in both directions.
>    |  traffic generator   |
>    +----------------------+
>
> Without cp-protection, the bond0 links are randomly switching to
> "defaulted" when one of the LACP packets sent by the switch is dropped
> because the RX queues are full and the PMD threads did not process them
> fast enough. When that happens, all traffic must go through a single
> link which causes above line rate traffic to be dropped.
>
>  ~# ovs-appctl lacp/show-stats bond0
>  ---- bond0 statistics ----
>  member: phy0:
>    TX PDUs: 347246
>    RX PDUs: 14865
>    RX Bad PDUs: 0
>    RX Marker Request PDUs: 0
>    Link Expired: 168
>    Link Defaulted: 0
>    Carrier Status Changed: 0
>  member: phy1:
>    TX PDUs: 347245
>    RX PDUs: 14919
>    RX Bad PDUs: 0
>    RX Marker Request PDUs: 0
>    Link Expired: 147
>    Link Defaulted: 1
>    Carrier Status Changed: 0
>
> When cp-protection is enabled, no LACP packet is dropped and the bond
> links remain enabled at all times, maximizing the throughput. Neither
> the "Link Expired" nor the "Link Defaulted" counters are incremented
> anymore.
>
> This feature may be considered as "QoS". However, it does not work by
> limiting the rate of traffic explicitly. It only guarantees that some
> protocols have a lower chance of being dropped because the PMD cores
> cannot keep up with regular traffic.
>
> The choice of protocols is limited on purpose. This is not meant to be
> configurable by users. Some limited configurability could be considered
> in the future but it would expose to more potential issues if users are
> accidentally redirecting all traffic in the control plane queue.
>
> Cc: Anthony Harivel <aharivel@redhat.com>
> Cc: Christophe Fontaine <cfontain@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> ---
> v8 -> v9:
>
> * Rebased on cf288fdfe2bf ("AUTHORS: Add Liang Mancang and Viacheslav
>   Galaktionov.")
> * Reset rte_flow_error struct before passing it to functions.
> * Refined some comments.
> * Updated check for hw-offload on a per-port basis. That way, if a port
>   already has cp-protection enabled, hw-offload will not be enabled on
>   it but cp-protection will continue to work until next restart.
>   However, On next restart, hw-offload will be checked first and
>   therefore cp-protection will be disabled on all ports.
>
> Unless there are significant reserves about this patch. Would it be ok
> to include it for 3.2?
>
> Thanks!
>
>  Documentation/topics/dpdk/phy.rst |  77 ++++++++
>  NEWS                              |   4 +
>  lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
>  vswitchd/vswitch.xml              |  26 +++
>  4 files changed, 414 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
> index 4b0fe8dded3a..518b67134639 100644
> --- a/Documentation/topics/dpdk/phy.rst
> +++ b/Documentation/topics/dpdk/phy.rst
> @@ -131,6 +131,83 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
>  for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
>  information on configuring PMD threads, refer to :doc:`pmd`.
>  
> +Control Plane Protection
> +------------------------
> +
> +.. warning:: This feature is experimental.
> +
> +Some control protocols are used to maintain link status between forwarding
> +engines. In SDN environments, these packets share the same physical network
> +than the user data traffic.
> +
> +When the system is not sized properly, the PMD threads may not be able to
> +process all incoming traffic from the configured Rx queues. When a signaling
> +packet of such protocols is dropped, it can cause link flapping, worsening the
> +situation.
> +
> +Some physical NICs can be programmed to put these protocols in a dedicated
> +hardware Rx queue using the rte_flow__ API.
> +
> +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
> +
> +The currently supported control plane protocols are:
> +
> +``lacp``
> +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> +
> +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> +
> +.. warning::
> +
> +   This feature is not compatible with all NICs. Refer to the DPDK
> +   `compatibilty matrix`__ and vendor documentation for more details.
> +
> +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
> +
> +Control plane protection must be enabled on specific protocols per port. The
> +``cp-protection`` option requires a coma separated list of protocol names::
> +
> +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
> +        options:cp-protection=lacp
> +
> +.. note::
> +
> +   If multiple Rx queues are already configured, regular RSS (Receive Side
> +   Scaling) queue balancing is done on all but the extra control plane
> +   protection queue.
> +
> +.. tip::
> +
> +   You can check if control plane protection is supported on a port with the
> +   following command::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> +
> +   This will also show in ``ovs-vswitchd.log``::
> +
> +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
> +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
> +
> +   If the hardware does not support redirecting control plane traffic to
> +   a dedicated queue, it will be explicit::
> +
> +      $ ovs-vsctl get interface dpdk-p0 status
> +      {cp_protection=unsupported, driver_name=...}
> +
> +   More details can often be found in ``ovs-vswitchd.log``::
> +
> +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
> +
> +To disable control plane protection on a port, use the following command::
> +
> +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
> +
> +You can see that it has been disabled in ``ovs-vswitchd.log``::
> +
> +   INFO|dpdk-p0: cp-protection: disabled
> +
>  .. _dpdk-phy-flow-control:
>  
>  Flow Control
> diff --git a/NEWS b/NEWS
> index 85b34962145e..22505ea9d4ad 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -10,6 +10,10 @@ Post-v3.1.0
>         in order to create OVSDB sockets with access mode of 0770.
>     - QoS:
>       * Added new configuration option 'jitter' for a linux-netem QoS type.
> +   - DPDK:
> +     * New experimental "cp-protection=<protocol>" option to redirect certain
> +       protocols (for now, only LACP) to a dedicated hardware queue using
> +       RTE flow.
>  
>  
>  v3.1.0 - 16 Feb 2023
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index fb0dd43f75c5..549fc425edba 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
>      NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
>  };
>  
> +enum dpdk_cp_prot_flags {
> +    DPDK_CP_PROT_LACP = 1 << 0,
> +};
> +
>  /*
>   * In order to avoid confusion in variables names, following naming convention
>   * should be used, if possible:
> @@ -504,11 +508,18 @@ struct netdev_dpdk {
>           * so we remember the request and update them next time
>           * netdev_dpdk*_reconfigure() is called */
>          int requested_mtu;
> +        /* User input for n_rxq + an optional control plane protection queue
> +         * (see netdev_dpdk_reconfigure). This field is different from the
> +         * other requested_* fields as it may contain a different value than
> +         * the user input. */
>          int requested_n_txq;
>          int requested_n_rxq;
>          int requested_rxq_size;
>          int requested_txq_size;
>  
> +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> +        int user_n_rxq;
> +
>          /* Number of rx/tx descriptors for physical devices */
>          int rxq_size;
>          int txq_size;
> @@ -534,6 +545,13 @@ struct netdev_dpdk {
>  
>          /* VF configuration. */
>          struct eth_addr requested_hwaddr;
> +
> +        /* Requested control plane protection flags,
> +         * from the enum set 'dpdk_cp_prot_flags'. */
> +        uint64_t requested_cp_prot_flags;
> +        uint64_t cp_prot_flags;
> +        size_t cp_prot_flows_num;
> +        struct rte_flow **cp_prot_flows;
>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>      netdev->n_rxq = 0;
>      netdev->n_txq = 0;
>      dev->requested_n_rxq = NR_QUEUE;
> +    dev->user_n_rxq = NR_QUEUE;
>      dev->requested_n_txq = NR_QUEUE;
>      dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
>      dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> +    dev->requested_cp_prot_flags = 0;
> +    dev->cp_prot_flags = 0;
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
>  
>      /* Initialize the flow control to NULL */
>      memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
>      ovs_mutex_destroy(&dev->mutex);
>  }
>  
> +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
> +
>  static void
>  netdev_dpdk_destruct(struct netdev *netdev)
>  {
> @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
>  
>      ovs_mutex_lock(&dpdk_mutex);
>  
> +    /* Destroy any rte flows to allow RXQs to be removed. */
> +    dpdk_cp_prot_unconfigure(dev);
> +
>      rte_eth_dev_stop(dev->port_id);
>      dev->started = false;
>  
> @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
>      int new_n_rxq;
>  
>      new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> -    if (new_n_rxq != dev->requested_n_rxq) {
> -        dev->requested_n_rxq = new_n_rxq;
> +    if (new_n_rxq != dev->user_n_rxq) {
> +        dev->user_n_rxq = new_n_rxq;
>          netdev_request_reconfigure(&dev->up);
>      }
>  }
> @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
>      }
>  }
>  
> +static void
> +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> +                        const struct smap *args, char **errp)
> +{
> +    const char *arg = smap_get_def(args, "cp-protection", "");
> +    char *token, *saveptr, *buf;
> +    uint64_t flags = 0;
> +
> +    buf = xstrdup(arg);
> +    token = strtok_r(buf, ",", &saveptr);
> +    while (token) {
> +        if (strcmp(token, "lacp") == 0) {
> +            flags |= DPDK_CP_PROT_LACP;
> +        } else {
> +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                          "unknown protocol '%s'",
> +                          netdev_get_name(netdev), token);
> +        }
> +        token = strtok_r(NULL, ",", &saveptr);
> +    }
> +    free(buf);
> +
> +    if (flags && dev->type != DPDK_DEV_ETH) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is only supported on ethernet ports",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
> +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> +                      "is incompatible with hw-offload",
> +                      netdev_get_name(netdev));
> +        flags = 0;
> +    }
> +
> +    if (flags != dev->requested_cp_prot_flags) {
> +        dev->requested_cp_prot_flags = flags;
> +        netdev_request_reconfigure(netdev);
> +    }
> +}
> +
>  static int
>  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>                         char **errp)
> @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
>      ovs_mutex_lock(&dpdk_mutex);
>      ovs_mutex_lock(&dev->mutex);
>  
> +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
> +
>      dpdk_set_rxq_config(dev, args);
>  
>      dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      struct rte_eth_dev_info dev_info;
> +    size_t cp_prot_flows_num;
> +    uint64_t cp_prot_flags;
>      const char *bus_info;
>      uint32_t link_speed;
>      uint32_t dev_flags;
> +    int n_rxq;
>  
>      if (!rte_eth_dev_is_valid_port(dev->port_id)) {
>          return ENODEV;
> @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>      link_speed = dev->link.link_speed;
>      dev_flags = *dev_info.dev_flags;
>      bus_info = rte_dev_bus_info(dev_info.device);
> +    cp_prot_flags = dev->cp_prot_flags;
> +    cp_prot_flows_num = dev->cp_prot_flows_num;
> +    n_rxq = netdev->n_rxq;
>      ovs_mutex_unlock(&dev->mutex);
>      ovs_mutex_unlock(&dpdk_mutex);
>  
> @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
>                          ETH_ADDR_ARGS(dev->hwaddr));
>      }
>  
> +    if (cp_prot_flags) {
> +        if (!cp_prot_flows_num) {
> +            smap_add(args, "cp_protection", "unsupported");
> +        } else {
> +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> +            if (n_rxq > 2) {
> +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> +            } else {
> +                smap_add(args, "rss_queues", "0");
> +            }
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
>      .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
>  };
>  
> +static int
> +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> +                      const struct rte_flow_item items[],
> +                      const char *desc)
> +{
> +    const struct rte_flow_attr attr = { .ingress = 1 };
> +    const struct rte_flow_action actions[] = {
> +        {
> +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> +            .conf = &(const struct rte_flow_action_queue) {
> +                .index = dev->up.n_rxq - 1,
> +            },
> +        },
> +        { .type = RTE_FLOW_ACTION_TYPE_END },
> +    };
> +    struct rte_flow_error error;
> +    struct rte_flow *flow;
> +    size_t num;
> +    int err;
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
> +    if (err) {
> +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        goto out;
> +    }
> +
> +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
> +    if (flow == NULL) {
> +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> +                  netdev_get_name(&dev->up), desc,
> +                  error.message ? error.message : "");
> +        err = rte_errno;
> +        goto out;
> +    }
> +
> +    num = dev->cp_prot_flows_num + 1;
> +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> +    dev->cp_prot_flows_num = num;
> +
> +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
> +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> +out:
> +    return err;
> +}
> +
> +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
> +
> +static int
> +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> +{
> +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
> +    struct rte_eth_dev_info info;
> +    int err;
> +
> +    rte_eth_dev_info_get(dev->port_id, &info);
> +
> +    if (info.reta_size % rss_n_rxq != 0 &&
> +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
> +        /*
> +         * Some drivers set reta_size equal to the total number of rxqs that
> +         * are configured when it is a power of two. Since we are actually
> +         * reconfiguring the redirection table to exclude the last rxq, we may
> +         * end up with an imbalanced redirection table. For example, such
> +         * configuration:
> +         *
> +         *   options:n_rxq=3 options:cp-protection=lacp
> +         *
> +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
> +         *
> +         *   [0, 1, 2, 3]
> +         *
> +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> +         *
> +         *   [0, 1, 2, 0]
> +         *
> +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
> +         *
> +         * Work around that corner case by forcing a bigger redirection table
> +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
> +         * and when reta_size is less than 128. This value seems to be
> +         * supported by most of the drivers that also support rte flow.
> +         */
> +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
> +    }
> +
> +    memset(reta_conf, 0, sizeof(reta_conf));
> +    for (uint16_t i = 0; i < info.reta_size; i++) {
> +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> +        reta_conf[idx].mask |= 1ULL << shift;
> +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> +    }
> +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
> +    if (err < 0) {
> +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
> +                  netdev_get_name(&dev->up), err);
> +    }
> +
> +    return err;
> +}
> +
> +static int
> +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
> +{
> +    int err = 0;
> +
> +    if (dev->up.n_rxq < 2) {
> +        err = ENOTSUP;
> +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
> +                  netdev_get_name(&dev->up));
> +        goto out;
> +    }
> +
> +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> +        const struct rte_flow_item items[] = {
> +            {
> +                .type = RTE_FLOW_ITEM_TYPE_ETH,
> +                .spec = &(const struct rte_flow_item_eth){
> +                    .type = htons(ETH_TYPE_LACP),
> +                },
> +                .mask = &(const struct rte_flow_item_eth){
> +                    .type = htons(0xffff),
> +                },
> +            },
> +            { .type = RTE_FLOW_ITEM_TYPE_END },
> +        };
> +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
> +        if (err) {
> +            goto out;
> +        }
> +    }
> +
> +    if (dev->cp_prot_flows_num) {
> +        /* Reconfigure RSS reta in all but the cp protection queue. */
> +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> +        if (!err) {
> +            if (dev->up.n_rxq == 2) {
> +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
> +                          "rx queue 0", netdev_get_name(&dev->up));
> +            } else {
> +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
> +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
> +            }
> +        }
> +    }
> +
> +out:
> +    return err;
> +}
> +
> +static void
> +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> +{
> +    struct rte_flow_error error;
> +
> +    if (!dev->cp_prot_flows_num) {
> +        return;
> +    }
> +
> +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> +
> +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
> +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> +                     netdev_get_name(&dev->up),
> +                     error.message ? error.message : "");
> +        }
> +    }
> +    free(dev->cp_prot_flows);
> +    dev->cp_prot_flows_num = 0;
> +    dev->cp_prot_flows = NULL;
> +}
> +
>  static int
>  netdev_dpdk_reconfigure(struct netdev *netdev)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    bool try_cp_prot;
>      int err = 0;
>  
>      ovs_mutex_lock(&dev->mutex);
>  
> +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> +    dev->requested_n_rxq = dev->user_n_rxq;
> +    if (try_cp_prot) {
> +        dev->requested_n_rxq += 1;
> +    }
> +
>      if (netdev->n_txq == dev->requested_n_txq
>          && netdev->n_rxq == dev->requested_n_rxq
> +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
>          && dev->mtu == dev->requested_mtu
>          && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
>          && dev->rxq_size == dev->requested_rxq_size
> @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>          goto out;
>      }
>  
> +retry:
> +    dpdk_cp_prot_unconfigure(dev);
> +
>      if (dev->reset_needed) {
>          rte_eth_dev_reset(dev->port_id);
>          if_notifier_manual_report();
> @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>      dev->txq_size = dev->requested_txq_size;
>  
>      rte_free(dev->tx_q);
> +    dev->tx_q = NULL;
>  
>      if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
>          err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
> @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>       */
>      dev->requested_hwaddr = dev->hwaddr;
>  
> +    if (try_cp_prot) {
> +        err = dpdk_cp_prot_configure(dev);
> +        if (err) {
> +            /* No hw support, disable & recover gracefully. */
> +            try_cp_prot = false;
> +            /*
> +             * The extra queue must be explicitly removed here to ensure that
> +             * it is unconfigured immediately.
> +             */
> +            dev->requested_n_rxq = dev->user_n_rxq;
> +            goto retry;
> +        }
> +    } else {
> +        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
> +    }
> +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
> +
>      dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
>      if (!dev->tx_q) {
>          err = ENOMEM;
> @@ -5468,7 +5767,12 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
>      ovs_mutex_lock(&dev->mutex);
>      if (dev->type == DPDK_DEV_ETH) {
>          /* TODO: Check if we able to offload some minimal flow. */
> -        ret = true;
> +        if (dev->requested_cp_prot_flags) {
> +            VLOG_WARN("%s: hw-offload is mutually exclusive with "
> +                      "cp-protection", netdev_get_name(netdev));
> +        } else {
> +            ret = true;
> +        }
>      }
>      ovs_mutex_unlock(&dev->mutex);
>  out:
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 05ac1fbe5ef6..e6575f84eb3c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3453,6 +3453,32 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          <p>This option may only be used with dpdk VF representors.</p>
>        </column>
>  
> +      <column name="options" key="cp-protection"
> +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> +        <p>
> +          Allocate an extra Rx queue for control plane packets of the specified
> +          protocol(s).
> +        </p>
> +        <p>
> +          If the user has already configured multiple
> +          <code>options:n_rxq</code> on the port, an additional one will be
> +          allocated for control plane packets. If the hardware cannot satisfy
> +          the requested number of requested Rx queues, the last Rx queue will
> +          be assigned for control plane. If only one Rx queue is available or
> +          if the hardware does not support the RTE flow matchers/actions
> +          required to redirect the selected protocols,
> +          <code>cp-protection</code> will be disabled.
> +        </p>
> +        <p>
> +          This feature is multually exclusive with
> +          <code>other_options:hw-offload</code> as it may conflict with the
> +          offloaded RTE flows.
> +        </p>
> +        <p>
> +          Disabled by default.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-steering"
>                type='{"type": "string",
>                       "enum": ["set", ["thread", "hash"]]}'>
Robin Jarry March 7, 2023, 10:26 p.m. UTC | #3
Hi Aaron,

Thanks for your feedback.

Aaron Conole, Mar 07, 2023 at 19:57:
> I'm concerned about this - this is a negative interference with rte_flow
> offload.  And rte_flow offload would also just alleviate these problems,
> yes?  Maybe we should encourage a user to just turn on flow offloading?

Indeed this feature will not get along with rte_flow offload, but as you
said, if your hardware and flow pipeline are compatible with rte_flow
offload, then, you probably don't need this feature since your PMD
threads will only handle a limited portion of the data plane traffic.

The CP protection flows are very basic (for LACP, only matching the
ether type, no VLAN support needed, etc.) and they should be supported
by a broad range of NICs (even Intel 82599 Niantic from 2013).

This feature is mostly aimed at pure DPDK (without extended rte_flow
offload support) where LACP is mixed in the same queues and processed by
the same CPUs than data plane traffic.

> Additionally, it doesn't seem to have a good solution for kernel side.

For non-DPDK use cases, I see two separate paths. Please do correct me
if I am wrong:

1) pure kernel (not tested, only from ethtool(8))

   ethtool -L $dev rx $((rxqs + 1))
   ethtool -X $dev equal $((rxqs - 1))
   ethtool -U $dev flow-type ether proto 0x8809 m 0xffff action $rxqs

2) tc flow offload

   Only mlx5 supported hardware is concerned and there should be direct
   support for hardware bond offload (LACP handled in the NIC).

> And I worry about adding flows into the system that the ofproto layer
> doesn't know about but will interact with - but maybe I'm just being
> paranoid.

There should be no overlap with the ofproto layer. The CP protection
flows only route certain packets into certain RXQs. The packets are
still processed by the same pipeline.

I understand your reservations as this involves intrusive elements in
the packet flow and does not play well with RTE flow offload.

However, actual openflow pipelines (like the ones in OVN deployments)
are supported in RTE flow by a limited and recent set of hardware. Users
with existing deployments and older hardware cannot benefit from this.
This feature could allow making these environments more resilient and
less prone to link flapping.

Also, given that cp-protection is disabled by default, there will be no
conflict with DPDK hw-offload when there is hardware support.

What do you think?
Christophe Fontaine March 8, 2023, 10:59 a.m. UTC | #4
On Tue, Mar 7, 2023 at 7:58 PM Aaron Conole <aconole@redhat.com> wrote:
>
> Robin Jarry <rjarry@redhat.com> writes:
>
> > Some control protocols are used to maintain link status between
> > forwarding engines (e.g. LACP). When the system is not sized properly,
> > the PMD threads may not be able to process all incoming traffic from the
> > configured Rx queues. When a signaling packet of such protocols is
> > dropped, it can cause link flapping, worsening the situation.
> >
> > Use the RTE flow API to redirect these protocols into a dedicated Rx
> > queue. The assumption is made that the ratio between control protocol
> > traffic and user data traffic is very low and thus this dedicated Rx
> > queue will never get full. The RSS redirection table is re-programmed to
> > only use the other Rx queues. The RSS table size is stored in the
> > netdev_dpdk structure at port initialization to avoid requesting the
> > information again when changing the port configuration.
> >
> > The additional Rx queue will be assigned a PMD core like any other Rx
> > queue. Polling that extra queue may introduce increased latency and
> > a slight performance penalty at the benefit of preventing link flapping.
> >
> > This feature must be enabled per port on specific protocols via the
> > cp-protection option. This option takes a coma-separated list of
> > protocol names. It is only supported on ethernet ports. This feature is
> > experimental.
> >
> > If the user has already configured multiple Rx queues on the port, an
> > additional one will be allocated for control plane packets. If the
> > hardware cannot satisfy the requested number of requested Rx queues, the
> > last Rx queue will be assigned for control plane. If only one Rx queue
> > is available, the cp-protection feature will be disabled. If the
> > hardware does not support the RTE flow matchers/actions, the feature
> > will be disabled.
> >
> > It cannot be enabled when other_config:hw-offload=true as it may
> > conflict with the offloaded RTE flows. Similarly, if hw-offload is
> > enabled while some ports already have cp-protection enabled, RTE flow
> > offloading will be disabled on these ports.
>
> I'm concerned about this - this is a negative interference with rte_flow
> offload.  And rte_flow offload would also just alleviate these problems,
> yes?  Maybe we should encourage a user to just turn on flow offloading?

I agree that *in the near future* rte_flow is a great solution, but
having a dedicated queue to handle control plane packets is a great
addition.

Yet, I don't think rte_flow offload will "just" alleviate these problems:
- rte_flow offload, in the current state, isn't a working solution for
all NICs (Niantic)
- with a dedicated queue for LACP (and in the future, other CP packets
such as BFD ), we can maintain the link status even when the system is
overcapacity, whereas with don't a way to express this constraint with
rte_flow.

If I'm not mistaken, rte_flow priorities are not supported by Niantic
nics, but with other nics, could we have:
- high priority flow for the CP protection
- lower priorities dedicated for rte_flow offload

So, what do you think about getting that improvement now, and keeping
it in mind as a requirement for rte_flow offload ?
Christophe


>
> Additionally, it doesn't seem to have a good solution for kernel side.
> And I worry about adding flows into the system that the ofproto layer
> doesn't know about but will interact with - but maybe I'm just being
> paranoid.
>
> > Example use:
> >
> >  ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
> >    set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
> >    set interface phy0 options:cp-protection=lacp -- \
> >    set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
> >    set interface phy1 options:cp-protection=lacp
> >
> > As a starting point, only one protocol is supported: LACP. Other
> > protocols can be added in the future. NIC compatibility should be
> > checked.
> >
> > To validate that this works as intended, I used a traffic generator to
> > generate random traffic slightly above the machine capacity at line rate
> > on a two ports bond interface. OVS is configured to receive traffic on
> > two VLANs and pop/push them in a br-int bridge based on tags set on
> > patch ports.
> >
> >    +----------------------+
> >    |         DUT          |
> >    |+--------------------+|
> >    ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
> >    ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
> >    || patch10    patch11 ||
> >    |+---|-----------|----+|
> >    |    |           |     |
> >    |+---|-----------|----+|
> >    || patch00    patch01 ||
> >    ||  tag:10    tag:20  ||
> >    ||                    ||
> >    ||       br-phy       || default flow, action=NORMAL
> >    ||                    ||
> >    ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
> >    ||    phy0   phy1     ||
> >    |+------|-----|-------+|
> >    +-------|-----|--------+
> >            |     |
> >    +-------|-----|--------+
> >    |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
> >    |         lag          | mode trunk VLANs 10, 20
> >    |                      |
> >    |        switch        |
> >    |                      |
> >    |  vlan 10    vlan 20  |  mode access
> >    |   port2      port3   |
> >    +-----|----------|-----+
> >          |          |
> >    +-----|----------|-----+
> >    |   tgen0      tgen1   |  Random traffic that is properly balanced
> >    |                      |  across the bond ports in both directions.
> >    |  traffic generator   |
> >    +----------------------+
> >
> > Without cp-protection, the bond0 links are randomly switching to
> > "defaulted" when one of the LACP packets sent by the switch is dropped
> > because the RX queues are full and the PMD threads did not process them
> > fast enough. When that happens, all traffic must go through a single
> > link which causes above line rate traffic to be dropped.
> >
> >  ~# ovs-appctl lacp/show-stats bond0
> >  ---- bond0 statistics ----
> >  member: phy0:
> >    TX PDUs: 347246
> >    RX PDUs: 14865
> >    RX Bad PDUs: 0
> >    RX Marker Request PDUs: 0
> >    Link Expired: 168
> >    Link Defaulted: 0
> >    Carrier Status Changed: 0
> >  member: phy1:
> >    TX PDUs: 347245
> >    RX PDUs: 14919
> >    RX Bad PDUs: 0
> >    RX Marker Request PDUs: 0
> >    Link Expired: 147
> >    Link Defaulted: 1
> >    Carrier Status Changed: 0
> >
> > When cp-protection is enabled, no LACP packet is dropped and the bond
> > links remain enabled at all times, maximizing the throughput. Neither
> > the "Link Expired" nor the "Link Defaulted" counters are incremented
> > anymore.
> >
> > This feature may be considered as "QoS". However, it does not work by
> > limiting the rate of traffic explicitly. It only guarantees that some
> > protocols have a lower chance of being dropped because the PMD cores
> > cannot keep up with regular traffic.
> >
> > The choice of protocols is limited on purpose. This is not meant to be
> > configurable by users. Some limited configurability could be considered
> > in the future but it would expose to more potential issues if users are
> > accidentally redirecting all traffic in the control plane queue.
> >
> > Cc: Anthony Harivel <aharivel@redhat.com>
> > Cc: Christophe Fontaine <cfontain@redhat.com>
> > Cc: David Marchand <david.marchand@redhat.com>
> > Cc: Kevin Traynor <ktraynor@redhat.com>
> > Signed-off-by: Robin Jarry <rjarry@redhat.com>
> > ---
> > v8 -> v9:
> >
> > * Rebased on cf288fdfe2bf ("AUTHORS: Add Liang Mancang and Viacheslav
> >   Galaktionov.")
> > * Reset rte_flow_error struct before passing it to functions.
> > * Refined some comments.
> > * Updated check for hw-offload on a per-port basis. That way, if a port
> >   already has cp-protection enabled, hw-offload will not be enabled on
> >   it but cp-protection will continue to work until next restart.
> >   However, On next restart, hw-offload will be checked first and
> >   therefore cp-protection will be disabled on all ports.
> >
> > Unless there are significant reserves about this patch. Would it be ok
> > to include it for 3.2?
> >
> > Thanks!
> >
> >  Documentation/topics/dpdk/phy.rst |  77 ++++++++
> >  NEWS                              |   4 +
> >  lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
> >  vswitchd/vswitch.xml              |  26 +++
> >  4 files changed, 414 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
> > index 4b0fe8dded3a..518b67134639 100644
> > --- a/Documentation/topics/dpdk/phy.rst
> > +++ b/Documentation/topics/dpdk/phy.rst
> > @@ -131,6 +131,83 @@ possible with DPDK acceleration. It is possible to configure multiple Rx queues
> >  for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
> >  information on configuring PMD threads, refer to :doc:`pmd`.
> >
> > +Control Plane Protection
> > +------------------------
> > +
> > +.. warning:: This feature is experimental.
> > +
> > +Some control protocols are used to maintain link status between forwarding
> > +engines. In SDN environments, these packets share the same physical network
> > +than the user data traffic.
> > +
> > +When the system is not sized properly, the PMD threads may not be able to
> > +process all incoming traffic from the configured Rx queues. When a signaling
> > +packet of such protocols is dropped, it can cause link flapping, worsening the
> > +situation.
> > +
> > +Some physical NICs can be programmed to put these protocols in a dedicated
> > +hardware Rx queue using the rte_flow__ API.
> > +
> > +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
> > +
> > +The currently supported control plane protocols are:
> > +
> > +``lacp``
> > +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> > +
> > +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> > +
> > +.. warning::
> > +
> > +   This feature is not compatible with all NICs. Refer to the DPDK
> > +   `compatibilty matrix`__ and vendor documentation for more details.
> > +
> > +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
> > +
> > +Control plane protection must be enabled on specific protocols per port. The
> > +``cp-protection`` option requires a coma separated list of protocol names::
> > +
> > +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> > +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
> > +        options:cp-protection=lacp
> > +
> > +.. note::
> > +
> > +   If multiple Rx queues are already configured, regular RSS (Receive Side
> > +   Scaling) queue balancing is done on all but the extra control plane
> > +   protection queue.
> > +
> > +.. tip::
> > +
> > +   You can check if control plane protection is supported on a port with the
> > +   following command::
> > +
> > +      $ ovs-vsctl get interface dpdk-p0 status
> > +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> > +
> > +   This will also show in ``ovs-vswitchd.log``::
> > +
> > +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
> > +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
> > +
> > +   If the hardware does not support redirecting control plane traffic to
> > +   a dedicated queue, it will be explicit::
> > +
> > +      $ ovs-vsctl get interface dpdk-p0 status
> > +      {cp_protection=unsupported, driver_name=...}
> > +
> > +   More details can often be found in ``ovs-vswitchd.log``::
> > +
> > +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
> > +
> > +To disable control plane protection on a port, use the following command::
> > +
> > +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
> > +
> > +You can see that it has been disabled in ``ovs-vswitchd.log``::
> > +
> > +   INFO|dpdk-p0: cp-protection: disabled
> > +
> >  .. _dpdk-phy-flow-control:
> >
> >  Flow Control
> > diff --git a/NEWS b/NEWS
> > index 85b34962145e..22505ea9d4ad 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -10,6 +10,10 @@ Post-v3.1.0
> >         in order to create OVSDB sockets with access mode of 0770.
> >     - QoS:
> >       * Added new configuration option 'jitter' for a linux-netem QoS type.
> > +   - DPDK:
> > +     * New experimental "cp-protection=<protocol>" option to redirect certain
> > +       protocols (for now, only LACP) to a dedicated hardware queue using
> > +       RTE flow.
> >
> >
> >  v3.1.0 - 16 Feb 2023
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index fb0dd43f75c5..549fc425edba 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
> >      NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
> >  };
> >
> > +enum dpdk_cp_prot_flags {
> > +    DPDK_CP_PROT_LACP = 1 << 0,
> > +};
> > +
> >  /*
> >   * In order to avoid confusion in variables names, following naming convention
> >   * should be used, if possible:
> > @@ -504,11 +508,18 @@ struct netdev_dpdk {
> >           * so we remember the request and update them next time
> >           * netdev_dpdk*_reconfigure() is called */
> >          int requested_mtu;
> > +        /* User input for n_rxq + an optional control plane protection queue
> > +         * (see netdev_dpdk_reconfigure). This field is different from the
> > +         * other requested_* fields as it may contain a different value than
> > +         * the user input. */
> >          int requested_n_txq;
> >          int requested_n_rxq;
> >          int requested_rxq_size;
> >          int requested_txq_size;
> >
> > +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> > +        int user_n_rxq;
> > +
> >          /* Number of rx/tx descriptors for physical devices */
> >          int rxq_size;
> >          int txq_size;
> > @@ -534,6 +545,13 @@ struct netdev_dpdk {
> >
> >          /* VF configuration. */
> >          struct eth_addr requested_hwaddr;
> > +
> > +        /* Requested control plane protection flags,
> > +         * from the enum set 'dpdk_cp_prot_flags'. */
> > +        uint64_t requested_cp_prot_flags;
> > +        uint64_t cp_prot_flags;
> > +        size_t cp_prot_flows_num;
> > +        struct rte_flow **cp_prot_flows;
> >      );
> >
> >      PADDED_MEMBERS(CACHE_LINE_SIZE,
> > @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
> >      netdev->n_rxq = 0;
> >      netdev->n_txq = 0;
> >      dev->requested_n_rxq = NR_QUEUE;
> > +    dev->user_n_rxq = NR_QUEUE;
> >      dev->requested_n_txq = NR_QUEUE;
> >      dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
> >      dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> > +    dev->requested_cp_prot_flags = 0;
> > +    dev->cp_prot_flags = 0;
> > +    dev->cp_prot_flows_num = 0;
> > +    dev->cp_prot_flows = NULL;
> >
> >      /* Initialize the flow control to NULL */
> >      memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> > @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
> >      ovs_mutex_destroy(&dev->mutex);
> >  }
> >
> > +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
> > +
> >  static void
> >  netdev_dpdk_destruct(struct netdev *netdev)
> >  {
> > @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
> >
> >      ovs_mutex_lock(&dpdk_mutex);
> >
> > +    /* Destroy any rte flows to allow RXQs to be removed. */
> > +    dpdk_cp_prot_unconfigure(dev);
> > +
> >      rte_eth_dev_stop(dev->port_id);
> >      dev->started = false;
> >
> > @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
> >      int new_n_rxq;
> >
> >      new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > -    if (new_n_rxq != dev->requested_n_rxq) {
> > -        dev->requested_n_rxq = new_n_rxq;
> > +    if (new_n_rxq != dev->user_n_rxq) {
> > +        dev->user_n_rxq = new_n_rxq;
> >          netdev_request_reconfigure(&dev->up);
> >      }
> >  }
> > @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
> >      }
> >  }
> >
> > +static void
> > +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> > +                        const struct smap *args, char **errp)
> > +{
> > +    const char *arg = smap_get_def(args, "cp-protection", "");
> > +    char *token, *saveptr, *buf;
> > +    uint64_t flags = 0;
> > +
> > +    buf = xstrdup(arg);
> > +    token = strtok_r(buf, ",", &saveptr);
> > +    while (token) {
> > +        if (strcmp(token, "lacp") == 0) {
> > +            flags |= DPDK_CP_PROT_LACP;
> > +        } else {
> > +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                          "unknown protocol '%s'",
> > +                          netdev_get_name(netdev), token);
> > +        }
> > +        token = strtok_r(NULL, ",", &saveptr);
> > +    }
> > +    free(buf);
> > +
> > +    if (flags && dev->type != DPDK_DEV_ETH) {
> > +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                      "is only supported on ethernet ports",
> > +                      netdev_get_name(netdev));
> > +        flags = 0;
> > +    }
> > +
> > +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
> > +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                      "is incompatible with hw-offload",
> > +                      netdev_get_name(netdev));
> > +        flags = 0;
> > +    }
> > +
> > +    if (flags != dev->requested_cp_prot_flags) {
> > +        dev->requested_cp_prot_flags = flags;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +}
> > +
> >  static int
> >  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
> >                         char **errp)
> > @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
> >      ovs_mutex_lock(&dpdk_mutex);
> >      ovs_mutex_lock(&dev->mutex);
> >
> > +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
> > +
> >      dpdk_set_rxq_config(dev, args);
> >
> >      dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> > @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
> >  {
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >      struct rte_eth_dev_info dev_info;
> > +    size_t cp_prot_flows_num;
> > +    uint64_t cp_prot_flags;
> >      const char *bus_info;
> >      uint32_t link_speed;
> >      uint32_t dev_flags;
> > +    int n_rxq;
> >
> >      if (!rte_eth_dev_is_valid_port(dev->port_id)) {
> >          return ENODEV;
> > @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
> >      link_speed = dev->link.link_speed;
> >      dev_flags = *dev_info.dev_flags;
> >      bus_info = rte_dev_bus_info(dev_info.device);
> > +    cp_prot_flags = dev->cp_prot_flags;
> > +    cp_prot_flows_num = dev->cp_prot_flows_num;
> > +    n_rxq = netdev->n_rxq;
> >      ovs_mutex_unlock(&dev->mutex);
> >      ovs_mutex_unlock(&dpdk_mutex);
> >
> > @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
> >                          ETH_ADDR_ARGS(dev->hwaddr));
> >      }
> >
> > +    if (cp_prot_flags) {
> > +        if (!cp_prot_flows_num) {
> > +            smap_add(args, "cp_protection", "unsupported");
> > +        } else {
> > +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> > +            if (n_rxq > 2) {
> > +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> > +            } else {
> > +                smap_add(args, "rss_queues", "0");
> > +            }
> > +        }
> > +    }
> > +
> >      return 0;
> >  }
> >
> > @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops = {
> >      .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
> >  };
> >
> > +static int
> > +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> > +                      const struct rte_flow_item items[],
> > +                      const char *desc)
> > +{
> > +    const struct rte_flow_attr attr = { .ingress = 1 };
> > +    const struct rte_flow_action actions[] = {
> > +        {
> > +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> > +            .conf = &(const struct rte_flow_action_queue) {
> > +                .index = dev->up.n_rxq - 1,
> > +            },
> > +        },
> > +        { .type = RTE_FLOW_ACTION_TYPE_END },
> > +    };
> > +    struct rte_flow_error error;
> > +    struct rte_flow *flow;
> > +    size_t num;
> > +    int err;
> > +
> > +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
> > +    if (err) {
> > +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> > +                  netdev_get_name(&dev->up), desc,
> > +                  error.message ? error.message : "");
> > +        goto out;
> > +    }
> > +
> > +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
> > +    if (flow == NULL) {
> > +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> > +                  netdev_get_name(&dev->up), desc,
> > +                  error.message ? error.message : "");
> > +        err = rte_errno;
> > +        goto out;
> > +    }
> > +
> > +    num = dev->cp_prot_flows_num + 1;
> > +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> > +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> > +    dev->cp_prot_flows_num = num;
> > +
> > +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
> > +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> > +out:
> > +    return err;
> > +}
> > +
> > +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
> > +
> > +static int
> > +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> > +{
> > +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
> > +    struct rte_eth_dev_info info;
> > +    int err;
> > +
> > +    rte_eth_dev_info_get(dev->port_id, &info);
> > +
> > +    if (info.reta_size % rss_n_rxq != 0 &&
> > +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
> > +        /*
> > +         * Some drivers set reta_size equal to the total number of rxqs that
> > +         * are configured when it is a power of two. Since we are actually
> > +         * reconfiguring the redirection table to exclude the last rxq, we may
> > +         * end up with an imbalanced redirection table. For example, such
> > +         * configuration:
> > +         *
> > +         *   options:n_rxq=3 options:cp-protection=lacp
> > +         *
> > +         * Will actually configure 4 rxqs on the NIC, and the default reta to:
> > +         *
> > +         *   [0, 1, 2, 3]
> > +         *
> > +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> > +         *
> > +         *   [0, 1, 2, 0]
> > +         *
> > +         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
> > +         *
> > +         * Work around that corner case by forcing a bigger redirection table
> > +         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
> > +         * and when reta_size is less than 128. This value seems to be
> > +         * supported by most of the drivers that also support rte flow.
> > +         */
> > +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
> > +    }
> > +
> > +    memset(reta_conf, 0, sizeof(reta_conf));
> > +    for (uint16_t i = 0; i < info.reta_size; i++) {
> > +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> > +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> > +        reta_conf[idx].mask |= 1ULL << shift;
> > +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> > +    }
> > +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
> > +    if (err < 0) {
> > +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
> > +                  netdev_get_name(&dev->up), err);
> > +    }
> > +
> > +    return err;
> > +}
> > +
> > +static int
> > +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
> > +{
> > +    int err = 0;
> > +
> > +    if (dev->up.n_rxq < 2) {
> > +        err = ENOTSUP;
> > +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
> > +                  netdev_get_name(&dev->up));
> > +        goto out;
> > +    }
> > +
> > +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> > +        const struct rte_flow_item items[] = {
> > +            {
> > +                .type = RTE_FLOW_ITEM_TYPE_ETH,
> > +                .spec = &(const struct rte_flow_item_eth){
> > +                    .type = htons(ETH_TYPE_LACP),
> > +                },
> > +                .mask = &(const struct rte_flow_item_eth){
> > +                    .type = htons(0xffff),
> > +                },
> > +            },
> > +            { .type = RTE_FLOW_ITEM_TYPE_END },
> > +        };
> > +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
> > +        if (err) {
> > +            goto out;
> > +        }
> > +    }
> > +
> > +    if (dev->cp_prot_flows_num) {
> > +        /* Reconfigure RSS reta in all but the cp protection queue. */
> > +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> > +        if (!err) {
> > +            if (dev->up.n_rxq == 2) {
> > +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
> > +                          "rx queue 0", netdev_get_name(&dev->up));
> > +            } else {
> > +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
> > +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
> > +            }
> > +        }
> > +    }
> > +
> > +out:
> > +    return err;
> > +}
> > +
> > +static void
> > +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> > +{
> > +    struct rte_flow_error error;
> > +
> > +    if (!dev->cp_prot_flows_num) {
> > +        return;
> > +    }
> > +
> > +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> > +
> > +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> > +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
> > +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> > +                     netdev_get_name(&dev->up),
> > +                     error.message ? error.message : "");
> > +        }
> > +    }
> > +    free(dev->cp_prot_flows);
> > +    dev->cp_prot_flows_num = 0;
> > +    dev->cp_prot_flows = NULL;
> > +}
> > +
> >  static int
> >  netdev_dpdk_reconfigure(struct netdev *netdev)
> >  {
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +    bool try_cp_prot;
> >      int err = 0;
> >
> >      ovs_mutex_lock(&dev->mutex);
> >
> > +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> > +    dev->requested_n_rxq = dev->user_n_rxq;
> > +    if (try_cp_prot) {
> > +        dev->requested_n_rxq += 1;
> > +    }
> > +
> >      if (netdev->n_txq == dev->requested_n_txq
> >          && netdev->n_rxq == dev->requested_n_rxq
> > +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
> >          && dev->mtu == dev->requested_mtu
> >          && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
> >          && dev->rxq_size == dev->requested_rxq_size
> > @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >          goto out;
> >      }
> >
> > +retry:
> > +    dpdk_cp_prot_unconfigure(dev);
> > +
> >      if (dev->reset_needed) {
> >          rte_eth_dev_reset(dev->port_id);
> >          if_notifier_manual_report();
> > @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >      dev->txq_size = dev->requested_txq_size;
> >
> >      rte_free(dev->tx_q);
> > +    dev->tx_q = NULL;
> >
> >      if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
> >          err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
> > @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >       */
> >      dev->requested_hwaddr = dev->hwaddr;
> >
> > +    if (try_cp_prot) {
> > +        err = dpdk_cp_prot_configure(dev);
> > +        if (err) {
> > +            /* No hw support, disable & recover gracefully. */
> > +            try_cp_prot = false;
> > +            /*
> > +             * The extra queue must be explicitly removed here to ensure that
> > +             * it is unconfigured immediately.
> > +             */
> > +            dev->requested_n_rxq = dev->user_n_rxq;
> > +            goto retry;
> > +        }
> > +    } else {
> > +        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
> > +    }
> > +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
> > +
> >      dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
> >      if (!dev->tx_q) {
> >          err = ENOMEM;
> > @@ -5468,7 +5767,12 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
> >      ovs_mutex_lock(&dev->mutex);
> >      if (dev->type == DPDK_DEV_ETH) {
> >          /* TODO: Check if we able to offload some minimal flow. */
> > -        ret = true;
> > +        if (dev->requested_cp_prot_flags) {
> > +            VLOG_WARN("%s: hw-offload is mutually exclusive with "
> > +                      "cp-protection", netdev_get_name(netdev));
> > +        } else {
> > +            ret = true;
> > +        }
> >      }
> >      ovs_mutex_unlock(&dev->mutex);
> >  out:
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 05ac1fbe5ef6..e6575f84eb3c 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3453,6 +3453,32 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
> >          <p>This option may only be used with dpdk VF representors.</p>
> >        </column>
> >
> > +      <column name="options" key="cp-protection"
> > +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> > +        <p>
> > +          Allocate an extra Rx queue for control plane packets of the specified
> > +          protocol(s).
> > +        </p>
> > +        <p>
> > +          If the user has already configured multiple
> > +          <code>options:n_rxq</code> on the port, an additional one will be
> > +          allocated for control plane packets. If the hardware cannot satisfy
> > +          the requested number of requested Rx queues, the last Rx queue will
> > +          be assigned for control plane. If only one Rx queue is available or
> > +          if the hardware does not support the RTE flow matchers/actions
> > +          required to redirect the selected protocols,
> > +          <code>cp-protection</code> will be disabled.
> > +        </p>
> > +        <p>
> > +          This feature is multually exclusive with
> > +          <code>other_options:hw-offload</code> as it may conflict with the
> > +          offloaded RTE flows.
> > +        </p>
> > +        <p>
> > +          Disabled by default.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="tx-steering"
> >                type='{"type": "string",
> >                       "enum": ["set", ["thread", "hash"]]}'>
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Kevin Traynor March 23, 2023, 2:27 p.m. UTC | #5
On 22/02/2023 15:43, Robin Jarry wrote:
> Some control protocols are used to maintain link status between
> forwarding engines (e.g. LACP). When the system is not sized properly,
> the PMD threads may not be able to process all incoming traffic from the
> configured Rx queues. When a signaling packet of such protocols is
> dropped, it can cause link flapping, worsening the situation.
> 
> Use the RTE flow API to redirect these protocols into a dedicated Rx
> queue. The assumption is made that the ratio between control protocol
> traffic and user data traffic is very low and thus this dedicated Rx
> queue will never get full. The RSS redirection table is re-programmed to
> only use the other Rx queues. The RSS table size is stored in the
> netdev_dpdk structure at port initialization to avoid requesting the
> information again when changing the port configuration.
> 
> The additional Rx queue will be assigned a PMD core like any other Rx
> queue. Polling that extra queue may introduce increased latency and
> a slight performance penalty at the benefit of preventing link flapping.
> 
> This feature must be enabled per port on specific protocols via the
> cp-protection option. This option takes a coma-separated list of
> protocol names. It is only supported on ethernet ports. This feature is
> experimental.
> 
> If the user has already configured multiple Rx queues on the port, an
> additional one will be allocated for control plane packets. If the
> hardware cannot satisfy the requested number of requested Rx queues, the
> last Rx queue will be assigned for control plane. If only one Rx queue
> is available, the cp-protection feature will be disabled. If the
> hardware does not support the RTE flow matchers/actions, the feature
> will be disabled.
> 
> It cannot be enabled when other_config:hw-offload=true as it may
> conflict with the offloaded RTE flows. Similarly, if hw-offload is
> enabled while some ports already have cp-protection enabled, RTE flow
> offloading will be disabled on these ports.
> 
> Example use:
> 
>   ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
>     set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
>     set interface phy0 options:cp-protection=lacp -- \
>     set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
>     set interface phy1 options:cp-protection=lacp
> 
> As a starting point, only one protocol is supported: LACP. Other
> protocols can be added in the future. NIC compatibility should be
> checked.
> 
> To validate that this works as intended, I used a traffic generator to
> generate random traffic slightly above the machine capacity at line rate
> on a two ports bond interface. OVS is configured to receive traffic on
> two VLANs and pop/push them in a br-int bridge based on tags set on
> patch ports.
> 
>     +----------------------+
>     |         DUT          |
>     |+--------------------+|
>     ||       br-int       || in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
>     ||                    || in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
>     || patch10    patch11 ||
>     |+---|-----------|----+|
>     |    |           |     |
>     |+---|-----------|----+|
>     || patch00    patch01 ||
>     ||  tag:10    tag:20  ||
>     ||                    ||
>     ||       br-phy       || default flow, action=NORMAL
>     ||                    ||
>     ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
>     ||    phy0   phy1     ||
>     |+------|-----|-------+|
>     +-------|-----|--------+
>             |     |
>     +-------|-----|--------+
>     |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
>     |         lag          | mode trunk VLANs 10, 20
>     |                      |
>     |        switch        |
>     |                      |
>     |  vlan 10    vlan 20  |  mode access
>     |   port2      port3   |
>     +-----|----------|-----+
>           |          |
>     +-----|----------|-----+
>     |   tgen0      tgen1   |  Random traffic that is properly balanced
>     |                      |  across the bond ports in both directions.
>     |  traffic generator   |
>     +----------------------+
> 
> Without cp-protection, the bond0 links are randomly switching to
> "defaulted" when one of the LACP packets sent by the switch is dropped
> because the RX queues are full and the PMD threads did not process them
> fast enough. When that happens, all traffic must go through a single
> link which causes above line rate traffic to be dropped.
> 
>   ~# ovs-appctl lacp/show-stats bond0
>   ---- bond0 statistics ----
>   member: phy0:
>     TX PDUs: 347246
>     RX PDUs: 14865
>     RX Bad PDUs: 0
>     RX Marker Request PDUs: 0
>     Link Expired: 168
>     Link Defaulted: 0
>     Carrier Status Changed: 0
>   member: phy1:
>     TX PDUs: 347245
>     RX PDUs: 14919
>     RX Bad PDUs: 0
>     RX Marker Request PDUs: 0
>     Link Expired: 147
>     Link Defaulted: 1
>     Carrier Status Changed: 0
> 
> When cp-protection is enabled, no LACP packet is dropped and the bond
> links remain enabled at all times, maximizing the throughput. Neither
> the "Link Expired" nor the "Link Defaulted" counters are incremented
> anymore.
> 
> This feature may be considered as "QoS". However, it does not work by
> limiting the rate of traffic explicitly. It only guarantees that some
> protocols have a lower chance of being dropped because the PMD cores
> cannot keep up with regular traffic.
> 
> The choice of protocols is limited on purpose. This is not meant to be
> configurable by users. Some limited configurability could be considered
> in the future but it would expose to more potential issues if users are
> accidentally redirecting all traffic in the control plane queue.
> 
> Cc: Anthony Harivel <aharivel@redhat.com>
> Cc: Christophe Fontaine <cfontain@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>
> Signed-off-by: Robin Jarry <rjarry@redhat.com>
> ---
> v8 -> v9:
> 
> * Rebased on cf288fdfe2bf ("AUTHORS: Add Liang Mancang and Viacheslav
>    Galaktionov.")
> * Reset rte_flow_error struct before passing it to functions.
> * Refined some comments.
> * Updated check for hw-offload on a per-port basis. That way, if a port
>    already has cp-protection enabled, hw-offload will not be enabled on
>    it but cp-protection will continue to work until next restart.
>    However, On next restart, hw-offload will be checked first and
>    therefore cp-protection will be disabled on all ports.
> 

Hi Robin,

Regarding having both features enabled, I think it's an issue that it's 
chronological based if they are enabled while running. It introduces 
another variable that might confuse things.

For example, the operation could be changed from cp-proto to hw-offload 
on a port by restarting OVS, which would probably be unexpected by a 
user. I mentioned it while chatting to Ilya and he agreed that same 
state in ovsdb should mean same state in ovs-vswitchd.

So that would mean having a binary priority between the two features and 
  removing one if the higher priority one was later enabled (either 
globally or per-port?).

Whatever the co-existance (or not) is, I think it's better to resolve it 
in mail first to avoid you having to rework code over again. I don't 
think it needs to be super-smart as these are experimental features, 
just needs to be consistent and clearly documented for the user.

Code wise, I've tested previous versions and I think the code is in 
pretty good shape overall. I'll do another pass review/testing when the 
hwol/cp-prot prio is resolved.

thanks,
Kevin.

> Unless there are significant reserves about this patch. Would it be ok
> to include it for 3.2?
> 
> Thanks!
> 
>   Documentation/topics/dpdk/phy.rst |  77 ++++++++
>   NEWS                              |   4 +
>   lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
>   vswitchd/vswitch.xml              |  26 +++
>   4 files changed, 414 insertions(+), 3 deletions(-)
> 

<snip>
Robin Jarry April 6, 2023, 8:47 a.m. UTC | #6
Kevin Traynor, Mar 23, 2023 at 15:27:
> Hi Robin,
>
> Regarding having both features enabled, I think it's an issue that it's 
> chronological based if they are enabled while running. It introduces 
> another variable that might confuse things.
>
> For example, the operation could be changed from cp-proto to hw-offload 
> on a port by restarting OVS, which would probably be unexpected by a 
> user. I mentioned it while chatting to Ilya and he agreed that same 
> state in ovsdb should mean same state in ovs-vswitchd.
>
> So that would mean having a binary priority between the two features and 
>   removing one if the higher priority one was later enabled (either 
> globally or per-port?).
>
> Whatever the co-existance (or not) is, I think it's better to resolve it 
> in mail first to avoid you having to rework code over again. I don't 
> think it needs to be super-smart as these are experimental features, 
> just needs to be consistent and clearly documented for the user.
>
> Code wise, I've tested previous versions and I think the code is in 
> pretty good shape overall. I'll do another pass review/testing when the 
> hwol/cp-prot prio is resolved.
>
> thanks,
> Kevin.

Hi Kevin,

sorry not to have replied earlier, I got caught up in other issues :)

I agree that having a deterministic priority between rte flow offload
and control plane protection is a must have. However, I am not sure how
to implement it in the current state of things.

The main issue is that cp-protection is dpdk specific whereas hw-offload
is in the abstract netdev layer. There is no way to check the state of
cp-protection from netdev-offload.c. Maybe I could expose a minimal
generic API in netdev.h to determine if hw-offload can be enabled for
a specific device or not. And implement it for dpdk, based on the value
of cp-protection.

What do you think?
Kevin Traynor April 6, 2023, 12:05 p.m. UTC | #7
On 06/04/2023 09:47, Robin Jarry wrote:
> Kevin Traynor, Mar 23, 2023 at 15:27:
>> Hi Robin,
>>
>> Regarding having both features enabled, I think it's an issue that it's
>> chronological based if they are enabled while running. It introduces
>> another variable that might confuse things.
>>
>> For example, the operation could be changed from cp-proto to hw-offload
>> on a port by restarting OVS, which would probably be unexpected by a
>> user. I mentioned it while chatting to Ilya and he agreed that same
>> state in ovsdb should mean same state in ovs-vswitchd.
>>
>> So that would mean having a binary priority between the two features and
>>    removing one if the higher priority one was later enabled (either
>> globally or per-port?).
>>
>> Whatever the co-existance (or not) is, I think it's better to resolve it
>> in mail first to avoid you having to rework code over again. I don't
>> think it needs to be super-smart as these are experimental features,
>> just needs to be consistent and clearly documented for the user.
>>
>> Code wise, I've tested previous versions and I think the code is in
>> pretty good shape overall. I'll do another pass review/testing when the
>> hwol/cp-prot prio is resolved.
>>
>> thanks,
>> Kevin.
> 
> Hi Kevin,
> 
> sorry not to have replied earlier, I got caught up in other issues :)
> 
> I agree that having a deterministic priority between rte flow offload
> and control plane protection is a must have. However, I am not sure how
> to implement it in the current state of things.
> 
> The main issue is that cp-protection is dpdk specific whereas hw-offload
> is in the abstract netdev layer. There is no way to check the state of
> cp-protection from netdev-offload.c. Maybe I could expose a minimal
> generic API in netdev.h to determine if hw-offload can be enabled for
> a specific device or not. And implement it for dpdk, based on the value
> of cp-protection.
> 
> What do you think?
> 

In netdev-dpdk netdev_dpdk_flow_api_supported(), there is already code 
to check to see if cp-proto is enabled on a port when hw-offload gets 
set. There is also code to unconfigure the cp-proto flows in case the 
user removed the option.

So rather than printing the mutual exclusive warning 
innetdev_dpdk_flow_api_supported(), could you just trigger an 
unconfigure of the cp-proto?
diff mbox series

Patch

diff --git a/Documentation/topics/dpdk/phy.rst b/Documentation/topics/dpdk/phy.rst
index 4b0fe8dded3a..518b67134639 100644
--- a/Documentation/topics/dpdk/phy.rst
+++ b/Documentation/topics/dpdk/phy.rst
@@ -131,6 +131,83 @@  possible with DPDK acceleration. It is possible to configure multiple Rx queues
 for ``dpdk`` ports, thus ensuring this is not a bottleneck for performance. For
 information on configuring PMD threads, refer to :doc:`pmd`.
 
+Control Plane Protection
+------------------------
+
+.. warning:: This feature is experimental.
+
+Some control protocols are used to maintain link status between forwarding
+engines. In SDN environments, these packets share the same physical network
+than the user data traffic.
+
+When the system is not sized properly, the PMD threads may not be able to
+process all incoming traffic from the configured Rx queues. When a signaling
+packet of such protocols is dropped, it can cause link flapping, worsening the
+situation.
+
+Some physical NICs can be programmed to put these protocols in a dedicated
+hardware Rx queue using the rte_flow__ API.
+
+__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
+
+The currently supported control plane protocols are:
+
+``lacp``
+   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
+
+   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
+
+.. warning::
+
+   This feature is not compatible with all NICs. Refer to the DPDK
+   `compatibilty matrix`__ and vendor documentation for more details.
+
+   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
+
+Control plane protection must be enabled on specific protocols per port. The
+``cp-protection`` option requires a coma separated list of protocol names::
+
+   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
+        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
+        options:cp-protection=lacp
+
+.. note::
+
+   If multiple Rx queues are already configured, regular RSS (Receive Side
+   Scaling) queue balancing is done on all but the extra control plane
+   protection queue.
+
+.. tip::
+
+   You can check if control plane protection is supported on a port with the
+   following command::
+
+      $ ovs-vsctl get interface dpdk-p0 status
+      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
+
+   This will also show in ``ovs-vswitchd.log``::
+
+      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
+      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
+
+   If the hardware does not support redirecting control plane traffic to
+   a dedicated queue, it will be explicit::
+
+      $ ovs-vsctl get interface dpdk-p0 status
+      {cp_protection=unsupported, driver_name=...}
+
+   More details can often be found in ``ovs-vswitchd.log``::
+
+      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported pattern
+
+To disable control plane protection on a port, use the following command::
+
+   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
+
+You can see that it has been disabled in ``ovs-vswitchd.log``::
+
+   INFO|dpdk-p0: cp-protection: disabled
+
 .. _dpdk-phy-flow-control:
 
 Flow Control
diff --git a/NEWS b/NEWS
index 85b34962145e..22505ea9d4ad 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,10 @@  Post-v3.1.0
        in order to create OVSDB sockets with access mode of 0770.
    - QoS:
      * Added new configuration option 'jitter' for a linux-netem QoS type.
+   - DPDK:
+     * New experimental "cp-protection=<protocol>" option to redirect certain
+       protocols (for now, only LACP) to a dedicated hardware queue using
+       RTE flow.
 
 
 v3.1.0 - 16 Feb 2023
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index fb0dd43f75c5..549fc425edba 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -415,6 +415,10 @@  enum dpdk_hw_ol_features {
     NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
 };
 
+enum dpdk_cp_prot_flags {
+    DPDK_CP_PROT_LACP = 1 << 0,
+};
+
 /*
  * In order to avoid confusion in variables names, following naming convention
  * should be used, if possible:
@@ -504,11 +508,18 @@  struct netdev_dpdk {
          * so we remember the request and update them next time
          * netdev_dpdk*_reconfigure() is called */
         int requested_mtu;
+        /* User input for n_rxq + an optional control plane protection queue
+         * (see netdev_dpdk_reconfigure). This field is different from the
+         * other requested_* fields as it may contain a different value than
+         * the user input. */
         int requested_n_txq;
         int requested_n_rxq;
         int requested_rxq_size;
         int requested_txq_size;
 
+        /* User input for n_rxq (see dpdk_set_rxq_config). */
+        int user_n_rxq;
+
         /* Number of rx/tx descriptors for physical devices */
         int rxq_size;
         int txq_size;
@@ -534,6 +545,13 @@  struct netdev_dpdk {
 
         /* VF configuration. */
         struct eth_addr requested_hwaddr;
+
+        /* Requested control plane protection flags,
+         * from the enum set 'dpdk_cp_prot_flags'. */
+        uint64_t requested_cp_prot_flags;
+        uint64_t cp_prot_flags;
+        size_t cp_prot_flows_num;
+        struct rte_flow **cp_prot_flows;
     );
 
     PADDED_MEMBERS(CACHE_LINE_SIZE,
@@ -1310,9 +1328,14 @@  common_construct(struct netdev *netdev, dpdk_port_t port_no,
     netdev->n_rxq = 0;
     netdev->n_txq = 0;
     dev->requested_n_rxq = NR_QUEUE;
+    dev->user_n_rxq = NR_QUEUE;
     dev->requested_n_txq = NR_QUEUE;
     dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
     dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
+    dev->requested_cp_prot_flags = 0;
+    dev->cp_prot_flags = 0;
+    dev->cp_prot_flows_num = 0;
+    dev->cp_prot_flows = NULL;
 
     /* Initialize the flow control to NULL */
     memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
@@ -1487,6 +1510,8 @@  common_destruct(struct netdev_dpdk *dev)
     ovs_mutex_destroy(&dev->mutex);
 }
 
+static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
+
 static void
 netdev_dpdk_destruct(struct netdev *netdev)
 {
@@ -1494,6 +1519,9 @@  netdev_dpdk_destruct(struct netdev *netdev)
 
     ovs_mutex_lock(&dpdk_mutex);
 
+    /* Destroy any rte flows to allow RXQs to be removed. */
+    dpdk_cp_prot_unconfigure(dev);
+
     rte_eth_dev_stop(dev->port_id);
     dev->started = false;
 
@@ -1908,8 +1936,8 @@  dpdk_set_rxq_config(struct netdev_dpdk *dev, const struct smap *args)
     int new_n_rxq;
 
     new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
-    if (new_n_rxq != dev->requested_n_rxq) {
-        dev->requested_n_rxq = new_n_rxq;
+    if (new_n_rxq != dev->user_n_rxq) {
+        dev->user_n_rxq = new_n_rxq;
         netdev_request_reconfigure(&dev->up);
     }
 }
@@ -1931,6 +1959,48 @@  dpdk_process_queue_size(struct netdev *netdev, const struct smap *args,
     }
 }
 
+static void
+dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
+                        const struct smap *args, char **errp)
+{
+    const char *arg = smap_get_def(args, "cp-protection", "");
+    char *token, *saveptr, *buf;
+    uint64_t flags = 0;
+
+    buf = xstrdup(arg);
+    token = strtok_r(buf, ",", &saveptr);
+    while (token) {
+        if (strcmp(token, "lacp") == 0) {
+            flags |= DPDK_CP_PROT_LACP;
+        } else {
+            VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                          "unknown protocol '%s'",
+                          netdev_get_name(netdev), token);
+        }
+        token = strtok_r(NULL, ",", &saveptr);
+    }
+    free(buf);
+
+    if (flags && dev->type != DPDK_DEV_ETH) {
+        VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                      "is only supported on ethernet ports",
+                      netdev_get_name(netdev));
+        flags = 0;
+    }
+
+    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
+        VLOG_WARN_BUF(errp, "%s options:cp-protection "
+                      "is incompatible with hw-offload",
+                      netdev_get_name(netdev));
+        flags = 0;
+    }
+
+    if (flags != dev->requested_cp_prot_flags) {
+        dev->requested_cp_prot_flags = flags;
+        netdev_request_reconfigure(netdev);
+    }
+}
+
 static int
 netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
                        char **errp)
@@ -1950,6 +2020,8 @@  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
     ovs_mutex_lock(&dpdk_mutex);
     ovs_mutex_lock(&dev->mutex);
 
+    dpdk_cp_prot_set_config(netdev, dev, args, errp);
+
     dpdk_set_rxq_config(dev, args);
 
     dpdk_process_queue_size(netdev, args, "n_rxq_desc",
@@ -3819,9 +3891,12 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     struct rte_eth_dev_info dev_info;
+    size_t cp_prot_flows_num;
+    uint64_t cp_prot_flags;
     const char *bus_info;
     uint32_t link_speed;
     uint32_t dev_flags;
+    int n_rxq;
 
     if (!rte_eth_dev_is_valid_port(dev->port_id)) {
         return ENODEV;
@@ -3833,6 +3908,9 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
     link_speed = dev->link.link_speed;
     dev_flags = *dev_info.dev_flags;
     bus_info = rte_dev_bus_info(dev_info.device);
+    cp_prot_flags = dev->cp_prot_flags;
+    cp_prot_flows_num = dev->cp_prot_flows_num;
+    n_rxq = netdev->n_rxq;
     ovs_mutex_unlock(&dev->mutex);
     ovs_mutex_unlock(&dpdk_mutex);
 
@@ -3875,6 +3953,19 @@  netdev_dpdk_get_status(const struct netdev *netdev, struct smap *args)
                         ETH_ADDR_ARGS(dev->hwaddr));
     }
 
+    if (cp_prot_flags) {
+        if (!cp_prot_flows_num) {
+            smap_add(args, "cp_protection", "unsupported");
+        } else {
+            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
+            if (n_rxq > 2) {
+                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
+            } else {
+                smap_add(args, "rss_queues", "0");
+            }
+        }
+    }
+
     return 0;
 }
 
@@ -5178,16 +5269,203 @@  static const struct dpdk_qos_ops trtcm_policer_ops = {
     .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
 };
 
+static int
+dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
+                      const struct rte_flow_item items[],
+                      const char *desc)
+{
+    const struct rte_flow_attr attr = { .ingress = 1 };
+    const struct rte_flow_action actions[] = {
+        {
+            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
+            .conf = &(const struct rte_flow_action_queue) {
+                .index = dev->up.n_rxq - 1,
+            },
+        },
+        { .type = RTE_FLOW_ACTION_TYPE_END },
+    };
+    struct rte_flow_error error;
+    struct rte_flow *flow;
+    size_t num;
+    int err;
+
+    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
+    if (err) {
+        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
+                  netdev_get_name(&dev->up), desc,
+                  error.message ? error.message : "");
+        goto out;
+    }
+
+    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
+    if (flow == NULL) {
+        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
+                  netdev_get_name(&dev->up), desc,
+                  error.message ? error.message : "");
+        err = rte_errno;
+        goto out;
+    }
+
+    num = dev->cp_prot_flows_num + 1;
+    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
+    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
+    dev->cp_prot_flows_num = num;
+
+    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
+              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
+out:
+    return err;
+}
+
+#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / RTE_ETH_RETA_GROUP_SIZE)
+
+static int
+dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
+{
+    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
+    struct rte_eth_dev_info info;
+    int err;
+
+    rte_eth_dev_info_get(dev->port_id, &info);
+
+    if (info.reta_size % rss_n_rxq != 0 &&
+        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
+        /*
+         * Some drivers set reta_size equal to the total number of rxqs that
+         * are configured when it is a power of two. Since we are actually
+         * reconfiguring the redirection table to exclude the last rxq, we may
+         * end up with an imbalanced redirection table. For example, such
+         * configuration:
+         *
+         *   options:n_rxq=3 options:cp-protection=lacp
+         *
+         * Will actually configure 4 rxqs on the NIC, and the default reta to:
+         *
+         *   [0, 1, 2, 3]
+         *
+         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
+         *
+         *   [0, 1, 2, 0]
+         *
+         * Causing queue 0 to receive twice as much traffic as queues 1 and 2.
+         *
+         * Work around that corner case by forcing a bigger redirection table
+         * size to 128 entries when reta_size is not a multiple of rss_n_rxq
+         * and when reta_size is less than 128. This value seems to be
+         * supported by most of the drivers that also support rte flow.
+         */
+        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
+    }
+
+    memset(reta_conf, 0, sizeof(reta_conf));
+    for (uint16_t i = 0; i < info.reta_size; i++) {
+        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
+        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
+        reta_conf[idx].mask |= 1ULL << shift;
+        reta_conf[idx].reta[shift] = i % rss_n_rxq;
+    }
+    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, info.reta_size);
+    if (err < 0) {
+        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
+                  netdev_get_name(&dev->up), err);
+    }
+
+    return err;
+}
+
+static int
+dpdk_cp_prot_configure(struct netdev_dpdk *dev)
+{
+    int err = 0;
+
+    if (dev->up.n_rxq < 2) {
+        err = ENOTSUP;
+        VLOG_WARN("%s: cp-protection: not enough available rx queues",
+                  netdev_get_name(&dev->up));
+        goto out;
+    }
+
+    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
+        const struct rte_flow_item items[] = {
+            {
+                .type = RTE_FLOW_ITEM_TYPE_ETH,
+                .spec = &(const struct rte_flow_item_eth){
+                    .type = htons(ETH_TYPE_LACP),
+                },
+                .mask = &(const struct rte_flow_item_eth){
+                    .type = htons(0xffff),
+                },
+            },
+            { .type = RTE_FLOW_ITEM_TYPE_END },
+        };
+        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
+        if (err) {
+            goto out;
+        }
+    }
+
+    if (dev->cp_prot_flows_num) {
+        /* Reconfigure RSS reta in all but the cp protection queue. */
+        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
+        if (!err) {
+            if (dev->up.n_rxq == 2) {
+                VLOG_INFO("%s: cp-protection: redirected other traffic to "
+                          "rx queue 0", netdev_get_name(&dev->up));
+            } else {
+                VLOG_INFO("%s: cp-protection: applied rss on rx queue 0-%u",
+                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
+            }
+        }
+    }
+
+out:
+    return err;
+}
+
+static void
+dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
+{
+    struct rte_flow_error error;
+
+    if (!dev->cp_prot_flows_num) {
+        return;
+    }
+
+    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
+
+    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
+        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
+        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) {
+            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
+                     netdev_get_name(&dev->up),
+                     error.message ? error.message : "");
+        }
+    }
+    free(dev->cp_prot_flows);
+    dev->cp_prot_flows_num = 0;
+    dev->cp_prot_flows = NULL;
+}
+
 static int
 netdev_dpdk_reconfigure(struct netdev *netdev)
 {
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    bool try_cp_prot;
     int err = 0;
 
     ovs_mutex_lock(&dev->mutex);
 
+    try_cp_prot = dev->requested_cp_prot_flags != 0;
+    dev->requested_n_rxq = dev->user_n_rxq;
+    if (try_cp_prot) {
+        dev->requested_n_rxq += 1;
+    }
+
     if (netdev->n_txq == dev->requested_n_txq
         && netdev->n_rxq == dev->requested_n_rxq
+        && dev->cp_prot_flags == dev->requested_cp_prot_flags
         && dev->mtu == dev->requested_mtu
         && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
         && dev->rxq_size == dev->requested_rxq_size
@@ -5200,6 +5478,9 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
         goto out;
     }
 
+retry:
+    dpdk_cp_prot_unconfigure(dev);
+
     if (dev->reset_needed) {
         rte_eth_dev_reset(dev->port_id);
         if_notifier_manual_report();
@@ -5224,6 +5505,7 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
     dev->txq_size = dev->requested_txq_size;
 
     rte_free(dev->tx_q);
+    dev->tx_q = NULL;
 
     if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
         err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
@@ -5255,6 +5537,23 @@  netdev_dpdk_reconfigure(struct netdev *netdev)
      */
     dev->requested_hwaddr = dev->hwaddr;
 
+    if (try_cp_prot) {
+        err = dpdk_cp_prot_configure(dev);
+        if (err) {
+            /* No hw support, disable & recover gracefully. */
+            try_cp_prot = false;
+            /*
+             * The extra queue must be explicitly removed here to ensure that
+             * it is unconfigured immediately.
+             */
+            dev->requested_n_rxq = dev->user_n_rxq;
+            goto retry;
+        }
+    } else {
+        VLOG_INFO("%s: cp-protection: disabled", netdev_get_name(&dev->up));
+    }
+    dev->cp_prot_flags = dev->requested_cp_prot_flags;
+
     dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
     if (!dev->tx_q) {
         err = ENOMEM;
@@ -5468,7 +5767,12 @@  netdev_dpdk_flow_api_supported(struct netdev *netdev)
     ovs_mutex_lock(&dev->mutex);
     if (dev->type == DPDK_DEV_ETH) {
         /* TODO: Check if we able to offload some minimal flow. */
-        ret = true;
+        if (dev->requested_cp_prot_flags) {
+            VLOG_WARN("%s: hw-offload is mutually exclusive with "
+                      "cp-protection", netdev_get_name(netdev));
+        } else {
+            ret = true;
+        }
     }
     ovs_mutex_unlock(&dev->mutex);
 out:
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 05ac1fbe5ef6..e6575f84eb3c 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3453,6 +3453,32 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         <p>This option may only be used with dpdk VF representors.</p>
       </column>
 
+      <column name="options" key="cp-protection"
+              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
+        <p>
+          Allocate an extra Rx queue for control plane packets of the specified
+          protocol(s).
+        </p>
+        <p>
+          If the user has already configured multiple
+          <code>options:n_rxq</code> on the port, an additional one will be
+          allocated for control plane packets. If the hardware cannot satisfy
+          the requested number of requested Rx queues, the last Rx queue will
+          be assigned for control plane. If only one Rx queue is available or
+          if the hardware does not support the RTE flow matchers/actions
+          required to redirect the selected protocols,
+          <code>cp-protection</code> will be disabled.
+        </p>
+        <p>
+          This feature is multually exclusive with
+          <code>other_options:hw-offload</code> as it may conflict with the
+          offloaded RTE flows.
+        </p>
+        <p>
+          Disabled by default.
+        </p>
+      </column>
+
       <column name="other_config" key="tx-steering"
               type='{"type": "string",
                      "enum": ["set", ["thread", "hash"]]}'>