diff mbox series

[ovs-dev,v3,4/5] dpif-netdev: Introduce hash-based Tx packet steering mode.

Message ID 20211215162630.90249-5-maxime.coquelin@redhat.com
State Superseded
Headers show
Series dpif-netdev: Hash-based Tx packet steering | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Maxime Coquelin Dec. 15, 2021, 4:26 p.m. UTC
This patch adds a new hash Tx steering mode that
distributes the traffic on all the Tx queues, whatever the
number of PMD threads. It would be useful for guests
expecting traffic to be distributed on all the vCPUs.

The idea here is to re-use the 5-tuple hash of the packets,
already computed to build the flows batches (and so it
does not provide flexibility on which fields are part of
the hash).

There are also no user-configurable indirection table,
given the feature is transparent to the guest. The queue
selection is just a modulo operation between the packet
hash and the number of Tx queues.

There are no (at least intentionnally) functionnal changes
for the existing XPS and static modes. There should not be
noticeable performance changes for these modes (only one
more branch in the hot path).

For the hash mode, performance could be impacted due to
locking when multiple PMD threads are in use (same as
XPS mode) and also because of the second level of batching.

Regarding the batching, the existing Tx port output_pkts
is not modified. It means that at maximum, NETDEV_MAX_BURST
can be batched for all the Tx queues. A second level of
batching is done in dp_netdev_pmd_flush_output_on_port(),
only for this hash mode.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 Documentation/automake.mk                     |   1 +
 Documentation/topics/index.rst                |   1 +
 .../topics/userspace-tx-steering.rst          |  78 +++++++++++
 lib/dpif-netdev.c                             | 125 ++++++++++++++----
 4 files changed, 182 insertions(+), 23 deletions(-)
 create mode 100644 Documentation/topics/userspace-tx-steering.rst

Comments

Kevin Traynor Dec. 16, 2021, 6 p.m. UTC | #1
On 15/12/2021 16:26, Maxime Coquelin wrote:
> This patch adds a new hash Tx steering mode that
> distributes the traffic on all the Tx queues, whatever the
> number of PMD threads. It would be useful for guests
> expecting traffic to be distributed on all the vCPUs.
> 
> The idea here is to re-use the 5-tuple hash of the packets,
> already computed to build the flows batches (and so it
> does not provide flexibility on which fields are part of
> the hash).
> 
> There are also no user-configurable indirection table,
> given the feature is transparent to the guest. The queue
> selection is just a modulo operation between the packet
> hash and the number of Tx queues.
> 
> There are no (at least intentionnally) functionnal changes
> for the existing XPS and static modes. There should not be
> noticeable performance changes for these modes (only one
> more branch in the hot path).
> 
> For the hash mode, performance could be impacted due to
> locking when multiple PMD threads are in use (same as
> XPS mode) and also because of the second level of batching.
> 
> Regarding the batching, the existing Tx port output_pkts
> is not modified. It means that at maximum, NETDEV_MAX_BURST
> can be batched for all the Tx queues. A second level of
> batching is done in dp_netdev_pmd_flush_output_on_port(),
> only for this hash mode.
> 

Thanks Maxime. This feature is really needed and the way the 
implementation compliments the existing code is very slick.

I tested many combinations switching between modes while traffic was 
flowing and everything was working well.

An interesting test case, I tested with the performance of using 3 txqs 
with default and hash modes. I saw a 13% rx tput drop in the VM with 
hash mode. I think that is quite reasonable considering the extra 
locking and the potential to remove other bottlenecks in a real VNF.

I didn't dig in too deep on the clearing/loading cached ports code but 
it's working when changing mode on the fly. The rest of the code looks 
good. Couple of comments below.

> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>   Documentation/automake.mk                     |   1 +
>   Documentation/topics/index.rst                |   1 +
>   .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>   lib/dpif-netdev.c                             | 125 ++++++++++++++----
>   4 files changed, 182 insertions(+), 23 deletions(-)
>   create mode 100644 Documentation/topics/userspace-tx-steering.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 137cc57c5..01e3c4f9e 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>   	Documentation/topics/record-replay.rst \
>   	Documentation/topics/tracing.rst \
>   	Documentation/topics/userspace-tso.rst \
> +	Documentation/topics/userspace-tx-steering.rst \
>   	Documentation/topics/windows.rst \
>   	Documentation/howto/index.rst \
>   	Documentation/howto/dpdk.rst \
> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
> index d8ccbd757..3699fd5c4 100644
> --- a/Documentation/topics/index.rst
> +++ b/Documentation/topics/index.rst
> @@ -55,3 +55,4 @@ OVS
>      userspace-tso
>      idl-compound-indexes
>      ovs-extensions
> +   userspace-tx-steering
> diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
> new file mode 100644
> index 000000000..3de05c4ae
> --- /dev/null
> +++ b/Documentation/topics/userspace-tx-steering.rst
> @@ -0,0 +1,78 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +============================
> +Userspace Tx packet steering
> +============================
> +
> +The userspace datapath supports three transmit packets steering modes.
> +
> +Static mode
> +~~~~~~~~~~~
> +
> +This mode is automatically selected when port's ``tx-steering`` option is set
> +to ``default`` or unset, and if the port's number of Tx queues is greater or
> +equal than the number of PMD threads.
> +
> +This the recommended mode for performance reasons, as the Tx lock is not
> +acquired. If the number of Tx queues is greater than the number of PMDs, the
> +remaining Tx queues will not be used.
> +
> +XPS mode
> +~~~~~~~~
> +
> +This mode is automatically selected when port's ``tx-steering`` option is set
> +to ``default`` or unset, and if the port's number of Tx queues is lower than
> +the number of PMD threads.
> +
> +This mode may have a performance impact, given the Tx lock acquisition is
> +required as several PMD threads may use the same Tx queue.
> +
> +Hash mode
> +~~~~~~~~~
> +
> +Hash-based Tx packets steering mode distributes the traffic on all the port's
> +transmit queues, whatever the number of PMD threads. Queue selection is based
> +on the 5-tuples hash already computed to build the flows batches, the selected
> +queue being the modulo between the hash and the number of Tx queues of the
> +port.
> +
> +Hash mode may be used for example with Vhost-user ports, when the number of
> +vCPUs and queues of thevguest are greater than the number of PMD threads.
> +Without hash mode, the Tx queues used would bevlimited to the number of PMD.
> +
> +Hash-based Tx packet steering may have an impact on the performance, given the
> +Tx lock acquisition is required and a second level of batching is performed.
> +
> +This feature is disabled by default.
> +
> +Usage
> +~~~~~
> +
> +To enable hash mode::
> +
> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
> +
> +To disable hash mode::
> +
> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default

I think it might be better to use a more descriptive name than 'default'.

That way there's no naming conflict if the actual default behaviour was 
ever changed to 'hash' mode in the future. Also, perhaps the default 
mode could be set by the user in future so they wouldn't have to set for 
every vhost port.

Perhaps, tx-steering=pmd|hash ?

It is well explained above, but maybe from a user perspective as there 
is only 2 choices, then there should be 2 'modes' (e.g. 
tx-steering=pmd|hash), and xps is explained as a feature that may run in 
the 'pmd' mode. Internally in the code, it probably makes sense to keep 
the 3 values in the enum etc.

> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 0407f30b3..b8c528d6e 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -385,15 +385,21 @@ struct dp_netdev_rxq {
>       atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
>   };
>   
> +enum txq_req_mode {
> +    TXQ_REQ_MODE_DEFAULT,
> +    TXQ_REQ_MODE_HASH,
> +};
> +
>   enum txq_mode {
>       TXQ_MODE_STATIC,
>       TXQ_MODE_XPS,
> +    TXQ_MODE_HASH,
>   };
>   
>   /* A port in a netdev-based datapath. */
>   struct dp_netdev_port {
>       odp_port_t port_no;
> -    enum txq_mode txq_mode;     /* static, XPS */
> +    enum txq_mode txq_mode;     /* static, XPS, HASH. */
>       bool need_reconfigure;      /* True if we should reconfigure netdev. */
>       struct netdev *netdev;
>       struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
> @@ -405,6 +411,7 @@ struct dp_netdev_port {
>       bool emc_enabled;           /* If true EMC will be used. */
>       char *type;                 /* Port type as requested by user. */
>       char *rxq_affinity_list;    /* Requested affinity of rx queues. */
> +    enum txq_req_mode txq_requested_mode;
>   };
>   
>   static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
> @@ -440,6 +447,7 @@ struct tx_port {
>       struct hmap_node node;
>       long long flush_time;
>       struct dp_packet_batch output_pkts;
> +    struct dp_packet_batch *txq_pkts; /* Only for hash mode */
>       struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>   };
>   
> @@ -4435,6 +4443,8 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>       int error = 0;
>       const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
>       bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
> +    const char *tx_steering_mode = smap_get(cfg, "tx-steering");
> +    enum txq_req_mode txq_mode;
>   
>       ovs_mutex_lock(&dp->port_mutex);
>       error = get_port_by_number(dp, port_no, &port);
> @@ -4476,19 +4486,34 @@ dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>       }
>   
>       /* Checking for RXq affinity changes. */
> -    if (!netdev_is_pmd(port->netdev)
> -        || nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
> -        goto unlock;
> +    if (netdev_is_pmd(port->netdev)
> +        && !nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
> +
> +        error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
> +        if (error) {
> +            goto unlock;
> +        }
> +        free(port->rxq_affinity_list);
> +        port->rxq_affinity_list = nullable_xstrdup(affinity_list);
> +
> +        dp_netdev_request_reconfigure(dp);
>       }
>   
> -    error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
> -    if (error) {
> -        goto unlock;
> +    if (nullable_string_is_equal(tx_steering_mode, "hash")) {
> +        txq_mode = TXQ_REQ_MODE_HASH;
> +    } else {
> +        txq_mode = TXQ_REQ_MODE_DEFAULT;
> +    }
> +
> +    if (txq_mode != port->txq_requested_mode) {
> +        port->txq_requested_mode = txq_mode;
> +        VLOG_INFO("%s: Txq mode has been set to %s.",
> +                  netdev_get_name(port->netdev),
> +                  (txq_mode == TXQ_REQ_MODE_DEFAULT) ? "default" : "hash");
> +        dp_netdev_request_reconfigure(dp);
> +
>       }
> -    free(port->rxq_affinity_list);
> -    port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>   
> -    dp_netdev_request_reconfigure(dp);
>   unlock:
>       ovs_mutex_unlock(&dp->port_mutex);
>       return error;
> @@ -4603,18 +4628,38 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>   
>       cycle_timer_start(&pmd->perf_stats, &timer);
>   
> -    if (p->port->txq_mode == TXQ_MODE_XPS) {
> -        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
> -        concurrent_txqs = true;
> -    } else {
> -        tx_qid = pmd->static_tx_qid;
> -        concurrent_txqs = false;
> -    }
> -
>       output_cnt = dp_packet_batch_size(&p->output_pkts);
>       ovs_assert(output_cnt > 0);
>   
> -    netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
> +        int n_txq = netdev_n_txq(p->port->netdev);
> +

I'm pretty sure it's not possible due to the default, but it might be 
worth adding a check for 'n_txq > 0' here as it would cause a divide by 
zero below.

> +        /* Re-batch per txq based on packet hash. */
> +        for (i = 0; i < output_cnt; i++) {
> +            struct dp_packet *packet = p->output_pkts.packets[i];
> +
> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
> +        }
> +
> +        /* Flush batches of each Tx queues. */
> +        for (i = 0; i < n_txq; i++) {
> +            if (dp_packet_batch_is_empty(&p->txq_pkts[i])) {
> +                continue;
> +            }
> +            netdev_send(p->port->netdev, i, &p->txq_pkts[i], true);
> +            dp_packet_batch_init(&p->txq_pkts[i]);
> +        }
> +    } else {
> +        if (p->port->txq_mode == TXQ_MODE_XPS) {
> +            tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
> +            concurrent_txqs = true;
> +        } else {
> +            tx_qid = pmd->static_tx_qid;
> +            concurrent_txqs = false;
> +        }
> +        netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
> +    }
>       dp_packet_batch_init(&p->output_pkts);
>   
>       /* Update time of the next flush. */
> @@ -5775,7 +5820,9 @@ reconfigure_datapath(struct dp_netdev *dp)
>       HMAP_FOR_EACH (port, node, &dp->ports) {
>           if (netdev_is_reconf_required(port->netdev)
>               || ((port->txq_mode == TXQ_MODE_XPS)
> -                != (netdev_n_txq(port->netdev) < wanted_txqs))) {
> +                != (netdev_n_txq(port->netdev) < wanted_txqs))
> +            || ((port->txq_mode == TXQ_MODE_HASH)
> +                != (port->txq_requested_mode == TXQ_REQ_MODE_HASH))) {
>               port->need_reconfigure = true;
>           }
>       }
> @@ -5810,8 +5857,15 @@ reconfigure_datapath(struct dp_netdev *dp)
>               seq_change(dp->port_seq);
>               port_destroy(port);
>           } else {
> -            port->txq_mode = (netdev_n_txq(port->netdev) < wanted_txqs) ?
> -                TXQ_MODE_XPS : TXQ_MODE_STATIC;
> +            /* With a single queue, there is no point in using hash mode. */
> +            if (port->txq_requested_mode == TXQ_REQ_MODE_HASH &&
> +                    netdev_n_txq(port->netdev) > 1) {
> +                port->txq_mode = TXQ_MODE_HASH;
> +            } else if (netdev_n_txq(port->netdev) < wanted_txqs) {
> +                port->txq_mode = TXQ_MODE_XPS;
> +            } else {
> +                port->txq_mode = TXQ_MODE_STATIC;
> +            }
>           }
>       }
>   
> @@ -6100,9 +6154,11 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
>       dpif_netdev_xps_revalidate_pmd(pmd, true);
>   
>       HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->tnl_port_cache) {
> +        free(tx_port_cached->txq_pkts);
>           free(tx_port_cached);
>       }
>       HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
> +        free(tx_port_cached->txq_pkts);
>           free(tx_port_cached);
>       }
>   }
> @@ -6122,14 +6178,27 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>       hmap_shrink(&pmd->tnl_port_cache);
>   
>       HMAP_FOR_EACH (tx_port, node, &pmd->tx_ports) {
> +        int n_txq = netdev_n_txq(tx_port->port->netdev);
> +        struct dp_packet_batch *txq_pkts_cached;
> +
>           if (netdev_has_tunnel_push_pop(tx_port->port->netdev)) {
>               tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
> +            if (tx_port->txq_pkts) {
> +                txq_pkts_cached = xmemdup(tx_port->txq_pkts,
> +                                          n_txq * sizeof *tx_port->txq_pkts);
> +                tx_port_cached->txq_pkts = txq_pkts_cached;
> +            }
>               hmap_insert(&pmd->tnl_port_cache, &tx_port_cached->node,
>                           hash_port_no(tx_port_cached->port->port_no));
>           }
>   
> -        if (netdev_n_txq(tx_port->port->netdev)) {
> +        if (n_txq) {
>               tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
> +            if (tx_port->txq_pkts) {
> +                txq_pkts_cached = xmemdup(tx_port->txq_pkts,
> +                                          n_txq * sizeof *tx_port->txq_pkts);
> +                tx_port_cached->txq_pkts = txq_pkts_cached;
> +            }
>               hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
>                           hash_port_no(tx_port_cached->port->port_no));
>           }
> @@ -6911,6 +6980,7 @@ dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd)
>           free(poll);
>       }
>       HMAP_FOR_EACH_POP (port, node, &pmd->tx_ports) {
> +        free(port->txq_pkts);
>           free(port);
>       }
>       ovs_mutex_unlock(&pmd->port_mutex);
> @@ -6981,6 +7051,14 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
>       tx->flush_time = 0LL;
>       dp_packet_batch_init(&tx->output_pkts);
>   
> +    if (tx->port->txq_mode == TXQ_MODE_HASH) {
> +        int i, n_txq = netdev_n_txq(tx->port->netdev);
> +        tx->txq_pkts = xzalloc(n_txq * sizeof *tx->txq_pkts);
> +        for (i = 0; i < n_txq; i++) {
> +            dp_packet_batch_init(&tx->txq_pkts[i]);
> +        }
> +    }
> +
>       hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
>       pmd->need_reload = true;
>   }
> @@ -6993,6 +7071,7 @@ dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
>       OVS_REQUIRES(pmd->port_mutex)
>   {
>       hmap_remove(&pmd->tx_ports, &tx->node);
> +    free(tx->txq_pkts);
>       free(tx);
>       pmd->need_reload = true;
>   }
>
Maxime Coquelin Dec. 16, 2021, 8:04 p.m. UTC | #2
Hi Kevin,

On 12/16/21 19:00, Kevin Traynor wrote:
> On 15/12/2021 16:26, Maxime Coquelin wrote:
>> This patch adds a new hash Tx steering mode that
>> distributes the traffic on all the Tx queues, whatever the
>> number of PMD threads. It would be useful for guests
>> expecting traffic to be distributed on all the vCPUs.
>>
>> The idea here is to re-use the 5-tuple hash of the packets,
>> already computed to build the flows batches (and so it
>> does not provide flexibility on which fields are part of
>> the hash).
>>
>> There are also no user-configurable indirection table,
>> given the feature is transparent to the guest. The queue
>> selection is just a modulo operation between the packet
>> hash and the number of Tx queues.
>>
>> There are no (at least intentionnally) functionnal changes
>> for the existing XPS and static modes. There should not be
>> noticeable performance changes for these modes (only one
>> more branch in the hot path).
>>
>> For the hash mode, performance could be impacted due to
>> locking when multiple PMD threads are in use (same as
>> XPS mode) and also because of the second level of batching.
>>
>> Regarding the batching, the existing Tx port output_pkts
>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>> can be batched for all the Tx queues. A second level of
>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>> only for this hash mode.
>>
> 
> Thanks Maxime. This feature is really needed and the way the 
> implementation compliments the existing code is very slick.
> 
> I tested many combinations switching between modes while traffic was 
> flowing and everything was working well.
> 
> An interesting test case, I tested with the performance of using 3 txqs 
> with default and hash modes. I saw a 13% rx tput drop in the VM with 
> hash mode. I think that is quite reasonable considering the extra 
> locking and the potential to remove other bottlenecks in a real VNF.

Out of curiosity, what kind of traffic was sent?

> I didn't dig in too deep on the clearing/loading cached ports code but 
> it's working when changing mode on the fly. The rest of the code looks 
> good. Couple of comments below.
> 
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> ---
>>   Documentation/automake.mk                     |   1 +
>>   Documentation/topics/index.rst                |   1 +
>>   .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>   lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>   4 files changed, 182 insertions(+), 23 deletions(-)
>>   create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index 137cc57c5..01e3c4f9e 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>       Documentation/topics/record-replay.rst \
>>       Documentation/topics/tracing.rst \
>>       Documentation/topics/userspace-tso.rst \
>> +    Documentation/topics/userspace-tx-steering.rst \
>>       Documentation/topics/windows.rst \
>>       Documentation/howto/index.rst \
>>       Documentation/howto/dpdk.rst \
>> diff --git a/Documentation/topics/index.rst 
>> b/Documentation/topics/index.rst
>> index d8ccbd757..3699fd5c4 100644
>> --- a/Documentation/topics/index.rst
>> +++ b/Documentation/topics/index.rst
>> @@ -55,3 +55,4 @@ OVS
>>      userspace-tso
>>      idl-compound-indexes
>>      ovs-extensions
>> +   userspace-tx-steering
>> diff --git a/Documentation/topics/userspace-tx-steering.rst 
>> b/Documentation/topics/userspace-tx-steering.rst
>> new file mode 100644
>> index 000000000..3de05c4ae
>> --- /dev/null
>> +++ b/Documentation/topics/userspace-tx-steering.rst
>> @@ -0,0 +1,78 @@
>> +..
>> +      Licensed under the Apache License, Version 2.0 (the "License"); 
>> you may
>> +      not use this file except in compliance with the License. You 
>> may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, 
>> software
>> +      distributed under the License is distributed on an "AS IS" 
>> BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>> implied. See the
>> +      License for the specific language governing permissions and 
>> limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +============================
>> +Userspace Tx packet steering
>> +============================
>> +
>> +The userspace datapath supports three transmit packets steering modes.
>> +
>> +Static mode
>> +~~~~~~~~~~~
>> +
>> +This mode is automatically selected when port's ``tx-steering`` 
>> option is set
>> +to ``default`` or unset, and if the port's number of Tx queues is 
>> greater or
>> +equal than the number of PMD threads.
>> +
>> +This the recommended mode for performance reasons, as the Tx lock is not
>> +acquired. If the number of Tx queues is greater than the number of 
>> PMDs, the
>> +remaining Tx queues will not be used.
>> +
>> +XPS mode
>> +~~~~~~~~
>> +
>> +This mode is automatically selected when port's ``tx-steering`` 
>> option is set
>> +to ``default`` or unset, and if the port's number of Tx queues is 
>> lower than
>> +the number of PMD threads.
>> +
>> +This mode may have a performance impact, given the Tx lock 
>> acquisition is
>> +required as several PMD threads may use the same Tx queue.
>> +
>> +Hash mode
>> +~~~~~~~~~
>> +
>> +Hash-based Tx packets steering mode distributes the traffic on all 
>> the port's
>> +transmit queues, whatever the number of PMD threads. Queue selection 
>> is based
>> +on the 5-tuples hash already computed to build the flows batches, the 
>> selected
>> +queue being the modulo between the hash and the number of Tx queues 
>> of the
>> +port.
>> +
>> +Hash mode may be used for example with Vhost-user ports, when the 
>> number of
>> +vCPUs and queues of thevguest are greater than the number of PMD 
>> threads.
>> +Without hash mode, the Tx queues used would bevlimited to the number 
>> of PMD.
>> +
>> +Hash-based Tx packet steering may have an impact on the performance, 
>> given the
>> +Tx lock acquisition is required and a second level of batching is 
>> performed.
>> +
>> +This feature is disabled by default.
>> +
>> +Usage
>> +~~~~~
>> +
>> +To enable hash mode::
>> +
>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>> +
>> +To disable hash mode::
>> +
>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
> 
> I think it might be better to use a more descriptive name than 'default'.
> 
> That way there's no naming conflict if the actual default behaviour was 
> ever changed to 'hash' mode in the future. Also, perhaps the default 
> mode could be set by the user in future so they wouldn't have to set for 
> every vhost port.
> 
> Perhaps, tx-steering=pmd|hash ?
> 
> It is well explained above, but maybe from a user perspective as there 
> is only 2 choices, then there should be 2 'modes' (e.g. 
> tx-steering=pmd|hash), and xps is explained as a feature that may run in 
> the 'pmd' mode. Internally in the code, it probably makes sense to keep 
> the 3 values in the enum etc.

OK, that sounds reasonnable. Ilya, what do you think?

>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 0407f30b3..b8c528d6e 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -385,15 +385,21 @@ struct dp_netdev_rxq {
>>       atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
>>   };
>> +enum txq_req_mode {
>> +    TXQ_REQ_MODE_DEFAULT,
>> +    TXQ_REQ_MODE_HASH,
>> +};
>> +
>>   enum txq_mode {
>>       TXQ_MODE_STATIC,
>>       TXQ_MODE_XPS,
>> +    TXQ_MODE_HASH,
>>   };
>>   /* A port in a netdev-based datapath. */
>>   struct dp_netdev_port {
>>       odp_port_t port_no;
>> -    enum txq_mode txq_mode;     /* static, XPS */
>> +    enum txq_mode txq_mode;     /* static, XPS, HASH. */
>>       bool need_reconfigure;      /* True if we should reconfigure 
>> netdev. */
>>       struct netdev *netdev;
>>       struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
>> @@ -405,6 +411,7 @@ struct dp_netdev_port {
>>       bool emc_enabled;           /* If true EMC will be used. */
>>       char *type;                 /* Port type as requested by user. */
>>       char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>> +    enum txq_req_mode txq_requested_mode;
>>   };
>>   static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
>> @@ -440,6 +447,7 @@ struct tx_port {
>>       struct hmap_node node;
>>       long long flush_time;
>>       struct dp_packet_batch output_pkts;
>> +    struct dp_packet_batch *txq_pkts; /* Only for hash mode */
>>       struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>>   };
>> @@ -4435,6 +4443,8 @@ dpif_netdev_port_set_config(struct dpif *dpif, 
>> odp_port_t port_no,
>>       int error = 0;
>>       const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
>>       bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>> +    const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>> +    enum txq_req_mode txq_mode;
>>       ovs_mutex_lock(&dp->port_mutex);
>>       error = get_port_by_number(dp, port_no, &port);
>> @@ -4476,19 +4486,34 @@ dpif_netdev_port_set_config(struct dpif *dpif, 
>> odp_port_t port_no,
>>       }
>>       /* Checking for RXq affinity changes. */
>> -    if (!netdev_is_pmd(port->netdev)
>> -        || nullable_string_is_equal(affinity_list, 
>> port->rxq_affinity_list)) {
>> -        goto unlock;
>> +    if (netdev_is_pmd(port->netdev)
>> +        && !nullable_string_is_equal(affinity_list, 
>> port->rxq_affinity_list)) {
>> +
>> +        error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
>> +        if (error) {
>> +            goto unlock;
>> +        }
>> +        free(port->rxq_affinity_list);
>> +        port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>> +
>> +        dp_netdev_request_reconfigure(dp);
>>       }
>> -    error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
>> -    if (error) {
>> -        goto unlock;
>> +    if (nullable_string_is_equal(tx_steering_mode, "hash")) {
>> +        txq_mode = TXQ_REQ_MODE_HASH;
>> +    } else {
>> +        txq_mode = TXQ_REQ_MODE_DEFAULT;
>> +    }
>> +
>> +    if (txq_mode != port->txq_requested_mode) {
>> +        port->txq_requested_mode = txq_mode;
>> +        VLOG_INFO("%s: Txq mode has been set to %s.",
>> +                  netdev_get_name(port->netdev),
>> +                  (txq_mode == TXQ_REQ_MODE_DEFAULT) ? "default" : 
>> "hash");
>> +        dp_netdev_request_reconfigure(dp);
>> +
>>       }
>> -    free(port->rxq_affinity_list);
>> -    port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>> -    dp_netdev_request_reconfigure(dp);
>>   unlock:
>>       ovs_mutex_unlock(&dp->port_mutex);
>>       return error;
>> @@ -4603,18 +4628,38 @@ dp_netdev_pmd_flush_output_on_port(struct 
>> dp_netdev_pmd_thread *pmd,
>>       cycle_timer_start(&pmd->perf_stats, &timer);
>> -    if (p->port->txq_mode == TXQ_MODE_XPS) {
>> -        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
>> -        concurrent_txqs = true;
>> -    } else {
>> -        tx_qid = pmd->static_tx_qid;
>> -        concurrent_txqs = false;
>> -    }
>> -
>>       output_cnt = dp_packet_batch_size(&p->output_pkts);
>>       ovs_assert(output_cnt > 0);
>> -    netdev_send(p->port->netdev, tx_qid, &p->output_pkts, 
>> concurrent_txqs);
>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>> +        int n_txq = netdev_n_txq(p->port->netdev);
>> +
> 
> I'm pretty sure it's not possible due to the default, but it might be 
> worth adding a check for 'n_txq > 0' here as it would cause a divide by 
> zero below.

TXQ_MODE_HASH can only be selected if n_txq > 1. So, I am not sure this
is necessary.

Thanks for the review & testing!
Maxime
Kevin Traynor Dec. 17, 2021, 9:45 a.m. UTC | #3
On 16/12/2021 20:04, Maxime Coquelin wrote:
> Hi Kevin,
> 
> On 12/16/21 19:00, Kevin Traynor wrote:
>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>> This patch adds a new hash Tx steering mode that
>>> distributes the traffic on all the Tx queues, whatever the
>>> number of PMD threads. It would be useful for guests
>>> expecting traffic to be distributed on all the vCPUs.
>>>
>>> The idea here is to re-use the 5-tuple hash of the packets,
>>> already computed to build the flows batches (and so it
>>> does not provide flexibility on which fields are part of
>>> the hash).
>>>
>>> There are also no user-configurable indirection table,
>>> given the feature is transparent to the guest. The queue
>>> selection is just a modulo operation between the packet
>>> hash and the number of Tx queues.
>>>
>>> There are no (at least intentionnally) functionnal changes
>>> for the existing XPS and static modes. There should not be
>>> noticeable performance changes for these modes (only one
>>> more branch in the hot path).
>>>
>>> For the hash mode, performance could be impacted due to
>>> locking when multiple PMD threads are in use (same as
>>> XPS mode) and also because of the second level of batching.
>>>
>>> Regarding the batching, the existing Tx port output_pkts
>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>> can be batched for all the Tx queues. A second level of
>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>> only for this hash mode.
>>>
>>
>> Thanks Maxime. This feature is really needed and the way the
>> implementation compliments the existing code is very slick.
>>
>> I tested many combinations switching between modes while traffic was
>> flowing and everything was working well.
>>
>> An interesting test case, I tested with the performance of using 3 txqs
>> with default and hash modes. I saw a 13% rx tput drop in the VM with
>> hash mode. I think that is quite reasonable considering the extra
>> locking and the potential to remove other bottlenecks in a real VNF.
> 
> Out of curiosity, what kind of traffic was sent?
> 

Default pktgen-dpdk with 'enable all range' and programmed PVP flows in 
OVS (i.e. not NORMAL) for PVP. So it was 64 byte TCP packets with a lot 
of variation in the 5-tuple.

>> I didn't dig in too deep on the clearing/loading cached ports code but
>> it's working when changing mode on the fly. The rest of the code looks
>> good. Couple of comments below.
>>
>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> ---
>>>    Documentation/automake.mk                     |   1 +
>>>    Documentation/topics/index.rst                |   1 +
>>>    .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>    lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>>    4 files changed, 182 insertions(+), 23 deletions(-)
>>>    create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>
>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>> index 137cc57c5..01e3c4f9e 100644
>>> --- a/Documentation/automake.mk
>>> +++ b/Documentation/automake.mk
>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>        Documentation/topics/record-replay.rst \
>>>        Documentation/topics/tracing.rst \
>>>        Documentation/topics/userspace-tso.rst \
>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>        Documentation/topics/windows.rst \
>>>        Documentation/howto/index.rst \
>>>        Documentation/howto/dpdk.rst \
>>> diff --git a/Documentation/topics/index.rst
>>> b/Documentation/topics/index.rst
>>> index d8ccbd757..3699fd5c4 100644
>>> --- a/Documentation/topics/index.rst
>>> +++ b/Documentation/topics/index.rst
>>> @@ -55,3 +55,4 @@ OVS
>>>       userspace-tso
>>>       idl-compound-indexes
>>>       ovs-extensions
>>> +   userspace-tx-steering
>>> diff --git a/Documentation/topics/userspace-tx-steering.rst
>>> b/Documentation/topics/userspace-tx-steering.rst
>>> new file mode 100644
>>> index 000000000..3de05c4ae
>>> --- /dev/null
>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>> @@ -0,0 +1,78 @@
>>> +..
>>> +      Licensed under the Apache License, Version 2.0 (the "License");
>>> you may
>>> +      not use this file except in compliance with the License. You
>>> may obtain
>>> +      a copy of the License at
>>> +
>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> +      Unless required by applicable law or agreed to in writing,
>>> software
>>> +      distributed under the License is distributed on an "AS IS"
>>> BASIS, WITHOUT
>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied. See the
>>> +      License for the specific language governing permissions and
>>> limitations
>>> +      under the License.
>>> +
>>> +      Convention for heading levels in Open vSwitch documentation:
>>> +
>>> +      =======  Heading 0 (reserved for the title in a document)
>>> +      -------  Heading 1
>>> +      ~~~~~~~  Heading 2
>>> +      +++++++  Heading 3
>>> +      '''''''  Heading 4
>>> +
>>> +      Avoid deeper levels because they do not render well.
>>> +
>>> +============================
>>> +Userspace Tx packet steering
>>> +============================
>>> +
>>> +The userspace datapath supports three transmit packets steering modes.
>>> +
>>> +Static mode
>>> +~~~~~~~~~~~
>>> +
>>> +This mode is automatically selected when port's ``tx-steering``
>>> option is set
>>> +to ``default`` or unset, and if the port's number of Tx queues is
>>> greater or
>>> +equal than the number of PMD threads.
>>> +
>>> +This the recommended mode for performance reasons, as the Tx lock is not
>>> +acquired. If the number of Tx queues is greater than the number of
>>> PMDs, the
>>> +remaining Tx queues will not be used.
>>> +
>>> +XPS mode
>>> +~~~~~~~~
>>> +
>>> +This mode is automatically selected when port's ``tx-steering``
>>> option is set
>>> +to ``default`` or unset, and if the port's number of Tx queues is
>>> lower than
>>> +the number of PMD threads.
>>> +
>>> +This mode may have a performance impact, given the Tx lock
>>> acquisition is
>>> +required as several PMD threads may use the same Tx queue.
>>> +
>>> +Hash mode
>>> +~~~~~~~~~
>>> +
>>> +Hash-based Tx packets steering mode distributes the traffic on all
>>> the port's
>>> +transmit queues, whatever the number of PMD threads. Queue selection
>>> is based
>>> +on the 5-tuples hash already computed to build the flows batches, the
>>> selected
>>> +queue being the modulo between the hash and the number of Tx queues
>>> of the
>>> +port.
>>> +
>>> +Hash mode may be used for example with Vhost-user ports, when the
>>> number of
>>> +vCPUs and queues of thevguest are greater than the number of PMD
>>> threads.
>>> +Without hash mode, the Tx queues used would bevlimited to the number
>>> of PMD.
>>> +
>>> +Hash-based Tx packet steering may have an impact on the performance,
>>> given the
>>> +Tx lock acquisition is required and a second level of batching is
>>> performed.
>>> +
>>> +This feature is disabled by default.
>>> +
>>> +Usage
>>> +~~~~~
>>> +
>>> +To enable hash mode::
>>> +
>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>> +
>>> +To disable hash mode::
>>> +
>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
>>
>> I think it might be better to use a more descriptive name than 'default'.
>>
>> That way there's no naming conflict if the actual default behaviour was
>> ever changed to 'hash' mode in the future. Also, perhaps the default
>> mode could be set by the user in future so they wouldn't have to set for
>> every vhost port.
>>
>> Perhaps, tx-steering=pmd|hash ?
>>
>> It is well explained above, but maybe from a user perspective as there
>> is only 2 choices, then there should be 2 'modes' (e.g.
>> tx-steering=pmd|hash), and xps is explained as a feature that may run in
>> the 'pmd' mode. Internally in the code, it probably makes sense to keep
>> the 3 values in the enum etc.
> 
> OK, that sounds reasonnable. Ilya, what do you think?
> 
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>> index 0407f30b3..b8c528d6e 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -385,15 +385,21 @@ struct dp_netdev_rxq {
>>>        atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
>>>    };
>>> +enum txq_req_mode {
>>> +    TXQ_REQ_MODE_DEFAULT,
>>> +    TXQ_REQ_MODE_HASH,
>>> +};
>>> +
>>>    enum txq_mode {
>>>        TXQ_MODE_STATIC,
>>>        TXQ_MODE_XPS,
>>> +    TXQ_MODE_HASH,
>>>    };
>>>    /* A port in a netdev-based datapath. */
>>>    struct dp_netdev_port {
>>>        odp_port_t port_no;
>>> -    enum txq_mode txq_mode;     /* static, XPS */
>>> +    enum txq_mode txq_mode;     /* static, XPS, HASH. */
>>>        bool need_reconfigure;      /* True if we should reconfigure
>>> netdev. */
>>>        struct netdev *netdev;
>>>        struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
>>> @@ -405,6 +411,7 @@ struct dp_netdev_port {
>>>        bool emc_enabled;           /* If true EMC will be used. */
>>>        char *type;                 /* Port type as requested by user. */
>>>        char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>>> +    enum txq_req_mode txq_requested_mode;
>>>    };
>>>    static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
>>> @@ -440,6 +447,7 @@ struct tx_port {
>>>        struct hmap_node node;
>>>        long long flush_time;
>>>        struct dp_packet_batch output_pkts;
>>> +    struct dp_packet_batch *txq_pkts; /* Only for hash mode */
>>>        struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>>>    };
>>> @@ -4435,6 +4443,8 @@ dpif_netdev_port_set_config(struct dpif *dpif,
>>> odp_port_t port_no,
>>>        int error = 0;
>>>        const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
>>>        bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
>>> +    const char *tx_steering_mode = smap_get(cfg, "tx-steering");
>>> +    enum txq_req_mode txq_mode;
>>>        ovs_mutex_lock(&dp->port_mutex);
>>>        error = get_port_by_number(dp, port_no, &port);
>>> @@ -4476,19 +4486,34 @@ dpif_netdev_port_set_config(struct dpif *dpif,
>>> odp_port_t port_no,
>>>        }
>>>        /* Checking for RXq affinity changes. */
>>> -    if (!netdev_is_pmd(port->netdev)
>>> -        || nullable_string_is_equal(affinity_list,
>>> port->rxq_affinity_list)) {
>>> -        goto unlock;
>>> +    if (netdev_is_pmd(port->netdev)
>>> +        && !nullable_string_is_equal(affinity_list,
>>> port->rxq_affinity_list)) {
>>> +
>>> +        error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
>>> +        if (error) {
>>> +            goto unlock;
>>> +        }
>>> +        free(port->rxq_affinity_list);
>>> +        port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>>> +
>>> +        dp_netdev_request_reconfigure(dp);
>>>        }
>>> -    error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
>>> -    if (error) {
>>> -        goto unlock;
>>> +    if (nullable_string_is_equal(tx_steering_mode, "hash")) {
>>> +        txq_mode = TXQ_REQ_MODE_HASH;
>>> +    } else {
>>> +        txq_mode = TXQ_REQ_MODE_DEFAULT;
>>> +    }
>>> +
>>> +    if (txq_mode != port->txq_requested_mode) {
>>> +        port->txq_requested_mode = txq_mode;
>>> +        VLOG_INFO("%s: Txq mode has been set to %s.",
>>> +                  netdev_get_name(port->netdev),
>>> +                  (txq_mode == TXQ_REQ_MODE_DEFAULT) ? "default" :
>>> "hash");
>>> +        dp_netdev_request_reconfigure(dp);
>>> +
>>>        }
>>> -    free(port->rxq_affinity_list);
>>> -    port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>>> -    dp_netdev_request_reconfigure(dp);
>>>    unlock:
>>>        ovs_mutex_unlock(&dp->port_mutex);
>>>        return error;
>>> @@ -4603,18 +4628,38 @@ dp_netdev_pmd_flush_output_on_port(struct
>>> dp_netdev_pmd_thread *pmd,
>>>        cycle_timer_start(&pmd->perf_stats, &timer);
>>> -    if (p->port->txq_mode == TXQ_MODE_XPS) {
>>> -        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
>>> -        concurrent_txqs = true;
>>> -    } else {
>>> -        tx_qid = pmd->static_tx_qid;
>>> -        concurrent_txqs = false;
>>> -    }
>>> -
>>>        output_cnt = dp_packet_batch_size(&p->output_pkts);
>>>        ovs_assert(output_cnt > 0);
>>> -    netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
>>> concurrent_txqs);
>>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>>> +        int n_txq = netdev_n_txq(p->port->netdev);
>>> +
>>
>> I'm pretty sure it's not possible due to the default, but it might be
>> worth adding a check for 'n_txq > 0' here as it would cause a divide by
>> zero below.
> 
> TXQ_MODE_HASH can only be selected if n_txq > 1. So, I am not sure this
> is necessary.
> 

ok, that's an explict check rather than just relying on the default, so 
sounds good and it avoids putting another check in the datapath.

thanks,
Kevin.

> Thanks for the review & testing!
> Maxime
>
Maxime Coquelin Dec. 17, 2021, 9:53 a.m. UTC | #4
On 12/17/21 10:45, Kevin Traynor wrote:
> On 16/12/2021 20:04, Maxime Coquelin wrote:
>> Hi Kevin,
>>
>> On 12/16/21 19:00, Kevin Traynor wrote:
>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>> This patch adds a new hash Tx steering mode that
>>>> distributes the traffic on all the Tx queues, whatever the
>>>> number of PMD threads. It would be useful for guests
>>>> expecting traffic to be distributed on all the vCPUs.
>>>>
>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>> already computed to build the flows batches (and so it
>>>> does not provide flexibility on which fields are part of
>>>> the hash).
>>>>
>>>> There are also no user-configurable indirection table,
>>>> given the feature is transparent to the guest. The queue
>>>> selection is just a modulo operation between the packet
>>>> hash and the number of Tx queues.
>>>>
>>>> There are no (at least intentionnally) functionnal changes
>>>> for the existing XPS and static modes. There should not be
>>>> noticeable performance changes for these modes (only one
>>>> more branch in the hot path).
>>>>
>>>> For the hash mode, performance could be impacted due to
>>>> locking when multiple PMD threads are in use (same as
>>>> XPS mode) and also because of the second level of batching.
>>>>
>>>> Regarding the batching, the existing Tx port output_pkts
>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>> can be batched for all the Tx queues. A second level of
>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>> only for this hash mode.
>>>>
>>>
>>> Thanks Maxime. This feature is really needed and the way the
>>> implementation compliments the existing code is very slick.
>>>
>>> I tested many combinations switching between modes while traffic was
>>> flowing and everything was working well.
>>>
>>> An interesting test case, I tested with the performance of using 3 txqs
>>> with default and hash modes. I saw a 13% rx tput drop in the VM with
>>> hash mode. I think that is quite reasonable considering the extra
>>> locking and the potential to remove other bottlenecks in a real VNF.
>>
>> Out of curiosity, what kind of traffic was sent?
>>
> 
> Default pktgen-dpdk with 'enable all range' and programmed PVP flows in 
> OVS (i.e. not NORMAL) for PVP. So it was 64 byte TCP packets with a lot 
> of variation in the 5-tuple.

Ok, thanks! In this case, the 13% rx throughput drop is reasonnable!

Thanks,
Maxime
Ilya Maximets Dec. 17, 2021, 12:23 p.m. UTC | #5
On 12/16/21 21:04, Maxime Coquelin wrote:
> Hi Kevin,
> 
> On 12/16/21 19:00, Kevin Traynor wrote:
>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>> This patch adds a new hash Tx steering mode that
>>> distributes the traffic on all the Tx queues, whatever the
>>> number of PMD threads. It would be useful for guests
>>> expecting traffic to be distributed on all the vCPUs.
>>>
>>> The idea here is to re-use the 5-tuple hash of the packets,
>>> already computed to build the flows batches (and so it
>>> does not provide flexibility on which fields are part of
>>> the hash).
>>>
>>> There are also no user-configurable indirection table,
>>> given the feature is transparent to the guest. The queue
>>> selection is just a modulo operation between the packet
>>> hash and the number of Tx queues.
>>>
>>> There are no (at least intentionnally) functionnal changes
>>> for the existing XPS and static modes. There should not be
>>> noticeable performance changes for these modes (only one
>>> more branch in the hot path).
>>>
>>> For the hash mode, performance could be impacted due to
>>> locking when multiple PMD threads are in use (same as
>>> XPS mode) and also because of the second level of batching.
>>>
>>> Regarding the batching, the existing Tx port output_pkts
>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>> can be batched for all the Tx queues. A second level of
>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>> only for this hash mode.
>>>
>>
>> Thanks Maxime. This feature is really needed and the way the implementation compliments the existing code is very slick.
>>
>> I tested many combinations switching between modes while traffic was flowing and everything was working well.
>>
>> An interesting test case, I tested with the performance of using 3 txqs with default and hash modes. I saw a 13% rx tput drop in the VM with hash mode. I think that is quite reasonable considering the extra locking and the potential to remove other bottlenecks in a real VNF.
> 
> Out of curiosity, what kind of traffic was sent?
> 
>> I didn't dig in too deep on the clearing/loading cached ports code but it's working when changing mode on the fly. The rest of the code looks good. Couple of comments below.
>>
>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> ---
>>>   Documentation/automake.mk                     |   1 +
>>>   Documentation/topics/index.rst                |   1 +
>>>   .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>   lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>>   4 files changed, 182 insertions(+), 23 deletions(-)
>>>   create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>
>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>> index 137cc57c5..01e3c4f9e 100644
>>> --- a/Documentation/automake.mk
>>> +++ b/Documentation/automake.mk
>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>       Documentation/topics/record-replay.rst \
>>>       Documentation/topics/tracing.rst \
>>>       Documentation/topics/userspace-tso.rst \
>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>       Documentation/topics/windows.rst \
>>>       Documentation/howto/index.rst \
>>>       Documentation/howto/dpdk.rst \
>>> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
>>> index d8ccbd757..3699fd5c4 100644
>>> --- a/Documentation/topics/index.rst
>>> +++ b/Documentation/topics/index.rst
>>> @@ -55,3 +55,4 @@ OVS
>>>      userspace-tso
>>>      idl-compound-indexes
>>>      ovs-extensions
>>> +   userspace-tx-steering
>>> diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
>>> new file mode 100644
>>> index 000000000..3de05c4ae
>>> --- /dev/null
>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>> @@ -0,0 +1,78 @@
>>> +..
>>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>>> +      not use this file except in compliance with the License. You may obtain
>>> +      a copy of the License at
>>> +
>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> +      Unless required by applicable law or agreed to in writing, software
>>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>>> +      License for the specific language governing permissions and limitations
>>> +      under the License.
>>> +
>>> +      Convention for heading levels in Open vSwitch documentation:
>>> +
>>> +      =======  Heading 0 (reserved for the title in a document)
>>> +      -------  Heading 1
>>> +      ~~~~~~~  Heading 2
>>> +      +++++++  Heading 3
>>> +      '''''''  Heading 4
>>> +
>>> +      Avoid deeper levels because they do not render well.
>>> +
>>> +============================
>>> +Userspace Tx packet steering
>>> +============================
>>> +
>>> +The userspace datapath supports three transmit packets steering modes.
>>> +
>>> +Static mode
>>> +~~~~~~~~~~~
>>> +
>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>> +to ``default`` or unset, and if the port's number of Tx queues is greater or
>>> +equal than the number of PMD threads.
>>> +
>>> +This the recommended mode for performance reasons, as the Tx lock is not
>>> +acquired. If the number of Tx queues is greater than the number of PMDs, the
>>> +remaining Tx queues will not be used.
>>> +
>>> +XPS mode
>>> +~~~~~~~~
>>> +
>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>> +to ``default`` or unset, and if the port's number of Tx queues is lower than
>>> +the number of PMD threads.
>>> +
>>> +This mode may have a performance impact, given the Tx lock acquisition is
>>> +required as several PMD threads may use the same Tx queue.
>>> +
>>> +Hash mode
>>> +~~~~~~~~~
>>> +
>>> +Hash-based Tx packets steering mode distributes the traffic on all the port's
>>> +transmit queues, whatever the number of PMD threads. Queue selection is based
>>> +on the 5-tuples hash already computed to build the flows batches, the selected
>>> +queue being the modulo between the hash and the number of Tx queues of the
>>> +port.
>>> +
>>> +Hash mode may be used for example with Vhost-user ports, when the number of
>>> +vCPUs and queues of thevguest are greater than the number of PMD threads.
>>> +Without hash mode, the Tx queues used would bevlimited to the number of PMD.
>>> +
>>> +Hash-based Tx packet steering may have an impact on the performance, given the
>>> +Tx lock acquisition is required and a second level of batching is performed.
>>> +
>>> +This feature is disabled by default.
>>> +
>>> +Usage
>>> +~~~~~
>>> +
>>> +To enable hash mode::
>>> +
>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>> +
>>> +To disable hash mode::
>>> +
>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
>>
>> I think it might be better to use a more descriptive name than 'default'.
>>
>> That way there's no naming conflict if the actual default behaviour was ever changed to 'hash' mode in the future. Also, perhaps the default mode could be set by the user in future so they wouldn't have to set for every vhost port.
>>
>> Perhaps, tx-steering=pmd|hash ?
>>
>> It is well explained above, but maybe from a user perspective as there is only 2 choices, then there should be 2 'modes' (e.g. tx-steering=pmd|hash), and xps is explained as a feature that may run in the 'pmd' mode. Internally in the code, it probably makes sense to keep the 3 values in the enum etc.
> 
> OK, that sounds reasonnable. Ilya, what do you think?

Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
called 'pmd*' in OVS and most of these names makes very little sense,
to the point where what we call PMD in OVS is not a Poll Mode Driver
and has a very little relation to drivers in general.

So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
from 'pmd*' to something else instead.  :)

In general, I suggested 'default' just because it's hard to come up
with a single word that would describe both the static and dynamic
queue assignments.  But I agree that some less vague name should be
better.  If you think that 'pmd' is descriptive enough, call it maybe
'thread' instead.  WDYT?


And I agree that it's better to not split modes in the documentation
since they can not actually be configured by a user.


And since we're here, one comment for the code itself:

> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
> +        int n_txq = netdev_n_txq(p->port->netdev);
> +
> +        /* Re-batch per txq based on packet hash. */
> +        for (i = 0; i < output_cnt; i++) {
> +            struct dp_packet *packet = p->output_pkts.packets[i];
> +
> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
> +        }

I think, you still have to check that RSS is valid with the
dp_packet_rss_valid() function before using it and re-calculate
if it's not available.  That is because RSS is not always calculated,
e.g. in case of a partial HW offload the miniflow_extract is skipped
and hash is not calculated either.  That might become a problem if
for any reason the driver didn't provide a hash (doesn't support?) or
for a future developments like 'simple match' optimization where the
packet parsing is mostly skipped.

Since we don't have a miniflow here, we, probably, need something like
this:
  https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896

Best regards, Ilya Maximets.
Ilya Maximets Dec. 17, 2021, 12:52 p.m. UTC | #6
On 12/17/21 13:23, Ilya Maximets wrote:
> On 12/16/21 21:04, Maxime Coquelin wrote:
>> Hi Kevin,
>>
>> On 12/16/21 19:00, Kevin Traynor wrote:
>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>> This patch adds a new hash Tx steering mode that
>>>> distributes the traffic on all the Tx queues, whatever the
>>>> number of PMD threads. It would be useful for guests
>>>> expecting traffic to be distributed on all the vCPUs.
>>>>
>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>> already computed to build the flows batches (and so it
>>>> does not provide flexibility on which fields are part of
>>>> the hash).
>>>>
>>>> There are also no user-configurable indirection table,
>>>> given the feature is transparent to the guest. The queue
>>>> selection is just a modulo operation between the packet
>>>> hash and the number of Tx queues.
>>>>
>>>> There are no (at least intentionnally) functionnal changes
>>>> for the existing XPS and static modes. There should not be
>>>> noticeable performance changes for these modes (only one
>>>> more branch in the hot path).
>>>>
>>>> For the hash mode, performance could be impacted due to
>>>> locking when multiple PMD threads are in use (same as
>>>> XPS mode) and also because of the second level of batching.
>>>>
>>>> Regarding the batching, the existing Tx port output_pkts
>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>> can be batched for all the Tx queues. A second level of
>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>> only for this hash mode.
>>>>
>>>
>>> Thanks Maxime. This feature is really needed and the way the implementation compliments the existing code is very slick.
>>>
>>> I tested many combinations switching between modes while traffic was flowing and everything was working well.
>>>
>>> An interesting test case, I tested with the performance of using 3 txqs with default and hash modes. I saw a 13% rx tput drop in the VM with hash mode. I think that is quite reasonable considering the extra locking and the potential to remove other bottlenecks in a real VNF.
>>
>> Out of curiosity, what kind of traffic was sent?
>>
>>> I didn't dig in too deep on the clearing/loading cached ports code but it's working when changing mode on the fly. The rest of the code looks good. Couple of comments below.
>>>
>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> ---
>>>>   Documentation/automake.mk                     |   1 +
>>>>   Documentation/topics/index.rst                |   1 +
>>>>   .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>   lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>>>   4 files changed, 182 insertions(+), 23 deletions(-)
>>>>   create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>
>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>> index 137cc57c5..01e3c4f9e 100644
>>>> --- a/Documentation/automake.mk
>>>> +++ b/Documentation/automake.mk
>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>       Documentation/topics/record-replay.rst \
>>>>       Documentation/topics/tracing.rst \
>>>>       Documentation/topics/userspace-tso.rst \
>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>       Documentation/topics/windows.rst \
>>>>       Documentation/howto/index.rst \
>>>>       Documentation/howto/dpdk.rst \
>>>> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
>>>> index d8ccbd757..3699fd5c4 100644
>>>> --- a/Documentation/topics/index.rst
>>>> +++ b/Documentation/topics/index.rst
>>>> @@ -55,3 +55,4 @@ OVS
>>>>      userspace-tso
>>>>      idl-compound-indexes
>>>>      ovs-extensions
>>>> +   userspace-tx-steering
>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
>>>> new file mode 100644
>>>> index 000000000..3de05c4ae
>>>> --- /dev/null
>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>> @@ -0,0 +1,78 @@
>>>> +..
>>>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>>>> +      not use this file except in compliance with the License. You may obtain
>>>> +      a copy of the License at
>>>> +
>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +      Unless required by applicable law or agreed to in writing, software
>>>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>>>> +      License for the specific language governing permissions and limitations
>>>> +      under the License.
>>>> +
>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>> +
>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>> +      -------  Heading 1
>>>> +      ~~~~~~~  Heading 2
>>>> +      +++++++  Heading 3
>>>> +      '''''''  Heading 4
>>>> +
>>>> +      Avoid deeper levels because they do not render well.
>>>> +
>>>> +============================
>>>> +Userspace Tx packet steering
>>>> +============================
>>>> +
>>>> +The userspace datapath supports three transmit packets steering modes.
>>>> +
>>>> +Static mode
>>>> +~~~~~~~~~~~
>>>> +
>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>> +to ``default`` or unset, and if the port's number of Tx queues is greater or
>>>> +equal than the number of PMD threads.
>>>> +
>>>> +This the recommended mode for performance reasons, as the Tx lock is not
>>>> +acquired. If the number of Tx queues is greater than the number of PMDs, the
>>>> +remaining Tx queues will not be used.
>>>> +
>>>> +XPS mode
>>>> +~~~~~~~~
>>>> +
>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>> +to ``default`` or unset, and if the port's number of Tx queues is lower than
>>>> +the number of PMD threads.
>>>> +
>>>> +This mode may have a performance impact, given the Tx lock acquisition is
>>>> +required as several PMD threads may use the same Tx queue.
>>>> +
>>>> +Hash mode
>>>> +~~~~~~~~~
>>>> +
>>>> +Hash-based Tx packets steering mode distributes the traffic on all the port's
>>>> +transmit queues, whatever the number of PMD threads. Queue selection is based
>>>> +on the 5-tuples hash already computed to build the flows batches, the selected
>>>> +queue being the modulo between the hash and the number of Tx queues of the
>>>> +port.
>>>> +
>>>> +Hash mode may be used for example with Vhost-user ports, when the number of
>>>> +vCPUs and queues of thevguest are greater than the number of PMD threads.
>>>> +Without hash mode, the Tx queues used would bevlimited to the number of PMD.
>>>> +
>>>> +Hash-based Tx packet steering may have an impact on the performance, given the
>>>> +Tx lock acquisition is required and a second level of batching is performed.
>>>> +
>>>> +This feature is disabled by default.
>>>> +
>>>> +Usage
>>>> +~~~~~
>>>> +
>>>> +To enable hash mode::
>>>> +
>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>> +
>>>> +To disable hash mode::
>>>> +
>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
>>>
>>> I think it might be better to use a more descriptive name than 'default'.
>>>
>>> That way there's no naming conflict if the actual default behaviour was ever changed to 'hash' mode in the future. Also, perhaps the default mode could be set by the user in future so they wouldn't have to set for every vhost port.
>>>
>>> Perhaps, tx-steering=pmd|hash ?
>>>
>>> It is well explained above, but maybe from a user perspective as there is only 2 choices, then there should be 2 'modes' (e.g. tx-steering=pmd|hash), and xps is explained as a feature that may run in the 'pmd' mode. Internally in the code, it probably makes sense to keep the 3 values in the enum etc.
>>
>> OK, that sounds reasonnable. Ilya, what do you think?
> 
> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
> called 'pmd*' in OVS and most of these names makes very little sense,
> to the point where what we call PMD in OVS is not a Poll Mode Driver
> and has a very little relation to drivers in general.
> 
> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
> from 'pmd*' to something else instead.  :)
> 
> In general, I suggested 'default' just because it's hard to come up
> with a single word that would describe both the static and dynamic
> queue assignments.  But I agree that some less vague name should be
> better.  If you think that 'pmd' is descriptive enough, call it maybe
> 'thread' instead.  WDYT?
> 
> 
> And I agree that it's better to not split modes in the documentation
> since they can not actually be configured by a user.
> 
> 
> And since we're here, one comment for the code itself:
> 
>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>> +        int n_txq = netdev_n_txq(p->port->netdev);
>> +
>> +        /* Re-batch per txq based on packet hash. */
>> +        for (i = 0; i < output_cnt; i++) {
>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>> +
>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>> +        }
> 
> I think, you still have to check that RSS is valid with the
> dp_packet_rss_valid() function before using it and re-calculate
> if it's not available.  That is because RSS is not always calculated,
> e.g. in case of a partial HW offload the miniflow_extract is skipped
> and hash is not calculated either.  That might become a problem if
> for any reason the driver didn't provide a hash (doesn't support?) or
> for a future developments like 'simple match' optimization where the
> packet parsing is mostly skipped.
> 
> Since we don't have a miniflow here, we, probably, need something like
> this:
>   https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896

And, actually, you should be able to reproduce the issue by just enabling
the hw-offload in your testcase. dummy-pmd supports partial HW offload
and doesn't support RSS, so it should be invalid at the point where you're
sending packets to the output port.  Might be also a good idea to try that
under the address sanitizer.

> 
> Best regards, Ilya Maximets.
>
Kevin Traynor Dec. 17, 2021, 1:03 p.m. UTC | #7
On 17/12/2021 12:23, Ilya Maximets wrote:
> On 12/16/21 21:04, Maxime Coquelin wrote:
>> Hi Kevin,
>>
>> On 12/16/21 19:00, Kevin Traynor wrote:
>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>> This patch adds a new hash Tx steering mode that
>>>> distributes the traffic on all the Tx queues, whatever the
>>>> number of PMD threads. It would be useful for guests
>>>> expecting traffic to be distributed on all the vCPUs.
>>>>
>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>> already computed to build the flows batches (and so it
>>>> does not provide flexibility on which fields are part of
>>>> the hash).
>>>>
>>>> There are also no user-configurable indirection table,
>>>> given the feature is transparent to the guest. The queue
>>>> selection is just a modulo operation between the packet
>>>> hash and the number of Tx queues.
>>>>
>>>> There are no (at least intentionnally) functionnal changes
>>>> for the existing XPS and static modes. There should not be
>>>> noticeable performance changes for these modes (only one
>>>> more branch in the hot path).
>>>>
>>>> For the hash mode, performance could be impacted due to
>>>> locking when multiple PMD threads are in use (same as
>>>> XPS mode) and also because of the second level of batching.
>>>>
>>>> Regarding the batching, the existing Tx port output_pkts
>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>> can be batched for all the Tx queues. A second level of
>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>> only for this hash mode.
>>>>
>>>
>>> Thanks Maxime. This feature is really needed and the way the implementation compliments the existing code is very slick.
>>>
>>> I tested many combinations switching between modes while traffic was flowing and everything was working well.
>>>
>>> An interesting test case, I tested with the performance of using 3 txqs with default and hash modes. I saw a 13% rx tput drop in the VM with hash mode. I think that is quite reasonable considering the extra locking and the potential to remove other bottlenecks in a real VNF.
>>
>> Out of curiosity, what kind of traffic was sent?
>>
>>> I didn't dig in too deep on the clearing/loading cached ports code but it's working when changing mode on the fly. The rest of the code looks good. Couple of comments below.
>>>
>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> ---
>>>>    Documentation/automake.mk                     |   1 +
>>>>    Documentation/topics/index.rst                |   1 +
>>>>    .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>    lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>>>    4 files changed, 182 insertions(+), 23 deletions(-)
>>>>    create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>
>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>> index 137cc57c5..01e3c4f9e 100644
>>>> --- a/Documentation/automake.mk
>>>> +++ b/Documentation/automake.mk
>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>        Documentation/topics/record-replay.rst \
>>>>        Documentation/topics/tracing.rst \
>>>>        Documentation/topics/userspace-tso.rst \
>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>        Documentation/topics/windows.rst \
>>>>        Documentation/howto/index.rst \
>>>>        Documentation/howto/dpdk.rst \
>>>> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
>>>> index d8ccbd757..3699fd5c4 100644
>>>> --- a/Documentation/topics/index.rst
>>>> +++ b/Documentation/topics/index.rst
>>>> @@ -55,3 +55,4 @@ OVS
>>>>       userspace-tso
>>>>       idl-compound-indexes
>>>>       ovs-extensions
>>>> +   userspace-tx-steering
>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
>>>> new file mode 100644
>>>> index 000000000..3de05c4ae
>>>> --- /dev/null
>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>> @@ -0,0 +1,78 @@
>>>> +..
>>>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>>>> +      not use this file except in compliance with the License. You may obtain
>>>> +      a copy of the License at
>>>> +
>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +      Unless required by applicable law or agreed to in writing, software
>>>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>>>> +      License for the specific language governing permissions and limitations
>>>> +      under the License.
>>>> +
>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>> +
>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>> +      -------  Heading 1
>>>> +      ~~~~~~~  Heading 2
>>>> +      +++++++  Heading 3
>>>> +      '''''''  Heading 4
>>>> +
>>>> +      Avoid deeper levels because they do not render well.
>>>> +
>>>> +============================
>>>> +Userspace Tx packet steering
>>>> +============================
>>>> +
>>>> +The userspace datapath supports three transmit packets steering modes.
>>>> +
>>>> +Static mode
>>>> +~~~~~~~~~~~
>>>> +
>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>> +to ``default`` or unset, and if the port's number of Tx queues is greater or
>>>> +equal than the number of PMD threads.
>>>> +
>>>> +This the recommended mode for performance reasons, as the Tx lock is not
>>>> +acquired. If the number of Tx queues is greater than the number of PMDs, the
>>>> +remaining Tx queues will not be used.
>>>> +
>>>> +XPS mode
>>>> +~~~~~~~~
>>>> +
>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>> +to ``default`` or unset, and if the port's number of Tx queues is lower than
>>>> +the number of PMD threads.
>>>> +
>>>> +This mode may have a performance impact, given the Tx lock acquisition is
>>>> +required as several PMD threads may use the same Tx queue.
>>>> +
>>>> +Hash mode
>>>> +~~~~~~~~~
>>>> +
>>>> +Hash-based Tx packets steering mode distributes the traffic on all the port's
>>>> +transmit queues, whatever the number of PMD threads. Queue selection is based
>>>> +on the 5-tuples hash already computed to build the flows batches, the selected
>>>> +queue being the modulo between the hash and the number of Tx queues of the
>>>> +port.
>>>> +
>>>> +Hash mode may be used for example with Vhost-user ports, when the number of
>>>> +vCPUs and queues of thevguest are greater than the number of PMD threads.
>>>> +Without hash mode, the Tx queues used would bevlimited to the number of PMD.
>>>> +
>>>> +Hash-based Tx packet steering may have an impact on the performance, given the
>>>> +Tx lock acquisition is required and a second level of batching is performed.
>>>> +
>>>> +This feature is disabled by default.
>>>> +
>>>> +Usage
>>>> +~~~~~
>>>> +
>>>> +To enable hash mode::
>>>> +
>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>> +
>>>> +To disable hash mode::
>>>> +
>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
>>>
>>> I think it might be better to use a more descriptive name than 'default'.
>>>
>>> That way there's no naming conflict if the actual default behaviour was ever changed to 'hash' mode in the future. Also, perhaps the default mode could be set by the user in future so they wouldn't have to set for every vhost port.
>>>
>>> Perhaps, tx-steering=pmd|hash ?
>>>
>>> It is well explained above, but maybe from a user perspective as there is only 2 choices, then there should be 2 'modes' (e.g. tx-steering=pmd|hash), and xps is explained as a feature that may run in the 'pmd' mode. Internally in the code, it probably makes sense to keep the 3 values in the enum etc.
>>
>> OK, that sounds reasonnable. Ilya, what do you think?
> 
> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
> called 'pmd*' in OVS and most of these names makes very little sense,
> to the point where what we call PMD in OVS is not a Poll Mode Driver
> and has a very little relation to drivers in general.
> 
> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
> from 'pmd*' to something else instead.  :)
> 

Too late - that acronym is weaponized.

> In general, I suggested 'default' just because it's hard to come up
> with a single word that would describe both the static and dynamic
> queue assignments.  But I agree that some less vague name should be
> better.  If you think that 'pmd' is descriptive enough, call it maybe
> 'thread' instead.  WDYT?
> 

Not tied to 'pmd', just want to make sure that we don't close the option 
of allowing the default to change in the future, and so we don't end up 
with a situation where default is no longer the default :-)

Just throwing other out ideas, 'auto', 'ovs','ovscfg' ?

> 
> And I agree that it's better to not split modes in the documentation
> since they can not actually be configured by a user.
> 
> 
> And since we're here, one comment for the code itself:
> 
>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>> +        int n_txq = netdev_n_txq(p->port->netdev);
>> +
>> +        /* Re-batch per txq based on packet hash. */
>> +        for (i = 0; i < output_cnt; i++) {
>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>> +
>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>> +        }
> 
> I think, you still have to check that RSS is valid with the
> dp_packet_rss_valid() function before using it and re-calculate
> if it's not available.  That is because RSS is not always calculated,
> e.g. in case of a partial HW offload the miniflow_extract is skipped
> and hash is not calculated either.  That might become a problem if
> for any reason the driver didn't provide a hash (doesn't support?) or
> for a future developments like 'simple match' optimization where the
> packet parsing is mostly skipped.
> 
> Since we don't have a miniflow here, we, probably, need something like
> this:
>    https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896
> 
> Best regards, Ilya Maximets.
>
Maxime Coquelin Dec. 17, 2021, 1:20 p.m. UTC | #8
On 12/17/21 13:52, Ilya Maximets wrote:
> On 12/17/21 13:23, Ilya Maximets wrote:
>> On 12/16/21 21:04, Maxime Coquelin wrote:
>>> Hi Kevin,
>>>
>>> On 12/16/21 19:00, Kevin Traynor wrote:
>>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>>> This patch adds a new hash Tx steering mode that
>>>>> distributes the traffic on all the Tx queues, whatever the
>>>>> number of PMD threads. It would be useful for guests
>>>>> expecting traffic to be distributed on all the vCPUs.
>>>>>
>>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>>> already computed to build the flows batches (and so it
>>>>> does not provide flexibility on which fields are part of
>>>>> the hash).
>>>>>
>>>>> There are also no user-configurable indirection table,
>>>>> given the feature is transparent to the guest. The queue
>>>>> selection is just a modulo operation between the packet
>>>>> hash and the number of Tx queues.
>>>>>
>>>>> There are no (at least intentionnally) functionnal changes
>>>>> for the existing XPS and static modes. There should not be
>>>>> noticeable performance changes for these modes (only one
>>>>> more branch in the hot path).
>>>>>
>>>>> For the hash mode, performance could be impacted due to
>>>>> locking when multiple PMD threads are in use (same as
>>>>> XPS mode) and also because of the second level of batching.
>>>>>
>>>>> Regarding the batching, the existing Tx port output_pkts
>>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>>> can be batched for all the Tx queues. A second level of
>>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>>> only for this hash mode.
>>>>>
>>>>
>>>> Thanks Maxime. This feature is really needed and the way the implementation compliments the existing code is very slick.
>>>>
>>>> I tested many combinations switching between modes while traffic was flowing and everything was working well.
>>>>
>>>> An interesting test case, I tested with the performance of using 3 txqs with default and hash modes. I saw a 13% rx tput drop in the VM with hash mode. I think that is quite reasonable considering the extra locking and the potential to remove other bottlenecks in a real VNF.
>>>
>>> Out of curiosity, what kind of traffic was sent?
>>>
>>>> I didn't dig in too deep on the clearing/loading cached ports code but it's working when changing mode on the fly. The rest of the code looks good. Couple of comments below.
>>>>
>>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> ---
>>>>>    Documentation/automake.mk                     |   1 +
>>>>>    Documentation/topics/index.rst                |   1 +
>>>>>    .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>>    lib/dpif-netdev.c                             | 125 ++++++++++++++----
>>>>>    4 files changed, 182 insertions(+), 23 deletions(-)
>>>>>    create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>>
>>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>>> index 137cc57c5..01e3c4f9e 100644
>>>>> --- a/Documentation/automake.mk
>>>>> +++ b/Documentation/automake.mk
>>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>>        Documentation/topics/record-replay.rst \
>>>>>        Documentation/topics/tracing.rst \
>>>>>        Documentation/topics/userspace-tso.rst \
>>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>>        Documentation/topics/windows.rst \
>>>>>        Documentation/howto/index.rst \
>>>>>        Documentation/howto/dpdk.rst \
>>>>> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
>>>>> index d8ccbd757..3699fd5c4 100644
>>>>> --- a/Documentation/topics/index.rst
>>>>> +++ b/Documentation/topics/index.rst
>>>>> @@ -55,3 +55,4 @@ OVS
>>>>>       userspace-tso
>>>>>       idl-compound-indexes
>>>>>       ovs-extensions
>>>>> +   userspace-tx-steering
>>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
>>>>> new file mode 100644
>>>>> index 000000000..3de05c4ae
>>>>> --- /dev/null
>>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>>> @@ -0,0 +1,78 @@
>>>>> +..
>>>>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>>>>> +      not use this file except in compliance with the License. You may obtain
>>>>> +      a copy of the License at
>>>>> +
>>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>>> +
>>>>> +      Unless required by applicable law or agreed to in writing, software
>>>>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>>>>> +      License for the specific language governing permissions and limitations
>>>>> +      under the License.
>>>>> +
>>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>>> +
>>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>>> +      -------  Heading 1
>>>>> +      ~~~~~~~  Heading 2
>>>>> +      +++++++  Heading 3
>>>>> +      '''''''  Heading 4
>>>>> +
>>>>> +      Avoid deeper levels because they do not render well.
>>>>> +
>>>>> +============================
>>>>> +Userspace Tx packet steering
>>>>> +============================
>>>>> +
>>>>> +The userspace datapath supports three transmit packets steering modes.
>>>>> +
>>>>> +Static mode
>>>>> +~~~~~~~~~~~
>>>>> +
>>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>>> +to ``default`` or unset, and if the port's number of Tx queues is greater or
>>>>> +equal than the number of PMD threads.
>>>>> +
>>>>> +This the recommended mode for performance reasons, as the Tx lock is not
>>>>> +acquired. If the number of Tx queues is greater than the number of PMDs, the
>>>>> +remaining Tx queues will not be used.
>>>>> +
>>>>> +XPS mode
>>>>> +~~~~~~~~
>>>>> +
>>>>> +This mode is automatically selected when port's ``tx-steering`` option is set
>>>>> +to ``default`` or unset, and if the port's number of Tx queues is lower than
>>>>> +the number of PMD threads.
>>>>> +
>>>>> +This mode may have a performance impact, given the Tx lock acquisition is
>>>>> +required as several PMD threads may use the same Tx queue.
>>>>> +
>>>>> +Hash mode
>>>>> +~~~~~~~~~
>>>>> +
>>>>> +Hash-based Tx packets steering mode distributes the traffic on all the port's
>>>>> +transmit queues, whatever the number of PMD threads. Queue selection is based
>>>>> +on the 5-tuples hash already computed to build the flows batches, the selected
>>>>> +queue being the modulo between the hash and the number of Tx queues of the
>>>>> +port.
>>>>> +
>>>>> +Hash mode may be used for example with Vhost-user ports, when the number of
>>>>> +vCPUs and queues of thevguest are greater than the number of PMD threads.
>>>>> +Without hash mode, the Tx queues used would bevlimited to the number of PMD.
>>>>> +
>>>>> +Hash-based Tx packet steering may have an impact on the performance, given the
>>>>> +Tx lock acquisition is required and a second level of batching is performed.
>>>>> +
>>>>> +This feature is disabled by default.
>>>>> +
>>>>> +Usage
>>>>> +~~~~~
>>>>> +
>>>>> +To enable hash mode::
>>>>> +
>>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>>> +
>>>>> +To disable hash mode::
>>>>> +
>>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
>>>>
>>>> I think it might be better to use a more descriptive name than 'default'.
>>>>
>>>> That way there's no naming conflict if the actual default behaviour was ever changed to 'hash' mode in the future. Also, perhaps the default mode could be set by the user in future so they wouldn't have to set for every vhost port.
>>>>
>>>> Perhaps, tx-steering=pmd|hash ?
>>>>
>>>> It is well explained above, but maybe from a user perspective as there is only 2 choices, then there should be 2 'modes' (e.g. tx-steering=pmd|hash), and xps is explained as a feature that may run in the 'pmd' mode. Internally in the code, it probably makes sense to keep the 3 values in the enum etc.
>>>
>>> OK, that sounds reasonnable. Ilya, what do you think?
>>
>> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
>> called 'pmd*' in OVS and most of these names makes very little sense,
>> to the point where what we call PMD in OVS is not a Poll Mode Driver
>> and has a very little relation to drivers in general.
>>
>> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
>> from 'pmd*' to something else instead.  :)
>>
>> In general, I suggested 'default' just because it's hard to come up
>> with a single word that would describe both the static and dynamic
>> queue assignments.  But I agree that some less vague name should be
>> better.  If you think that 'pmd' is descriptive enough, call it maybe
>> 'thread' instead.  WDYT?
>>

Ok, I think 'thread' is well-suited.

>> And I agree that it's better to not split modes in the documentation
>> since they can not actually be configured by a user.
>>

Got it, I will change to document 'thread' and 'hash' modes. In 'thread' 
part, I will briefly mention the static and XPS sub-modes.

>> And since we're here, one comment for the code itself:
>>
>>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>>> +        int n_txq = netdev_n_txq(p->port->netdev);
>>> +
>>> +        /* Re-batch per txq based on packet hash. */
>>> +        for (i = 0; i < output_cnt; i++) {
>>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>>> +
>>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>>> +        }
>>
>> I think, you still have to check that RSS is valid with the
>> dp_packet_rss_valid() function before using it and re-calculate
>> if it's not available.  That is because RSS is not always calculated,
>> e.g. in case of a partial HW offload the miniflow_extract is skipped
>> and hash is not calculated either.  That might become a problem if
>> for any reason the driver didn't provide a hash (doesn't support?) or
>> for a future developments like 'simple match' optimization where the
>> packet parsing is mostly skipped.
>>
>> Since we don't have a miniflow here, we, probably, need something like
>> this:
>>    https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896
> 
> And, actually, you should be able to reproduce the issue by just enabling
> the hw-offload in your testcase. dummy-pmd supports partial HW offload
> and doesn't support RSS, so it should be invalid at the point where you're
> sending packets to the output port.  Might be also a good idea to try that
> under the address sanitizer.

Good catch, and thanks for providing a way to reproduce it!

Maxime
>>
>> Best regards, Ilya Maximets.
>>
>
Maxime Coquelin Dec. 17, 2021, 1:22 p.m. UTC | #9
On 12/17/21 14:03, Kevin Traynor wrote:
> On 17/12/2021 12:23, Ilya Maximets wrote:
>> On 12/16/21 21:04, Maxime Coquelin wrote:
>>> Hi Kevin,
>>>
>>> On 12/16/21 19:00, Kevin Traynor wrote:
>>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>>> This patch adds a new hash Tx steering mode that
>>>>> distributes the traffic on all the Tx queues, whatever the
>>>>> number of PMD threads. It would be useful for guests
>>>>> expecting traffic to be distributed on all the vCPUs.
>>>>>
>>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>>> already computed to build the flows batches (and so it
>>>>> does not provide flexibility on which fields are part of
>>>>> the hash).
>>>>>
>>>>> There are also no user-configurable indirection table,
>>>>> given the feature is transparent to the guest. The queue
>>>>> selection is just a modulo operation between the packet
>>>>> hash and the number of Tx queues.
>>>>>
>>>>> There are no (at least intentionnally) functionnal changes
>>>>> for the existing XPS and static modes. There should not be
>>>>> noticeable performance changes for these modes (only one
>>>>> more branch in the hot path).
>>>>>
>>>>> For the hash mode, performance could be impacted due to
>>>>> locking when multiple PMD threads are in use (same as
>>>>> XPS mode) and also because of the second level of batching.
>>>>>
>>>>> Regarding the batching, the existing Tx port output_pkts
>>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>>> can be batched for all the Tx queues. A second level of
>>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>>> only for this hash mode.
>>>>>
>>>>
>>>> Thanks Maxime. This feature is really needed and the way the 
>>>> implementation compliments the existing code is very slick.
>>>>
>>>> I tested many combinations switching between modes while traffic was 
>>>> flowing and everything was working well.
>>>>
>>>> An interesting test case, I tested with the performance of using 3 
>>>> txqs with default and hash modes. I saw a 13% rx tput drop in the VM 
>>>> with hash mode. I think that is quite reasonable considering the 
>>>> extra locking and the potential to remove other bottlenecks in a 
>>>> real VNF.
>>>
>>> Out of curiosity, what kind of traffic was sent?
>>>
>>>> I didn't dig in too deep on the clearing/loading cached ports code 
>>>> but it's working when changing mode on the fly. The rest of the code 
>>>> looks good. Couple of comments below.
>>>>
>>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> ---
>>>>>    Documentation/automake.mk                     |   1 +
>>>>>    Documentation/topics/index.rst                |   1 +
>>>>>    .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>>    lib/dpif-netdev.c                             | 125 
>>>>> ++++++++++++++----
>>>>>    4 files changed, 182 insertions(+), 23 deletions(-)
>>>>>    create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>>
>>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>>> index 137cc57c5..01e3c4f9e 100644
>>>>> --- a/Documentation/automake.mk
>>>>> +++ b/Documentation/automake.mk
>>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>>        Documentation/topics/record-replay.rst \
>>>>>        Documentation/topics/tracing.rst \
>>>>>        Documentation/topics/userspace-tso.rst \
>>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>>        Documentation/topics/windows.rst \
>>>>>        Documentation/howto/index.rst \
>>>>>        Documentation/howto/dpdk.rst \
>>>>> diff --git a/Documentation/topics/index.rst 
>>>>> b/Documentation/topics/index.rst
>>>>> index d8ccbd757..3699fd5c4 100644
>>>>> --- a/Documentation/topics/index.rst
>>>>> +++ b/Documentation/topics/index.rst
>>>>> @@ -55,3 +55,4 @@ OVS
>>>>>       userspace-tso
>>>>>       idl-compound-indexes
>>>>>       ovs-extensions
>>>>> +   userspace-tx-steering
>>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst 
>>>>> b/Documentation/topics/userspace-tx-steering.rst
>>>>> new file mode 100644
>>>>> index 000000000..3de05c4ae
>>>>> --- /dev/null
>>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>>> @@ -0,0 +1,78 @@
>>>>> +..
>>>>> +      Licensed under the Apache License, Version 2.0 (the 
>>>>> "License"); you may
>>>>> +      not use this file except in compliance with the License. You 
>>>>> may obtain
>>>>> +      a copy of the License at
>>>>> +
>>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>>> +
>>>>> +      Unless required by applicable law or agreed to in writing, 
>>>>> software
>>>>> +      distributed under the License is distributed on an "AS IS" 
>>>>> BASIS, WITHOUT
>>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>> implied. See the
>>>>> +      License for the specific language governing permissions and 
>>>>> limitations
>>>>> +      under the License.
>>>>> +
>>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>>> +
>>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>>> +      -------  Heading 1
>>>>> +      ~~~~~~~  Heading 2
>>>>> +      +++++++  Heading 3
>>>>> +      '''''''  Heading 4
>>>>> +
>>>>> +      Avoid deeper levels because they do not render well.
>>>>> +
>>>>> +============================
>>>>> +Userspace Tx packet steering
>>>>> +============================
>>>>> +
>>>>> +The userspace datapath supports three transmit packets steering 
>>>>> modes.
>>>>> +
>>>>> +Static mode
>>>>> +~~~~~~~~~~~
>>>>> +
>>>>> +This mode is automatically selected when port's ``tx-steering`` 
>>>>> option is set
>>>>> +to ``default`` or unset, and if the port's number of Tx queues is 
>>>>> greater or
>>>>> +equal than the number of PMD threads.
>>>>> +
>>>>> +This the recommended mode for performance reasons, as the Tx lock 
>>>>> is not
>>>>> +acquired. If the number of Tx queues is greater than the number of 
>>>>> PMDs, the
>>>>> +remaining Tx queues will not be used.
>>>>> +
>>>>> +XPS mode
>>>>> +~~~~~~~~
>>>>> +
>>>>> +This mode is automatically selected when port's ``tx-steering`` 
>>>>> option is set
>>>>> +to ``default`` or unset, and if the port's number of Tx queues is 
>>>>> lower than
>>>>> +the number of PMD threads.
>>>>> +
>>>>> +This mode may have a performance impact, given the Tx lock 
>>>>> acquisition is
>>>>> +required as several PMD threads may use the same Tx queue.
>>>>> +
>>>>> +Hash mode
>>>>> +~~~~~~~~~
>>>>> +
>>>>> +Hash-based Tx packets steering mode distributes the traffic on all 
>>>>> the port's
>>>>> +transmit queues, whatever the number of PMD threads. Queue 
>>>>> selection is based
>>>>> +on the 5-tuples hash already computed to build the flows batches, 
>>>>> the selected
>>>>> +queue being the modulo between the hash and the number of Tx 
>>>>> queues of the
>>>>> +port.
>>>>> +
>>>>> +Hash mode may be used for example with Vhost-user ports, when the 
>>>>> number of
>>>>> +vCPUs and queues of thevguest are greater than the number of PMD 
>>>>> threads.
>>>>> +Without hash mode, the Tx queues used would bevlimited to the 
>>>>> number of PMD.
>>>>> +
>>>>> +Hash-based Tx packet steering may have an impact on the 
>>>>> performance, given the
>>>>> +Tx lock acquisition is required and a second level of batching is 
>>>>> performed.
>>>>> +
>>>>> +This feature is disabled by default.
>>>>> +
>>>>> +Usage
>>>>> +~~~~~
>>>>> +
>>>>> +To enable hash mode::
>>>>> +
>>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>>> +
>>>>> +To disable hash mode::
>>>>> +
>>>>> +    $ ovs-vsctl set Interface <iface> 
>>>>> other_config:tx-steering=default
>>>>
>>>> I think it might be better to use a more descriptive name than 
>>>> 'default'.
>>>>
>>>> That way there's no naming conflict if the actual default behaviour 
>>>> was ever changed to 'hash' mode in the future. Also, perhaps the 
>>>> default mode could be set by the user in future so they wouldn't 
>>>> have to set for every vhost port.
>>>>
>>>> Perhaps, tx-steering=pmd|hash ?
>>>>
>>>> It is well explained above, but maybe from a user perspective as 
>>>> there is only 2 choices, then there should be 2 'modes' (e.g. 
>>>> tx-steering=pmd|hash), and xps is explained as a feature that may 
>>>> run in the 'pmd' mode. Internally in the code, it probably makes 
>>>> sense to keep the 3 values in the enum etc.
>>>
>>> OK, that sounds reasonnable. Ilya, what do you think?
>>
>> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
>> called 'pmd*' in OVS and most of these names makes very little sense,
>> to the point where what we call PMD in OVS is not a Poll Mode Driver
>> and has a very little relation to drivers in general.
>>
>> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
>> from 'pmd*' to something else instead.  :)
>>
> 
> Too late - that acronym is weaponized.
> 
>> In general, I suggested 'default' just because it's hard to come up
>> with a single word that would describe both the static and dynamic
>> queue assignments.  But I agree that some less vague name should be
>> better.  If you think that 'pmd' is descriptive enough, call it maybe
>> 'thread' instead.  WDYT?
>>
> 
> Not tied to 'pmd', just want to make sure that we don't close the option 
> of allowing the default to change in the future, and so we don't end up 
> with a situation where default is no longer the default :-)
> 
> Just throwing other out ideas, 'auto', 'ovs','ovscfg' ?

I personaly prefer 'thread', 'auto' might be misleading given hash mode
will not be selected automatically. And I find 'ovs' and 'ovscfg' too
vague.

Thanks,
Maxime

>>
>> And I agree that it's better to not split modes in the documentation
>> since they can not actually be configured by a user.
>>
>>
>> And since we're here, one comment for the code itself:
>>
>>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>>> +        int n_txq = netdev_n_txq(p->port->netdev);
>>> +
>>> +        /* Re-batch per txq based on packet hash. */
>>> +        for (i = 0; i < output_cnt; i++) {
>>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>>> +
>>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>>> +        }
>>
>> I think, you still have to check that RSS is valid with the
>> dp_packet_rss_valid() function before using it and re-calculate
>> if it's not available.  That is because RSS is not always calculated,
>> e.g. in case of a partial HW offload the miniflow_extract is skipped
>> and hash is not calculated either.  That might become a problem if
>> for any reason the driver didn't provide a hash (doesn't support?) or
>> for a future developments like 'simple match' optimization where the
>> packet parsing is mostly skipped.
>>
>> Since we don't have a miniflow here, we, probably, need something like
>> this:
>>    
>> https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896 
>>
>>
>> Best regards, Ilya Maximets.
>>
>
Kevin Traynor Dec. 17, 2021, 1:24 p.m. UTC | #10
On 17/12/2021 13:22, Maxime Coquelin wrote:
> 
> 
> On 12/17/21 14:03, Kevin Traynor wrote:
>> On 17/12/2021 12:23, Ilya Maximets wrote:
>>> On 12/16/21 21:04, Maxime Coquelin wrote:
>>>> Hi Kevin,
>>>>
>>>> On 12/16/21 19:00, Kevin Traynor wrote:
>>>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>>>> This patch adds a new hash Tx steering mode that
>>>>>> distributes the traffic on all the Tx queues, whatever the
>>>>>> number of PMD threads. It would be useful for guests
>>>>>> expecting traffic to be distributed on all the vCPUs.
>>>>>>
>>>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>>>> already computed to build the flows batches (and so it
>>>>>> does not provide flexibility on which fields are part of
>>>>>> the hash).
>>>>>>
>>>>>> There are also no user-configurable indirection table,
>>>>>> given the feature is transparent to the guest. The queue
>>>>>> selection is just a modulo operation between the packet
>>>>>> hash and the number of Tx queues.
>>>>>>
>>>>>> There are no (at least intentionnally) functionnal changes
>>>>>> for the existing XPS and static modes. There should not be
>>>>>> noticeable performance changes for these modes (only one
>>>>>> more branch in the hot path).
>>>>>>
>>>>>> For the hash mode, performance could be impacted due to
>>>>>> locking when multiple PMD threads are in use (same as
>>>>>> XPS mode) and also because of the second level of batching.
>>>>>>
>>>>>> Regarding the batching, the existing Tx port output_pkts
>>>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>>>> can be batched for all the Tx queues. A second level of
>>>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>>>> only for this hash mode.
>>>>>>
>>>>>
>>>>> Thanks Maxime. This feature is really needed and the way the
>>>>> implementation compliments the existing code is very slick.
>>>>>
>>>>> I tested many combinations switching between modes while traffic was
>>>>> flowing and everything was working well.
>>>>>
>>>>> An interesting test case, I tested with the performance of using 3
>>>>> txqs with default and hash modes. I saw a 13% rx tput drop in the VM
>>>>> with hash mode. I think that is quite reasonable considering the
>>>>> extra locking and the potential to remove other bottlenecks in a
>>>>> real VNF.
>>>>
>>>> Out of curiosity, what kind of traffic was sent?
>>>>
>>>>> I didn't dig in too deep on the clearing/loading cached ports code
>>>>> but it's working when changing mode on the fly. The rest of the code
>>>>> looks good. Couple of comments below.
>>>>>
>>>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>> ---
>>>>>>     Documentation/automake.mk                     |   1 +
>>>>>>     Documentation/topics/index.rst                |   1 +
>>>>>>     .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>>>     lib/dpif-netdev.c                             | 125
>>>>>> ++++++++++++++----
>>>>>>     4 files changed, 182 insertions(+), 23 deletions(-)
>>>>>>     create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>>>
>>>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>>>> index 137cc57c5..01e3c4f9e 100644
>>>>>> --- a/Documentation/automake.mk
>>>>>> +++ b/Documentation/automake.mk
>>>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>>>         Documentation/topics/record-replay.rst \
>>>>>>         Documentation/topics/tracing.rst \
>>>>>>         Documentation/topics/userspace-tso.rst \
>>>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>>>         Documentation/topics/windows.rst \
>>>>>>         Documentation/howto/index.rst \
>>>>>>         Documentation/howto/dpdk.rst \
>>>>>> diff --git a/Documentation/topics/index.rst
>>>>>> b/Documentation/topics/index.rst
>>>>>> index d8ccbd757..3699fd5c4 100644
>>>>>> --- a/Documentation/topics/index.rst
>>>>>> +++ b/Documentation/topics/index.rst
>>>>>> @@ -55,3 +55,4 @@ OVS
>>>>>>        userspace-tso
>>>>>>        idl-compound-indexes
>>>>>>        ovs-extensions
>>>>>> +   userspace-tx-steering
>>>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst
>>>>>> b/Documentation/topics/userspace-tx-steering.rst
>>>>>> new file mode 100644
>>>>>> index 000000000..3de05c4ae
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>>>> @@ -0,0 +1,78 @@
>>>>>> +..
>>>>>> +      Licensed under the Apache License, Version 2.0 (the
>>>>>> "License"); you may
>>>>>> +      not use this file except in compliance with the License. You
>>>>>> may obtain
>>>>>> +      a copy of the License at
>>>>>> +
>>>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +
>>>>>> +      Unless required by applicable law or agreed to in writing,
>>>>>> software
>>>>>> +      distributed under the License is distributed on an "AS IS"
>>>>>> BASIS, WITHOUT
>>>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>>>> implied. See the
>>>>>> +      License for the specific language governing permissions and
>>>>>> limitations
>>>>>> +      under the License.
>>>>>> +
>>>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>>>> +
>>>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>>>> +      -------  Heading 1
>>>>>> +      ~~~~~~~  Heading 2
>>>>>> +      +++++++  Heading 3
>>>>>> +      '''''''  Heading 4
>>>>>> +
>>>>>> +      Avoid deeper levels because they do not render well.
>>>>>> +
>>>>>> +============================
>>>>>> +Userspace Tx packet steering
>>>>>> +============================
>>>>>> +
>>>>>> +The userspace datapath supports three transmit packets steering
>>>>>> modes.
>>>>>> +
>>>>>> +Static mode
>>>>>> +~~~~~~~~~~~
>>>>>> +
>>>>>> +This mode is automatically selected when port's ``tx-steering``
>>>>>> option is set
>>>>>> +to ``default`` or unset, and if the port's number of Tx queues is
>>>>>> greater or
>>>>>> +equal than the number of PMD threads.
>>>>>> +
>>>>>> +This the recommended mode for performance reasons, as the Tx lock
>>>>>> is not
>>>>>> +acquired. If the number of Tx queues is greater than the number of
>>>>>> PMDs, the
>>>>>> +remaining Tx queues will not be used.
>>>>>> +
>>>>>> +XPS mode
>>>>>> +~~~~~~~~
>>>>>> +
>>>>>> +This mode is automatically selected when port's ``tx-steering``
>>>>>> option is set
>>>>>> +to ``default`` or unset, and if the port's number of Tx queues is
>>>>>> lower than
>>>>>> +the number of PMD threads.
>>>>>> +
>>>>>> +This mode may have a performance impact, given the Tx lock
>>>>>> acquisition is
>>>>>> +required as several PMD threads may use the same Tx queue.
>>>>>> +
>>>>>> +Hash mode
>>>>>> +~~~~~~~~~
>>>>>> +
>>>>>> +Hash-based Tx packets steering mode distributes the traffic on all
>>>>>> the port's
>>>>>> +transmit queues, whatever the number of PMD threads. Queue
>>>>>> selection is based
>>>>>> +on the 5-tuples hash already computed to build the flows batches,
>>>>>> the selected
>>>>>> +queue being the modulo between the hash and the number of Tx
>>>>>> queues of the
>>>>>> +port.
>>>>>> +
>>>>>> +Hash mode may be used for example with Vhost-user ports, when the
>>>>>> number of
>>>>>> +vCPUs and queues of thevguest are greater than the number of PMD
>>>>>> threads.
>>>>>> +Without hash mode, the Tx queues used would bevlimited to the
>>>>>> number of PMD.
>>>>>> +
>>>>>> +Hash-based Tx packet steering may have an impact on the
>>>>>> performance, given the
>>>>>> +Tx lock acquisition is required and a second level of batching is
>>>>>> performed.
>>>>>> +
>>>>>> +This feature is disabled by default.
>>>>>> +
>>>>>> +Usage
>>>>>> +~~~~~
>>>>>> +
>>>>>> +To enable hash mode::
>>>>>> +
>>>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>>>> +
>>>>>> +To disable hash mode::
>>>>>> +
>>>>>> +    $ ovs-vsctl set Interface <iface>
>>>>>> other_config:tx-steering=default
>>>>>
>>>>> I think it might be better to use a more descriptive name than
>>>>> 'default'.
>>>>>
>>>>> That way there's no naming conflict if the actual default behaviour
>>>>> was ever changed to 'hash' mode in the future. Also, perhaps the
>>>>> default mode could be set by the user in future so they wouldn't
>>>>> have to set for every vhost port.
>>>>>
>>>>> Perhaps, tx-steering=pmd|hash ?
>>>>>
>>>>> It is well explained above, but maybe from a user perspective as
>>>>> there is only 2 choices, then there should be 2 'modes' (e.g.
>>>>> tx-steering=pmd|hash), and xps is explained as a feature that may
>>>>> run in the 'pmd' mode. Internally in the code, it probably makes
>>>>> sense to keep the 3 values in the enum etc.
>>>>
>>>> OK, that sounds reasonnable. Ilya, what do you think?
>>>
>>> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
>>> called 'pmd*' in OVS and most of these names makes very little sense,
>>> to the point where what we call PMD in OVS is not a Poll Mode Driver
>>> and has a very little relation to drivers in general.
>>>
>>> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
>>> from 'pmd*' to something else instead.  :)
>>>
>>
>> Too late - that acronym is weaponized.
>>
>>> In general, I suggested 'default' just because it's hard to come up
>>> with a single word that would describe both the static and dynamic
>>> queue assignments.  But I agree that some less vague name should be
>>> better.  If you think that 'pmd' is descriptive enough, call it maybe
>>> 'thread' instead.  WDYT?
>>>
>>
>> Not tied to 'pmd', just want to make sure that we don't close the option
>> of allowing the default to change in the future, and so we don't end up
>> with a situation where default is no longer the default :-)
>>
>> Just throwing other out ideas, 'auto', 'ovs','ovscfg' ?
> 
> I personaly prefer 'thread', 'auto' might be misleading given hash mode
> will not be selected automatically. And I find 'ovs' and 'ovscfg' too
> vague.
> 

Sounds good. Thanks!

> Thanks,
> Maxime
> 
>>>
>>> And I agree that it's better to not split modes in the documentation
>>> since they can not actually be configured by a user.
>>>
>>>
>>> And since we're here, one comment for the code itself:
>>>
>>>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>>>> +        int n_txq = netdev_n_txq(p->port->netdev);
>>>> +
>>>> +        /* Re-batch per txq based on packet hash. */
>>>> +        for (i = 0; i < output_cnt; i++) {
>>>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>>>> +
>>>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>>>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>>>> +        }
>>>
>>> I think, you still have to check that RSS is valid with the
>>> dp_packet_rss_valid() function before using it and re-calculate
>>> if it's not available.  That is because RSS is not always calculated,
>>> e.g. in case of a partial HW offload the miniflow_extract is skipped
>>> and hash is not calculated either.  That might become a problem if
>>> for any reason the driver didn't provide a hash (doesn't support?) or
>>> for a future developments like 'simple match' optimization where the
>>> packet parsing is mostly skipped.
>>>
>>> Since we don't have a miniflow here, we, probably, need something like
>>> this:
>>>     
>>> https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896
>>>
>>>
>>> Best regards, Ilya Maximets.
>>>
>>
>
Maxime Coquelin Dec. 17, 2021, 1:37 p.m. UTC | #11
On 12/17/21 14:20, Maxime Coquelin wrote:
> 
> 
> On 12/17/21 13:52, Ilya Maximets wrote:
>> On 12/17/21 13:23, Ilya Maximets wrote:
>>> On 12/16/21 21:04, Maxime Coquelin wrote:
>>>> Hi Kevin,
>>>>
>>>> On 12/16/21 19:00, Kevin Traynor wrote:
>>>>> On 15/12/2021 16:26, Maxime Coquelin wrote:
>>>>>> This patch adds a new hash Tx steering mode that
>>>>>> distributes the traffic on all the Tx queues, whatever the
>>>>>> number of PMD threads. It would be useful for guests
>>>>>> expecting traffic to be distributed on all the vCPUs.
>>>>>>
>>>>>> The idea here is to re-use the 5-tuple hash of the packets,
>>>>>> already computed to build the flows batches (and so it
>>>>>> does not provide flexibility on which fields are part of
>>>>>> the hash).
>>>>>>
>>>>>> There are also no user-configurable indirection table,
>>>>>> given the feature is transparent to the guest. The queue
>>>>>> selection is just a modulo operation between the packet
>>>>>> hash and the number of Tx queues.
>>>>>>
>>>>>> There are no (at least intentionnally) functionnal changes
>>>>>> for the existing XPS and static modes. There should not be
>>>>>> noticeable performance changes for these modes (only one
>>>>>> more branch in the hot path).
>>>>>>
>>>>>> For the hash mode, performance could be impacted due to
>>>>>> locking when multiple PMD threads are in use (same as
>>>>>> XPS mode) and also because of the second level of batching.
>>>>>>
>>>>>> Regarding the batching, the existing Tx port output_pkts
>>>>>> is not modified. It means that at maximum, NETDEV_MAX_BURST
>>>>>> can be batched for all the Tx queues. A second level of
>>>>>> batching is done in dp_netdev_pmd_flush_output_on_port(),
>>>>>> only for this hash mode.
>>>>>>
>>>>>
>>>>> Thanks Maxime. This feature is really needed and the way the 
>>>>> implementation compliments the existing code is very slick.
>>>>>
>>>>> I tested many combinations switching between modes while traffic 
>>>>> was flowing and everything was working well.
>>>>>
>>>>> An interesting test case, I tested with the performance of using 3 
>>>>> txqs with default and hash modes. I saw a 13% rx tput drop in the 
>>>>> VM with hash mode. I think that is quite reasonable considering the 
>>>>> extra locking and the potential to remove other bottlenecks in a 
>>>>> real VNF.
>>>>
>>>> Out of curiosity, what kind of traffic was sent?
>>>>
>>>>> I didn't dig in too deep on the clearing/loading cached ports code 
>>>>> but it's working when changing mode on the fly. The rest of the 
>>>>> code looks good. Couple of comments below.
>>>>>
>>>>>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>> ---
>>>>>>    Documentation/automake.mk                     |   1 +
>>>>>>    Documentation/topics/index.rst                |   1 +
>>>>>>    .../topics/userspace-tx-steering.rst          |  78 +++++++++++
>>>>>>    lib/dpif-netdev.c                             | 125 
>>>>>> ++++++++++++++----
>>>>>>    4 files changed, 182 insertions(+), 23 deletions(-)
>>>>>>    create mode 100644 Documentation/topics/userspace-tx-steering.rst
>>>>>>
>>>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>>>> index 137cc57c5..01e3c4f9e 100644
>>>>>> --- a/Documentation/automake.mk
>>>>>> +++ b/Documentation/automake.mk
>>>>>> @@ -58,6 +58,7 @@ DOC_SOURCE = \
>>>>>>        Documentation/topics/record-replay.rst \
>>>>>>        Documentation/topics/tracing.rst \
>>>>>>        Documentation/topics/userspace-tso.rst \
>>>>>> +    Documentation/topics/userspace-tx-steering.rst \
>>>>>>        Documentation/topics/windows.rst \
>>>>>>        Documentation/howto/index.rst \
>>>>>>        Documentation/howto/dpdk.rst \
>>>>>> diff --git a/Documentation/topics/index.rst 
>>>>>> b/Documentation/topics/index.rst
>>>>>> index d8ccbd757..3699fd5c4 100644
>>>>>> --- a/Documentation/topics/index.rst
>>>>>> +++ b/Documentation/topics/index.rst
>>>>>> @@ -55,3 +55,4 @@ OVS
>>>>>>       userspace-tso
>>>>>>       idl-compound-indexes
>>>>>>       ovs-extensions
>>>>>> +   userspace-tx-steering
>>>>>> diff --git a/Documentation/topics/userspace-tx-steering.rst 
>>>>>> b/Documentation/topics/userspace-tx-steering.rst
>>>>>> new file mode 100644
>>>>>> index 000000000..3de05c4ae
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/topics/userspace-tx-steering.rst
>>>>>> @@ -0,0 +1,78 @@
>>>>>> +..
>>>>>> +      Licensed under the Apache License, Version 2.0 (the 
>>>>>> "License"); you may
>>>>>> +      not use this file except in compliance with the License. 
>>>>>> You may obtain
>>>>>> +      a copy of the License at
>>>>>> +
>>>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +
>>>>>> +      Unless required by applicable law or agreed to in writing, 
>>>>>> software
>>>>>> +      distributed under the License is distributed on an "AS IS" 
>>>>>> BASIS, WITHOUT
>>>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>>>>> implied. See the
>>>>>> +      License for the specific language governing permissions and 
>>>>>> limitations
>>>>>> +      under the License.
>>>>>> +
>>>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>>>> +
>>>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>>>> +      -------  Heading 1
>>>>>> +      ~~~~~~~  Heading 2
>>>>>> +      +++++++  Heading 3
>>>>>> +      '''''''  Heading 4
>>>>>> +
>>>>>> +      Avoid deeper levels because they do not render well.
>>>>>> +
>>>>>> +============================
>>>>>> +Userspace Tx packet steering
>>>>>> +============================
>>>>>> +
>>>>>> +The userspace datapath supports three transmit packets steering 
>>>>>> modes.
>>>>>> +
>>>>>> +Static mode
>>>>>> +~~~~~~~~~~~
>>>>>> +
>>>>>> +This mode is automatically selected when port's ``tx-steering`` 
>>>>>> option is set
>>>>>> +to ``default`` or unset, and if the port's number of Tx queues is 
>>>>>> greater or
>>>>>> +equal than the number of PMD threads.
>>>>>> +
>>>>>> +This the recommended mode for performance reasons, as the Tx lock 
>>>>>> is not
>>>>>> +acquired. If the number of Tx queues is greater than the number 
>>>>>> of PMDs, the
>>>>>> +remaining Tx queues will not be used.
>>>>>> +
>>>>>> +XPS mode
>>>>>> +~~~~~~~~
>>>>>> +
>>>>>> +This mode is automatically selected when port's ``tx-steering`` 
>>>>>> option is set
>>>>>> +to ``default`` or unset, and if the port's number of Tx queues is 
>>>>>> lower than
>>>>>> +the number of PMD threads.
>>>>>> +
>>>>>> +This mode may have a performance impact, given the Tx lock 
>>>>>> acquisition is
>>>>>> +required as several PMD threads may use the same Tx queue.
>>>>>> +
>>>>>> +Hash mode
>>>>>> +~~~~~~~~~
>>>>>> +
>>>>>> +Hash-based Tx packets steering mode distributes the traffic on 
>>>>>> all the port's
>>>>>> +transmit queues, whatever the number of PMD threads. Queue 
>>>>>> selection is based
>>>>>> +on the 5-tuples hash already computed to build the flows batches, 
>>>>>> the selected
>>>>>> +queue being the modulo between the hash and the number of Tx 
>>>>>> queues of the
>>>>>> +port.
>>>>>> +
>>>>>> +Hash mode may be used for example with Vhost-user ports, when the 
>>>>>> number of
>>>>>> +vCPUs and queues of thevguest are greater than the number of PMD 
>>>>>> threads.
>>>>>> +Without hash mode, the Tx queues used would bevlimited to the 
>>>>>> number of PMD.
>>>>>> +
>>>>>> +Hash-based Tx packet steering may have an impact on the 
>>>>>> performance, given the
>>>>>> +Tx lock acquisition is required and a second level of batching is 
>>>>>> performed.
>>>>>> +
>>>>>> +This feature is disabled by default.
>>>>>> +
>>>>>> +Usage
>>>>>> +~~~~~
>>>>>> +
>>>>>> +To enable hash mode::
>>>>>> +
>>>>>> +    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
>>>>>> +
>>>>>> +To disable hash mode::
>>>>>> +
>>>>>> +    $ ovs-vsctl set Interface <iface> 
>>>>>> other_config:tx-steering=default
>>>>>
>>>>> I think it might be better to use a more descriptive name than 
>>>>> 'default'.
>>>>>
>>>>> That way there's no naming conflict if the actual default behaviour 
>>>>> was ever changed to 'hash' mode in the future. Also, perhaps the 
>>>>> default mode could be set by the user in future so they wouldn't 
>>>>> have to set for every vhost port.
>>>>>
>>>>> Perhaps, tx-steering=pmd|hash ?
>>>>>
>>>>> It is well explained above, but maybe from a user perspective as 
>>>>> there is only 2 choices, then there should be 2 'modes' (e.g. 
>>>>> tx-steering=pmd|hash), and xps is explained as a feature that may 
>>>>> run in the 'pmd' mode. Internally in the code, it probably makes 
>>>>> sense to keep the 3 values in the enum etc.
>>>>
>>>> OK, that sounds reasonnable. Ilya, what do you think?
>>>
>>> Seems OK, but 'pmd', IMO, is a bad name.  We have a lot of things
>>> called 'pmd*' in OVS and most of these names makes very little sense,
>>> to the point where what we call PMD in OVS is not a Poll Mode Driver
>>> and has a very little relation to drivers in general.
>>>
>>> So, please, don't call it 'pmd'.  I'd rename a pile of other stuff
>>> from 'pmd*' to something else instead.  :)
>>>
>>> In general, I suggested 'default' just because it's hard to come up
>>> with a single word that would describe both the static and dynamic
>>> queue assignments.  But I agree that some less vague name should be
>>> better.  If you think that 'pmd' is descriptive enough, call it maybe
>>> 'thread' instead.  WDYT?
>>>
> 
> Ok, I think 'thread' is well-suited.
> 
>>> And I agree that it's better to not split modes in the documentation
>>> since they can not actually be configured by a user.
>>>
> 
> Got it, I will change to document 'thread' and 'hash' modes. In 'thread' 
> part, I will briefly mention the static and XPS sub-modes.
> 
>>> And since we're here, one comment for the code itself:
>>>
>>>> +    if (p->port->txq_mode == TXQ_MODE_HASH) {
>>>> +        int n_txq = netdev_n_txq(p->port->netdev);
>>>> +
>>>> +        /* Re-batch per txq based on packet hash. */
>>>> +        for (i = 0; i < output_cnt; i++) {
>>>> +            struct dp_packet *packet = p->output_pkts.packets[i];
>>>> +
>>>> +            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
>>>> +            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
>>>> +        }
>>>
>>> I think, you still have to check that RSS is valid with the
>>> dp_packet_rss_valid() function before using it and re-calculate
>>> if it's not available.  That is because RSS is not always calculated,
>>> e.g. in case of a partial HW offload the miniflow_extract is skipped
>>> and hash is not calculated either.  That might become a problem if
>>> for any reason the driver didn't provide a hash (doesn't support?) or
>>> for a future developments like 'simple match' optimization where the
>>> packet parsing is mostly skipped.
>>>
>>> Since we don't have a miniflow here, we, probably, need something like
>>> this:
>>>    
>>> https://github.com/openvswitch/ovs/blob/893693e8085cf4163a1ca3ae6b712bf4716ed8b7/lib/odp-execute.c#L896 
>>>
>>
>> And, actually, you should be able to reproduce the issue by just enabling
>> the hw-offload in your testcase. dummy-pmd supports partial HW offload
>> and doesn't support RSS, so it should be invalid at the point where 
>> you're
>> sending packets to the output port.  Might be also a good idea to try 
>> that
>> under the address sanitizer.
> 
> Good catch, and thanks for providing a way to reproduce it!

I can reproduce it with setting other_config:hw-offload=true, and it 
even helped to find an issue in my test!

Thanks,
Maxime

> Maxime
>>>
>>> Best regards, Ilya Maximets.
>>>
>>
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 137cc57c5..01e3c4f9e 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -58,6 +58,7 @@  DOC_SOURCE = \
 	Documentation/topics/record-replay.rst \
 	Documentation/topics/tracing.rst \
 	Documentation/topics/userspace-tso.rst \
+	Documentation/topics/userspace-tx-steering.rst \
 	Documentation/topics/windows.rst \
 	Documentation/howto/index.rst \
 	Documentation/howto/dpdk.rst \
diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
index d8ccbd757..3699fd5c4 100644
--- a/Documentation/topics/index.rst
+++ b/Documentation/topics/index.rst
@@ -55,3 +55,4 @@  OVS
    userspace-tso
    idl-compound-indexes
    ovs-extensions
+   userspace-tx-steering
diff --git a/Documentation/topics/userspace-tx-steering.rst b/Documentation/topics/userspace-tx-steering.rst
new file mode 100644
index 000000000..3de05c4ae
--- /dev/null
+++ b/Documentation/topics/userspace-tx-steering.rst
@@ -0,0 +1,78 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+============================
+Userspace Tx packet steering
+============================
+
+The userspace datapath supports three transmit packets steering modes.
+
+Static mode
+~~~~~~~~~~~
+
+This mode is automatically selected when port's ``tx-steering`` option is set
+to ``default`` or unset, and if the port's number of Tx queues is greater or
+equal than the number of PMD threads.
+
+This the recommended mode for performance reasons, as the Tx lock is not
+acquired. If the number of Tx queues is greater than the number of PMDs, the
+remaining Tx queues will not be used.
+
+XPS mode
+~~~~~~~~
+
+This mode is automatically selected when port's ``tx-steering`` option is set
+to ``default`` or unset, and if the port's number of Tx queues is lower than
+the number of PMD threads.
+
+This mode may have a performance impact, given the Tx lock acquisition is
+required as several PMD threads may use the same Tx queue.
+
+Hash mode
+~~~~~~~~~
+
+Hash-based Tx packets steering mode distributes the traffic on all the port's
+transmit queues, whatever the number of PMD threads. Queue selection is based
+on the 5-tuples hash already computed to build the flows batches, the selected
+queue being the modulo between the hash and the number of Tx queues of the
+port.
+
+Hash mode may be used for example with Vhost-user ports, when the number of
+vCPUs and queues of thevguest are greater than the number of PMD threads.
+Without hash mode, the Tx queues used would bevlimited to the number of PMD.
+
+Hash-based Tx packet steering may have an impact on the performance, given the
+Tx lock acquisition is required and a second level of batching is performed.
+
+This feature is disabled by default.
+
+Usage
+~~~~~
+
+To enable hash mode::
+
+    $ ovs-vsctl set Interface <iface> other_config:tx-steering=hash
+
+To disable hash mode::
+
+    $ ovs-vsctl set Interface <iface> other_config:tx-steering=default
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0407f30b3..b8c528d6e 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -385,15 +385,21 @@  struct dp_netdev_rxq {
     atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
 };
 
+enum txq_req_mode {
+    TXQ_REQ_MODE_DEFAULT,
+    TXQ_REQ_MODE_HASH,
+};
+
 enum txq_mode {
     TXQ_MODE_STATIC,
     TXQ_MODE_XPS,
+    TXQ_MODE_HASH,
 };
 
 /* A port in a netdev-based datapath. */
 struct dp_netdev_port {
     odp_port_t port_no;
-    enum txq_mode txq_mode;     /* static, XPS */
+    enum txq_mode txq_mode;     /* static, XPS, HASH. */
     bool need_reconfigure;      /* True if we should reconfigure netdev. */
     struct netdev *netdev;
     struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
@@ -405,6 +411,7 @@  struct dp_netdev_port {
     bool emc_enabled;           /* If true EMC will be used. */
     char *type;                 /* Port type as requested by user. */
     char *rxq_affinity_list;    /* Requested affinity of rx queues. */
+    enum txq_req_mode txq_requested_mode;
 };
 
 static bool dp_netdev_flow_ref(struct dp_netdev_flow *);
@@ -440,6 +447,7 @@  struct tx_port {
     struct hmap_node node;
     long long flush_time;
     struct dp_packet_batch output_pkts;
+    struct dp_packet_batch *txq_pkts; /* Only for hash mode */
     struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
 };
 
@@ -4435,6 +4443,8 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
     int error = 0;
     const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
     bool emc_enabled = smap_get_bool(cfg, "emc-enable", true);
+    const char *tx_steering_mode = smap_get(cfg, "tx-steering");
+    enum txq_req_mode txq_mode;
 
     ovs_mutex_lock(&dp->port_mutex);
     error = get_port_by_number(dp, port_no, &port);
@@ -4476,19 +4486,34 @@  dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
     }
 
     /* Checking for RXq affinity changes. */
-    if (!netdev_is_pmd(port->netdev)
-        || nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
-        goto unlock;
+    if (netdev_is_pmd(port->netdev)
+        && !nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
+
+        error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
+        if (error) {
+            goto unlock;
+        }
+        free(port->rxq_affinity_list);
+        port->rxq_affinity_list = nullable_xstrdup(affinity_list);
+
+        dp_netdev_request_reconfigure(dp);
     }
 
-    error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
-    if (error) {
-        goto unlock;
+    if (nullable_string_is_equal(tx_steering_mode, "hash")) {
+        txq_mode = TXQ_REQ_MODE_HASH;
+    } else {
+        txq_mode = TXQ_REQ_MODE_DEFAULT;
+    }
+
+    if (txq_mode != port->txq_requested_mode) {
+        port->txq_requested_mode = txq_mode;
+        VLOG_INFO("%s: Txq mode has been set to %s.",
+                  netdev_get_name(port->netdev),
+                  (txq_mode == TXQ_REQ_MODE_DEFAULT) ? "default" : "hash");
+        dp_netdev_request_reconfigure(dp);
+
     }
-    free(port->rxq_affinity_list);
-    port->rxq_affinity_list = nullable_xstrdup(affinity_list);
 
-    dp_netdev_request_reconfigure(dp);
 unlock:
     ovs_mutex_unlock(&dp->port_mutex);
     return error;
@@ -4603,18 +4628,38 @@  dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
 
     cycle_timer_start(&pmd->perf_stats, &timer);
 
-    if (p->port->txq_mode == TXQ_MODE_XPS) {
-        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
-        concurrent_txqs = true;
-    } else {
-        tx_qid = pmd->static_tx_qid;
-        concurrent_txqs = false;
-    }
-
     output_cnt = dp_packet_batch_size(&p->output_pkts);
     ovs_assert(output_cnt > 0);
 
-    netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
+    if (p->port->txq_mode == TXQ_MODE_HASH) {
+        int n_txq = netdev_n_txq(p->port->netdev);
+
+        /* Re-batch per txq based on packet hash. */
+        for (i = 0; i < output_cnt; i++) {
+            struct dp_packet *packet = p->output_pkts.packets[i];
+
+            tx_qid = dp_packet_get_rss_hash(packet) % n_txq;
+            dp_packet_batch_add(&p->txq_pkts[tx_qid], packet);
+        }
+
+        /* Flush batches of each Tx queues. */
+        for (i = 0; i < n_txq; i++) {
+            if (dp_packet_batch_is_empty(&p->txq_pkts[i])) {
+                continue;
+            }
+            netdev_send(p->port->netdev, i, &p->txq_pkts[i], true);
+            dp_packet_batch_init(&p->txq_pkts[i]);
+        }
+    } else {
+        if (p->port->txq_mode == TXQ_MODE_XPS) {
+            tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p);
+            concurrent_txqs = true;
+        } else {
+            tx_qid = pmd->static_tx_qid;
+            concurrent_txqs = false;
+        }
+        netdev_send(p->port->netdev, tx_qid, &p->output_pkts, concurrent_txqs);
+    }
     dp_packet_batch_init(&p->output_pkts);
 
     /* Update time of the next flush. */
@@ -5775,7 +5820,9 @@  reconfigure_datapath(struct dp_netdev *dp)
     HMAP_FOR_EACH (port, node, &dp->ports) {
         if (netdev_is_reconf_required(port->netdev)
             || ((port->txq_mode == TXQ_MODE_XPS)
-                != (netdev_n_txq(port->netdev) < wanted_txqs))) {
+                != (netdev_n_txq(port->netdev) < wanted_txqs))
+            || ((port->txq_mode == TXQ_MODE_HASH)
+                != (port->txq_requested_mode == TXQ_REQ_MODE_HASH))) {
             port->need_reconfigure = true;
         }
     }
@@ -5810,8 +5857,15 @@  reconfigure_datapath(struct dp_netdev *dp)
             seq_change(dp->port_seq);
             port_destroy(port);
         } else {
-            port->txq_mode = (netdev_n_txq(port->netdev) < wanted_txqs) ?
-                TXQ_MODE_XPS : TXQ_MODE_STATIC;
+            /* With a single queue, there is no point in using hash mode. */
+            if (port->txq_requested_mode == TXQ_REQ_MODE_HASH &&
+                    netdev_n_txq(port->netdev) > 1) {
+                port->txq_mode = TXQ_MODE_HASH;
+            } else if (netdev_n_txq(port->netdev) < wanted_txqs) {
+                port->txq_mode = TXQ_MODE_XPS;
+            } else {
+                port->txq_mode = TXQ_MODE_STATIC;
+            }
         }
     }
 
@@ -6100,9 +6154,11 @@  pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
     dpif_netdev_xps_revalidate_pmd(pmd, true);
 
     HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->tnl_port_cache) {
+        free(tx_port_cached->txq_pkts);
         free(tx_port_cached);
     }
     HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
+        free(tx_port_cached->txq_pkts);
         free(tx_port_cached);
     }
 }
@@ -6122,14 +6178,27 @@  pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
     hmap_shrink(&pmd->tnl_port_cache);
 
     HMAP_FOR_EACH (tx_port, node, &pmd->tx_ports) {
+        int n_txq = netdev_n_txq(tx_port->port->netdev);
+        struct dp_packet_batch *txq_pkts_cached;
+
         if (netdev_has_tunnel_push_pop(tx_port->port->netdev)) {
             tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
+            if (tx_port->txq_pkts) {
+                txq_pkts_cached = xmemdup(tx_port->txq_pkts,
+                                          n_txq * sizeof *tx_port->txq_pkts);
+                tx_port_cached->txq_pkts = txq_pkts_cached;
+            }
             hmap_insert(&pmd->tnl_port_cache, &tx_port_cached->node,
                         hash_port_no(tx_port_cached->port->port_no));
         }
 
-        if (netdev_n_txq(tx_port->port->netdev)) {
+        if (n_txq) {
             tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
+            if (tx_port->txq_pkts) {
+                txq_pkts_cached = xmemdup(tx_port->txq_pkts,
+                                          n_txq * sizeof *tx_port->txq_pkts);
+                tx_port_cached->txq_pkts = txq_pkts_cached;
+            }
             hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
                         hash_port_no(tx_port_cached->port->port_no));
         }
@@ -6911,6 +6980,7 @@  dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd)
         free(poll);
     }
     HMAP_FOR_EACH_POP (port, node, &pmd->tx_ports) {
+        free(port->txq_pkts);
         free(port);
     }
     ovs_mutex_unlock(&pmd->port_mutex);
@@ -6981,6 +7051,14 @@  dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
     tx->flush_time = 0LL;
     dp_packet_batch_init(&tx->output_pkts);
 
+    if (tx->port->txq_mode == TXQ_MODE_HASH) {
+        int i, n_txq = netdev_n_txq(tx->port->netdev);
+        tx->txq_pkts = xzalloc(n_txq * sizeof *tx->txq_pkts);
+        for (i = 0; i < n_txq; i++) {
+            dp_packet_batch_init(&tx->txq_pkts[i]);
+        }
+    }
+
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
     pmd->need_reload = true;
 }
@@ -6993,6 +7071,7 @@  dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
     OVS_REQUIRES(pmd->port_mutex)
 {
     hmap_remove(&pmd->tx_ports, &tx->node);
+    free(tx->txq_pkts);
     free(tx);
     pmd->need_reload = true;
 }