[ovs-dev,v3,3/3] dpif-netdev: Introduce pmd-rxq-affinity.

On 26.07.2016 04:46, Daniele Di Proietto wrote:
> Thanks for the patch.
> 
> I haven't been able to apply this without the XPS patch.

That was the original idea. Using of this patch with current
tx queue management may lead to performance issues on multiqueue
configurations.

> This looks like a perfect chance to add more tests to pmd.at.  I can do it if you want

Sounds good.

> I started taking a look at this patch and I have a few comments inline.  I'll keep looking at it tomorrow
> 
> Thanks,
> 
> Daniele
> 
> 
> On 15/07/2016 04:54, "Ilya Maximets" <i.maximets@samsung.com> wrote:
> 
>> New 'other_config:pmd-rxq-affinity' field for Interface table to
>> perform manual pinning of RX queues to desired cores.
>>
>> This functionality is required to achieve maximum performance because
>> all kinds of ports have different cost of rx/tx operations and
>> only user can know about expected workload on different ports.
>>
>> Example:
>> 	# ./bin/ovs-vsctl set interface dpdk0 options:n_rxq=4 \
>> 	                  other_config:pmd-rxq-affinity="0:3,1:7,3:8"
>> 	Queue #0 pinned to core 3;
>> 	Queue #1 pinned to core 7;
>> 	Queue #2 not pinned.
>> 	Queue #3 pinned to core 8;
>>
>> It's decided to automatically isolate cores that have rxq explicitly
>> assigned to them because it's useful to keep constant polling rate on
>> some performance critical ports while adding/deleting other ports
>> without explicit pinning of all ports.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>> INSTALL.DPDK.md      |  49 +++++++++++-
>> NEWS                 |   2 +
>> lib/dpif-netdev.c    | 218 ++++++++++++++++++++++++++++++++++++++++++---------
>> tests/pmd.at         |   6 ++
>> vswitchd/vswitch.xml |  23 ++++++
>> 5 files changed, 257 insertions(+), 41 deletions(-)
>>
>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> index 5407794..7609aa7 100644
>> --- a/INSTALL.DPDK.md
>> +++ b/INSTALL.DPDK.md
>> @@ -289,14 +289,57 @@ advanced install guide [INSTALL.DPDK-ADVANCED.md]
>>      # Check current stats
>>        ovs-appctl dpif-netdev/pmd-stats-show
>>
>> +     # Clear previous stats
>> +       ovs-appctl dpif-netdev/pmd-stats-clear
>> +     ```
>> +
>> +  7. Port/rxq assigment to PMD threads
>> +
>> +     ```
>>      # Show port/rxq assignment
>>        ovs-appctl dpif-netdev/pmd-rxq-show
>> +     ```
>>
>> -     # Clear previous stats
>> -       ovs-appctl dpif-netdev/pmd-stats-clear
>> +     To change default rxq assignment to pmd threads rxqs may be manually
>> +     pinned to desired cores using:
>> +
>> +     ```
>> +     ovs-vsctl set Interface <iface> \
>> +               other_config:pmd-rxq-affinity=<rxq-affinity-list>
>>      ```
>> +     where:
>> +
>> +     ```
>> +     <rxq-affinity-list> ::= NULL | <non-empty-list>
>> +     <non-empty-list> ::= <affinity-pair> |
>> +                          <affinity-pair> , <non-empty-list>
>> +     <affinity-pair> ::= <queue-id> : <core-id>
>> +     ```
>> +
>> +     Example:
>> +
>> +     ```
>> +     ovs-vsctl set interface dpdk0 options:n_rxq=4 \
>> +               other_config:pmd-rxq-affinity="0:3,1:7,3:8"
>> +
>> +     Queue #0 pinned to core 3;
>> +     Queue #1 pinned to core 7;
>> +     Queue #2 not pinned.
>> +     Queue #3 pinned to core 8;
>> +     ```
>> +
>> +     After that PMD threads on cores where RX queues was pinned will become
>> +     `isolated`. This means that this thread will poll only pinned RX queues.
>> +
>> +     WARNING: If there are no `non-isolated` PMD threads, `non-pinned` RX queues
>> +     will not be polled. Also, if provided `core_id` is not available (ex. this
>> +     `core_id` not in `pmd-cpu-mask`), RX queue will not be polled by any
>> +     PMD thread.
>> +
>> +     Isolation of PMD threads also can be checked using
>> +     `ovs-appctl dpif-netdev/pmd-rxq-show` command.
>>
>> -  7. Stop vswitchd & Delete bridge
>> +  8. Stop vswitchd & Delete bridge
>>
>>      ```
>>      ovs-appctl -t ovs-vswitchd exit
>> diff --git a/NEWS b/NEWS
>> index 6496dc1..9ccc1f5 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -44,6 +44,8 @@ Post-v2.5.0
>>        Old 'other_config:n-dpdk-rxqs' is no longer supported.
>>        Not supported by vHost interfaces. For them number of rx and tx queues
>>        is applied from connected virtio device.
>> +     * New 'other_config:pmd-rxq-affinity' field for PMD interfaces, that
>> +       allows to pin port's rx queues to desired cores.
>>      * New appctl command 'dpif-netdev/pmd-rxq-show' to check the port/rxq
>>        assignment.
>>      * Type of log messages from PMD threads changed from INFO to DBG.
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 18ce316..e5a8dec 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -63,6 +63,7 @@
>> #include "random.h"
>> #include "seq.h"
>> #include "shash.h"
>> +#include "smap.h"
>> #include "sset.h"
>> #include "timeval.h"
>> #include "tnl-neigh-cache.h"
>> @@ -250,6 +251,12 @@ enum pmd_cycles_counter_type {
>>
>> #define XPS_TIMEOUT_MS 500LL
>>
>> +/* Contained by struct dp_netdev_port's 'rxqs' member.  */
>> +struct dp_netdev_rxq {
>> +    struct netdev_rxq *rxq;
>> +    unsigned core_id;           /* Сore to which this queue is pinned. */
>> +};
>> +
>> /* A port in a netdev-based datapath. */
>> struct dp_netdev_port {
>>     odp_port_t port_no;
>> @@ -257,10 +264,11 @@ struct dp_netdev_port {
>>     struct hmap_node node;      /* Node in dp_netdev's 'ports'. */
>>     struct netdev_saved_flags *sf;
>>     unsigned n_rxq;             /* Number of elements in 'rxq' */
>> -    struct netdev_rxq **rxq;
>> +    struct dp_netdev_rxq *rxqs;
>>     unsigned *txq_used;         /* Number of threads that uses each tx queue. */
>>     struct ovs_mutex txq_used_mutex;
>>     char *type;                 /* Port type as requested by user. */
>> +    char *rxq_affinity_list;    /* Requested affinity of rx queues. */
>> };
>>
>> /* Contained by struct dp_netdev_flow's 'stats' member.  */
>> @@ -447,6 +455,7 @@ struct dp_netdev_pmd_thread {
>>     pthread_t thread;
>>     unsigned core_id;               /* CPU core id of this pmd thread. */
>>     int numa_id;                    /* numa node id of this pmd thread. */
>> +    bool isolated;
>>
>>     /* Queue id used by this pmd thread to send packets on all netdevs.
>>      * All tx_qid's are unique and less than 'ovs_numa_get_n_cores() + 1'. */
>> @@ -541,6 +550,8 @@ static struct dp_netdev_pmd_thread *
>> dp_netdev_less_loaded_pmd_on_numa(struct dp_netdev *dp, int numa_id);
>> static void dp_netdev_reset_pmd_threads(struct dp_netdev *dp)
>>     OVS_REQUIRES(dp->port_mutex);
>> +static void reconfigure_pmd_threads(struct dp_netdev *dp)
>> +    OVS_REQUIRES(dp->port_mutex);
>> static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
>> static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
>> static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
>> @@ -731,8 +742,10 @@ pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd)
>>         struct rxq_poll *poll;
>>         const char *prev_name = NULL;
>>
>> -        ds_put_format(reply, "pmd thread numa_id %d core_id %u:\n",
>> -                      pmd->numa_id, pmd->core_id);
>> +        ds_put_format(reply,
>> +                      "pmd thread numa_id %d core_id %u:\nisolated : %s\n",
> 
> I think we should put a "\t" before "isolated:"

OK.

>> +                      pmd->numa_id, pmd->core_id, (pmd->isolated)
>> +                                                  ? "true" : "false");
>>
>>         ovs_mutex_lock(&pmd->port_mutex);
>>         LIST_FOR_EACH (poll, node, &pmd->poll_list) {
>> @@ -1196,18 +1209,19 @@ port_create(const char *devname, const char *open_type, const char *type,
>>     port->port_no = port_no;
>>     port->netdev = netdev;
>>     port->n_rxq = netdev_n_rxq(netdev);
>> -    port->rxq = xcalloc(port->n_rxq, sizeof *port->rxq);
>> +    port->rxqs = xcalloc(port->n_rxq, sizeof *port->rxqs);
>>     port->txq_used = xcalloc(netdev_n_txq(netdev), sizeof *port->txq_used);
>>     port->type = xstrdup(type);
>>     ovs_mutex_init(&port->txq_used_mutex);
>>
>>     for (i = 0; i < port->n_rxq; i++) {
>> -        error = netdev_rxq_open(netdev, &port->rxq[i], i);
>> +        error = netdev_rxq_open(netdev, &port->rxqs[i].rxq, i);
>>         if (error) {
>>             VLOG_ERR("%s: cannot receive packets on this network device (%s)",
>>                      devname, ovs_strerror(errno));
>>             goto out_rxq_close;
>>         }
>> +        port->rxqs[i].core_id = -1;
>>         n_open_rxqs++;
>>     }
>>
>> @@ -1223,12 +1237,12 @@ port_create(const char *devname, const char *open_type, const char *type,
>>
>> out_rxq_close:
>>     for (i = 0; i < n_open_rxqs; i++) {
>> -        netdev_rxq_close(port->rxq[i]);
>> +        netdev_rxq_close(port->rxqs[i].rxq);
>>     }
>>     ovs_mutex_destroy(&port->txq_used_mutex);
>>     free(port->type);
>>     free(port->txq_used);
>> -    free(port->rxq);
>> +    free(port->rxqs);
>>     free(port);
>>
>> out:
>> @@ -1365,11 +1379,12 @@ port_destroy(struct dp_netdev_port *port)
>>     netdev_restore_flags(port->sf);
>>
>>     for (unsigned i = 0; i < port->n_rxq; i++) {
>> -        netdev_rxq_close(port->rxq[i]);
>> +        netdev_rxq_close(port->rxqs[i].rxq);
>>     }
>>     ovs_mutex_destroy(&port->txq_used_mutex);
>> +    free(port->rxq_affinity_list);
>>     free(port->txq_used);
>> -    free(port->rxq);
>> +    free(port->rxqs);
>>     free(port->type);
>>     free(port);
>> }
>> @@ -2539,6 +2554,97 @@ dpif_netdev_pmd_set(struct dpif *dpif, const char *cmask)
>>     return 0;
>> }
>>
>> +/* Parses affinity list and returns result in 'core_ids'. */
>> +static int
>> +parse_affinity_list(const char *affinity_list, unsigned *core_ids, int n_rxq)
>> +{
>> +    unsigned i;
>> +    char *list, *pt, *saveptr = NULL;
>> +    int error = 0;
>> +
>> +    for (i = 0; i < n_rxq; i++) {
>> +        core_ids[i] = -1;
>> +    }
>> +
>> +    if (!affinity_list) {
>> +        return 0;
>> +    }
>> +
>> +    list = xstrdup(affinity_list);
>> +    for (pt = strtok_r(list, ",:", &saveptr); pt;
>> +         pt = strtok_r(NULL, ",:", &saveptr)) {
>> +        int rxq_id, core_id;
>> +
>> +        rxq_id = strtol(pt, NULL, 10);
>> +        if (rxq_id < 0) {
>> +            error = EINVAL;
>> +            break;
>> +        }
>> +        pt = strtok_r(NULL, ",:", &saveptr);
>> +        if (!pt || (core_id = strtol(pt, NULL, 10)) < 0) {
>> +            error = EINVAL;
>> +            break;
>> +        }
>> +        core_ids[rxq_id] = core_id;
>> +    }
>> +    free(list);
>> +    return error;
>> +}
>> +
>> +/* Parses 'affinity_list' and applies configuration if it is valid. */
>> +static int
>> +dpif_netdev_port_set_rxq_affinity(struct dp_netdev_port *port,
>> +                                  const char *affinity_list)
>> +{
>> +    unsigned *core_ids, i;
>> +    int error = 0;
>> +
>> +    core_ids = xmalloc(port->n_rxq * sizeof *core_ids);
>> +    if (parse_affinity_list(affinity_list, core_ids, port->n_rxq)) {
>> +        error = EINVAL;
>> +        goto exit;
>> +    }
>> +
>> +    for (i = 0; i < port->n_rxq; i++) {
>> +        port->rxqs[i].core_id = core_ids[i];
>> +    }
>> +
>> +exit:
>> +    free(core_ids);
>> +    return error;
>> +}
>> +
>> +/* Changes the affinity of port's rx queues.  The changes are actually applied
>> + * in dpif_netdev_run(). */
>> +static int
>> +dpif_netdev_port_set_config(struct dpif *dpif, odp_port_t port_no,
>> +                            const struct smap *cfg)
>> +{
>> +    struct dp_netdev *dp = get_dp_netdev(dpif);
>> +    struct dp_netdev_port *port;
>> +    int error = 0;
>> +    const char *affinity_list = smap_get(cfg, "pmd-rxq-affinity");
>> +
>> +    ovs_mutex_lock(&dp->port_mutex);
>> +    error = get_port_by_number(dp, port_no, &port);
>> +    if (error || !netdev_is_pmd(port->netdev)
>> +        || nullable_string_is_equal(affinity_list, port->rxq_affinity_list)) {
>> +        goto unlock;
>> +    }
>> +
>> +    error = dpif_netdev_port_set_rxq_affinity(port, affinity_list);
>> +    if (error) {
>> +        goto unlock;
>> +    }
>> +    free(port->rxq_affinity_list);
>> +    port->rxq_affinity_list = nullable_xstrdup(affinity_list);
>> +
>> +    reconfigure_pmd_threads(dp);
> 
> This will execute reconfigure the threads immediately.
> 
> Can't we postpone the changes to dpif_netdev_run(), so that if multiple ports
> are changed we stop the threads only once?

I guess, we can.
How about implementing of 2 functions:
	* dp_netdev_request_reconfigure()
	* dp_netdev_is_reconf_required()
just like for 'netdev'?

Maybe something like following fixup will fit (not tested):
-----------------------------------------------------------------------

-----------------------------------------------------------------------

>> +unlock:
>> +    ovs_mutex_unlock(&dp->port_mutex);
>> +    return error;
>> +}
>> +
>> static int
>> dpif_netdev_queue_to_priority(const struct dpif *dpif OVS_UNUSED,
>>                               uint32_t queue_id, uint32_t *priority)
>> @@ -2638,7 +2744,7 @@ static int
>> port_reconfigure(struct dp_netdev_port *port)
>> {
>>     struct netdev *netdev = port->netdev;
>> -    int i, err;
>> +    int i, err, old_n_rxq;
>>
>>     if (!netdev_is_reconf_required(netdev)) {
>>         return 0;
>> @@ -2646,9 +2752,10 @@ port_reconfigure(struct dp_netdev_port *port)
>>
>>     /* Closes the existing 'rxq's. */
>>     for (i = 0; i < port->n_rxq; i++) {
>> -        netdev_rxq_close(port->rxq[i]);
>> -        port->rxq[i] = NULL;
>> +        netdev_rxq_close(port->rxqs[i].rxq);
>> +        port->rxqs[i].rxq = NULL;
>>     }
>> +    old_n_rxq = port->n_rxq;
>>     port->n_rxq = 0;
>>
>>     /* Allows 'netdev' to apply the pending configuration changes. */
>> @@ -2659,19 +2766,27 @@ port_reconfigure(struct dp_netdev_port *port)
>>         return err;
>>     }
>>     /* If the netdev_reconfigure() above succeeds, reopens the 'rxq's. */
>> -    port->rxq = xrealloc(port->rxq, sizeof *port->rxq * netdev_n_rxq(netdev));
>> +    port->rxqs = xrealloc(port->rxqs,
>> +                          sizeof *port->rxqs * netdev_n_rxq(netdev));
>>     /* Realloc 'used' counters for tx queues. */
>>     free(port->txq_used);
>>     port->txq_used = xcalloc(netdev_n_txq(netdev), sizeof *port->txq_used);
>>
>>     for (i = 0; i < netdev_n_rxq(netdev); i++) {
>> -        err = netdev_rxq_open(netdev, &port->rxq[i], i);
>> +        err = netdev_rxq_open(netdev, &port->rxqs[i].rxq, i);
>>         if (err) {
>>             return err;
>>         }
>> +        /* Initialization for newly allocated memory. */
>> +        if (i >= old_n_rxq) {
>> +            port->rxqs[i].core_id = -1;
>> +        }
> 
> The above is not necessary, dpif_netdev_port_set_rxq_affinity() will
> set the appropriate affinity, right?

Yes. You're right. Thanks.

>>         port->n_rxq++;
>>     }
>>
>> +    /* Parse affinity list to apply configuration for new queues. */
>> +    dpif_netdev_port_set_rxq_affinity(port, port->rxq_affinity_list);
>> +
>>     return 0;
>> }
>>
>> @@ -2737,7 +2852,7 @@ dpif_netdev_run(struct dpif *dpif)
>>             int i;
>>
>>             for (i = 0; i < port->n_rxq; i++) {
>> -                dp_netdev_process_rxq_port(non_pmd, port, port->rxq[i]);
>> +                dp_netdev_process_rxq_port(non_pmd, port, port->rxqs[i].rxq);
>>             }
>>         }
>>     }
>> @@ -2777,7 +2892,7 @@ dpif_netdev_wait(struct dpif *dpif)
>>             int i;
>>
>>             for (i = 0; i < port->n_rxq; i++) {
>> -                netdev_rxq_wait(port->rxq[i]);
>> +                netdev_rxq_wait(port->rxqs[i].rxq);
>>             }
>>         }
>>     }
>> @@ -3256,9 +3371,9 @@ dp_netdev_del_port_from_all_pmds(struct dp_netdev *dp,
>> }
>>
>>
>> -/* Returns PMD thread from this numa node with fewer rx queues to poll.
>> - * Returns NULL if there is no PMD threads on this numa node.
>> - * Can be called safely only by main thread. */
>> +/* Returns non-isolated PMD thread from this numa node with fewer
>> + * rx queues to poll. Returns NULL if there is no non-isolated  PMD threads
> 
> Double space
> 
> s/threads/thread/

Thanks. I'll fix this in v4.

Best regards, Ilya Maximets.

[ovs-dev,v3,3/3] dpif-netdev: Introduce pmd-rxq-affinity.

Commit Message

Comments

Patch