[ovs-dev,v2] dpif-netdev: calculate per numa variance

Message ID	1671282936-15771-1-git-send-email-lic121@chinatelecom.cn
State	Accepted
Commit	46e04ec31bb2b889bd5715d436be2bdc0268f08b
Headers	show Return-Path: <ovs-dev-bounces@openvswitch.org> From: Cheng Li <lic121@chinatelecom.cn> To: dev@openvswitch.org, ktraynor@redhat.com Date: Sat, 17 Dec 2022 13:15:36 +0000 Message-Id: <1671282936-15771-1-git-send-email-lic121@chinatelecom.cn> Subject: [ovs-dev] [PATCH v2] dpif-netdev: calculate per numa variance Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" <ovs-dev-bounces@openvswitch.org>
Series	[ovs-dev,v2] dpif-netdev: calculate per numa variance \| expand [ovs-dev,v2] dpif-netdev: calculate per numa variance

Message ID

1671282936-15771-1-git-send-email-lic121@chinatelecom.cn

State

Accepted

Commit

46e04ec31bb2b889bd5715d436be2bdc0268f08b

Headers

HMM_SOURCE_IP: 172.18.0.218:32872.1015699653
HMM_ATTACHE_NUM: 0000
HMM_SOURCE_TYPE: SMTP
Received: from clientip-218.88.39.253 (unknown [172.18.0.218])
 by chinatelecom.cn (HERMES) with SMTP id 265AD2800B4;
 Sat, 17 Dec 2022 21:15:57 +0800 (CST)
X-189-SAVE-TO-SEND: +lic121@chinatelecom.cn
Received: from  ([218.88.39.253])
 by app0025 with ESMTP id 377dffa35eaa4224b8305038f426d1d8 for
 dev@openvswitch.org; Sat, 17 Dec 2022 21:16:05 CST
X-Transaction-ID: 377dffa35eaa4224b8305038f426d1d8
X-Real-From: lic121@chinatelecom.cn
X-Receive-IP: 218.88.39.253
X-MEDUSA-Status: 0
From: Cheng Li <lic121@chinatelecom.cn>
To: dev@openvswitch.org,
	ktraynor@redhat.com
Date: Sat, 17 Dec 2022 13:15:36 +0000
Message-Id: <1671282936-15771-1-git-send-email-lic121@chinatelecom.cn>
X-Mailer: git-send-email 1.8.3.1
Subject: [ovs-dev] [PATCH v2] dpif-netdev: calculate per numa variance
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

Series

[ovs-dev,v2] dpif-netdev: calculate per numa variance | expand

Checks

Context	Check	Description
ovsrobot/apply-robot	success	apply and check: success
ovsrobot/github-robot-_Build_and_Test	success	github build: passed
ovsrobot/intel-ovs-compilation	success	test: success

Context

Check

Description

ovsrobot/apply-robot

success

apply and check: success

ovsrobot/github-robot-_Build_and_Test

success

github build: passed

ovsrobot/intel-ovs-compilation

success

test: success

Commit Message

Cheng Li Dec. 17, 2022, 1:15 p.m. UTC

Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.

Considering the following case. Numa0 is free because VMs on numa0
are not sending pkts, while numa1 is busy. Within numa1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 62 will make numa1 much more balance. For numa1
the variance improment will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not tigger auto_lb.

```
numa_id   core_id      kpps
      0        30         0
      0        31         0
      0        94         0
      0        95         0
      1       126      1500
      1       127      1000
      1        63      1000
      1        62       500
```

As auto_lb doesn't balance workload across numa nodes. So it makes
more sense to calculate variance improvemnet per numa node.

Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
---

Notes:
    v2:
    - Commit msg update
    - Update doc for per numa variance
    - Reword variance improment log msg
    - Not break numa variance loop for debug purpose

 Documentation/topics/dpdk/pmd.rst |  8 ++--
 lib/dpif-netdev.c                 | 85 +++++++++++++++++++--------------------
 2 files changed, 46 insertions(+), 47 deletions(-)

Comments

Kevin Traynor Dec. 21, 2022, 7:36 p.m. UTC | #1

Hi Cheng,

On 17/12/2022 13:15, Cheng Li wrote:
> Currently, pmd_rebalance_dry_run() calculate overall variance of
> all pmds regardless of their numa location. The overall result may
> hide un-balance in an individual numa.
> 
> Considering the following case. Numa0 is free because VMs on numa0
> are not sending pkts, while numa1 is busy. Within numa1, pmds
> workloads are not balanced. Obviously, moving 500 kpps workloads from
> pmd 126 to pmd 62 will make numa1 much more balance. For numa1
> the variance improment will be almost 100%, because after rebalance

typo: improvement - maybe maintainer can fix on apply and punctuate 
commit title ("dpif-netdev: Calculate per numa variance.") :-)

> each pmd in numa1 holds same workload(variance ~= 0). But the overall
> variance improvement is only about 20%, which may not tigger auto_lb.
> 
> ```
> numa_id   core_id      kpps
>        0        30         0
>        0        31         0
>        0        94         0
>        0        95         0
>        1       126      1500
>        1       127      1000
>        1        63      1000
>        1        62       500
> ```
> 
> As auto_lb doesn't balance workload across numa nodes. So it makes
> more sense to calculate variance improvemnet per numa node.
> 
> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
> Co-authored-by: Kevin Traynor <ktraynor@redhat.com>


I think it's a nice improvement, thanks for sending it. I reviewed and 
tested a bit and the debug logs look good now [1].

Acked-by: Kevin Traynor <ktraynor@redhat.com>

[1]
2022-12-21T18:55:41Z|00335|dpif_netdev|DBG|Numa node 0. Current variance 
225 Estimated variance 256. Variance improvement 0%.
2022-12-21T18:55:57Z|00336|dpif_netdev|DBG|Numa node 1. Current variance 
0 Estimated variance 0. Variance improvement 0%.
2022-12-21T18:56:03Z|00337|dpif_netdev|DBG|PMD load variance improvement 
threshold 5% is not met.

> ---
> 
> Notes:
>      v2:
>      - Commit msg update
>      - Update doc for per numa variance
>      - Reword variance improment log msg
>      - Not break numa variance loop for debug purpose
> 
>   Documentation/topics/dpdk/pmd.rst |  8 ++--
>   lib/dpif-netdev.c                 | 85 +++++++++++++++++++--------------------
>   2 files changed, 46 insertions(+), 47 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index b259cc8..c335757 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -278,10 +278,10 @@ If a PMD core is detected to be above the load threshold and the minimum
>   pre-requisites are met, a dry-run using the current PMD assignment algorithm is
>   performed.
>   
> -The current variance of load between the PMD cores and estimated variance from
> -the dry-run are both calculated. If the estimated dry-run variance is improved
> -from the current one by the variance threshold, a new Rx queue to PMD
> -assignment will be performed.
> +For each numa node, the current variance of load between the PMD cores and
> +estimated variance from the dry-run are both calculated. If any numa's
> +estimated dry-run variance is improved from the current one by the variance
> +threshold, a new Rx queue to PMD assignment will be performed.
>   
>   For example, to set the variance improvement threshold to 40%::
>   
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2c08a71..7ff923b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
>   static uint64_t variance(uint64_t a[], int n);
>   
>   static uint64_t
> -sched_numa_list_variance(struct sched_numa_list *numa_list)
> +sched_numa_variance(struct sched_numa *numa)
>   {
> -    struct sched_numa *numa;
>       uint64_t *percent_busy = NULL;
> -    unsigned total_pmds = 0;
>       int n_proc = 0;
>       uint64_t var;
>   
> -    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
> -        total_pmds += numa->n_pmds;
> -        percent_busy = xrealloc(percent_busy,
> -                                total_pmds * sizeof *percent_busy);
> +    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
>   
> -        for (unsigned i = 0; i < numa->n_pmds; i++) {
> -            struct sched_pmd *sched_pmd;
> -            uint64_t total_cycles = 0;
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        uint64_t total_cycles = 0;
>   
> -            sched_pmd = &numa->pmds[i];
> -            /* Exclude isolated PMDs from variance calculations. */
> -            if (sched_pmd->isolated == true) {
> -                continue;
> -            }
> -            /* Get the total pmd cycles for an interval. */
> -            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
> +        sched_pmd = &numa->pmds[i];
> +        /* Exclude isolated PMDs from variance calculations. */
> +        if (sched_pmd->isolated == true) {
> +            continue;
> +        }
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
>   
> -            if (total_cycles) {
> -                /* Estimate the cycles to cover all intervals. */
> -                total_cycles *= PMD_INTERVAL_MAX;
> -                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> -                                             / total_cycles;
> -            } else {
> -                percent_busy[n_proc++] = 0;
> -            }
> +        if (total_cycles) {
> +            /* Estimate the cycles to cover all intervals. */
> +            total_cycles *= PMD_INTERVAL_MAX;
> +            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> +                                            / total_cycles;
> +        } else {
> +            percent_busy[n_proc++] = 0;
>           }
>       }
>       var = variance(percent_busy, n_proc);
> @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>       struct sched_numa_list numa_list_est;
>       bool thresh_met = false;
>       uint64_t current_var, estimate_var;
> +    struct sched_numa *numa_cur, *numa_est;
>       uint64_t improvement = 0;
>   
>       VLOG_DBG("PMD auto load balance performing dry run.");
> @@ -6200,25 +6195,29 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>               sched_numa_list_count(&numa_list_est) == 1) {
>   
>           /* Calculate variances. */
> -        current_var = sched_numa_list_variance(&numa_list_cur);
> -        estimate_var = sched_numa_list_variance(&numa_list_est);
> -
> -        if (estimate_var < current_var) {
> -             improvement = ((current_var - estimate_var) * 100) / current_var;
> -        }
> -        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
> -                 current_var, estimate_var);
> -        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
> -
> -        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> -            thresh_met = true;
> -            VLOG_DBG("PMD load variance improvement threshold %u%% "
> -                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
> -        } else {
> -            VLOG_DBG("PMD load variance improvement threshold "
> -                     "%u%% is not met.",
> -                      dp->pmd_alb.rebalance_improve_thresh);
> +        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
> +            numa_est = sched_numa_list_lookup(&numa_list_est,
> +                                              numa_cur->numa_id);
> +            if (!numa_est) {
> +                continue;
> +            }
> +            current_var = sched_numa_variance(numa_cur);
> +            estimate_var = sched_numa_variance(numa_est);
> +            if (estimate_var < current_var) {
> +                improvement = ((current_var - estimate_var) * 100)
> +                              / current_var;
> +            }
> +            VLOG_DBG("Numa node %d. Current variance %"PRIu64" Estimated "
> +                     "variance %"PRIu64". Variance improvement %"PRIu64"%%.",
> +                     numa_cur->numa_id, current_var,
> +                     estimate_var, improvement);
> +            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> +                thresh_met = true;
> +            }
>           }
> +        VLOG_DBG("PMD load variance improvement threshold %u%% is %s.",
> +                 dp->pmd_alb.rebalance_improve_thresh,
> +                 thresh_met ? "met" : "not met");
>       } else {
>           VLOG_DBG("PMD auto load balance detected cross-numa polling with "
>                    "multiple numa nodes. Unable to accurately estimate.");

Ilya Maximets Dec. 21, 2022, 11:10 p.m. UTC | #2

On 12/21/22 20:36, Kevin Traynor wrote:
> Hi Cheng,
> 
> On 17/12/2022 13:15, Cheng Li wrote:
>> Currently, pmd_rebalance_dry_run() calculate overall variance of
>> all pmds regardless of their numa location. The overall result may
>> hide un-balance in an individual numa.
>>
>> Considering the following case. Numa0 is free because VMs on numa0
>> are not sending pkts, while numa1 is busy. Within numa1, pmds
>> workloads are not balanced. Obviously, moving 500 kpps workloads from
>> pmd 126 to pmd 62 will make numa1 much more balance. For numa1
>> the variance improment will be almost 100%, because after rebalance
> 
> typo: improvement - maybe maintainer can fix on apply and punctuate> commit title ("dpif-netdev: Calculate per numa variance.") :-)

I have a git hook that capitalizes the first letter and adds a period
at the end for every patch that I'm applying. :)

> 
>> each pmd in numa1 holds same workload(variance ~= 0). But the overall
>> variance improvement is only about 20%, which may not tigger auto_lb.
>>
>> ```
>> numa_id   core_id      kpps
>>        0        30         0
>>        0        31         0
>>        0        94         0
>>        0        95         0
>>        1       126      1500
>>        1       127      1000
>>        1        63      1000
>>        1        62       500
>> ```
>>
>> As auto_lb doesn't balance workload across numa nodes. So it makes
>> more sense to calculate variance improvemnet per numa node.
>>
>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
>> Co-authored-by: Kevin Traynor <ktraynor@redhat.com>
> 
> 
> I think it's a nice improvement, thanks for sending it. I reviewed and tested a bit and the debug logs look good now [1].
> 
> Acked-by: Kevin Traynor <ktraynor@redhat.com>

Applied.  Thanks!

Best regards, Ilya Maximets.

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index b259cc8..c335757 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -278,10 +278,10 @@  If a PMD core is detected to be above the load threshold and the minimum
 pre-requisites are met, a dry-run using the current PMD assignment algorithm is
 performed.
 
-The current variance of load between the PMD cores and estimated variance from
-the dry-run are both calculated. If the estimated dry-run variance is improved
-from the current one by the variance threshold, a new Rx queue to PMD
-assignment will be performed.
+For each numa node, the current variance of load between the PMD cores and
+estimated variance from the dry-run are both calculated. If any numa's
+estimated dry-run variance is improved from the current one by the variance
+threshold, a new Rx queue to PMD assignment will be performed.
 
 For example, to set the variance improvement threshold to 40%::
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2c08a71..7ff923b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6076,39 +6076,33 @@  rxq_scheduling(struct dp_netdev *dp)
 static uint64_t variance(uint64_t a[], int n);
 
 static uint64_t
-sched_numa_list_variance(struct sched_numa_list *numa_list)
+sched_numa_variance(struct sched_numa *numa)
 {
-    struct sched_numa *numa;
     uint64_t *percent_busy = NULL;
-    unsigned total_pmds = 0;
     int n_proc = 0;
     uint64_t var;
 
-    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
-        total_pmds += numa->n_pmds;
-        percent_busy = xrealloc(percent_busy,
-                                total_pmds * sizeof *percent_busy);
+    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
 
-        for (unsigned i = 0; i < numa->n_pmds; i++) {
-            struct sched_pmd *sched_pmd;
-            uint64_t total_cycles = 0;
+    for (unsigned i = 0; i < numa->n_pmds; i++) {
+        struct sched_pmd *sched_pmd;
+        uint64_t total_cycles = 0;
 
-            sched_pmd = &numa->pmds[i];
-            /* Exclude isolated PMDs from variance calculations. */
-            if (sched_pmd->isolated == true) {
-                continue;
-            }
-            /* Get the total pmd cycles for an interval. */
-            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
+        sched_pmd = &numa->pmds[i];
+        /* Exclude isolated PMDs from variance calculations. */
+        if (sched_pmd->isolated == true) {
+            continue;
+        }
+        /* Get the total pmd cycles for an interval. */
+        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
 
-            if (total_cycles) {
-                /* Estimate the cycles to cover all intervals. */
-                total_cycles *= PMD_INTERVAL_MAX;
-                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
-                                             / total_cycles;
-            } else {
-                percent_busy[n_proc++] = 0;
-            }
+        if (total_cycles) {
+            /* Estimate the cycles to cover all intervals. */
+            total_cycles *= PMD_INTERVAL_MAX;
+            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
+                                            / total_cycles;
+        } else {
+            percent_busy[n_proc++] = 0;
         }
     }
     var = variance(percent_busy, n_proc);
@@ -6182,6 +6176,7 @@  pmd_rebalance_dry_run(struct dp_netdev *dp)
     struct sched_numa_list numa_list_est;
     bool thresh_met = false;
     uint64_t current_var, estimate_var;
+    struct sched_numa *numa_cur, *numa_est;
     uint64_t improvement = 0;
 
     VLOG_DBG("PMD auto load balance performing dry run.");
@@ -6200,25 +6195,29 @@  pmd_rebalance_dry_run(struct dp_netdev *dp)
             sched_numa_list_count(&numa_list_est) == 1) {
 
         /* Calculate variances. */
-        current_var = sched_numa_list_variance(&numa_list_cur);
-        estimate_var = sched_numa_list_variance(&numa_list_est);
-
-        if (estimate_var < current_var) {
-             improvement = ((current_var - estimate_var) * 100) / current_var;
-        }
-        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
-                 current_var, estimate_var);
-        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
-
-        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
-            thresh_met = true;
-            VLOG_DBG("PMD load variance improvement threshold %u%% "
-                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
-        } else {
-            VLOG_DBG("PMD load variance improvement threshold "
-                     "%u%% is not met.",
-                      dp->pmd_alb.rebalance_improve_thresh);
+        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
+            numa_est = sched_numa_list_lookup(&numa_list_est,
+                                              numa_cur->numa_id);
+            if (!numa_est) {
+                continue;
+            }
+            current_var = sched_numa_variance(numa_cur);
+            estimate_var = sched_numa_variance(numa_est);
+            if (estimate_var < current_var) {
+                improvement = ((current_var - estimate_var) * 100)
+                              / current_var;
+            }
+            VLOG_DBG("Numa node %d. Current variance %"PRIu64" Estimated "
+                     "variance %"PRIu64". Variance improvement %"PRIu64"%%.",
+                     numa_cur->numa_id, current_var,
+                     estimate_var, improvement);
+            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
+                thresh_met = true;
+            }
         }
+        VLOG_DBG("PMD load variance improvement threshold %u%% is %s.",
+                 dp->pmd_alb.rebalance_improve_thresh,
+                 thresh_met ? "met" : "not met");
     } else {
         VLOG_DBG("PMD auto load balance detected cross-numa polling with "
                  "multiple numa nodes. Unable to accurately estimate.");

[ovs-dev,v2] dpif-netdev: calculate per numa variance

Checks

Commit Message

Comments

Patch