[ovs-dev] dpif-netdev: calculate per numa variance

Message ID	1670152577-4830-1-git-send-email-lic121@chinatelecom.cn
State	Changes Requested
Headers	show Return-Path: <ovs-dev-bounces@openvswitch.org> From: Cheng Li <lic121@chinatelecom.cn> To: dev@openvswitch.org, ktraynor@redhat.com Date: Sun, 4 Dec 2022 11:16:17 +0000 Message-Id: <1670152577-4830-1-git-send-email-lic121@chinatelecom.cn> Subject: [ovs-dev] [PATCH] dpif-netdev: calculate per numa variance Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" <ovs-dev-bounces@openvswitch.org>
Series	[ovs-dev] dpif-netdev: calculate per numa variance \| expand [ovs-dev] dpif-netdev: calculate per numa variance

Message ID

1670152577-4830-1-git-send-email-lic121@chinatelecom.cn

State

Changes Requested

Headers

HMM_SOURCE_IP: 172.18.0.218:40878.609727683
HMM_ATTACHE_NUM: 0000
HMM_SOURCE_TYPE: SMTP
Received: from clientip-171.217.51.122 (unknown [172.18.0.218])
 by chinatelecom.cn (HERMES) with SMTP id E65B92800B6;
 Sun,  4 Dec 2022 19:16:44 +0800 (CST)
X-189-SAVE-TO-SEND: +lic121@chinatelecom.cn
Received: from  ([171.217.51.122])
 by app0025 with ESMTP id 0ec1d53ada404011a709e00a5358e789 for
 dev@openvswitch.org; Sun, 04 Dec 2022 19:16:46 CST
X-Transaction-ID: 0ec1d53ada404011a709e00a5358e789
X-Real-From: lic121@chinatelecom.cn
X-Receive-IP: 171.217.51.122
X-MEDUSA-Status: 0
From: Cheng Li <lic121@chinatelecom.cn>
To: dev@openvswitch.org,
	ktraynor@redhat.com
Date: Sun,  4 Dec 2022 11:16:17 +0000
Message-Id: <1670152577-4830-1-git-send-email-lic121@chinatelecom.cn>
X-Mailer: git-send-email 1.8.3.1
Subject: [ovs-dev] [PATCH] dpif-netdev: calculate per numa variance
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

Series

[ovs-dev] dpif-netdev: calculate per numa variance | expand

Context	Check	Description
ovsrobot/apply-robot	success	apply and check: success
ovsrobot/github-robot-_Build_and_Test	success	github build: passed

Context

Check

Description

ovsrobot/apply-robot

success

apply and check: success

ovsrobot/github-robot-_Build_and_Test

success

github build: passed

Commit Message

Cheng Li Dec. 4, 2022, 11:16 a.m. UTC

Currently, pmd_rebalance_dry_run() calculate overall variance of
all pmds regardless of their numa location. The overall result may
hide un-balance in an individual numa.

Considering the following case. Numa 0 is free because VMs on numa0
are not sending pkts, while numa 1 is busy. Within numa 1, pmds
workloads are not balanced. Obviously, moving 500 kpps workloads from
pmd 126 to pmd 61 will make numa1 much more balance. For numa1
the variance improment will be almost 100%, because after rebalance
each pmd in numa1 holds same workload(variance ~= 0). But the overall
variance improvement is only about 20%, which may not tigger auto_lb.

```
numa_id   core_id      kpps
      0        30         0
      0        31         0
      0        94         0
      0        95         0
      1       126      1500
      1       127      1000
      1        63      1000
      1        62       500
```

As auto_lb doesn't work if any coss_numa rxq exists, it means that
auto_lb only balance rxq assignment within each numa. So it makes
more sense to calculate variance improvemnet per numa node.

Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
---
 lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 47 insertions(+), 43 deletions(-)

Comments

Kevin Traynor Dec. 9, 2022, 3:32 p.m. UTC | #1

On 04/12/2022 11:16, Cheng Li wrote:
> Currently, pmd_rebalance_dry_run() calculate overall variance of
> all pmds regardless of their numa location. The overall result may
> hide un-balance in an individual numa.
> 

Hi. Thanks, the idea looks a good one. I didn't test yet, couple of 
comments on the code below.

> Considering the following case. Numa 0 is free because VMs on numa0
> are not sending pkts, while numa 1 is busy. Within numa 1, pmds
> workloads are not balanced. Obviously, moving 500 kpps workloads from
> pmd 126 to pmd 61 will make numa1 much more balance. For numa1
> the variance improment will be almost 100%, because after rebalance
> each pmd in numa1 holds same workload(variance ~= 0). But the overall
> variance improvement is only about 20%, which may not tigger auto_lb.
> 
> ```
> numa_id   core_id      kpps
>        0        30         0
>        0        31         0
>        0        94         0
>        0        95         0
>        1       126      1500
>        1       127      1000
>        1        63      1000
>        1        62       500
> ```
> 
> As auto_lb doesn't work if any coss_numa rxq exists, it means that

That's not fully true. It can run with cross numa in a very limited 
circumstances where there is only PMD active on 1 numa.

It doesn't diminish the idea here, but just best not write that blanket 
statement as it may confuse.

> auto_lb only balance rxq assignment within each numa. So it makes
> more sense to calculate variance improvemnet per numa node.
> 
> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> ---
>   lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++--------------------------
>   1 file changed, 47 insertions(+), 43 deletions(-)
> 

I think at least some docs would need to be updated to say it's variance 
per NUMA.

> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2c08a71..6a53f13 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
>   static uint64_t variance(uint64_t a[], int n);
>   
>   static uint64_t
> -sched_numa_list_variance(struct sched_numa_list *numa_list)
> +sched_numa_variance(struct sched_numa *numa)
>   {
> -    struct sched_numa *numa;
>       uint64_t *percent_busy = NULL;
> -    unsigned total_pmds = 0;
>       int n_proc = 0;
>       uint64_t var;
>   
> -    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
> -        total_pmds += numa->n_pmds;
> -        percent_busy = xrealloc(percent_busy,
> -                                total_pmds * sizeof *percent_busy);
> +    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
>   
> -        for (unsigned i = 0; i < numa->n_pmds; i++) {
> -            struct sched_pmd *sched_pmd;
> -            uint64_t total_cycles = 0;
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        uint64_t total_cycles = 0;
>   
> -            sched_pmd = &numa->pmds[i];
> -            /* Exclude isolated PMDs from variance calculations. */
> -            if (sched_pmd->isolated == true) {
> -                continue;
> -            }
> -            /* Get the total pmd cycles for an interval. */
> -            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
> +        sched_pmd = &numa->pmds[i];
> +        /* Exclude isolated PMDs from variance calculations. */
> +        if (sched_pmd->isolated == true) {
> +            continue;
> +        }
> +        /* Get the total pmd cycles for an interval. */
> +        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
>   
> -            if (total_cycles) {
> -                /* Estimate the cycles to cover all intervals. */
> -                total_cycles *= PMD_INTERVAL_MAX;
> -                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> -                                             / total_cycles;
> -            } else {
> -                percent_busy[n_proc++] = 0;
> -            }
> +        if (total_cycles) {
> +            /* Estimate the cycles to cover all intervals. */
> +            total_cycles *= PMD_INTERVAL_MAX;
> +            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> +                                            / total_cycles;
> +        } else {
> +            percent_busy[n_proc++] = 0;
>           }
>       }
>       var = variance(percent_busy, n_proc);
> @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>       struct sched_numa_list numa_list_est;
>       bool thresh_met = false;
>       uint64_t current_var, estimate_var;
> +    struct sched_numa *numa_cur, *numa_est;
>       uint64_t improvement = 0;
>   
>       VLOG_DBG("PMD auto load balance performing dry run.");
> @@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>               sched_numa_list_count(&numa_list_est) == 1) {
>   
>           /* Calculate variances. */
> -        current_var = sched_numa_list_variance(&numa_list_cur);
> -        estimate_var = sched_numa_list_variance(&numa_list_est);
> -
> -        if (estimate_var < current_var) {
> -             improvement = ((current_var - estimate_var) * 100) / current_var;
> -        }
> -        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
> -                 current_var, estimate_var);
> -        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
> -
> -        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> -            thresh_met = true;
> -            VLOG_DBG("PMD load variance improvement threshold %u%% "
> -                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
> -        } else {
> -            VLOG_DBG("PMD load variance improvement threshold "
> -                     "%u%% is not met.",
> -                      dp->pmd_alb.rebalance_improve_thresh);
> +        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
> +            numa_est = sched_numa_list_lookup(&numa_list_est,
> +                                              numa_cur->numa_id);
> +            if (!numa_est) {
> +                continue;
> +            }
> +            current_var = sched_numa_variance(numa_cur);
> +            estimate_var = sched_numa_variance(numa_est);
> +            if (estimate_var < current_var) {
> +                improvement = ((current_var - estimate_var) * 100)
> +                              / current_var;
> +            }
> +            VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current"
> +                     " variance %"PRIu64" Estimated variance %"PRIu64".",
> +                     numa_cur->numa_id, improvement,
> +                     current_var, estimate_var);

Not sure the reason for changing the logging order? Just the NUMA node 
needs to be added.

> +
> +            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> +                VLOG_DBG("PMD load variance improvement threshold %u%% "
> +                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
> +                thresh_met = true;
> +                break;


I think it's better to remove the break here and move the "result" 
logging outside the for loop. Just set the thresh_met = true at this 
point. So then the logging looks like.

"PMD auto load balance dry run"
"Numa node x. Current y...Estime z...Improve n"
"Numa node x. Current y...Estime z...Improve n"
"PMD load improvement of m is met." or "...not met"

For debug it would be important to continue the run on the other NUMA 
even if the threshold is already met, so removing the break; would allow 
this.

> +            } else {
> +                VLOG_DBG("PMD load variance improvement threshold "
> +                        "%u%% is not met.",
> +                        dp->pmd_alb.rebalance_improve_thresh);
> +            }
>           }
>       } else {
>           VLOG_DBG("PMD auto load balance detected cross-numa polling with "

Cheng Li Dec. 10, 2022, 7:21 a.m. UTC | #2

On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote:
> On 04/12/2022 11:16, Cheng Li wrote:
> >Currently, pmd_rebalance_dry_run() calculate overall variance of
> >all pmds regardless of their numa location. The overall result may
> >hide un-balance in an individual numa.
> >
> 
> Hi. Thanks, the idea looks a good one. I didn't test yet, couple of
> comments on the code below.

Hi Kevin, thanks for the review.

> 
> >Considering the following case. Numa 0 is free because VMs on numa0
> >are not sending pkts, while numa 1 is busy. Within numa 1, pmds
> >workloads are not balanced. Obviously, moving 500 kpps workloads from
> >pmd 126 to pmd 61 will make numa1 much more balance. For numa1
> >the variance improment will be almost 100%, because after rebalance
> >each pmd in numa1 holds same workload(variance ~= 0). But the overall
> >variance improvement is only about 20%, which may not tigger auto_lb.
> >
> >```
> >numa_id   core_id      kpps
> >       0        30         0
> >       0        31         0
> >       0        94         0
> >       0        95         0
> >       1       126      1500
> >       1       127      1000
> >       1        63      1000
> >       1        62       500
> >```
> >
> >As auto_lb doesn't work if any coss_numa rxq exists, it means that
> 
> That's not fully true. It can run with cross numa in a very limited
> circumstances where there is only PMD active on 1 numa.
> 
> It doesn't diminish the idea here, but just best not write that
> blanket statement as it may confuse.

Makes sense, I will reword the description.

> 
> >auto_lb only balance rxq assignment within each numa. So it makes
> >more sense to calculate variance improvemnet per numa node.
> >
> >Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
> >---
> >  lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 47 insertions(+), 43 deletions(-)
> >
> 
> I think at least some docs would need to be updated to say it's
> variance per NUMA.

After going through the doc directory, I found two pleases:
1. Documentation/topics/dpdk/pmd.rst
2. NEWS

> 
> >diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >index 2c08a71..6a53f13 100644
> >--- a/lib/dpif-netdev.c
> >+++ b/lib/dpif-netdev.c
> >@@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
> >  static uint64_t variance(uint64_t a[], int n);
> >  static uint64_t
> >-sched_numa_list_variance(struct sched_numa_list *numa_list)
> >+sched_numa_variance(struct sched_numa *numa)
> >  {
> >-    struct sched_numa *numa;
> >      uint64_t *percent_busy = NULL;
> >-    unsigned total_pmds = 0;
> >      int n_proc = 0;
> >      uint64_t var;
> >-    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
> >-        total_pmds += numa->n_pmds;
> >-        percent_busy = xrealloc(percent_busy,
> >-                                total_pmds * sizeof *percent_busy);
> >+    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
> >-        for (unsigned i = 0; i < numa->n_pmds; i++) {
> >-            struct sched_pmd *sched_pmd;
> >-            uint64_t total_cycles = 0;
> >+    for (unsigned i = 0; i < numa->n_pmds; i++) {
> >+        struct sched_pmd *sched_pmd;
> >+        uint64_t total_cycles = 0;
> >-            sched_pmd = &numa->pmds[i];
> >-            /* Exclude isolated PMDs from variance calculations. */
> >-            if (sched_pmd->isolated == true) {
> >-                continue;
> >-            }
> >-            /* Get the total pmd cycles for an interval. */
> >-            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
> >+        sched_pmd = &numa->pmds[i];
> >+        /* Exclude isolated PMDs from variance calculations. */
> >+        if (sched_pmd->isolated == true) {
> >+            continue;
> >+        }
> >+        /* Get the total pmd cycles for an interval. */
> >+        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
> >-            if (total_cycles) {
> >-                /* Estimate the cycles to cover all intervals. */
> >-                total_cycles *= PMD_INTERVAL_MAX;
> >-                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> >-                                             / total_cycles;
> >-            } else {
> >-                percent_busy[n_proc++] = 0;
> >-            }
> >+        if (total_cycles) {
> >+            /* Estimate the cycles to cover all intervals. */
> >+            total_cycles *= PMD_INTERVAL_MAX;
> >+            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
> >+                                            / total_cycles;
> >+        } else {
> >+            percent_busy[n_proc++] = 0;
> >          }
> >      }
> >      var = variance(percent_busy, n_proc);
> >@@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
> >      struct sched_numa_list numa_list_est;
> >      bool thresh_met = false;
> >      uint64_t current_var, estimate_var;
> >+    struct sched_numa *numa_cur, *numa_est;
> >      uint64_t improvement = 0;
> >      VLOG_DBG("PMD auto load balance performing dry run.");
> >@@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
> >              sched_numa_list_count(&numa_list_est) == 1) {
> >          /* Calculate variances. */
> >-        current_var = sched_numa_list_variance(&numa_list_cur);
> >-        estimate_var = sched_numa_list_variance(&numa_list_est);
> >-
> >-        if (estimate_var < current_var) {
> >-             improvement = ((current_var - estimate_var) * 100) / current_var;
> >-        }
> >-        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
> >-                 current_var, estimate_var);
> >-        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
> >-
> >-        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> >-            thresh_met = true;
> >-            VLOG_DBG("PMD load variance improvement threshold %u%% "
> >-                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
> >-        } else {
> >-            VLOG_DBG("PMD load variance improvement threshold "
> >-                     "%u%% is not met.",
> >-                      dp->pmd_alb.rebalance_improve_thresh);
> >+        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
> >+            numa_est = sched_numa_list_lookup(&numa_list_est,
> >+                                              numa_cur->numa_id);
> >+            if (!numa_est) {
> >+                continue;
> >+            }
> >+            current_var = sched_numa_variance(numa_cur);
> >+            estimate_var = sched_numa_variance(numa_est);
> >+            if (estimate_var < current_var) {
> >+                improvement = ((current_var - estimate_var) * 100)
> >+                              / current_var;
> >+            }
> >+            VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current"
> >+                     " variance %"PRIu64" Estimated variance %"PRIu64".",
> >+                     numa_cur->numa_id, improvement,
> >+                     current_var, estimate_var);
> 
> Not sure the reason for changing the logging order? Just the NUMA
> node needs to be added.

I didn't mean to update the logging order, just want to make the map
from numa id to improvement/variance_value be clear. So I chose a simple
way, which merges 'numa id', 'improvement' and 'variance value' into one
log msg. You prefer variance value goes first, then improvement? like
below.
"Numa node <I>. Current variance <A> Estimated variance <B>. Variance
improvement <P>."

> 
> >+
> >+            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
> >+                VLOG_DBG("PMD load variance improvement threshold %u%% "
> >+                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
> >+                thresh_met = true;
> >+                break;
> 
> 
> I think it's better to remove the break here and move the "result"
> logging outside the for loop. Just set the thresh_met = true at this
> point. So then the logging looks like.
> 
> "PMD auto load balance dry run"
> "Numa node x. Current y...Estime z...Improve n"
> "Numa node x. Current y...Estime z...Improve n"
> "PMD load improvement of m is met." or "...not met"
> 
> For debug it would be important to continue the run on the other
> NUMA even if the threshold is already met, so removing the break;
> would allow this.

It makes sense, will update in v2.

> 
> >+            } else {
> >+                VLOG_DBG("PMD load variance improvement threshold "
> >+                        "%u%% is not met.",
> >+                        dp->pmd_alb.rebalance_improve_thresh);
> >+            }
> >          }
> >      } else {
> >          VLOG_DBG("PMD auto load balance detected cross-numa polling with "
>

Kevin Traynor Dec. 12, 2022, 11:37 a.m. UTC | #3

On 10/12/2022 07:21, Cheng Li wrote:
> On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote:
>> On 04/12/2022 11:16, Cheng Li wrote:
>>> Currently, pmd_rebalance_dry_run() calculate overall variance of
>>> all pmds regardless of their numa location. The overall result may
>>> hide un-balance in an individual numa.
>>>
>>
>> Hi. Thanks, the idea looks a good one. I didn't test yet, couple of
>> comments on the code below.
> 
> Hi Kevin, thanks for the review.
> 
>>
>>> Considering the following case. Numa 0 is free because VMs on numa0
>>> are not sending pkts, while numa 1 is busy. Within numa 1, pmds
>>> workloads are not balanced. Obviously, moving 500 kpps workloads from
>>> pmd 126 to pmd 61 will make numa1 much more balance. For numa1
>>> the variance improment will be almost 100%, because after rebalance
>>> each pmd in numa1 holds same workload(variance ~= 0). But the overall
>>> variance improvement is only about 20%, which may not tigger auto_lb.
>>>
>>> ```
>>> numa_id   core_id      kpps
>>>        0        30         0
>>>        0        31         0
>>>        0        94         0
>>>        0        95         0
>>>        1       126      1500
>>>        1       127      1000
>>>        1        63      1000
>>>        1        62       500
>>> ```
>>>
>>> As auto_lb doesn't work if any coss_numa rxq exists, it means that
>>
>> That's not fully true. It can run with cross numa in a very limited
>> circumstances where there is only PMD active on 1 numa.
>>
>> It doesn't diminish the idea here, but just best not write that
>> blanket statement as it may confuse.
> 
> Makes sense, I will reword the description.
> 
>>
>>> auto_lb only balance rxq assignment within each numa. So it makes
>>> more sense to calculate variance improvemnet per numa node.
>>>
>>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
>>> ---
>>>   lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++--------------------------
>>>   1 file changed, 47 insertions(+), 43 deletions(-)
>>>
>>
>> I think at least some docs would need to be updated to say it's
>> variance per NUMA.
> 
> After going through the doc directory, I found two pleases:
> 1. Documentation/topics/dpdk/pmd.rst

Yeah, would be good to even mention it here.

> 2. NEWS
> 

Not sure it's really NEWS worthy as I'd consider it more an optimisation 
than new functionality, but I'm fine either way.

>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>> index 2c08a71..6a53f13 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp)
>>>   static uint64_t variance(uint64_t a[], int n);
>>>   static uint64_t
>>> -sched_numa_list_variance(struct sched_numa_list *numa_list)
>>> +sched_numa_variance(struct sched_numa *numa)
>>>   {
>>> -    struct sched_numa *numa;
>>>       uint64_t *percent_busy = NULL;
>>> -    unsigned total_pmds = 0;
>>>       int n_proc = 0;
>>>       uint64_t var;
>>> -    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
>>> -        total_pmds += numa->n_pmds;
>>> -        percent_busy = xrealloc(percent_busy,
>>> -                                total_pmds * sizeof *percent_busy);
>>> +    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
>>> -        for (unsigned i = 0; i < numa->n_pmds; i++) {
>>> -            struct sched_pmd *sched_pmd;
>>> -            uint64_t total_cycles = 0;
>>> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
>>> +        struct sched_pmd *sched_pmd;
>>> +        uint64_t total_cycles = 0;
>>> -            sched_pmd = &numa->pmds[i];
>>> -            /* Exclude isolated PMDs from variance calculations. */
>>> -            if (sched_pmd->isolated == true) {
>>> -                continue;
>>> -            }
>>> -            /* Get the total pmd cycles for an interval. */
>>> -            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
>>> +        sched_pmd = &numa->pmds[i];
>>> +        /* Exclude isolated PMDs from variance calculations. */
>>> +        if (sched_pmd->isolated == true) {
>>> +            continue;
>>> +        }
>>> +        /* Get the total pmd cycles for an interval. */
>>> +        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
>>> -            if (total_cycles) {
>>> -                /* Estimate the cycles to cover all intervals. */
>>> -                total_cycles *= PMD_INTERVAL_MAX;
>>> -                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
>>> -                                             / total_cycles;
>>> -            } else {
>>> -                percent_busy[n_proc++] = 0;
>>> -            }
>>> +        if (total_cycles) {
>>> +            /* Estimate the cycles to cover all intervals. */
>>> +            total_cycles *= PMD_INTERVAL_MAX;
>>> +            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
>>> +                                            / total_cycles;
>>> +        } else {
>>> +            percent_busy[n_proc++] = 0;
>>>           }
>>>       }
>>>       var = variance(percent_busy, n_proc);
>>> @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>>>       struct sched_numa_list numa_list_est;
>>>       bool thresh_met = false;
>>>       uint64_t current_var, estimate_var;
>>> +    struct sched_numa *numa_cur, *numa_est;
>>>       uint64_t improvement = 0;
>>>       VLOG_DBG("PMD auto load balance performing dry run.");
>>> @@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
>>>               sched_numa_list_count(&numa_list_est) == 1) {
>>>           /* Calculate variances. */
>>> -        current_var = sched_numa_list_variance(&numa_list_cur);
>>> -        estimate_var = sched_numa_list_variance(&numa_list_est);
>>> -
>>> -        if (estimate_var < current_var) {
>>> -             improvement = ((current_var - estimate_var) * 100) / current_var;
>>> -        }
>>> -        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
>>> -                 current_var, estimate_var);
>>> -        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
>>> -
>>> -        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
>>> -            thresh_met = true;
>>> -            VLOG_DBG("PMD load variance improvement threshold %u%% "
>>> -                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
>>> -        } else {
>>> -            VLOG_DBG("PMD load variance improvement threshold "
>>> -                     "%u%% is not met.",
>>> -                      dp->pmd_alb.rebalance_improve_thresh);
>>> +        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
>>> +            numa_est = sched_numa_list_lookup(&numa_list_est,
>>> +                                              numa_cur->numa_id);
>>> +            if (!numa_est) {
>>> +                continue;
>>> +            }
>>> +            current_var = sched_numa_variance(numa_cur);
>>> +            estimate_var = sched_numa_variance(numa_est);
>>> +            if (estimate_var < current_var) {
>>> +                improvement = ((current_var - estimate_var) * 100)
>>> +                              / current_var;
>>> +            }
>>> +            VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current"
>>> +                     " variance %"PRIu64" Estimated variance %"PRIu64".",
>>> +                     numa_cur->numa_id, improvement,
>>> +                     current_var, estimate_var);
>>
>> Not sure the reason for changing the logging order? Just the NUMA
>> node needs to be added.
> 
> I didn't mean to update the logging order, just want to make the map
> from numa id to improvement/variance_value be clear. So I chose a simple
> way, which merges 'numa id', 'improvement' and 'variance value' into one
> log msg. You prefer variance value goes first, then improvement? like
> below.
> "Numa node <I>. Current variance <A> Estimated variance <B>. Variance
> improvement <P>."
> 

Yes, because it gives the inputs first and then the result. I admit, 
it's just a personal preference and others may disagree.

>>
>>> +
>>> +            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
>>> +                VLOG_DBG("PMD load variance improvement threshold %u%% "
>>> +                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
>>> +                thresh_met = true;
>>> +                break;
>>
>>
>> I think it's better to remove the break here and move the "result"
>> logging outside the for loop. Just set the thresh_met = true at this
>> point. So then the logging looks like.
>>
>> "PMD auto load balance dry run"
>> "Numa node x. Current y...Estime z...Improve n"
>> "Numa node x. Current y...Estime z...Improve n"
>> "PMD load improvement of m is met." or "...not met"
>>
>> For debug it would be important to continue the run on the other
>> NUMA even if the threshold is already met, so removing the break;
>> would allow this.
> 
> It makes sense, will update in v2.
> 

Example below, but feel free to do a different way. It would still print 
"not met" for the disallowed cross-numa case, but that seems ok.

@@ -6214,12 +6214,5 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)

              if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
-                VLOG_DBG("PMD load variance improvement threshold %u%% "
-                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
                  thresh_met = true;
-                break;
-            } else {
-                VLOG_DBG("PMD load variance improvement threshold "
-                        "%u%% is not met.",
-                        dp->pmd_alb.rebalance_improve_thresh);
              }
          }
@@ -6229,4 +6222,8 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
      }

+    VLOG_DBG("PMD load variance improvement threshold %u%% "
+                "is %s.", dp->pmd_alb.rebalance_improve_thresh,
+                thresh_met ? "met" : "not met");
+
      sched_numa_list_free_entries(&numa_list_cur);
      sched_numa_list_free_entries(&numa_list_est);


>>
>>> +            } else {
>>> +                VLOG_DBG("PMD load variance improvement threshold "
>>> +                        "%u%% is not met.",
>>> +                        dp->pmd_alb.rebalance_improve_thresh);
>>> +            }
>>>           }
>>>       } else {
>>>           VLOG_DBG("PMD auto load balance detected cross-numa polling with "
>>
>

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2c08a71..6a53f13 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6076,39 +6076,33 @@  rxq_scheduling(struct dp_netdev *dp)
 static uint64_t variance(uint64_t a[], int n);
 
 static uint64_t
-sched_numa_list_variance(struct sched_numa_list *numa_list)
+sched_numa_variance(struct sched_numa *numa)
 {
-    struct sched_numa *numa;
     uint64_t *percent_busy = NULL;
-    unsigned total_pmds = 0;
     int n_proc = 0;
     uint64_t var;
 
-    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
-        total_pmds += numa->n_pmds;
-        percent_busy = xrealloc(percent_busy,
-                                total_pmds * sizeof *percent_busy);
+    percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy);
 
-        for (unsigned i = 0; i < numa->n_pmds; i++) {
-            struct sched_pmd *sched_pmd;
-            uint64_t total_cycles = 0;
+    for (unsigned i = 0; i < numa->n_pmds; i++) {
+        struct sched_pmd *sched_pmd;
+        uint64_t total_cycles = 0;
 
-            sched_pmd = &numa->pmds[i];
-            /* Exclude isolated PMDs from variance calculations. */
-            if (sched_pmd->isolated == true) {
-                continue;
-            }
-            /* Get the total pmd cycles for an interval. */
-            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
+        sched_pmd = &numa->pmds[i];
+        /* Exclude isolated PMDs from variance calculations. */
+        if (sched_pmd->isolated == true) {
+            continue;
+        }
+        /* Get the total pmd cycles for an interval. */
+        atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles);
 
-            if (total_cycles) {
-                /* Estimate the cycles to cover all intervals. */
-                total_cycles *= PMD_INTERVAL_MAX;
-                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
-                                             / total_cycles;
-            } else {
-                percent_busy[n_proc++] = 0;
-            }
+        if (total_cycles) {
+            /* Estimate the cycles to cover all intervals. */
+            total_cycles *= PMD_INTERVAL_MAX;
+            percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
+                                            / total_cycles;
+        } else {
+            percent_busy[n_proc++] = 0;
         }
     }
     var = variance(percent_busy, n_proc);
@@ -6182,6 +6176,7 @@  pmd_rebalance_dry_run(struct dp_netdev *dp)
     struct sched_numa_list numa_list_est;
     bool thresh_met = false;
     uint64_t current_var, estimate_var;
+    struct sched_numa *numa_cur, *numa_est;
     uint64_t improvement = 0;
 
     VLOG_DBG("PMD auto load balance performing dry run.");
@@ -6200,24 +6195,33 @@  pmd_rebalance_dry_run(struct dp_netdev *dp)
             sched_numa_list_count(&numa_list_est) == 1) {
 
         /* Calculate variances. */
-        current_var = sched_numa_list_variance(&numa_list_cur);
-        estimate_var = sched_numa_list_variance(&numa_list_est);
-
-        if (estimate_var < current_var) {
-             improvement = ((current_var - estimate_var) * 100) / current_var;
-        }
-        VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".",
-                 current_var, estimate_var);
-        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
-
-        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
-            thresh_met = true;
-            VLOG_DBG("PMD load variance improvement threshold %u%% "
-                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
-        } else {
-            VLOG_DBG("PMD load variance improvement threshold "
-                     "%u%% is not met.",
-                      dp->pmd_alb.rebalance_improve_thresh);
+        HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) {
+            numa_est = sched_numa_list_lookup(&numa_list_est,
+                                              numa_cur->numa_id);
+            if (!numa_est) {
+                continue;
+            }
+            current_var = sched_numa_variance(numa_cur);
+            estimate_var = sched_numa_variance(numa_est);
+            if (estimate_var < current_var) {
+                improvement = ((current_var - estimate_var) * 100)
+                              / current_var;
+            }
+            VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current"
+                     " variance %"PRIu64" Estimated variance %"PRIu64".",
+                     numa_cur->numa_id, improvement,
+                     current_var, estimate_var);
+
+            if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
+                VLOG_DBG("PMD load variance improvement threshold %u%% "
+                        "is met.", dp->pmd_alb.rebalance_improve_thresh);
+                thresh_met = true;
+                break;
+            } else {
+                VLOG_DBG("PMD load variance improvement threshold "
+                        "%u%% is not met.",
+                        dp->pmd_alb.rebalance_improve_thresh);
+            }
         }
     } else {
         VLOG_DBG("PMD auto load balance detected cross-numa polling with "

[ovs-dev] dpif-netdev: calculate per numa variance

Checks

Commit Message

Comments

Patch