Message ID | 1670152577-4830-1-git-send-email-lic121@chinatelecom.cn |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev] dpif-netdev: calculate per numa variance | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
On 04/12/2022 11:16, Cheng Li wrote: > Currently, pmd_rebalance_dry_run() calculate overall variance of > all pmds regardless of their numa location. The overall result may > hide un-balance in an individual numa. > Hi. Thanks, the idea looks a good one. I didn't test yet, couple of comments on the code below. > Considering the following case. Numa 0 is free because VMs on numa0 > are not sending pkts, while numa 1 is busy. Within numa 1, pmds > workloads are not balanced. Obviously, moving 500 kpps workloads from > pmd 126 to pmd 61 will make numa1 much more balance. For numa1 > the variance improment will be almost 100%, because after rebalance > each pmd in numa1 holds same workload(variance ~= 0). But the overall > variance improvement is only about 20%, which may not tigger auto_lb. > > ``` > numa_id core_id kpps > 0 30 0 > 0 31 0 > 0 94 0 > 0 95 0 > 1 126 1500 > 1 127 1000 > 1 63 1000 > 1 62 500 > ``` > > As auto_lb doesn't work if any coss_numa rxq exists, it means that That's not fully true. It can run with cross numa in a very limited circumstances where there is only PMD active on 1 numa. It doesn't diminish the idea here, but just best not write that blanket statement as it may confuse. > auto_lb only balance rxq assignment within each numa. So it makes > more sense to calculate variance improvemnet per numa node. > > Signed-off-by: Cheng Li <lic121@chinatelecom.cn> > --- > lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++-------------------------- > 1 file changed, 47 insertions(+), 43 deletions(-) > I think at least some docs would need to be updated to say it's variance per NUMA. > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 2c08a71..6a53f13 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp) > static uint64_t variance(uint64_t a[], int n); > > static uint64_t > -sched_numa_list_variance(struct sched_numa_list *numa_list) > +sched_numa_variance(struct sched_numa *numa) > { > - struct sched_numa *numa; > uint64_t *percent_busy = NULL; > - unsigned total_pmds = 0; > int n_proc = 0; > uint64_t var; > > - HMAP_FOR_EACH (numa, node, &numa_list->numas) { > - total_pmds += numa->n_pmds; > - percent_busy = xrealloc(percent_busy, > - total_pmds * sizeof *percent_busy); > + percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy); > > - for (unsigned i = 0; i < numa->n_pmds; i++) { > - struct sched_pmd *sched_pmd; > - uint64_t total_cycles = 0; > + for (unsigned i = 0; i < numa->n_pmds; i++) { > + struct sched_pmd *sched_pmd; > + uint64_t total_cycles = 0; > > - sched_pmd = &numa->pmds[i]; > - /* Exclude isolated PMDs from variance calculations. */ > - if (sched_pmd->isolated == true) { > - continue; > - } > - /* Get the total pmd cycles for an interval. */ > - atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); > + sched_pmd = &numa->pmds[i]; > + /* Exclude isolated PMDs from variance calculations. */ > + if (sched_pmd->isolated == true) { > + continue; > + } > + /* Get the total pmd cycles for an interval. */ > + atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); > > - if (total_cycles) { > - /* Estimate the cycles to cover all intervals. */ > - total_cycles *= PMD_INTERVAL_MAX; > - percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > - / total_cycles; > - } else { > - percent_busy[n_proc++] = 0; > - } > + if (total_cycles) { > + /* Estimate the cycles to cover all intervals. */ > + total_cycles *= PMD_INTERVAL_MAX; > + percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > + / total_cycles; > + } else { > + percent_busy[n_proc++] = 0; > } > } > var = variance(percent_busy, n_proc); > @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > struct sched_numa_list numa_list_est; > bool thresh_met = false; > uint64_t current_var, estimate_var; > + struct sched_numa *numa_cur, *numa_est; > uint64_t improvement = 0; > > VLOG_DBG("PMD auto load balance performing dry run."); > @@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > sched_numa_list_count(&numa_list_est) == 1) { > > /* Calculate variances. */ > - current_var = sched_numa_list_variance(&numa_list_cur); > - estimate_var = sched_numa_list_variance(&numa_list_est); > - > - if (estimate_var < current_var) { > - improvement = ((current_var - estimate_var) * 100) / current_var; > - } > - VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".", > - current_var, estimate_var); > - VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement); > - > - if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > - thresh_met = true; > - VLOG_DBG("PMD load variance improvement threshold %u%% " > - "is met.", dp->pmd_alb.rebalance_improve_thresh); > - } else { > - VLOG_DBG("PMD load variance improvement threshold " > - "%u%% is not met.", > - dp->pmd_alb.rebalance_improve_thresh); > + HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) { > + numa_est = sched_numa_list_lookup(&numa_list_est, > + numa_cur->numa_id); > + if (!numa_est) { > + continue; > + } > + current_var = sched_numa_variance(numa_cur); > + estimate_var = sched_numa_variance(numa_est); > + if (estimate_var < current_var) { > + improvement = ((current_var - estimate_var) * 100) > + / current_var; > + } > + VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current" > + " variance %"PRIu64" Estimated variance %"PRIu64".", > + numa_cur->numa_id, improvement, > + current_var, estimate_var); Not sure the reason for changing the logging order? Just the NUMA node needs to be added. > + > + if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > + VLOG_DBG("PMD load variance improvement threshold %u%% " > + "is met.", dp->pmd_alb.rebalance_improve_thresh); > + thresh_met = true; > + break; I think it's better to remove the break here and move the "result" logging outside the for loop. Just set the thresh_met = true at this point. So then the logging looks like. "PMD auto load balance dry run" "Numa node x. Current y...Estime z...Improve n" "Numa node x. Current y...Estime z...Improve n" "PMD load improvement of m is met." or "...not met" For debug it would be important to continue the run on the other NUMA even if the threshold is already met, so removing the break; would allow this. > + } else { > + VLOG_DBG("PMD load variance improvement threshold " > + "%u%% is not met.", > + dp->pmd_alb.rebalance_improve_thresh); > + } > } > } else { > VLOG_DBG("PMD auto load balance detected cross-numa polling with "
On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote: > On 04/12/2022 11:16, Cheng Li wrote: > >Currently, pmd_rebalance_dry_run() calculate overall variance of > >all pmds regardless of their numa location. The overall result may > >hide un-balance in an individual numa. > > > > Hi. Thanks, the idea looks a good one. I didn't test yet, couple of > comments on the code below. Hi Kevin, thanks for the review. > > >Considering the following case. Numa 0 is free because VMs on numa0 > >are not sending pkts, while numa 1 is busy. Within numa 1, pmds > >workloads are not balanced. Obviously, moving 500 kpps workloads from > >pmd 126 to pmd 61 will make numa1 much more balance. For numa1 > >the variance improment will be almost 100%, because after rebalance > >each pmd in numa1 holds same workload(variance ~= 0). But the overall > >variance improvement is only about 20%, which may not tigger auto_lb. > > > >``` > >numa_id core_id kpps > > 0 30 0 > > 0 31 0 > > 0 94 0 > > 0 95 0 > > 1 126 1500 > > 1 127 1000 > > 1 63 1000 > > 1 62 500 > >``` > > > >As auto_lb doesn't work if any coss_numa rxq exists, it means that > > That's not fully true. It can run with cross numa in a very limited > circumstances where there is only PMD active on 1 numa. > > It doesn't diminish the idea here, but just best not write that > blanket statement as it may confuse. Makes sense, I will reword the description. > > >auto_lb only balance rxq assignment within each numa. So it makes > >more sense to calculate variance improvemnet per numa node. > > > >Signed-off-by: Cheng Li <lic121@chinatelecom.cn> > >--- > > lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++-------------------------- > > 1 file changed, 47 insertions(+), 43 deletions(-) > > > > I think at least some docs would need to be updated to say it's > variance per NUMA. After going through the doc directory, I found two pleases: 1. Documentation/topics/dpdk/pmd.rst 2. NEWS > > >diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > >index 2c08a71..6a53f13 100644 > >--- a/lib/dpif-netdev.c > >+++ b/lib/dpif-netdev.c > >@@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp) > > static uint64_t variance(uint64_t a[], int n); > > static uint64_t > >-sched_numa_list_variance(struct sched_numa_list *numa_list) > >+sched_numa_variance(struct sched_numa *numa) > > { > >- struct sched_numa *numa; > > uint64_t *percent_busy = NULL; > >- unsigned total_pmds = 0; > > int n_proc = 0; > > uint64_t var; > >- HMAP_FOR_EACH (numa, node, &numa_list->numas) { > >- total_pmds += numa->n_pmds; > >- percent_busy = xrealloc(percent_busy, > >- total_pmds * sizeof *percent_busy); > >+ percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy); > >- for (unsigned i = 0; i < numa->n_pmds; i++) { > >- struct sched_pmd *sched_pmd; > >- uint64_t total_cycles = 0; > >+ for (unsigned i = 0; i < numa->n_pmds; i++) { > >+ struct sched_pmd *sched_pmd; > >+ uint64_t total_cycles = 0; > >- sched_pmd = &numa->pmds[i]; > >- /* Exclude isolated PMDs from variance calculations. */ > >- if (sched_pmd->isolated == true) { > >- continue; > >- } > >- /* Get the total pmd cycles for an interval. */ > >- atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); > >+ sched_pmd = &numa->pmds[i]; > >+ /* Exclude isolated PMDs from variance calculations. */ > >+ if (sched_pmd->isolated == true) { > >+ continue; > >+ } > >+ /* Get the total pmd cycles for an interval. */ > >+ atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); > >- if (total_cycles) { > >- /* Estimate the cycles to cover all intervals. */ > >- total_cycles *= PMD_INTERVAL_MAX; > >- percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > >- / total_cycles; > >- } else { > >- percent_busy[n_proc++] = 0; > >- } > >+ if (total_cycles) { > >+ /* Estimate the cycles to cover all intervals. */ > >+ total_cycles *= PMD_INTERVAL_MAX; > >+ percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) > >+ / total_cycles; > >+ } else { > >+ percent_busy[n_proc++] = 0; > > } > > } > > var = variance(percent_busy, n_proc); > >@@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > > struct sched_numa_list numa_list_est; > > bool thresh_met = false; > > uint64_t current_var, estimate_var; > >+ struct sched_numa *numa_cur, *numa_est; > > uint64_t improvement = 0; > > VLOG_DBG("PMD auto load balance performing dry run."); > >@@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > > sched_numa_list_count(&numa_list_est) == 1) { > > /* Calculate variances. */ > >- current_var = sched_numa_list_variance(&numa_list_cur); > >- estimate_var = sched_numa_list_variance(&numa_list_est); > >- > >- if (estimate_var < current_var) { > >- improvement = ((current_var - estimate_var) * 100) / current_var; > >- } > >- VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".", > >- current_var, estimate_var); > >- VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement); > >- > >- if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > >- thresh_met = true; > >- VLOG_DBG("PMD load variance improvement threshold %u%% " > >- "is met.", dp->pmd_alb.rebalance_improve_thresh); > >- } else { > >- VLOG_DBG("PMD load variance improvement threshold " > >- "%u%% is not met.", > >- dp->pmd_alb.rebalance_improve_thresh); > >+ HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) { > >+ numa_est = sched_numa_list_lookup(&numa_list_est, > >+ numa_cur->numa_id); > >+ if (!numa_est) { > >+ continue; > >+ } > >+ current_var = sched_numa_variance(numa_cur); > >+ estimate_var = sched_numa_variance(numa_est); > >+ if (estimate_var < current_var) { > >+ improvement = ((current_var - estimate_var) * 100) > >+ / current_var; > >+ } > >+ VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current" > >+ " variance %"PRIu64" Estimated variance %"PRIu64".", > >+ numa_cur->numa_id, improvement, > >+ current_var, estimate_var); > > Not sure the reason for changing the logging order? Just the NUMA > node needs to be added. I didn't mean to update the logging order, just want to make the map from numa id to improvement/variance_value be clear. So I chose a simple way, which merges 'numa id', 'improvement' and 'variance value' into one log msg. You prefer variance value goes first, then improvement? like below. "Numa node <I>. Current variance <A> Estimated variance <B>. Variance improvement <P>." > > >+ > >+ if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { > >+ VLOG_DBG("PMD load variance improvement threshold %u%% " > >+ "is met.", dp->pmd_alb.rebalance_improve_thresh); > >+ thresh_met = true; > >+ break; > > > I think it's better to remove the break here and move the "result" > logging outside the for loop. Just set the thresh_met = true at this > point. So then the logging looks like. > > "PMD auto load balance dry run" > "Numa node x. Current y...Estime z...Improve n" > "Numa node x. Current y...Estime z...Improve n" > "PMD load improvement of m is met." or "...not met" > > For debug it would be important to continue the run on the other > NUMA even if the threshold is already met, so removing the break; > would allow this. It makes sense, will update in v2. > > >+ } else { > >+ VLOG_DBG("PMD load variance improvement threshold " > >+ "%u%% is not met.", > >+ dp->pmd_alb.rebalance_improve_thresh); > >+ } > > } > > } else { > > VLOG_DBG("PMD auto load balance detected cross-numa polling with " >
On 10/12/2022 07:21, Cheng Li wrote: > On Fri, Dec 09, 2022 at 03:32:06PM +0000, Kevin Traynor wrote: >> On 04/12/2022 11:16, Cheng Li wrote: >>> Currently, pmd_rebalance_dry_run() calculate overall variance of >>> all pmds regardless of their numa location. The overall result may >>> hide un-balance in an individual numa. >>> >> >> Hi. Thanks, the idea looks a good one. I didn't test yet, couple of >> comments on the code below. > > Hi Kevin, thanks for the review. > >> >>> Considering the following case. Numa 0 is free because VMs on numa0 >>> are not sending pkts, while numa 1 is busy. Within numa 1, pmds >>> workloads are not balanced. Obviously, moving 500 kpps workloads from >>> pmd 126 to pmd 61 will make numa1 much more balance. For numa1 >>> the variance improment will be almost 100%, because after rebalance >>> each pmd in numa1 holds same workload(variance ~= 0). But the overall >>> variance improvement is only about 20%, which may not tigger auto_lb. >>> >>> ``` >>> numa_id core_id kpps >>> 0 30 0 >>> 0 31 0 >>> 0 94 0 >>> 0 95 0 >>> 1 126 1500 >>> 1 127 1000 >>> 1 63 1000 >>> 1 62 500 >>> ``` >>> >>> As auto_lb doesn't work if any coss_numa rxq exists, it means that >> >> That's not fully true. It can run with cross numa in a very limited >> circumstances where there is only PMD active on 1 numa. >> >> It doesn't diminish the idea here, but just best not write that >> blanket statement as it may confuse. > > Makes sense, I will reword the description. > >> >>> auto_lb only balance rxq assignment within each numa. So it makes >>> more sense to calculate variance improvemnet per numa node. >>> >>> Signed-off-by: Cheng Li <lic121@chinatelecom.cn> >>> --- >>> lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++-------------------------- >>> 1 file changed, 47 insertions(+), 43 deletions(-) >>> >> >> I think at least some docs would need to be updated to say it's >> variance per NUMA. > > After going through the doc directory, I found two pleases: > 1. Documentation/topics/dpdk/pmd.rst Yeah, would be good to even mention it here. > 2. NEWS > Not sure it's really NEWS worthy as I'd consider it more an optimisation than new functionality, but I'm fine either way. >> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >>> index 2c08a71..6a53f13 100644 >>> --- a/lib/dpif-netdev.c >>> +++ b/lib/dpif-netdev.c >>> @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp) >>> static uint64_t variance(uint64_t a[], int n); >>> static uint64_t >>> -sched_numa_list_variance(struct sched_numa_list *numa_list) >>> +sched_numa_variance(struct sched_numa *numa) >>> { >>> - struct sched_numa *numa; >>> uint64_t *percent_busy = NULL; >>> - unsigned total_pmds = 0; >>> int n_proc = 0; >>> uint64_t var; >>> - HMAP_FOR_EACH (numa, node, &numa_list->numas) { >>> - total_pmds += numa->n_pmds; >>> - percent_busy = xrealloc(percent_busy, >>> - total_pmds * sizeof *percent_busy); >>> + percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy); >>> - for (unsigned i = 0; i < numa->n_pmds; i++) { >>> - struct sched_pmd *sched_pmd; >>> - uint64_t total_cycles = 0; >>> + for (unsigned i = 0; i < numa->n_pmds; i++) { >>> + struct sched_pmd *sched_pmd; >>> + uint64_t total_cycles = 0; >>> - sched_pmd = &numa->pmds[i]; >>> - /* Exclude isolated PMDs from variance calculations. */ >>> - if (sched_pmd->isolated == true) { >>> - continue; >>> - } >>> - /* Get the total pmd cycles for an interval. */ >>> - atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); >>> + sched_pmd = &numa->pmds[i]; >>> + /* Exclude isolated PMDs from variance calculations. */ >>> + if (sched_pmd->isolated == true) { >>> + continue; >>> + } >>> + /* Get the total pmd cycles for an interval. */ >>> + atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); >>> - if (total_cycles) { >>> - /* Estimate the cycles to cover all intervals. */ >>> - total_cycles *= PMD_INTERVAL_MAX; >>> - percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) >>> - / total_cycles; >>> - } else { >>> - percent_busy[n_proc++] = 0; >>> - } >>> + if (total_cycles) { >>> + /* Estimate the cycles to cover all intervals. */ >>> + total_cycles *= PMD_INTERVAL_MAX; >>> + percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) >>> + / total_cycles; >>> + } else { >>> + percent_busy[n_proc++] = 0; >>> } >>> } >>> var = variance(percent_busy, n_proc); >>> @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) >>> struct sched_numa_list numa_list_est; >>> bool thresh_met = false; >>> uint64_t current_var, estimate_var; >>> + struct sched_numa *numa_cur, *numa_est; >>> uint64_t improvement = 0; >>> VLOG_DBG("PMD auto load balance performing dry run."); >>> @@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) >>> sched_numa_list_count(&numa_list_est) == 1) { >>> /* Calculate variances. */ >>> - current_var = sched_numa_list_variance(&numa_list_cur); >>> - estimate_var = sched_numa_list_variance(&numa_list_est); >>> - >>> - if (estimate_var < current_var) { >>> - improvement = ((current_var - estimate_var) * 100) / current_var; >>> - } >>> - VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".", >>> - current_var, estimate_var); >>> - VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement); >>> - >>> - if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { >>> - thresh_met = true; >>> - VLOG_DBG("PMD load variance improvement threshold %u%% " >>> - "is met.", dp->pmd_alb.rebalance_improve_thresh); >>> - } else { >>> - VLOG_DBG("PMD load variance improvement threshold " >>> - "%u%% is not met.", >>> - dp->pmd_alb.rebalance_improve_thresh); >>> + HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) { >>> + numa_est = sched_numa_list_lookup(&numa_list_est, >>> + numa_cur->numa_id); >>> + if (!numa_est) { >>> + continue; >>> + } >>> + current_var = sched_numa_variance(numa_cur); >>> + estimate_var = sched_numa_variance(numa_est); >>> + if (estimate_var < current_var) { >>> + improvement = ((current_var - estimate_var) * 100) >>> + / current_var; >>> + } >>> + VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current" >>> + " variance %"PRIu64" Estimated variance %"PRIu64".", >>> + numa_cur->numa_id, improvement, >>> + current_var, estimate_var); >> >> Not sure the reason for changing the logging order? Just the NUMA >> node needs to be added. > > I didn't mean to update the logging order, just want to make the map > from numa id to improvement/variance_value be clear. So I chose a simple > way, which merges 'numa id', 'improvement' and 'variance value' into one > log msg. You prefer variance value goes first, then improvement? like > below. > "Numa node <I>. Current variance <A> Estimated variance <B>. Variance > improvement <P>." > Yes, because it gives the inputs first and then the result. I admit, it's just a personal preference and others may disagree. >> >>> + >>> + if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { >>> + VLOG_DBG("PMD load variance improvement threshold %u%% " >>> + "is met.", dp->pmd_alb.rebalance_improve_thresh); >>> + thresh_met = true; >>> + break; >> >> >> I think it's better to remove the break here and move the "result" >> logging outside the for loop. Just set the thresh_met = true at this >> point. So then the logging looks like. >> >> "PMD auto load balance dry run" >> "Numa node x. Current y...Estime z...Improve n" >> "Numa node x. Current y...Estime z...Improve n" >> "PMD load improvement of m is met." or "...not met" >> >> For debug it would be important to continue the run on the other >> NUMA even if the threshold is already met, so removing the break; >> would allow this. > > It makes sense, will update in v2. > Example below, but feel free to do a different way. It would still print "not met" for the disallowed cross-numa case, but that seems ok. @@ -6214,12 +6214,5 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { - VLOG_DBG("PMD load variance improvement threshold %u%% " - "is met.", dp->pmd_alb.rebalance_improve_thresh); thresh_met = true; - break; - } else { - VLOG_DBG("PMD load variance improvement threshold " - "%u%% is not met.", - dp->pmd_alb.rebalance_improve_thresh); } } @@ -6229,4 +6222,8 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) } + VLOG_DBG("PMD load variance improvement threshold %u%% " + "is %s.", dp->pmd_alb.rebalance_improve_thresh, + thresh_met ? "met" : "not met"); + sched_numa_list_free_entries(&numa_list_cur); sched_numa_list_free_entries(&numa_list_est); >> >>> + } else { >>> + VLOG_DBG("PMD load variance improvement threshold " >>> + "%u%% is not met.", >>> + dp->pmd_alb.rebalance_improve_thresh); >>> + } >>> } >>> } else { >>> VLOG_DBG("PMD auto load balance detected cross-numa polling with " >> >
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 2c08a71..6a53f13 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -6076,39 +6076,33 @@ rxq_scheduling(struct dp_netdev *dp) static uint64_t variance(uint64_t a[], int n); static uint64_t -sched_numa_list_variance(struct sched_numa_list *numa_list) +sched_numa_variance(struct sched_numa *numa) { - struct sched_numa *numa; uint64_t *percent_busy = NULL; - unsigned total_pmds = 0; int n_proc = 0; uint64_t var; - HMAP_FOR_EACH (numa, node, &numa_list->numas) { - total_pmds += numa->n_pmds; - percent_busy = xrealloc(percent_busy, - total_pmds * sizeof *percent_busy); + percent_busy = xmalloc(numa->n_pmds * sizeof *percent_busy); - for (unsigned i = 0; i < numa->n_pmds; i++) { - struct sched_pmd *sched_pmd; - uint64_t total_cycles = 0; + for (unsigned i = 0; i < numa->n_pmds; i++) { + struct sched_pmd *sched_pmd; + uint64_t total_cycles = 0; - sched_pmd = &numa->pmds[i]; - /* Exclude isolated PMDs from variance calculations. */ - if (sched_pmd->isolated == true) { - continue; - } - /* Get the total pmd cycles for an interval. */ - atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); + sched_pmd = &numa->pmds[i]; + /* Exclude isolated PMDs from variance calculations. */ + if (sched_pmd->isolated == true) { + continue; + } + /* Get the total pmd cycles for an interval. */ + atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles, &total_cycles); - if (total_cycles) { - /* Estimate the cycles to cover all intervals. */ - total_cycles *= PMD_INTERVAL_MAX; - percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) - / total_cycles; - } else { - percent_busy[n_proc++] = 0; - } + if (total_cycles) { + /* Estimate the cycles to cover all intervals. */ + total_cycles *= PMD_INTERVAL_MAX; + percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100) + / total_cycles; + } else { + percent_busy[n_proc++] = 0; } } var = variance(percent_busy, n_proc); @@ -6182,6 +6176,7 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) struct sched_numa_list numa_list_est; bool thresh_met = false; uint64_t current_var, estimate_var; + struct sched_numa *numa_cur, *numa_est; uint64_t improvement = 0; VLOG_DBG("PMD auto load balance performing dry run."); @@ -6200,24 +6195,33 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) sched_numa_list_count(&numa_list_est) == 1) { /* Calculate variances. */ - current_var = sched_numa_list_variance(&numa_list_cur); - estimate_var = sched_numa_list_variance(&numa_list_est); - - if (estimate_var < current_var) { - improvement = ((current_var - estimate_var) * 100) / current_var; - } - VLOG_DBG("Current variance %"PRIu64" Estimated variance %"PRIu64".", - current_var, estimate_var); - VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement); - - if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { - thresh_met = true; - VLOG_DBG("PMD load variance improvement threshold %u%% " - "is met.", dp->pmd_alb.rebalance_improve_thresh); - } else { - VLOG_DBG("PMD load variance improvement threshold " - "%u%% is not met.", - dp->pmd_alb.rebalance_improve_thresh); + HMAP_FOR_EACH (numa_cur, node, &numa_list_cur.numas) { + numa_est = sched_numa_list_lookup(&numa_list_est, + numa_cur->numa_id); + if (!numa_est) { + continue; + } + current_var = sched_numa_variance(numa_cur); + estimate_var = sched_numa_variance(numa_est); + if (estimate_var < current_var) { + improvement = ((current_var - estimate_var) * 100) + / current_var; + } + VLOG_DBG("Numa node %d. Variance improvement %"PRIu64"%%. Current" + " variance %"PRIu64" Estimated variance %"PRIu64".", + numa_cur->numa_id, improvement, + current_var, estimate_var); + + if (improvement >= dp->pmd_alb.rebalance_improve_thresh) { + VLOG_DBG("PMD load variance improvement threshold %u%% " + "is met.", dp->pmd_alb.rebalance_improve_thresh); + thresh_met = true; + break; + } else { + VLOG_DBG("PMD load variance improvement threshold " + "%u%% is not met.", + dp->pmd_alb.rebalance_improve_thresh); + } } } else { VLOG_DBG("PMD auto load balance detected cross-numa polling with "
Currently, pmd_rebalance_dry_run() calculate overall variance of all pmds regardless of their numa location. The overall result may hide un-balance in an individual numa. Considering the following case. Numa 0 is free because VMs on numa0 are not sending pkts, while numa 1 is busy. Within numa 1, pmds workloads are not balanced. Obviously, moving 500 kpps workloads from pmd 126 to pmd 61 will make numa1 much more balance. For numa1 the variance improment will be almost 100%, because after rebalance each pmd in numa1 holds same workload(variance ~= 0). But the overall variance improvement is only about 20%, which may not tigger auto_lb. ``` numa_id core_id kpps 0 30 0 0 31 0 0 94 0 0 95 0 1 126 1500 1 127 1000 1 63 1000 1 62 500 ``` As auto_lb doesn't work if any coss_numa rxq exists, it means that auto_lb only balance rxq assignment within each numa. So it makes more sense to calculate variance improvemnet per numa node. Signed-off-by: Cheng Li <lic121@chinatelecom.cn> --- lib/dpif-netdev.c | 90 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 47 insertions(+), 43 deletions(-)