diff mbox series

[ovs-dev] ofproto-dpif: Improve load balancing in dp_hash select groups.

Message ID 20240920124135.1006001-1-i.maximets@ovn.org
State Accepted
Commit cb64234786438684a05f523bfe6c0d90815ec413
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev] ofproto-dpif: Improve load balancing in dp_hash select groups. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Ilya Maximets Sept. 20, 2024, 12:40 p.m. UTC
ovn-kubernetes users observe uneven distribution of traffic among
service backends.  It gets noticeably worse when 9+ backends are
configured.  For example:

    Server        connections
    server-6nbg7     634
    server-72br8     326  <-----
    server-7b8rm     605
    server-9tnz7     655
    server-c84hw     596
    server-mjskl     622
    server-ptkcc     336  <-----
    server-rl7pk     592
    server-xd2cb     634

Here we can see that out of 5000 connections total, servers '72br8'
and 'ptkcc' received about 2 times less connections than any other
server.  With 10 backends, 4 servers will receive 2x less, etc.
The situation is common, because kubernetes users frequently scale
their applications with something like kubectl scale --replicas=10.

Services are implemented as OVN load-balancers, which are implemented
as OpenFlow select groups with a dp_hash selection method.  Balancing
is implemented by allocating a hash space (number of hashes is a
power of 2) and distributing hashes between the group buckets using
Webster's method taking into account bucket weights.

This works well enough when buckets have different weights, because
OVS needs to allocate larger hash space in order to accommodate the
weight difference.  However, in case of OVN load-balancers, all the
buckets (backends) have the same weight.  This leads to allocation
of a smaller hash space just enough to fit all the backends (closest
power of 2).  For example, if we have 10 backends (buckets), then
16 hashes will be allocated.  After hashes are distributed, we end
up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
So, each of the 6 buckets will receive on average 2x more traffic
than each of the remaining 4.

The problem can be mitigated by expanding the hash space and this is
already done for cases with small number of buckets by limiting the
hash space to have at least 16 hashes.  However, it is just not large
enough for 9 or 10 buckets to provide a good distribution.  At the
same time, blindly increasing the limit is also not a good solution
as we do not want too large of a hash space for small number of
buckets, simply because each hash is a separate datapath flow and
having too many of them may cause performance issues.

Approach taken in this change is to ensure that the hash space is
at least 4 times larger than the number of buckets, but not larger
than the maximum allowed (256).  This provides a better distribution
while not unnecessarily exploding number of datapath flows for
services with not that many backends.

Here is some data to demonstrate why the 4 was chosen as a coefficient:

  coeff.  :   1         2         3         4         5        100
  -------------------------------------------------------------------
  AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
  MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
  AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
  MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
  --------+----------------------------------------------------------
    16    :   1         1         1         1         1         -
    32    :  17         9         6         5         4         -
    64    :  33        17        11         9         7         -
   128    :  65        33        22        17        13         1
   256    : 129        65        43        33        26         2
  --------+----------------------------------------------------------
           current                       proposed

Table shows average and maximum load difference (Diff) between backends
across groups with 1 to 64 equally weighted backends.  And it shows
average and maximum standard deviation (Dev) of load distribution for
the same.  For example, with a coefficient 2, the maximum difference
between two backends will be 33% and the maximum standard deviation
will be 20.4%.  With the current logic (coefficient of 1) we have
maximum difference as high as 50%, as shown with the example at the
beginning, with the standard deviation of 35.4%.

The bottom half of the table shows from how many backends we start to
use a particular number of buckets.  For example, with a coeff. 3
we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
buckets for 11-21 and so on.

According to the table, the number 4 is about where we achieve a good
enough standard deviation for the load (11.2%) while still not creating
too many hashes for cases with low number of backends.  The standard
deviation also doesn't go down that much with higher coefficient.

The data is aggregated for up to 64 backends because with higher
numbers we're getting close to the cap of 256 hashes, so deviation
increases.  However, the load difference with so many backends should
have lower impact on a cluster in general.  The change improves the
cases with 64+ backends, but to get very good numbers we'll need to
consider moving the upper limit for the hash space, which is a separate
change to think about.

Added test checks that standard deviation of the load distribution
between buckets is below 12.5% for all cases up to 64 buckets.  It's
just a pretty number that I've chosen that should be IMO good enough.

Note: when number of buckets is a power of 2, then we have an ideal
distribution with zero deviation, because hashes can be evenly mapped
to buckets.

Note 2: The change doesn't affect distribution for 4 or less backends,
since there are still minimum 16 hashes for those groups.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

Not sure if this should be considered as a bug fix and be backported.
Thoughts on this are welcome.

 ofproto/ofproto-dpif.c |  6 +++--
 tests/ofproto-dpif.at  | 54 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 2 deletions(-)

Comments

Eelco Chaudron Sept. 26, 2024, 9:03 a.m. UTC | #1
On 20 Sep 2024, at 14:40, Ilya Maximets wrote:

> ovn-kubernetes users observe uneven distribution of traffic among
> service backends.  It gets noticeably worse when 9+ backends are
> configured.  For example:
>
>     Server        connections
>     server-6nbg7     634
>     server-72br8     326  <-----
>     server-7b8rm     605
>     server-9tnz7     655
>     server-c84hw     596
>     server-mjskl     622
>     server-ptkcc     336  <-----
>     server-rl7pk     592
>     server-xd2cb     634
>
> Here we can see that out of 5000 connections total, servers '72br8'
> and 'ptkcc' received about 2 times less connections than any other
> server.  With 10 backends, 4 servers will receive 2x less, etc.
> The situation is common, because kubernetes users frequently scale
> their applications with something like kubectl scale --replicas=10.
>
> Services are implemented as OVN load-balancers, which are implemented
> as OpenFlow select groups with a dp_hash selection method.  Balancing
> is implemented by allocating a hash space (number of hashes is a
> power of 2) and distributing hashes between the group buckets using
> Webster's method taking into account bucket weights.
>
> This works well enough when buckets have different weights, because
> OVS needs to allocate larger hash space in order to accommodate the
> weight difference.  However, in case of OVN load-balancers, all the
> buckets (backends) have the same weight.  This leads to allocation
> of a smaller hash space just enough to fit all the backends (closest
> power of 2).  For example, if we have 10 backends (buckets), then
> 16 hashes will be allocated.  After hashes are distributed, we end
> up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
> So, each of the 6 buckets will receive on average 2x more traffic
> than each of the remaining 4.
>
> The problem can be mitigated by expanding the hash space and this is
> already done for cases with small number of buckets by limiting the
> hash space to have at least 16 hashes.  However, it is just not large
> enough for 9 or 10 buckets to provide a good distribution.  At the
> same time, blindly increasing the limit is also not a good solution
> as we do not want too large of a hash space for small number of
> buckets, simply because each hash is a separate datapath flow and
> having too many of them may cause performance issues.
>
> Approach taken in this change is to ensure that the hash space is
> at least 4 times larger than the number of buckets, but not larger
> than the maximum allowed (256).  This provides a better distribution
> while not unnecessarily exploding number of datapath flows for
> services with not that many backends.
>
> Here is some data to demonstrate why the 4 was chosen as a coefficient:
>
>   coeff.  :   1         2         3         4         5        100
>   -------------------------------------------------------------------
>   AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
>   MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
>   AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
>   MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
>   --------+----------------------------------------------------------
>     16    :   1         1         1         1         1         -
>     32    :  17         9         6         5         4         -
>     64    :  33        17        11         9         7         -
>    128    :  65        33        22        17        13         1
>    256    : 129        65        43        33        26         2
>   --------+----------------------------------------------------------
>            current                       proposed
>
> Table shows average and maximum load difference (Diff) between backends
> across groups with 1 to 64 equally weighted backends.  And it shows
> average and maximum standard deviation (Dev) of load distribution for
> the same.  For example, with a coefficient 2, the maximum difference
> between two backends will be 33% and the maximum standard deviation
> will be 20.4%.  With the current logic (coefficient of 1) we have
> maximum difference as high as 50%, as shown with the example at the
> beginning, with the standard deviation of 35.4%.
>
> The bottom half of the table shows from how many backends we start to
> use a particular number of buckets.  For example, with a coeff. 3
> we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
> buckets for 11-21 and so on.
>
> According to the table, the number 4 is about where we achieve a good
> enough standard deviation for the load (11.2%) while still not creating
> too many hashes for cases with low number of backends.  The standard
> deviation also doesn't go down that much with higher coefficient.
>
> The data is aggregated for up to 64 backends because with higher
> numbers we're getting close to the cap of 256 hashes, so deviation
> increases.  However, the load difference with so many backends should
> have lower impact on a cluster in general.  The change improves the
> cases with 64+ backends, but to get very good numbers we'll need to
> consider moving the upper limit for the hash space, which is a separate
> change to think about.
>
> Added test checks that standard deviation of the load distribution
> between buckets is below 12.5% for all cases up to 64 buckets.  It's
> just a pretty number that I've chosen that should be IMO good enough.
>
> Note: when number of buckets is a power of 2, then we have an ideal
> distribution with zero deviation, because hashes can be evenly mapped
> to buckets.
>
> Note 2: The change doesn't affect distribution for 4 or less backends,
> since there are still minimum 16 hashes for those groups.
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

Thanks for the patch and the extensive log message description.
The change looks good to me. Did some basic testing, but I do not see how this would break any real traffic.

Cheers,

Eelco

Acked-by: Eelco Chaudron <echaudro@redhat.com>


> ---
>
> Not sure if this should be considered as a bug fix and be backported.
> Thoughts on this are welcome.
>
>  ofproto/ofproto-dpif.c |  6 +++--
>  tests/ofproto-dpif.at  | 54 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 4659d8a3b..bf43d5d4b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -5341,8 +5341,10 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash)
>               min_weight, total_weight);
>
>      uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
> -    uint64_t min_slots2 = ROUND_UP_POW2(min_slots);
> -    uint64_t n_hash = MAX(16, min_slots2);
> +    uint64_t min_slots2 =
> +        MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES));
> +    uint64_t min_slots3 = ROUND_UP_POW2(min_slots2);
> +    uint64_t n_hash = MAX(16, min_slots3);
>      if (n_hash > MAX_SELECT_GROUP_HASH_VALUES ||
>          (max_hash != 0 && n_hash > max_hash)) {
>          VLOG_DBG("  Too many hash values required: %"PRIu64, n_hash);
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index 12cb7f7a6..1c99dc3bb 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -1238,6 +1238,60 @@ bucket3 >= 500
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
>
> +AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights])
> +
> +OVS_VSWITCHD_START
> +add_of_ports br0 1 10
> +
> +AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info])
> +
> +AT_DATA([stddev.awk], [
> +  {
> +    # $1 (target) is a mean value, because all weights are the same.
> +    # $2 (hits) is an actual number of hashes assigned to this bucket.
> +    n_hashes += $2
> +    n_buckets++
> +    sum_sq_diff += ($2 - $1) * ($2 - $1)
> +  }
> +  END {
> +    mean = n_hashes / n_buckets
> +    stddev = sqrt(sum_sq_diff / n_buckets)
> +    stddevp = stddev * 100 / mean
> +
> +    print "hashes:", n_hashes, "buckets:", n_buckets
> +    print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )"
> +
> +    # Make sure that standard deviation of load between buckets is below 12.5%.
> +    # Note: it's not a strict requirement, but a good number that passes tests.
> +    if (stddevp <= 12.5) { print "PASS" }
> +    else { print "FAIL" }
> +  }
> +])
> +
> +m4_define([CHECK_DISTRIBUTION], [
> +  AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \
> +                | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \
> +                | awk -f stddev.awk], [0], [stdout])
> +  AT_CHECK([grep -q "buckets: $2" stdout])
> +  AT_CHECK([grep -q 'PASS' stdout])
> +])
> +
> +m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash])
> +m4_define([OFG_BUCKET], [bucket=weight=$1,output:10])
> +
> +dnl Test load distribution in groups with up to 64 equally weighted buckets.
> +m4_define([OFG_BUCKETS], [OFG_BUCKET(100)])
> +m4_for([id], [1], [64], [1], [
> +  get_log_next_line_num
> +  AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \
> +                "OF_GROUP(id),OFG_BUCKETS()"])
> +  CHECK_DISTRIBUTION([+$LINENUM], [id])
> +  m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)])
> +])
> +
> +OVS_VSWITCHD_STOP
> +AT_CLEANUP
> +
>  AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection method])
>
>  OVS_VSWITCHD_START
> -- 
> 2.46.0
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Aaron Conole Oct. 1, 2024, 5:27 p.m. UTC | #2
Ilya Maximets <i.maximets@ovn.org> writes:

> ovn-kubernetes users observe uneven distribution of traffic among
> service backends.  It gets noticeably worse when 9+ backends are
> configured.  For example:
>
>     Server        connections
>     server-6nbg7     634
>     server-72br8     326  <-----
>     server-7b8rm     605
>     server-9tnz7     655
>     server-c84hw     596
>     server-mjskl     622
>     server-ptkcc     336  <-----
>     server-rl7pk     592
>     server-xd2cb     634
>
> Here we can see that out of 5000 connections total, servers '72br8'
> and 'ptkcc' received about 2 times less connections than any other
> server.  With 10 backends, 4 servers will receive 2x less, etc.
> The situation is common, because kubernetes users frequently scale
> their applications with something like kubectl scale --replicas=10.
>
> Services are implemented as OVN load-balancers, which are implemented
> as OpenFlow select groups with a dp_hash selection method.  Balancing
> is implemented by allocating a hash space (number of hashes is a
> power of 2) and distributing hashes between the group buckets using
> Webster's method taking into account bucket weights.
>
> This works well enough when buckets have different weights, because
> OVS needs to allocate larger hash space in order to accommodate the
> weight difference.  However, in case of OVN load-balancers, all the
> buckets (backends) have the same weight.  This leads to allocation
> of a smaller hash space just enough to fit all the backends (closest
> power of 2).  For example, if we have 10 backends (buckets), then
> 16 hashes will be allocated.  After hashes are distributed, we end
> up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
> So, each of the 6 buckets will receive on average 2x more traffic
> than each of the remaining 4.
>
> The problem can be mitigated by expanding the hash space and this is
> already done for cases with small number of buckets by limiting the
> hash space to have at least 16 hashes.  However, it is just not large
> enough for 9 or 10 buckets to provide a good distribution.  At the
> same time, blindly increasing the limit is also not a good solution
> as we do not want too large of a hash space for small number of
> buckets, simply because each hash is a separate datapath flow and
> having too many of them may cause performance issues.
>
> Approach taken in this change is to ensure that the hash space is
> at least 4 times larger than the number of buckets, but not larger
> than the maximum allowed (256).  This provides a better distribution
> while not unnecessarily exploding number of datapath flows for
> services with not that many backends.
>
> Here is some data to demonstrate why the 4 was chosen as a coefficient:
>
>   coeff.  :   1         2         3         4         5        100
>   -------------------------------------------------------------------
>   AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
>   MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
>   AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
>   MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
>   --------+----------------------------------------------------------
>     16    :   1         1         1         1         1         -
>     32    :  17         9         6         5         4         -
>     64    :  33        17        11         9         7         -
>    128    :  65        33        22        17        13         1
>    256    : 129        65        43        33        26         2
>   --------+----------------------------------------------------------
>            current                       proposed
>
> Table shows average and maximum load difference (Diff) between backends
> across groups with 1 to 64 equally weighted backends.  And it shows
> average and maximum standard deviation (Dev) of load distribution for
> the same.  For example, with a coefficient 2, the maximum difference
> between two backends will be 33% and the maximum standard deviation
> will be 20.4%.  With the current logic (coefficient of 1) we have
> maximum difference as high as 50%, as shown with the example at the
> beginning, with the standard deviation of 35.4%.
>
> The bottom half of the table shows from how many backends we start to
> use a particular number of buckets.  For example, with a coeff. 3
> we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
> buckets for 11-21 and so on.
>
> According to the table, the number 4 is about where we achieve a good
> enough standard deviation for the load (11.2%) while still not creating
> too many hashes for cases with low number of backends.  The standard
> deviation also doesn't go down that much with higher coefficient.
>
> The data is aggregated for up to 64 backends because with higher
> numbers we're getting close to the cap of 256 hashes, so deviation
> increases.  However, the load difference with so many backends should
> have lower impact on a cluster in general.  The change improves the
> cases with 64+ backends, but to get very good numbers we'll need to
> consider moving the upper limit for the hash space, which is a separate
> change to think about.
>
> Added test checks that standard deviation of the load distribution
> between buckets is below 12.5% for all cases up to 64 buckets.  It's
> just a pretty number that I've chosen that should be IMO good enough.
>
> Note: when number of buckets is a power of 2, then we have an ideal
> distribution with zero deviation, because hashes can be evenly mapped
> to buckets.
>
> Note 2: The change doesn't affect distribution for 4 or less backends,
> since there are still minimum 16 hashes for those groups.
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>
> Not sure if this should be considered as a bug fix and be backported.
> Thoughts on this are welcome.

I think it should be okay to backport.  It isn't a new feature.

Acked-by: Aaron Conole <aconole@redhat.com>

>  ofproto/ofproto-dpif.c |  6 +++--
>  tests/ofproto-dpif.at  | 54 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 4659d8a3b..bf43d5d4b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -5341,8 +5341,10 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash)
>               min_weight, total_weight);
>  
>      uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
> -    uint64_t min_slots2 = ROUND_UP_POW2(min_slots);
> -    uint64_t n_hash = MAX(16, min_slots2);
> +    uint64_t min_slots2 =
> +        MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES));
> +    uint64_t min_slots3 = ROUND_UP_POW2(min_slots2);
> +    uint64_t n_hash = MAX(16, min_slots3);
>      if (n_hash > MAX_SELECT_GROUP_HASH_VALUES ||
>          (max_hash != 0 && n_hash > max_hash)) {
>          VLOG_DBG("  Too many hash values required: %"PRIu64, n_hash);
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index 12cb7f7a6..1c99dc3bb 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -1238,6 +1238,60 @@ bucket3 >= 500
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
>  
> +AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights])
> +
> +OVS_VSWITCHD_START
> +add_of_ports br0 1 10
> +
> +AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info])
> +
> +AT_DATA([stddev.awk], [
> +  {
> +    # $1 (target) is a mean value, because all weights are the same.
> +    # $2 (hits) is an actual number of hashes assigned to this bucket.
> +    n_hashes += $2
> +    n_buckets++
> +    sum_sq_diff += ($2 - $1) * ($2 - $1)
> +  }
> +  END {
> +    mean = n_hashes / n_buckets
> +    stddev = sqrt(sum_sq_diff / n_buckets)
> +    stddevp = stddev * 100 / mean
> +
> +    print "hashes:", n_hashes, "buckets:", n_buckets
> +    print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )"
> +
> +    # Make sure that standard deviation of load between buckets is below 12.5%.
> +    # Note: it's not a strict requirement, but a good number that passes tests.
> +    if (stddevp <= 12.5) { print "PASS" }
> +    else { print "FAIL" }
> +  }
> +])
> +
> +m4_define([CHECK_DISTRIBUTION], [
> +  AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \
> +                | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \
> +                | awk -f stddev.awk], [0], [stdout])
> +  AT_CHECK([grep -q "buckets: $2" stdout])
> +  AT_CHECK([grep -q 'PASS' stdout])
> +])
> +
> +m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash])
> +m4_define([OFG_BUCKET], [bucket=weight=$1,output:10])
> +
> +dnl Test load distribution in groups with up to 64 equally weighted buckets.
> +m4_define([OFG_BUCKETS], [OFG_BUCKET(100)])
> +m4_for([id], [1], [64], [1], [
> +  get_log_next_line_num
> +  AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \
> +                "OF_GROUP(id),OFG_BUCKETS()"])
> +  CHECK_DISTRIBUTION([+$LINENUM], [id])
> +  m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)])
> +])
> +
> +OVS_VSWITCHD_STOP
> +AT_CLEANUP
> +
>  AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection method])
>  
>  OVS_VSWITCHD_START
Ilya Maximets Oct. 7, 2024, 9:13 p.m. UTC | #3
On 10/1/24 19:27, Aaron Conole wrote:
> Ilya Maximets <i.maximets@ovn.org> writes:
> 
>> ovn-kubernetes users observe uneven distribution of traffic among
>> service backends.  It gets noticeably worse when 9+ backends are
>> configured.  For example:
>>
>>     Server        connections
>>     server-6nbg7     634
>>     server-72br8     326  <-----
>>     server-7b8rm     605
>>     server-9tnz7     655
>>     server-c84hw     596
>>     server-mjskl     622
>>     server-ptkcc     336  <-----
>>     server-rl7pk     592
>>     server-xd2cb     634
>>
>> Here we can see that out of 5000 connections total, servers '72br8'
>> and 'ptkcc' received about 2 times less connections than any other
>> server.  With 10 backends, 4 servers will receive 2x less, etc.
>> The situation is common, because kubernetes users frequently scale
>> their applications with something like kubectl scale --replicas=10.
>>
>> Services are implemented as OVN load-balancers, which are implemented
>> as OpenFlow select groups with a dp_hash selection method.  Balancing
>> is implemented by allocating a hash space (number of hashes is a
>> power of 2) and distributing hashes between the group buckets using
>> Webster's method taking into account bucket weights.
>>
>> This works well enough when buckets have different weights, because
>> OVS needs to allocate larger hash space in order to accommodate the
>> weight difference.  However, in case of OVN load-balancers, all the
>> buckets (backends) have the same weight.  This leads to allocation
>> of a smaller hash space just enough to fit all the backends (closest
>> power of 2).  For example, if we have 10 backends (buckets), then
>> 16 hashes will be allocated.  After hashes are distributed, we end
>> up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
>> So, each of the 6 buckets will receive on average 2x more traffic
>> than each of the remaining 4.
>>
>> The problem can be mitigated by expanding the hash space and this is
>> already done for cases with small number of buckets by limiting the
>> hash space to have at least 16 hashes.  However, it is just not large
>> enough for 9 or 10 buckets to provide a good distribution.  At the
>> same time, blindly increasing the limit is also not a good solution
>> as we do not want too large of a hash space for small number of
>> buckets, simply because each hash is a separate datapath flow and
>> having too many of them may cause performance issues.
>>
>> Approach taken in this change is to ensure that the hash space is
>> at least 4 times larger than the number of buckets, but not larger
>> than the maximum allowed (256).  This provides a better distribution
>> while not unnecessarily exploding number of datapath flows for
>> services with not that many backends.
>>
>> Here is some data to demonstrate why the 4 was chosen as a coefficient:
>>
>>   coeff.  :   1         2         3         4         5        100
>>   -------------------------------------------------------------------
>>   AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
>>   MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
>>   AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
>>   MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
>>   --------+----------------------------------------------------------
>>     16    :   1         1         1         1         1         -
>>     32    :  17         9         6         5         4         -
>>     64    :  33        17        11         9         7         -
>>    128    :  65        33        22        17        13         1
>>    256    : 129        65        43        33        26         2
>>   --------+----------------------------------------------------------
>>            current                       proposed
>>
>> Table shows average and maximum load difference (Diff) between backends
>> across groups with 1 to 64 equally weighted backends.  And it shows
>> average and maximum standard deviation (Dev) of load distribution for
>> the same.  For example, with a coefficient 2, the maximum difference
>> between two backends will be 33% and the maximum standard deviation
>> will be 20.4%.  With the current logic (coefficient of 1) we have
>> maximum difference as high as 50%, as shown with the example at the
>> beginning, with the standard deviation of 35.4%.
>>
>> The bottom half of the table shows from how many backends we start to
>> use a particular number of buckets.  For example, with a coeff. 3
>> we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
>> buckets for 11-21 and so on.
>>
>> According to the table, the number 4 is about where we achieve a good
>> enough standard deviation for the load (11.2%) while still not creating
>> too many hashes for cases with low number of backends.  The standard
>> deviation also doesn't go down that much with higher coefficient.
>>
>> The data is aggregated for up to 64 backends because with higher
>> numbers we're getting close to the cap of 256 hashes, so deviation
>> increases.  However, the load difference with so many backends should
>> have lower impact on a cluster in general.  The change improves the
>> cases with 64+ backends, but to get very good numbers we'll need to
>> consider moving the upper limit for the hash space, which is a separate
>> change to think about.
>>
>> Added test checks that standard deviation of the load distribution
>> between buckets is below 12.5% for all cases up to 64 buckets.  It's
>> just a pretty number that I've chosen that should be IMO good enough.
>>
>> Note: when number of buckets is a power of 2, then we have an ideal
>> distribution with zero deviation, because hashes can be evenly mapped
>> to buckets.
>>
>> Note 2: The change doesn't affect distribution for 4 or less backends,
>> since there are still minimum 16 hashes for those groups.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>
>> Not sure if this should be considered as a bug fix and be backported.
>> Thoughts on this are welcome.
> 
> I think it should be okay to backport.  It isn't a new feature.
> 
> Acked-by: Aaron Conole <aconole@redhat.com>

Thanks, Aaron and Eelco!

After some considerations, I think this should indeed be treated as a bug
fix, because OpenFlow users expect OVS to honor weights of the buckets
and in this case the load of equally weighted buckets is significantly
different up to 2x.

Applied and backported down to 3.1.  3.1 because the issue was originally
reported for 3.1, the fix is simple, and we do provide occasional releases
for the last 4 branches, according to our release policies.  At the same
time the issue doesn't seem critical to backport to the old-LTS 2.17.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 4659d8a3b..bf43d5d4b 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -5341,8 +5341,10 @@  group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash)
              min_weight, total_weight);
 
     uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
-    uint64_t min_slots2 = ROUND_UP_POW2(min_slots);
-    uint64_t n_hash = MAX(16, min_slots2);
+    uint64_t min_slots2 =
+        MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES));
+    uint64_t min_slots3 = ROUND_UP_POW2(min_slots2);
+    uint64_t n_hash = MAX(16, min_slots3);
     if (n_hash > MAX_SELECT_GROUP_HASH_VALUES ||
         (max_hash != 0 && n_hash > max_hash)) {
         VLOG_DBG("  Too many hash values required: %"PRIu64, n_hash);
diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index 12cb7f7a6..1c99dc3bb 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -1238,6 +1238,60 @@  bucket3 >= 500
 OVS_VSWITCHD_STOP
 AT_CLEANUP
 
+AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights])
+
+OVS_VSWITCHD_START
+add_of_ports br0 1 10
+
+AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info])
+
+AT_DATA([stddev.awk], [
+  {
+    # $1 (target) is a mean value, because all weights are the same.
+    # $2 (hits) is an actual number of hashes assigned to this bucket.
+    n_hashes += $2
+    n_buckets++
+    sum_sq_diff += ($2 - $1) * ($2 - $1)
+  }
+  END {
+    mean = n_hashes / n_buckets
+    stddev = sqrt(sum_sq_diff / n_buckets)
+    stddevp = stddev * 100 / mean
+
+    print "hashes:", n_hashes, "buckets:", n_buckets
+    print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )"
+
+    # Make sure that standard deviation of load between buckets is below 12.5%.
+    # Note: it's not a strict requirement, but a good number that passes tests.
+    if (stddevp <= 12.5) { print "PASS" }
+    else { print "FAIL" }
+  }
+])
+
+m4_define([CHECK_DISTRIBUTION], [
+  AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \
+                | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \
+                | awk -f stddev.awk], [0], [stdout])
+  AT_CHECK([grep -q "buckets: $2" stdout])
+  AT_CHECK([grep -q 'PASS' stdout])
+])
+
+m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash])
+m4_define([OFG_BUCKET], [bucket=weight=$1,output:10])
+
+dnl Test load distribution in groups with up to 64 equally weighted buckets.
+m4_define([OFG_BUCKETS], [OFG_BUCKET(100)])
+m4_for([id], [1], [64], [1], [
+  get_log_next_line_num
+  AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \
+                "OF_GROUP(id),OFG_BUCKETS()"])
+  CHECK_DISTRIBUTION([+$LINENUM], [id])
+  m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)])
+])
+
+OVS_VSWITCHD_STOP
+AT_CLEANUP
+
 AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection method])
 
 OVS_VSWITCHD_START