Message ID | 20210823200223.680933-1-i.maximets@ovn.org |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev] ovn-northd: Fix extremely inefficient usage of lflow hash map. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/github-robot-_ovn-kubernetes | fail | github build: failed |
Should not be the case. The map is pre-sized for the size from the previous iterations. Line 12861 in my tree which is probably a few commits out of date: fast_hmap_size_for(&lflows, max_seen_lflow_size); And immediately after building the lflows: if (hmap_count(&lflows) > max_seen_lflow_size) { max_seen_lflow_size = hmap_count(&lflows); } So the map SHOULD be sized correctly - to the most recent seen lflow count. A. On 23/08/2021 21:02, Ilya Maximets wrote: > 'lflow_map' is never expanded because northd always uses fast > insertion. This leads to the case where we have a hash map > with only 128 initial buckets and every ovn_lflow_find() ends > up iterating over n_lflows / 128 entries. It's thousands of > logical flows or even more. For example, it takes forever for > ovn-northd to start up with the Northbound Db from the 120 node > density-heavy test from ovn-heater, because every lookup is slower > than previous one. I aborted the process after 15 minutes of > waiting, because there was no sign that it will converge. With > this change applied the loop completes in only 11 seconds. > > Hash map will be pre-allocated to the maximum seen number of > logical flows on a second iteration, but this doesn't help for > the first iteration when northd first time connects to a big > Northbound database, which is a common case during failover or > cluster upgrade. And there is an even trickier case where big > NbDB transaction that explodes the number of logical flows received > on not the first run. > > We can't expand the hash map in case of parallel build, so this > should be fixed separately. > > CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> > Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > northd/ovn-northd.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > index 3d8e21a4f..40cf957c0 100644 > --- a/northd/ovn-northd.c > +++ b/northd/ovn-northd.c > @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, > nullable_xstrdup(ctrl_meter), > ovn_lflow_hint(stage_hint), where); > hmapx_add(&lflow->od_group, od); > - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > + if (!use_parallel_build) { > + hmap_insert(lflow_map, &lflow->hmap_node, hash); > + } else { > + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > + } > } > > /* Adds a row with the specified contents to the Logical_Flow table. */
On 8/23/21 10:20 PM, Anton Ivanov wrote: > Should not be the case. > > The map is pre-sized for the size from the previous iterations. > > Line 12861 in my tree which is probably a few commits out of date: > > fast_hmap_size_for(&lflows, max_seen_lflow_size); > > And immediately after building the lflows: > > if (hmap_count(&lflows) > max_seen_lflow_size) { > max_seen_lflow_size = hmap_count(&lflows); > } > > So the map SHOULD be sized correctly - to the most recent seen lflow count. Please, re-read the commit message. It's a first run of the loop and the 'max_seen_lflow_size' is default 128 at this point. > > A. > > On 23/08/2021 21:02, Ilya Maximets wrote: >> 'lflow_map' is never expanded because northd always uses fast >> insertion. This leads to the case where we have a hash map >> with only 128 initial buckets and every ovn_lflow_find() ends >> up iterating over n_lflows / 128 entries. It's thousands of >> logical flows or even more. For example, it takes forever for >> ovn-northd to start up with the Northbound Db from the 120 node >> density-heavy test from ovn-heater, because every lookup is slower >> than previous one. I aborted the process after 15 minutes of >> waiting, because there was no sign that it will converge. With >> this change applied the loop completes in only 11 seconds. >> >> Hash map will be pre-allocated to the maximum seen number of >> logical flows on a second iteration, but this doesn't help for >> the first iteration when northd first time connects to a big >> Northbound database, which is a common case during failover or >> cluster upgrade. And there is an even trickier case where big >> NbDB transaction that explodes the number of logical flows received >> on not the first run. >> >> We can't expand the hash map in case of parallel build, so this >> should be fixed separately. >> >> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >> --- >> northd/ovn-northd.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >> index 3d8e21a4f..40cf957c0 100644 >> --- a/northd/ovn-northd.c >> +++ b/northd/ovn-northd.c >> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >> nullable_xstrdup(ctrl_meter), >> ovn_lflow_hint(stage_hint), where); >> hmapx_add(&lflow->od_group, od); >> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >> + if (!use_parallel_build) { >> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >> + } else { >> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >> + } >> } >> /* Adds a row with the specified contents to the Logical_Flow table. */ > >
On 23/08/2021 21:26, Ilya Maximets wrote: > On 8/23/21 10:20 PM, Anton Ivanov wrote: >> Should not be the case. >> >> The map is pre-sized for the size from the previous iterations. >> >> Line 12861 in my tree which is probably a few commits out of date: >> >> fast_hmap_size_for(&lflows, max_seen_lflow_size); >> >> And immediately after building the lflows: >> >> if (hmap_count(&lflows) > max_seen_lflow_size) { >> max_seen_lflow_size = hmap_count(&lflows); >> } >> >> So the map SHOULD be sized correctly - to the most recent seen lflow count. > Please, re-read the commit message. It's a first run of the loop > and the 'max_seen_lflow_size' is default 128 at this point. Ack, Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. From that perspective the patch is a straight +1 from me. From the perspective of the use case stated in the commit message- I am not sure it addresses it. If northd is in single-threaded mode and is tackling a GIGANTIC database, it may never complete the first iteration before the expiration of the timers and everyone deciding that northd is AWOL. In that case, if it is multi-threaded from the start, it should probably grab the sizing from the lflow table hash in south db. That would be a good approximation for the first run. A. > >> A. >> >> On 23/08/2021 21:02, Ilya Maximets wrote: >>> 'lflow_map' is never expanded because northd always uses fast >>> insertion. This leads to the case where we have a hash map >>> with only 128 initial buckets and every ovn_lflow_find() ends >>> up iterating over n_lflows / 128 entries. It's thousands of >>> logical flows or even more. For example, it takes forever for >>> ovn-northd to start up with the Northbound Db from the 120 node >>> density-heavy test from ovn-heater, because every lookup is slower >>> than previous one. I aborted the process after 15 minutes of >>> waiting, because there was no sign that it will converge. With >>> this change applied the loop completes in only 11 seconds. >>> >>> Hash map will be pre-allocated to the maximum seen number of >>> logical flows on a second iteration, but this doesn't help for >>> the first iteration when northd first time connects to a big >>> Northbound database, which is a common case during failover or >>> cluster upgrade. And there is an even trickier case where big >>> NbDB transaction that explodes the number of logical flows received >>> on not the first run. >>> >>> We can't expand the hash map in case of parallel build, so this >>> should be fixed separately. >>> >>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>> --- >>> northd/ovn-northd.c | 6 +++++- >>> 1 file changed, 5 insertions(+), 1 deletion(-) >>> >>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>> index 3d8e21a4f..40cf957c0 100644 >>> --- a/northd/ovn-northd.c >>> +++ b/northd/ovn-northd.c >>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>> nullable_xstrdup(ctrl_meter), >>> ovn_lflow_hint(stage_hint), where); >>> hmapx_add(&lflow->od_group, od); >>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>> + if (!use_parallel_build) { >>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>> + } else { >>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>> + } >>> } >>> /* Adds a row with the specified contents to the Logical_Flow table. */ >> >
On 8/23/21 10:37 PM, Anton Ivanov wrote: > On 23/08/2021 21:26, Ilya Maximets wrote: >> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>> Should not be the case. >>> >>> The map is pre-sized for the size from the previous iterations. >>> >>> Line 12861 in my tree which is probably a few commits out of date: >>> >>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>> >>> And immediately after building the lflows: >>> >>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>> max_seen_lflow_size = hmap_count(&lflows); >>> } >>> >>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >> Please, re-read the commit message. It's a first run of the loop >> and the 'max_seen_lflow_size' is default 128 at this point. > > Ack, > > Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. > > From that perspective the patch is a straight +1 from me. > > From the perspective of the use case stated in the commit message- I am not sure it addresses it. > > If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the > expiration of the timers and everyone deciding that northd is AWOL. Well, how do you suggest to fix that? Obviously, we can always create a database that northd will never be able to process in a reasonable amount of time. And it doesn't matter if it's single- or multi-threaded. In this case NbDB is only 9MB in size, which is very reasonable, and northd on my laptop takes more than 15 minutes to process it (I killed it at this point). With the patch applied it took only 11 seconds. So, for me, this patch pretty much fixes the issue. 11 seconds is not that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. It would be great to reduce, but we're not there yet. > > In that case, if it is multi-threaded from the start, it should probably > grab the sizing from the lflow table hash in south db. That would be a > good approximation for the first run. This will not work for a case where SbDB is empty for any reason while NbDB is not. And there is still a case where northd initially connects to semi-empty databases and after few iterations NbDB receives a big transaction and generates a big update for northd. > > A. > >> >>> A. >>> >>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>> 'lflow_map' is never expanded because northd always uses fast >>>> insertion. This leads to the case where we have a hash map >>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>> up iterating over n_lflows / 128 entries. It's thousands of >>>> logical flows or even more. For example, it takes forever for >>>> ovn-northd to start up with the Northbound Db from the 120 node >>>> density-heavy test from ovn-heater, because every lookup is slower >>>> than previous one. I aborted the process after 15 minutes of >>>> waiting, because there was no sign that it will converge. With >>>> this change applied the loop completes in only 11 seconds. >>>> >>>> Hash map will be pre-allocated to the maximum seen number of >>>> logical flows on a second iteration, but this doesn't help for >>>> the first iteration when northd first time connects to a big >>>> Northbound database, which is a common case during failover or >>>> cluster upgrade. And there is an even trickier case where big >>>> NbDB transaction that explodes the number of logical flows received >>>> on not the first run. >>>> >>>> We can't expand the hash map in case of parallel build, so this >>>> should be fixed separately. >>>> >>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>> --- >>>> northd/ovn-northd.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>> index 3d8e21a4f..40cf957c0 100644 >>>> --- a/northd/ovn-northd.c >>>> +++ b/northd/ovn-northd.c >>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>> nullable_xstrdup(ctrl_meter), >>>> ovn_lflow_hint(stage_hint), where); >>>> hmapx_add(&lflow->od_group, od); >>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>> + if (!use_parallel_build) { >>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>> + } else { >>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>> + } >>>> } >>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>> >> >
On 23/08/2021 22:36, Ilya Maximets wrote: > On 8/23/21 10:37 PM, Anton Ivanov wrote: >> On 23/08/2021 21:26, Ilya Maximets wrote: >>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>> Should not be the case. >>>> >>>> The map is pre-sized for the size from the previous iterations. >>>> >>>> Line 12861 in my tree which is probably a few commits out of date: >>>> >>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>> >>>> And immediately after building the lflows: >>>> >>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>> max_seen_lflow_size = hmap_count(&lflows); >>>> } >>>> >>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>> Please, re-read the commit message. It's a first run of the loop >>> and the 'max_seen_lflow_size' is default 128 at this point. >> Ack, >> >> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >> >> From that perspective the patch is a straight +1 from me. >> >> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >> >> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >> expiration of the timers and everyone deciding that northd is AWOL. > Well, how do you suggest to fix that? Obviously, we can always create > a database that northd will never be able to process in a reasonable > amount of time. And it doesn't matter if it's single- or multi-threaded. > > In this case NbDB is only 9MB in size, which is very reasonable, and > northd on my laptop takes more than 15 minutes to process it (I killed > it at this point). With the patch applied it took only 11 seconds. > So, for me, this patch pretty much fixes the issue. 11 seconds is not > that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. > It would be great to reduce, but we're not there yet. > >> In that case, if it is multi-threaded from the start, it should probably >> grab the sizing from the lflow table hash in south db. That would be a >> good approximation for the first run. > This will not work for a case where SbDB is empty for any reason while > NbDB is not. And there is still a case where northd initially connects > to semi-empty databases and after few iterations NbDB receives a big > transaction and generates a big update for northd. A partial fix is to resize to optimum size the hash after lflow processing. If the sizing was correct - 99.9% of the case this will be a noop. If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. Brgds, A > >> A. >> >>>> A. >>>> >>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>> 'lflow_map' is never expanded because northd always uses fast >>>>> insertion. This leads to the case where we have a hash map >>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>> logical flows or even more. For example, it takes forever for >>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>> than previous one. I aborted the process after 15 minutes of >>>>> waiting, because there was no sign that it will converge. With >>>>> this change applied the loop completes in only 11 seconds. >>>>> >>>>> Hash map will be pre-allocated to the maximum seen number of >>>>> logical flows on a second iteration, but this doesn't help for >>>>> the first iteration when northd first time connects to a big >>>>> Northbound database, which is a common case during failover or >>>>> cluster upgrade. And there is an even trickier case where big >>>>> NbDB transaction that explodes the number of logical flows received >>>>> on not the first run. >>>>> >>>>> We can't expand the hash map in case of parallel build, so this >>>>> should be fixed separately. >>>>> >>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>> --- >>>>> northd/ovn-northd.c | 6 +++++- >>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>> index 3d8e21a4f..40cf957c0 100644 >>>>> --- a/northd/ovn-northd.c >>>>> +++ b/northd/ovn-northd.c >>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>> nullable_xstrdup(ctrl_meter), >>>>> ovn_lflow_hint(stage_hint), where); >>>>> hmapx_add(&lflow->od_group, od); >>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>> + if (!use_parallel_build) { >>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>> + } else { >>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>> + } >>>>> } >>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >
On 8/24/21 7:36 AM, Anton Ivanov wrote: > On 23/08/2021 22:36, Ilya Maximets wrote: >> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>> Should not be the case. >>>>> >>>>> The map is pre-sized for the size from the previous iterations. >>>>> >>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>> >>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>> >>>>> And immediately after building the lflows: >>>>> >>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>> } >>>>> >>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>> Please, re-read the commit message. It's a first run of the loop >>>> and the 'max_seen_lflow_size' is default 128 at this point. >>> Ack, >>> >>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>> >>> From that perspective the patch is a straight +1 from me. >>> >>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>> >>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>> expiration of the timers and everyone deciding that northd is AWOL. >> Well, how do you suggest to fix that? Obviously, we can always create >> a database that northd will never be able to process in a reasonable >> amount of time. And it doesn't matter if it's single- or multi-threaded. >> >> In this case NbDB is only 9MB in size, which is very reasonable, and >> northd on my laptop takes more than 15 minutes to process it (I killed >> it at this point). With the patch applied it took only 11 seconds. >> So, for me, this patch pretty much fixes the issue. 11 seconds is not >> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >> It would be great to reduce, but we're not there yet. >> >>> In that case, if it is multi-threaded from the start, it should probably >>> grab the sizing from the lflow table hash in south db. That would be a >>> good approximation for the first run. >> This will not work for a case where SbDB is empty for any reason while >> NbDB is not. And there is still a case where northd initially connects >> to semi-empty databases and after few iterations NbDB receives a big >> transaction and generates a big update for northd. > > A partial fix is to resize to optimum size the hash after lflow processing. I'm not sure I understand what you mean here, because resizing after lflow processing will not help. The lflow processing itself is the very slow part that we're trying to make faster here. > > If the sizing was correct - 99.9% of the case this will be a noop. > > If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. > > This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. > > I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. > > Brgds, > > A > >> >>> A. >>> >>>>> A. >>>>> >>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>> insertion. This leads to the case where we have a hash map >>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>> logical flows or even more. For example, it takes forever for >>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>> than previous one. I aborted the process after 15 minutes of >>>>>> waiting, because there was no sign that it will converge. With >>>>>> this change applied the loop completes in only 11 seconds. >>>>>> >>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>> logical flows on a second iteration, but this doesn't help for >>>>>> the first iteration when northd first time connects to a big >>>>>> Northbound database, which is a common case during failover or >>>>>> cluster upgrade. And there is an even trickier case where big >>>>>> NbDB transaction that explodes the number of logical flows received >>>>>> on not the first run. >>>>>> >>>>>> We can't expand the hash map in case of parallel build, so this >>>>>> should be fixed separately. >>>>>> >>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>> --- >>>>>> northd/ovn-northd.c | 6 +++++- >>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>> --- a/northd/ovn-northd.c >>>>>> +++ b/northd/ovn-northd.c >>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>> nullable_xstrdup(ctrl_meter), >>>>>> ovn_lflow_hint(stage_hint), where); >>>>>> hmapx_add(&lflow->od_group, od); >>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>> + if (!use_parallel_build) { >>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>> + } else { >>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>> + } >>>>>> } >>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >> >
On 24/08/2021 12:05, Ilya Maximets wrote: > On 8/24/21 7:36 AM, Anton Ivanov wrote: >> On 23/08/2021 22:36, Ilya Maximets wrote: >>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>> Should not be the case. >>>>>> >>>>>> The map is pre-sized for the size from the previous iterations. >>>>>> >>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>> >>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>> >>>>>> And immediately after building the lflows: >>>>>> >>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>> } >>>>>> >>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>> Please, re-read the commit message. It's a first run of the loop >>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>> Ack, >>>> >>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>> >>>> From that perspective the patch is a straight +1 from me. >>>> >>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>> >>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>> expiration of the timers and everyone deciding that northd is AWOL. >>> Well, how do you suggest to fix that? Obviously, we can always create >>> a database that northd will never be able to process in a reasonable >>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>> >>> In this case NbDB is only 9MB in size, which is very reasonable, and >>> northd on my laptop takes more than 15 minutes to process it (I killed >>> it at this point). With the patch applied it took only 11 seconds. >>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>> It would be great to reduce, but we're not there yet. >>> >>>> In that case, if it is multi-threaded from the start, it should probably >>>> grab the sizing from the lflow table hash in south db. That would be a >>>> good approximation for the first run. >>> This will not work for a case where SbDB is empty for any reason while >>> NbDB is not. And there is still a case where northd initially connects >>> to semi-empty databases and after few iterations NbDB receives a big >>> transaction and generates a big update for northd. >> A partial fix is to resize to optimum size the hash after lflow processing. > I'm not sure I understand what you mean here, because resizing after > lflow processing will not help. The lflow processing itself is the > very slow part that we're trying to make faster here. That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. I am scale testing that as well as resize (for non-dp-groups cases) at present. Brgds, A. > >> If the sizing was correct - 99.9% of the case this will be a noop. >> >> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >> >> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >> >> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >> >> Brgds, >> >> A >> >>>> A. >>>> >>>>>> A. >>>>>> >>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>> insertion. This leads to the case where we have a hash map >>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>> logical flows or even more. For example, it takes forever for >>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>> waiting, because there was no sign that it will converge. With >>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>> >>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>> the first iteration when northd first time connects to a big >>>>>>> Northbound database, which is a common case during failover or >>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>> on not the first run. >>>>>>> >>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>> should be fixed separately. >>>>>>> >>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>> --- >>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>> --- a/northd/ovn-northd.c >>>>>>> +++ b/northd/ovn-northd.c >>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>> + if (!use_parallel_build) { >>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>> + } else { >>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>> + } >>>>>>> } >>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >
On 8/24/21 1:18 PM, Anton Ivanov wrote: > > On 24/08/2021 12:05, Ilya Maximets wrote: >> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>> Should not be the case. >>>>>>> >>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>> >>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>> >>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>> >>>>>>> And immediately after building the lflows: >>>>>>> >>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>> } >>>>>>> >>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>> Ack, >>>>> >>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>> >>>>> From that perspective the patch is a straight +1 from me. >>>>> >>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>> >>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>> Well, how do you suggest to fix that? Obviously, we can always create >>>> a database that northd will never be able to process in a reasonable >>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>> >>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>> it at this point). With the patch applied it took only 11 seconds. >>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>> It would be great to reduce, but we're not there yet. >>>> >>>>> In that case, if it is multi-threaded from the start, it should probably >>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>> good approximation for the first run. >>>> This will not work for a case where SbDB is empty for any reason while >>>> NbDB is not. And there is still a case where northd initially connects >>>> to semi-empty databases and after few iterations NbDB receives a big >>>> transaction and generates a big update for northd. >>> A partial fix is to resize to optimum size the hash after lflow processing. >> I'm not sure I understand what you mean here, because resizing after >> lflow processing will not help. The lflow processing itself is the >> very slow part that we're trying to make faster here. > > That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. > > Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. Oh, I see. Indeed, without dp-groups there is no lookup during lflow build. I missed that. So, yes, I agree that for a case without dp-groups, re-sizing after lflow processing should work. We need that for parallel case. Current patch (use hmap_insert() that resizes if needed) helps for all non-parallel cases. I'm mostly running dp-groups + non-parallel which is a default case for ovn-heater/ovn-k8s. > > For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. Yeah, dp-groups + parallel is a hard case. > > I am scale testing that as well as resize (for non-dp-groups cases) at present. > > Brgds, > > A. > >> >>> If the sizing was correct - 99.9% of the case this will be a noop. >>> >>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>> >>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>> >>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>> >>> Brgds, >>> >>> A >>> >>>>> A. >>>>> >>>>>>> A. >>>>>>> >>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>> >>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>> the first iteration when northd first time connects to a big >>>>>>>> Northbound database, which is a common case during failover or >>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>> on not the first run. >>>>>>>> >>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>> should be fixed separately. >>>>>>>> >>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>> --- >>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>> --- a/northd/ovn-northd.c >>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>> + if (!use_parallel_build) { >>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>> + } else { >>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>> + } >>>>>>>> } >>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>
On 24/08/2021 12:46, Ilya Maximets wrote: > On 8/24/21 1:18 PM, Anton Ivanov wrote: >> On 24/08/2021 12:05, Ilya Maximets wrote: >>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>> Should not be the case. >>>>>>>> >>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>> >>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>> >>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>> >>>>>>>> And immediately after building the lflows: >>>>>>>> >>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>> } >>>>>>>> >>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>> Ack, >>>>>> >>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>> >>>>>> From that perspective the patch is a straight +1 from me. >>>>>> >>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>> >>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>> a database that northd will never be able to process in a reasonable >>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>> >>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>> it at this point). With the patch applied it took only 11 seconds. >>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>> It would be great to reduce, but we're not there yet. >>>>> >>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>> good approximation for the first run. >>>>> This will not work for a case where SbDB is empty for any reason while >>>>> NbDB is not. And there is still a case where northd initially connects >>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>> transaction and generates a big update for northd. >>>> A partial fix is to resize to optimum size the hash after lflow processing. >>> I'm not sure I understand what you mean here, because resizing after >>> lflow processing will not help. The lflow processing itself is the >>> very slow part that we're trying to make faster here. >> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >> >> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. > Oh, I see. Indeed, without dp-groups there is no lookup during > lflow build. I missed that. So, yes, I agree that for a case > without dp-groups, re-sizing after lflow processing should work. > We need that for parallel case. > > Current patch (use hmap_insert() that resizes if needed) helps > for all non-parallel cases. Indeed. It should go in. I will sort out the other cases to the extent possible. Brgds, A. > > I'm mostly running dp-groups + non-parallel which is a default > case for ovn-heater/ovn-k8s. > >> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. > Yeah, dp-groups + parallel is a hard case. > >> I am scale testing that as well as resize (for non-dp-groups cases) at present. >> >> Brgds, >> >> A. >> >>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>> >>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>> >>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>> >>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>> >>>> Brgds, >>>> >>>> A >>>> >>>>>> A. >>>>>> >>>>>>>> A. >>>>>>>> >>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>> >>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>> on not the first run. >>>>>>>>> >>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>> should be fixed separately. >>>>>>>>> >>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>> --- >>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>> + if (!use_parallel_build) { >>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>> + } else { >>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >
On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote: > > > On 24/08/2021 12:46, Ilya Maximets wrote: > > On 8/24/21 1:18 PM, Anton Ivanov wrote: > >> On 24/08/2021 12:05, Ilya Maximets wrote: > >>> On 8/24/21 7:36 AM, Anton Ivanov wrote: > >>>> On 23/08/2021 22:36, Ilya Maximets wrote: > >>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: > >>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: > >>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: > >>>>>>>> Should not be the case. > >>>>>>>> > >>>>>>>> The map is pre-sized for the size from the previous iterations. > >>>>>>>> > >>>>>>>> Line 12861 in my tree which is probably a few commits out of date: > >>>>>>>> > >>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); > >>>>>>>> > >>>>>>>> And immediately after building the lflows: > >>>>>>>> > >>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { > >>>>>>>> max_seen_lflow_size = hmap_count(&lflows); > >>>>>>>> } > >>>>>>>> > >>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. > >>>>>>> Please, re-read the commit message. It's a first run of the loop > >>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. > >>>>>> Ack, > >>>>>> > >>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. > >>>>>> > >>>>>> From that perspective the patch is a straight +1 from me. > >>>>>> > >>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. > >>>>>> > >>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the > >>>>>> expiration of the timers and everyone deciding that northd is AWOL. > >>>>> Well, how do you suggest to fix that? Obviously, we can always create > >>>>> a database that northd will never be able to process in a reasonable > >>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. > >>>>> > >>>>> In this case NbDB is only 9MB in size, which is very reasonable, and > >>>>> northd on my laptop takes more than 15 minutes to process it (I killed > >>>>> it at this point). With the patch applied it took only 11 seconds. > >>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not > >>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. > >>>>> It would be great to reduce, but we're not there yet. > >>>>> > >>>>>> In that case, if it is multi-threaded from the start, it should probably > >>>>>> grab the sizing from the lflow table hash in south db. That would be a > >>>>>> good approximation for the first run. > >>>>> This will not work for a case where SbDB is empty for any reason while > >>>>> NbDB is not. And there is still a case where northd initially connects > >>>>> to semi-empty databases and after few iterations NbDB receives a big > >>>>> transaction and generates a big update for northd. > >>>> A partial fix is to resize to optimum size the hash after lflow processing. > >>> I'm not sure I understand what you mean here, because resizing after > >>> lflow processing will not help. The lflow processing itself is the > >>> very slow part that we're trying to make faster here. > >> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. > >> > >> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. > > Oh, I see. Indeed, without dp-groups there is no lookup during > > lflow build. I missed that. So, yes, I agree that for a case > > without dp-groups, re-sizing after lflow processing should work. > > We need that for parallel case. > > > > Current patch (use hmap_insert() that resizes if needed) helps > > for all non-parallel cases. > > Indeed. It should go in. > Why can't we have hmap_insert() for both parallel and non parallel configs to start with and switch over to hmap_insert_fast() when ovn-northd has successfully connected to SB DB and has approximated on the more accurate hmap size ? Thanks Numan > I will sort out the other cases to the extent possible. > > Brgds, > > A. > > > > > I'm mostly running dp-groups + non-parallel which is a default > > case for ovn-heater/ovn-k8s. > > > >> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. > > Yeah, dp-groups + parallel is a hard case. > > > >> I am scale testing that as well as resize (for non-dp-groups cases) at present. > >> > >> Brgds, > >> > >> A. > >> > >>>> If the sizing was correct - 99.9% of the case this will be a noop. > >>>> > >>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. > >>>> > >>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. > >>>> > >>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. > >>>> > >>>> Brgds, > >>>> > >>>> A > >>>> > >>>>>> A. > >>>>>> > >>>>>>>> A. > >>>>>>>> > >>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: > >>>>>>>>> 'lflow_map' is never expanded because northd always uses fast > >>>>>>>>> insertion. This leads to the case where we have a hash map > >>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends > >>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of > >>>>>>>>> logical flows or even more. For example, it takes forever for > >>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node > >>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower > >>>>>>>>> than previous one. I aborted the process after 15 minutes of > >>>>>>>>> waiting, because there was no sign that it will converge. With > >>>>>>>>> this change applied the loop completes in only 11 seconds. > >>>>>>>>> > >>>>>>>>> Hash map will be pre-allocated to the maximum seen number of > >>>>>>>>> logical flows on a second iteration, but this doesn't help for > >>>>>>>>> the first iteration when northd first time connects to a big > >>>>>>>>> Northbound database, which is a common case during failover or > >>>>>>>>> cluster upgrade. And there is an even trickier case where big > >>>>>>>>> NbDB transaction that explodes the number of logical flows received > >>>>>>>>> on not the first run. > >>>>>>>>> > >>>>>>>>> We can't expand the hash map in case of parallel build, so this > >>>>>>>>> should be fixed separately. > >>>>>>>>> > >>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> > >>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") > >>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >>>>>>>>> --- > >>>>>>>>> northd/ovn-northd.c | 6 +++++- > >>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>>>>>>> > >>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>>>>>>>> index 3d8e21a4f..40cf957c0 100644 > >>>>>>>>> --- a/northd/ovn-northd.c > >>>>>>>>> +++ b/northd/ovn-northd.c > >>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, > >>>>>>>>> nullable_xstrdup(ctrl_meter), > >>>>>>>>> ovn_lflow_hint(stage_hint), where); > >>>>>>>>> hmapx_add(&lflow->od_group, od); > >>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>> + if (!use_parallel_build) { > >>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>> + } else { > >>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>> + } > >>>>>>>>> } > >>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ > > > -- > Anton R. Ivanov > Cambridgegreys Limited. Registered in England. Company Number 10273661 > https://www.cambridgegreys.com/ > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 8/24/21 6:25 PM, Numan Siddique wrote: > On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov > <anton.ivanov@cambridgegreys.com> wrote: >> >> >> On 24/08/2021 12:46, Ilya Maximets wrote: >>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>> Should not be the case. >>>>>>>>>> >>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>> >>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>> >>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>> >>>>>>>>>> And immediately after building the lflows: >>>>>>>>>> >>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>> Ack, >>>>>>>> >>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>> >>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>> >>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>> >>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>> a database that northd will never be able to process in a reasonable >>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>> >>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>> It would be great to reduce, but we're not there yet. >>>>>>> >>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>> good approximation for the first run. >>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>> transaction and generates a big update for northd. >>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>> I'm not sure I understand what you mean here, because resizing after >>>>> lflow processing will not help. The lflow processing itself is the >>>>> very slow part that we're trying to make faster here. >>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>> >>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>> Oh, I see. Indeed, without dp-groups there is no lookup during >>> lflow build. I missed that. So, yes, I agree that for a case >>> without dp-groups, re-sizing after lflow processing should work. >>> We need that for parallel case. >>> >>> Current patch (use hmap_insert() that resizes if needed) helps >>> for all non-parallel cases. >> >> Indeed. It should go in. >> > > Why can't we have hmap_insert() for both parallel and non parallel configs > to start with and switch over to hmap_insert_fast() when ovn-northd > has successfully connected to SB DB and has approximated on the > more accurate hmap size ? We can't use hmap_insert() for parallel, because resize of a hash map will crash threads that are working on it at the moment, IIUC. We could disable parallel for first several iterations, but still, this doesn't cover all the cases. i.e. the one where we have a big update from the NbDB with a fairly small DB before that. > > Thanks > Numan > >> I will sort out the other cases to the extent possible. > > >> >> Brgds, >> >> A. >> >>> >>> I'm mostly running dp-groups + non-parallel which is a default >>> case for ovn-heater/ovn-k8s. >>> >>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>> Yeah, dp-groups + parallel is a hard case. >>> >>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>> >>>> Brgds, >>>> >>>> A. >>>> >>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>> >>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>> >>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>> >>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>> >>>>>> Brgds, >>>>>> >>>>>> A >>>>>> >>>>>>>> A. >>>>>>>> >>>>>>>>>> A. >>>>>>>>>> >>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>> >>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>> on not the first run. >>>>>>>>>>> >>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>> should be fixed separately. >>>>>>>>>>> >>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>> --- >>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + } else { >>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + } >>>>>>>>>>> } >>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>> >> -- >> Anton R. Ivanov >> Cambridgegreys Limited. Registered in England. Company Number 10273661 >> https://www.cambridgegreys.com/ >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 24/08/2021 17:25, Numan Siddique wrote: > On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov > <anton.ivanov@cambridgegreys.com> wrote: >> >> On 24/08/2021 12:46, Ilya Maximets wrote: >>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>> Should not be the case. >>>>>>>>>> >>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>> >>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>> >>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>> >>>>>>>>>> And immediately after building the lflows: >>>>>>>>>> >>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>> Ack, >>>>>>>> >>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>> >>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>> >>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>> >>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>> a database that northd will never be able to process in a reasonable >>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>> >>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>> It would be great to reduce, but we're not there yet. >>>>>>> >>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>> good approximation for the first run. >>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>> transaction and generates a big update for northd. >>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>> I'm not sure I understand what you mean here, because resizing after >>>>> lflow processing will not help. The lflow processing itself is the >>>>> very slow part that we're trying to make faster here. >>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>> >>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>> Oh, I see. Indeed, without dp-groups there is no lookup during >>> lflow build. I missed that. So, yes, I agree that for a case >>> without dp-groups, re-sizing after lflow processing should work. >>> We need that for parallel case. >>> >>> Current patch (use hmap_insert() that resizes if needed) helps >>> for all non-parallel cases. >> Indeed. It should go in. >> > Why can't we have hmap_insert() for both parallel and non parallel configs > to start with and switch over to hmap_insert_fast() when ovn-northd > has successfully connected to SB DB and has approximated on the > more accurate hmap size ? That is possible, but not for dp_groups. If it is just the lflow compute, you can use hmap_insert, but that does not actually have any benefit. In fact, you will consume much more CPU than merging into a suboptimal hmap and then resizing it at the end. For dp_groups, the locking is per hash bucket. If you change the number of buckets (as upon resize) your locks are no longer valid and you end up corrupting the data. I am running tests on dp_groups and I am starting to think that we should abandon the parallelization of lflow compute altogether for the dp_groups case. I get at best the same results and sometimes worse results. Looking at the the picture on ovn-central node of ovn-heater the threads never ramp up to more than single digit percents - they are waiting on locks. Compared to that the brute-force lflow compute has threads ramping up to 100% and clear benefit from parallelization. This is for 36 fake nodes, 1800 ports. A. > > Thanks > Numan > >> I will sort out the other cases to the extent possible. > >> Brgds, >> >> A. >> >>> I'm mostly running dp-groups + non-parallel which is a default >>> case for ovn-heater/ovn-k8s. >>> >>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>> Yeah, dp-groups + parallel is a hard case. >>> >>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>> >>>> Brgds, >>>> >>>> A. >>>> >>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>> >>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>> >>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>> >>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>> >>>>>> Brgds, >>>>>> >>>>>> A >>>>>> >>>>>>>> A. >>>>>>>> >>>>>>>>>> A. >>>>>>>>>> >>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>> >>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>> on not the first run. >>>>>>>>>>> >>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>> should be fixed separately. >>>>>>>>>>> >>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>> --- >>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + } else { >>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>> + } >>>>>>>>>>> } >>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >> -- >> Anton R. Ivanov >> Cambridgegreys Limited. Registered in England. Company Number 10273661 >> https://www.cambridgegreys.com/ >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 24/08/2021 17:35, Ilya Maximets wrote: > On 8/24/21 6:25 PM, Numan Siddique wrote: >> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov >> <anton.ivanov@cambridgegreys.com> wrote: >>> >>> On 24/08/2021 12:46, Ilya Maximets wrote: >>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>>> Should not be the case. >>>>>>>>>>> >>>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>>> >>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>>> >>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>>> >>>>>>>>>>> And immediately after building the lflows: >>>>>>>>>>> >>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>>> Ack, >>>>>>>>> >>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>>> >>>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>>> >>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>>> >>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>>> a database that northd will never be able to process in a reasonable >>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>>> >>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>>> It would be great to reduce, but we're not there yet. >>>>>>>> >>>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>>> good approximation for the first run. >>>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>>> transaction and generates a big update for northd. >>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>>> I'm not sure I understand what you mean here, because resizing after >>>>>> lflow processing will not help. The lflow processing itself is the >>>>>> very slow part that we're trying to make faster here. >>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>>> >>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>>> Oh, I see. Indeed, without dp-groups there is no lookup during >>>> lflow build. I missed that. So, yes, I agree that for a case >>>> without dp-groups, re-sizing after lflow processing should work. >>>> We need that for parallel case. >>>> >>>> Current patch (use hmap_insert() that resizes if needed) helps >>>> for all non-parallel cases. >>> Indeed. It should go in. >>> >> Why can't we have hmap_insert() for both parallel and non parallel configs >> to start with and switch over to hmap_insert_fast() when ovn-northd >> has successfully connected to SB DB and has approximated on the >> more accurate hmap size ? > We can't use hmap_insert() for parallel, because resize of a hash map > will crash threads that are working on it at the moment, IIUC. We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). > > We could disable parallel for first several iterations, but still, > this doesn't cover all the cases. i.e. the one where we have a big > update from the NbDB with a fairly small DB before that. I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). A. > >> Thanks >> Numan >> >>> I will sort out the other cases to the extent possible. >> >>> Brgds, >>> >>> A. >>> >>>> I'm mostly running dp-groups + non-parallel which is a default >>>> case for ovn-heater/ovn-k8s. >>>> >>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>>> Yeah, dp-groups + parallel is a hard case. >>>> >>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>>> >>>>> Brgds, >>>>> >>>>> A. >>>>> >>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>>> >>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>>> >>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>>> >>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>>> >>>>>>> Brgds, >>>>>>> >>>>>>> A >>>>>>> >>>>>>>>> A. >>>>>>>>> >>>>>>>>>>> A. >>>>>>>>>>> >>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>>> >>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>>> on not the first run. >>>>>>>>>>>> >>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>>> should be fixed separately. >>>>>>>>>>>> >>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>>> --- >>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>> + } else { >>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>> + } >>>>>>>>>>>> } >>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>> -- >>> Anton R. Ivanov >>> Cambridgegreys Limited. Registered in England. Company Number 10273661 >>> https://www.cambridgegreys.com/ >>> >>> _______________________________________________ >>> dev mailing list >>> dev@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote: > > > On 24/08/2021 17:35, Ilya Maximets wrote: > > On 8/24/21 6:25 PM, Numan Siddique wrote: > >> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov > >> <anton.ivanov@cambridgegreys.com> wrote: > >>> > >>> On 24/08/2021 12:46, Ilya Maximets wrote: > >>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: > >>>>> On 24/08/2021 12:05, Ilya Maximets wrote: > >>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: > >>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: > >>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: > >>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: > >>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: > >>>>>>>>>>> Should not be the case. > >>>>>>>>>>> > >>>>>>>>>>> The map is pre-sized for the size from the previous iterations. > >>>>>>>>>>> > >>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: > >>>>>>>>>>> > >>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); > >>>>>>>>>>> > >>>>>>>>>>> And immediately after building the lflows: > >>>>>>>>>>> > >>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { > >>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. > >>>>>>>>>> Please, re-read the commit message. It's a first run of the loop > >>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. > >>>>>>>>> Ack, > >>>>>>>>> > >>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. > >>>>>>>>> > >>>>>>>>> From that perspective the patch is a straight +1 from me. > >>>>>>>>> > >>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. > >>>>>>>>> > >>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the > >>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. > >>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create > >>>>>>>> a database that northd will never be able to process in a reasonable > >>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. > >>>>>>>> > >>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and > >>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed > >>>>>>>> it at this point). With the patch applied it took only 11 seconds. > >>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not > >>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. > >>>>>>>> It would be great to reduce, but we're not there yet. > >>>>>>>> > >>>>>>>>> In that case, if it is multi-threaded from the start, it should probably > >>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a > >>>>>>>>> good approximation for the first run. > >>>>>>>> This will not work for a case where SbDB is empty for any reason while > >>>>>>>> NbDB is not. And there is still a case where northd initially connects > >>>>>>>> to semi-empty databases and after few iterations NbDB receives a big > >>>>>>>> transaction and generates a big update for northd. > >>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. > >>>>>> I'm not sure I understand what you mean here, because resizing after > >>>>>> lflow processing will not help. The lflow processing itself is the > >>>>>> very slow part that we're trying to make faster here. > >>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. > >>>>> > >>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. > >>>> Oh, I see. Indeed, without dp-groups there is no lookup during > >>>> lflow build. I missed that. So, yes, I agree that for a case > >>>> without dp-groups, re-sizing after lflow processing should work. > >>>> We need that for parallel case. > >>>> > >>>> Current patch (use hmap_insert() that resizes if needed) helps > >>>> for all non-parallel cases. > >>> Indeed. It should go in. > >>> > >> Why can't we have hmap_insert() for both parallel and non parallel configs > >> to start with and switch over to hmap_insert_fast() when ovn-northd > >> has successfully connected to SB DB and has approximated on the > >> more accurate hmap size ? > > We can't use hmap_insert() for parallel, because resize of a hash map > > will crash threads that are working on it at the moment, IIUC. > > We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. > > For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). > > > > > We could disable parallel for first several iterations, but still, > > this doesn't cover all the cases. i.e. the one where we have a big > > update from the NbDB with a fairly small DB before that. > > I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. > > We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). Ok. You're fine with the proposed patch here right ? @Ilya - Can you please update the commit message to include that the issue is seen with dp groups enabled ? I guess you can share the commit message here. No need to submit v2 for that. Thanks Numan Thanks Numan > > A. > > > > >> Thanks > >> Numan > >> > >>> I will sort out the other cases to the extent possible. > >> > >>> Brgds, > >>> > >>> A. > >>> > >>>> I'm mostly running dp-groups + non-parallel which is a default > >>>> case for ovn-heater/ovn-k8s. > >>>> > >>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. > >>>> Yeah, dp-groups + parallel is a hard case. > >>>> > >>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. > >>>>> > >>>>> Brgds, > >>>>> > >>>>> A. > >>>>> > >>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. > >>>>>>> > >>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. > >>>>>>> > >>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. > >>>>>>> > >>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. > >>>>>>> > >>>>>>> Brgds, > >>>>>>> > >>>>>>> A > >>>>>>> > >>>>>>>>> A. > >>>>>>>>> > >>>>>>>>>>> A. > >>>>>>>>>>> > >>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: > >>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast > >>>>>>>>>>>> insertion. This leads to the case where we have a hash map > >>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends > >>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of > >>>>>>>>>>>> logical flows or even more. For example, it takes forever for > >>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node > >>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower > >>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of > >>>>>>>>>>>> waiting, because there was no sign that it will converge. With > >>>>>>>>>>>> this change applied the loop completes in only 11 seconds. > >>>>>>>>>>>> > >>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of > >>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for > >>>>>>>>>>>> the first iteration when northd first time connects to a big > >>>>>>>>>>>> Northbound database, which is a common case during failover or > >>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big > >>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received > >>>>>>>>>>>> on not the first run. > >>>>>>>>>>>> > >>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this > >>>>>>>>>>>> should be fixed separately. > >>>>>>>>>>>> > >>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> > >>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") > >>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >>>>>>>>>>>> --- > >>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- > >>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>>>>>>>>>> > >>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 > >>>>>>>>>>>> --- a/northd/ovn-northd.c > >>>>>>>>>>>> +++ b/northd/ovn-northd.c > >>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, > >>>>>>>>>>>> nullable_xstrdup(ctrl_meter), > >>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); > >>>>>>>>>>>> hmapx_add(&lflow->od_group, od); > >>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>> + if (!use_parallel_build) { > >>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>> + } else { > >>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>> + } > >>>>>>>>>>>> } > >>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ > >>> -- > >>> Anton R. Ivanov > >>> Cambridgegreys Limited. Registered in England. Company Number 10273661 > >>> https://www.cambridgegreys.com/ > >>> > >>> _______________________________________________ > >>> dev mailing list > >>> dev@openvswitch.org > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > -- > Anton R. Ivanov > Cambridgegreys Limited. Registered in England. Company Number 10273661 > https://www.cambridgegreys.com/ > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 24/08/2021 18:48, Numan Siddique wrote: > On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov > <anton.ivanov@cambridgegreys.com> wrote: >> >> On 24/08/2021 17:35, Ilya Maximets wrote: >>> On 8/24/21 6:25 PM, Numan Siddique wrote: >>>> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov >>>> <anton.ivanov@cambridgegreys.com> wrote: >>>>> On 24/08/2021 12:46, Ilya Maximets wrote: >>>>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>>>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>>>>> Should not be the case. >>>>>>>>>>>>> >>>>>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>>>>> >>>>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>>>>> >>>>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>>>>> >>>>>>>>>>>>> And immediately after building the lflows: >>>>>>>>>>>>> >>>>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>>>>> Ack, >>>>>>>>>>> >>>>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>>>>> >>>>>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>>>>> >>>>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>>>>> >>>>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>>>>> a database that northd will never be able to process in a reasonable >>>>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>>>>> >>>>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>>>>> It would be great to reduce, but we're not there yet. >>>>>>>>>> >>>>>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>>>>> good approximation for the first run. >>>>>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>>>>> transaction and generates a big update for northd. >>>>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>>>>> I'm not sure I understand what you mean here, because resizing after >>>>>>>> lflow processing will not help. The lflow processing itself is the >>>>>>>> very slow part that we're trying to make faster here. >>>>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>>>>> >>>>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>>>>> Oh, I see. Indeed, without dp-groups there is no lookup during >>>>>> lflow build. I missed that. So, yes, I agree that for a case >>>>>> without dp-groups, re-sizing after lflow processing should work. >>>>>> We need that for parallel case. >>>>>> >>>>>> Current patch (use hmap_insert() that resizes if needed) helps >>>>>> for all non-parallel cases. >>>>> Indeed. It should go in. >>>>> >>>> Why can't we have hmap_insert() for both parallel and non parallel configs >>>> to start with and switch over to hmap_insert_fast() when ovn-northd >>>> has successfully connected to SB DB and has approximated on the >>>> more accurate hmap size ? >>> We can't use hmap_insert() for parallel, because resize of a hash map >>> will crash threads that are working on it at the moment, IIUC. >> We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. >> >> For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). >> >>> We could disable parallel for first several iterations, but still, >>> this doesn't cover all the cases. i.e. the one where we have a big >>> update from the NbDB with a fairly small DB before that. >> I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. >> >> We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). > Ok. You're fine with the proposed patch here right ? Yes. Please go ahead. I will address the other cases and issues in separate patches. > > @Ilya - Can you please update the commit message to include that the > issue is seen with dp groups enabled ? > > I guess you can share the commit message here. No need to submit v2 for that. > > Thanks > Numan > > Thanks > Numan >> A. >> >>>> Thanks >>>> Numan >>>> >>>>> I will sort out the other cases to the extent possible. >>>>> Brgds, >>>>> >>>>> A. >>>>> >>>>>> I'm mostly running dp-groups + non-parallel which is a default >>>>>> case for ovn-heater/ovn-k8s. >>>>>> >>>>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>>>>> Yeah, dp-groups + parallel is a hard case. >>>>>> >>>>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>>>>> >>>>>>> Brgds, >>>>>>> >>>>>>> A. >>>>>>> >>>>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>>>>> >>>>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>>>>> >>>>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>>>>> >>>>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>>>>> >>>>>>>>> Brgds, >>>>>>>>> >>>>>>>>> A >>>>>>>>> >>>>>>>>>>> A. >>>>>>>>>>> >>>>>>>>>>>>> A. >>>>>>>>>>>>> >>>>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>>>>> on not the first run. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>>>>> should be fixed separately. >>>>>>>>>>>>>> >>>>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + } else { >>>>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>>>> -- >>>>> Anton R. Ivanov >>>>> Cambridgegreys Limited. Registered in England. Company Number 10273661 >>>>> https://www.cambridgegreys.com/ >>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> dev@openvswitch.org >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> -- >> Anton R. Ivanov >> Cambridgegreys Limited. Registered in England. Company Number 10273661 >> https://www.cambridgegreys.com/ >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>
On 8/24/21 7:48 PM, Numan Siddique wrote: > On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov > <anton.ivanov@cambridgegreys.com> wrote: >> >> >> On 24/08/2021 17:35, Ilya Maximets wrote: >>> On 8/24/21 6:25 PM, Numan Siddique wrote: >>>> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov >>>> <anton.ivanov@cambridgegreys.com> wrote: >>>>> >>>>> On 24/08/2021 12:46, Ilya Maximets wrote: >>>>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>>>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>>>>> Should not be the case. >>>>>>>>>>>>> >>>>>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>>>>> >>>>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>>>>> >>>>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>>>>> >>>>>>>>>>>>> And immediately after building the lflows: >>>>>>>>>>>>> >>>>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>>>>> Ack, >>>>>>>>>>> >>>>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>>>>> >>>>>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>>>>> >>>>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>>>>> >>>>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>>>>> a database that northd will never be able to process in a reasonable >>>>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>>>>> >>>>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>>>>> It would be great to reduce, but we're not there yet. >>>>>>>>>> >>>>>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>>>>> good approximation for the first run. >>>>>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>>>>> transaction and generates a big update for northd. >>>>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>>>>> I'm not sure I understand what you mean here, because resizing after >>>>>>>> lflow processing will not help. The lflow processing itself is the >>>>>>>> very slow part that we're trying to make faster here. >>>>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>>>>> >>>>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>>>>> Oh, I see. Indeed, without dp-groups there is no lookup during >>>>>> lflow build. I missed that. So, yes, I agree that for a case >>>>>> without dp-groups, re-sizing after lflow processing should work. >>>>>> We need that for parallel case. >>>>>> >>>>>> Current patch (use hmap_insert() that resizes if needed) helps >>>>>> for all non-parallel cases. >>>>> Indeed. It should go in. >>>>> >>>> Why can't we have hmap_insert() for both parallel and non parallel configs >>>> to start with and switch over to hmap_insert_fast() when ovn-northd >>>> has successfully connected to SB DB and has approximated on the >>>> more accurate hmap size ? >>> We can't use hmap_insert() for parallel, because resize of a hash map >>> will crash threads that are working on it at the moment, IIUC. >> >> We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. >> >> For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). >> >>> >>> We could disable parallel for first several iterations, but still, >>> this doesn't cover all the cases. i.e. the one where we have a big >>> update from the NbDB with a fairly small DB before that. >> >> I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. >> >> We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). > > Ok. You're fine with the proposed patch here right ? > > @Ilya - Can you please update the commit message to include that the > issue is seen with dp groups enabled ? I'm not sure about that, because issue will still be there, but in a different place. We will not have lookups during construction of new logical flows, but we will have them while trying to reconcile existing SbDB flows and this will be slow too, i.e. the loop '/* Push changes to the Logical_Flow table to database. */' will be a a few thousand times slower. So, the issue is less severe, but it still there even with dp groups disabled. > > I guess you can share the commit message here. No need to submit v2 for that. > > Thanks > Numan > > Thanks > Numan >> >> A. >> >>> >>>> Thanks >>>> Numan >>>> >>>>> I will sort out the other cases to the extent possible. >>>> >>>>> Brgds, >>>>> >>>>> A. >>>>> >>>>>> I'm mostly running dp-groups + non-parallel which is a default >>>>>> case for ovn-heater/ovn-k8s. >>>>>> >>>>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>>>>> Yeah, dp-groups + parallel is a hard case. >>>>>> >>>>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>>>>> >>>>>>> Brgds, >>>>>>> >>>>>>> A. >>>>>>> >>>>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>>>>> >>>>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>>>>> >>>>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>>>>> >>>>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>>>>> >>>>>>>>> Brgds, >>>>>>>>> >>>>>>>>> A >>>>>>>>> >>>>>>>>>>> A. >>>>>>>>>>> >>>>>>>>>>>>> A. >>>>>>>>>>>>> >>>>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>>>>> on not the first run. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>>>>> should be fixed separately. >>>>>>>>>>>>>> >>>>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + } else { >>>>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>> + } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>>>> -- >>>>> Anton R. Ivanov >>>>> Cambridgegreys Limited. Registered in England. Company Number 10273661 >>>>> https://www.cambridgegreys.com/ >>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> dev@openvswitch.org >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >> -- >> Anton R. Ivanov >> Cambridgegreys Limited. Registered in England. Company Number 10273661 >> https://www.cambridgegreys.com/ >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>
On 24/08/2021 19:16, Ilya Maximets wrote: > On 8/24/21 7:48 PM, Numan Siddique wrote: >> On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov >> <anton.ivanov@cambridgegreys.com> wrote: >>> >>> On 24/08/2021 17:35, Ilya Maximets wrote: >>>> On 8/24/21 6:25 PM, Numan Siddique wrote: >>>>> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov >>>>> <anton.ivanov@cambridgegreys.com> wrote: >>>>>> On 24/08/2021 12:46, Ilya Maximets wrote: >>>>>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: >>>>>>>> On 24/08/2021 12:05, Ilya Maximets wrote: >>>>>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: >>>>>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: >>>>>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>>>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>>>>>>>>>> Should not be the case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The map is pre-sized for the size from the previous iterations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>>>>>>>>>> >>>>>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>>>>>>>>>> >>>>>>>>>>>>>> And immediately after building the lflows: >>>>>>>>>>>>>> >>>>>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>>>>>>>>>> Please, re-read the commit message. It's a first run of the loop >>>>>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>>>>>>>>>> Ack, >>>>>>>>>>>> >>>>>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>>>>>>>>>> >>>>>>>>>>>> From that perspective the patch is a straight +1 from me. >>>>>>>>>>>> >>>>>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>>>>>>>>>> >>>>>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>>>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. >>>>>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create >>>>>>>>>>> a database that northd will never be able to process in a reasonable >>>>>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. >>>>>>>>>>> >>>>>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and >>>>>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed >>>>>>>>>>> it at this point). With the patch applied it took only 11 seconds. >>>>>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>>>>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>>>>>>>>>> It would be great to reduce, but we're not there yet. >>>>>>>>>>> >>>>>>>>>>>> In that case, if it is multi-threaded from the start, it should probably >>>>>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a >>>>>>>>>>>> good approximation for the first run. >>>>>>>>>>> This will not work for a case where SbDB is empty for any reason while >>>>>>>>>>> NbDB is not. And there is still a case where northd initially connects >>>>>>>>>>> to semi-empty databases and after few iterations NbDB receives a big >>>>>>>>>>> transaction and generates a big update for northd. >>>>>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. >>>>>>>>> I'm not sure I understand what you mean here, because resizing after >>>>>>>>> lflow processing will not help. The lflow processing itself is the >>>>>>>>> very slow part that we're trying to make faster here. >>>>>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. >>>>>>>> >>>>>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. >>>>>>> Oh, I see. Indeed, without dp-groups there is no lookup during >>>>>>> lflow build. I missed that. So, yes, I agree that for a case >>>>>>> without dp-groups, re-sizing after lflow processing should work. >>>>>>> We need that for parallel case. >>>>>>> >>>>>>> Current patch (use hmap_insert() that resizes if needed) helps >>>>>>> for all non-parallel cases. >>>>>> Indeed. It should go in. >>>>>> >>>>> Why can't we have hmap_insert() for both parallel and non parallel configs >>>>> to start with and switch over to hmap_insert_fast() when ovn-northd >>>>> has successfully connected to SB DB and has approximated on the >>>>> more accurate hmap size ? >>>> We can't use hmap_insert() for parallel, because resize of a hash map >>>> will crash threads that are working on it at the moment, IIUC. >>> We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. >>> >>> For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). >>> >>>> We could disable parallel for first several iterations, but still, >>>> this doesn't cover all the cases. i.e. the one where we have a big >>>> update from the NbDB with a fairly small DB before that. >>> I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. >>> >>> We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). >> Ok. You're fine with the proposed patch here right ? >> >> @Ilya - Can you please update the commit message to include that the >> issue is seen with dp groups enabled ? > I'm not sure about that, because issue will still be there, but in a > different place. We will not have lookups during construction of new > logical flows, but we will have them while trying to reconcile existing > SbDB flows and this will be slow too, i.e. the loop > '/* Push changes to the Logical_Flow table to database. */' will be a > a few thousand times slower. So, the issue is less severe, but it > still there even with dp groups disabled. I am OK with the log entry as is. In the absencee of a hmap_expand(&lflows) right after the computation it is valid for both dp_groups and non dp_groups cases. > >> I guess you can share the commit message here. No need to submit v2 for that. >> >> Thanks >> Numan >> >> Thanks >> Numan >>> A. >>> >>>>> Thanks >>>>> Numan >>>>> >>>>>> I will sort out the other cases to the extent possible. >>>>>> Brgds, >>>>>> >>>>>> A. >>>>>> >>>>>>> I'm mostly running dp-groups + non-parallel which is a default >>>>>>> case for ovn-heater/ovn-k8s. >>>>>>> >>>>>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. >>>>>>> Yeah, dp-groups + parallel is a hard case. >>>>>>> >>>>>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. >>>>>>>> >>>>>>>> Brgds, >>>>>>>> >>>>>>>> A. >>>>>>>> >>>>>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. >>>>>>>>>> >>>>>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. >>>>>>>>>> >>>>>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. >>>>>>>>>> >>>>>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. >>>>>>>>>> >>>>>>>>>> Brgds, >>>>>>>>>> >>>>>>>>>> A >>>>>>>>>> >>>>>>>>>>>> A. >>>>>>>>>>>> >>>>>>>>>>>>>> A. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>>>>>>>>>> insertion. This leads to the case where we have a hash map >>>>>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>>>>>>>>>> logical flows or even more. For example, it takes forever for >>>>>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>>>>>>>>>> waiting, because there was no sign that it will converge. With >>>>>>>>>>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>>>>>>>>>> the first iteration when northd first time connects to a big >>>>>>>>>>>>>>> Northbound database, which is a common case during failover or >>>>>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>>>>>>>>>> on not the first run. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>>>>>>>>>> should be fixed separately. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>>>>>>>>>> --- a/northd/ovn-northd.c >>>>>>>>>>>>>>> +++ b/northd/ovn-northd.c >>>>>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>>>>>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>>>>>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>>> + if (!use_parallel_build) { >>>>>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>>> + } else { >>>>>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >>>>>> -- >>>>>> Anton R. Ivanov >>>>>> Cambridgegreys Limited. Registered in England. Company Number 10273661 >>>>>> https://www.cambridgegreys.com/ >>>>>> >>>>>> _______________________________________________ >>>>>> dev mailing list >>>>>> dev@openvswitch.org >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> -- >>> Anton R. Ivanov >>> Cambridgegreys Limited. Registered in England. Company Number 10273661 >>> https://www.cambridgegreys.com/ >>> >>> _______________________________________________ >>> dev mailing list >>> dev@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >
On Tue, Aug 24, 2021 at 2:20 PM Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote: > > On 24/08/2021 19:16, Ilya Maximets wrote: > > On 8/24/21 7:48 PM, Numan Siddique wrote: > >> On Tue, Aug 24, 2021 at 1:24 PM Anton Ivanov > >> <anton.ivanov@cambridgegreys.com> wrote: > >>> > >>> On 24/08/2021 17:35, Ilya Maximets wrote: > >>>> On 8/24/21 6:25 PM, Numan Siddique wrote: > >>>>> On Tue, Aug 24, 2021 at 7:56 AM Anton Ivanov > >>>>> <anton.ivanov@cambridgegreys.com> wrote: > >>>>>> On 24/08/2021 12:46, Ilya Maximets wrote: > >>>>>>> On 8/24/21 1:18 PM, Anton Ivanov wrote: > >>>>>>>> On 24/08/2021 12:05, Ilya Maximets wrote: > >>>>>>>>> On 8/24/21 7:36 AM, Anton Ivanov wrote: > >>>>>>>>>> On 23/08/2021 22:36, Ilya Maximets wrote: > >>>>>>>>>>> On 8/23/21 10:37 PM, Anton Ivanov wrote: > >>>>>>>>>>>> On 23/08/2021 21:26, Ilya Maximets wrote: > >>>>>>>>>>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: > >>>>>>>>>>>>>> Should not be the case. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The map is pre-sized for the size from the previous iterations. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Line 12861 in my tree which is probably a few commits out of date: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> And immediately after building the lflows: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { > >>>>>>>>>>>>>> max_seen_lflow_size = hmap_count(&lflows); > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. > >>>>>>>>>>>>> Please, re-read the commit message. It's a first run of the loop > >>>>>>>>>>>>> and the 'max_seen_lflow_size' is default 128 at this point. > >>>>>>>>>>>> Ack, > >>>>>>>>>>>> > >>>>>>>>>>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. > >>>>>>>>>>>> > >>>>>>>>>>>> From that perspective the patch is a straight +1 from me. > >>>>>>>>>>>> > >>>>>>>>>>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. > >>>>>>>>>>>> > >>>>>>>>>>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the > >>>>>>>>>>>> expiration of the timers and everyone deciding that northd is AWOL. > >>>>>>>>>>> Well, how do you suggest to fix that? Obviously, we can always create > >>>>>>>>>>> a database that northd will never be able to process in a reasonable > >>>>>>>>>>> amount of time. And it doesn't matter if it's single- or multi-threaded. > >>>>>>>>>>> > >>>>>>>>>>> In this case NbDB is only 9MB in size, which is very reasonable, and > >>>>>>>>>>> northd on my laptop takes more than 15 minutes to process it (I killed > >>>>>>>>>>> it at this point). With the patch applied it took only 11 seconds. > >>>>>>>>>>> So, for me, this patch pretty much fixes the issue. 11 seconds is not > >>>>>>>>>>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. > >>>>>>>>>>> It would be great to reduce, but we're not there yet. > >>>>>>>>>>> > >>>>>>>>>>>> In that case, if it is multi-threaded from the start, it should probably > >>>>>>>>>>>> grab the sizing from the lflow table hash in south db. That would be a > >>>>>>>>>>>> good approximation for the first run. > >>>>>>>>>>> This will not work for a case where SbDB is empty for any reason while > >>>>>>>>>>> NbDB is not. And there is still a case where northd initially connects > >>>>>>>>>>> to semi-empty databases and after few iterations NbDB receives a big > >>>>>>>>>>> transaction and generates a big update for northd. > >>>>>>>>>> A partial fix is to resize to optimum size the hash after lflow processing. > >>>>>>>>> I'm not sure I understand what you mean here, because resizing after > >>>>>>>>> lflow processing will not help. The lflow processing itself is the > >>>>>>>>> very slow part that we're trying to make faster here. > >>>>>>>> That can be the case only with dpgroups. Otherwise lflows are just a destination to dump data and the bucket sizing is irrelevant because there is never any lookup inside lflows during processing. The lflow map is just used to store data. So if it is suboptimal at the exit of build_lflows() the resize will fix it before the first lookup shortly thereafter. > >>>>>>>> > >>>>>>>> Are you running it with dpgroups enabled? In that case there are lookups inside lflows during build which happen under a per-bucket lock. So in addition to suboptimal size when searching the contention depends on the number of buckets. If they are too few, the system becomes heavily contended resulting in ridiculous computation sizes. > >>>>>>> Oh, I see. Indeed, without dp-groups there is no lookup during > >>>>>>> lflow build. I missed that. So, yes, I agree that for a case > >>>>>>> without dp-groups, re-sizing after lflow processing should work. > >>>>>>> We need that for parallel case. > >>>>>>> > >>>>>>> Current patch (use hmap_insert() that resizes if needed) helps > >>>>>>> for all non-parallel cases. > >>>>>> Indeed. It should go in. > >>>>>> > >>>>> Why can't we have hmap_insert() for both parallel and non parallel configs > >>>>> to start with and switch over to hmap_insert_fast() when ovn-northd > >>>>> has successfully connected to SB DB and has approximated on the > >>>>> more accurate hmap size ? > >>>> We can't use hmap_insert() for parallel, because resize of a hash map > >>>> will crash threads that are working on it at the moment, IIUC. > >>> We actually can for non-dp-groups case, but the merge will become much slower. Example - the last version of the snapshot and monitor parallelization for OVS. It no longer uses pre-sized fixed hmaps because it is nearly impossible to presize an RFC7047 structure correctly. > >>> > >>> For the dp-groups case we can't. Locking of the lflows which are used for lookups is per hash bucket. You cannot change the number of buckets in the middle of the run and other locking mechanisms will be more coarse. So the marginal benefit with dp-groups at present will become none (or even slower). > >>> > >>>> We could disable parallel for first several iterations, but still, > >>>> this doesn't cover all the cases. i.e. the one where we have a big > >>>> update from the NbDB with a fairly small DB before that. > >>> I am working on that. Let me collect some stat data for the current OVN master with all the recent changes. They have changed the "heat map" for the dp_groups case. > >>> > >>> We may need to do some adjustments to what and where we parallelize in the dp-groups case (if at all). > >> Ok. You're fine with the proposed patch here right ? > >> > >> @Ilya - Can you please update the commit message to include that the > >> issue is seen with dp groups enabled ? > > I'm not sure about that, because issue will still be there, but in a > > different place. We will not have lookups during construction of new > > logical flows, but we will have them while trying to reconcile existing > > SbDB flows and this will be slow too, i.e. the loop > > '/* Push changes to the Logical_Flow table to database. */' will be a > > a few thousand times slower. So, the issue is less severe, but it > > still there even with dp groups disabled. Right. Thanks for pointing this out. I applied this patch to the main branch and backported to branch-21.06. Numan > > I am OK with the log entry as is. In the absencee of a > hmap_expand(&lflows) right after the computation it is valid for both > dp_groups and non dp_groups cases. > > > > >> I guess you can share the commit message here. No need to submit v2 for that. > >> > >> Thanks > >> Numan > >> > >> Thanks > >> Numan > >>> A. > >>> > >>>>> Thanks > >>>>> Numan > >>>>> > >>>>>> I will sort out the other cases to the extent possible. > >>>>>> Brgds, > >>>>>> > >>>>>> A. > >>>>>> > >>>>>>> I'm mostly running dp-groups + non-parallel which is a default > >>>>>>> case for ovn-heater/ovn-k8s. > >>>>>>> > >>>>>>>> For the case of "dpgroups + parallel + first iteration + pre-existing large database" there is no cure short of pre-allocating the hash to maximum size. > >>>>>>> Yeah, dp-groups + parallel is a hard case. > >>>>>>> > >>>>>>>> I am scale testing that as well as resize (for non-dp-groups cases) at present. > >>>>>>>> > >>>>>>>> Brgds, > >>>>>>>> > >>>>>>>> A. > >>>>>>>> > >>>>>>>>>> If the sizing was correct - 99.9% of the case this will be a noop. > >>>>>>>>>> > >>>>>>>>>> If the sizing was incorrect, it will be resized so that the DP searches and all other ops which were recently added for flow reduction will work optimally. > >>>>>>>>>> > >>>>>>>>>> This still does not work for lflow compute with dpgroups + parallel upon initial connect and without a SB database to use for size guidance. It will work for all other cases. > >>>>>>>>>> > >>>>>>>>>> I will send two separate patches to address the cases which can be easily addressed and see what can be done with the dp+parallel upon initial connect to an empty sb database. > >>>>>>>>>> > >>>>>>>>>> Brgds, > >>>>>>>>>> > >>>>>>>>>> A > >>>>>>>>>> > >>>>>>>>>>>> A. > >>>>>>>>>>>> > >>>>>>>>>>>>>> A. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: > >>>>>>>>>>>>>>> 'lflow_map' is never expanded because northd always uses fast > >>>>>>>>>>>>>>> insertion. This leads to the case where we have a hash map > >>>>>>>>>>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends > >>>>>>>>>>>>>>> up iterating over n_lflows / 128 entries. It's thousands of > >>>>>>>>>>>>>>> logical flows or even more. For example, it takes forever for > >>>>>>>>>>>>>>> ovn-northd to start up with the Northbound Db from the 120 node > >>>>>>>>>>>>>>> density-heavy test from ovn-heater, because every lookup is slower > >>>>>>>>>>>>>>> than previous one. I aborted the process after 15 minutes of > >>>>>>>>>>>>>>> waiting, because there was no sign that it will converge. With > >>>>>>>>>>>>>>> this change applied the loop completes in only 11 seconds. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hash map will be pre-allocated to the maximum seen number of > >>>>>>>>>>>>>>> logical flows on a second iteration, but this doesn't help for > >>>>>>>>>>>>>>> the first iteration when northd first time connects to a big > >>>>>>>>>>>>>>> Northbound database, which is a common case during failover or > >>>>>>>>>>>>>>> cluster upgrade. And there is an even trickier case where big > >>>>>>>>>>>>>>> NbDB transaction that explodes the number of logical flows received > >>>>>>>>>>>>>>> on not the first run. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We can't expand the hash map in case of parallel build, so this > >>>>>>>>>>>>>>> should be fixed separately. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> > >>>>>>>>>>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") > >>>>>>>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >>>>>>>>>>>>>>> --- > >>>>>>>>>>>>>>> northd/ovn-northd.c | 6 +++++- > >>>>>>>>>>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>>>>>>>>>>>>>> index 3d8e21a4f..40cf957c0 100644 > >>>>>>>>>>>>>>> --- a/northd/ovn-northd.c > >>>>>>>>>>>>>>> +++ b/northd/ovn-northd.c > >>>>>>>>>>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, > >>>>>>>>>>>>>>> nullable_xstrdup(ctrl_meter), > >>>>>>>>>>>>>>> ovn_lflow_hint(stage_hint), where); > >>>>>>>>>>>>>>> hmapx_add(&lflow->od_group, od); > >>>>>>>>>>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>>>>> + if (!use_parallel_build) { > >>>>>>>>>>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>>>>> + } else { > >>>>>>>>>>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); > >>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>> } > >>>>>>>>>>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ > >>>>>> -- > >>>>>> Anton R. Ivanov > >>>>>> Cambridgegreys Limited. Registered in England. Company Number 10273661 > >>>>>> https://www.cambridgegreys.com/ > >>>>>> > >>>>>> _______________________________________________ > >>>>>> dev mailing list > >>>>>> dev@openvswitch.org > >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> -- > >>> Anton R. Ivanov > >>> Cambridgegreys Limited. Registered in England. Company Number 10273661 > >>> https://www.cambridgegreys.com/ > >>> > >>> _______________________________________________ > >>> dev mailing list > >>> dev@openvswitch.org > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> > > > > -- > Anton R. Ivanov > Cambridgegreys Limited. Registered in England. Company Number 10273661 > https://www.cambridgegreys.com/ > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 23/08/2021 22:36, Ilya Maximets wrote: > On 8/23/21 10:37 PM, Anton Ivanov wrote: >> On 23/08/2021 21:26, Ilya Maximets wrote: >>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>> Should not be the case. >>>> >>>> The map is pre-sized for the size from the previous iterations. >>>> >>>> Line 12861 in my tree which is probably a few commits out of date: >>>> >>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>> >>>> And immediately after building the lflows: >>>> >>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>> max_seen_lflow_size = hmap_count(&lflows); >>>> } >>>> >>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>> Please, re-read the commit message. It's a first run of the loop >>> and the 'max_seen_lflow_size' is default 128 at this point. >> Ack, >> >> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >> >> From that perspective the patch is a straight +1 from me. >> >> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >> >> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >> expiration of the timers and everyone deciding that northd is AWOL. > Well, how do you suggest to fix that? Obviously, we can always create > a database that northd will never be able to process in a reasonable > amount of time. And it doesn't matter if it's single- or multi-threaded. I will send the additional "baseline" fixes for parallel - resizing and initial sizing tomorrow, they are fairly trivial and have tested out OK. However, they do not solve the fact that the overall "heatmap" with dp_groups have moved. A lot of processing once again happens out of the "parallel" portion and in the single threaded part. Unless I am mistaken, this can be improved. Namely, at present, after computing lflows with dp_groups they are walked in full, single dp flows separated into a hmap and reprocessed. That is suboptimal for parallel (and possibly suboptimal for single threaded). Unless I am mistaken, when dp_groups are enabled, all lflows can be initially inserted into a separate "single datapath" hmap. If the dp_group for an lflow grows to more than one, the flow is then moved to the main lflow hmap. This way, the computation will generate both types of flows straight away (optionally in parallel) and there will be no need to do a full single threaded walk of lflows after they have been generated. One question (so I can get some idea on which optimizations are worth it and which aren't). What is the percentage and overall numbers of single datapath lflows? Brgds, A. > > In this case NbDB is only 9MB in size, which is very reasonable, and > northd on my laptop takes more than 15 minutes to process it (I killed > it at this point). With the patch applied it took only 11 seconds. > So, for me, this patch pretty much fixes the issue. 11 seconds is not > that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. > It would be great to reduce, but we're not there yet. > >> In that case, if it is multi-threaded from the start, it should probably >> grab the sizing from the lflow table hash in south db. That would be a >> good approximation for the first run. > This will not work for a case where SbDB is empty for any reason while > NbDB is not. And there is still a case where northd initially connects > to semi-empty databases and after few iterations NbDB receives a big > transaction and generates a big update for northd. > >> A. >> >>>> A. >>>> >>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>> 'lflow_map' is never expanded because northd always uses fast >>>>> insertion. This leads to the case where we have a hash map >>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>> logical flows or even more. For example, it takes forever for >>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>> than previous one. I aborted the process after 15 minutes of >>>>> waiting, because there was no sign that it will converge. With >>>>> this change applied the loop completes in only 11 seconds. >>>>> >>>>> Hash map will be pre-allocated to the maximum seen number of >>>>> logical flows on a second iteration, but this doesn't help for >>>>> the first iteration when northd first time connects to a big >>>>> Northbound database, which is a common case during failover or >>>>> cluster upgrade. And there is an even trickier case where big >>>>> NbDB transaction that explodes the number of logical flows received >>>>> on not the first run. >>>>> >>>>> We can't expand the hash map in case of parallel build, so this >>>>> should be fixed separately. >>>>> >>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>> --- >>>>> northd/ovn-northd.c | 6 +++++- >>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>> index 3d8e21a4f..40cf957c0 100644 >>>>> --- a/northd/ovn-northd.c >>>>> +++ b/northd/ovn-northd.c >>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>> nullable_xstrdup(ctrl_meter), >>>>> ovn_lflow_hint(stage_hint), where); >>>>> hmapx_add(&lflow->od_group, od); >>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>> + if (!use_parallel_build) { >>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>> + } else { >>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>> + } >>>>> } >>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >
On 8/24/21 10:07 PM, Anton Ivanov wrote: > On 23/08/2021 22:36, Ilya Maximets wrote: >> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>> Should not be the case. >>>>> >>>>> The map is pre-sized for the size from the previous iterations. >>>>> >>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>> >>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>> >>>>> And immediately after building the lflows: >>>>> >>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>> } >>>>> >>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>> Please, re-read the commit message. It's a first run of the loop >>>> and the 'max_seen_lflow_size' is default 128 at this point. >>> Ack, >>> >>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>> >>> From that perspective the patch is a straight +1 from me. >>> >>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>> >>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>> expiration of the timers and everyone deciding that northd is AWOL. >> Well, how do you suggest to fix that? Obviously, we can always create >> a database that northd will never be able to process in a reasonable >> amount of time. And it doesn't matter if it's single- or multi-threaded. > > I will send the additional "baseline" fixes for parallel - resizing and initial sizing tomorrow, they are fairly trivial and have tested out OK. > > However, they do not solve the fact that the overall "heatmap" with dp_groups have moved. A lot of processing once again happens out of the "parallel" portion and in the single threaded part. > > Unless I am mistaken, this can be improved. > > Namely, at present, after computing lflows with dp_groups they are walked in full, single dp flows separated into a hmap and reprocessed. That is suboptimal for parallel (and possibly suboptimal for single threaded). > > Unless I am mistaken, when dp_groups are enabled, all lflows can be initially inserted into a separate "single datapath" hmap. If the dp_group for an lflow grows to more than one, the flow is then moved to the main lflow hmap. This way, the computation will generate both types of flows straight away (optionally in parallel) and there will be no need to do a full single threaded walk of lflows after they have been generated. > > One question (so I can get some idea on which optimizations are worth it and which aren't). What is the percentage and overall numbers of single datapath lflows? From the DB that I have I extracted following information: Total lflows generated : 9.916.227 Ended up in SbDB: 540.795 (462.196 has no dp_group, 78.599 has a dp_group) On disk size of this DB with dp groups enabled is 270 MB. So, the lflows hashmap contains ~540K flows, 460K of them are single flows. But still it's 20 times less than number of lflows that northd generated. So, performance improvement from parallelization of this part might be not significant if dp-groups enabled. If disabled, it will be very hard for both northd and SbDB to handle database of this size even from the memory consumption standpoint. Database will take around 5 GB on disk. In memory as a parsed json object, it will be huge. I'd not advise running a setup of that size without dp groups. Node will ran out of memory very fast. > > Brgds, > > A. > >> >> In this case NbDB is only 9MB in size, which is very reasonable, and >> northd on my laptop takes more than 15 minutes to process it (I killed >> it at this point). With the patch applied it took only 11 seconds. >> So, for me, this patch pretty much fixes the issue. 11 seconds is not >> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >> It would be great to reduce, but we're not there yet. >> >>> In that case, if it is multi-threaded from the start, it should probably >>> grab the sizing from the lflow table hash in south db. That would be a >>> good approximation for the first run. >> This will not work for a case where SbDB is empty for any reason while >> NbDB is not. And there is still a case where northd initially connects >> to semi-empty databases and after few iterations NbDB receives a big >> transaction and generates a big update for northd. >> >>> A. >>> >>>>> A. >>>>> >>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>> insertion. This leads to the case where we have a hash map >>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>> logical flows or even more. For example, it takes forever for >>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>> than previous one. I aborted the process after 15 minutes of >>>>>> waiting, because there was no sign that it will converge. With >>>>>> this change applied the loop completes in only 11 seconds. >>>>>> >>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>> logical flows on a second iteration, but this doesn't help for >>>>>> the first iteration when northd first time connects to a big >>>>>> Northbound database, which is a common case during failover or >>>>>> cluster upgrade. And there is an even trickier case where big >>>>>> NbDB transaction that explodes the number of logical flows received >>>>>> on not the first run. >>>>>> >>>>>> We can't expand the hash map in case of parallel build, so this >>>>>> should be fixed separately. >>>>>> >>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>> --- >>>>>> northd/ovn-northd.c | 6 +++++- >>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>> --- a/northd/ovn-northd.c >>>>>> +++ b/northd/ovn-northd.c >>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>> nullable_xstrdup(ctrl_meter), >>>>>> ovn_lflow_hint(stage_hint), where); >>>>>> hmapx_add(&lflow->od_group, od); >>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>> + if (!use_parallel_build) { >>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>> + } else { >>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>> + } >>>>>> } >>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >> >
On 24/08/2021 22:49, Ilya Maximets wrote: > On 8/24/21 10:07 PM, Anton Ivanov wrote: >> On 23/08/2021 22:36, Ilya Maximets wrote: >>> On 8/23/21 10:37 PM, Anton Ivanov wrote: >>>> On 23/08/2021 21:26, Ilya Maximets wrote: >>>>> On 8/23/21 10:20 PM, Anton Ivanov wrote: >>>>>> Should not be the case. >>>>>> >>>>>> The map is pre-sized for the size from the previous iterations. >>>>>> >>>>>> Line 12861 in my tree which is probably a few commits out of date: >>>>>> >>>>>> fast_hmap_size_for(&lflows, max_seen_lflow_size); >>>>>> >>>>>> And immediately after building the lflows: >>>>>> >>>>>> if (hmap_count(&lflows) > max_seen_lflow_size) { >>>>>> max_seen_lflow_size = hmap_count(&lflows); >>>>>> } >>>>>> >>>>>> So the map SHOULD be sized correctly - to the most recent seen lflow count. >>>>> Please, re-read the commit message. It's a first run of the loop >>>>> and the 'max_seen_lflow_size' is default 128 at this point. >>>> Ack, >>>> >>>> Not using auto-resizing in single threaded mode is a bug. Thanks for fixing it. >>>> >>>> From that perspective the patch is a straight +1 from me. >>>> >>>> From the perspective of the use case stated in the commit message- I am not sure it addresses it. >>>> >>>> If northd is in single-threaded mode and is tackling a GIGANTIC> database, it may never complete the first iteration before the >>>> expiration of the timers and everyone deciding that northd is AWOL. >>> Well, how do you suggest to fix that? Obviously, we can always create >>> a database that northd will never be able to process in a reasonable >>> amount of time. And it doesn't matter if it's single- or multi-threaded. >> I will send the additional "baseline" fixes for parallel - resizing and initial sizing tomorrow, they are fairly trivial and have tested out OK. >> >> However, they do not solve the fact that the overall "heatmap" with dp_groups have moved. A lot of processing once again happens out of the "parallel" portion and in the single threaded part. >> >> Unless I am mistaken, this can be improved. >> >> Namely, at present, after computing lflows with dp_groups they are walked in full, single dp flows separated into a hmap and reprocessed. That is suboptimal for parallel (and possibly suboptimal for single threaded). >> >> Unless I am mistaken, when dp_groups are enabled, all lflows can be initially inserted into a separate "single datapath" hmap. If the dp_group for an lflow grows to more than one, the flow is then moved to the main lflow hmap. This way, the computation will generate both types of flows straight away (optionally in parallel) and there will be no need to do a full single threaded walk of lflows after they have been generated. >> >> One question (so I can get some idea on which optimizations are worth it and which aren't). What is the percentage and overall numbers of single datapath lflows? > From the DB that I have I extracted following information: > > Total lflows generated : 9.916.227 > Ended up in SbDB: 540.795 (462.196 has no dp_group, 78.599 has a dp_group) > On disk size of this DB with dp groups enabled is 270 MB. > > So, the lflows hashmap contains ~540K flows, 460K of them are single > flows. But still it's 20 times less than number of lflows that northd > generated. So, performance improvement from parallelization of this > part might be not significant if dp-groups enabled. Actually, removing a 460K flow map creation, 460K flow map walk and replacing a "pop"-"push" map moves with a fast merge (all of that is possible) should give a significant performance boost even in single thread. All of that is parallelizable too (with some locking). > If disabled, it > will be very hard for both northd and SbDB to handle database of this > size even from the memory consumption standpoint. Database will take > around 5 GB on disk. In memory as a parsed json object, it will be huge. > I'd not advise running a setup of that size without dp groups. Node > will ran out of memory very fast. >> Brgds, >> >> A. >> >>> In this case NbDB is only 9MB in size, which is very reasonable, and >>> northd on my laptop takes more than 15 minutes to process it (I killed >>> it at this point). With the patch applied it took only 11 seconds. >>> So, for me, this patch pretty much fixes the issue. 11 seconds is not >>> that bad, e.g. ovn-k8s configures inactivity probes for clients to 180. >>> It would be great to reduce, but we're not there yet. >>> >>>> In that case, if it is multi-threaded from the start, it should probably >>>> grab the sizing from the lflow table hash in south db. That would be a >>>> good approximation for the first run. >>> This will not work for a case where SbDB is empty for any reason while >>> NbDB is not. And there is still a case where northd initially connects >>> to semi-empty databases and after few iterations NbDB receives a big >>> transaction and generates a big update for northd. >>> >>>> A. >>>> >>>>>> A. >>>>>> >>>>>> On 23/08/2021 21:02, Ilya Maximets wrote: >>>>>>> 'lflow_map' is never expanded because northd always uses fast >>>>>>> insertion. This leads to the case where we have a hash map >>>>>>> with only 128 initial buckets and every ovn_lflow_find() ends >>>>>>> up iterating over n_lflows / 128 entries. It's thousands of >>>>>>> logical flows or even more. For example, it takes forever for >>>>>>> ovn-northd to start up with the Northbound Db from the 120 node >>>>>>> density-heavy test from ovn-heater, because every lookup is slower >>>>>>> than previous one. I aborted the process after 15 minutes of >>>>>>> waiting, because there was no sign that it will converge. With >>>>>>> this change applied the loop completes in only 11 seconds. >>>>>>> >>>>>>> Hash map will be pre-allocated to the maximum seen number of >>>>>>> logical flows on a second iteration, but this doesn't help for >>>>>>> the first iteration when northd first time connects to a big >>>>>>> Northbound database, which is a common case during failover or >>>>>>> cluster upgrade. And there is an even trickier case where big >>>>>>> NbDB transaction that explodes the number of logical flows received >>>>>>> on not the first run. >>>>>>> >>>>>>> We can't expand the hash map in case of parallel build, so this >>>>>>> should be fixed separately. >>>>>>> >>>>>>> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> >>>>>>> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") >>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >>>>>>> --- >>>>>>> northd/ovn-northd.c | 6 +++++- >>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>>>> index 3d8e21a4f..40cf957c0 100644 >>>>>>> --- a/northd/ovn-northd.c >>>>>>> +++ b/northd/ovn-northd.c >>>>>>> @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, >>>>>>> nullable_xstrdup(ctrl_meter), >>>>>>> ovn_lflow_hint(stage_hint), where); >>>>>>> hmapx_add(&lflow->od_group, od); >>>>>>> - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>> + if (!use_parallel_build) { >>>>>>> + hmap_insert(lflow_map, &lflow->hmap_node, hash); >>>>>>> + } else { >>>>>>> + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); >>>>>>> + } >>>>>>> } >>>>>>> /* Adds a row with the specified contents to the Logical_Flow table. */ >
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c index 3d8e21a4f..40cf957c0 100644 --- a/northd/ovn-northd.c +++ b/northd/ovn-northd.c @@ -4387,7 +4387,11 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od, nullable_xstrdup(ctrl_meter), ovn_lflow_hint(stage_hint), where); hmapx_add(&lflow->od_group, od); - hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); + if (!use_parallel_build) { + hmap_insert(lflow_map, &lflow->hmap_node, hash); + } else { + hmap_insert_fast(lflow_map, &lflow->hmap_node, hash); + } } /* Adds a row with the specified contents to the Logical_Flow table. */
'lflow_map' is never expanded because northd always uses fast insertion. This leads to the case where we have a hash map with only 128 initial buckets and every ovn_lflow_find() ends up iterating over n_lflows / 128 entries. It's thousands of logical flows or even more. For example, it takes forever for ovn-northd to start up with the Northbound Db from the 120 node density-heavy test from ovn-heater, because every lookup is slower than previous one. I aborted the process after 15 minutes of waiting, because there was no sign that it will converge. With this change applied the loop completes in only 11 seconds. Hash map will be pre-allocated to the maximum seen number of logical flows on a second iteration, but this doesn't help for the first iteration when northd first time connects to a big Northbound database, which is a common case during failover or cluster upgrade. And there is an even trickier case where big NbDB transaction that explodes the number of logical flows received on not the first run. We can't expand the hash map in case of parallel build, so this should be fixed separately. CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> Fixes: 74daa0607c7f ("ovn-northd: Introduce parallel lflow build") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- northd/ovn-northd.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)