Message ID | 20240315100402.1979482-1-i@liuyulong.me |
---|---|
State | Deferred |
Delegated to: | Ilya Maximets |
Headers | show |
Series | [ovs-dev,v3] ofproto-dpif-upcall: Try lock for udpif_key map during sweep. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
On 3/15/24 11:04, LIU Yulong wrote: > A potential race condition happened with the following 3 threads: > * PMD thread replaced the old_ukey and transitioned the state to > UKEY_DELETED. > * RCU thread is freeing the old_ukey mutex. > * While the revalidator thread is trying to lock the old_ukey mutex. > > We added some timestamp to udpif_key state transition and > the udpif_key mutex free action, as well as the sweep try lock for > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key > try_lock mutex had always a bit later than the udpif_key mutex free > action. For instance [3]: > ukey_destroy_time = 13217289156490 > sweep_now = 13217289156568 > > The second time is 78us behind the first time. > > Then vswitchd process aborts at the revalidator thread try_lock of > ukey->mutex because of the NULL pointer. > > This patch adds the try_lock for the ukeys' basket udpif_key map to > avoid the PMD and revalidator access to the same map for replacing the > udpif_key and transitioning the udpif_key state. Thanks for the patch! I still need to catch up on the ovs-discuss thread to better understand what we're dealing with. However, the solution proposed here may be a little too radical. The main problem is that we're holding the mutex for potentially very long time - the whole iteration over every ukey in the map, potentially waiting for datapath operations to remove old flows. This can take hundreds of milliseconds or even a fwe seconds in some cases. For that time we will be blocking PMD threads and handlers. This is probably not a good solution. Also, it might be masking the real issue by having this large critical section, i.e. the issue may still be there, but just much harder to reproduce. Best regards, Ilya Maximets.
ACK Thanks, Ilya. For now, update such lock movement code to the test hosts, no cores at sweep__ for 10 days. LIU Yulong On Wed, Mar 20, 2024 at 5:42 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > On 3/15/24 11:04, LIU Yulong wrote: > > A potential race condition happened with the following 3 threads: > > * PMD thread replaced the old_ukey and transitioned the state to > > UKEY_DELETED. > > * RCU thread is freeing the old_ukey mutex. > > * While the revalidator thread is trying to lock the old_ukey mutex. > > > > We added some timestamp to udpif_key state transition and > > the udpif_key mutex free action, as well as the sweep try lock for > > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key > > try_lock mutex had always a bit later than the udpif_key mutex free > > action. For instance [3]: > > ukey_destroy_time = 13217289156490 > > sweep_now = 13217289156568 > > > > The second time is 78us behind the first time. > > > > Then vswitchd process aborts at the revalidator thread try_lock of > > ukey->mutex because of the NULL pointer. > > > > This patch adds the try_lock for the ukeys' basket udpif_key map to > > avoid the PMD and revalidator access to the same map for replacing the > > udpif_key and transitioning the udpif_key state. > > Thanks for the patch! > > I still need to catch up on the ovs-discuss thread to better understand > what we're dealing with. > > However, the solution proposed here may be a little too radical. The > main problem is that we're holding the mutex for potentially very long > time - the whole iteration over every ukey in the map, potentially > waiting for datapath operations to remove old flows. This can take > hundreds of milliseconds or even a fwe seconds in some cases. For that > time we will be blocking PMD threads and handlers. This is probably > not a good solution. Also, it might be masking the real issue by having > this large critical section, i.e. the issue may still be there, but just > much harder to reproduce. > > Best regards, Ilya Maximets. > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 15 Mar 2024, at 11:04, LIU Yulong wrote: > A potential race condition happened with the following 3 threads: > * PMD thread replaced the old_ukey and transitioned the state to > UKEY_DELETED. > * RCU thread is freeing the old_ukey mutex. > * While the revalidator thread is trying to lock the old_ukey mutex. > > We added some timestamp to udpif_key state transition and > the udpif_key mutex free action, as well as the sweep try lock for > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key > try_lock mutex had always a bit later than the udpif_key mutex free > action. For instance [3]: > ukey_destroy_time = 13217289156490 > sweep_now = 13217289156568 > > The second time is 78us behind the first time. > > Then vswitchd process aborts at the revalidator thread try_lock of > ukey->mutex because of the NULL pointer. > > This patch adds the try_lock for the ukeys' basket udpif_key map to > avoid the PMD and revalidator access to the same map for replacing the > udpif_key and transitioning the udpif_key state. > > More details can be found at: > [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html > [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html > [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html > > Signed-off-by: LIU Yulong <i@liuyulong.me> Hi LIU, I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis. Assume PMD is doing try_ukey_replace(): 1930 try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey, 1931 struct udpif_key *new_ukey) 1932 OVS_REQUIRES(umap->mutex) 1933 OVS_TRY_LOCK(true, new_ukey->mutex) 1934 { 1935 bool replaced = false; 1936 1937 if (!ovs_mutex_trylock(&old_ukey->mutex)) { 1938 if (old_ukey->state == UKEY_EVICTED) { 1939 /* The flow was deleted during the current revalidator dump, 1940 * but its ukey won't be fully cleaned up until the sweep phase. 1941 * In the mean time, we are receiving upcalls for this traffic. 1942 * Expedite the (new) flow install by replacing the ukey. */ 1943 ovs_mutex_lock(&new_ukey->mutex); 1944 cmap_replace(&umap->cmap, &old_ukey->cmap_node, 1945 &new_ukey->cmap_node, new_ukey->hash); 1946 new_ukey->dump_seq = old_ukey->dump_seq; 1947 ovsrcu_postpone(ukey_delete__, old_ukey); 1948 transition_ukey(old_ukey, UKEY_DELETED); 1949 transition_ukey(new_ukey, UKEY_VISIBLE); 1950 replaced = true; 1951 } 1952 ovs_mutex_unlock(&old_ukey->mutex); 1953 } ... In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period. The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below. 2983 static void 2984 revalidator_sweep__(struct revalidator *revalidator, bool purge) 2985 { .... 2995 2996 for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) { .... 3004 3005 CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { 3006 enum flow_del_reason del_reason = FDR_NONE; 3007 enum ukey_state ukey_state; 3008 3009 /* Handler threads could be holding a ukey lock while it installs a 3010 * new flow, so don't hang around waiting for access to it. */ 3011 if (ovs_mutex_trylock(&ukey->mutex)) { 3012 continue; 3013 } 3037 reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, 3038 &odp_actions); 3039 } 3040 OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif, 3041 ukey, result, del_reason); 3042 } 3043 ovs_mutex_unlock(&ukey->mutex); .... 3045 if (ukey_state == UKEY_EVICTED) { 3046 /* The common flow deletion case involves deletion of the flow 3047 * during the dump phase and ukey deletion here. */ 3048 ovs_mutex_lock(&umap->mutex); 3049 ukey_delete(umap, ukey); 3050 ovs_mutex_unlock(&umap->mutex); 3051 } 3052 3053 if (n_ops == REVALIDATE_MAX_BATCH) { 3054 /* Update/delete missed flows and clean up corresponding ukeys 3055 * if necessary. */ 3056 push_ukey_ops(udpif, umap, ops, n_ops); 3057 n_ops = 0; 3058 } 3059 } .... 3066 ovsrcu_quiesce(); ^^ Here we call it. 3067 } 3068 } So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)? If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together. //Eelco
Hi Eelco, Sorry for the late reply. And thank you very much for your code examination. Today, I removed that "ovsrcu_quiesce();" at line 3066 [1], and started to run the test in my local environment. Let's see if such change can overcome the issue. Regarding the long time lock or quiesce state, Ilya's feedback seemed to be true during some service restart or large amount flow installations (no matter with or without the lock move fix). I noticed logs like this: 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms waiting for main to quiesce 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting for main to quiesce 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting for main to quiesce 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms waiting for main to quiesce 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms waiting for main to quiesce And 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms waiting for dpdk_watchdog1 to quiesce 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms waiting for dpdk_watchdog1 to quiesce 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms waiting for revalidator6 to quiesce 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms waiting for revalidator6 to quiesce 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms waiting for revalidator6 to quiesce 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms waiting for revalidator6 to quiesce 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms waiting for revalidator6 to quiesce 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms waiting for revalidator6 to quiesce 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms waiting for pmd-c03/id:8 to quiesce 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms waiting for pmd-c03/id:8 to quiesce Not sure if this is related to the core issue. [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066 LIU Yulong On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote: > > > > On 15 Mar 2024, at 11:04, LIU Yulong wrote: > > > A potential race condition happened with the following 3 threads: > > * PMD thread replaced the old_ukey and transitioned the state to > > UKEY_DELETED. > > * RCU thread is freeing the old_ukey mutex. > > * While the revalidator thread is trying to lock the old_ukey mutex. > > > > We added some timestamp to udpif_key state transition and > > the udpif_key mutex free action, as well as the sweep try lock for > > the same udpif_key. When the ovs-vswitchd goes core, the udpif_key > > try_lock mutex had always a bit later than the udpif_key mutex free > > action. For instance [3]: > > ukey_destroy_time = 13217289156490 > > sweep_now = 13217289156568 > > > > The second time is 78us behind the first time. > > > > Then vswitchd process aborts at the revalidator thread try_lock of > > ukey->mutex because of the NULL pointer. > > > > This patch adds the try_lock for the ukeys' basket udpif_key map to > > avoid the PMD and revalidator access to the same map for replacing the > > udpif_key and transitioning the udpif_key state. > > > > More details can be found at: > > [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html > > [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html > > [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html > > > > Signed-off-by: LIU Yulong <i@liuyulong.me> > > Hi LIU, > > I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis. > > Assume PMD is doing try_ukey_replace(): > > 1930 try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey, > 1931 struct udpif_key *new_ukey) > 1932 OVS_REQUIRES(umap->mutex) > 1933 OVS_TRY_LOCK(true, new_ukey->mutex) > 1934 { > 1935 bool replaced = false; > 1936 > 1937 if (!ovs_mutex_trylock(&old_ukey->mutex)) { > 1938 if (old_ukey->state == UKEY_EVICTED) { > 1939 /* The flow was deleted during the current revalidator dump, > 1940 * but its ukey won't be fully cleaned up until the sweep phase. > 1941 * In the mean time, we are receiving upcalls for this traffic. > 1942 * Expedite the (new) flow install by replacing the ukey. */ > 1943 ovs_mutex_lock(&new_ukey->mutex); > 1944 cmap_replace(&umap->cmap, &old_ukey->cmap_node, > 1945 &new_ukey->cmap_node, new_ukey->hash); > 1946 new_ukey->dump_seq = old_ukey->dump_seq; > 1947 ovsrcu_postpone(ukey_delete__, old_ukey); > 1948 transition_ukey(old_ukey, UKEY_DELETED); > 1949 transition_ukey(new_ukey, UKEY_VISIBLE); > 1950 replaced = true; > 1951 } > 1952 ovs_mutex_unlock(&old_ukey->mutex); > 1953 } > ... > > In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period. > The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below. > > 2983 static void > 2984 revalidator_sweep__(struct revalidator *revalidator, bool purge) > 2985 { > .... > 2995 > 2996 for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) { > .... > 3004 > 3005 CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { > 3006 enum flow_del_reason del_reason = FDR_NONE; > 3007 enum ukey_state ukey_state; > 3008 > 3009 /* Handler threads could be holding a ukey lock while it installs a > 3010 * new flow, so don't hang around waiting for access to it. */ > 3011 if (ovs_mutex_trylock(&ukey->mutex)) { > 3012 continue; > 3013 } > 3037 reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, > 3038 &odp_actions); > 3039 } > > 3040 OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif, > 3041 ukey, result, del_reason); > 3042 } > 3043 ovs_mutex_unlock(&ukey->mutex); > .... > 3045 if (ukey_state == UKEY_EVICTED) { > 3046 /* The common flow deletion case involves deletion of the flow > 3047 * during the dump phase and ukey deletion here. */ > 3048 ovs_mutex_lock(&umap->mutex); > 3049 ukey_delete(umap, ukey); > 3050 ovs_mutex_unlock(&umap->mutex); > 3051 } > 3052 > 3053 if (n_ops == REVALIDATE_MAX_BATCH) { > 3054 /* Update/delete missed flows and clean up corresponding ukeys > 3055 * if necessary. */ > 3056 push_ukey_ops(udpif, umap, ops, n_ops); > 3057 n_ops = 0; > 3058 } > 3059 } > .... > 3066 ovsrcu_quiesce(); > ^^ Here we call it. > 3067 } > 3068 } > > So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)? > > If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together. > > //Eelco > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 7 Apr 2024, at 3:57, LIU Yulong wrote: > Hi Eelco, > > Sorry for the late reply. And thank you very much for your code examination. > > Today, I removed that "ovsrcu_quiesce();" at line 3066 [1], and > started to run the > test in my local environment. Let's see if such change can overcome the issue. Lets see… > Regarding the long time lock or quiesce state, Ilya's feedback seemed > to be true during some > service restart or large amount flow installations (no matter with or > without the lock move fix). This looks worrisome anyway and might be worth investigating (without your patch). > I noticed logs like this: > > 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms > waiting for main to quiesce > 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms > waiting for main to quiesce > 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms > waiting for main to quiesce > 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms > waiting for main to quiesce > 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms > waiting for main to quiesce > > And > > 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms > waiting for dpdk_watchdog1 to quiesce > 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms > waiting for dpdk_watchdog1 to quiesce > 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms > waiting for revalidator6 to quiesce > 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms > waiting for pmd-c03/id:8 to quiesce > 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms > waiting for pmd-c03/id:8 to quiesce > > > Not sure if this is related to the core issue. I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :) > > [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066 > > > LIU Yulong > > > On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote: >> >> >> >> On 15 Mar 2024, at 11:04, LIU Yulong wrote: >> >>> A potential race condition happened with the following 3 threads: >>> * PMD thread replaced the old_ukey and transitioned the state to >>> UKEY_DELETED. >>> * RCU thread is freeing the old_ukey mutex. >>> * While the revalidator thread is trying to lock the old_ukey mutex. >>> >>> We added some timestamp to udpif_key state transition and >>> the udpif_key mutex free action, as well as the sweep try lock for >>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key >>> try_lock mutex had always a bit later than the udpif_key mutex free >>> action. For instance [3]: >>> ukey_destroy_time = 13217289156490 >>> sweep_now = 13217289156568 >>> >>> The second time is 78us behind the first time. >>> >>> Then vswitchd process aborts at the revalidator thread try_lock of >>> ukey->mutex because of the NULL pointer. >>> >>> This patch adds the try_lock for the ukeys' basket udpif_key map to >>> avoid the PMD and revalidator access to the same map for replacing the >>> udpif_key and transitioning the udpif_key state. >>> >>> More details can be found at: >>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html >>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html >>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html >>> >>> Signed-off-by: LIU Yulong <i@liuyulong.me> >> >> Hi LIU, >> >> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis. >> >> Assume PMD is doing try_ukey_replace(): >> >> 1930 try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey, >> 1931 struct udpif_key *new_ukey) >> 1932 OVS_REQUIRES(umap->mutex) >> 1933 OVS_TRY_LOCK(true, new_ukey->mutex) >> 1934 { >> 1935 bool replaced = false; >> 1936 >> 1937 if (!ovs_mutex_trylock(&old_ukey->mutex)) { >> 1938 if (old_ukey->state == UKEY_EVICTED) { >> 1939 /* The flow was deleted during the current revalidator dump, >> 1940 * but its ukey won't be fully cleaned up until the sweep phase. >> 1941 * In the mean time, we are receiving upcalls for this traffic. >> 1942 * Expedite the (new) flow install by replacing the ukey. */ >> 1943 ovs_mutex_lock(&new_ukey->mutex); >> 1944 cmap_replace(&umap->cmap, &old_ukey->cmap_node, >> 1945 &new_ukey->cmap_node, new_ukey->hash); >> 1946 new_ukey->dump_seq = old_ukey->dump_seq; >> 1947 ovsrcu_postpone(ukey_delete__, old_ukey); >> 1948 transition_ukey(old_ukey, UKEY_DELETED); >> 1949 transition_ukey(new_ukey, UKEY_VISIBLE); >> 1950 replaced = true; >> 1951 } >> 1952 ovs_mutex_unlock(&old_ukey->mutex); >> 1953 } >> ... >> >> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period. >> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below. >> >> 2983 static void >> 2984 revalidator_sweep__(struct revalidator *revalidator, bool purge) >> 2985 { >> .... >> 2995 >> 2996 for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) { >> .... >> 3004 >> 3005 CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >> 3006 enum flow_del_reason del_reason = FDR_NONE; >> 3007 enum ukey_state ukey_state; >> 3008 >> 3009 /* Handler threads could be holding a ukey lock while it installs a >> 3010 * new flow, so don't hang around waiting for access to it. */ >> 3011 if (ovs_mutex_trylock(&ukey->mutex)) { >> 3012 continue; >> 3013 } >> 3037 reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, >> 3038 &odp_actions); >> 3039 } >> >> 3040 OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif, >> 3041 ukey, result, del_reason); >> 3042 } >> 3043 ovs_mutex_unlock(&ukey->mutex); >> .... >> 3045 if (ukey_state == UKEY_EVICTED) { >> 3046 /* The common flow deletion case involves deletion of the flow >> 3047 * during the dump phase and ukey deletion here. */ >> 3048 ovs_mutex_lock(&umap->mutex); >> 3049 ukey_delete(umap, ukey); >> 3050 ovs_mutex_unlock(&umap->mutex); >> 3051 } >> 3052 >> 3053 if (n_ops == REVALIDATE_MAX_BATCH) { >> 3054 /* Update/delete missed flows and clean up corresponding ukeys >> 3055 * if necessary. */ >> 3056 push_ukey_ops(udpif, umap, ops, n_ops); >> 3057 n_ops = 0; >> 3058 } >> 3059 } >> .... >> 3066 ovsrcu_quiesce(); >> ^^ Here we call it. >> 3067 } >> 3068 } >> >> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)? >> >> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together. >> >> //Eelco >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Unfortunately, removed that "ovsrcu_quiesce();" the ovs-vswitchd still went core: root root 957M Apr 12 01:35 core.52305 But good news is with such change the core dump does not occur so frequently, it's about 7 days. Core trace back: #0 0x00007ff0cf7ab387 in raise () from /lib64/libc.so.6 #1 0x00007ff0cf7aca78 in abort () from /lib64/libc.so.6 #2 0x00005567d610eeee in ovs_abort_valist (err_no=<optimized out>, format=<optimized out>, args=args@entry=0x7ff0ba72f350) at lib/util.c:499 #3 0x00005567d610ef84 in ovs_abort (err_no=err_no@entry=0, format=format@entry=0x5567d656d290 "%s: %s() passed uninitialized ovs_mutex") at lib/util.c:491 #4 0x00005567d60d9c91 in ovs_mutex_trylock_at (l_=l_@entry=0x7ff0808e6e18, where=where@entry=0x5567d654a538 "ofproto/ofproto-dpif-upcall.c:3077") at lib/ovs-thread.c:106 #5 0x00005567d6002d26 in revalidator_sweep__ (revalidator=revalidator@entry=0x5567da631090, purge=purge@entry=false) at ofproto/ofproto-dpif-upcall.c:3077 #6 0x00005567d6005726 in revalidator_sweep (revalidator=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:3135 #7 udpif_revalidator (arg=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:1103 #8 0x00005567d60da84f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:422 #9 0x00007ff0d1902ea5 in start_thread () from /lib64/libpthread.so.0 #10 0x00007ff0cf873b0d in clone () from /lib64/libc.so.6 Some timestamp: (gdb) p sweep_now $1 = 15851410054368 ukey life cycle: ukey_create_time = 15851409359172, ukey_visible_time = 15851409359173, ukey_operational_time = 15851409524080, ukey_evicting_time = 15851409910342, ukey_evicted_time = 15851409910443, ukey_deleted_time = 15851410052878, ukey_destroy_time = 15851410054243, ukey_replace_time = 15851410052878 based on this test commit: https://github.com/gotostack/ovs/commit/38a2b73af4442aa741930b3e4cff32ab7b559050 The ukey_destroy_time is 125us later than sweep_now: 15851410054243 - 15851410054368 = -125 So I may guess: 1. The race condition is still between such ovs_mutex_trylock and ovs_mutex_destroy, maybe due to some CPU clock switching between threads. 2. There are other threads which override the memory of ukey mutex lock. I will set the code back to this lock movement change again. Regards, LIU Yulong On Mon, Apr 8, 2024 at 4:43 PM Eelco Chaudron <echaudro@redhat.com> wrote: > > > > On 7 Apr 2024, at 3:57, LIU Yulong wrote: > > > Hi Eelco, > > > > Sorry for the late reply. And thank you very much for your code examination. > > > > Today, I removed that "ovsrcu_quiesce();" at line 3066 [1], and > > started to run the > > test in my local environment. Let's see if such change can overcome the issue. > > Lets see… > > > Regarding the long time lock or quiesce state, Ilya's feedback seemed > > to be true during some > > service restart or large amount flow installations (no matter with or > > without the lock move fix). > > This looks worrisome anyway and might be worth investigating (without your patch). > > > I noticed logs like this: > > > > 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms > > waiting for main to quiesce > > 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms > > waiting for main to quiesce > > 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms > > waiting for main to quiesce > > 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms > > waiting for main to quiesce > > 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms > > waiting for main to quiesce > > > > And > > > > 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms > > waiting for dpdk_watchdog1 to quiesce > > 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms > > waiting for dpdk_watchdog1 to quiesce > > 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms > > waiting for revalidator6 to quiesce > > 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms > > waiting for pmd-c03/id:8 to quiesce > > 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms > > waiting for pmd-c03/id:8 to quiesce > > > > > > Not sure if this is related to the core issue. > > I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :) > > > > > [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066 > > > > > > LIU Yulong > > > > > > On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote: > >> > >> > >> > >> On 15 Mar 2024, at 11:04, LIU Yulong wrote: > >> > >>> A potential race condition happened with the following 3 threads: > >>> * PMD thread replaced the old_ukey and transitioned the state to > >>> UKEY_DELETED. > >>> * RCU thread is freeing the old_ukey mutex. > >>> * While the revalidator thread is trying to lock the old_ukey mutex. > >>> > >>> We added some timestamp to udpif_key state transition and > >>> the udpif_key mutex free action, as well as the sweep try lock for > >>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key > >>> try_lock mutex had always a bit later than the udpif_key mutex free > >>> action. For instance [3]: > >>> ukey_destroy_time = 13217289156490 > >>> sweep_now = 13217289156568 > >>> > >>> The second time is 78us behind the first time. > >>> > >>> Then vswitchd process aborts at the revalidator thread try_lock of > >>> ukey->mutex because of the NULL pointer. > >>> > >>> This patch adds the try_lock for the ukeys' basket udpif_key map to > >>> avoid the PMD and revalidator access to the same map for replacing the > >>> udpif_key and transitioning the udpif_key state. > >>> > >>> More details can be found at: > >>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html > >>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html > >>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html > >>> > >>> Signed-off-by: LIU Yulong <i@liuyulong.me> > >> > >> Hi LIU, > >> > >> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis. > >> > >> Assume PMD is doing try_ukey_replace(): > >> > >> 1930 try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey, > >> 1931 struct udpif_key *new_ukey) > >> 1932 OVS_REQUIRES(umap->mutex) > >> 1933 OVS_TRY_LOCK(true, new_ukey->mutex) > >> 1934 { > >> 1935 bool replaced = false; > >> 1936 > >> 1937 if (!ovs_mutex_trylock(&old_ukey->mutex)) { > >> 1938 if (old_ukey->state == UKEY_EVICTED) { > >> 1939 /* The flow was deleted during the current revalidator dump, > >> 1940 * but its ukey won't be fully cleaned up until the sweep phase. > >> 1941 * In the mean time, we are receiving upcalls for this traffic. > >> 1942 * Expedite the (new) flow install by replacing the ukey. */ > >> 1943 ovs_mutex_lock(&new_ukey->mutex); > >> 1944 cmap_replace(&umap->cmap, &old_ukey->cmap_node, > >> 1945 &new_ukey->cmap_node, new_ukey->hash); > >> 1946 new_ukey->dump_seq = old_ukey->dump_seq; > >> 1947 ovsrcu_postpone(ukey_delete__, old_ukey); > >> 1948 transition_ukey(old_ukey, UKEY_DELETED); > >> 1949 transition_ukey(new_ukey, UKEY_VISIBLE); > >> 1950 replaced = true; > >> 1951 } > >> 1952 ovs_mutex_unlock(&old_ukey->mutex); > >> 1953 } > >> ... > >> > >> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period. > >> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below. > >> > >> 2983 static void > >> 2984 revalidator_sweep__(struct revalidator *revalidator, bool purge) > >> 2985 { > >> .... > >> 2995 > >> 2996 for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) { > >> .... > >> 3004 > >> 3005 CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { > >> 3006 enum flow_del_reason del_reason = FDR_NONE; > >> 3007 enum ukey_state ukey_state; > >> 3008 > >> 3009 /* Handler threads could be holding a ukey lock while it installs a > >> 3010 * new flow, so don't hang around waiting for access to it. */ > >> 3011 if (ovs_mutex_trylock(&ukey->mutex)) { > >> 3012 continue; > >> 3013 } > >> 3037 reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, > >> 3038 &odp_actions); > >> 3039 } > >> > >> 3040 OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif, > >> 3041 ukey, result, del_reason); > >> 3042 } > >> 3043 ovs_mutex_unlock(&ukey->mutex); > >> .... > >> 3045 if (ukey_state == UKEY_EVICTED) { > >> 3046 /* The common flow deletion case involves deletion of the flow > >> 3047 * during the dump phase and ukey deletion here. */ > >> 3048 ovs_mutex_lock(&umap->mutex); > >> 3049 ukey_delete(umap, ukey); > >> 3050 ovs_mutex_unlock(&umap->mutex); > >> 3051 } > >> 3052 > >> 3053 if (n_ops == REVALIDATE_MAX_BATCH) { > >> 3054 /* Update/delete missed flows and clean up corresponding ukeys > >> 3055 * if necessary. */ > >> 3056 push_ukey_ops(udpif, umap, ops, n_ops); > >> 3057 n_ops = 0; > >> 3058 } > >> 3059 } > >> .... > >> 3066 ovsrcu_quiesce(); > >> ^^ Here we call it. > >> 3067 } > >> 3068 } > >> > >> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)? > >> > >> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together. > >> > >> //Eelco > >> > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 12 Apr 2024, at 9:31, LIU Yulong wrote: > Unfortunately, removed that "ovsrcu_quiesce();" the ovs-vswitchd > still went core: > root root 957M Apr 12 01:35 core.52305 > > But good news is with such change the core dump does not occur so frequently, > it's about 7 days. > > Core trace back: > #0 0x00007ff0cf7ab387 in raise () from /lib64/libc.so.6 > #1 0x00007ff0cf7aca78 in abort () from /lib64/libc.so.6 > #2 0x00005567d610eeee in ovs_abort_valist (err_no=<optimized out>, > format=<optimized out>, args=args@entry=0x7ff0ba72f350) at > lib/util.c:499 > #3 0x00005567d610ef84 in ovs_abort (err_no=err_no@entry=0, > format=format@entry=0x5567d656d290 "%s: %s() passed uninitialized > ovs_mutex") at lib/util.c:491 > #4 0x00005567d60d9c91 in ovs_mutex_trylock_at > (l_=l_@entry=0x7ff0808e6e18, where=where@entry=0x5567d654a538 > "ofproto/ofproto-dpif-upcall.c:3077") > at lib/ovs-thread.c:106 > #5 0x00005567d6002d26 in revalidator_sweep__ > (revalidator=revalidator@entry=0x5567da631090, > purge=purge@entry=false) at ofproto/ofproto-dpif-upcall.c:3077 > #6 0x00005567d6005726 in revalidator_sweep > (revalidator=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:3135 > #7 udpif_revalidator (arg=0x5567da631090) at ofproto/ofproto-dpif-upcall.c:1103 > #8 0x00005567d60da84f in ovsthread_wrapper (aux_=<optimized out>) at > lib/ovs-thread.c:422 > #9 0x00007ff0d1902ea5 in start_thread () from /lib64/libpthread.so.0 > #10 0x00007ff0cf873b0d in clone () from /lib64/libc.so.6 > > Some timestamp: > (gdb) p sweep_now > $1 = 15851410054368 > > ukey life cycle: > ukey_create_time = 15851409359172, > ukey_visible_time = 15851409359173, > ukey_operational_time = 15851409524080, > ukey_evicting_time = 15851409910342, > ukey_evicted_time = 15851409910443, > ukey_deleted_time = 15851410052878, > ukey_destroy_time = 15851410054243, > ukey_replace_time = 15851410052878 > > based on this test commit: > https://github.com/gotostack/ovs/commit/38a2b73af4442aa741930b3e4cff32ab7b559050 > > The ukey_destroy_time is 125us later than sweep_now: > 15851410054243 - 15851410054368 = -125 > > So I may guess: > 1. The race condition is still between such ovs_mutex_trylock and > ovs_mutex_destroy, maybe due to some > CPU clock switching between threads. > 2. There are other threads which override the memory of ukey mutex lock. There is something odd going on in your system, but from a code analysis, the locking/rcu handling seems fine. If I get some time in the next week(s), I’ll try to come up with some debug code to better understand the problem. Cheers, Eelco > I will set the code back to this lock movement change again. > > Regards, > LIU Yulong > > > On Mon, Apr 8, 2024 at 4:43 PM Eelco Chaudron <echaudro@redhat.com> wrote: >> >> >> >> On 7 Apr 2024, at 3:57, LIU Yulong wrote: >> >>> Hi Eelco, >>> >>> Sorry for the late reply. And thank you very much for your code examination. >>> >>> Today, I removed that "ovsrcu_quiesce();" at line 3066 [1], and >>> started to run the >>> test in my local environment. Let's see if such change can overcome the issue. >> >> Lets see… >> >>> Regarding the long time lock or quiesce state, Ilya's feedback seemed >>> to be true during some >>> service restart or large amount flow installations (no matter with or >>> without the lock move fix). >> >> This looks worrisome anyway and might be worth investigating (without your patch). >> >>> I noticed logs like this: >>> >>> 2024-04-07T01:33:53.025Z|00001|ovs_rcu(urcu2)|WARN|blocked 1001 ms >>> waiting for main to quiesce >>> 2024-04-07T01:33:54.025Z|00002|ovs_rcu(urcu2)|WARN|blocked 2000 ms >>> waiting for main to quiesce >>> 2024-04-07T01:33:56.025Z|00003|ovs_rcu(urcu2)|WARN|blocked 4000 ms >>> waiting for main to quiesce >>> 2024-04-07T01:34:00.025Z|00004|ovs_rcu(urcu2)|WARN|blocked 8000 ms >>> waiting for main to quiesce >>> 2024-04-07T01:34:08.025Z|00005|ovs_rcu(urcu2)|WARN|blocked 16001 ms >>> waiting for main to quiesce >>> >>> And >>> >>> 2024-04-03T03:38:20.817Z|00136|ovs_rcu(urcu2)|WARN|blocked 2640 ms >>> waiting for dpdk_watchdog1 to quiesce >>> 2024-04-03T03:38:20.817Z|00137|ovs_rcu(urcu2)|WARN|blocked 2640 ms >>> waiting for dpdk_watchdog1 to quiesce >>> 2024-04-03T03:38:53.624Z|00138|ovs_rcu(urcu2)|WARN|blocked 1066 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:38:57.606Z|00139|ovs_rcu(urcu2)|WARN|blocked 5048 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:38:57.606Z|00140|ovs_rcu(urcu2)|WARN|blocked 5048 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:39:47.105Z|00141|ovs_rcu(urcu2)|WARN|blocked 54547 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:39:47.105Z|00142|ovs_rcu(urcu2)|WARN|blocked 54548 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:39:47.105Z|00143|ovs_rcu(urcu2)|WARN|blocked 54548 ms >>> waiting for revalidator6 to quiesce >>> 2024-04-03T03:42:30.702Z|00144|ovs_rcu(urcu2)|WARN|blocked 10937 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:42:30.703Z|00145|ovs_rcu(urcu2)|WARN|blocked 10939 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:42:30.704Z|00146|ovs_rcu(urcu2)|WARN|blocked 10939 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:42:30.704Z|00147|ovs_rcu(urcu2)|WARN|blocked 10939 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:42:39.083Z|00151|ovs_rcu(urcu2)|WARN|blocked 19318 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:42:52.928Z|00155|ovs_rcu(urcu2)|WARN|blocked 33164 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:45:13.305Z|00156|ovs_rcu(urcu2)|WARN|blocked 173540 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:45:13.305Z|00157|ovs_rcu(urcu2)|WARN|blocked 173540 ms >>> waiting for pmd-c03/id:8 to quiesce >>> 2024-04-03T03:47:58.186Z|00158|ovs_rcu(urcu2)|WARN|blocked 338421 ms >>> waiting for pmd-c03/id:8 to quiesce >>> >>> >>> Not sure if this is related to the core issue. >> >> I’ll guess it’s the opposite. If we do not go through the quiesce state the entries will not be deleted :) >> >>> >>> [1] https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L3066 >>> >>> >>> LIU Yulong >>> >>> >>> On Fri, Mar 22, 2024 at 8:56 PM Eelco Chaudron <echaudro@redhat.com> wrote: >>>> >>>> >>>> >>>> On 15 Mar 2024, at 11:04, LIU Yulong wrote: >>>> >>>>> A potential race condition happened with the following 3 threads: >>>>> * PMD thread replaced the old_ukey and transitioned the state to >>>>> UKEY_DELETED. >>>>> * RCU thread is freeing the old_ukey mutex. >>>>> * While the revalidator thread is trying to lock the old_ukey mutex. >>>>> >>>>> We added some timestamp to udpif_key state transition and >>>>> the udpif_key mutex free action, as well as the sweep try lock for >>>>> the same udpif_key. When the ovs-vswitchd goes core, the udpif_key >>>>> try_lock mutex had always a bit later than the udpif_key mutex free >>>>> action. For instance [3]: >>>>> ukey_destroy_time = 13217289156490 >>>>> sweep_now = 13217289156568 >>>>> >>>>> The second time is 78us behind the first time. >>>>> >>>>> Then vswitchd process aborts at the revalidator thread try_lock of >>>>> ukey->mutex because of the NULL pointer. >>>>> >>>>> This patch adds the try_lock for the ukeys' basket udpif_key map to >>>>> avoid the PMD and revalidator access to the same map for replacing the >>>>> udpif_key and transitioning the udpif_key state. >>>>> >>>>> More details can be found at: >>>>> [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html >>>>> [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html >>>>> [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html >>>>> >>>>> Signed-off-by: LIU Yulong <i@liuyulong.me> >>>> >>>> Hi LIU, >>>> >>>> I've examined a lot of code, but I can't seem to figure out what event could be causing your problem. I also don't understand why your fix would prevent the problem from occurring (aside from possibly avoiding some quiescent state). Here's some of my analysis. >>>> >>>> Assume PMD is doing try_ukey_replace(): >>>> >>>> 1930 try_ukey_replace(struct umap *umap, struct udpif_key *old_ukey, >>>> 1931 struct udpif_key *new_ukey) >>>> 1932 OVS_REQUIRES(umap->mutex) >>>> 1933 OVS_TRY_LOCK(true, new_ukey->mutex) >>>> 1934 { >>>> 1935 bool replaced = false; >>>> 1936 >>>> 1937 if (!ovs_mutex_trylock(&old_ukey->mutex)) { >>>> 1938 if (old_ukey->state == UKEY_EVICTED) { >>>> 1939 /* The flow was deleted during the current revalidator dump, >>>> 1940 * but its ukey won't be fully cleaned up until the sweep phase. >>>> 1941 * In the mean time, we are receiving upcalls for this traffic. >>>> 1942 * Expedite the (new) flow install by replacing the ukey. */ >>>> 1943 ovs_mutex_lock(&new_ukey->mutex); >>>> 1944 cmap_replace(&umap->cmap, &old_ukey->cmap_node, >>>> 1945 &new_ukey->cmap_node, new_ukey->hash); >>>> 1946 new_ukey->dump_seq = old_ukey->dump_seq; >>>> 1947 ovsrcu_postpone(ukey_delete__, old_ukey); >>>> 1948 transition_ukey(old_ukey, UKEY_DELETED); >>>> 1949 transition_ukey(new_ukey, UKEY_VISIBLE); >>>> 1950 replaced = true; >>>> 1951 } >>>> 1952 ovs_mutex_unlock(&old_ukey->mutex); >>>> 1953 } >>>> ... >>>> >>>> In this code analysis, it's evident that the ukey deletion doesn't occur immediately but rather after completing an RCU quiescent period. >>>> The function revalidator_sweep__() doesn't include any ovsrcu_quiesce() calls within the CMAP loop; instead, it executes it when processing the next slice. See the code snippet below. >>>> >>>> 2983 static void >>>> 2984 revalidator_sweep__(struct revalidator *revalidator, bool purge) >>>> 2985 { >>>> .... >>>> 2995 >>>> 2996 for (int i = slice; i < N_UMAPS; i += udpif->n_revalidators) { >>>> .... >>>> 3004 >>>> 3005 CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>>> 3006 enum flow_del_reason del_reason = FDR_NONE; >>>> 3007 enum ukey_state ukey_state; >>>> 3008 >>>> 3009 /* Handler threads could be holding a ukey lock while it installs a >>>> 3010 * new flow, so don't hang around waiting for access to it. */ >>>> 3011 if (ovs_mutex_trylock(&ukey->mutex)) { >>>> 3012 continue; >>>> 3013 } >>>> 3037 reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, >>>> 3038 &odp_actions); >>>> 3039 } >>>> >>>> 3040 OVS_USDT_PROBE(revalidator_sweep__, flow_sweep_result, udpif, >>>> 3041 ukey, result, del_reason); >>>> 3042 } >>>> 3043 ovs_mutex_unlock(&ukey->mutex); >>>> .... >>>> 3045 if (ukey_state == UKEY_EVICTED) { >>>> 3046 /* The common flow deletion case involves deletion of the flow >>>> 3047 * during the dump phase and ukey deletion here. */ >>>> 3048 ovs_mutex_lock(&umap->mutex); >>>> 3049 ukey_delete(umap, ukey); >>>> 3050 ovs_mutex_unlock(&umap->mutex); >>>> 3051 } >>>> 3052 >>>> 3053 if (n_ops == REVALIDATE_MAX_BATCH) { >>>> 3054 /* Update/delete missed flows and clean up corresponding ukeys >>>> 3055 * if necessary. */ >>>> 3056 push_ukey_ops(udpif, umap, ops, n_ops); >>>> 3057 n_ops = 0; >>>> 3058 } >>>> 3059 } >>>> .... >>>> 3066 ovsrcu_quiesce(); >>>> ^^ Here we call it. >>>> 3067 } >>>> 3068 } >>>> >>>> So maybe you could try to remove the ovsrcu_quiesce() above, as this seems to be the only hint why we could go trough quiescent state (maybe some compiler optimization)? >>>> >>>> If this is not solving the problem, we might need to come up with some more debugging. Let me know and I can try to put some debug code together. >>>> >>>> //Eelco >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> dev@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index 9a5c5c29c..ef13f820a 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -2974,6 +2974,10 @@ revalidator_sweep__(struct revalidator *revalidator, bool purge) struct umap *umap = &udpif->ukeys[i]; size_t n_ops = 0; + if (ovs_mutex_trylock(&umap->mutex)) { + continue; + } + CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { enum ukey_state ukey_state; @@ -3013,9 +3017,7 @@ revalidator_sweep__(struct revalidator *revalidator, bool purge) if (ukey_state == UKEY_EVICTED) { /* The common flow deletion case involves deletion of the flow * during the dump phase and ukey deletion here. */ - ovs_mutex_lock(&umap->mutex); ukey_delete(umap, ukey); - ovs_mutex_unlock(&umap->mutex); } if (n_ops == REVALIDATE_MAX_BATCH) { @@ -3025,6 +3027,7 @@ revalidator_sweep__(struct revalidator *revalidator, bool purge) n_ops = 0; } } + ovs_mutex_unlock(&umap->mutex); if (n_ops) { push_ukey_ops(udpif, umap, ops, n_ops);
A potential race condition happened with the following 3 threads: * PMD thread replaced the old_ukey and transitioned the state to UKEY_DELETED. * RCU thread is freeing the old_ukey mutex. * While the revalidator thread is trying to lock the old_ukey mutex. We added some timestamp to udpif_key state transition and the udpif_key mutex free action, as well as the sweep try lock for the same udpif_key. When the ovs-vswitchd goes core, the udpif_key try_lock mutex had always a bit later than the udpif_key mutex free action. For instance [3]: ukey_destroy_time = 13217289156490 sweep_now = 13217289156568 The second time is 78us behind the first time. Then vswitchd process aborts at the revalidator thread try_lock of ukey->mutex because of the NULL pointer. This patch adds the try_lock for the ukeys' basket udpif_key map to avoid the PMD and revalidator access to the same map for replacing the udpif_key and transitioning the udpif_key state. More details can be found at: [1] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html [2] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html Signed-off-by: LIU Yulong <i@liuyulong.me> --- v2: - Updated commit message to make 0-day Robot happy. v3: - Updated commit message to make checkpatch.py happy. - Add some investigation details. --- --- ofproto/ofproto-dpif-upcall.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)