diff mbox series

[ovs-dev,v5] ofproto-dpif-upcall: Avoid stale ukeys leaks.

Message ID a206d08b44d77689fdfbf098f27456d3922641cd.1724914806.git.echaudro@redhat.com
State Accepted
Commit 180ab2fd635e03ffab5381bb1c22dda3f13aea7c
Delegated to: Eelco Chaudron
Headers show
Series [ovs-dev,v5] ofproto-dpif-upcall: Avoid stale ukeys leaks. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Eelco Chaudron Aug. 29, 2024, 7 a.m. UTC
It is observed in some environments that there are much more ukeys than
actual DP flows. For example:

$ ovs-appctl upcall/show
system@ovs-system:
flows : (current 7) (avg 6) (max 117) (limit 2125)
offloaded flows : 525
dump duration : 1063ms
ufid enabled : true

23: (keys 3612)
24: (keys 3625)
25: (keys 3485)

The revalidator threads are busy revalidating the stale ukeys leading to
high CPU and long dump duration.

This patch tracks the number of consecutive missed dumps. If four dumps
are missed in a row, it is assumed that the datapath flow no longer
exists, and the ukey can be deleted.

Reported-by: Roi Dayan <roid@nvidia.com>
Co-authored-by: Han Zhou <hzhou@ovn.org>
Co-authored-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
---

v3: - Rewrote fix to use actual dump state, and added a tests case.
v4: - Added reason to end of flow_del_reason.
    - Rather than time based, make it missed dumps based.
    - Changed it from a TC specific test to a generic unit test.
v5: - Fixed spelling mistake.
    - Added coverage counter verification to the unit test.
---
 ofproto/ofproto-dpif-upcall.c                | 18 ++++++++
 tests/ofproto-dpif.at                        | 45 ++++++++++++++++++++
 utilities/usdt-scripts/flow_reval_monitor.py |  4 +-
 3 files changed, 66 insertions(+), 1 deletion(-)

Comments

Aaron Conole Aug. 30, 2024, 1 p.m. UTC | #1
Eelco Chaudron <echaudro@redhat.com> writes:

> It is observed in some environments that there are much more ukeys than
> actual DP flows. For example:
>
> $ ovs-appctl upcall/show
> system@ovs-system:
> flows : (current 7) (avg 6) (max 117) (limit 2125)
> offloaded flows : 525
> dump duration : 1063ms
> ufid enabled : true
>
> 23: (keys 3612)
> 24: (keys 3625)
> 25: (keys 3485)
>
> The revalidator threads are busy revalidating the stale ukeys leading to
> high CPU and long dump duration.
>
> This patch tracks the number of consecutive missed dumps. If four dumps
> are missed in a row, it is assumed that the datapath flow no longer
> exists, and the ukey can be deleted.
>
> Reported-by: Roi Dayan <roid@nvidia.com>
> Co-authored-by: Han Zhou <hzhou@ovn.org>
> Co-authored-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Han Zhou <hzhou@ovn.org>
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> ---

Thanks all, applied.
Roi Dayan Sept. 9, 2024, 1:45 p.m. UTC | #2
On 8/30/24 16:00, Aaron Conole wrote:
> Eelco Chaudron <echaudro@redhat.com> writes:
> 
>> It is observed in some environments that there are much more ukeys than
>> actual DP flows. For example:
>>
>> $ ovs-appctl upcall/show
>> system@ovs-system:
>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>> offloaded flows : 525
>> dump duration : 1063ms
>> ufid enabled : true
>>
>> 23: (keys 3612)
>> 24: (keys 3625)
>> 25: (keys 3485)
>>
>> The revalidator threads are busy revalidating the stale ukeys leading to
>> high CPU and long dump duration.
>>
>> This patch tracks the number of consecutive missed dumps. If four dumps
>> are missed in a row, it is assumed that the datapath flow no longer
>> exists, and the ukey can be deleted.
>>
>> Reported-by: Roi Dayan <roid@nvidia.com>
>> Co-authored-by: Han Zhou <hzhou@ovn.org>
>> Co-authored-by: Roi Dayan <roid@nvidia.com>
>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>> Signed-off-by: Roi Dayan <roid@nvidia.com>
>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>> ---
> 
> Thanks all, applied.
> 

Hi,

Can we get this merged into 3.x branches maybe ?

 From 3.0 to 3.3 the python script and flow_del_reason enum
doesn't exists. With removing the diff and removing
"del_reason = FDR_FLOW_MISSING_DP;" should apply cleanly.

This patch applies cleanly without modification on branch 3.4.

Tell me if needed to send a back-ported patch that can be applied
to branches 3.0-3.3.

Thanks,
Roi

> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Aaron Conole Sept. 9, 2024, 7:54 p.m. UTC | #3
Roi Dayan <roid@nvidia.com> writes:

> On 8/30/24 16:00, Aaron Conole wrote:
>> Eelco Chaudron <echaudro@redhat.com> writes:
>> 
>>> It is observed in some environments that there are much more ukeys than
>>> actual DP flows. For example:
>>>
>>> $ ovs-appctl upcall/show
>>> system@ovs-system:
>>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>>> offloaded flows : 525
>>> dump duration : 1063ms
>>> ufid enabled : true
>>>
>>> 23: (keys 3612)
>>> 24: (keys 3625)
>>> 25: (keys 3485)
>>>
>>> The revalidator threads are busy revalidating the stale ukeys leading to
>>> high CPU and long dump duration.
>>>
>>> This patch tracks the number of consecutive missed dumps. If four dumps
>>> are missed in a row, it is assumed that the datapath flow no longer
>>> exists, and the ukey can be deleted.
>>>
>>> Reported-by: Roi Dayan <roid@nvidia.com>
>>> Co-authored-by: Han Zhou <hzhou@ovn.org>
>>> Co-authored-by: Roi Dayan <roid@nvidia.com>
>>> Signed-off-by: Han Zhou <hzhou@ovn.org>
>>> Signed-off-by: Roi Dayan <roid@nvidia.com>
>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>> ---
>> Thanks all, applied.
>> 
>
> Hi,
>
> Can we get this merged into 3.x branches maybe ?

Yes, I think it makes sense there.  While there isn't a specific 'fixes'
for this, it does seem more like a fix rather than a feature.

> From 3.0 to 3.3 the python script and flow_del_reason enum
> doesn't exists. With removing the diff and removing
> "del_reason = FDR_FLOW_MISSING_DP;" should apply cleanly.

It will fail because it depends on some other support (for example
strip_dp_hash in the tests).  I'm working on a backport for it, and will
post.

> This patch applies cleanly without modification on branch 3.4.
>
> Tell me if needed to send a back-ported patch that can be applied
> to branches 3.0-3.3.

I will be posting it for each branch when it is ready - I think I should
be happy with it all by COB tomorrow.

> Thanks,
> Roi
>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index 4d39bc5a7..e7d4c2b2c 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -57,6 +57,7 @@  COVERAGE_DEFINE(dumped_inconsistent_flow);
 COVERAGE_DEFINE(dumped_new_flow);
 COVERAGE_DEFINE(handler_duplicate_upcall);
 COVERAGE_DEFINE(revalidate_missed_dp_flow);
+COVERAGE_DEFINE(revalidate_missing_dp_flow);
 COVERAGE_DEFINE(ukey_dp_change);
 COVERAGE_DEFINE(ukey_invalid_stat_reset);
 COVERAGE_DEFINE(ukey_replace_contention);
@@ -284,6 +285,7 @@  enum flow_del_reason {
     FDR_TOO_EXPENSIVE,      /* Too expensive to revalidate. */
     FDR_UPDATE_FAIL,        /* Datapath update failed. */
     FDR_XLATION_ERROR,      /* Flow translation error. */
+    FDR_FLOW_MISSING_DP,    /* Flow is missing from the datapath. */
 };
 
 /* 'udpif_key's are responsible for tracking the little bit of state udpif
@@ -318,6 +320,7 @@  struct udpif_key {
     uint64_t dump_seq OVS_GUARDED;            /* Tracks udpif->dump_seq. */
     uint64_t reval_seq OVS_GUARDED;           /* Tracks udpif->reval_seq. */
     enum ukey_state state OVS_GUARDED;        /* Tracks ukey lifetime. */
+    uint32_t missed_dumps OVS_GUARDED;        /* Missed consecutive dumps. */
 
     /* 'state' debug information. */
     unsigned int state_thread OVS_GUARDED;    /* Thread that transitions. */
@@ -3040,6 +3043,21 @@  revalidator_sweep__(struct revalidator *revalidator, bool purge)
                     result = revalidate_ukey(udpif, ukey, &stats, &odp_actions,
                                              reval_seq, &recircs, &del_reason);
                 }
+
+                if (ukey->dump_seq != dump_seq) {
+                    ukey->missed_dumps++;
+                    if (ukey->missed_dumps >= 4) {
+                        /* If the flow was not dumped for 4 revalidator rounds,
+                         * we can assume the datapath flow no longer exists
+                         * and the ukey should be deleted. */
+                        COVERAGE_INC(revalidate_missing_dp_flow);
+                        del_reason = FDR_FLOW_MISSING_DP;
+                        result = UKEY_DELETE;
+                    }
+                } else {
+                    ukey->missed_dumps = 0;
+                }
+
                 if (result != UKEY_KEEP) {
                     /* Clears 'recircs' if filled by revalidate_ukey(). */
                     reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index 42fb66de6..1df944ef8 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -12661,3 +12661,48 @@  AT_CHECK([ovs-appctl revalidator/resume])
 
 OVS_VSWITCHD_STOP
 AT_CLEANUP
+
+AT_SETUP([ofproto-dpif - Cleanup missing datapath flows])
+
+OVS_VSWITCHD_START
+add_of_ports br0 1 2
+
+m4_define([ICMP_PKT], [m4_join([,],
+    [eth(src=50:54:00:00:00:05,dst=50:54:00:00:00:07),eth_type(0x0800)],
+    [ipv4(src=10.10.10.2,dst=10.10.10.1,proto=1,tos=1,ttl=128,frag=no)],
+    [icmp(type=8,code=0)])])
+
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flow br0 'actions=normal' ])
+
+AT_CHECK([ovs-appctl netdev-dummy/receive p1 'ICMP_PKT'])
+
+AT_CHECK([ovs-appctl dpctl/dump-flows --names | strip_used | strip_stats | dnl
+          strip_duration | strip_dp_hash | sort], [0], [dnl
+flow-dump from the main thread:
+recirc_id(0),in_port(p1),packet_type(ns=0,id=0),eth(src=50:54:00:00:00:05,dst=50:54:00:00:00:07),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:br0,p2
+])
+
+dnl Make sure the ukey exists.
+AT_CHECK([ovs-appctl upcall/show | grep '(keys' | awk '{print $3}' | \
+            grep -q '1)'], [0])
+
+dnl Delete all datapath flows, and make sure they are gone.
+AT_CHECK([ovs-appctl dpctl/del-flows])
+AT_CHECK([ovs-appctl dpctl/dump-flows --names ], [0], [])
+
+dnl Move forward in time and make sure we have at least 4 * 500ms.
+AT_CHECK([ovs-appctl time/warp 3000 300], [0], [ignore])
+
+dnl Make sure no more ukeys exists.
+AT_CHECK([ovs-appctl upcall/show | grep '(keys' | awk '{print $3}' | \
+            grep -qv '0)'], [1])
+
+dnl Verify coverage counter was hit.
+AT_CHECK([ovs-appctl coverage/read-counter revalidate_missing_dp_flow], [0],
+         [dnl
+1
+])
+
+OVS_VSWITCHD_STOP(["/failed to flow_del (No such file or directory)/d"])
+AT_CLEANUP
diff --git a/utilities/usdt-scripts/flow_reval_monitor.py b/utilities/usdt-scripts/flow_reval_monitor.py
index 28479a565..80c9c98bd 100755
--- a/utilities/usdt-scripts/flow_reval_monitor.py
+++ b/utilities/usdt-scripts/flow_reval_monitor.py
@@ -255,6 +255,7 @@  FdrReasons = IntEnum(
         "FDR_TOO_EXPENSIVE",
         "FDR_UPDATE_FAIL",
         "FDR_XLATION_ERROR",
+        "FDR_FLOW_MISSING_DP"
     ],
     start=0,
 )
@@ -270,7 +271,8 @@  FdrReasonStrings = {
     FdrReasons.FDR_PURGE: "User requested flow deletion",
     FdrReasons.FDR_TOO_EXPENSIVE: "Too expensive to revalidate",
     FdrReasons.FDR_UPDATE_FAIL: "Datapath update failed",
-    FdrReasons.FDR_XLATION_ERROR: "Flow translation error"
+    FdrReasons.FDR_XLATION_ERROR: "Flow translation error",
+    FdrReasons.FDR_FLOW_MISSING_DP: "Flow is missing from the datapath"
 }