diff mbox series

[ovs-dev] STT kmod kernel panic investigated.

Message ID 332514c4-8391-4eae-ad63-fa02c19cecc3@k2.cloud
State Not Applicable
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev] STT kmod kernel panic investigated. | expand

Checks

Context Check Description
ovsrobot/apply-robot fail apply and check: fail

Commit Message

Aleksandr Smirnov (K2 Cloud) Nov. 7, 2024, 8:26 a.m. UTC
Hello,

We got kernel crash in STT module in our could environment. I have made 
root cause analysis and I happy to provide RCA report here:

Version reported: OVS v2.17.6

Kernel crash:

[  120.574175] ------------[ cut here ]------------
[  120.574178] kernel BUG at net/core/skbuff.c:122!
[  120.574510] invalid opcode: 0000 [#1] SMP NOPTI
[  120.574840] CPU: 4 PID: 40 Comm: ksoftirqd/4 Kdump: loaded Tainted: 
G           OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1
[  120.575538] Hardware name: Dell Inc. PowerEdge R720/0X6FFV, BIOS 
2.5.2 01/28/2015
[  120.575916] RIP: 0010:skb_panic+0x4b/0x4d
[  120.576288] Code: 00 00 50 8b 87 dc 00 00 00 50 8b 87 d8 00 00 00 50 
ff b7 e8 00 00 00 4c 8b 8f e0 00 00 00 48 c7 c7 f8 0a 1a 9d e8 58 10 95 
ff <0f> 0
b 48 8b 14 24 48 c7 c1 c0 ea ee 9c e8 a3 ff ff ff 48 c7 c6 00
[  120.577132] RSP: 0018:ffffba2581b7f678 EFLAGS: 00010246
[  120.577553] RAX: 000000000000008b RBX: ffff9087064a4b00 RCX: 
0000000000000000
[  120.577989] RDX: 0000000000000000 RSI: ffff9086f789e698 RDI: 
ffff9086f789e698
[  120.578428] RBP: 0000000000009ec2 R08: 0000000000000000 R09: 
c0000000ffff7fff
[  120.578877] R10: 0000000000000001 R11: ffffba2581b7f498 R12: 
0000000000002f1d
[  120.579576] R13: 0000000000000c20 R14: 0000000000000008 R15: 
ffff9085c999a400
[  120.580041] FS:  0000000000000000(0000) GS:ffff9086f7880000(0000) 
knlGS:0000000000000000
[  120.580512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.580971] CR2: 00007f0eb711e340 CR3: 00000003b4410006 CR4: 
00000000000606e0
[  120.581445] Call Trace:
[  120.581906]  skb_push.cold.98+0x10/0x10
[  120.582374]  ovs_stt_xmit.cold.48+0x166/0x480 [openvswitch]
[  120.582870]  do_execute_actions+0xc51/0xc90 [openvswitch]
[  120.583361]  ? flow_lookup.isra.12+0x40/0xb0 [openvswitch]


Open vSwitch background to reproduce:

1. OVS action field in flow must have two 'output' commands executed on 
the same packet and routed to different ports. This double output 
results in input socket buffer will be cloned and used in subsequent 
xmit calls. Here I provide part of code that will be executed in 
do_execute_actions()

         case OVS_ACTION_ATTR_OUTPUT: {
             int port = nla_get_u32(a);
             struct sk_buff *clone;

             /* Every output action needs a separate clone
              * of 'skb', In case the output action is the
              * last action, cloning can be avoided.
              */
             if (nla_is_last(a, rem)) {
                 do_output(dp, skb, port, key);
                 /* 'skb' has been used for output.
                  */
                 return 0;
             }

             clone = skb_clone(skb, GFP_ATOMIC);
             if (clone)
                 do_output(dp, clone, port, key);

2. Input packet must be split into at least two fragments.

3. STT tunneling must be in use.


Open vSwitch kernel module background to reproduce.


1. We must have input packet consisting two fragments:
    Head part A (struct of sk_buff) and 2-nd part F (struct of sk_buff) 
A->frag_list == F

2. We call B = skb_clone(A) so, we create clone of input sk_buff. Note, 
that after clone second part of packet is not copied, thus A->flag_list 
== B->frag_list

3. Call send action on both buffers, in fact, ovs_stt_xmit(A), 
ovs_stt_xmit(B)

4. We get kernel crash.


Why it happens:

     STT code wants to prepend STT header in face of incoming packet. 
For input fragmented packet it wants to add STT header to every 
fragment. Thus, every fragment supplied with STT header cannot be 
enitire network packet anymore but rather set of individual packets.  
Therefore STT rebuilds input packet detaching fragments and turn them 
into  individual packets. Technically say, it moves A->frag_list pointer 
to A->next pointer. Then STT header is prepended to every element in skb 
chain,  (STT+)A, (STT+)A->next -- which is F.

     For now a situation is not very cute but somehow correct.

     Then we invoke same steps for B. B->frag_list still have pointer to 
F which already has STT header prepended.

     Running same way we fall into situation where B, B->next (==F) come 
to the code that adds STT headers. As result we could potentially have 
(STT+)B, (STT+STT+)B->next - which is F, so, second part of input packet 
will be corrupted, but, hopefully (?), a second attempt to save STT 
header alarmed memory availability guard code.

What is proposed to do:

     Note that Geneve code does not process fragmented-case 
individually. Logically say, we always have to prepend exactly ONE 
tunnel header in front of entire packet, it does not matter how many 
fragments it had on input. When such packet comes to output it shall be 
reassembled and possibly fragmented again by ip code. Since I find no 
reasons (except maybe historical reasons) to convert fragmented packet 
to several individual STT packets I have made an attempt to just remove 
STT packet defragmentation call, so, such like Geneve, STT adds only one 
header in front of entire packet.  After such change kernel crash gone 
and all packets were transmitted correctly in out testing environment.
I append this mail with possible patch just in case you want to have a fix.

Special attention required.

     STT code has installed NF hook that really need to reassemble 
fragments to  whole packet. I think a same or similar problem could 
happen here under same conditions. I did not make efforts to investigate 
this case.  An only question I have is Do we really need to cooperate 
with NF here?

=============== PATCH Here ==================

 From d8543538dd08bc53c729c26d119d430dd16379f6 Mon Sep 17 00:00:00 2001
From: Aleksandr Smirnov <alekssmirnov@k2.cloud>
Date: Fri, 1 Nov 2024 14:00:36 +0300
Subject: [PATCH] stt: fix kernel panic skb_push

---
  datapath/linux/compat/stt.c | 16 ----------------
  1 file changed, 16 deletions(-)

                        l3_proto, l4_proto, dst_mtu))

Comments

Ilya Maximets Nov. 27, 2024, 9:04 p.m. UTC | #1
On 11/7/24 09:26, Aleksandr Smirnov (K2 Cloud) wrote:
> Hello,
> 
> We got kernel crash in STT module in our could environment. I have made 
> root cause analysis and I happy to provide RCA report here:

Thanks for the detailed analysis!  This may be very useful for people still
using STT for whatever reason.

However, at this point in time, I don't think it makes a lot of sense to
apply any changes to the kernel module distributed within branch-2.17.
It contains way more serious issues, fixes for which were not backported
form the upstream kernel.  And it is also deprecated for a long time.
The 2.17 branch itself will reach end of life in February.

I'm also planing on full deprecation of STT and LISP tunnels support in
OVS, since there are no supported datapath implementations for these
tunnels and STT was also rejected upstream.  There is an STT implementation
for the windows datapath, but it should probably be removed, as we're not
really testing it and other datapaths will not get support for STT.  Also,
the protocol itself is not really friendly for the wild internet as it may
break middleboxes expecting proper TCP (one of the reasons it was rejected).

I'll send deprecation patches in coming weeks for review, so we can mark
them in OVS 3.5 and fully remove support in OVS 3.6.

Best regards, Ilya Maximets.

> 
> Version reported: OVS v2.17.6
> 
> Kernel crash:
> 
> [  120.574175] ------------[ cut here ]------------
> [  120.574178] kernel BUG at net/core/skbuff.c:122!
> [  120.574510] invalid opcode: 0000 [#1] SMP NOPTI
> [  120.574840] CPU: 4 PID: 40 Comm: ksoftirqd/4 Kdump: loaded Tainted: 
> G           OE    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1
> [  120.575538] Hardware name: Dell Inc. PowerEdge R720/0X6FFV, BIOS 
> 2.5.2 01/28/2015
> [  120.575916] RIP: 0010:skb_panic+0x4b/0x4d
> [  120.576288] Code: 00 00 50 8b 87 dc 00 00 00 50 8b 87 d8 00 00 00 50 
> ff b7 e8 00 00 00 4c 8b 8f e0 00 00 00 48 c7 c7 f8 0a 1a 9d e8 58 10 95 
> ff <0f> 0
> b 48 8b 14 24 48 c7 c1 c0 ea ee 9c e8 a3 ff ff ff 48 c7 c6 00
> [  120.577132] RSP: 0018:ffffba2581b7f678 EFLAGS: 00010246
> [  120.577553] RAX: 000000000000008b RBX: ffff9087064a4b00 RCX: 
> 0000000000000000
> [  120.577989] RDX: 0000000000000000 RSI: ffff9086f789e698 RDI: 
> ffff9086f789e698
> [  120.578428] RBP: 0000000000009ec2 R08: 0000000000000000 R09: 
> c0000000ffff7fff
> [  120.578877] R10: 0000000000000001 R11: ffffba2581b7f498 R12: 
> 0000000000002f1d
> [  120.579576] R13: 0000000000000c20 R14: 0000000000000008 R15: 
> ffff9085c999a400
> [  120.580041] FS:  0000000000000000(0000) GS:ffff9086f7880000(0000) 
> knlGS:0000000000000000
> [  120.580512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  120.580971] CR2: 00007f0eb711e340 CR3: 00000003b4410006 CR4: 
> 00000000000606e0
> [  120.581445] Call Trace:
> [  120.581906]  skb_push.cold.98+0x10/0x10
> [  120.582374]  ovs_stt_xmit.cold.48+0x166/0x480 [openvswitch]
> [  120.582870]  do_execute_actions+0xc51/0xc90 [openvswitch]
> [  120.583361]  ? flow_lookup.isra.12+0x40/0xb0 [openvswitch]
> 
> 
> Open vSwitch background to reproduce:
> 
> 1. OVS action field in flow must have two 'output' commands executed on 
> the same packet and routed to different ports. This double output 
> results in input socket buffer will be cloned and used in subsequent 
> xmit calls. Here I provide part of code that will be executed in 
> do_execute_actions()
> 
>          case OVS_ACTION_ATTR_OUTPUT: {
>              int port = nla_get_u32(a);
>              struct sk_buff *clone;
> 
>              /* Every output action needs a separate clone
>               * of 'skb', In case the output action is the
>               * last action, cloning can be avoided.
>               */
>              if (nla_is_last(a, rem)) {
>                  do_output(dp, skb, port, key);
>                  /* 'skb' has been used for output.
>                   */
>                  return 0;
>              }
> 
>              clone = skb_clone(skb, GFP_ATOMIC);
>              if (clone)
>                  do_output(dp, clone, port, key);
> 
> 2. Input packet must be split into at least two fragments.
> 
> 3. STT tunneling must be in use.
> 
> 
> Open vSwitch kernel module background to reproduce.
> 
> 
> 1. We must have input packet consisting two fragments:
>     Head part A (struct of sk_buff) and 2-nd part F (struct of sk_buff) 
> A->frag_list == F
> 
> 2. We call B = skb_clone(A) so, we create clone of input sk_buff. Note, 
> that after clone second part of packet is not copied, thus A->flag_list 
> == B->frag_list
> 
> 3. Call send action on both buffers, in fact, ovs_stt_xmit(A), 
> ovs_stt_xmit(B)
> 
> 4. We get kernel crash.
> 
> 
> Why it happens:
> 
>      STT code wants to prepend STT header in face of incoming packet. 
> For input fragmented packet it wants to add STT header to every 
> fragment. Thus, every fragment supplied with STT header cannot be 
> enitire network packet anymore but rather set of individual packets.  
> Therefore STT rebuilds input packet detaching fragments and turn them 
> into  individual packets. Technically say, it moves A->frag_list pointer 
> to A->next pointer. Then STT header is prepended to every element in skb 
> chain,  (STT+)A, (STT+)A->next -- which is F.
> 
>      For now a situation is not very cute but somehow correct.
> 
>      Then we invoke same steps for B. B->frag_list still have pointer to 
> F which already has STT header prepended.
> 
>      Running same way we fall into situation where B, B->next (==F) come 
> to the code that adds STT headers. As result we could potentially have 
> (STT+)B, (STT+STT+)B->next - which is F, so, second part of input packet 
> will be corrupted, but, hopefully (?), a second attempt to save STT 
> header alarmed memory availability guard code.
> 
> What is proposed to do:
> 
>      Note that Geneve code does not process fragmented-case 
> individually. Logically say, we always have to prepend exactly ONE 
> tunnel header in front of entire packet, it does not matter how many 
> fragments it had on input. When such packet comes to output it shall be 
> reassembled and possibly fragmented again by ip code. Since I find no 
> reasons (except maybe historical reasons) to convert fragmented packet 
> to several individual STT packets I have made an attempt to just remove 
> STT packet defragmentation call, so, such like Geneve, STT adds only one 
> header in front of entire packet.  After such change kernel crash gone 
> and all packets were transmitted correctly in out testing environment.
> I append this mail with possible patch just in case you want to have a fix.
> 
> Special attention required.
> 
>      STT code has installed NF hook that really need to reassemble 
> fragments to  whole packet. I think a same or similar problem could 
> happen here under same conditions. I did not make efforts to investigate 
> this case.  An only question I have is Do we really need to cooperate 
> with NF here?
> 
> =============== PATCH Here ==================
> 
>  From d8543538dd08bc53c729c26d119d430dd16379f6 Mon Sep 17 00:00:00 2001
> From: Aleksandr Smirnov <alekssmirnov@k2.cloud>
> Date: Fri, 1 Nov 2024 14:00:36 +0300
> Subject: [PATCH] stt: fix kernel panic skb_push
> 
> ---
>   datapath/linux/compat/stt.c | 16 ----------------
>   1 file changed, 16 deletions(-)
> 
> diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
> index 39a294764..7b1c7f3e7 100644
> --- a/datapath/linux/compat/stt.c
> +++ b/datapath/linux/compat/stt.c
> @@ -663,22 +663,6 @@ static struct sk_buff *push_stt_header(struct 
> sk_buff *head, __be64 tun_id,
>   {
>       struct sk_buff *skb;
> 
> -    if (skb_shinfo(head)->frag_list) {
> -        bool ipv4 = (l3_proto == htons(ETH_P_IP));
> -        bool tcp = (l4_proto == IPPROTO_TCP);
> -        bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL);
> -        int l4_offset = skb_transport_offset(head);
> -
> -        /* Need to call skb_orphan() to report currect true-size.
> -         * calling skb_orphan() in this layer is odd but SKB with
> -         * frag-list should not be associated with any socket, so
> -         * skb-orphan should be no-op. */
> -        skb_orphan(head);
> -        if (unlikely(segment_skb(&head, csum_partial,
> -                     ipv4, tcp, l4_offset)))
> -            goto error;
> -    }
> -
>       for (skb = head; skb; skb = skb->next) {
>           if (__push_stt_header(skb, tun_id, s_port, d_port, saddr, dst,
>                         l3_proto, l4_proto, dst_mtu))
diff mbox series

Patch

diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
index 39a294764..7b1c7f3e7 100644
--- a/datapath/linux/compat/stt.c
+++ b/datapath/linux/compat/stt.c
@@ -663,22 +663,6 @@  static struct sk_buff *push_stt_header(struct 
sk_buff *head, __be64 tun_id,
  {
      struct sk_buff *skb;

-    if (skb_shinfo(head)->frag_list) {
-        bool ipv4 = (l3_proto == htons(ETH_P_IP));
-        bool tcp = (l4_proto == IPPROTO_TCP);
-        bool csum_partial = (head->ip_summed == CHECKSUM_PARTIAL);
-        int l4_offset = skb_transport_offset(head);
-
-        /* Need to call skb_orphan() to report currect true-size.
-         * calling skb_orphan() in this layer is odd but SKB with
-         * frag-list should not be associated with any socket, so
-         * skb-orphan should be no-op. */
-        skb_orphan(head);
-        if (unlikely(segment_skb(&head, csum_partial,
-                     ipv4, tcp, l4_offset)))
-            goto error;
-    }
-
      for (skb = head; skb; skb = skb->next) {
          if (__push_stt_header(skb, tun_id, s_port, d_port, saddr, dst,