diff mbox series

[ovs-dev] ofproto/ofproto: Initialize learn add rule flag.

Message ID 1729717964-129926-1-git-send-email-mike.ovsiannikov@nutanix.com
State New
Delegated to: aaron conole
Headers show
Series [ovs-dev] ofproto/ofproto: Initialize learn add rule flag. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Mike Ovsiannikov Oct. 23, 2024, 9:12 p.m. UTC
Initialize learn_adds_rule in order to prevent
crash in ofproto_flow_mod_learn_finish() due to
being invoked with uninitialized rules
collections.

Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>

---
 ofproto/ofproto.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Aaron Conole Oct. 29, 2024, 1:59 p.m. UTC | #1
Mike Ovsiannikov <mike.ovsiannikov@nutanix.com> writes:

> Initialize learn_adds_rule in order to prevent
> crash in ofproto_flow_mod_learn_finish() due to
> being invoked with uninitialized rules
> collections.
>
> Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>
>
> ---

Hi Mike,

I think this should have:

Fixes: 1f4a89336682 ("ofproto: Refactor packet_out handling.")

Also, I don't see many details on how we have uninitialized rules
collections.  Is it possible to include a stack trace / steps to
reproduce the problem?  Could it be that there is a race somewhere?
Is it during add/remove of some group?  We shouldn't get into this case
without at least a valid rule, I think (at least it seems like all paths
need to cross the XC_LEARN / or a learn xlate test).
Mike Ovsiannikov Oct. 29, 2024, 10:39 p.m. UTC | #2
Hi Aron,

Here is the stack trace of the failure / segmentation fault:

#0  0x00007f33b423c1c1 in ofproto_flow_mod_learn_finish (ofm=0x230db00, orig_ofproto=orig_ofproto@entry=0x1d39220) at ofproto/ofproto-provider.h:509
#1  0x00007f33b4241e9a in ofproto_dpif_xcache_execute (ofproto=0x1d39200, xcache=<optimized out>, stats=0x7fff8e76eb10) at ofproto/ofproto-dpif.c:4966
#2  0x00007f33b4247723 in nxt_resume (ofproto_=0x1d39220, pin=0x7fff8e76f7b0) at ofproto/ofproto-dpif.c:5811
#3  0x00007f33b42308dd in handle_nxt_resume (ofconn=ofconn@entry=0x20689a0, oh=<optimized out>) at ofproto/ofproto.c:3851
#4  0x00007f33b423f429 in handle_single_part_openflow (type=OFPTYPE_NXT_RESUME, oh=<optimized out>, ofconn=0x20689a0) at ofproto/ofproto.c:8823
#5  handle_openflow (ofconn=0x20689a0, msgs=0x7fff8e7707a0) at ofproto/ofproto.c:8959
#6  0x00007f33b422cce0 in ofconn_run (handle_openflow=0x7f33b423e560 <handle_openflow>, ofconn=0x20689a0) at ofproto/connmgr.c:1329
#7  connmgr_run (mgr=0x1cfc140, handle_openflow=handle_openflow@entry=0x7f33b423e560 <handle_openflow>) at ofproto/connmgr.c:356
#8  0x00007f33b4236f70 in ofproto_run (p=0x1d39220) at ofproto/ofproto.c:1990
#9  0x00000000004101fc in bridge_run__ () at vswitchd/bridge.c:3287
#10 0x000000000041615e in bridge_run () at vswitchd/bridge.c:3346
#11 0x000000000040be05 in main (argc=<optimized out>, argv=<optimized out>) at vswitchd/ovs-vswitchd.c:132

The learn.ofm structure appears to be valid at this point, except that learn_adds_rule and old_rules new_rules fields does not appear to be initialized:

(gdb) p entry->learn.ofm[0]
$3 = {
  temp_rule = 0x15fb520,
  criteria = {
    table_id = 1 '\001',
    cr = {
      node = {
        prev = 0x15bd570,
        next = {
          p = 0x15bd570
        }
      },
      priority = 31,
      cls_match = {
        p = 0x0
      },
      match = {
        {
          {
            flow = 0x21b77a0,
            mask = 0x21b77d0
          },
          flows = {0x21b77a0, 0x21b77d0}
        },
        tun_md = 0x0
      }
    },
    version = 18446744073709551614,
    cookie = 0,
    cookie_mask = 0,
    out_port = 65535,
    out_group = 4294967295,
    include_hidden = false,
    include_readonly = false
  },
  conjs = 0x0,
  n_conjs = 0,
  command = 2,
  modify_cookie = true,
  modify_may_add_flow = true,
  modify_keep_counts = true,
  event = 492535614,
  table_id = 1 '\001',
  version = 492535619,
  learn_adds_rule = 2,
  old_rules = {
    collection = {
      objs = 0x1,
      n = 492535617,
      capacity = 5,
      stub = {0x1d5b7f41, 0x30, 0x0, 0x0, 0x0}
    }
  },
  new_rules = {
    collection = {
      objs = 0x0,
      n = 0,
      capacity = 0,
      stub = {0x0, 0x0, 0x0, 0x600000000, 0x0}
    }
  }
}

In this particular case the following line in ofproto_flow_mod_learn_finish()  results in NULL pointer that later is de-referenced in subsequent if statement.
struct rule *rule = rule_collection_rules(&ofm->new_rules)[0];

As far as I can tell, presently the flag initialization only exists in one place: inside packet_xlate(). Though it does not appear to me that packet_xlate() is guaranteed be invoked after the structure is allocated inside xlate_learn_action().
The old and new rules collections appears to be initialized by the following call chain, but it is not  guaranteed that this call chain is always / unconditionally executed.
 ofproto_flow_mod_learn() /
  ofproto_flow_mod_learn_start() /
   ofproto_flow_mod_start() /
    rule_collection_init()

I did notice that the learn_adds_rule flag was introduced by 1f4a89336682 change that you’ve mentioned, though I’m not convinced that the flag is guaranteed to be always initialized.

Thank you,
 — Mike.

On Oct 29, 2024, at 6:59 AM, Aaron Conole <aconole@redhat.com> wrote:

!-------------------------------------------------------------------|
 CAUTION: External Email

|-------------------------------------------------------------------!

Mike Ovsiannikov <mike.ovsiannikov@nutanix.com> writes:

Initialize learn_adds_rule in order to prevent
crash in ofproto_flow_mod_learn_finish() due to
being invoked with uninitialized rules
collections.

Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>

---

Hi Mike,

I think this should have:

Fixes: 1f4a89336682 ("ofproto: Refactor packet_out handling.")

Also, I don't see many details on how we have uninitialized rules
collections.  Is it possible to include a stack trace / steps to
reproduce the problem?  Could it be that there is a race somewhere?
Is it during add/remove of some group?  We shouldn't get into this case
without at least a valid rule, I think (at least it seems like all paths
need to cross the XC_LEARN / or a learn xlate test).
Aaron Conole Oct. 31, 2024, 1:46 p.m. UTC | #3
Mike Ovsiannikov <mike.ovsiannikov@nutanix.com> writes:

> Hi Aron, 
>
> Here is the stack trace of the failure / segmentation fault:
>
> #0  0x00007f33b423c1c1 in ofproto_flow_mod_learn_finish (ofm=0x230db00,
> orig_ofproto=orig_ofproto@entry=0x1d39220) at ofproto/ofproto-provider.h:509
> #1  0x00007f33b4241e9a in ofproto_dpif_xcache_execute (ofproto=0x1d39200,
> xcache=<optimized out>, stats=0x7fff8e76eb10) at ofproto/ofproto-dpif.c:4966
> #2  0x00007f33b4247723 in nxt_resume (ofproto_=0x1d39220, pin=0x7fff8e76f7b0) at
> ofproto/ofproto-dpif.c:5811
> #3  0x00007f33b42308dd in handle_nxt_resume (ofconn=ofconn@entry=0x20689a0,
> oh=<optimized out>) at ofproto/ofproto.c:3851
> #4  0x00007f33b423f429 in handle_single_part_openflow (type=OFPTYPE_NXT_RESUME,
> oh=<optimized out>, ofconn=0x20689a0) at ofproto/ofproto.c:8823
> #5  handle_openflow (ofconn=0x20689a0, msgs=0x7fff8e7707a0) at ofproto/ofproto.c:8959
> #6  0x00007f33b422cce0 in ofconn_run (handle_openflow=0x7f33b423e560
> <handle_openflow>, ofconn=0x20689a0) at ofproto/connmgr.c:1329
> #7  connmgr_run (mgr=0x1cfc140,
> handle_openflow=handle_openflow@entry=0x7f33b423e560 <handle_openflow>) at
> ofproto/connmgr.c:356
> #8  0x00007f33b4236f70 in ofproto_run (p=0x1d39220) at ofproto/ofproto.c:1990
> #9  0x00000000004101fc in bridge_run__ () at vswitchd/bridge.c:3287
> #10 0x000000000041615e in bridge_run () at vswitchd/bridge.c:3346
> #11 0x000000000040be05 in main (argc=<optimized out>, argv=<optimized out>) at
> vswitchd/ovs-vswitchd.c:132
>
> The learn.ofm structure appears to be valid at this point, except that learn_adds_rule and
> old_rules new_rules fields does not appear to be initialized:
>
> (gdb) p entry->learn.ofm[0]
> $3 = {
>   temp_rule = 0x15fb520, 
>   criteria = {
>     table_id = 1 '\001', 
>     cr = {
>       node = {
>         prev = 0x15bd570, 
>         next = {
>           p = 0x15bd570
>         }
>       }, 
>       priority = 31, 
>       cls_match = {
>         p = 0x0
>       }, 
>       match = {
>         {
>           {
>             flow = 0x21b77a0, 
>             mask = 0x21b77d0
>           }, 
>           flows = {0x21b77a0, 0x21b77d0}
>         }, 
>         tun_md = 0x0
>       }
>     }, 
>     version = 18446744073709551614, 
>     cookie = 0, 
>     cookie_mask = 0, 
>     out_port = 65535, 
>     out_group = 4294967295, 
>     include_hidden = false, 
>     include_readonly = false
>   }, 
>   conjs = 0x0, 
>   n_conjs = 0, 
>   command = 2, 
>   modify_cookie = true, 
>   modify_may_add_flow = true, 
>   modify_keep_counts = true, 
>   event = 492535614, 
>   table_id = 1 '\001', 
>   version = 492535619, 
>   learn_adds_rule = 2, 
>   old_rules = {
>     collection = {
>       objs = 0x1, 
>       n = 492535617, 
>       capacity = 5, 
>       stub = {0x1d5b7f41, 0x30, 0x0, 0x0, 0x0}
>     }
>   }, 
>   new_rules = {
>     collection = {
>       objs = 0x0, 
>       n = 0, 
>       capacity = 0, 
>       stub = {0x0, 0x0, 0x0, 0x600000000, 0x0}
>     }
>   }
> }
>
> In this particular case the following line in ofproto_flow_mod_learn_finish()  results in NULL
> pointer that later is de-referenced in subsequent if statement.
> struct rule *rule = rule_collection_rules(&ofm->new_rules)[0];
>
> As far as I can tell, presently the flag initialization only exists in one place: inside packet_xlate
> (). Though it does not appear to me that packet_xlate() is guaranteed be invoked after the
> structure is allocated inside xlate_learn_action().
> The old and new rules collections appears to be initialized by the following call chain, but it is
> not  guaranteed that this call chain is always / unconditionally executed.
>  ofproto_flow_mod_learn() /
>   ofproto_flow_mod_learn_start() /
>    ofproto_flow_mod_start() /
>     rule_collection_init()
>
> I did notice that the learn_adds_rule flag was introduced by 1f4a89336682 change that
> you’ve mentioned, though I’m not convinced that the flag is guaranteed to be always
> initialized.

Thanks for the details above.  You're correct - it isn't being
initialized properly.  However, this looks like it is happening with a
specific flow controller setup using some kind of continuation chain.
If you have that flow chain, maybe we can also emulate it in a test.
I'm actually concerned that we may have other uninitialized (or
incorrectly initialized) details.

> Thank you,
>  — Mike.
>
>  On Oct 29, 2024, at 6:59 AM, Aaron Conole <aconole@redhat.com> wrote:
>
>  !-------------------------------------------------------------------|
>   CAUTION: External Email
>
>  |-------------------------------------------------------------------!
>
>  Mike Ovsiannikov <mike.ovsiannikov@nutanix.com> writes:
>
>  Initialize learn_adds_rule in order to prevent
>  crash in ofproto_flow_mod_learn_finish() due to
>  being invoked with uninitialized rules
>  collections.
>
>  Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>
>
>  ---
>
>  Hi Mike,
>
>  I think this should have:
>
>  Fixes: 1f4a89336682 ("ofproto: Refactor packet_out handling.")
>
>  Also, I don't see many details on how we have uninitialized rules
>  collections.  Is it possible to include a stack trace / steps to
>  reproduce the problem?  Could it be that there is a race somewhere?
>  Is it during add/remove of some group?  We shouldn't get into this case
>  without at least a valid rule, I think (at least it seems like all paths
>  need to cross the XC_LEARN / or a learn xlate test).
Mike Ovsiannikov Nov. 1, 2024, 10:32 p.m. UTC | #4
I don’t see anything peculiar with the flow controller setup. Perhaps the only difference with other test setups, that I can think of, was that in the setup wiehre failure occurred VMs were configured with two NICs one in overlay and the other in the underlay network. So far attempts to reproduce the problem with simpler test configurations were not successful.

I was thinking about running unit tests under memory debugger / valgrind, though I haven’t had a chance to try this.

On Oct 31, 2024, at 6:46 AM, Aaron Conole <aconole@redhat.com> wrote:

!-------------------------------------------------------------------|
 CAUTION: External Email

|-------------------------------------------------------------------!

Mike Ovsiannikov <mike.ovsiannikov@nutanix.com<mailto:mike.ovsiannikov@nutanix.com>> writes:

Hi Aron,

Here is the stack trace of the failure / segmentation fault:

#0  0x00007f33b423c1c1 in ofproto_flow_mod_learn_finish (ofm=0x230db00,
orig_ofproto=orig_ofproto@entry=0x1d39220) at ofproto/ofproto-provider.h:509
#1  0x00007f33b4241e9a in ofproto_dpif_xcache_execute (ofproto=0x1d39200,
xcache=<optimized out>, stats=0x7fff8e76eb10) at ofproto/ofproto-dpif.c:4966
#2  0x00007f33b4247723 in nxt_resume (ofproto_=0x1d39220, pin=0x7fff8e76f7b0) at
ofproto/ofproto-dpif.c:5811
#3  0x00007f33b42308dd in handle_nxt_resume (ofconn=ofconn@entry=0x20689a0,
oh=<optimized out>) at ofproto/ofproto.c:3851
#4  0x00007f33b423f429 in handle_single_part_openflow (type=OFPTYPE_NXT_RESUME,
oh=<optimized out>, ofconn=0x20689a0) at ofproto/ofproto.c:8823
#5  handle_openflow (ofconn=0x20689a0, msgs=0x7fff8e7707a0) at ofproto/ofproto.c:8959
#6  0x00007f33b422cce0 in ofconn_run (handle_openflow=0x7f33b423e560
<handle_openflow>, ofconn=0x20689a0) at ofproto/connmgr.c:1329
#7  connmgr_run (mgr=0x1cfc140,
handle_openflow=handle_openflow@entry=0x7f33b423e560 <handle_openflow>) at
ofproto/connmgr.c:356
#8  0x00007f33b4236f70 in ofproto_run (p=0x1d39220) at ofproto/ofproto.c:1990
#9  0x00000000004101fc in bridge_run__ () at vswitchd/bridge.c:3287
#10 0x000000000041615e in bridge_run () at vswitchd/bridge.c:3346
#11 0x000000000040be05 in main (argc=<optimized out>, argv=<optimized out>) at
vswitchd/ovs-vswitchd.c:132

The learn.ofm structure appears to be valid at this point, except that learn_adds_rule and
old_rules new_rules fields does not appear to be initialized:

(gdb) p entry->learn.ofm[0]
$3 = {
 temp_rule = 0x15fb520,
 criteria = {
   table_id = 1 '\001',
   cr = {
     node = {
       prev = 0x15bd570,
       next = {
         p = 0x15bd570
       }
     },
     priority = 31,
     cls_match = {
       p = 0x0
     },
     match = {
       {
         {
           flow = 0x21b77a0,
           mask = 0x21b77d0
         },
         flows = {0x21b77a0, 0x21b77d0}
       },
       tun_md = 0x0
     }
   },
   version = 18446744073709551614,
   cookie = 0,
   cookie_mask = 0,
   out_port = 65535,
   out_group = 4294967295,
   include_hidden = false,
   include_readonly = false
 },
 conjs = 0x0,
 n_conjs = 0,
 command = 2,
 modify_cookie = true,
 modify_may_add_flow = true,
 modify_keep_counts = true,
 event = 492535614,
 table_id = 1 '\001',
 version = 492535619,
 learn_adds_rule = 2,
 old_rules = {
   collection = {
     objs = 0x1,
     n = 492535617,
     capacity = 5,
     stub = {0x1d5b7f41, 0x30, 0x0, 0x0, 0x0}
   }
 },
 new_rules = {
   collection = {
     objs = 0x0,
     n = 0,
     capacity = 0,
     stub = {0x0, 0x0, 0x0, 0x600000000, 0x0}
   }
 }
}

In this particular case the following line in ofproto_flow_mod_learn_finish()  results in NULL
pointer that later is de-referenced in subsequent if statement.
struct rule *rule = rule_collection_rules(&ofm->new_rules)[0];

As far as I can tell, presently the flag initialization only exists in one place: inside packet_xlate
(). Though it does not appear to me that packet_xlate() is guaranteed be invoked after the
structure is allocated inside xlate_learn_action().
The old and new rules collections appears to be initialized by the following call chain, but it is
not  guaranteed that this call chain is always / unconditionally executed.
ofproto_flow_mod_learn() /
 ofproto_flow_mod_learn_start() /
  ofproto_flow_mod_start() /
   rule_collection_init()

I did notice that the learn_adds_rule flag was introduced by 1f4a89336682 change that
you’ve mentioned, though I’m not convinced that the flag is guaranteed to be always
initialized.

Thanks for the details above.  You're correct - it isn't being
initialized properly.  However, this looks like it is happening with a
specific flow controller setup using some kind of continuation chain.
If you have that flow chain, maybe we can also emulate it in a test.
I'm actually concerned that we may have other uninitialized (or
incorrectly initialized) details.

Thank you,
— Mike.

On Oct 29, 2024, at 6:59 AM, Aaron Conole <aconole@redhat.com> wrote:

!-------------------------------------------------------------------|
 CAUTION: External Email

|-------------------------------------------------------------------!

Mike Ovsiannikov <mike.ovsiannikov@nutanix.com> writes:

Initialize learn_adds_rule in order to prevent
crash in ofproto_flow_mod_learn_finish() due to
being invoked with uninitialized rules
collections.

Signed-off-by: Mike Ovsiannikov <mike.ovsiannikov@nutanix.com>

---

Hi Mike,

I think this should have:

Fixes: 1f4a89336682 ("ofproto: Refactor packet_out handling.")

Also, I don't see many details on how we have uninitialized rules
collections.  Is it possible to include a stack trace / steps to
reproduce the problem?  Could it be that there is a race somewhere?
Is it during add/remove of some group?  We shouldn't get into this case
without at least a valid rule, I think (at least it seems like all paths
need to cross the XC_LEARN / or a learn xlate test).
diff mbox series

Patch

diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
index 122a06f30..9ba93e380 100644
--- a/ofproto/ofproto.c
+++ b/ofproto/ofproto.c
@@ -8204,6 +8204,9 @@  ofproto_flow_mod_init(struct ofproto *ofproto, struct ofproto_flow_mod *ofm,
     ofm->n_conjs = 0;
     ofm->table_id = fm->table_id;
 
+    /* Initialize flag used by ofproto_dpif_xcache_execute(). */
+    ofm->learn_adds_rule = false;
+
     bool check_buffer_id = false;
 
     switch (ofm->command) {