Message ID | 1599473445-29748-1-git-send-email-dceara@redhat.com |
---|---|
State | Rejected |
Headers | show |
Series | [ovs-dev,branch-20.03] chassis: Fix the way encaps are updated for a chassis record. | expand |
On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote: > > ovn-controller always stores the last configured system-id/chassis-id in > memory regardless if the connection to the SB is up or down. This is OK > as long as the change can be committed successfully when the SB DB > connection comes back up. > > Without this change, if the chassis-id changes while the SB connection is > down, ovn-controller will fail to create the new record but nevertheless > update its in-memory chassis-id. When the SB connection is restored > ovn-controller tries to find the record corresponding to the chassis-id > it stored in memory. This fails causing ovn-controller to try to insert > a new record. But at this point a constraint violation is hit in the SB > because the Encap records of the "stale" chassis still exist in the DB, > along with the old chassis record. > > This commit changes the way we search for a "stale" chassis record in the > SB to cover the above mentioned case. Also, in such cases there's no need > to recreate the Encaps, it's sufficient to update the chassis_name field. > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the string parsing") > Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > (cherry-picked from master commit 94a32fca2d2b825fece0ef5b1873459bd9857dd3) Hi Dumitru, Sorry that I missed the original review, but this patch seems causing problem in master already. When RBAC is enabled, it will result in below error: 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this chassis. 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: {"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\" prohibit modification of table \"Encap\".","error":"permission error"} You can simply reproduce by: make sandbox, and then run: ovn-nbctl --wait=hv sync, which will hang forever because of the problem. This patch seems to be updating the "name" field of Encap table, which violates the design of RBAC, which uses "name" as the identity of the client. We shouldn't allow an user to change system-id/chassis-id directly. I think the proper way is firstly destroying the old chassis and then creating a new chassis, which would also avoid the complex logic in ovn-controller regarding the "stale record" handling. What do you think? Thanks, Han > --- > controller/chassis.c | 60 ++++++++++++++++++++++++++++++------------------- > tests/ovn-controller.at | 17 ++++++++++++++ > 2 files changed, 54 insertions(+), 23 deletions(-) > > diff --git a/controller/chassis.c b/controller/chassis.c > index 522893e..d525463 100644 > --- a/controller/chassis.c > +++ b/controller/chassis.c > @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset *encap_type_set, > { > size_t encap_type_count = 0; > > - for (int i = 0; i < chassis_rec->n_encaps; i++) { > - if (strcmp(chassis_rec->name, chassis_rec->encaps[i]->chassis_name)) { > - return true; > - } > + for (size_t i = 0; i < chassis_rec->n_encaps; i++) { > > if (!sset_contains(encap_type_set, chassis_rec->encaps[i]->type)) { > return true; > @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn *ovnsb_idl_txn, > } > > /* > + * Updates encaps for a given chassis. This can happen when the chassis > + * name has changed. Also, the only thing we support updating is the > + * chassis_name. For other changes the encaps will be recreated. > + */ > +static void > +chassis_update_encaps(const struct sbrec_chassis *chassis) > +{ > + for (size_t i = 0; i < chassis->n_encaps; i++) { > + sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name); > + } > +} > + > +/* > * Returns a pointer to a chassis record from 'chassis_table' that > * matches at least one tunnel config. > */ > @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct sbrec_chassis_table *chassis_table, > /* If this is a chassis config update after we initialized the record once > * then we should always be able to find it with the ID we saved in > * chassis_state. > - * Otherwise (i.e., first time we create the record) then we check if there's > - * a stale record from a previous controller run that didn't end gracefully > - * and reuse it. If not then we create a new record. > + * Otherwise (i.e., first time we create the record or if the system-id > + * changed) then we check if there's a stale record from a previous > + * controller run that didn't end gracefully and reuse it. If not then we > + * create a new record. > */ > static const struct sbrec_chassis * > chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, > @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, > const struct ovs_chassis_cfg *ovs_cfg, > const char *chassis_id) > { > - const struct sbrec_chassis *chassis_rec; > + const struct sbrec_chassis *chassis = NULL; > > if (chassis_info_id_inited(&chassis_state)) { > - chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name, > - chassis_info_id(&chassis_state)); > - if (!chassis_rec) { > - VLOG_DBG("Could not find Chassis, will create it" > - ": stored (%s) ovs (%s)", > + chassis = chassis_lookup_by_name(sbrec_chassis_by_name, > + chassis_info_id(&chassis_state)); > + if (!chassis) { > + VLOG_DBG("Could not find Chassis, will check if the id changed: " > + "stored (%s) ovs (%s)", > chassis_info_id(&chassis_state), chassis_id); > - if (ovnsb_idl_txn) { > - /* Recreate the chassis record. */ > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); > - } > } > - } else { > - chassis_rec = > - chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id); > + } > > - if (!chassis_rec && ovnsb_idl_txn) { > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); > + if (!chassis) { > + chassis = chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id); > + } > + > + if (!chassis) { > + /* Recreate the chassis record. */ > + VLOG_DBG("Could not find Chassis, will create it: %s", chassis_id); > + if (ovnsb_idl_txn) { > + return sbrec_chassis_insert(ovnsb_idl_txn); > } > } > - return chassis_rec; > + > + return chassis; > } > > /* Update a Chassis record based on the config in the ovs config. */ > @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis *chassis_rec, > &ovs_cfg->encap_ip_set, ovs_cfg->encap_csum, > chassis_rec); > if (!tunnels_changed) { > + chassis_update_encaps(chassis_rec); > return; > } > > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > index 63b2581..1c7aa58 100644 > --- a/tests/ovn-controller.at > +++ b/tests/ovn-controller.at > @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([ > test "${sysid}" = "${chassis_id}" > ]) > > +# Simulate system-id changing while ovn-controller is disconnected from the > +# SB. > +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote) > +invalid_remote=tcp:0.0.0.0:4242 > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote} > +expected_state="not connected" > +OVS_WAIT_UNTIL([ > + test "${expected_state}" = "$(ovn-appctl -t ovn-controller connection-status)" > +]) > +sysid=${sysid}-bar > +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}" > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote} > +OVS_WAIT_UNTIL([ > + chassis_id=$(ovn-sbctl get Chassis "${sysid}" name) > + test "${sysid}" = "${chassis_id}" > +]) > + > # Gracefully terminate daemons > OVN_CLEANUP_SBOX([hv]) > OVN_CLEANUP_VSWITCH([main]) > -- > 1.8.3.1 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com> wrote: > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote: > > > > ovn-controller always stores the last configured system-id/chassis-id in > > memory regardless if the connection to the SB is up or down. This is OK > > as long as the change can be committed successfully when the SB DB > > connection comes back up. > > > > Without this change, if the chassis-id changes while the SB connection is > > down, ovn-controller will fail to create the new record but nevertheless > > update its in-memory chassis-id. When the SB connection is restored > > ovn-controller tries to find the record corresponding to the chassis-id > > it stored in memory. This fails causing ovn-controller to try to insert > > a new record. But at this point a constraint violation is hit in the SB > > because the Encap records of the "stale" chassis still exist in the DB, > > along with the old chassis record. > > > > This commit changes the way we search for a "stale" chassis record in the > > SB to cover the above mentioned case. Also, in such cases there's no need > > to recreate the Encaps, it's sufficient to update the chassis_name field. > > > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the > string parsing") > > Signed-off-by: Dumitru Ceara <dceara@redhat.com> > > > > (cherry-picked from master commit > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) > > Hi Dumitru, > > Sorry that I missed the original review, but this patch seems causing > problem in master already. When RBAC is enabled, it will result in below > error: > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this > chassis. > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: > {"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\" > prohibit modification of table \"Encap\".","error":"permission error"} > > You can simply reproduce by: make sandbox, and then run: ovn-nbctl > --wait=hv sync, which will hang forever because of the problem. > This patch seems to be updating the "name" field of Encap table, which > violates the design of RBAC, which uses "name" as the identity of the > client. We shouldn't allow an user to change system-id/chassis-id directly. > I think the proper way is firstly destroying the old chassis and then > creating a new chassis, which would also avoid the complex logic in > ovn-controller regarding the "stale record" handling. What do you think? > > Hi Han, Yes. Dumitru submitted the patch to fix this issue in master and I applied it just now. Can you please test again with the latest master. https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 We see the same issue with branch-20.06 too. Thanks Numan Thanks, > Han > > > --- > > controller/chassis.c | 60 > ++++++++++++++++++++++++++++++------------------- > > tests/ovn-controller.at | 17 ++++++++++++++ > > 2 files changed, 54 insertions(+), 23 deletions(-) > > > > diff --git a/controller/chassis.c b/controller/chassis.c > > index 522893e..d525463 100644 > > --- a/controller/chassis.c > > +++ b/controller/chassis.c > > @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset > *encap_type_set, > > { > > size_t encap_type_count = 0; > > > > - for (int i = 0; i < chassis_rec->n_encaps; i++) { > > - if (strcmp(chassis_rec->name, > chassis_rec->encaps[i]->chassis_name)) { > > - return true; > > - } > > + for (size_t i = 0; i < chassis_rec->n_encaps; i++) { > > > > if (!sset_contains(encap_type_set, > chassis_rec->encaps[i]->type)) { > > return true; > > @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn > *ovnsb_idl_txn, > > } > > > > /* > > + * Updates encaps for a given chassis. This can happen when the chassis > > + * name has changed. Also, the only thing we support updating is the > > + * chassis_name. For other changes the encaps will be recreated. > > + */ > > +static void > > +chassis_update_encaps(const struct sbrec_chassis *chassis) > > +{ > > + for (size_t i = 0; i < chassis->n_encaps; i++) { > > + sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name); > > + } > > +} > > + > > +/* > > * Returns a pointer to a chassis record from 'chassis_table' that > > * matches at least one tunnel config. > > */ > > @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct > sbrec_chassis_table *chassis_table, > > /* If this is a chassis config update after we initialized the record > once > > * then we should always be able to find it with the ID we saved in > > * chassis_state. > > - * Otherwise (i.e., first time we create the record) then we check if > there's > > - * a stale record from a previous controller run that didn't end > gracefully > > - * and reuse it. If not then we create a new record. > > + * Otherwise (i.e., first time we create the record or if the system-id > > + * changed) then we check if there's a stale record from a previous > > + * controller run that didn't end gracefully and reuse it. If not then > we > > + * create a new record. > > */ > > static const struct sbrec_chassis * > > chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, > > @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn > *ovnsb_idl_txn, > > const struct ovs_chassis_cfg *ovs_cfg, > > const char *chassis_id) > > { > > - const struct sbrec_chassis *chassis_rec; > > + const struct sbrec_chassis *chassis = NULL; > > > > if (chassis_info_id_inited(&chassis_state)) { > > - chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name, > > - > chassis_info_id(&chassis_state)); > > - if (!chassis_rec) { > > - VLOG_DBG("Could not find Chassis, will create it" > > - ": stored (%s) ovs (%s)", > > + chassis = chassis_lookup_by_name(sbrec_chassis_by_name, > > + > chassis_info_id(&chassis_state)); > > + if (!chassis) { > > + VLOG_DBG("Could not find Chassis, will check if the id > changed: " > > + "stored (%s) ovs (%s)", > > chassis_info_id(&chassis_state), chassis_id); > > - if (ovnsb_idl_txn) { > > - /* Recreate the chassis record. */ > > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); > > - } > > } > > - } else { > > - chassis_rec = > > - chassis_get_stale_record(chassis_table, ovs_cfg, > chassis_id); > > + } > > > > - if (!chassis_rec && ovnsb_idl_txn) { > > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); > > + if (!chassis) { > > + chassis = chassis_get_stale_record(chassis_table, ovs_cfg, > chassis_id); > > + } > > + > > + if (!chassis) { > > + /* Recreate the chassis record. */ > > + VLOG_DBG("Could not find Chassis, will create it: %s", > chassis_id); > > + if (ovnsb_idl_txn) { > > + return sbrec_chassis_insert(ovnsb_idl_txn); > > } > > } > > - return chassis_rec; > > + > > + return chassis; > > } > > > > /* Update a Chassis record based on the config in the ovs config. */ > > @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis > *chassis_rec, > > &ovs_cfg->encap_ip_set, > ovs_cfg->encap_csum, > > chassis_rec); > > if (!tunnels_changed) { > > + chassis_update_encaps(chassis_rec); > > return; > > } > > > > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at > > index 63b2581..1c7aa58 100644 > > --- a/tests/ovn-controller.at > > +++ b/tests/ovn-controller.at > > @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([ > > test "${sysid}" = "${chassis_id}" > > ]) > > > > +# Simulate system-id changing while ovn-controller is disconnected from > the > > +# SB. > > +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote) > > +invalid_remote=tcp:0.0.0.0:4242 > > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote} > > +expected_state="not connected" > > +OVS_WAIT_UNTIL([ > > + test "${expected_state}" = "$(ovn-appctl -t ovn-controller > connection-status)" > > +]) > > +sysid=${sysid}-bar > > +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}" > > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote} > > +OVS_WAIT_UNTIL([ > > + chassis_id=$(ovn-sbctl get Chassis "${sysid}" name) > > + test "${sysid}" = "${chassis_id}" > > +]) > > + > > # Gracefully terminate daemons > > OVN_CLEANUP_SBOX([hv]) > > OVN_CLEANUP_VSWITCH([main]) > > -- > > 1.8.3.1 > > > > _______________________________________________ > > dev mailing list > > dev@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >
On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org> wrote: > > > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com> wrote: > >> On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com> wrote: >> > >> > ovn-controller always stores the last configured system-id/chassis-id in >> > memory regardless if the connection to the SB is up or down. This is OK >> > as long as the change can be committed successfully when the SB DB >> > connection comes back up. >> > >> > Without this change, if the chassis-id changes while the SB connection >> is >> > down, ovn-controller will fail to create the new record but nevertheless >> > update its in-memory chassis-id. When the SB connection is restored >> > ovn-controller tries to find the record corresponding to the chassis-id >> > it stored in memory. This fails causing ovn-controller to try to insert >> > a new record. But at this point a constraint violation is hit in the SB >> > because the Encap records of the "stale" chassis still exist in the DB, >> > along with the old chassis record. >> > >> > This commit changes the way we search for a "stale" chassis record in >> the >> > SB to cover the above mentioned case. Also, in such cases there's no >> need >> > to recreate the Encaps, it's sufficient to update the chassis_name >> field. >> > >> > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the >> string parsing") >> > Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> > >> > (cherry-picked from master commit >> 94a32fca2d2b825fece0ef5b1873459bd9857dd3) >> >> Hi Dumitru, >> >> Sorry that I missed the original review, but this patch seems causing >> problem in master already. When RBAC is enabled, it will result in below >> error: >> 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 for this >> chassis. >> 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: >> {"details":"RBAC rules for client \"chassis-1\" role \"ovn-controller\" >> prohibit modification of table \"Encap\".","error":"permission error"} >> >> You can simply reproduce by: make sandbox, and then run: ovn-nbctl >> --wait=hv sync, which will hang forever because of the problem. >> This patch seems to be updating the "name" field of Encap table, which >> violates the design of RBAC, which uses "name" as the identity of the >> client. We shouldn't allow an user to change system-id/chassis-id >> directly. >> I think the proper way is firstly destroying the old chassis and then >> creating a new chassis, which would also avoid the complex logic in >> ovn-controller regarding the "stale record" handling. What do you think? >> >> > Hi Han, > > Yes. Dumitru submitted the patch to fix this issue in master and I applied > it just now. > Can you please test again with the latest master. > > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 > > We see the same issue with branch-20.06 too. > Hi Dumitru, Can you please make a 2 patch series for branch-20.03 which includes this patch and the other one which fixes the RBAC issue. Thanks Numan > > Thanks > Numan > > Thanks, >> Han >> >> > --- >> > controller/chassis.c | 60 >> ++++++++++++++++++++++++++++++------------------- >> > tests/ovn-controller.at | 17 ++++++++++++++ >> > 2 files changed, 54 insertions(+), 23 deletions(-) >> > >> > diff --git a/controller/chassis.c b/controller/chassis.c >> > index 522893e..d525463 100644 >> > --- a/controller/chassis.c >> > +++ b/controller/chassis.c >> > @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset >> *encap_type_set, >> > { >> > size_t encap_type_count = 0; >> > >> > - for (int i = 0; i < chassis_rec->n_encaps; i++) { >> > - if (strcmp(chassis_rec->name, >> chassis_rec->encaps[i]->chassis_name)) { >> > - return true; >> > - } >> > + for (size_t i = 0; i < chassis_rec->n_encaps; i++) { >> > >> > if (!sset_contains(encap_type_set, >> chassis_rec->encaps[i]->type)) { >> > return true; >> > @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn >> *ovnsb_idl_txn, >> > } >> > >> > /* >> > + * Updates encaps for a given chassis. This can happen when the chassis >> > + * name has changed. Also, the only thing we support updating is the >> > + * chassis_name. For other changes the encaps will be recreated. >> > + */ >> > +static void >> > +chassis_update_encaps(const struct sbrec_chassis *chassis) >> > +{ >> > + for (size_t i = 0; i < chassis->n_encaps; i++) { >> > + sbrec_encap_set_chassis_name(chassis->encaps[i], >> chassis->name); >> > + } >> > +} >> > + >> > +/* >> > * Returns a pointer to a chassis record from 'chassis_table' that >> > * matches at least one tunnel config. >> > */ >> > @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct >> sbrec_chassis_table *chassis_table, >> > /* If this is a chassis config update after we initialized the record >> once >> > * then we should always be able to find it with the ID we saved in >> > * chassis_state. >> > - * Otherwise (i.e., first time we create the record) then we check if >> there's >> > - * a stale record from a previous controller run that didn't end >> gracefully >> > - * and reuse it. If not then we create a new record. >> > + * Otherwise (i.e., first time we create the record or if the system-id >> > + * changed) then we check if there's a stale record from a previous >> > + * controller run that didn't end gracefully and reuse it. If not then >> we >> > + * create a new record. >> > */ >> > static const struct sbrec_chassis * >> > chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, >> > @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn >> *ovnsb_idl_txn, >> > const struct ovs_chassis_cfg *ovs_cfg, >> > const char *chassis_id) >> > { >> > - const struct sbrec_chassis *chassis_rec; >> > + const struct sbrec_chassis *chassis = NULL; >> > >> > if (chassis_info_id_inited(&chassis_state)) { >> > - chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name, >> > - >> chassis_info_id(&chassis_state)); >> > - if (!chassis_rec) { >> > - VLOG_DBG("Could not find Chassis, will create it" >> > - ": stored (%s) ovs (%s)", >> > + chassis = chassis_lookup_by_name(sbrec_chassis_by_name, >> > + >> chassis_info_id(&chassis_state)); >> > + if (!chassis) { >> > + VLOG_DBG("Could not find Chassis, will check if the id >> changed: " >> > + "stored (%s) ovs (%s)", >> > chassis_info_id(&chassis_state), chassis_id); >> > - if (ovnsb_idl_txn) { >> > - /* Recreate the chassis record. */ >> > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); >> > - } >> > } >> > - } else { >> > - chassis_rec = >> > - chassis_get_stale_record(chassis_table, ovs_cfg, >> chassis_id); >> > + } >> > >> > - if (!chassis_rec && ovnsb_idl_txn) { >> > - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); >> > + if (!chassis) { >> > + chassis = chassis_get_stale_record(chassis_table, ovs_cfg, >> chassis_id); >> > + } >> > + >> > + if (!chassis) { >> > + /* Recreate the chassis record. */ >> > + VLOG_DBG("Could not find Chassis, will create it: %s", >> chassis_id); >> > + if (ovnsb_idl_txn) { >> > + return sbrec_chassis_insert(ovnsb_idl_txn); >> > } >> > } >> > - return chassis_rec; >> > + >> > + return chassis; >> > } >> > >> > /* Update a Chassis record based on the config in the ovs config. */ >> > @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis >> *chassis_rec, >> > &ovs_cfg->encap_ip_set, >> ovs_cfg->encap_csum, >> > chassis_rec); >> > if (!tunnels_changed) { >> > + chassis_update_encaps(chassis_rec); >> > return; >> > } >> > >> > diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at >> > index 63b2581..1c7aa58 100644 >> > --- a/tests/ovn-controller.at >> > +++ b/tests/ovn-controller.at >> > @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([ >> > test "${sysid}" = "${chassis_id}" >> > ]) >> > >> > +# Simulate system-id changing while ovn-controller is disconnected from >> the >> > +# SB. >> > +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote) >> > +invalid_remote=tcp:0.0.0.0:4242 >> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote} >> > +expected_state="not connected" >> > +OVS_WAIT_UNTIL([ >> > + test "${expected_state}" = "$(ovn-appctl -t ovn-controller >> connection-status)" >> > +]) >> > +sysid=${sysid}-bar >> > +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}" >> > +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote} >> > +OVS_WAIT_UNTIL([ >> > + chassis_id=$(ovn-sbctl get Chassis "${sysid}" name) >> > + test "${sysid}" = "${chassis_id}" >> > +]) >> > + >> > # Gracefully terminate daemons >> > OVN_CLEANUP_SBOX([hv]) >> > OVN_CLEANUP_VSWITCH([main]) >> > -- >> > 1.8.3.1 >> > >> > _______________________________________________ >> > dev mailing list >> > dev@openvswitch.org >> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >>
On 9/8/20 9:47 AM, Numan Siddique wrote: > > > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org > <mailto:numans@ovn.org>> wrote: > > > > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com > <mailto:zhouhan@gmail.com>> wrote: > > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> wrote: > > > > ovn-controller always stores the last configured > system-id/chassis-id in > > memory regardless if the connection to the SB is up or down. > This is OK > > as long as the change can be committed successfully when the SB DB > > connection comes back up. > > > > Without this change, if the chassis-id changes while the SB > connection is > > down, ovn-controller will fail to create the new record but > nevertheless > > update its in-memory chassis-id. When the SB connection is > restored > > ovn-controller tries to find the record corresponding to the > chassis-id > > it stored in memory. This fails causing ovn-controller to try > to insert > > a new record. But at this point a constraint violation is hit > in the SB > > because the Encap records of the "stale" chassis still exist > in the DB, > > along with the old chassis record. > > > > This commit changes the way we search for a "stale" chassis > record in the > > SB to cover the above mentioned case. Also, in such cases > there's no need > > to recreate the Encaps, it's sufficient to update the > chassis_name field. > > > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to > abstract the > string parsing") > > Signed-off-by: Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> > > > > (cherry-picked from master commit > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) > > Hi Dumitru, > Hi Han, Numan, > Sorry that I missed the original review, but this patch seems > causing > problem in master already. When RBAC is enabled, it will result > in below > error: > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 > for this > chassis. > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: > {"details":"RBAC rules for client \"chassis-1\" role > \"ovn-controller\" > prohibit modification of table \"Encap\".","error":"permission > error"} > > You can simply reproduce by: make sandbox, and then run: ovn-nbctl > --wait=hv sync, which will hang forever because of the problem. > This patch seems to be updating the "name" field of Encap table, > which > violates the design of RBAC, which uses "name" as the identity > of the > client. We shouldn't allow an user to change > system-id/chassis-id directly. How can we enforce this from within OVN? > I think the proper way is firstly destroying the old chassis and > then > creating a new chassis, which would also avoid the complex logic in > ovn-controller regarding the "stale record" handling. What do > you think? > > I'm afraid this would be a behavior change that might introduce bugs in CMSs that allow changing system-ids (without RBAC enabled) without explicitly deleting the "stale" Chassis *and* Chassis_private record. Also, forcing CMS to do this housekeeping will probably add complex code in the CMS. For example, if the "old" Chassis record is deleted before ovn-controller receives the update about the system-id change in OVS DB then ovn-controller will just recreate it. > Hi Han, > > Yes. Dumitru submitted the patch to fix this issue in master and I > applied it just now. > Can you please test again with the latest master. > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 > > We see the same issue with branch-20.06 too. > > > > Hi Dumitru, > > Can you please make a 2 patch series for branch-20.03 which includes > this patch and the other one > which fixes the RBAC issue. > > Thanks > Numan > Like you said, we have two options: 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when system-id changes and completely ignore in ovn-controller any stale chassis records that might exist in the SB DB. In this case we'd just let transactions fail due to constraint violations if the CMS cleanup of the Chassis/Chassis_private records doesn't happen or is incorrect. 2. Make it more user friendly for the case when RBAC is _not used_ and try to cleanup after ourselves in ovn-controller and reuse the stale records. This is already happening with the code on the master branch and if we cherry pick [0] as Numan suggested and add it to this series it would be the case for branch-20.03 too. [0] would have to go to branch-20.06 as well. I would say, at least for now, the least intrusive change, with the least chance of breaking CMSs, is #2 above but I'll wait for confirmation before sending a new version of this patch. Thanks, Dumitru [0] https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 9/8/20 9:47 AM, Numan Siddique wrote: > > > > > > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org > > <mailto:numans@ovn.org>> wrote: > > > > > > > > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com > > <mailto:zhouhan@gmail.com>> wrote: > > > > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara <dceara@redhat.com > > <mailto:dceara@redhat.com>> wrote: > > > > > > ovn-controller always stores the last configured > > system-id/chassis-id in > > > memory regardless if the connection to the SB is up or down. > > This is OK > > > as long as the change can be committed successfully when the SB DB > > > connection comes back up. > > > > > > Without this change, if the chassis-id changes while the SB > > connection is > > > down, ovn-controller will fail to create the new record but > > nevertheless > > > update its in-memory chassis-id. When the SB connection is > > restored > > > ovn-controller tries to find the record corresponding to the > > chassis-id > > > it stored in memory. This fails causing ovn-controller to try > > to insert > > > a new record. But at this point a constraint violation is hit > > in the SB > > > because the Encap records of the "stale" chassis still exist > > in the DB, > > > along with the old chassis record. > > > > > > This commit changes the way we search for a "stale" chassis > > record in the > > > SB to cover the above mentioned case. Also, in such cases > > there's no need > > > to recreate the Encaps, it's sufficient to update the > > chassis_name field. > > > > > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to > > abstract the > > string parsing") > > > Signed-off-by: Dumitru Ceara <dceara@redhat.com > > <mailto:dceara@redhat.com>> > > > > > > (cherry-picked from master commit > > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) > > > > Hi Dumitru, > > > > Hi Han, Numan, > > > Sorry that I missed the original review, but this patch seems > > causing > > problem in master already. When RBAC is enabled, it will result > > in below > > error: > > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 > > for this > > chassis. > > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: > > {"details":"RBAC rules for client \"chassis-1\" role > > \"ovn-controller\" > > prohibit modification of table \"Encap\".","error":"permission > > error"} > > > > You can simply reproduce by: make sandbox, and then run: ovn-nbctl > > --wait=hv sync, which will hang forever because of the problem. > > This patch seems to be updating the "name" field of Encap table, > > which > > violates the design of RBAC, which uses "name" as the identity > > of the > > client. We shouldn't allow an user to change > > system-id/chassis-id directly. > > How can we enforce this from within OVN? > My suggestion is don't enforce this from OVN. We can simply document this clearly. It would be good to prevent misuse as much as we can, but there are so many ways to break OVN if users don't follow the correct procedure, and we can't prevent all of them. Changing system-id is something I think very abnormal and if someone does it purposely they must know what they are doing. The logic is becoming more and more complex in OVN just for this misuse case, so I am thinking about simplifying this from the requirement point of view, unless there is a decent way of solving the problem. > > I think the proper way is firstly destroying the old chassis and > > then > > creating a new chassis, which would also avoid the complex logic in > > ovn-controller regarding the "stale record" handling. What do > > you think? > > > > > > I'm afraid this would be a behavior change that might introduce bugs in > CMSs that allow changing system-ids (without RBAC enabled) without > explicitly deleting the "stale" Chassis *and* Chassis_private record. > > Also, forcing CMS to do this housekeeping will probably add complex code > in the CMS. For example, if the "old" Chassis record is deleted before > ovn-controller receives the update about the system-id change in OVS DB > then ovn-controller will just recreate it. > Is changing system-id a common operation? I couldn't think about a real use case when we need to change system-id. Also, it doesn't make much sense to me if the solution supports changing it only when RBAC is disabled. In such case we will at least document more clearly what's supported in RBAC case and non-RBAC case, which could be more confusing than just don't support changing system-id. For corner cases, would it be sufficient to just document the proper steps of destroying and recreating a chassis whenever necessary? > > Hi Han, > > > > Yes. Dumitru submitted the patch to fix this issue in master and I > > applied it just now. > > Can you please test again with the latest master. > > > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 > > > > We see the same issue with branch-20.06 too. > > > > > > > > Hi Dumitru, > > > > Can you please make a 2 patch series for branch-20.03 which includes > > this patch and the other one > > which fixes the RBAC issue. > > > > Thanks > > Numan > > > > Like you said, we have two options: > > 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when > system-id changes and completely ignore in ovn-controller any stale > chassis records that might exist in the SB DB. In this case we'd just > let transactions fail due to constraint violations if the CMS cleanup of > the Chassis/Chassis_private records doesn't happen or is incorrect. > > 2. Make it more user friendly for the case when RBAC is _not used_ and > try to cleanup after ourselves in ovn-controller and reuse the stale > records. This is already happening with the code on the master branch > and if we cherry pick [0] as Numan suggested and add it to this series > it would be the case for branch-20.03 too. [0] would have to go to > branch-20.06 as well. > > I would say, at least for now, the least intrusive change, with the > least chance of breaking CMSs, is #2 above but I'll wait for > confirmation before sending a new version of this patch. Unfortunately [0] is still broken. The steps to reproduce: $ make sandbox $ ovs-vsctl set open . external_ids:system-id=new-chassis $ ovs-vsctl set open . external_ids:system-id=new-chassis-2 The problem is, the "name" column is the RBAC identity of a ovn-controller client. RBAC defines what operations are allowed for an authenticated client, but the identity itself should never be changed. Otherwise, there is no point for RBAC. Thanks, Han > > Thanks, > Dumitru > > [0] > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 >
On 9/8/20 7:43 PM, Han Zhou wrote: > > > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> wrote: >> >> On 9/8/20 9:47 AM, Numan Siddique wrote: >> > >> > >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org > <mailto:numans@ovn.org> >> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>> wrote: >> > >> > >> > >> > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com > <mailto:zhouhan@gmail.com> >> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>> wrote: >> > >> > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara > <dceara@redhat.com <mailto:dceara@redhat.com> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: >> > > >> > > ovn-controller always stores the last configured >> > system-id/chassis-id in >> > > memory regardless if the connection to the SB is up or down. >> > This is OK >> > > as long as the change can be committed successfully when > the SB DB >> > > connection comes back up. >> > > >> > > Without this change, if the chassis-id changes while the SB >> > connection is >> > > down, ovn-controller will fail to create the new record but >> > nevertheless >> > > update its in-memory chassis-id. When the SB connection is >> > restored >> > > ovn-controller tries to find the record corresponding to the >> > chassis-id >> > > it stored in memory. This fails causing ovn-controller to try >> > to insert >> > > a new record. But at this point a constraint violation is hit >> > in the SB >> > > because the Encap records of the "stale" chassis still exist >> > in the DB, >> > > along with the old chassis record. >> > > >> > > This commit changes the way we search for a "stale" chassis >> > record in the >> > > SB to cover the above mentioned case. Also, in such cases >> > there's no need >> > > to recreate the Encaps, it's sufficient to update the >> > chassis_name field. >> > > >> > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to >> > abstract the >> > string parsing") >> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> >> > > >> > > (cherry-picked from master commit >> > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) >> > >> > Hi Dumitru, >> > >> >> Hi Han, Numan, >> >> > Sorry that I missed the original review, but this patch seems >> > causing >> > problem in master already. When RBAC is enabled, it will result >> > in below >> > error: >> > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 >> > for this >> > chassis. >> > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: >> > {"details":"RBAC rules for client \"chassis-1\" role >> > \"ovn-controller\" >> > prohibit modification of table \"Encap\".","error":"permission >> > error"} >> > >> > You can simply reproduce by: make sandbox, and then run: > ovn-nbctl >> > --wait=hv sync, which will hang forever because of the problem. >> > This patch seems to be updating the "name" field of Encap table, >> > which >> > violates the design of RBAC, which uses "name" as the identity >> > of the >> > client. We shouldn't allow an user to change >> > system-id/chassis-id directly. >> >> How can we enforce this from within OVN? >> > My suggestion is don't enforce this from OVN. We can simply document > this clearly. It would be good to prevent misuse as much as we can, but > there are so many ways to break OVN if users don't follow the correct > procedure, and we can't prevent all of them. Changing system-id is > something I think very abnormal and if someone does it purposely they > must know what they are doing. The logic is becoming more and more > complex in OVN just for this misuse case, so I am thinking about > simplifying this from the requirement point of view, unless there is a > decent way of solving the problem. > >> > I think the proper way is firstly destroying the old chassis and >> > then >> > creating a new chassis, which would also avoid the complex > logic in >> > ovn-controller regarding the "stale record" handling. What do >> > you think? >> > >> > >> >> I'm afraid this would be a behavior change that might introduce bugs in >> CMSs that allow changing system-ids (without RBAC enabled) without >> explicitly deleting the "stale" Chassis *and* Chassis_private record. >> >> Also, forcing CMS to do this housekeeping will probably add complex code >> in the CMS. For example, if the "old" Chassis record is deleted before >> ovn-controller receives the update about the system-id change in OVS DB >> then ovn-controller will just recreate it. >> > Is changing system-id a common operation? I couldn't think about a real > use case when we need to change system-id. Also, it doesn't make much > sense to me if the solution supports changing it only when RBAC is > disabled. In such case we will at least document more clearly what's > supported in RBAC case and non-RBAC case, which could be more confusing > than just don't support changing system-id. For corner cases, would it > be sufficient to just document the proper steps of destroying and > recreating a chassis whenever necessary? > Right, it shouldn't happen very often but it may happen. >> > Hi Han, >> > >> > Yes. Dumitru submitted the patch to fix this issue in master and I >> > applied it just now. >> > Can you please test again with the latest master. >> > >> > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 >> > >> > We see the same issue with branch-20.06 too. >> > >> > >> > >> > Hi Dumitru, >> > >> > Can you please make a 2 patch series for branch-20.03 which includes >> > this patch and the other one >> > which fixes the RBAC issue. >> > >> > Thanks >> > Numan >> > >> >> Like you said, we have two options: >> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when >> system-id changes and completely ignore in ovn-controller any stale >> chassis records that might exist in the SB DB. In this case we'd just >> let transactions fail due to constraint violations if the CMS cleanup of >> the Chassis/Chassis_private records doesn't happen or is incorrect. >> >> 2. Make it more user friendly for the case when RBAC is _not used_ and >> try to cleanup after ourselves in ovn-controller and reuse the stale >> records. This is already happening with the code on the master branch >> and if we cherry pick [0] as Numan suggested and add it to this series >> it would be the case for branch-20.03 too. [0] would have to go to >> branch-20.06 as well. >> >> I would say, at least for now, the least intrusive change, with the >> least chance of breaking CMSs, is #2 above but I'll wait for >> confirmation before sending a new version of this patch. > > Unfortunately [0] is still broken. The steps to reproduce: > $ make sandbox > $ ovs-vsctl set open . external_ids:system-id=new-chassis > $ ovs-vsctl set open . external_ids:system-id=new-chassis-2 > > The problem is, the "name" column is the RBAC identity of a > ovn-controller client. RBAC defines what operations are allowed for an > authenticated client, but the identity itself should never be changed. > Otherwise, there is no point for RBAC. > OK but you're changing the system-id with RBAC enabled which we decided already that is not working and can't be fixed. In any case, you're right, this is getting too complicated and we clearly can't handle all cases without operator help so I'll send a new patch (on master first) to: - document that changing system-id is something that can't be handled completely in ovn-controller on its own and will need operator help to cleanup stale records in SB. - document how changing the system-id can be performed with RBAC enabled (I think we need to change the SSL certificates too to use the new system-id). - remove the parts of the code in chassis.c that reuse stale Chassis/Chassis_private records. What do you think? Thanks, Dumitru
On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara@redhat.com> wrote: > > On 9/8/20 7:43 PM, Han Zhou wrote: > > > > > > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com > > <mailto:dceara@redhat.com>> wrote: > >> > >> On 9/8/20 9:47 AM, Numan Siddique wrote: > >> > > >> > > >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org > > <mailto:numans@ovn.org> > >> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>> wrote: > >> > > >> > > >> > > >> > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com > > <mailto:zhouhan@gmail.com> > >> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>> wrote: > >> > > >> > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara > > <dceara@redhat.com <mailto:dceara@redhat.com> > >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: > >> > > > >> > > ovn-controller always stores the last configured > >> > system-id/chassis-id in > >> > > memory regardless if the connection to the SB is up or down. > >> > This is OK > >> > > as long as the change can be committed successfully when > > the SB DB > >> > > connection comes back up. > >> > > > >> > > Without this change, if the chassis-id changes while the SB > >> > connection is > >> > > down, ovn-controller will fail to create the new record but > >> > nevertheless > >> > > update its in-memory chassis-id. When the SB connection is > >> > restored > >> > > ovn-controller tries to find the record corresponding to the > >> > chassis-id > >> > > it stored in memory. This fails causing ovn-controller to try > >> > to insert > >> > > a new record. But at this point a constraint violation is hit > >> > in the SB > >> > > because the Encap records of the "stale" chassis still exist > >> > in the DB, > >> > > along with the old chassis record. > >> > > > >> > > This commit changes the way we search for a "stale" chassis > >> > record in the > >> > > SB to cover the above mentioned case. Also, in such cases > >> > there's no need > >> > > to recreate the Encaps, it's sufficient to update the > >> > chassis_name field. > >> > > > >> > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to > >> > abstract the > >> > string parsing") > >> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com > > <mailto:dceara@redhat.com> > >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> > >> > > > >> > > (cherry-picked from master commit > >> > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) > >> > > >> > Hi Dumitru, > >> > > >> > >> Hi Han, Numan, > >> > >> > Sorry that I missed the original review, but this patch seems > >> > causing > >> > problem in master already. When RBAC is enabled, it will result > >> > in below > >> > error: > >> > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1 > >> > for this > >> > chassis. > >> > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error: > >> > {"details":"RBAC rules for client \"chassis-1\" role > >> > \"ovn-controller\" > >> > prohibit modification of table \"Encap\".","error":"permission > >> > error"} > >> > > >> > You can simply reproduce by: make sandbox, and then run: > > ovn-nbctl > >> > --wait=hv sync, which will hang forever because of the problem. > >> > This patch seems to be updating the "name" field of Encap table, > >> > which > >> > violates the design of RBAC, which uses "name" as the identity > >> > of the > >> > client. We shouldn't allow an user to change > >> > system-id/chassis-id directly. > >> > >> How can we enforce this from within OVN? > >> > > My suggestion is don't enforce this from OVN. We can simply document > > this clearly. It would be good to prevent misuse as much as we can, but > > there are so many ways to break OVN if users don't follow the correct > > procedure, and we can't prevent all of them. Changing system-id is > > something I think very abnormal and if someone does it purposely they > > must know what they are doing. The logic is becoming more and more > > complex in OVN just for this misuse case, so I am thinking about > > simplifying this from the requirement point of view, unless there is a > > decent way of solving the problem. > > > >> > I think the proper way is firstly destroying the old chassis and > >> > then > >> > creating a new chassis, which would also avoid the complex > > logic in > >> > ovn-controller regarding the "stale record" handling. What do > >> > you think? > >> > > >> > > >> > >> I'm afraid this would be a behavior change that might introduce bugs in > >> CMSs that allow changing system-ids (without RBAC enabled) without > >> explicitly deleting the "stale" Chassis *and* Chassis_private record. > >> > >> Also, forcing CMS to do this housekeeping will probably add complex code > >> in the CMS. For example, if the "old" Chassis record is deleted before > >> ovn-controller receives the update about the system-id change in OVS DB > >> then ovn-controller will just recreate it. > >> > > Is changing system-id a common operation? I couldn't think about a real > > use case when we need to change system-id. Also, it doesn't make much > > sense to me if the solution supports changing it only when RBAC is > > disabled. In such case we will at least document more clearly what's > > supported in RBAC case and non-RBAC case, which could be more confusing > > than just don't support changing system-id. For corner cases, would it > > be sufficient to just document the proper steps of destroying and > > recreating a chassis whenever necessary? > > > > Right, it shouldn't happen very often but it may happen. > > >> > Hi Han, > >> > > >> > Yes. Dumitru submitted the patch to fix this issue in master and I > >> > applied it just now. > >> > Can you please test again with the latest master. > >> > > >> > > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 > >> > > >> > We see the same issue with branch-20.06 too. > >> > > >> > > >> > > >> > Hi Dumitru, > >> > > >> > Can you please make a 2 patch series for branch-20.03 which includes > >> > this patch and the other one > >> > which fixes the RBAC issue. > >> > > >> > Thanks > >> > Numan > >> > > >> > >> Like you said, we have two options: > >> > >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when > >> system-id changes and completely ignore in ovn-controller any stale > >> chassis records that might exist in the SB DB. In this case we'd just > >> let transactions fail due to constraint violations if the CMS cleanup of > >> the Chassis/Chassis_private records doesn't happen or is incorrect. > >> > >> 2. Make it more user friendly for the case when RBAC is _not used_ and > >> try to cleanup after ourselves in ovn-controller and reuse the stale > >> records. This is already happening with the code on the master branch > >> and if we cherry pick [0] as Numan suggested and add it to this series > >> it would be the case for branch-20.03 too. [0] would have to go to > >> branch-20.06 as well. > >> > >> I would say, at least for now, the least intrusive change, with the > >> least chance of breaking CMSs, is #2 above but I'll wait for > >> confirmation before sending a new version of this patch. > > > > Unfortunately [0] is still broken. The steps to reproduce: > > $ make sandbox > > $ ovs-vsctl set open . external_ids:system-id=new-chassis > > $ ovs-vsctl set open . external_ids:system-id=new-chassis-2 > > > > The problem is, the "name" column is the RBAC identity of a > > ovn-controller client. RBAC defines what operations are allowed for an > > authenticated client, but the identity itself should never be changed. > > Otherwise, there is no point for RBAC. > > > > OK but you're changing the system-id with RBAC enabled which we decided > already that is not working and can't be fixed. > > In any case, you're right, this is getting too complicated and we > clearly can't handle all cases without operator help so I'll send a new > patch (on master first) to: > - document that changing system-id is something that can't be handled > completely in ovn-controller on its own and will need operator help to > cleanup stale records in SB. Thanks Dumitru. The plan sounds good to me. Let's be clear what will be supported in OVN and what will not. IMHO, I'd say we don't support directly changing system-id. If it needs to be changed, follow the procedure of deleting and recreating a new chassis, which can be described in detail for RBAC and non-RBAC respectively. Is this the same as what you are about to address? > - document how changing the system-id can be performed with RBAC enabled > (I think we need to change the SSL certificates too to use the new > system-id). > - remove the parts of the code in chassis.c that reuse stale > Chassis/Chassis_private records. > > What do you think? > > Thanks, > Dumitru > >
On 9/8/20 8:45 PM, Han Zhou wrote: > > > > On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com>> wrote: >> >> On 9/8/20 7:43 PM, Han Zhou wrote: >> > >> > >> > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>> wrote: >> >> >> >> On 9/8/20 9:47 AM, Numan Siddique wrote: >> >> > >> >> > >> >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans@ovn.org > <mailto:numans@ovn.org> >> > <mailto:numans@ovn.org <mailto:numans@ovn.org>> >> >> > <mailto:numans@ovn.org <mailto:numans@ovn.org> > <mailto:numans@ovn.org <mailto:numans@ovn.org>>>> wrote: >> >> > >> >> > >> >> > >> >> > On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan@gmail.com > <mailto:zhouhan@gmail.com> >> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>> >> >> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com> > <mailto:zhouhan@gmail.com <mailto:zhouhan@gmail.com>>>> wrote: >> >> > >> >> > On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara >> > <dceara@redhat.com <mailto:dceara@redhat.com> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>> >> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>> wrote: >> >> > > >> >> > > ovn-controller always stores the last configured >> >> > system-id/chassis-id in >> >> > > memory regardless if the connection to the SB is up or > down. >> >> > This is OK >> >> > > as long as the change can be committed successfully when >> > the SB DB >> >> > > connection comes back up. >> >> > > >> >> > > Without this change, if the chassis-id changes while the SB >> >> > connection is >> >> > > down, ovn-controller will fail to create the new record but >> >> > nevertheless >> >> > > update its in-memory chassis-id. When the SB connection is >> >> > restored >> >> > > ovn-controller tries to find the record corresponding > to the >> >> > chassis-id >> >> > > it stored in memory. This fails causing ovn-controller > to try >> >> > to insert >> >> > > a new record. But at this point a constraint violation > is hit >> >> > in the SB >> >> > > because the Encap records of the "stale" chassis still > exist >> >> > in the DB, >> >> > > along with the old chassis record. >> >> > > >> >> > > This commit changes the way we search for a "stale" chassis >> >> > record in the >> >> > > SB to cover the above mentioned case. Also, in such cases >> >> > there's no need >> >> > > to recreate the Encaps, it's sufficient to update the >> >> > chassis_name field. >> >> > > >> >> > > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to >> >> > abstract the >> >> > string parsing") >> >> > > Signed-off-by: Dumitru Ceara <dceara@redhat.com > <mailto:dceara@redhat.com> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>> >> >> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com> > <mailto:dceara@redhat.com <mailto:dceara@redhat.com>>>> >> >> > > >> >> > > (cherry-picked from master commit >> >> > 94a32fca2d2b825fece0ef5b1873459bd9857dd3) >> >> > >> >> > Hi Dumitru, >> >> > >> >> >> >> Hi Han, Numan, >> >> >> >> > Sorry that I missed the original review, but this patch seems >> >> > causing >> >> > problem in master already. When RBAC is enabled, it will > result >> >> > in below >> >> > error: >> >> > 2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming > lport lsp1 >> >> > for this >> >> > chassis. >> >> > 2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction > error: >> >> > {"details":"RBAC rules for client \"chassis-1\" role >> >> > \"ovn-controller\" >> >> > prohibit modification of table > \"Encap\".","error":"permission >> >> > error"} >> >> > >> >> > You can simply reproduce by: make sandbox, and then run: >> > ovn-nbctl >> >> > --wait=hv sync, which will hang forever because of the > problem. >> >> > This patch seems to be updating the "name" field of Encap > table, >> >> > which >> >> > violates the design of RBAC, which uses "name" as the > identity >> >> > of the >> >> > client. We shouldn't allow an user to change >> >> > system-id/chassis-id directly. >> >> >> >> How can we enforce this from within OVN? >> >> >> > My suggestion is don't enforce this from OVN. We can simply document >> > this clearly. It would be good to prevent misuse as much as we can, but >> > there are so many ways to break OVN if users don't follow the correct >> > procedure, and we can't prevent all of them. Changing system-id is >> > something I think very abnormal and if someone does it purposely they >> > must know what they are doing. The logic is becoming more and more >> > complex in OVN just for this misuse case, so I am thinking about >> > simplifying this from the requirement point of view, unless there is a >> > decent way of solving the problem. >> > >> >> > I think the proper way is firstly destroying the old > chassis and >> >> > then >> >> > creating a new chassis, which would also avoid the complex >> > logic in >> >> > ovn-controller regarding the "stale record" handling. What do >> >> > you think? >> >> > >> >> > >> >> >> >> I'm afraid this would be a behavior change that might introduce bugs in >> >> CMSs that allow changing system-ids (without RBAC enabled) without >> >> explicitly deleting the "stale" Chassis *and* Chassis_private record. >> >> >> >> Also, forcing CMS to do this housekeeping will probably add complex > code >> >> in the CMS. For example, if the "old" Chassis record is deleted before >> >> ovn-controller receives the update about the system-id change in OVS DB >> >> then ovn-controller will just recreate it. >> >> >> > Is changing system-id a common operation? I couldn't think about a real >> > use case when we need to change system-id. Also, it doesn't make much >> > sense to me if the solution supports changing it only when RBAC is >> > disabled. In such case we will at least document more clearly what's >> > supported in RBAC case and non-RBAC case, which could be more confusing >> > than just don't support changing system-id. For corner cases, would it >> > be sufficient to just document the proper steps of destroying and >> > recreating a chassis whenever necessary? >> > >> >> Right, it shouldn't happen very often but it may happen. >> >> >> > Hi Han, >> >> > >> >> > Yes. Dumitru submitted the patch to fix this issue in master > and I >> >> > applied it just now. >> >> > Can you please test again with the latest master. >> >> > >> >> > >> > > https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34 >> >> > >> >> > We see the same issue with branch-20.06 too. >> >> > >> >> > >> >> > >> >> > Hi Dumitru, >> >> > >> >> > Can you please make a 2 patch series for branch-20.03 which includes >> >> > this patch and the other one >> >> > which fixes the RBAC issue. >> >> > >> >> > Thanks >> >> > Numan >> >> > >> >> >> >> Like you said, we have two options: >> >> >> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when >> >> system-id changes and completely ignore in ovn-controller any stale >> >> chassis records that might exist in the SB DB. In this case we'd just >> >> let transactions fail due to constraint violations if the CMS > cleanup of >> >> the Chassis/Chassis_private records doesn't happen or is incorrect. >> >> >> >> 2. Make it more user friendly for the case when RBAC is _not used_ and >> >> try to cleanup after ourselves in ovn-controller and reuse the stale >> >> records. This is already happening with the code on the master branch >> >> and if we cherry pick [0] as Numan suggested and add it to this series >> >> it would be the case for branch-20.03 too. [0] would have to go to >> >> branch-20.06 as well. >> >> >> >> I would say, at least for now, the least intrusive change, with the >> >> least chance of breaking CMSs, is #2 above but I'll wait for >> >> confirmation before sending a new version of this patch. >> > >> > Unfortunately [0] is still broken. The steps to reproduce: >> > $ make sandbox >> > $ ovs-vsctl set open . external_ids:system-id=new-chassis >> > $ ovs-vsctl set open . external_ids:system-id=new-chassis-2 >> > >> > The problem is, the "name" column is the RBAC identity of a >> > ovn-controller client. RBAC defines what operations are allowed for an >> > authenticated client, but the identity itself should never be changed. >> > Otherwise, there is no point for RBAC. >> > >> >> OK but you're changing the system-id with RBAC enabled which we decided >> already that is not working and can't be fixed. >> >> In any case, you're right, this is getting too complicated and we >> clearly can't handle all cases without operator help so I'll send a new >> patch (on master first) to: >> - document that changing system-id is something that can't be handled >> completely in ovn-controller on its own and will need operator help to >> cleanup stale records in SB. > > Thanks Dumitru. The plan sounds good to me. Let's be clear what will be > supported in OVN and what will not. IMHO, I'd say we don't support > directly changing system-id. If it needs to be changed, follow the > procedure of deleting and recreating a new chassis, which can be > described in detail for RBAC and non-RBAC respectively. Is this the same > as what you are about to address? > Exactly. Thanks, Dumitru >> - document how changing the system-id can be performed with RBAC enabled >> (I think we need to change the SSL certificates too to use the new >> system-id). >> - remove the parts of the code in chassis.c that reuse stale >> Chassis/Chassis_private records. >> >> What do you think? >> >> Thanks, >> Dumitru >> >>
diff --git a/controller/chassis.c b/controller/chassis.c index 522893e..d525463 100644 --- a/controller/chassis.c +++ b/controller/chassis.c @@ -380,10 +380,7 @@ chassis_tunnels_changed(const struct sset *encap_type_set, { size_t encap_type_count = 0; - for (int i = 0; i < chassis_rec->n_encaps; i++) { - if (strcmp(chassis_rec->name, chassis_rec->encaps[i]->chassis_name)) { - return true; - } + for (size_t i = 0; i < chassis_rec->n_encaps; i++) { if (!sset_contains(encap_type_set, chassis_rec->encaps[i]->type)) { return true; @@ -456,6 +453,19 @@ chassis_build_encaps(struct ovsdb_idl_txn *ovnsb_idl_txn, } /* + * Updates encaps for a given chassis. This can happen when the chassis + * name has changed. Also, the only thing we support updating is the + * chassis_name. For other changes the encaps will be recreated. + */ +static void +chassis_update_encaps(const struct sbrec_chassis *chassis) +{ + for (size_t i = 0; i < chassis->n_encaps; i++) { + sbrec_encap_set_chassis_name(chassis->encaps[i], chassis->name); + } +} + +/* * Returns a pointer to a chassis record from 'chassis_table' that * matches at least one tunnel config. */ @@ -486,9 +496,10 @@ chassis_get_stale_record(const struct sbrec_chassis_table *chassis_table, /* If this is a chassis config update after we initialized the record once * then we should always be able to find it with the ID we saved in * chassis_state. - * Otherwise (i.e., first time we create the record) then we check if there's - * a stale record from a previous controller run that didn't end gracefully - * and reuse it. If not then we create a new record. + * Otherwise (i.e., first time we create the record or if the system-id + * changed) then we check if there's a stale record from a previous + * controller run that didn't end gracefully and reuse it. If not then we + * create a new record. */ static const struct sbrec_chassis * chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, @@ -497,29 +508,31 @@ chassis_get_record(struct ovsdb_idl_txn *ovnsb_idl_txn, const struct ovs_chassis_cfg *ovs_cfg, const char *chassis_id) { - const struct sbrec_chassis *chassis_rec; + const struct sbrec_chassis *chassis = NULL; if (chassis_info_id_inited(&chassis_state)) { - chassis_rec = chassis_lookup_by_name(sbrec_chassis_by_name, - chassis_info_id(&chassis_state)); - if (!chassis_rec) { - VLOG_DBG("Could not find Chassis, will create it" - ": stored (%s) ovs (%s)", + chassis = chassis_lookup_by_name(sbrec_chassis_by_name, + chassis_info_id(&chassis_state)); + if (!chassis) { + VLOG_DBG("Could not find Chassis, will check if the id changed: " + "stored (%s) ovs (%s)", chassis_info_id(&chassis_state), chassis_id); - if (ovnsb_idl_txn) { - /* Recreate the chassis record. */ - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); - } } - } else { - chassis_rec = - chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id); + } - if (!chassis_rec && ovnsb_idl_txn) { - chassis_rec = sbrec_chassis_insert(ovnsb_idl_txn); + if (!chassis) { + chassis = chassis_get_stale_record(chassis_table, ovs_cfg, chassis_id); + } + + if (!chassis) { + /* Recreate the chassis record. */ + VLOG_DBG("Could not find Chassis, will create it: %s", chassis_id); + if (ovnsb_idl_txn) { + return sbrec_chassis_insert(ovnsb_idl_txn); } } - return chassis_rec; + + return chassis; } /* Update a Chassis record based on the config in the ovs config. */ @@ -567,6 +580,7 @@ chassis_update(const struct sbrec_chassis *chassis_rec, &ovs_cfg->encap_ip_set, ovs_cfg->encap_csum, chassis_rec); if (!tunnels_changed) { + chassis_update_encaps(chassis_rec); return; } diff --git a/tests/ovn-controller.at b/tests/ovn-controller.at index 63b2581..1c7aa58 100644 --- a/tests/ovn-controller.at +++ b/tests/ovn-controller.at @@ -196,6 +196,23 @@ OVS_WAIT_UNTIL([ test "${sysid}" = "${chassis_id}" ]) +# Simulate system-id changing while ovn-controller is disconnected from the +# SB. +valid_remote=$(ovs-vsctl get Open_vSwitch . external_ids:ovn-remote) +invalid_remote=tcp:0.0.0.0:4242 +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${invalid_remote} +expected_state="not connected" +OVS_WAIT_UNTIL([ + test "${expected_state}" = "$(ovn-appctl -t ovn-controller connection-status)" +]) +sysid=${sysid}-bar +ovs-vsctl set Open_vSwitch . external-ids:system-id="${sysid}" +ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=${valid_remote} +OVS_WAIT_UNTIL([ + chassis_id=$(ovn-sbctl get Chassis "${sysid}" name) + test "${sysid}" = "${chassis_id}" +]) + # Gracefully terminate daemons OVN_CLEANUP_SBOX([hv]) OVN_CLEANUP_VSWITCH([main])
ovn-controller always stores the last configured system-id/chassis-id in memory regardless if the connection to the SB is up or down. This is OK as long as the change can be committed successfully when the SB DB connection comes back up. Without this change, if the chassis-id changes while the SB connection is down, ovn-controller will fail to create the new record but nevertheless update its in-memory chassis-id. When the SB connection is restored ovn-controller tries to find the record corresponding to the chassis-id it stored in memory. This fails causing ovn-controller to try to insert a new record. But at this point a constraint violation is hit in the SB because the Encap records of the "stale" chassis still exist in the DB, along with the old chassis record. This commit changes the way we search for a "stale" chassis record in the SB to cover the above mentioned case. Also, in such cases there's no need to recreate the Encaps, it's sufficient to update the chassis_name field. Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to abstract the string parsing") Signed-off-by: Dumitru Ceara <dceara@redhat.com> (cherry-picked from master commit 94a32fca2d2b825fece0ef5b1873459bd9857dd3) --- controller/chassis.c | 60 ++++++++++++++++++++++++++++++------------------- tests/ovn-controller.at | 17 ++++++++++++++ 2 files changed, 54 insertions(+), 23 deletions(-)