Message ID | 20240326172717.1454071-2-i.maximets@ovn.org |
---|---|
State | Accepted |
Commit | aab379ec21c95971fe6a05fb94793d1744a864ce |
Headers | show |
Series | ovsdb: raft: Fixes for cluster joining state. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | fail | test: fail |
On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maximets@ovn.org> wrote: > > Current implementation of the leadership transfer just shoots the > leadership in the general direction of the first stable server in the > configuration. It doesn't check if the server was active recently or > even that the connection is established. This may result in sending > leadership to a disconnected or otherwise unavailable server. > > Such behavior should not cause log truncation or any other correctness > issues because the destination server would have all the append > requests queued up or the connection will be dropped by the leader. > In a worst case we will have a leader-less cluster until the next > election timer fires up. Other servers will notice the absence of the > leader and will trigger a new leader election normally. > However, the potential wait for the election timer is not good as > real-world setups may have high values configured. > > Fix that by trying to transfer to servers that we know have applied > the most changes, i.e., have the highest 'match_index'. Such servers > replied to the most recent append requests, so they have highest > chances to be healthy. Choosing the random starting point in the > list of such servers so we don't transfer to the same server every > single time. This slightly improves load distribution, but, most > importantly, increases robustness of our test suite, making it cover > more cases. Also checking that the message was actually sent without > immediate failure. > > If we fail to transfer to any server with the highest index, try to > just transfer to any other server that is not behind majority and then > just any other server that is connected. We did actually send them > all the updates (if the connection is open), they just didn't reply > yet for one reason or another. It should be better than leaving the > cluster without a leader. > > Note that there is always a chance that transfer will fail, since > we're not waiting for it to be acknowledged (and must not wait). > In this case, normal election will be triggered after the election > timer fires up. > > Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > > CC: Felix Huettner <felix.huettner@mail.schwarz> > > ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 44 insertions(+), 4 deletions(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index f463afcb3..b171da345 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > return; > } > > - struct raft_server *s; > + struct raft_server **servers, *s; > + uint64_t threshold = 0; > + size_t n = 0, start, i; > + > + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); > + > HMAP_FOR_EACH (s, hmap_node, &raft->servers) { > - if (!uuid_equals(&raft->sid, &s->sid) > - && s->phase == RAFT_PHASE_STABLE) { > + if (uuid_equals(&raft->sid, &s->sid) > + || s->phase != RAFT_PHASE_STABLE) { > + continue; > + } > + if (s->match_index > threshold) { > + threshold = s->match_index; > + } > + servers[n++] = s; > + } > + > + start = n ? random_range(n) : 0; > + > +retry: > + for (i = 0; i < n; i++) { > + s = servers[(start + i) % n]; > + > + if (s->match_index >= threshold) { > struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); > if (!conn) { > continue; > @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > .term = raft->term, > } > }; > - raft_send_to_conn(raft, &rpc, conn); > + > + if (!raft_send_to_conn(raft, &rpc, conn)) { > + continue; > + } > > raft_record_note(raft, "transfer leadership", > "transferring leadership to %s because %s", > @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > break; > } > } > + > + if (n && i == n && threshold) { > + if (threshold > raft->commit_index) { > + /* Failed to transfer to servers with the highest 'match_index'. > + * Try other servers that are not behind the majority. */ > + threshold = raft->commit_index; > + } else { > + /* Try any other server. It is safe, because they either have all > + * the append requests queued up for them before the leadership > + * transfer message or their connection is broken and we will not > + * transfer anyway. */ > + threshold = 0; > + } > + goto retry; Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the ones that raft_send_to_conn() returned false) one or two more times, but it should be fine because the number of servers are very small anyway. So this looks good to me. Acked-by: Han Zhou <hzhou@ovn.org> > + } > + > + free(servers); > } > > /* Send a RemoveServerRequest to the rest of the servers in the cluster. > -- > 2.44.0 >
On Tue, Mar 26, 2024 at 06:27:09PM +0100, Ilya Maximets wrote: > Current implementation of the leadership transfer just shoots the > leadership in the general direction of the first stable server in the > configuration. It doesn't check if the server was active recently or > even that the connection is established. This may result in sending > leadership to a disconnected or otherwise unavailable server. > > Such behavior should not cause log truncation or any other correctness > issues because the destination server would have all the append > requests queued up or the connection will be dropped by the leader. > In a worst case we will have a leader-less cluster until the next > election timer fires up. Other servers will notice the absence of the > leader and will trigger a new leader election normally. > However, the potential wait for the election timer is not good as > real-world setups may have high values configured. > > Fix that by trying to transfer to servers that we know have applied > the most changes, i.e., have the highest 'match_index'. Such servers > replied to the most recent append requests, so they have highest > chances to be healthy. Choosing the random starting point in the > list of such servers so we don't transfer to the same server every > single time. This slightly improves load distribution, but, most > importantly, increases robustness of our test suite, making it cover > more cases. Also checking that the message was actually sent without > immediate failure. > > If we fail to transfer to any server with the highest index, try to > just transfer to any other server that is not behind majority and then > just any other server that is connected. We did actually send them > all the updates (if the connection is open), they just didn't reply > yet for one reason or another. It should be better than leaving the > cluster without a leader. > > Note that there is always a chance that transfer will fail, since > we're not waiting for it to be acknowledged (and must not wait). > In this case, normal election will be triggered after the election > timer fires up. > > Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > > CC: Felix Huettner <felix.huettner@mail.schwarz> > > ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 44 insertions(+), 4 deletions(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index f463afcb3..b171da345 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > return; > } > > - struct raft_server *s; > + struct raft_server **servers, *s; > + uint64_t threshold = 0; > + size_t n = 0, start, i; > + > + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); > + > HMAP_FOR_EACH (s, hmap_node, &raft->servers) { > - if (!uuid_equals(&raft->sid, &s->sid) > - && s->phase == RAFT_PHASE_STABLE) { > + if (uuid_equals(&raft->sid, &s->sid) > + || s->phase != RAFT_PHASE_STABLE) { > + continue; > + } > + if (s->match_index > threshold) { > + threshold = s->match_index; > + } > + servers[n++] = s; > + } > + > + start = n ? random_range(n) : 0; > + > +retry: > + for (i = 0; i < n; i++) { > + s = servers[(start + i) % n]; > + > + if (s->match_index >= threshold) { > struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); > if (!conn) { > continue; > @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > .term = raft->term, > } > }; > - raft_send_to_conn(raft, &rpc, conn); > + > + if (!raft_send_to_conn(raft, &rpc, conn)) { > + continue; > + } > > raft_record_note(raft, "transfer leadership", > "transferring leadership to %s because %s", > @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > break; > } > } > + > + if (n && i == n && threshold) { > + if (threshold > raft->commit_index) { > + /* Failed to transfer to servers with the highest 'match_index'. > + * Try other servers that are not behind the majority. */ > + threshold = raft->commit_index; > + } else { > + /* Try any other server. It is safe, because they either have all > + * the append requests queued up for them before the leadership > + * transfer message or their connection is broken and we will not > + * transfer anyway. */ > + threshold = 0; > + } > + goto retry; > + } > + > + free(servers); > } > > /* Send a RemoveServerRequest to the rest of the servers in the cluster. > -- > 2.44.0 > Thanks you, looks good for me. Acked-by: Felix Huettner <felix.huettner@mail.schwarz>
On 3/26/24 21:10, Han Zhou wrote: > > > On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote: >> >> Current implementation of the leadership transfer just shoots the >> leadership in the general direction of the first stable server in the >> configuration. It doesn't check if the server was active recently or >> even that the connection is established. This may result in sending >> leadership to a disconnected or otherwise unavailable server. >> >> Such behavior should not cause log truncation or any other correctness >> issues because the destination server would have all the append >> requests queued up or the connection will be dropped by the leader. >> In a worst case we will have a leader-less cluster until the next >> election timer fires up. Other servers will notice the absence of the >> leader and will trigger a new leader election normally. >> However, the potential wait for the election timer is not good as >> real-world setups may have high values configured. >> >> Fix that by trying to transfer to servers that we know have applied >> the most changes, i.e., have the highest 'match_index'. Such servers >> replied to the most recent append requests, so they have highest >> chances to be healthy. Choosing the random starting point in the >> list of such servers so we don't transfer to the same server every >> single time. This slightly improves load distribution, but, most >> importantly, increases robustness of our test suite, making it cover >> more cases. Also checking that the message was actually sent without >> immediate failure. >> >> If we fail to transfer to any server with the highest index, try to >> just transfer to any other server that is not behind majority and then >> just any other server that is connected. We did actually send them >> all the updates (if the connection is open), they just didn't reply >> yet for one reason or another. It should be better than leaving the >> cluster without a leader. >> >> Note that there is always a chance that transfer will fail, since >> we're not waiting for it to be acknowledged (and must not wait). >> In this case, normal election will be triggered after the election >> timer fires up. >> >> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> >> --- >> >> CC: Felix Huettner <felix.huettner@mail.schwarz> >> >> ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 44 insertions(+), 4 deletions(-) >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c >> index f463afcb3..b171da345 100644 >> --- a/ovsdb/raft.c >> +++ b/ovsdb/raft.c >> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) >> return; >> } >> >> - struct raft_server *s; >> + struct raft_server **servers, *s; >> + uint64_t threshold = 0; >> + size_t n = 0, start, i; >> + >> + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); >> + >> HMAP_FOR_EACH (s, hmap_node, &raft->servers) { >> - if (!uuid_equals(&raft->sid, &s->sid) >> - && s->phase == RAFT_PHASE_STABLE) { >> + if (uuid_equals(&raft->sid, &s->sid) >> + || s->phase != RAFT_PHASE_STABLE) { >> + continue; >> + } >> + if (s->match_index > threshold) { >> + threshold = s->match_index; >> + } >> + servers[n++] = s; >> + } >> + >> + start = n ? random_range(n) : 0; >> + >> +retry: >> + for (i = 0; i < n; i++) { >> + s = servers[(start + i) % n]; >> + >> + if (s->match_index >= threshold) { >> struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); >> if (!conn) { >> continue; >> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) >> .term = raft->term, >> } >> }; >> - raft_send_to_conn(raft, &rpc, conn); >> + >> + if (!raft_send_to_conn(raft, &rpc, conn)) { >> + continue; >> + } >> >> raft_record_note(raft, "transfer leadership", >> "transferring leadership to %s because %s", >> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) >> break; >> } >> } >> + >> + if (n && i == n && threshold) { >> + if (threshold > raft->commit_index) { >> + /* Failed to transfer to servers with the highest 'match_index'. >> + * Try other servers that are not behind the majority. */ >> + threshold = raft->commit_index; >> + } else { >> + /* Try any other server. It is safe, because they either have all >> + * the append requests queued up for them before the leadership >> + * transfer message or their connection is broken and we will not >> + * transfer anyway. */ >> + threshold = 0; >> + } >> + goto retry; > > Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the ones that > raft_send_to_conn() returned false) one or two more times, but it should be fine because > the number of servers are very small anyway. So this looks good to me. Yeah, most of the time this array will have just 2 elements and commit_index will be equal to the highest match_index, so we will likely only retry once. > > Acked-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>> > >> + } >> + >> + free(servers); >> } >> >> /* Send a RemoveServerRequest to the rest of the servers in the cluster. >> -- >> 2.44.0 >> >
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index f463afcb3..b171da345 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) return; } - struct raft_server *s; + struct raft_server **servers, *s; + uint64_t threshold = 0; + size_t n = 0, start, i; + + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); + HMAP_FOR_EACH (s, hmap_node, &raft->servers) { - if (!uuid_equals(&raft->sid, &s->sid) - && s->phase == RAFT_PHASE_STABLE) { + if (uuid_equals(&raft->sid, &s->sid) + || s->phase != RAFT_PHASE_STABLE) { + continue; + } + if (s->match_index > threshold) { + threshold = s->match_index; + } + servers[n++] = s; + } + + start = n ? random_range(n) : 0; + +retry: + for (i = 0; i < n; i++) { + s = servers[(start + i) % n]; + + if (s->match_index >= threshold) { struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); if (!conn) { continue; @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) .term = raft->term, } }; - raft_send_to_conn(raft, &rpc, conn); + + if (!raft_send_to_conn(raft, &rpc, conn)) { + continue; + } raft_record_note(raft, "transfer leadership", "transferring leadership to %s because %s", @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) break; } } + + if (n && i == n && threshold) { + if (threshold > raft->commit_index) { + /* Failed to transfer to servers with the highest 'match_index'. + * Try other servers that are not behind the majority. */ + threshold = raft->commit_index; + } else { + /* Try any other server. It is safe, because they either have all + * the append requests queued up for them before the leadership + * transfer message or their connection is broken and we will not + * transfer anyway. */ + threshold = 0; + } + goto retry; + } + + free(servers); } /* Send a RemoveServerRequest to the rest of the servers in the cluster.
Current implementation of the leadership transfer just shoots the leadership in the general direction of the first stable server in the configuration. It doesn't check if the server was active recently or even that the connection is established. This may result in sending leadership to a disconnected or otherwise unavailable server. Such behavior should not cause log truncation or any other correctness issues because the destination server would have all the append requests queued up or the connection will be dropped by the leader. In a worst case we will have a leader-less cluster until the next election timer fires up. Other servers will notice the absence of the leader and will trigger a new leader election normally. However, the potential wait for the election timer is not good as real-world setups may have high values configured. Fix that by trying to transfer to servers that we know have applied the most changes, i.e., have the highest 'match_index'. Such servers replied to the most recent append requests, so they have highest chances to be healthy. Choosing the random starting point in the list of such servers so we don't transfer to the same server every single time. This slightly improves load distribution, but, most importantly, increases robustness of our test suite, making it cover more cases. Also checking that the message was actually sent without immediate failure. If we fail to transfer to any server with the highest index, try to just transfer to any other server that is not behind majority and then just any other server that is connected. We did actually send them all the updates (if the connection is open), they just didn't reply yet for one reason or another. It should be better than leaving the cluster without a leader. Note that there is always a chance that transfer will fail, since we're not waiting for it to be acknowledged (and must not wait). In this case, normal election will be triggered after the election timer fires up. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- CC: Felix Huettner <felix.huettner@mail.schwarz> ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 44 insertions(+), 4 deletions(-)