Message ID | 1441183880-26993-7-git-send-email-wency@cn.fujitsu.com |
---|---|
State | New |
Headers | show |
On 09/02/2015 02:51 AM, Wen Congyang wrote: > If the child is not ready, read/write/getlength/flush will > return -errno. It is not critical error, and can be ignored: > 1. read/write: > Just not report the error event. What happens if all the children report an error? Or is the threshold at play here? For example, if you have a threshold of 3/5, then I'm assuming that if up to two children return an errno, then it is okay to ignore; but if three or more return an errno, you haven't met threshold, so the I/O must fail. Are you ignoring ALL errors (including things like EACCES), or just EIO errors? > 2. getlength: > just ignore it. If all children's getlength return -errno, > and be ignored, return -EIO. > 3. flush: > Just ignore it. If all children's getlength return -errno, s/getlength/flush/ > and be ignored, return 0. Yuck - claiming success when all of the children fail feels dangerous. > > Usage: children.x.ignore-errors=true > > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> > Signed-off-by: Gonglei <arei.gonglei@huawei.com> > Cc: Alberto Garcia <berto@igalia.com> > --- > block/quorum.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++---- Interface review only: > +++ b/qapi/block-core.json > @@ -1411,6 +1411,8 @@ > # @allow-write-backing-file: #optional whether the backing file is opened in > # read-write mode. It is only for backing file > # (Since 2.5 default: false) > +# @ignore-errors: #options whether the child's I/O error should be ignored. s/options/optional/ s/error/errors/ > +# it is only for quorum's child.(Since 2.5 default: false) Space after '.' in English sentences. The fact that you are documenting that this option can only be specified for quorum children makes me wonder if it belongs better as an option in BlockdevOptionsQuorum rather than BlockdevOptionsBase. Semantically, it sounds like you are trying to allow for a per-child decision of whether this particular child's errors matter to the overall quorum. So, if we have a 3/5 quorum, we can decide that for children A, B, C, and D, errors cannot be ignored, but for child E, errors are not a problem. As written, you are tying the semantics to each child BDS, and requiring special code to double-check that the property is only ever set if the BDS is used as the child of a quorum. Furthermore, if the property is set, you are then changing what the child does in response to various operations. What if you instead create a list property in the quorum parent? Maybe along the lines of: # @child-errors-okay: #optional an array of named-node children where errors will be ignored (Since 2.5, default empty) { 'struct': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], 'vote-threshold': 'int', '*rewrite-corrupted': 'bool', '*read-pattern': 'QuorumReadPattern', '*child-errors-okay': ['str'] } } The above example of a 3/5 quorum, where only child E can ignore errors, would then be represented as: { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3, 'child-errors-okay': [ "E" ] } The code to ignore the errors is then done in the quorum itself (the BDS for E does not have to care about a special ignore-errors property, but just always returns the error as usual; and then the quorum is deciding how to handle the error), and you are not polluting the BDS state for something that is quorum-specific, because it is now the quorum itself that tracks the special casing. Finally, why can't hot-plug/unplug of quorum members work? If you are going to always ignore errors from a particular child, then why is that child even part of the quorum? Isn't a better design to just not add the child to the quorum until it is ready and won't be reporting errors?
On 09/03/2015 12:30 AM, Eric Blake wrote: > On 09/02/2015 02:51 AM, Wen Congyang wrote: >> If the child is not ready, read/write/getlength/flush will >> return -errno. It is not critical error, and can be ignored: >> 1. read/write: >> Just not report the error event. > > What happens if all the children report an error? Or is the threshold > at play here? Good question. What about this: if we have 5 children, only 4 children can ignore errors. > > For example, if you have a threshold of 3/5, then I'm assuming that if > up to two children return an errno, then it is okay to ignore; but if > three or more return an errno, you haven't met threshold, so the I/O > must fail. Do more check here, at least one child sucesses. > > Are you ignoring ALL errors (including things like EACCES), or just EIO > errors? Does bdrv_xxx() always return -errno on error? > > >> 2. getlength: >> just ignore it. If all children's getlength return -errno, >> and be ignored, return -EIO. >> 3. flush: >> Just ignore it. If all children's getlength return -errno, > > s/getlength/flush/ > >> and be ignored, return 0. > > Yuck - claiming success when all of the children fail feels dangerous. Yes. > >> >> Usage: children.x.ignore-errors=true >> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> >> Signed-off-by: Gonglei <arei.gonglei@huawei.com> >> Cc: Alberto Garcia <berto@igalia.com> >> --- >> block/quorum.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++---- > > Interface review only: > >> +++ b/qapi/block-core.json >> @@ -1411,6 +1411,8 @@ >> # @allow-write-backing-file: #optional whether the backing file is opened in >> # read-write mode. It is only for backing file >> # (Since 2.5 default: false) >> +# @ignore-errors: #options whether the child's I/O error should be ignored. > > s/options/optional/ > s/error/errors/ > >> +# it is only for quorum's child.(Since 2.5 default: false) > > Space after '.' in English sentences. > > The fact that you are documenting that this option can only be specified > for quorum children makes me wonder if it belongs better as an option in > BlockdevOptionsQuorum rather than BlockdevOptionsBase. Hmm, how to define it? It is quorum's child's option, not quorum's option. > > Semantically, it sounds like you are trying to allow for a per-child > decision of whether this particular child's errors matter to the overall > quorum. So, if we have a 3/5 quorum, we can decide that for children A, > B, C, and D, errors cannot be ignored, but for child E, errors are not a > problem. > > As written, you are tying the semantics to each child BDS, and requiring > special code to double-check that the property is only ever set if the > BDS is used as the child of a quorum. Furthermore, if the property is > set, you are then changing what the child does in response to various > operations. > > What if you instead create a list property in the quorum parent? Maybe > along the lines of: > > # @child-errors-okay: #optional an array of named-node children where > errors will be ignored (Since 2.5, default empty) > > { 'struct': 'BlockdevOptionsQuorum', > 'data': { '*blkverify': 'bool', > 'children': [ 'BlockdevRef' ], > 'vote-threshold': 'int', > '*rewrite-corrupted': 'bool', > '*read-pattern': 'QuorumReadPattern', > '*child-errors-okay': ['str'] } } > > The above example of a 3/5 quorum, where only child E can ignore errors, > would then be represented as: > > { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3, > 'child-errors-okay': [ "E" ] } OK, I will try it. > > The code to ignore the errors is then done in the quorum itself (the BDS > for E does not have to care about a special ignore-errors property, but > just always returns the error as usual; and then the quorum is deciding > how to handle the error), and you are not polluting the BDS state for > something that is quorum-specific, because it is now the quorum itself > that tracks the special casing. > > Finally, why can't hot-plug/unplug of quorum members work? If you are > going to always ignore errors from a particular child, then why is that > child even part of the quorum? Isn't a better design to just not add > the child to the quorum until it is ready and won't be reporting errors? > Yes. In the early version, I don't use hot-plug/unplug of quorum members, so the quorum member may be not ready. Now I use hot-plug/unplug of quorum members, so the quorum's member is ready when it is hot added. In such case, the quorum member is not ready after failover. This error is expected, but I want this error can be ignored, otherwise, there may be too error events... Thanks Wen Congyang
* Eric Blake (eblake@redhat.com) wrote: > On 09/02/2015 02:51 AM, Wen Congyang wrote: > > If the child is not ready, read/write/getlength/flush will > > return -errno. It is not critical error, and can be ignored: > > 1. read/write: > > Just not report the error event. > > What happens if all the children report an error? Or is the threshold > at play here? I think it's interesting because in the COLO case the intention isn't really about a threshold (in the way you might use for RAID or mirroring), it's that one of the stores is local (and not expected to error) and one is somewhere over a network, so if it fails you don't want to stop the local VM working. However, if it fails we do need to know about it; if any write to the secondary stops then the fault-tolerance has failed (at least for that drive); so we should do *something* - I'm not sure what though. Dave > For example, if you have a threshold of 3/5, then I'm assuming that if > up to two children return an errno, then it is okay to ignore; but if > three or more return an errno, you haven't met threshold, so the I/O > must fail. > > Are you ignoring ALL errors (including things like EACCES), or just EIO > errors? > > > > 2. getlength: > > just ignore it. If all children's getlength return -errno, > > and be ignored, return -EIO. > > 3. flush: > > Just ignore it. If all children's getlength return -errno, > > s/getlength/flush/ > > > and be ignored, return 0. > > Yuck - claiming success when all of the children fail feels dangerous. > > > > > Usage: children.x.ignore-errors=true > > > > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> > > Signed-off-by: Gonglei <arei.gonglei@huawei.com> > > Cc: Alberto Garcia <berto@igalia.com> > > --- > > block/quorum.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++---- > > Interface review only: > > > +++ b/qapi/block-core.json > > @@ -1411,6 +1411,8 @@ > > # @allow-write-backing-file: #optional whether the backing file is opened in > > # read-write mode. It is only for backing file > > # (Since 2.5 default: false) > > +# @ignore-errors: #options whether the child's I/O error should be ignored. > > s/options/optional/ > s/error/errors/ > > > +# it is only for quorum's child.(Since 2.5 default: false) > > Space after '.' in English sentences. > > The fact that you are documenting that this option can only be specified > for quorum children makes me wonder if it belongs better as an option in > BlockdevOptionsQuorum rather than BlockdevOptionsBase. > > Semantically, it sounds like you are trying to allow for a per-child > decision of whether this particular child's errors matter to the overall > quorum. So, if we have a 3/5 quorum, we can decide that for children A, > B, C, and D, errors cannot be ignored, but for child E, errors are not a > problem. > > As written, you are tying the semantics to each child BDS, and requiring > special code to double-check that the property is only ever set if the > BDS is used as the child of a quorum. Furthermore, if the property is > set, you are then changing what the child does in response to various > operations. > > What if you instead create a list property in the quorum parent? Maybe > along the lines of: > > # @child-errors-okay: #optional an array of named-node children where > errors will be ignored (Since 2.5, default empty) > > { 'struct': 'BlockdevOptionsQuorum', > 'data': { '*blkverify': 'bool', > 'children': [ 'BlockdevRef' ], > 'vote-threshold': 'int', > '*rewrite-corrupted': 'bool', > '*read-pattern': 'QuorumReadPattern', > '*child-errors-okay': ['str'] } } > > The above example of a 3/5 quorum, where only child E can ignore errors, > would then be represented as: > > { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3, > 'child-errors-okay': [ "E" ] } > > The code to ignore the errors is then done in the quorum itself (the BDS > for E does not have to care about a special ignore-errors property, but > just always returns the error as usual; and then the quorum is deciding > how to handle the error), and you are not polluting the BDS state for > something that is quorum-specific, because it is now the quorum itself > that tracks the special casing. > > Finally, why can't hot-plug/unplug of quorum members work? If you are > going to always ignore errors from a particular child, then why is that > child even part of the quorum? Isn't a better design to just not add > the child to the quorum until it is ready and won't be reporting errors? > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 09/08/2015 12:56 AM, Dr. David Alan Gilbert wrote: > * Eric Blake (eblake@redhat.com) wrote: >> On 09/02/2015 02:51 AM, Wen Congyang wrote: >>> If the child is not ready, read/write/getlength/flush will >>> return -errno. It is not critical error, and can be ignored: >>> 1. read/write: >>> Just not report the error event. >> >> What happens if all the children report an error? Or is the threshold >> at play here? > > I think it's interesting because in the COLO case the intention isn't > really about a threshold (in the way you might use for RAID or mirroring), > it's that one of the stores is local (and not expected to error) and one > is somewhere over a network, so if it fails you don't want to stop > the local VM working. > > However, if it fails we do need to know about it; if any write to > the secondary stops then the fault-tolerance has failed (at least for > that drive); so we should do *something* - I'm not sure what though. The filter driver replication catches this error, and the COLO framework will get this error when doing checkpoint. Thanks Wen Congyang > > Dave > > >> For example, if you have a threshold of 3/5, then I'm assuming that if >> up to two children return an errno, then it is okay to ignore; but if >> three or more return an errno, you haven't met threshold, so the I/O >> must fail. >> >> Are you ignoring ALL errors (including things like EACCES), or just EIO >> errors? >> >> >>> 2. getlength: >>> just ignore it. If all children's getlength return -errno, >>> and be ignored, return -EIO. >>> 3. flush: >>> Just ignore it. If all children's getlength return -errno, >> >> s/getlength/flush/ >> >>> and be ignored, return 0. >> >> Yuck - claiming success when all of the children fail feels dangerous. >> >>> >>> Usage: children.x.ignore-errors=true >>> >>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> >>> Signed-off-by: Gonglei <arei.gonglei@huawei.com> >>> Cc: Alberto Garcia <berto@igalia.com> >>> --- >>> block/quorum.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> >> Interface review only: >> >>> +++ b/qapi/block-core.json >>> @@ -1411,6 +1411,8 @@ >>> # @allow-write-backing-file: #optional whether the backing file is opened in >>> # read-write mode. It is only for backing file >>> # (Since 2.5 default: false) >>> +# @ignore-errors: #options whether the child's I/O error should be ignored. >> >> s/options/optional/ >> s/error/errors/ >> >>> +# it is only for quorum's child.(Since 2.5 default: false) >> >> Space after '.' in English sentences. >> >> The fact that you are documenting that this option can only be specified >> for quorum children makes me wonder if it belongs better as an option in >> BlockdevOptionsQuorum rather than BlockdevOptionsBase. >> >> Semantically, it sounds like you are trying to allow for a per-child >> decision of whether this particular child's errors matter to the overall >> quorum. So, if we have a 3/5 quorum, we can decide that for children A, >> B, C, and D, errors cannot be ignored, but for child E, errors are not a >> problem. >> >> As written, you are tying the semantics to each child BDS, and requiring >> special code to double-check that the property is only ever set if the >> BDS is used as the child of a quorum. Furthermore, if the property is >> set, you are then changing what the child does in response to various >> operations. >> >> What if you instead create a list property in the quorum parent? Maybe >> along the lines of: >> >> # @child-errors-okay: #optional an array of named-node children where >> errors will be ignored (Since 2.5, default empty) >> >> { 'struct': 'BlockdevOptionsQuorum', >> 'data': { '*blkverify': 'bool', >> 'children': [ 'BlockdevRef' ], >> 'vote-threshold': 'int', >> '*rewrite-corrupted': 'bool', >> '*read-pattern': 'QuorumReadPattern', >> '*child-errors-okay': ['str'] } } >> >> The above example of a 3/5 quorum, where only child E can ignore errors, >> would then be represented as: >> >> { "children": [ "A", "B", "C", "D", "E" ], 'vote-threshold':3, >> 'child-errors-okay': [ "E" ] } >> >> The code to ignore the errors is then done in the quorum itself (the BDS >> for E does not have to care about a special ignore-errors property, but >> just always returns the error as usual; and then the quorum is deciding >> how to handle the error), and you are not polluting the BDS state for >> something that is quorum-specific, because it is now the quorum itself >> that tracks the special casing. >> >> Finally, why can't hot-plug/unplug of quorum members work? If you are >> going to always ignore errors from a particular child, then why is that >> child even part of the quorum? Isn't a better design to just not add >> the child to the quorum until it is ready and won't be reporting errors? >> >> -- >> Eric Blake eblake redhat com +1-919-301-3266 >> Libvirt virtualization library http://libvirt.org >> > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > . >
diff --git a/block/quorum.c b/block/quorum.c index 8059861..f23dbb7 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -30,6 +30,7 @@ #define QUORUM_OPT_BLKVERIFY "blkverify" #define QUORUM_OPT_REWRITE "rewrite-corrupted" #define QUORUM_OPT_READ_PATTERN "read-pattern" +#define QUORUM_CHILDREN_OPT_IGNORE_ERRORS "ignore-errors" /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -65,6 +66,7 @@ typedef struct QuorumVotes { /* the following structure holds the state of one quorum instance */ typedef struct BDRVQuorumState { BlockDriverState **bs; /* children BlockDriverStates */ + bool *ignore_errors; /* ignore children's error? */ int num_children; /* children count */ int max_children; /* The maximum children count, we need to reallocate * bs if num_children will larger than maximum. @@ -100,6 +102,7 @@ typedef struct QuorumChildRequest { uint8_t *buf; int ret; QuorumAIOCB *parent; + int index; } QuorumChildRequest; /* Quorum will use the following structure to track progress of each read/write @@ -212,6 +215,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, acb->qcrs[i].buf = NULL; acb->qcrs[i].ret = 0; acb->qcrs[i].parent = acb; + acb->qcrs[i].index = i; } return acb; @@ -305,7 +309,7 @@ static void quorum_aio_cb(void *opaque, int ret) acb->count++; if (ret == 0) { acb->success_count++; - } else { + } else if (!s->ignore_errors[sacb->index]) { quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); } assert(acb->count <= s->num_children); @@ -716,19 +720,31 @@ static BlockAIOCB *quorum_aio_writev(BlockDriverState *bs, static int64_t quorum_getlength(BlockDriverState *bs) { BDRVQuorumState *s = bs->opaque; - int64_t result; + int64_t result = -EIO; int i; /* check that all file have the same length */ - result = bdrv_getlength(s->bs[0]); - if (result < 0) { - return result; - } - for (i = 1; i < s->num_children; i++) { + for (i = 0; i < s->num_children; i++) { int64_t value = bdrv_getlength(s->bs[i]); + if (value < 0) { return value; } + + if (value == 0 && s->ignore_errors[i]) { + /* + * If the child is not ready, it cannot return -errno, + * otherwise refresh_total_sectors() will fail when + * we open the child. + */ + continue; + } + + if (result == -EIO) { + result = value; + continue; + } + if (value != result) { return -EIO; } @@ -766,6 +782,9 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) for (i = 0; i < s->num_children; i++) { result = bdrv_co_flush(s->bs[i]); + if (result < 0 && s->ignore_errors[i]) { + result = 0; + } result_value.l = result; quorum_count_vote(&error_votes, &result_value, i); } @@ -840,6 +859,19 @@ static QemuOptsList quorum_runtime_opts = { }, }; +static QemuOptsList quorum_children_common_opts = { + .name = "quorum children", + .head = QTAILQ_HEAD_INITIALIZER(quorum_children_common_opts.head), + .desc = { + { + .name = QUORUM_CHILDREN_OPT_IGNORE_ERRORS, + .type = QEMU_OPT_BOOL, + .help = "ignore child I/O error", + }, + { /* end of list */ } + }, +}; + static int parse_read_pattern(const char *opt) { int i; @@ -858,6 +890,37 @@ static int parse_read_pattern(const char *opt) return -EINVAL; } +static int parse_children_options(BDRVQuorumState *s, QDict *options, + const char *indexstr, int index, + Error **errp) +{ + QemuOpts *children_opts = NULL; + Error *local_err = NULL; + int ret = 0; + bool value; + + children_opts = qemu_opts_create(&quorum_children_common_opts, NULL, 0, + &error_abort); + qemu_opts_absorb_qdict_by_index(children_opts, options, indexstr, + &local_err); + if (local_err) { + ret = -EINVAL; + goto out; + } + + value = qemu_opt_get_bool(children_opts, QUORUM_CHILDREN_OPT_IGNORE_ERRORS, + false); + s->ignore_errors[index] = value; + +out: + qemu_opts_del(children_opts); + /* propagate error */ + if (local_err) { + error_propagate(errp, local_err); + } + return ret; +} + static int quorum_open(BlockDriverState *bs, QDict *options, int flags, Error **errp) { @@ -929,12 +992,18 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, s->bs = g_new0(BlockDriverState *, s->num_children); opened = g_new0(bool, s->num_children); s->max_children = s->num_children; + s->ignore_errors = g_new0(bool, s->num_children); for (i = 0; i < s->num_children; i++) { char indexstr[32]; ret = snprintf(indexstr, 32, "children.%d", i); assert(ret < 32); + ret = parse_children_options(s, options, indexstr, i, &local_err); + if (ret < 0) { + goto close_exit; + } + ret = bdrv_open_image(&s->bs[i], NULL, options, indexstr, bs, &child_format, false, &local_err); if (ret < 0) { @@ -976,6 +1045,7 @@ static void quorum_close(BlockDriverState *bs) } g_free(s->bs); + g_free(s->ignore_errors); } static void quorum_detach_aio_context(BlockDriverState *bs) @@ -1014,10 +1084,18 @@ static void quorum_add_child(BlockDriverState *bs, QDict *options, Error **errp) } s->bs = g_renew(BlockDriverState *, s->bs, s->max_children + 1); + s->ignore_errors = g_renew(bool, s->ignore_errors, s->max_children + 1); s->bs[s->num_children] = NULL; s->max_children += 1; } + ret = parse_children_options(s, options, "child", s->num_children, + &local_err); + if (ret < 0) { + error_propagate(errp, local_err); + return; + } + ret = bdrv_open_image(&s->bs[s->num_children], NULL, options, "child", bs, &child_format, false, &local_err); if (ret < 0) { @@ -1058,6 +1136,8 @@ static void quorum_del_child(BlockDriverState *bs, BlockDriverState *child_bs, bdrv_drain(bs); /* We can safely remove this child now */ memmove(&s->bs[i], &s->bs[i+1], (s->num_children - i - 1) * sizeof(void *)); + memmove(&s->ignore_errors[i], &s->ignore_errors[i+1], + (s->num_children - i - 1) * sizeof(bool)); s->num_children--; s->bs[s->num_children] = NULL; bdrv_unref(child_bs); diff --git a/qapi/block-core.json b/qapi/block-core.json index bf141a2..24099ef 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1411,6 +1411,8 @@ # @allow-write-backing-file: #optional whether the backing file is opened in # read-write mode. It is only for backing file # (Since 2.5 default: false) +# @ignore-errors: #options whether the child's I/O error should be ignored. +# it is only for quorum's child.(Since 2.5 default: false) # # Since: 1.7 ## @@ -1425,7 +1427,8 @@ '*werror': 'BlockdevOnError', '*read-only': 'bool', '*detect-zeroes': 'BlockdevDetectZeroesOptions', - '*allow-write-backing-file': 'bool' } } + '*allow-write-backing-file': 'bool', + '*ignore-errors': 'bool' } } ## # @BlockdevOptionsFile