Message ID | 20240816100723.2815-1-d-tatianin@yandex-team.ru |
---|---|
State | New |
Headers | show |
Series | chardev: allow specifying finer-grained reconnect timeouts | expand |
Daniil Tatianin <d-tatianin@yandex-team.ru> writes: > The "reconnect" option only allows to specify the time in seconds, > which is way too long for certain workflows. > > We have a lightweight disk backend server, which takes about 20ms to > live update, but due to this limitation in QEMU, previously the guest > disk controller would hang for one second because it would take this > long for QEMU to reinitialize the socket connection. > > Make it possible to specify a smaller timeout by treating the value in > "reconnect" as milliseconds via the new "reconnect-is-ms" option. > > Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> Your use case demonstrates that a granularity of seconds was the wrong choice for the reconnection delay. [...] > diff --git a/qapi/char.json b/qapi/char.json > index ef58445cee..61aeccf09d 100644 > --- a/qapi/char.json > +++ b/qapi/char.json > @@ -272,8 +272,13 @@ > # (default: false) (Since: 3.1) > # > # @reconnect: For a client socket, if a socket is disconnected, then > -# attempt a reconnect after the given number of seconds. Setting > -# this to zero disables this function. (default: 0) (Since: 2.2) > +# attempt a reconnect after the given number of seconds (unless > +# @reconnect-is-ms is set to true, in that case the number is > +# treated as milliseconds). Setting this to zero disables > +# this function. (default: 0) (Since: 2.2) > +# > +# @reconnect-is-ms: The value specified in @reconnect should be treated > +# as milliseconds. (default: false) (Since: 9.2) > # > # Since: 1.4 > ## > @@ -287,7 +292,8 @@ > '*telnet': 'bool', > '*tn3270': 'bool', > '*websocket': 'bool', > - '*reconnect': 'int' }, > + '*reconnect': 'int', > + '*reconnect-is-ms': 'bool' }, > 'base': 'ChardevCommon' } > > ## I don't like this interface. PRO: compatible extension; no management application updates needed unless they want to support sub-second delays. CON: specifying a sub-second delay takes two parameters, which is awkward. CON: trap in combination with -set. Before the patch, something like -set chardev.ID.reconnect=N means N seconds no matter what. Afterwards, it depends on the value of reconnect-is-ms, which may be set far away. Mitigating factor: -set is obscure. Alternatives: 1. Change @reconnect to 'number', specify sub-second delays as fractions. PRO: compatible extension; no management application updates needed unless they want to support sub-second delays. CON: first use of floating-point for time in seconds in QAPI, as far as I can see. CON: QemuOpts can't do floating-point. 2. Deprecate @reconnect in favour of a new member with a more suitable unit. Error if both are present. PRO: after @reconnect is gone, the interface is what it arguably should've been from the start. CON: incompatible change. Management application updates needed within the deprecation grace period. Let's get additional input from management application developers. I cc'ed some. Related: NetdevSocketOptions member @reconnect.
On 8/29/24 2:56 PM, Markus Armbruster wrote: > Daniil Tatianin <d-tatianin@yandex-team.ru> writes: > >> The "reconnect" option only allows to specify the time in seconds, >> which is way too long for certain workflows. >> >> We have a lightweight disk backend server, which takes about 20ms to >> live update, but due to this limitation in QEMU, previously the guest >> disk controller would hang for one second because it would take this >> long for QEMU to reinitialize the socket connection. >> >> Make it possible to specify a smaller timeout by treating the value in >> "reconnect" as milliseconds via the new "reconnect-is-ms" option. >> >> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> > Your use case demonstrates that a granularity of seconds was the wrong > choice for the reconnection delay. > > [...] > >> diff --git a/qapi/char.json b/qapi/char.json >> index ef58445cee..61aeccf09d 100644 >> --- a/qapi/char.json >> +++ b/qapi/char.json >> @@ -272,8 +272,13 @@ >> # (default: false) (Since: 3.1) >> # >> # @reconnect: For a client socket, if a socket is disconnected, then >> -# attempt a reconnect after the given number of seconds. Setting >> -# this to zero disables this function. (default: 0) (Since: 2.2) >> +# attempt a reconnect after the given number of seconds (unless >> +# @reconnect-is-ms is set to true, in that case the number is >> +# treated as milliseconds). Setting this to zero disables >> +# this function. (default: 0) (Since: 2.2) >> +# >> +# @reconnect-is-ms: The value specified in @reconnect should be treated >> +# as milliseconds. (default: false) (Since: 9.2) >> # >> # Since: 1.4 >> ## >> @@ -287,7 +292,8 @@ >> '*telnet': 'bool', >> '*tn3270': 'bool', >> '*websocket': 'bool', >> - '*reconnect': 'int' }, >> + '*reconnect': 'int', >> + '*reconnect-is-ms': 'bool' }, >> 'base': 'ChardevCommon' } >> ## > I don't like this interface. Indeed. After discussing this more internally we have actually settled on one of the alternatives you have mentioned below. > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: specifying a sub-second delay takes two parameters, which is > awkward. > > CON: trap in combination with -set. Before the patch, something like > -set chardev.ID.reconnect=N means N seconds no matter what. > Afterwards, it depends on the value of reconnect-is-ms, which may be > set far away. Mitigating factor: -set is obscure. > > Alternatives: > > 1. Change @reconnect to 'number', specify sub-second delays as > fractions. > > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: first use of floating-point for time in seconds in QAPI, as far > as I can see. > > CON: QemuOpts can't do floating-point. > > 2. Deprecate @reconnect in favour of a new member with a more suitable > unit. Error if both are present. > > PRO: after @reconnect is gone, the interface is what it arguably > should've been from the start. > > CON: incompatible change. Management application updates needed > within the deprecation grace period. This is the one we decided to go with. We simply added a new "reconnect-ms" option, which doesn't replace the old one for backwards compatibility, but also disallows both to be specified at the same time, making them mutually exclusive. I think deprecation would be the way to go here. > > Let's get additional input from management application developers. I > cc'ed some. Sure, sounds great. Thanks! > > Related: NetdevSocketOptions member @reconnect. >
On Thu, Aug 29, 2024 at 13:56:43 +0200, Markus Armbruster wrote: > Daniil Tatianin <d-tatianin@yandex-team.ru> writes: [...] So firstly, libvirt models the timeout in the XML in seconds for now so making use of this will require some XML design plumbing making use of it if there'll be any users wanting it. When libvirt would want to make use of this the amount of work for any of the options below is almost the same (capability detection, adding of the new plumbing, etc). The only difference is what to do if nobody shows interest in exposing the sub-second intervals in libvirt. > > @@ -287,7 +292,8 @@ > > '*telnet': 'bool', > > '*tn3270': 'bool', > > '*websocket': 'bool', > > - '*reconnect': 'int' }, > > + '*reconnect': 'int', > > + '*reconnect-is-ms': 'bool' }, > > 'base': 'ChardevCommon' } > > > > ## > > I don't like this interface. > > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: specifying a sub-second delay takes two parameters, which is > awkward. > > CON: trap in combination with -set. Before the patch, something like > -set chardev.ID.reconnect=N means N seconds no matter what. > Afterwards, it depends on the value of reconnect-is-ms, which may be > set far away. Mitigating factor: -set is obscure. Here we'd have to do nothing. > Alternatives: > > 1. Change @reconnect to 'number', specify sub-second delays as > fractions. > > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: first use of floating-point for time in seconds in QAPI, as far > as I can see. > > CON: QemuOpts can't do floating-point. Same here. > > 2. Deprecate @reconnect in favour of a new member with a more suitable > unit. Error if both are present. > > PRO: after @reconnect is gone, the interface is what it arguably > should've been from the start. > > CON: incompatible change. Management application updates needed > within the deprecation grace period. This one will require change to libvirt including a capability addition even when libvirt might not want to use the new field. (Add capability detection, switch to new interface if present. Libvirt doesn't want to use deprecated interfaces to avoid future breakage.) But we've done this multiple times so it's not a big deal. > Let's get additional input from management application developers. I > cc'ed some. From libvirt's point of view I'd say either option is viable. We're okay with deprecating the old interface libvirt is used to do that. I'd personally prefer if floating point numbers were avoided.
On Thu, Aug 29, 2024 at 01:56:43PM GMT, Markus Armbruster wrote: > Daniil Tatianin <d-tatianin@yandex-team.ru> writes: > > > The "reconnect" option only allows to specify the time in seconds, > > which is way too long for certain workflows. ... > > @@ -287,7 +292,8 @@ > > '*telnet': 'bool', > > '*tn3270': 'bool', > > '*websocket': 'bool', > > - '*reconnect': 'int' }, > > + '*reconnect': 'int', > > + '*reconnect-is-ms': 'bool' }, > > 'base': 'ChardevCommon' } > > > > ## > > I don't like this interface. > > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: specifying a sub-second delay takes two parameters, which is > awkward. > > CON: trap in combination with -set. Before the patch, something like > -set chardev.ID.reconnect=N means N seconds no matter what. > Afterwards, it depends on the value of reconnect-is-ms, which may be > set far away. Mitigating factor: -set is obscure. I also dislike this interface. Having only one number plus an optional boolean that controls its scale is not as easy to reason about. > > Alternatives: > > 1. Change @reconnect to 'number', specify sub-second delays as > fractions. > > PRO: compatible extension; no management application updates needed > unless they want to support sub-second delays. > > CON: first use of floating-point for time in seconds in QAPI, as far > as I can see. > > CON: QemuOpts can't do floating-point. Eww. I don't see this as the compelling reason to switch to floating point. > > 2. Deprecate @reconnect in favour of a new member with a more suitable > unit. Error if both are present. > > PRO: after @reconnect is gone, the interface is what it arguably > should've been from the start. > > CON: incompatible change. Management application updates needed > within the deprecation grace period. This seems reasonable to me. We have enough places in QAPI where we want mutual exclusion (we mark both fields optional, but expect the user to provide exactly one or get an error), that I wonder if it is worth making it a first-class construct in QAPI (maybe I'm spoiled by the OneOf designation[1] in protobuf[2] used by gRPC[3] in kubernetes). Including the scale in the name is a bonus reason to switch the interface. [1] https://protobuf.dev/programming-guides/proto3/#oneof [2] https://protobuf.dev/overview/ [3] https://grpc.io/ > > Let's get additional input from management application developers. I > cc'ed some. > > Related: NetdevSocketOptions member @reconnect. >
Eric Blake <eblake@redhat.com> writes: > On Thu, Aug 29, 2024 at 01:56:43PM GMT, Markus Armbruster wrote: >> Daniil Tatianin <d-tatianin@yandex-team.ru> writes: >> >> > The "reconnect" option only allows to specify the time in seconds, >> > which is way too long for certain workflows. > > ... >> > @@ -287,7 +292,8 @@ >> > '*telnet': 'bool', >> > '*tn3270': 'bool', >> > '*websocket': 'bool', >> > - '*reconnect': 'int' }, >> > + '*reconnect': 'int', >> > + '*reconnect-is-ms': 'bool' }, >> > 'base': 'ChardevCommon' } >> > >> > ## >> >> I don't like this interface. >> >> PRO: compatible extension; no management application updates needed >> unless they want to support sub-second delays. >> >> CON: specifying a sub-second delay takes two parameters, which is >> awkward. >> >> CON: trap in combination with -set. Before the patch, something like >> -set chardev.ID.reconnect=N means N seconds no matter what. >> Afterwards, it depends on the value of reconnect-is-ms, which may be >> set far away. Mitigating factor: -set is obscure. > > I also dislike this interface. Having only one number plus an > optional boolean that controls its scale is not as easy to reason > about. > >> >> Alternatives: >> >> 1. Change @reconnect to 'number', specify sub-second delays as >> fractions. >> >> PRO: compatible extension; no management application updates needed >> unless they want to support sub-second delays. >> >> CON: first use of floating-point for time in seconds in QAPI, as far >> as I can see. >> >> CON: QemuOpts can't do floating-point. > > Eww. I don't see this as the compelling reason to switch to floating point. > >> >> 2. Deprecate @reconnect in favour of a new member with a more suitable >> unit. Error if both are present. >> >> PRO: after @reconnect is gone, the interface is what it arguably >> should've been from the start. >> >> CON: incompatible change. Management application updates needed >> within the deprecation grace period. > > This seems reasonable to me. To Daniil as well. Since Peter is okay with it on behalf of libvirt, so am I. > We have enough places in QAPI where we > want mutual exclusion (we mark both fields optional, but expect the > user to provide exactly one or get an error), that I wonder if it is > worth making it a first-class construct in QAPI (maybe I'm spoiled by > the OneOf designation[1] in protobuf[2] used by gRPC[3] in > kubernetes). "One of a set of optional members" is a constraint the QAPI language cannot express, so we have to leave enforcing it to handwritten code, and documenting it to handwritten comments. If it could express it, the constraint would be visible in introspection, and we'd have less handwritten code. The price is additional QAPI language and generator complexity. Sensible tradeoff? Hard to tell without patches. Risk: we pay for patches only to find out the answer is no. Another risk: we pay for patches, then take them just because we already paid for them, then continue to pay upkeep. I don't mean to say such QAPI language enhancements are a bad idea. I'm just pointing out the tradeoff to weigh, and the risks to take. There are more constraints that could be supported by the language, e.g. integer ranges. > Including the scale in the name is a bonus reason to > switch the interface. Yes, because the unit isn't obvious from the type. It is for things like byte counts, where we've succeeded at sticking to one plain unit[*]. We failed for time, so there it isn't obvious. > [1] https://protobuf.dev/programming-guides/proto3/#oneof > [2] https://protobuf.dev/overview/ > [3] https://grpc.io/ > >> >> Let's get additional input from management application developers. I >> cc'ed some. >> >> Related: NetdevSocketOptions member @reconnect. Could use the same treatment for consistency. Not a demand. [*] With few exceptions, such certain rate limits that use MBytes/s.
diff --git a/chardev/char-socket.c b/chardev/char-socket.c index 1ca9441b1b..d87859e641 100644 --- a/chardev/char-socket.c +++ b/chardev/char-socket.c @@ -74,7 +74,7 @@ static void qemu_chr_socket_restart_timer(Chardev *chr) assert(!s->reconnect_timer); name = g_strdup_printf("chardev-socket-reconnect-%s", chr->label); s->reconnect_timer = qemu_chr_timeout_add_ms(chr, - s->reconnect_time * 1000, + s->reconnect_time, socket_reconnect_timeout, chr); g_source_set_name(s->reconnect_timer, name); @@ -1082,7 +1082,7 @@ static int tcp_chr_wait_connected(Chardev *chr, Error **errp) if (tcp_chr_connect_client_sync(chr, &err) < 0) { if (s->reconnect_time) { error_free(err); - g_usleep(s->reconnect_time * 1000ULL * 1000ULL); + g_usleep(s->reconnect_time * 1000ULL); } else { error_propagate(errp, err); return -1; @@ -1509,6 +1509,9 @@ static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend, sock->wait = qemu_opt_get_bool(opts, "wait", true); sock->has_reconnect = qemu_opt_find(opts, "reconnect"); sock->reconnect = qemu_opt_get_number(opts, "reconnect", 0); + if (!qemu_opt_get_bool(opts, "reconnect-is-ms", false)) { + sock->reconnect *= 1000ULL; + } sock->tls_creds = g_strdup(qemu_opt_get(opts, "tls-creds")); sock->tls_authz = g_strdup(qemu_opt_get(opts, "tls-authz")); diff --git a/chardev/char.c b/chardev/char.c index 3c43fb1278..f6e5f1d5c9 100644 --- a/chardev/char.c +++ b/chardev/char.c @@ -859,6 +859,9 @@ QemuOptsList qemu_chardev_opts = { },{ .name = "reconnect", .type = QEMU_OPT_NUMBER, + },{ + .name = "reconnect-is-ms", + .type = QEMU_OPT_BOOL, },{ .name = "telnet", .type = QEMU_OPT_BOOL, diff --git a/qapi/char.json b/qapi/char.json index ef58445cee..61aeccf09d 100644 --- a/qapi/char.json +++ b/qapi/char.json @@ -272,8 +272,13 @@ # (default: false) (Since: 3.1) # # @reconnect: For a client socket, if a socket is disconnected, then -# attempt a reconnect after the given number of seconds. Setting -# this to zero disables this function. (default: 0) (Since: 2.2) +# attempt a reconnect after the given number of seconds (unless +# @reconnect-is-ms is set to true, in that case the number is +# treated as milliseconds). Setting this to zero disables +# this function. (default: 0) (Since: 2.2) +# +# @reconnect-is-ms: The value specified in @reconnect should be treated +# as milliseconds. (default: false) (Since: 9.2) # # Since: 1.4 ## @@ -287,7 +292,8 @@ '*telnet': 'bool', '*tn3270': 'bool', '*websocket': 'bool', - '*reconnect': 'int' }, + '*reconnect': 'int', + '*reconnect-is-ms': 'bool' }, 'base': 'ChardevCommon' } ##
The "reconnect" option only allows to specify the time in seconds, which is way too long for certain workflows. We have a lightweight disk backend server, which takes about 20ms to live update, but due to this limitation in QEMU, previously the guest disk controller would hang for one second because it would take this long for QEMU to reinitialize the socket connection. Make it possible to specify a smaller timeout by treating the value in "reconnect" as milliseconds via the new "reconnect-is-ms" option. Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> --- chardev/char-socket.c | 7 +++++-- chardev/char.c | 3 +++ qapi/char.json | 12 +++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-)