diff mbox series

chardev: allow specifying finer-grained reconnect timeouts

Message ID 20240816100723.2815-1-d-tatianin@yandex-team.ru
State New
Headers show
Series chardev: allow specifying finer-grained reconnect timeouts | expand

Commit Message

Daniil Tatianin Aug. 16, 2024, 10:07 a.m. UTC
The "reconnect" option only allows to specify the time in seconds,
which is way too long for certain workflows.

We have a lightweight disk backend server, which takes about 20ms to
live update, but due to this limitation in QEMU, previously the guest
disk controller would hang for one second because it would take this
long for QEMU to reinitialize the socket connection.

Make it possible to specify a smaller timeout by treating the value in
"reconnect" as milliseconds via the new "reconnect-is-ms" option.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
---
 chardev/char-socket.c |  7 +++++--
 chardev/char.c        |  3 +++
 qapi/char.json        | 12 +++++++++---
 3 files changed, 17 insertions(+), 5 deletions(-)

Comments

Markus Armbruster Aug. 29, 2024, 11:56 a.m. UTC | #1
Daniil Tatianin <d-tatianin@yandex-team.ru> writes:

> The "reconnect" option only allows to specify the time in seconds,
> which is way too long for certain workflows.
>
> We have a lightweight disk backend server, which takes about 20ms to
> live update, but due to this limitation in QEMU, previously the guest
> disk controller would hang for one second because it would take this
> long for QEMU to reinitialize the socket connection.
>
> Make it possible to specify a smaller timeout by treating the value in
> "reconnect" as milliseconds via the new "reconnect-is-ms" option.
>
> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>

Your use case demonstrates that a granularity of seconds was the wrong
choice for the reconnection delay.

[...]

> diff --git a/qapi/char.json b/qapi/char.json
> index ef58445cee..61aeccf09d 100644
> --- a/qapi/char.json
> +++ b/qapi/char.json
> @@ -272,8 +272,13 @@
>  #     (default: false) (Since: 3.1)
>  #
>  # @reconnect: For a client socket, if a socket is disconnected, then
> -#     attempt a reconnect after the given number of seconds.  Setting
> -#     this to zero disables this function.  (default: 0) (Since: 2.2)
> +#     attempt a reconnect after the given number of seconds (unless
> +#     @reconnect-is-ms is set to true, in that case the number is
> +#     treated as milliseconds).  Setting this to zero disables
> +#     this function.  (default: 0) (Since: 2.2)
> +#
> +# @reconnect-is-ms: The value specified in @reconnect should be treated
> +#     as milliseconds.  (default: false) (Since: 9.2)
>  #
>  # Since: 1.4
>  ##
> @@ -287,7 +292,8 @@
>              '*telnet': 'bool',
>              '*tn3270': 'bool',
>              '*websocket': 'bool',
> -            '*reconnect': 'int' },
> +            '*reconnect': 'int',
> +            '*reconnect-is-ms': 'bool' },
>    'base': 'ChardevCommon' }
>  
>  ##

I don't like this interface.

   PRO: compatible extension; no management application updates needed
   unless they want to support sub-second delays.

   CON: specifying a sub-second delay takes two parameters, which is
   awkward.

   CON: trap in combination with -set.  Before the patch, something like
   -set chardev.ID.reconnect=N means N seconds no matter what.
   Afterwards, it depends on the value of reconnect-is-ms, which may be
   set far away.  Mitigating factor: -set is obscure.

Alternatives:

1. Change @reconnect to 'number', specify sub-second delays as
   fractions.

   PRO: compatible extension; no management application updates needed
   unless they want to support sub-second delays.

   CON: first use of floating-point for time in seconds in QAPI, as far
   as I can see.

   CON: QemuOpts can't do floating-point.

2. Deprecate @reconnect in favour of a new member with a more suitable
   unit.  Error if both are present.

   PRO: after @reconnect is gone, the interface is what it arguably
   should've been from the start.

   CON: incompatible change.  Management application updates needed
   within the deprecation grace period.

Let's get additional input from management application developers.  I
cc'ed some.

Related: NetdevSocketOptions member @reconnect.
Daniil Tatianin Aug. 29, 2024, 12:18 p.m. UTC | #2
On 8/29/24 2:56 PM, Markus Armbruster wrote:
> Daniil Tatianin <d-tatianin@yandex-team.ru> writes:
>
>> The "reconnect" option only allows to specify the time in seconds,
>> which is way too long for certain workflows.
>>
>> We have a lightweight disk backend server, which takes about 20ms to
>> live update, but due to this limitation in QEMU, previously the guest
>> disk controller would hang for one second because it would take this
>> long for QEMU to reinitialize the socket connection.
>>
>> Make it possible to specify a smaller timeout by treating the value in
>> "reconnect" as milliseconds via the new "reconnect-is-ms" option.
>>
>> Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
> Your use case demonstrates that a granularity of seconds was the wrong
> choice for the reconnection delay.
>
> [...]
>
>> diff --git a/qapi/char.json b/qapi/char.json
>> index ef58445cee..61aeccf09d 100644
>> --- a/qapi/char.json
>> +++ b/qapi/char.json
>> @@ -272,8 +272,13 @@
>> # (default: false) (Since: 3.1)
>> #
>> # @reconnect: For a client socket, if a socket is disconnected, then
>> -# attempt a reconnect after the given number of seconds. Setting
>> -# this to zero disables this function. (default: 0) (Since: 2.2)
>> +# attempt a reconnect after the given number of seconds (unless
>> +# @reconnect-is-ms is set to true, in that case the number is
>> +# treated as milliseconds). Setting this to zero disables
>> +# this function. (default: 0) (Since: 2.2)
>> +#
>> +# @reconnect-is-ms: The value specified in @reconnect should be treated
>> +# as milliseconds. (default: false) (Since: 9.2)
>> #
>> # Since: 1.4
>> ##
>> @@ -287,7 +292,8 @@
>> '*telnet': 'bool',
>> '*tn3270': 'bool',
>> '*websocket': 'bool',
>> - '*reconnect': 'int' },
>> + '*reconnect': 'int',
>> + '*reconnect-is-ms': 'bool' },
>> 'base': 'ChardevCommon' }
>> ##
> I don't like this interface.

Indeed. After discussing this more internally we have actually settled 
on one of the alternatives you have mentioned below.

> PRO: compatible extension; no management application updates needed
> unless they want to support sub-second delays.
>
> CON: specifying a sub-second delay takes two parameters, which is
> awkward.
>
> CON: trap in combination with -set. Before the patch, something like
> -set chardev.ID.reconnect=N means N seconds no matter what.
> Afterwards, it depends on the value of reconnect-is-ms, which may be
> set far away. Mitigating factor: -set is obscure.
>
> Alternatives:
>
> 1. Change @reconnect to 'number', specify sub-second delays as
> fractions.
>
> PRO: compatible extension; no management application updates needed
> unless they want to support sub-second delays.
>
> CON: first use of floating-point for time in seconds in QAPI, as far
> as I can see.
>
> CON: QemuOpts can't do floating-point.
>
> 2. Deprecate @reconnect in favour of a new member with a more suitable
> unit. Error if both are present.
>
> PRO: after @reconnect is gone, the interface is what it arguably
> should've been from the start.
>
> CON: incompatible change. Management application updates needed
> within the deprecation grace period.

This is the one we decided to go with. We simply added a new 
"reconnect-ms" option, which doesn't replace the old one for backwards 
compatibility, but also disallows both to be specified at the same time, 
making them mutually exclusive. I think deprecation would be the way to 
go here.

>
> Let's get additional input from management application developers. I
> cc'ed some.

Sure, sounds great. Thanks!

>
> Related: NetdevSocketOptions member @reconnect.
>
Peter Krempa Aug. 29, 2024, 12:41 p.m. UTC | #3
On Thu, Aug 29, 2024 at 13:56:43 +0200, Markus Armbruster wrote:
> Daniil Tatianin <d-tatianin@yandex-team.ru> writes:

[...]

So firstly, libvirt models the timeout in the XML in seconds for now so
making use of this will require some XML design plumbing making use of
it if there'll be any users wanting it.

When libvirt would want to make use of this the amount of work for any
of the options below is almost the same (capability detection, adding
of the new plumbing, etc). The only difference is what to
do if nobody shows interest in exposing the sub-second intervals in
libvirt.

> > @@ -287,7 +292,8 @@
> >              '*telnet': 'bool',
> >              '*tn3270': 'bool',
> >              '*websocket': 'bool',
> > -            '*reconnect': 'int' },
> > +            '*reconnect': 'int',
> > +            '*reconnect-is-ms': 'bool' },
> >    'base': 'ChardevCommon' }
> >  
> >  ##
> 
> I don't like this interface.
> 
>    PRO: compatible extension; no management application updates needed
>    unless they want to support sub-second delays.
> 
>    CON: specifying a sub-second delay takes two parameters, which is
>    awkward.
> 
>    CON: trap in combination with -set.  Before the patch, something like
>    -set chardev.ID.reconnect=N means N seconds no matter what.
>    Afterwards, it depends on the value of reconnect-is-ms, which may be
>    set far away.  Mitigating factor: -set is obscure.

Here we'd have to do nothing.

> Alternatives:
> 
> 1. Change @reconnect to 'number', specify sub-second delays as
>    fractions.
> 
>    PRO: compatible extension; no management application updates needed
>    unless they want to support sub-second delays.
> 
>    CON: first use of floating-point for time in seconds in QAPI, as far
>    as I can see.
> 
>    CON: QemuOpts can't do floating-point.

Same here.

> 
> 2. Deprecate @reconnect in favour of a new member with a more suitable
>    unit.  Error if both are present.
> 
>    PRO: after @reconnect is gone, the interface is what it arguably
>    should've been from the start.
> 
>    CON: incompatible change.  Management application updates needed
>    within the deprecation grace period.

This one will require change to libvirt including a capability addition
even when libvirt might not want to use the new field. (Add capability
detection, switch to new interface if present. Libvirt doesn't want to
use deprecated interfaces to avoid future breakage.)

But we've done this multiple times so it's not a big deal.

> Let's get additional input from management application developers.  I
> cc'ed some.

From libvirt's point of view I'd say either option is viable. We're okay
with deprecating the old interface libvirt is used to do that.
I'd personally prefer if floating point numbers were avoided.
Eric Blake Aug. 29, 2024, 6:59 p.m. UTC | #4
On Thu, Aug 29, 2024 at 01:56:43PM GMT, Markus Armbruster wrote:
> Daniil Tatianin <d-tatianin@yandex-team.ru> writes:
> 
> > The "reconnect" option only allows to specify the time in seconds,
> > which is way too long for certain workflows.

...
> > @@ -287,7 +292,8 @@
> >              '*telnet': 'bool',
> >              '*tn3270': 'bool',
> >              '*websocket': 'bool',
> > -            '*reconnect': 'int' },
> > +            '*reconnect': 'int',
> > +            '*reconnect-is-ms': 'bool' },
> >    'base': 'ChardevCommon' }
> >  
> >  ##
> 
> I don't like this interface.
> 
>    PRO: compatible extension; no management application updates needed
>    unless they want to support sub-second delays.
> 
>    CON: specifying a sub-second delay takes two parameters, which is
>    awkward.
> 
>    CON: trap in combination with -set.  Before the patch, something like
>    -set chardev.ID.reconnect=N means N seconds no matter what.
>    Afterwards, it depends on the value of reconnect-is-ms, which may be
>    set far away.  Mitigating factor: -set is obscure.

I also dislike this interface.  Having only one number plus an
optional boolean that controls its scale is not as easy to reason
about.

> 
> Alternatives:
> 
> 1. Change @reconnect to 'number', specify sub-second delays as
>    fractions.
> 
>    PRO: compatible extension; no management application updates needed
>    unless they want to support sub-second delays.
> 
>    CON: first use of floating-point for time in seconds in QAPI, as far
>    as I can see.
> 
>    CON: QemuOpts can't do floating-point.

Eww.  I don't see this as the compelling reason to switch to floating point.

> 
> 2. Deprecate @reconnect in favour of a new member with a more suitable
>    unit.  Error if both are present.
> 
>    PRO: after @reconnect is gone, the interface is what it arguably
>    should've been from the start.
> 
>    CON: incompatible change.  Management application updates needed
>    within the deprecation grace period.

This seems reasonable to me.  We have enough places in QAPI where we
want mutual exclusion (we mark both fields optional, but expect the
user to provide exactly one or get an error), that I wonder if it is
worth making it a first-class construct in QAPI (maybe I'm spoiled by
the OneOf designation[1] in protobuf[2] used by gRPC[3] in
kubernetes).  Including the scale in the name is a bonus reason to
switch the interface.

[1] https://protobuf.dev/programming-guides/proto3/#oneof
[2] https://protobuf.dev/overview/
[3] https://grpc.io/

> 
> Let's get additional input from management application developers.  I
> cc'ed some.
> 
> Related: NetdevSocketOptions member @reconnect.
>
Markus Armbruster Aug. 30, 2024, 6:34 a.m. UTC | #5
Eric Blake <eblake@redhat.com> writes:

> On Thu, Aug 29, 2024 at 01:56:43PM GMT, Markus Armbruster wrote:
>> Daniil Tatianin <d-tatianin@yandex-team.ru> writes:
>> 
>> > The "reconnect" option only allows to specify the time in seconds,
>> > which is way too long for certain workflows.
>
> ...
>> > @@ -287,7 +292,8 @@
>> >              '*telnet': 'bool',
>> >              '*tn3270': 'bool',
>> >              '*websocket': 'bool',
>> > -            '*reconnect': 'int' },
>> > +            '*reconnect': 'int',
>> > +            '*reconnect-is-ms': 'bool' },
>> >    'base': 'ChardevCommon' }
>> >  
>> >  ##
>> 
>> I don't like this interface.
>> 
>>    PRO: compatible extension; no management application updates needed
>>    unless they want to support sub-second delays.
>> 
>>    CON: specifying a sub-second delay takes two parameters, which is
>>    awkward.
>> 
>>    CON: trap in combination with -set.  Before the patch, something like
>>    -set chardev.ID.reconnect=N means N seconds no matter what.
>>    Afterwards, it depends on the value of reconnect-is-ms, which may be
>>    set far away.  Mitigating factor: -set is obscure.
>
> I also dislike this interface.  Having only one number plus an
> optional boolean that controls its scale is not as easy to reason
> about.
>
>> 
>> Alternatives:
>> 
>> 1. Change @reconnect to 'number', specify sub-second delays as
>>    fractions.
>> 
>>    PRO: compatible extension; no management application updates needed
>>    unless they want to support sub-second delays.
>> 
>>    CON: first use of floating-point for time in seconds in QAPI, as far
>>    as I can see.
>> 
>>    CON: QemuOpts can't do floating-point.
>
> Eww.  I don't see this as the compelling reason to switch to floating point.
>
>> 
>> 2. Deprecate @reconnect in favour of a new member with a more suitable
>>    unit.  Error if both are present.
>> 
>>    PRO: after @reconnect is gone, the interface is what it arguably
>>    should've been from the start.
>> 
>>    CON: incompatible change.  Management application updates needed
>>    within the deprecation grace period.
>
> This seems reasonable to me.

To Daniil as well.  Since Peter is okay with it on behalf of libvirt, so
am I.

>                               We have enough places in QAPI where we
> want mutual exclusion (we mark both fields optional, but expect the
> user to provide exactly one or get an error), that I wonder if it is
> worth making it a first-class construct in QAPI (maybe I'm spoiled by
> the OneOf designation[1] in protobuf[2] used by gRPC[3] in
> kubernetes).

"One of a set of optional members" is a constraint the QAPI language
cannot express, so we have to leave enforcing it to handwritten code,
and documenting it to handwritten comments.

If it could express it, the constraint would be visible in
introspection, and we'd have less handwritten code.  The price is
additional QAPI language and generator complexity.  Sensible tradeoff?
Hard to tell without patches.  Risk: we pay for patches only to find out
the answer is no.  Another risk: we pay for patches, then take them just
because we already paid for them, then continue to pay upkeep.

I don't mean to say such QAPI language enhancements are a bad idea.  I'm
just pointing out the tradeoff to weigh, and the risks to take.

There are more constraints that could be supported by the language,
e.g. integer ranges.

>               Including the scale in the name is a bonus reason to
> switch the interface.

Yes, because the unit isn't obvious from the type.  It is for things
like byte counts, where we've succeeded at sticking to one plain
unit[*].  We failed for time, so there it isn't obvious.

> [1] https://protobuf.dev/programming-guides/proto3/#oneof
> [2] https://protobuf.dev/overview/
> [3] https://grpc.io/
>
>> 
>> Let's get additional input from management application developers.  I
>> cc'ed some.
>> 
>> Related: NetdevSocketOptions member @reconnect.

Could use the same treatment for consistency.  Not a demand.


[*] With few exceptions, such certain rate limits that use MBytes/s.
diff mbox series

Patch

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 1ca9441b1b..d87859e641 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -74,7 +74,7 @@  static void qemu_chr_socket_restart_timer(Chardev *chr)
     assert(!s->reconnect_timer);
     name = g_strdup_printf("chardev-socket-reconnect-%s", chr->label);
     s->reconnect_timer = qemu_chr_timeout_add_ms(chr,
-                                                 s->reconnect_time * 1000,
+                                                 s->reconnect_time,
                                                  socket_reconnect_timeout,
                                                  chr);
     g_source_set_name(s->reconnect_timer, name);
@@ -1082,7 +1082,7 @@  static int tcp_chr_wait_connected(Chardev *chr, Error **errp)
             if (tcp_chr_connect_client_sync(chr, &err) < 0) {
                 if (s->reconnect_time) {
                     error_free(err);
-                    g_usleep(s->reconnect_time * 1000ULL * 1000ULL);
+                    g_usleep(s->reconnect_time * 1000ULL);
                 } else {
                     error_propagate(errp, err);
                     return -1;
@@ -1509,6 +1509,9 @@  static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
     sock->wait = qemu_opt_get_bool(opts, "wait", true);
     sock->has_reconnect = qemu_opt_find(opts, "reconnect");
     sock->reconnect = qemu_opt_get_number(opts, "reconnect", 0);
+    if (!qemu_opt_get_bool(opts, "reconnect-is-ms", false)) {
+        sock->reconnect *= 1000ULL;
+    }
     sock->tls_creds = g_strdup(qemu_opt_get(opts, "tls-creds"));
     sock->tls_authz = g_strdup(qemu_opt_get(opts, "tls-authz"));
 
diff --git a/chardev/char.c b/chardev/char.c
index 3c43fb1278..f6e5f1d5c9 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -859,6 +859,9 @@  QemuOptsList qemu_chardev_opts = {
         },{
             .name = "reconnect",
             .type = QEMU_OPT_NUMBER,
+        },{
+            .name = "reconnect-is-ms",
+            .type = QEMU_OPT_BOOL,
         },{
             .name = "telnet",
             .type = QEMU_OPT_BOOL,
diff --git a/qapi/char.json b/qapi/char.json
index ef58445cee..61aeccf09d 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -272,8 +272,13 @@ 
 #     (default: false) (Since: 3.1)
 #
 # @reconnect: For a client socket, if a socket is disconnected, then
-#     attempt a reconnect after the given number of seconds.  Setting
-#     this to zero disables this function.  (default: 0) (Since: 2.2)
+#     attempt a reconnect after the given number of seconds (unless
+#     @reconnect-is-ms is set to true, in that case the number is
+#     treated as milliseconds).  Setting this to zero disables
+#     this function.  (default: 0) (Since: 2.2)
+#
+# @reconnect-is-ms: The value specified in @reconnect should be treated
+#     as milliseconds.  (default: false) (Since: 9.2)
 #
 # Since: 1.4
 ##
@@ -287,7 +292,8 @@ 
             '*telnet': 'bool',
             '*tn3270': 'bool',
             '*websocket': 'bool',
-            '*reconnect': 'int' },
+            '*reconnect': 'int',
+            '*reconnect-is-ms': 'bool' },
   'base': 'ChardevCommon' }
 
 ##