diff mbox series

[7/9] monitor: fdset: Match against O_DIRECT

Message ID 20240426142042.14573-8-farosas@suse.de
State New
Headers show
Series migration/mapped-ram: Add direct-io support | expand

Commit Message

Fabiano Rosas April 26, 2024, 2:20 p.m. UTC
We're about to enable the use of O_DIRECT in the migration code and
due to the alignment restrictions imposed by filesystems we need to
make sure the flag is only used when doing aligned IO.

The migration will do parallel IO to different regions of a file, so
we need to use more than one file descriptor. Those cannot be obtained
by duplicating (dup()) since duplicated file descriptors share the
file status flags, including O_DIRECT. If one migration channel does
unaligned IO while another sets O_DIRECT to do aligned IO, the
filesystem would fail the unaligned operation.

The add-fd QMP command along with the fdset code are specifically
designed to allow the user to pass a set of file descriptors with
different access flags into QEMU to be later fetched by code that
needs to alternate between those flags when doing IO.

Extend the fdset matching to behave the same with the O_DIRECT flag.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 monitor/fds.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Peter Xu May 3, 2024, 6:53 p.m. UTC | #1
On Fri, Apr 26, 2024 at 11:20:40AM -0300, Fabiano Rosas wrote:
> We're about to enable the use of O_DIRECT in the migration code and
> due to the alignment restrictions imposed by filesystems we need to
> make sure the flag is only used when doing aligned IO.
> 
> The migration will do parallel IO to different regions of a file, so
> we need to use more than one file descriptor. Those cannot be obtained
> by duplicating (dup()) since duplicated file descriptors share the
> file status flags, including O_DIRECT. If one migration channel does
> unaligned IO while another sets O_DIRECT to do aligned IO, the
> filesystem would fail the unaligned operation.
> 
> The add-fd QMP command along with the fdset code are specifically
> designed to allow the user to pass a set of file descriptors with
> different access flags into QEMU to be later fetched by code that
> needs to alternate between those flags when doing IO.
> 
> Extend the fdset matching to behave the same with the O_DIRECT flag.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  monitor/fds.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index 4ec3b7eea9..62e324fcec 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -420,6 +420,11 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
>          int fd = -1;
>          int dup_fd;
>          int mon_fd_flags;
> +        int mask = O_ACCMODE;
> +
> +#ifdef O_DIRECT
> +        mask |= O_DIRECT;
> +#endif
>  
>          if (mon_fdset->id != fdset_id) {
>              continue;
> @@ -431,7 +436,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
>                  return -1;
>              }
>  
> -            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
> +            if ((flags & mask) == (mon_fd_flags & mask)) {
>                  fd = mon_fdset_fd->fd;
>                  break;
>              }

I think I see what you wanted to do, picking out the right fd out of two
when qemu_open_old(), which makes sense.

However what happens if the mgmt app only passes in 1 fd to the fdset?  The
issue is we have a "fallback dup()" plan right after this chunk of code:

        dup_fd = qemu_dup_flags(fd, flags);
        if (dup_fd == -1) {
            return -1;
        }

        mon_fdset_fd_dup = g_malloc0(sizeof(*mon_fdset_fd_dup));
        mon_fdset_fd_dup->fd = dup_fd;
        QLIST_INSERT_HEAD(&mon_fdset->dup_fds, mon_fdset_fd_dup, next);

I think it means even if the mgmt app only passes in 1 fd (rather than 2,
one with O_DIRECT, one without), QEMU can always successfully call
qemu_open_old() twice for each case, even though silently the two FDs will
actually impact on each other.  This doesn't look ideal if it's true.

But I also must confess I don't really understand this code at all: we
dup(), then we try F_SETFL on all the possible flags got passed in.
However AFAICT due to the fact that dup()ed FDs will share "struct file" it
means mostly all flags will be shared, except close-on-exec.  I don't ever
see anything protecting that F_SETFL to only touch close-on-exec, I think
it means it'll silently change file status flags for the other fd which we
dup()ed from.  Does it mean that we have issue already with such dup() usage?

Thanks,
Fabiano Rosas May 3, 2024, 9:19 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Fri, Apr 26, 2024 at 11:20:40AM -0300, Fabiano Rosas wrote:
>> We're about to enable the use of O_DIRECT in the migration code and
>> due to the alignment restrictions imposed by filesystems we need to
>> make sure the flag is only used when doing aligned IO.
>> 
>> The migration will do parallel IO to different regions of a file, so
>> we need to use more than one file descriptor. Those cannot be obtained
>> by duplicating (dup()) since duplicated file descriptors share the
>> file status flags, including O_DIRECT. If one migration channel does
>> unaligned IO while another sets O_DIRECT to do aligned IO, the
>> filesystem would fail the unaligned operation.
>> 
>> The add-fd QMP command along with the fdset code are specifically
>> designed to allow the user to pass a set of file descriptors with
>> different access flags into QEMU to be later fetched by code that
>> needs to alternate between those flags when doing IO.
>> 
>> Extend the fdset matching to behave the same with the O_DIRECT flag.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  monitor/fds.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/monitor/fds.c b/monitor/fds.c
>> index 4ec3b7eea9..62e324fcec 100644
>> --- a/monitor/fds.c
>> +++ b/monitor/fds.c
>> @@ -420,6 +420,11 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
>>          int fd = -1;
>>          int dup_fd;
>>          int mon_fd_flags;
>> +        int mask = O_ACCMODE;
>> +
>> +#ifdef O_DIRECT
>> +        mask |= O_DIRECT;
>> +#endif
>>  
>>          if (mon_fdset->id != fdset_id) {
>>              continue;
>> @@ -431,7 +436,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
>>                  return -1;
>>              }
>>  
>> -            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
>> +            if ((flags & mask) == (mon_fd_flags & mask)) {
>>                  fd = mon_fdset_fd->fd;
>>                  break;
>>              }
>
> I think I see what you wanted to do, picking out the right fd out of two
> when qemu_open_old(), which makes sense.
>
> However what happens if the mgmt app only passes in 1 fd to the fdset?  The
> issue is we have a "fallback dup()" plan right after this chunk of code:
>

I'm validating the fdset at file_parse_fdset() beforehand. If there's
anything else than 2 fds then we'll error out:

    if (nfds != 2) {
        error_setg(errp, "Outgoing migration needs two fds in the fdset, "
                   "got %d", nfds);
        qmp_remove_fd(*id, false, -1, NULL);
        *id = -1;
        return false;
    }

>         dup_fd = qemu_dup_flags(fd, flags);
>         if (dup_fd == -1) {
>             return -1;
>         }
>
>         mon_fdset_fd_dup = g_malloc0(sizeof(*mon_fdset_fd_dup));
>         mon_fdset_fd_dup->fd = dup_fd;
>         QLIST_INSERT_HEAD(&mon_fdset->dup_fds, mon_fdset_fd_dup, next);
>
> I think it means even if the mgmt app only passes in 1 fd (rather than 2,
> one with O_DIRECT, one without), QEMU can always successfully call
> qemu_open_old() twice for each case, even though silently the two FDs will
> actually impact on each other.  This doesn't look ideal if it's true.
>
> But I also must confess I don't really understand this code at all: we
> dup(), then we try F_SETFL on all the possible flags got passed in.
> However AFAICT due to the fact that dup()ed FDs will share "struct file" it
> means mostly all flags will be shared, except close-on-exec.  I don't ever
> see anything protecting that F_SETFL to only touch close-on-exec, I think
> it means it'll silently change file status flags for the other fd which we
> dup()ed from.  Does it mean that we have issue already with such dup() usage?

I think you're right, but I also think there's a requirement even from
this code that the fds in the fdset cannot be dup()ed. I don't see it
enforced anywhere, but maybe that's a consequence of the larger use-case
for which this feature was introduced.

For our scenario, the open() man page says one can use kcmp() to compare
the fds and determine if they are a result of dup(). Maybe we should do
that extra check? We're defining a pretty rigid interface between QEMU
and the management layer, so not likely to break once it's written. I'm
also not sure how bad would it be to call syscall() directly from QEMU
(kcmp has no libc wrapper).

>
> Thanks,
Peter Xu May 3, 2024, 10:16 p.m. UTC | #3
On Fri, May 03, 2024 at 06:19:30PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Apr 26, 2024 at 11:20:40AM -0300, Fabiano Rosas wrote:
> >> We're about to enable the use of O_DIRECT in the migration code and
> >> due to the alignment restrictions imposed by filesystems we need to
> >> make sure the flag is only used when doing aligned IO.
> >> 
> >> The migration will do parallel IO to different regions of a file, so
> >> we need to use more than one file descriptor. Those cannot be obtained
> >> by duplicating (dup()) since duplicated file descriptors share the
> >> file status flags, including O_DIRECT. If one migration channel does
> >> unaligned IO while another sets O_DIRECT to do aligned IO, the
> >> filesystem would fail the unaligned operation.
> >> 
> >> The add-fd QMP command along with the fdset code are specifically
> >> designed to allow the user to pass a set of file descriptors with
> >> different access flags into QEMU to be later fetched by code that
> >> needs to alternate between those flags when doing IO.
> >> 
> >> Extend the fdset matching to behave the same with the O_DIRECT flag.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >>  monitor/fds.c | 7 ++++++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/monitor/fds.c b/monitor/fds.c
> >> index 4ec3b7eea9..62e324fcec 100644
> >> --- a/monitor/fds.c
> >> +++ b/monitor/fds.c
> >> @@ -420,6 +420,11 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
> >>          int fd = -1;
> >>          int dup_fd;
> >>          int mon_fd_flags;
> >> +        int mask = O_ACCMODE;
> >> +
> >> +#ifdef O_DIRECT
> >> +        mask |= O_DIRECT;
> >> +#endif
> >>  
> >>          if (mon_fdset->id != fdset_id) {
> >>              continue;
> >> @@ -431,7 +436,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
> >>                  return -1;
> >>              }
> >>  
> >> -            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
> >> +            if ((flags & mask) == (mon_fd_flags & mask)) {
> >>                  fd = mon_fdset_fd->fd;
> >>                  break;
> >>              }
> >
> > I think I see what you wanted to do, picking out the right fd out of two
> > when qemu_open_old(), which makes sense.
> >
> > However what happens if the mgmt app only passes in 1 fd to the fdset?  The
> > issue is we have a "fallback dup()" plan right after this chunk of code:
> >
> 
> I'm validating the fdset at file_parse_fdset() beforehand. If there's
> anything else than 2 fds then we'll error out:
> 
>     if (nfds != 2) {
>         error_setg(errp, "Outgoing migration needs two fds in the fdset, "
>                    "got %d", nfds);
>         qmp_remove_fd(*id, false, -1, NULL);
>         *id = -1;
>         return false;
>     }
> 
> >         dup_fd = qemu_dup_flags(fd, flags);
> >         if (dup_fd == -1) {
> >             return -1;
> >         }
> >
> >         mon_fdset_fd_dup = g_malloc0(sizeof(*mon_fdset_fd_dup));
> >         mon_fdset_fd_dup->fd = dup_fd;
> >         QLIST_INSERT_HEAD(&mon_fdset->dup_fds, mon_fdset_fd_dup, next);
> >
> > I think it means even if the mgmt app only passes in 1 fd (rather than 2,
> > one with O_DIRECT, one without), QEMU can always successfully call
> > qemu_open_old() twice for each case, even though silently the two FDs will
> > actually impact on each other.  This doesn't look ideal if it's true.
> >
> > But I also must confess I don't really understand this code at all: we
> > dup(), then we try F_SETFL on all the possible flags got passed in.
> > However AFAICT due to the fact that dup()ed FDs will share "struct file" it
> > means mostly all flags will be shared, except close-on-exec.  I don't ever
> > see anything protecting that F_SETFL to only touch close-on-exec, I think
> > it means it'll silently change file status flags for the other fd which we
> > dup()ed from.  Does it mean that we have issue already with such dup() usage?
> 
> I think you're right, but I also think there's a requirement even from
> this code that the fds in the fdset cannot be dup()ed. I don't see it
> enforced anywhere, but maybe that's a consequence of the larger use-case
> for which this feature was introduced.

I think that's the thing we need to figure out for add-fd usages.  The bad
thing is there're too many qemu_open_internal() users... so we can't easily
tell what we're looking for. May need some time reading the code or the
history.. pretty sad.  I hope someone can chim in.

> 
> For our scenario, the open() man page says one can use kcmp() to compare
> the fds and determine if they are a result of dup(). Maybe we should do
> that extra check? We're defining a pretty rigid interface between QEMU
> and the management layer, so not likely to break once it's written. I'm
> also not sure how bad would it be to call syscall() directly from QEMU
> (kcmp has no libc wrapper).

That should be all fine, see:

$ git grep " syscall(" | wc -l
28

And if we want we can also do fcntl(F_GETFL) on both fds later, making sure
they have proper flags (one must have O_DIRECT, one must not).
diff mbox series

Patch

diff --git a/monitor/fds.c b/monitor/fds.c
index 4ec3b7eea9..62e324fcec 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -420,6 +420,11 @@  int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
         int fd = -1;
         int dup_fd;
         int mon_fd_flags;
+        int mask = O_ACCMODE;
+
+#ifdef O_DIRECT
+        mask |= O_DIRECT;
+#endif
 
         if (mon_fdset->id != fdset_id) {
             continue;
@@ -431,7 +436,7 @@  int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
                 return -1;
             }
 
-            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
+            if ((flags & mask) == (mon_fd_flags & mask)) {
                 fd = mon_fdset_fd->fd;
                 break;
             }