Message ID | 20191017212858.13230-2-axboe@kernel.dk |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | [1/3] io_uring: add support for async work inheriting files table | expand |
On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: > This is in preparation for adding opcodes that need to modify files > in a process file table, either adding new ones or closing old ones. Closing old ones would be tricky. Basically if you call get_files_struct() while you're between an fdget()/fdput() pair (e.g. from sys_io_uring_enter()), you're not allowed to use that files_struct reference to replace or close existing FDs through that reference. (Or more accurately, if you go through fdget() with files_struct refcount 1, you must not replace/close FDs in there in any way until you've passed the corresponding fdput().) You can avoid that if you ensure that you never use fdget()/fdput() in the relevant places, only fget()/fput(). > If an opcode needs this, it must set REQ_F_NEED_FILES in the request > structure. If work that needs to get punted to async context have this > set, they will grab a reference to the process file table. When the > work is completed, the reference is dropped again. [...] > @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work) > set_fs(USER_DS); > } > } > + if (s->files && !old_files) { > + old_files = current->files; > + current->files = s->files; > + } AFAIK e.g. stuff like proc_fd_link() in procfs can concurrently call get_files_struct() even on kernel tasks, so you should take the task_lock(current) while fiddling with the ->files pointer. Also, maybe I'm too tired to read this correctly, but it seems like when io_sq_wq_submit_work() is processing multiple elements with ->files pointers, this part will only consume a reference to the first one? > > if (!ret) { > s->has_user = cur_mm != NULL; > @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) > unuse_mm(cur_mm); > mmput(cur_mm); > } > + if (old_files) { > + struct files_struct *files = current->files; > + current->files = old_files; > + put_files_struct(files); > + } And then here the first files_struct reference is dropped, and the rest of them leak? > } > > /* > @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, > > s->sqe = sqe_copy; > memcpy(&req->submit, s, sizeof(*s)); > + if (req->flags & REQ_F_NEED_FILES) > + req->submit.files = get_files_struct(current); Stupid question: How does this interact with sqpoll mode? In that case, this function is running on a kernel thread that isn't sharing the application's files_struct, right?
On 10/17/19 8:41 PM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >> This is in preparation for adding opcodes that need to modify files >> in a process file table, either adding new ones or closing old ones. > > Closing old ones would be tricky. Basically if you call > get_files_struct() while you're between an fdget()/fdput() pair (e.g. > from sys_io_uring_enter()), you're not allowed to use that > files_struct reference to replace or close existing FDs through that > reference. (Or more accurately, if you go through fdget() with > files_struct refcount 1, you must not replace/close FDs in there in > any way until you've passed the corresponding fdput().) > > You can avoid that if you ensure that you never use fdget()/fdput() in > the relevant places, only fget()/fput(). That's a good point, I didn't think the closing aspect through when writing that changelog. File addition is the most interesting aspect, obviously, and the only part that I care about in this patch set. I'll change the wording. >> If an opcode needs this, it must set REQ_F_NEED_FILES in the request >> structure. If work that needs to get punted to async context have this >> set, they will grab a reference to the process file table. When the >> work is completed, the reference is dropped again. > [...] >> @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work) >> set_fs(USER_DS); >> } >> } >> + if (s->files && !old_files) { >> + old_files = current->files; >> + current->files = s->files; >> + } > > AFAIK e.g. stuff like proc_fd_link() in procfs can concurrently call > get_files_struct() even on kernel tasks, so you should take the > task_lock(current) while fiddling with the ->files pointer. Fixed up, thanks! > Also, maybe I'm too tired to read this correctly, but it seems like > when io_sq_wq_submit_work() is processing multiple elements with > ->files pointers, this part will only consume a reference to the first > one? Like the mm, we should only have the one file table. But there's no reason to not handle this properly, I've amended the commit to properly swap so it works for any number of file tables. >> if (!ret) { >> s->has_user = cur_mm != NULL; >> @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) >> unuse_mm(cur_mm); >> mmput(cur_mm); >> } >> + if (old_files) { >> + struct files_struct *files = current->files; >> + current->files = old_files; >> + put_files_struct(files); >> + } > > And then here the first files_struct reference is dropped, and the > rest of them leak? Fixed with the above change. >> @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, >> >> s->sqe = sqe_copy; >> memcpy(&req->submit, s, sizeof(*s)); >> + if (req->flags & REQ_F_NEED_FILES) >> + req->submit.files = get_files_struct(current); > > Stupid question: How does this interact with sqpoll mode? In that > case, this function is running on a kernel thread that isn't sharing > the application's files_struct, right? Not a stupid question! It doesn't work with sqpoll. We need to be entered on behalf of the task, and we never see that with sqpoll (except if NEED_WAKE is set in flags). For now I'll just forbid it explicitly in io_accept(), just like we do for IORING_SETUP_IOPOLL. Updated patch1: http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 and patch 3: http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=442bb35fc4f8f28c29ea220475c45babb44ee49c
On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/17/19 8:41 PM, Jann Horn wrote: > > On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: > >> This is in preparation for adding opcodes that need to modify files > >> in a process file table, either adding new ones or closing old ones. [...] > Updated patch1: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 I don't understand what you're doing with old_files in there. In the "s->files && !old_files" branch, "current->files = s->files" happens without holding task_lock(), but current->files and s->files are also the same already at that point anyway. And what's the intent behind assigning stuff to old_files inside the loop? Isn't that going to cause the workqueue to keep a modified current->files beyond the runtime of the work?
On 10/18/19 8:34 AM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/17/19 8:41 PM, Jann Horn wrote: >>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >>>> This is in preparation for adding opcodes that need to modify files >>>> in a process file table, either adding new ones or closing old ones. > [...] >> Updated patch1: >> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 > > I don't understand what you're doing with old_files in there. In the > "s->files && !old_files" branch, "current->files = s->files" happens > without holding task_lock(), but current->files and s->files are also > the same already at that point anyway. And what's the intent behind > assigning stuff to old_files inside the loop? Isn't that going to > cause the workqueue to keep a modified current->files beyond the > runtime of the work? I simply forgot to remove the old block, it should only have this one: if (s->files && s->files != cur_files) { task_lock(current); current->files = s->files; task_unlock(current); if (cur_files) put_files_struct(cur_files); cur_files = s->files; }
On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 10/18/19 8:34 AM, Jann Horn wrote: > > On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 10/17/19 8:41 PM, Jann Horn wrote: > >>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: > >>>> This is in preparation for adding opcodes that need to modify files > >>>> in a process file table, either adding new ones or closing old ones. > > [...] > >> Updated patch1: > >> > >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 > > > > I don't understand what you're doing with old_files in there. In the > > "s->files && !old_files" branch, "current->files = s->files" happens > > without holding task_lock(), but current->files and s->files are also > > the same already at that point anyway. And what's the intent behind > > assigning stuff to old_files inside the loop? Isn't that going to > > cause the workqueue to keep a modified current->files beyond the > > runtime of the work? > > I simply forgot to remove the old block, it should only have this one: > > if (s->files && s->files != cur_files) { > task_lock(current); > current->files = s->files; > task_unlock(current); > if (cur_files) > put_files_struct(cur_files); > cur_files = s->files; > } Don't you still need a put_files_struct() in the case where "s->files == cur_files"?
On 10/18/19 8:40 AM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 10/18/19 8:34 AM, Jann Horn wrote: >>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> On 10/17/19 8:41 PM, Jann Horn wrote: >>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> This is in preparation for adding opcodes that need to modify files >>>>>> in a process file table, either adding new ones or closing old ones. >>> [...] >>>> Updated patch1: >>>> >>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 >>> >>> I don't understand what you're doing with old_files in there. In the >>> "s->files && !old_files" branch, "current->files = s->files" happens >>> without holding task_lock(), but current->files and s->files are also >>> the same already at that point anyway. And what's the intent behind >>> assigning stuff to old_files inside the loop? Isn't that going to >>> cause the workqueue to keep a modified current->files beyond the >>> runtime of the work? >> >> I simply forgot to remove the old block, it should only have this one: >> >> if (s->files && s->files != cur_files) { >> task_lock(current); >> current->files = s->files; >> task_unlock(current); >> if (cur_files) >> put_files_struct(cur_files); >> cur_files = s->files; >> } > > Don't you still need a put_files_struct() in the case where "s->files > == cur_files"? I want to hold on to the files for as long as I can, to avoid unnecessary shuffling of it. But I take it your worry here is that we'll be calling something that manipulates ->files? Nothing should do that, unless s->files is set. We didn't hide the workqueue ->files[] before this change either.
On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 10/18/19 8:40 AM, Jann Horn wrote: > > On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 10/18/19 8:34 AM, Jann Horn wrote: > >>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> On 10/17/19 8:41 PM, Jann Horn wrote: > >>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>> This is in preparation for adding opcodes that need to modify files > >>>>>> in a process file table, either adding new ones or closing old ones. > >>> [...] > >>>> Updated patch1: > >>>> > >>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 > >>> > >>> I don't understand what you're doing with old_files in there. In the > >>> "s->files && !old_files" branch, "current->files = s->files" happens > >>> without holding task_lock(), but current->files and s->files are also > >>> the same already at that point anyway. And what's the intent behind > >>> assigning stuff to old_files inside the loop? Isn't that going to > >>> cause the workqueue to keep a modified current->files beyond the > >>> runtime of the work? > >> > >> I simply forgot to remove the old block, it should only have this one: > >> > >> if (s->files && s->files != cur_files) { > >> task_lock(current); > >> current->files = s->files; > >> task_unlock(current); > >> if (cur_files) > >> put_files_struct(cur_files); > >> cur_files = s->files; > >> } > > > > Don't you still need a put_files_struct() in the case where "s->files > > == cur_files"? > > I want to hold on to the files for as long as I can, to avoid unnecessary > shuffling of it. But I take it your worry here is that we'll be calling > something that manipulates ->files? Nothing should do that, unless > s->files is set. We didn't hide the workqueue ->files[] before this > change either. No, my worry is that the refcount of the files_struct is left too high. From what I can tell, the "do" loop in io_sq_wq_submit_work() iterates over multiple instances of struct sqe_submit. If there are two sqe_submit instances with the same ->files (each holding a reference from the get_files_struct() in __io_queue_sqe()), then: When processing the first sqe_submit instance, current->files and cur_files are set to $user_files. When processing the second sqe_submit instance, nothing happens (s->files == cur_files). After the loop, at the end of the function, put_files_struct() is called once on $user_files. So get_files_struct() has been called twice, but put_files_struct() has only been called once. That leaves the refcount too high, and by repeating this, an attacker can make the refcount wrap around and then cause a use-after-free.
On 10/18/19 8:52 AM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 10/18/19 8:40 AM, Jann Horn wrote: >>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> On 10/18/19 8:34 AM, Jann Horn wrote: >>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> On 10/17/19 8:41 PM, Jann Horn wrote: >>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>> This is in preparation for adding opcodes that need to modify files >>>>>>>> in a process file table, either adding new ones or closing old ones. >>>>> [...] >>>>>> Updated patch1: >>>>>> >>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 >>>>> >>>>> I don't understand what you're doing with old_files in there. In the >>>>> "s->files && !old_files" branch, "current->files = s->files" happens >>>>> without holding task_lock(), but current->files and s->files are also >>>>> the same already at that point anyway. And what's the intent behind >>>>> assigning stuff to old_files inside the loop? Isn't that going to >>>>> cause the workqueue to keep a modified current->files beyond the >>>>> runtime of the work? >>>> >>>> I simply forgot to remove the old block, it should only have this one: >>>> >>>> if (s->files && s->files != cur_files) { >>>> task_lock(current); >>>> current->files = s->files; >>>> task_unlock(current); >>>> if (cur_files) >>>> put_files_struct(cur_files); >>>> cur_files = s->files; >>>> } >>> >>> Don't you still need a put_files_struct() in the case where "s->files >>> == cur_files"? >> >> I want to hold on to the files for as long as I can, to avoid unnecessary >> shuffling of it. But I take it your worry here is that we'll be calling >> something that manipulates ->files? Nothing should do that, unless >> s->files is set. We didn't hide the workqueue ->files[] before this >> change either. > > No, my worry is that the refcount of the files_struct is left too > high. From what I can tell, the "do" loop in io_sq_wq_submit_work() > iterates over multiple instances of struct sqe_submit. If there are > two sqe_submit instances with the same ->files (each holding a > reference from the get_files_struct() in __io_queue_sqe()), then: > > When processing the first sqe_submit instance, current->files and > cur_files are set to $user_files. > When processing the second sqe_submit instance, nothing happens > (s->files == cur_files). > After the loop, at the end of the function, put_files_struct() is > called once on $user_files. > > So get_files_struct() has been called twice, but put_files_struct() > has only been called once. That leaves the refcount too high, and by > repeating this, an attacker can make the refcount wrap around and then > cause a use-after-free. Ah now I see what you are getting at, yes that's clearly a bug! I wonder how we best safely can batch the drops. We can track the number of times we've used the same files, and do atomic_sub_and_test() in a put_files_struct_many() type addition. But that would leave us open to the issue you describe, where someone could maliciously overflow the files ref count. Probably not worth over-optimizing, as long as we can avoid the current->files task lock/unlock and shuffle. I'll update the patch.
On 10/18/19 9:00 AM, Jens Axboe wrote: > On 10/18/19 8:52 AM, Jann Horn wrote: >> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> On 10/18/19 8:40 AM, Jann Horn wrote: >>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> >>>>> On 10/18/19 8:34 AM, Jann Horn wrote: >>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote: >>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>> This is in preparation for adding opcodes that need to modify files >>>>>>>>> in a process file table, either adding new ones or closing old ones. >>>>>> [...] >>>>>>> Updated patch1: >>>>>>> >>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 >>>>>> >>>>>> I don't understand what you're doing with old_files in there. In the >>>>>> "s->files && !old_files" branch, "current->files = s->files" happens >>>>>> without holding task_lock(), but current->files and s->files are also >>>>>> the same already at that point anyway. And what's the intent behind >>>>>> assigning stuff to old_files inside the loop? Isn't that going to >>>>>> cause the workqueue to keep a modified current->files beyond the >>>>>> runtime of the work? >>>>> >>>>> I simply forgot to remove the old block, it should only have this one: >>>>> >>>>> if (s->files && s->files != cur_files) { >>>>> task_lock(current); >>>>> current->files = s->files; >>>>> task_unlock(current); >>>>> if (cur_files) >>>>> put_files_struct(cur_files); >>>>> cur_files = s->files; >>>>> } >>>> >>>> Don't you still need a put_files_struct() in the case where "s->files >>>> == cur_files"? >>> >>> I want to hold on to the files for as long as I can, to avoid unnecessary >>> shuffling of it. But I take it your worry here is that we'll be calling >>> something that manipulates ->files? Nothing should do that, unless >>> s->files is set. We didn't hide the workqueue ->files[] before this >>> change either. >> >> No, my worry is that the refcount of the files_struct is left too >> high. From what I can tell, the "do" loop in io_sq_wq_submit_work() >> iterates over multiple instances of struct sqe_submit. If there are >> two sqe_submit instances with the same ->files (each holding a >> reference from the get_files_struct() in __io_queue_sqe()), then: >> >> When processing the first sqe_submit instance, current->files and >> cur_files are set to $user_files. >> When processing the second sqe_submit instance, nothing happens >> (s->files == cur_files). >> After the loop, at the end of the function, put_files_struct() is >> called once on $user_files. >> >> So get_files_struct() has been called twice, but put_files_struct() >> has only been called once. That leaves the refcount too high, and by >> repeating this, an attacker can make the refcount wrap around and then >> cause a use-after-free. > > Ah now I see what you are getting at, yes that's clearly a bug! I wonder > how we best safely can batch the drops. We can track the number of times > we've used the same files, and do atomic_sub_and_test() in a > put_files_struct_many() type addition. But that would leave us open to > the issue you describe, where someone could maliciously overflow the > files ref count. > > Probably not worth over-optimizing, as long as we can avoid the > current->files task lock/unlock and shuffle. > > I'll update the patch. Alright, this incremental on top should do it. And full updated patch here: http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c Let me know what you think. diff --git a/fs/io_uring.c b/fs/io_uring.c index 68eda36bcc9c..2fed0badad38 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2289,13 +2289,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) set_fs(USER_DS); } } - if (s->files && s->files != cur_files) { + if (cur_files) + put_files_struct(cur_files); + cur_files = s->files; + if (cur_files && cur_files != current->files) { task_lock(current); - current->files = s->files; + current->files = cur_files; task_unlock(current); - if (cur_files) - put_files_struct(cur_files); - cur_files = s->files; } if (!ret) { @@ -2393,8 +2393,9 @@ static void io_sq_wq_submit_work(struct work_struct *work) task_lock(current); current->files = old_files; task_unlock(current); - put_files_struct(cur_files); } + if (cur_files) + put_files_struct(cur_files); } /*
On Fri, Oct 18, 2019 at 5:55 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/18/19 9:00 AM, Jens Axboe wrote: > > On 10/18/19 8:52 AM, Jann Horn wrote: > >> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote: > >>> > >>> On 10/18/19 8:40 AM, Jann Horn wrote: > >>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>> > >>>>> On 10/18/19 8:34 AM, Jann Horn wrote: > >>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote: > >>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>>>>> This is in preparation for adding opcodes that need to modify files > >>>>>>>>> in a process file table, either adding new ones or closing old ones. > >>>>>> [...] > >>>>>>> Updated patch1: > >>>>>>> > >>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 > >>>>>> > >>>>>> I don't understand what you're doing with old_files in there. In the > >>>>>> "s->files && !old_files" branch, "current->files = s->files" happens > >>>>>> without holding task_lock(), but current->files and s->files are also > >>>>>> the same already at that point anyway. And what's the intent behind > >>>>>> assigning stuff to old_files inside the loop? Isn't that going to > >>>>>> cause the workqueue to keep a modified current->files beyond the > >>>>>> runtime of the work? > >>>>> > >>>>> I simply forgot to remove the old block, it should only have this one: > >>>>> > >>>>> if (s->files && s->files != cur_files) { > >>>>> task_lock(current); > >>>>> current->files = s->files; > >>>>> task_unlock(current); > >>>>> if (cur_files) > >>>>> put_files_struct(cur_files); > >>>>> cur_files = s->files; > >>>>> } > >>>> > >>>> Don't you still need a put_files_struct() in the case where "s->files > >>>> == cur_files"? > >>> > >>> I want to hold on to the files for as long as I can, to avoid unnecessary > >>> shuffling of it. But I take it your worry here is that we'll be calling > >>> something that manipulates ->files? Nothing should do that, unless > >>> s->files is set. We didn't hide the workqueue ->files[] before this > >>> change either. > >> > >> No, my worry is that the refcount of the files_struct is left too > >> high. From what I can tell, the "do" loop in io_sq_wq_submit_work() > >> iterates over multiple instances of struct sqe_submit. If there are > >> two sqe_submit instances with the same ->files (each holding a > >> reference from the get_files_struct() in __io_queue_sqe()), then: > >> > >> When processing the first sqe_submit instance, current->files and > >> cur_files are set to $user_files. > >> When processing the second sqe_submit instance, nothing happens > >> (s->files == cur_files). > >> After the loop, at the end of the function, put_files_struct() is > >> called once on $user_files. > >> > >> So get_files_struct() has been called twice, but put_files_struct() > >> has only been called once. That leaves the refcount too high, and by > >> repeating this, an attacker can make the refcount wrap around and then > >> cause a use-after-free. > > > > Ah now I see what you are getting at, yes that's clearly a bug! I wonder > > how we best safely can batch the drops. We can track the number of times > > we've used the same files, and do atomic_sub_and_test() in a > > put_files_struct_many() type addition. But that would leave us open to > > the issue you describe, where someone could maliciously overflow the > > files ref count. > > > > Probably not worth over-optimizing, as long as we can avoid the > > current->files task lock/unlock and shuffle. > > > > I'll update the patch. > > Alright, this incremental on top should do it. And full updated patch > here: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c > > Let me know what you think. Ignoring the locking elision, basically the logic is now this: static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct files_struct *cur_files = NULL, *old_files; [...] old_files = current->files; [...] do { struct sqe_submit *s = &req->submit; [...] if (cur_files) /* drop cur_files reference; borrow lifetime must * end before here */ put_files_struct(cur_files); /* move reference ownership to cur_files */ cur_files = s->files; if (cur_files) { task_lock(current); /* current->files borrows reference from cur_files; * existing borrow from previous loop ends here */ current->files = cur_files; task_unlock(current); } [call __io_submit_sqe()] [...] } while (req); [...] /* existing borrow ends here */ task_lock(current); current->files = old_files; task_unlock(current); if (cur_files) /* drop cur_files reference; borrow lifetime must * end before here */ put_files_struct(cur_files); } If you run two iterations of this loop, with a first element that has a ->files pointer and a second element that doesn't, then in the second run through the loop, the reference to the files_struct will be dropped while current->files still points to it; current->files is only reset after the loop has ended. If someone accesses current->files through procfs directly after that, AFAICS you'd get a use-after-free.
On 10/18/19 10:20 AM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 5:55 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/18/19 9:00 AM, Jens Axboe wrote: >>> On 10/18/19 8:52 AM, Jann Horn wrote: >>>> On Fri, Oct 18, 2019 at 4:43 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> >>>>> On 10/18/19 8:40 AM, Jann Horn wrote: >>>>>> On Fri, Oct 18, 2019 at 4:37 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> >>>>>>> On 10/18/19 8:34 AM, Jann Horn wrote: >>>>>>>> On Fri, Oct 18, 2019 at 4:01 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>> On 10/17/19 8:41 PM, Jann Horn wrote: >>>>>>>>>> On Fri, Oct 18, 2019 at 4:01 AM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>>>> This is in preparation for adding opcodes that need to modify files >>>>>>>>>>> in a process file table, either adding new ones or closing old ones. >>>>>>>> [...] >>>>>>>>> Updated patch1: >>>>>>>>> >>>>>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=df6caac708dae8ee9a74c9016e479b02ad78d436 >>>>>>>> >>>>>>>> I don't understand what you're doing with old_files in there. In the >>>>>>>> "s->files && !old_files" branch, "current->files = s->files" happens >>>>>>>> without holding task_lock(), but current->files and s->files are also >>>>>>>> the same already at that point anyway. And what's the intent behind >>>>>>>> assigning stuff to old_files inside the loop? Isn't that going to >>>>>>>> cause the workqueue to keep a modified current->files beyond the >>>>>>>> runtime of the work? >>>>>>> >>>>>>> I simply forgot to remove the old block, it should only have this one: >>>>>>> >>>>>>> if (s->files && s->files != cur_files) { >>>>>>> task_lock(current); >>>>>>> current->files = s->files; >>>>>>> task_unlock(current); >>>>>>> if (cur_files) >>>>>>> put_files_struct(cur_files); >>>>>>> cur_files = s->files; >>>>>>> } >>>>>> >>>>>> Don't you still need a put_files_struct() in the case where "s->files >>>>>> == cur_files"? >>>>> >>>>> I want to hold on to the files for as long as I can, to avoid unnecessary >>>>> shuffling of it. But I take it your worry here is that we'll be calling >>>>> something that manipulates ->files? Nothing should do that, unless >>>>> s->files is set. We didn't hide the workqueue ->files[] before this >>>>> change either. >>>> >>>> No, my worry is that the refcount of the files_struct is left too >>>> high. From what I can tell, the "do" loop in io_sq_wq_submit_work() >>>> iterates over multiple instances of struct sqe_submit. If there are >>>> two sqe_submit instances with the same ->files (each holding a >>>> reference from the get_files_struct() in __io_queue_sqe()), then: >>>> >>>> When processing the first sqe_submit instance, current->files and >>>> cur_files are set to $user_files. >>>> When processing the second sqe_submit instance, nothing happens >>>> (s->files == cur_files). >>>> After the loop, at the end of the function, put_files_struct() is >>>> called once on $user_files. >>>> >>>> So get_files_struct() has been called twice, but put_files_struct() >>>> has only been called once. That leaves the refcount too high, and by >>>> repeating this, an attacker can make the refcount wrap around and then >>>> cause a use-after-free. >>> >>> Ah now I see what you are getting at, yes that's clearly a bug! I wonder >>> how we best safely can batch the drops. We can track the number of times >>> we've used the same files, and do atomic_sub_and_test() in a >>> put_files_struct_many() type addition. But that would leave us open to >>> the issue you describe, where someone could maliciously overflow the >>> files ref count. >>> >>> Probably not worth over-optimizing, as long as we can avoid the >>> current->files task lock/unlock and shuffle. >>> >>> I'll update the patch. >> >> Alright, this incremental on top should do it. And full updated patch >> here: >> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=40449c5a3d3b16796fa13e9469c69d62986e961c >> >> Let me know what you think. > > Ignoring the locking elision, basically the logic is now this: > > static void io_sq_wq_submit_work(struct work_struct *work) > { > struct io_kiocb *req = container_of(work, struct io_kiocb, work); > struct files_struct *cur_files = NULL, *old_files; > [...] > old_files = current->files; > [...] > do { > struct sqe_submit *s = &req->submit; > [...] > if (cur_files) > /* drop cur_files reference; borrow lifetime must > * end before here */ > put_files_struct(cur_files); > /* move reference ownership to cur_files */ > cur_files = s->files; > if (cur_files) { > task_lock(current); > /* current->files borrows reference from cur_files; > * existing borrow from previous loop ends here */ > current->files = cur_files; > task_unlock(current); > } > > [call __io_submit_sqe()] > [...] > } while (req); > [...] > /* existing borrow ends here */ > task_lock(current); > current->files = old_files; > task_unlock(current); > if (cur_files) > /* drop cur_files reference; borrow lifetime must > * end before here */ > put_files_struct(cur_files); > } > > If you run two iterations of this loop, with a first element that has > a ->files pointer and a second element that doesn't, then in the > second run through the loop, the reference to the files_struct will be > dropped while current->files still points to it; current->files is > only reset after the loop has ended. If someone accesses > current->files through procfs directly after that, AFAICS you'd get a > use-after-free. Amazing how this is still broken. You are right, and it's especially annoying since that's exactly the case I originally talked about (not flipping current->files if we don't have to). I just did it wrong, so we'll leave a dangling pointer in ->files. The by far most common case is if one sqe has a files it needs to attach, then others that also have files will be the same set. So I want to optimize for the case where we only flip current->files once when we see the files, and once when we're done with the loop. Let me see if I can get this right...
On 10/18/19 10:36 AM, Jens Axboe wrote: >> Ignoring the locking elision, basically the logic is now this: >> >> static void io_sq_wq_submit_work(struct work_struct *work) >> { >> struct io_kiocb *req = container_of(work, struct io_kiocb, work); >> struct files_struct *cur_files = NULL, *old_files; >> [...] >> old_files = current->files; >> [...] >> do { >> struct sqe_submit *s = &req->submit; >> [...] >> if (cur_files) >> /* drop cur_files reference; borrow lifetime must >> * end before here */ >> put_files_struct(cur_files); >> /* move reference ownership to cur_files */ >> cur_files = s->files; >> if (cur_files) { >> task_lock(current); >> /* current->files borrows reference from cur_files; >> * existing borrow from previous loop ends here */ >> current->files = cur_files; >> task_unlock(current); >> } >> >> [call __io_submit_sqe()] >> [...] >> } while (req); >> [...] >> /* existing borrow ends here */ >> task_lock(current); >> current->files = old_files; >> task_unlock(current); >> if (cur_files) >> /* drop cur_files reference; borrow lifetime must >> * end before here */ >> put_files_struct(cur_files); >> } >> >> If you run two iterations of this loop, with a first element that has >> a ->files pointer and a second element that doesn't, then in the >> second run through the loop, the reference to the files_struct will be >> dropped while current->files still points to it; current->files is >> only reset after the loop has ended. If someone accesses >> current->files through procfs directly after that, AFAICS you'd get a >> use-after-free. > > Amazing how this is still broken. You are right, and it's especially > annoying since that's exactly the case I originally talked about (not > flipping current->files if we don't have to). I just did it wrong, so > we'll leave a dangling pointer in ->files. > > The by far most common case is if one sqe has a files it needs to > attach, then others that also have files will be the same set. So I want > to optimize for the case where we only flip current->files once when we > see the files, and once when we're done with the loop. > > Let me see if I can get this right... I _think_ the simplest way to do it is simply to have both cur_files and current->files hold a reference to the file table. That won't really add any extra cost as the double increments / decrements are following each other. Something like this incremental, totally untested. diff --git a/fs/io_uring.c b/fs/io_uring.c index 2fed0badad38..b3cf3f3d7911 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work) put_files_struct(cur_files); cur_files = s->files; if (cur_files && cur_files != current->files) { + struct files_struct *old; + + atomic_inc(&cur_files->count); task_lock(current); + old = current->files; current->files = cur_files; task_unlock(current); + put_files_struct(old); } if (!ret) { @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) mmput(cur_mm); } if (old_files != current->files) { + struct files_struct *old; + task_lock(current); + old = current->files; current->files = old_files; task_unlock(current); + put_files_struct(old); } if (cur_files) put_files_struct(cur_files);
On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/18/19 10:36 AM, Jens Axboe wrote: > >> Ignoring the locking elision, basically the logic is now this: > >> > >> static void io_sq_wq_submit_work(struct work_struct *work) > >> { > >> struct io_kiocb *req = container_of(work, struct io_kiocb, work); > >> struct files_struct *cur_files = NULL, *old_files; > >> [...] > >> old_files = current->files; > >> [...] > >> do { > >> struct sqe_submit *s = &req->submit; > >> [...] > >> if (cur_files) > >> /* drop cur_files reference; borrow lifetime must > >> * end before here */ > >> put_files_struct(cur_files); > >> /* move reference ownership to cur_files */ > >> cur_files = s->files; > >> if (cur_files) { > >> task_lock(current); > >> /* current->files borrows reference from cur_files; > >> * existing borrow from previous loop ends here */ > >> current->files = cur_files; > >> task_unlock(current); > >> } > >> > >> [call __io_submit_sqe()] > >> [...] > >> } while (req); > >> [...] > >> /* existing borrow ends here */ > >> task_lock(current); > >> current->files = old_files; > >> task_unlock(current); > >> if (cur_files) > >> /* drop cur_files reference; borrow lifetime must > >> * end before here */ > >> put_files_struct(cur_files); > >> } > >> > >> If you run two iterations of this loop, with a first element that has > >> a ->files pointer and a second element that doesn't, then in the > >> second run through the loop, the reference to the files_struct will be > >> dropped while current->files still points to it; current->files is > >> only reset after the loop has ended. If someone accesses > >> current->files through procfs directly after that, AFAICS you'd get a > >> use-after-free. > > > > Amazing how this is still broken. You are right, and it's especially > > annoying since that's exactly the case I originally talked about (not > > flipping current->files if we don't have to). I just did it wrong, so > > we'll leave a dangling pointer in ->files. > > > > The by far most common case is if one sqe has a files it needs to > > attach, then others that also have files will be the same set. So I want > > to optimize for the case where we only flip current->files once when we > > see the files, and once when we're done with the loop. > > > > Let me see if I can get this right... > > I _think_ the simplest way to do it is simply to have both cur_files and > current->files hold a reference to the file table. That won't really add > any extra cost as the double increments / decrements are following each > other. Something like this incremental, totally untested. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2fed0badad38..b3cf3f3d7911 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work) > put_files_struct(cur_files); > cur_files = s->files; > if (cur_files && cur_files != current->files) { > + struct files_struct *old; > + > + atomic_inc(&cur_files->count); > task_lock(current); > + old = current->files; > current->files = cur_files; > task_unlock(current); > + put_files_struct(old); > } > > if (!ret) { > @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) > mmput(cur_mm); > } > if (old_files != current->files) { > + struct files_struct *old; > + > task_lock(current); > + old = current->files; > current->files = old_files; > task_unlock(current); > + put_files_struct(old); > } > if (cur_files) > put_files_struct(cur_files); The only part I still feel a bit twitchy about is this part at the end: if (old_files != current->files) { struct files_struct *old; task_lock(current); old = current->files; current->files = old_files; task_unlock(current); put_files_struct(old); } If it was possible for the initial ->files to be the same as the ->files of a submission, and we got two submissions with first a different files_struct and then our old one, then this branch would not be executed even though it should, which would leave the refcount of the files_struct one too high. But that probably can't happen? Since kernel workers should be running with &init_files (I think?) and that thing is never used for userspace tasks. But still, I'd feel better if you could change it like this: diff --git a/fs/io_uring.c b/fs/io_uring.c index f9f5c70564f0..7673035d6bfe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2265,6 +2265,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct files_struct *cur_files = NULL, *old_files; + bool restore_current_files = false; struct io_ring_ctx *ctx = req->ctx; struct mm_struct *cur_mm = NULL; struct async_list *async_list; @@ -2313,6 +2314,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) current->files = cur_files; task_unlock(current); put_files_struct(old); + restore_current_files = true; } if (!ret) { @@ -2406,7 +2408,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) unuse_mm(cur_mm); mmput(cur_mm); } - if (old_files != current->files) { + if (restore_current_files) { struct files_struct *old; task_lock(current); But actually, by the way: Is this whole files_struct thing creating a reference loop? The files_struct has a reference to the uring file, and the uring file has ACCEPT work that has a reference to the files_struct. If the task gets killed and the accept work blocks, the entire files_struct will stay alive, right?
On 10/18/19 12:06 PM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/18/19 10:36 AM, Jens Axboe wrote: >>>> Ignoring the locking elision, basically the logic is now this: >>>> >>>> static void io_sq_wq_submit_work(struct work_struct *work) >>>> { >>>> struct io_kiocb *req = container_of(work, struct io_kiocb, work); >>>> struct files_struct *cur_files = NULL, *old_files; >>>> [...] >>>> old_files = current->files; >>>> [...] >>>> do { >>>> struct sqe_submit *s = &req->submit; >>>> [...] >>>> if (cur_files) >>>> /* drop cur_files reference; borrow lifetime must >>>> * end before here */ >>>> put_files_struct(cur_files); >>>> /* move reference ownership to cur_files */ >>>> cur_files = s->files; >>>> if (cur_files) { >>>> task_lock(current); >>>> /* current->files borrows reference from cur_files; >>>> * existing borrow from previous loop ends here */ >>>> current->files = cur_files; >>>> task_unlock(current); >>>> } >>>> >>>> [call __io_submit_sqe()] >>>> [...] >>>> } while (req); >>>> [...] >>>> /* existing borrow ends here */ >>>> task_lock(current); >>>> current->files = old_files; >>>> task_unlock(current); >>>> if (cur_files) >>>> /* drop cur_files reference; borrow lifetime must >>>> * end before here */ >>>> put_files_struct(cur_files); >>>> } >>>> >>>> If you run two iterations of this loop, with a first element that has >>>> a ->files pointer and a second element that doesn't, then in the >>>> second run through the loop, the reference to the files_struct will be >>>> dropped while current->files still points to it; current->files is >>>> only reset after the loop has ended. If someone accesses >>>> current->files through procfs directly after that, AFAICS you'd get a >>>> use-after-free. >>> >>> Amazing how this is still broken. You are right, and it's especially >>> annoying since that's exactly the case I originally talked about (not >>> flipping current->files if we don't have to). I just did it wrong, so >>> we'll leave a dangling pointer in ->files. >>> >>> The by far most common case is if one sqe has a files it needs to >>> attach, then others that also have files will be the same set. So I want >>> to optimize for the case where we only flip current->files once when we >>> see the files, and once when we're done with the loop. >>> >>> Let me see if I can get this right... >> >> I _think_ the simplest way to do it is simply to have both cur_files and >> current->files hold a reference to the file table. That won't really add >> any extra cost as the double increments / decrements are following each >> other. Something like this incremental, totally untested. >> >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 2fed0badad38..b3cf3f3d7911 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work) >> put_files_struct(cur_files); >> cur_files = s->files; >> if (cur_files && cur_files != current->files) { >> + struct files_struct *old; >> + >> + atomic_inc(&cur_files->count); >> task_lock(current); >> + old = current->files; >> current->files = cur_files; >> task_unlock(current); >> + put_files_struct(old); >> } >> >> if (!ret) { >> @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) >> mmput(cur_mm); >> } >> if (old_files != current->files) { >> + struct files_struct *old; >> + >> task_lock(current); >> + old = current->files; >> current->files = old_files; >> task_unlock(current); >> + put_files_struct(old); >> } >> if (cur_files) >> put_files_struct(cur_files); > > The only part I still feel a bit twitchy about is this part at the end: > > if (old_files != current->files) { > struct files_struct *old; > > task_lock(current); > old = current->files; > current->files = old_files; > task_unlock(current); > put_files_struct(old); > } > > If it was possible for the initial ->files to be the same as the > ->files of a submission, and we got two submissions with first a > different files_struct and then our old one, then this branch would > not be executed even though it should, which would leave the refcount > of the files_struct one too high. But that probably can't happen? > Since kernel workers should be running with &init_files (I think?) and > that thing is never used for userspace tasks. But still, I'd feel > better if you could change it like this: Right, that is never going to happen. But your solution is simpler, so... I'll throw some testing at it. > But actually, by the way: Is this whole files_struct thing creating a > reference loop? The files_struct has a reference to the uring file, > and the uring file has ACCEPT work that has a reference to the > files_struct. If the task gets killed and the accept work blocks, the > entire files_struct will stay alive, right? Yes, for the lifetime of the request, it does create a loop. So if the application goes away, I think you're right, the files_struct will stay. And so will the io_uring, for that matter, as we depend on the closing of the files to do the final reap. Hmm, not sure how best to handle that, to be honest. We need some way to break the loop, if the request never finishes.
On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/18/19 12:06 PM, Jann Horn wrote: > > But actually, by the way: Is this whole files_struct thing creating a > > reference loop? The files_struct has a reference to the uring file, > > and the uring file has ACCEPT work that has a reference to the > > files_struct. If the task gets killed and the accept work blocks, the > > entire files_struct will stay alive, right? > > Yes, for the lifetime of the request, it does create a loop. So if the > application goes away, I think you're right, the files_struct will stay. > And so will the io_uring, for that matter, as we depend on the closing > of the files to do the final reap. > > Hmm, not sure how best to handle that, to be honest. We need some way to > break the loop, if the request never finishes. A wacky and dubious approach would be to, instead of taking a reference to the files_struct, abuse f_op->flush() to synchronously flush out pending requests with references to the files_struct... But it's probably a bad idea, given that in f_op->flush(), you can't easily tell which files_struct the close is coming from. I suppose you could keep a list of (fdtable, fd) pairs through which ACCEPT requests have come in and then let f_op->flush() probe whether the file pointers are gone from them...
On Thu, Oct 17, 2019 at 03:28:56PM -0600, Jens Axboe wrote: > This is in preparation for adding opcodes that need to modify files > in a process file table, either adding new ones or closing old ones. > > If an opcode needs this, it must set REQ_F_NEED_FILES in the request > structure. If work that needs to get punted to async context have this > set, they will grab a reference to the process file table. When the > work is completed, the reference is dropped again. I think IORING_OP_SENDMSG and _RECVMSG need to set this flag due to SCM_RIGHTS control messages. Thought I'd reply here since I just now ran into the issue that I was getting ever-increasing wrong file descriptor numbers on pretty much ever "other" async recvmsg() call I did via io-uring while receiving file descriptors from lxc for the seccomp-notify proxy. (I'm currently running an ubuntu based 5.3.1 kernel) I ended up finding them in /proc - they show up in all kernel threads, eg.: root:/root # grep Name /proc/9/status Name: mm_percpu_wq root:/root # ls -l /proc/9/fd total 0 lr-x------ 1 root root 64 Oct 23 12:00 0 -> '/proc/512 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 1 -> /proc/512/mem lr-x------ 1 root root 64 Oct 23 12:00 10 -> '/proc/11782 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 11 -> /proc/11782/mem lr-x------ 1 root root 64 Oct 23 12:00 12 -> '/proc/12210 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 13 -> /proc/12210/mem lr-x------ 1 root root 64 Oct 23 12:00 14 -> '/proc/12298 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 15 -> /proc/12298/mem lr-x------ 1 root root 64 Oct 23 12:00 16 -> '/proc/13955 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 17 -> /proc/13955/mem lr-x------ 1 root root 64 Oct 23 12:00 18 -> '/proc/13989 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 19 -> /proc/13989/mem lr-x------ 1 root root 64 Oct 23 12:00 2 -> '/proc/584 (deleted)' lr-x------ 1 root root 64 Oct 23 12:00 20 -> '/proc/15502 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 21 -> /proc/15502/mem lr-x------ 1 root root 64 Oct 23 12:00 22 -> '/proc/15510 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 23 -> /proc/15510/mem lr-x------ 1 root root 64 Oct 23 12:00 24 -> '/proc/17833 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 25 -> /proc/17833/mem lr-x------ 1 root root 64 Oct 23 12:00 26 -> '/proc/17836 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 27 -> /proc/17836/mem lr-x------ 1 root root 64 Oct 23 12:00 28 -> '/proc/21929 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 29 -> /proc/21929/mem lrwx------ 1 root root 64 Oct 23 12:00 3 -> /proc/584/mem lr-x------ 1 root root 64 Oct 23 12:00 30 -> '/proc/22214 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 31 -> /proc/22214/mem lr-x------ 1 root root 64 Oct 23 12:00 32 -> '/proc/22283 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 33 -> /proc/22283/mem lr-x------ 1 root root 64 Oct 23 12:00 34 -> '/proc/29795 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 35 -> /proc/29795/mem lr-x------ 1 root root 64 Oct 23 12:00 36 -> '/proc/30124 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 37 -> /proc/30124/mem lr-x------ 1 root root 64 Oct 23 12:00 38 -> '/proc/31016 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 39 -> /proc/31016/mem lr-x------ 1 root root 64 Oct 23 12:00 4 -> '/proc/1632 (deleted)' lr-x------ 1 root root 64 Oct 23 12:00 40 -> '/proc/4137 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 41 -> /proc/4137/mem lrwx------ 1 root root 64 Oct 23 12:00 5 -> /proc/1632/mem lr-x------ 1 root root 64 Oct 23 12:00 6 -> '/proc/3655 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 7 -> /proc/3655/mem lr-x------ 1 root root 64 Oct 23 12:00 8 -> '/proc/7075 (deleted)' lrwx------ 1 root root 64 Oct 23 12:00 9 -> /proc/7075/mem root:/root # Those are the fds I expected to receive, and I get fd numbers consistently increasing with them. lxc sends the syscall-executing process' pidfd and its 'mem' fd via a socket, but instead of making it to the receiver, they end up there... I suspect that an async sendmsg() call could potentially end up accessing those instead of the ones from the sender process, but I haven't tested it... > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > fs/io_uring.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 635856023fdf..ad462237275e 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -267,10 +267,11 @@ struct io_ring_ctx { > struct sqe_submit { > const struct io_uring_sqe *sqe; > unsigned short index; > + bool has_user : 1; > + bool in_async : 1; > + bool needs_fixed_file : 1; > u32 sequence; > - bool has_user; > - bool in_async; > - bool needs_fixed_file; > + struct files_struct *files; > }; > > /* > @@ -323,6 +324,7 @@ struct io_kiocb { > #define REQ_F_FAIL_LINK 256 /* fail rest of links */ > #define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ > #define REQ_F_TIMEOUT 1024 /* timeout request */ > +#define REQ_F_NEED_FILES 2048 /* needs to assume file table */ > u64 user_data; > u32 result; > u32 sequence; > @@ -2191,6 +2193,7 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) > static void io_sq_wq_submit_work(struct work_struct *work) > { > struct io_kiocb *req = container_of(work, struct io_kiocb, work); > + struct files_struct *old_files = NULL; > struct io_ring_ctx *ctx = req->ctx; > struct mm_struct *cur_mm = NULL; > struct async_list *async_list; > @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work) > set_fs(USER_DS); > } > } > + if (s->files && !old_files) { > + old_files = current->files; > + current->files = s->files; > + } > > if (!ret) { > s->has_user = cur_mm != NULL; > @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) > unuse_mm(cur_mm); > mmput(cur_mm); > } > + if (old_files) { > + struct files_struct *files = current->files; > + current->files = old_files; > + put_files_struct(files); > + } > } > > /* > @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, > > s->sqe = sqe_copy; > memcpy(&req->submit, s, sizeof(*s)); > + if (req->flags & REQ_F_NEED_FILES) > + req->submit.files = get_files_struct(current); > list = io_async_list_from_sqe(ctx, s->sqe); > if (!io_add_to_prev_work(list, req)) { > if (list) > @@ -2633,6 +2647,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) > s->index = head; > s->sqe = &ctx->sq_sqes[head]; > s->sequence = ctx->cached_sq_head; > + s->files = NULL; > ctx->cached_sq_head++; > return true; > } > -- > 2.17.1
On 10/23/19 6:04 AM, Wolfgang Bumiller wrote: > On Thu, Oct 17, 2019 at 03:28:56PM -0600, Jens Axboe wrote: >> This is in preparation for adding opcodes that need to modify files >> in a process file table, either adding new ones or closing old ones. >> >> If an opcode needs this, it must set REQ_F_NEED_FILES in the request >> structure. If work that needs to get punted to async context have this >> set, they will grab a reference to the process file table. When the >> work is completed, the reference is dropped again. > > I think IORING_OP_SENDMSG and _RECVMSG need to set this flag due to > SCM_RIGHTS control messages. > Thought I'd reply here since I just now ran into the issue that I was > getting ever-increasing wrong file descriptor numbers on pretty much > ever "other" async recvmsg() call I did via io-uring while receiving > file descriptors from lxc for the seccomp-notify proxy. (I'm currently > running an ubuntu based 5.3.1 kernel) > I ended up finding them in /proc - they show up in all kernel threads, > eg.: > > root:/root # grep Name /proc/9/status > Name: mm_percpu_wq > root:/root # ls -l /proc/9/fd > total 0 > lr-x------ 1 root root 64 Oct 23 12:00 0 -> '/proc/512 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 1 -> /proc/512/mem > lr-x------ 1 root root 64 Oct 23 12:00 10 -> '/proc/11782 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 11 -> /proc/11782/mem > lr-x------ 1 root root 64 Oct 23 12:00 12 -> '/proc/12210 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 13 -> /proc/12210/mem > lr-x------ 1 root root 64 Oct 23 12:00 14 -> '/proc/12298 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 15 -> /proc/12298/mem > lr-x------ 1 root root 64 Oct 23 12:00 16 -> '/proc/13955 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 17 -> /proc/13955/mem > lr-x------ 1 root root 64 Oct 23 12:00 18 -> '/proc/13989 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 19 -> /proc/13989/mem > lr-x------ 1 root root 64 Oct 23 12:00 2 -> '/proc/584 (deleted)' > lr-x------ 1 root root 64 Oct 23 12:00 20 -> '/proc/15502 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 21 -> /proc/15502/mem > lr-x------ 1 root root 64 Oct 23 12:00 22 -> '/proc/15510 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 23 -> /proc/15510/mem > lr-x------ 1 root root 64 Oct 23 12:00 24 -> '/proc/17833 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 25 -> /proc/17833/mem > lr-x------ 1 root root 64 Oct 23 12:00 26 -> '/proc/17836 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 27 -> /proc/17836/mem > lr-x------ 1 root root 64 Oct 23 12:00 28 -> '/proc/21929 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 29 -> /proc/21929/mem > lrwx------ 1 root root 64 Oct 23 12:00 3 -> /proc/584/mem > lr-x------ 1 root root 64 Oct 23 12:00 30 -> '/proc/22214 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 31 -> /proc/22214/mem > lr-x------ 1 root root 64 Oct 23 12:00 32 -> '/proc/22283 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 33 -> /proc/22283/mem > lr-x------ 1 root root 64 Oct 23 12:00 34 -> '/proc/29795 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 35 -> /proc/29795/mem > lr-x------ 1 root root 64 Oct 23 12:00 36 -> '/proc/30124 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 37 -> /proc/30124/mem > lr-x------ 1 root root 64 Oct 23 12:00 38 -> '/proc/31016 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 39 -> /proc/31016/mem > lr-x------ 1 root root 64 Oct 23 12:00 4 -> '/proc/1632 (deleted)' > lr-x------ 1 root root 64 Oct 23 12:00 40 -> '/proc/4137 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 41 -> /proc/4137/mem > lrwx------ 1 root root 64 Oct 23 12:00 5 -> /proc/1632/mem > lr-x------ 1 root root 64 Oct 23 12:00 6 -> '/proc/3655 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 7 -> /proc/3655/mem > lr-x------ 1 root root 64 Oct 23 12:00 8 -> '/proc/7075 (deleted)' > lrwx------ 1 root root 64 Oct 23 12:00 9 -> /proc/7075/mem > root:/root # > > Those are the fds I expected to receive, and I get fd numbers > consistently increasing with them. > lxc sends the syscall-executing process' pidfd and its 'mem' fd via a > socket, but instead of making it to the receiver, they end up there... > > I suspect that an async sendmsg() call could potentially end up > accessing those instead of the ones from the sender process, but I > haven't tested it... Might "just" be a case of the sendmsg() being stuck, we can't currently cancel work. So if they never complete, the ring won't go away. Actually working on a small workqueue replacement for io_uring which allow us to cancel things like that. It's a requirement for accept() as well, but also for basic read/write send/recv on sockets. So used to storage IO operations that complete in a finite amount of time... But yes, I hope with that, and the flush trick that Jann suggested, that we can make this 100% reliable for any type of operation.
On 10/18/19 12:50 PM, Jann Horn wrote: > On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/18/19 12:06 PM, Jann Horn wrote: >>> But actually, by the way: Is this whole files_struct thing creating a >>> reference loop? The files_struct has a reference to the uring file, >>> and the uring file has ACCEPT work that has a reference to the >>> files_struct. If the task gets killed and the accept work blocks, the >>> entire files_struct will stay alive, right? >> >> Yes, for the lifetime of the request, it does create a loop. So if the >> application goes away, I think you're right, the files_struct will stay. >> And so will the io_uring, for that matter, as we depend on the closing >> of the files to do the final reap. >> >> Hmm, not sure how best to handle that, to be honest. We need some way to >> break the loop, if the request never finishes. > > A wacky and dubious approach would be to, instead of taking a > reference to the files_struct, abuse f_op->flush() to synchronously > flush out pending requests with references to the files_struct... But > it's probably a bad idea, given that in f_op->flush(), you can't > easily tell which files_struct the close is coming from. I suppose you > could keep a list of (fdtable, fd) pairs through which ACCEPT requests > have come in and then let f_op->flush() probe whether the file > pointers are gone from them... Got back to this after finishing the io-wq stuff, which we need for the cancel. Here's an updated patch: http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 that seems to work for me (lightly tested), we correctly find and cancel work that is holding on to the file table. The full series sits on top of my for-5.5/io_uring-wq branch, and can be viewed here: http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test Let me know what you think!
On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote: > On 10/18/19 12:50 PM, Jann Horn wrote: > > On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 10/18/19 12:06 PM, Jann Horn wrote: > >>> But actually, by the way: Is this whole files_struct thing creating a > >>> reference loop? The files_struct has a reference to the uring file, > >>> and the uring file has ACCEPT work that has a reference to the > >>> files_struct. If the task gets killed and the accept work blocks, the > >>> entire files_struct will stay alive, right? > >> > >> Yes, for the lifetime of the request, it does create a loop. So if the > >> application goes away, I think you're right, the files_struct will stay. > >> And so will the io_uring, for that matter, as we depend on the closing > >> of the files to do the final reap. > >> > >> Hmm, not sure how best to handle that, to be honest. We need some way to > >> break the loop, if the request never finishes. > > > > A wacky and dubious approach would be to, instead of taking a > > reference to the files_struct, abuse f_op->flush() to synchronously > > flush out pending requests with references to the files_struct... But > > it's probably a bad idea, given that in f_op->flush(), you can't > > easily tell which files_struct the close is coming from. I suppose you > > could keep a list of (fdtable, fd) pairs through which ACCEPT requests > > have come in and then let f_op->flush() probe whether the file > > pointers are gone from them... > > Got back to this after finishing the io-wq stuff, which we need for the > cancel. > > Here's an updated patch: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 > > that seems to work for me (lightly tested), we correctly find and cancel > work that is holding on to the file table. > > The full series sits on top of my for-5.5/io_uring-wq branch, and can be > viewed here: > > http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test > > Let me know what you think! Ah, I didn't realize that the second argument to f_op->flush is a pointer to the files_struct. That's neat. Security: There is no guarantee that ->flush() will run after the last io_uring_enter() finishes. You can race like this, with threads A and B in one process and C in another one: A: sends uring fd to C via unix domain socket A: starts syscall io_uring_enter(fd, ...) A: calls fdget(fd), takes reference to file B: starts syscall close(fd) B: fd table entry is removed B: f_op->flush is invoked and finds no pending transactions B: syscall close() returns A: continues io_uring_enter(), grabbing current->files A: io_uring_enter() returns A and B: exit worker: use-after-free access to files_struct I think the solution to this would be (unless you're fine with adding some broad global read-write mutex) something like this in __io_queue_sqe(), where "fd" and "f" are the variables from io_uring_enter(), plumbed through the stack somehow: if (req->flags & REQ_F_NEED_FILES) { rcu_read_lock(); spin_lock_irq(&ctx->inflight_lock); if (fcheck(fd) == f) { list_add(&req->inflight_list, &ctx->inflight_list); req->work.files = current->files; ret = 0; } else { ret = -EBADF; } spin_unlock_irq(&ctx->inflight_lock); rcu_read_unlock(); if (ret) goto put_req; } Minor note: If a process uses dup() to duplicate the uring fd, then closes the duplicated fd, that will cause work cancellations - but I guess that's fine? Style nit: I find it a bit confusing to name both the list head and the list member heads "inflight_list". Maybe name them "inflight_list" and "inflight_entry", or something like that? Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE? Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the task while it's at that schedule(), it's just going to loop back around and retry what it was doing already, right? Security + Correctness: If there is more than one io_wqe, it seems to me that io_uring_flush() calls io_wq_cancel_work(), which calls io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first request it looks at is pending. In that case, io_wq_cancel_work() will immediately return, and io_uring_flush() will also immediately return. It looks like any other requests will continue running?
On 10/24/19 2:31 PM, Jann Horn wrote: > On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/18/19 12:50 PM, Jann Horn wrote: >>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> On 10/18/19 12:06 PM, Jann Horn wrote: >>>>> But actually, by the way: Is this whole files_struct thing creating a >>>>> reference loop? The files_struct has a reference to the uring file, >>>>> and the uring file has ACCEPT work that has a reference to the >>>>> files_struct. If the task gets killed and the accept work blocks, the >>>>> entire files_struct will stay alive, right? >>>> >>>> Yes, for the lifetime of the request, it does create a loop. So if the >>>> application goes away, I think you're right, the files_struct will stay. >>>> And so will the io_uring, for that matter, as we depend on the closing >>>> of the files to do the final reap. >>>> >>>> Hmm, not sure how best to handle that, to be honest. We need some way to >>>> break the loop, if the request never finishes. >>> >>> A wacky and dubious approach would be to, instead of taking a >>> reference to the files_struct, abuse f_op->flush() to synchronously >>> flush out pending requests with references to the files_struct... But >>> it's probably a bad idea, given that in f_op->flush(), you can't >>> easily tell which files_struct the close is coming from. I suppose you >>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests >>> have come in and then let f_op->flush() probe whether the file >>> pointers are gone from them... >> >> Got back to this after finishing the io-wq stuff, which we need for the >> cancel. >> >> Here's an updated patch: >> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 >> >> that seems to work for me (lightly tested), we correctly find and cancel >> work that is holding on to the file table. >> >> The full series sits on top of my for-5.5/io_uring-wq branch, and can be >> viewed here: >> >> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test >> >> Let me know what you think! > > Ah, I didn't realize that the second argument to f_op->flush is a > pointer to the files_struct. That's neat. > > > Security: There is no guarantee that ->flush() will run after the last > io_uring_enter() finishes. You can race like this, with threads A and > B in one process and C in another one: > > A: sends uring fd to C via unix domain socket > A: starts syscall io_uring_enter(fd, ...) > A: calls fdget(fd), takes reference to file > B: starts syscall close(fd) > B: fd table entry is removed > B: f_op->flush is invoked and finds no pending transactions > B: syscall close() returns > A: continues io_uring_enter(), grabbing current->files > A: io_uring_enter() returns > A and B: exit > worker: use-after-free access to files_struct > > I think the solution to this would be (unless you're fine with adding > some broad global read-write mutex) something like this in > __io_queue_sqe(), where "fd" and "f" are the variables from > io_uring_enter(), plumbed through the stack somehow: > > if (req->flags & REQ_F_NEED_FILES) { > rcu_read_lock(); > spin_lock_irq(&ctx->inflight_lock); > if (fcheck(fd) == f) { > list_add(&req->inflight_list, > &ctx->inflight_list); > req->work.files = current->files; > ret = 0; > } else { > ret = -EBADF; > } > spin_unlock_irq(&ctx->inflight_lock); > rcu_read_unlock(); > if (ret) > goto put_req; > } First of all, thanks for the thorough look at this! We already have f available here, it's req->file. And we just made a copy of the sqe, so we have sqe->fd available as well. I fixed this up. > Minor note: If a process uses dup() to duplicate the uring fd, then > closes the duplicated fd, that will cause work cancellations - but I > guess that's fine? I don't think that's a concern. > Style nit: I find it a bit confusing to name both the list head and > the list member heads "inflight_list". Maybe name them "inflight_list" > and "inflight_entry", or something like that? Fixed > Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE? > Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the > task while it's at that schedule(), it's just going to loop back > around and retry what it was doing already, right? Fixed > Security + Correctness: If there is more than one io_wqe, it seems to > me that io_uring_flush() calls io_wq_cancel_work(), which calls > io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first > request it looks at is pending. In that case, io_wq_cancel_work() will > immediately return, and io_uring_flush() will also immediately return. > It looks like any other requests will continue running? Ah good point, I missed that. We need to keep looping until we get NOTFOUND returned. Fixed as well. Also added cancellation if the task is going away. Here's the incremental patch, I'll resend with the full version. diff --git a/fs/io_uring.c b/fs/io_uring.c index 43ae0e04fd09..ec9dadfa90d2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -326,7 +326,7 @@ struct io_kiocb { u32 result; u32 sequence; - struct list_head inflight_list; + struct list_head inflight_entry; struct io_wq_work work; }; @@ -688,7 +688,7 @@ static void __io_free_req(struct io_kiocb *req) unsigned long flags; spin_lock_irqsave(&ctx->inflight_lock, flags); - list_del(&req->inflight_list); + list_del(&req->inflight_entry); if (wq_has_sleeper(&ctx->inflight_wait)) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); @@ -2329,6 +2329,24 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; } +static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct io_uring_sqe *sqe) +{ + int ret = -EBADF; + + rcu_read_lock(); + spin_lock_irq(&ctx->inflight_lock); + if (fcheck(sqe->fd) == req->file) { + list_add(&req->inflight_entry, &ctx->inflight_list); + req->work.files = current->files; + ret = 0; + } + spin_unlock_irq(&ctx->inflight_lock); + rcu_read_unlock(); + + return ret; +} + static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s) { @@ -2349,23 +2367,24 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); if (req->flags & REQ_F_NEED_FILES) { - spin_lock_irq(&ctx->inflight_lock); - list_add(&req->inflight_list, - &ctx->inflight_list); - req->work.files = current->files; - spin_unlock_irq(&ctx->inflight_lock); + ret = io_grab_files(ctx, req, sqe_copy); + if (ret) { + kfree(sqe_copy); + goto err; + } } - io_queue_async_work(ctx, req); /* * Queued up for async execution, worker will release * submit reference when the iocb is actually submitted. */ + io_queue_async_work(ctx, req); return 0; } } /* drop submission reference */ +err: io_put_req(req, NULL); /* and drop final reference, if we failed */ @@ -3768,38 +3787,51 @@ static int io_uring_release(struct inode *inode, struct file *file) return 0; } -static int io_uring_flush(struct file *file, void *data) +static void io_uring_cancel_files(struct io_ring_ctx *ctx, + struct files_struct *files) { - struct io_ring_ctx *ctx = file->private_data; - enum io_wq_cancel ret; struct io_kiocb *req; DEFINE_WAIT(wait); -restart: - ret = IO_WQ_CANCEL_NOTFOUND; + while (!list_empty_careful(&ctx->inflight_list)) { + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - spin_lock_irq(&ctx->inflight_lock); - list_for_each_entry(req, &ctx->inflight_list, inflight_list) { - if (req->work.files == data) { - ret = io_wq_cancel_work(ctx->io_wq, &req->work); - break; + spin_lock_irq(&ctx->inflight_lock); + list_for_each_entry(req, &ctx->inflight_list, inflight_entry) { + if (req->work.files == files) { + ret = io_wq_cancel_work(ctx->io_wq, &req->work); + break; + } } - } - if (ret == IO_WQ_CANCEL_RUNNING) - prepare_to_wait(&ctx->inflight_wait, &wait, TASK_INTERRUPTIBLE); + if (ret == IO_WQ_CANCEL_RUNNING) + prepare_to_wait(&ctx->inflight_wait, &wait, + TASK_UNINTERRUPTIBLE); - spin_unlock_irq(&ctx->inflight_lock); + spin_unlock_irq(&ctx->inflight_lock); - /* - * If it was found running, wait for one inflight request to finish. - * Retry loop after that, as it may not have been the one we were - * looking for. - */ - if (ret == IO_WQ_CANCEL_RUNNING) { + /* + * We need to keep going until we get NOTFOUND. We only cancel + * one work at the time. + * + * If we get CANCEL_RUNNING, then wait for a work to complete + * before continuing. + */ + if (ret == IO_WQ_CANCEL_OK) + continue; + else if (ret != IO_WQ_CANCEL_RUNNING) + break; schedule(); - goto restart; - } + }; +} +static int io_uring_flush(struct file *file, void *data) +{ + struct io_ring_ctx *ctx = file->private_data; + + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_all(ctx->io_wq); + else + io_uring_cancel_files(ctx, data); return 0; }
On 10/24/19 4:04 PM, Jens Axboe wrote: > Also added cancellation if the task is going away. Here's the > incremental patch, I'll resend with the full version. Full version: http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=24f099c6dd49fdedec464c5a7c0dfe2ac5f500d0
On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote: > On 10/24/19 2:31 PM, Jann Horn wrote: > > On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 10/18/19 12:50 PM, Jann Horn wrote: > >>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> On 10/18/19 12:06 PM, Jann Horn wrote: > >>>>> But actually, by the way: Is this whole files_struct thing creating a > >>>>> reference loop? The files_struct has a reference to the uring file, > >>>>> and the uring file has ACCEPT work that has a reference to the > >>>>> files_struct. If the task gets killed and the accept work blocks, the > >>>>> entire files_struct will stay alive, right? > >>>> > >>>> Yes, for the lifetime of the request, it does create a loop. So if the > >>>> application goes away, I think you're right, the files_struct will stay. > >>>> And so will the io_uring, for that matter, as we depend on the closing > >>>> of the files to do the final reap. > >>>> > >>>> Hmm, not sure how best to handle that, to be honest. We need some way to > >>>> break the loop, if the request never finishes. > >>> > >>> A wacky and dubious approach would be to, instead of taking a > >>> reference to the files_struct, abuse f_op->flush() to synchronously > >>> flush out pending requests with references to the files_struct... But > >>> it's probably a bad idea, given that in f_op->flush(), you can't > >>> easily tell which files_struct the close is coming from. I suppose you > >>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests > >>> have come in and then let f_op->flush() probe whether the file > >>> pointers are gone from them... > >> > >> Got back to this after finishing the io-wq stuff, which we need for the > >> cancel. > >> > >> Here's an updated patch: > >> > >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 > >> > >> that seems to work for me (lightly tested), we correctly find and cancel > >> work that is holding on to the file table. > >> > >> The full series sits on top of my for-5.5/io_uring-wq branch, and can be > >> viewed here: > >> > >> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test > >> > >> Let me know what you think! > > > > Ah, I didn't realize that the second argument to f_op->flush is a > > pointer to the files_struct. That's neat. > > > > > > Security: There is no guarantee that ->flush() will run after the last > > io_uring_enter() finishes. You can race like this, with threads A and > > B in one process and C in another one: > > > > A: sends uring fd to C via unix domain socket > > A: starts syscall io_uring_enter(fd, ...) > > A: calls fdget(fd), takes reference to file > > B: starts syscall close(fd) > > B: fd table entry is removed > > B: f_op->flush is invoked and finds no pending transactions > > B: syscall close() returns > > A: continues io_uring_enter(), grabbing current->files > > A: io_uring_enter() returns > > A and B: exit > > worker: use-after-free access to files_struct > > > > I think the solution to this would be (unless you're fine with adding > > some broad global read-write mutex) something like this in > > __io_queue_sqe(), where "fd" and "f" are the variables from > > io_uring_enter(), plumbed through the stack somehow: > > > > if (req->flags & REQ_F_NEED_FILES) { > > rcu_read_lock(); > > spin_lock_irq(&ctx->inflight_lock); > > if (fcheck(fd) == f) { > > list_add(&req->inflight_list, > > &ctx->inflight_list); > > req->work.files = current->files; > > ret = 0; > > } else { > > ret = -EBADF; > > } > > spin_unlock_irq(&ctx->inflight_lock); > > rcu_read_unlock(); > > if (ret) > > goto put_req; > > } > > First of all, thanks for the thorough look at this! We already have f > available here, it's req->file. And we just made a copy of the sqe, so > we have sqe->fd available as well. I fixed this up. sqe->fd is the file descriptor we're doing I/O on, not the file descriptor of the uring file, right? Same thing for req->file. This check only detects whether the fd we're doing I/O on was closed, which is irrelevant. > > Security + Correctness: If there is more than one io_wqe, it seems to > > me that io_uring_flush() calls io_wq_cancel_work(), which calls > > io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first > > request it looks at is pending. In that case, io_wq_cancel_work() will > > immediately return, and io_uring_flush() will also immediately return. > > It looks like any other requests will continue running? > > Ah good point, I missed that. We need to keep looping until we get > NOTFOUND returned. Fixed as well. > > Also added cancellation if the task is going away. Here's the > incremental patch, I'll resend with the full version. [...] > +static int io_uring_flush(struct file *file, void *data) > +{ > + struct io_ring_ctx *ctx = file->private_data; > + > + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) > + io_wq_cancel_all(ctx->io_wq); Looking at io_wq_cancel_all(), this will just send a signal to the task without waiting for anything, right? Isn't that unsafe? > + else > + io_uring_cancel_files(ctx, data); > return 0; > }
On 10/24/19 5:13 PM, Jann Horn wrote: > On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote: >> On 10/24/19 2:31 PM, Jann Horn wrote: >>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> On 10/18/19 12:50 PM, Jann Horn wrote: >>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> On 10/18/19 12:06 PM, Jann Horn wrote: >>>>>>> But actually, by the way: Is this whole files_struct thing creating a >>>>>>> reference loop? The files_struct has a reference to the uring file, >>>>>>> and the uring file has ACCEPT work that has a reference to the >>>>>>> files_struct. If the task gets killed and the accept work blocks, the >>>>>>> entire files_struct will stay alive, right? >>>>>> >>>>>> Yes, for the lifetime of the request, it does create a loop. So if the >>>>>> application goes away, I think you're right, the files_struct will stay. >>>>>> And so will the io_uring, for that matter, as we depend on the closing >>>>>> of the files to do the final reap. >>>>>> >>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to >>>>>> break the loop, if the request never finishes. >>>>> >>>>> A wacky and dubious approach would be to, instead of taking a >>>>> reference to the files_struct, abuse f_op->flush() to synchronously >>>>> flush out pending requests with references to the files_struct... But >>>>> it's probably a bad idea, given that in f_op->flush(), you can't >>>>> easily tell which files_struct the close is coming from. I suppose you >>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests >>>>> have come in and then let f_op->flush() probe whether the file >>>>> pointers are gone from them... >>>> >>>> Got back to this after finishing the io-wq stuff, which we need for the >>>> cancel. >>>> >>>> Here's an updated patch: >>>> >>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 >>>> >>>> that seems to work for me (lightly tested), we correctly find and cancel >>>> work that is holding on to the file table. >>>> >>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be >>>> viewed here: >>>> >>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test >>>> >>>> Let me know what you think! >>> >>> Ah, I didn't realize that the second argument to f_op->flush is a >>> pointer to the files_struct. That's neat. >>> >>> >>> Security: There is no guarantee that ->flush() will run after the last >>> io_uring_enter() finishes. You can race like this, with threads A and >>> B in one process and C in another one: >>> >>> A: sends uring fd to C via unix domain socket >>> A: starts syscall io_uring_enter(fd, ...) >>> A: calls fdget(fd), takes reference to file >>> B: starts syscall close(fd) >>> B: fd table entry is removed >>> B: f_op->flush is invoked and finds no pending transactions >>> B: syscall close() returns >>> A: continues io_uring_enter(), grabbing current->files >>> A: io_uring_enter() returns >>> A and B: exit >>> worker: use-after-free access to files_struct >>> >>> I think the solution to this would be (unless you're fine with adding >>> some broad global read-write mutex) something like this in >>> __io_queue_sqe(), where "fd" and "f" are the variables from >>> io_uring_enter(), plumbed through the stack somehow: >>> >>> if (req->flags & REQ_F_NEED_FILES) { >>> rcu_read_lock(); >>> spin_lock_irq(&ctx->inflight_lock); >>> if (fcheck(fd) == f) { >>> list_add(&req->inflight_list, >>> &ctx->inflight_list); >>> req->work.files = current->files; >>> ret = 0; >>> } else { >>> ret = -EBADF; >>> } >>> spin_unlock_irq(&ctx->inflight_lock); >>> rcu_read_unlock(); >>> if (ret) >>> goto put_req; >>> } >> >> First of all, thanks for the thorough look at this! We already have f >> available here, it's req->file. And we just made a copy of the sqe, so >> we have sqe->fd available as well. I fixed this up. > > sqe->fd is the file descriptor we're doing I/O on, not the file > descriptor of the uring file, right? Same thing for req->file. This > check only detects whether the fd we're doing I/O on was closed, which > is irrelevant. Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd. >>> Security + Correctness: If there is more than one io_wqe, it seems to >>> me that io_uring_flush() calls io_wq_cancel_work(), which calls >>> io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first >>> request it looks at is pending. In that case, io_wq_cancel_work() will >>> immediately return, and io_uring_flush() will also immediately return. >>> It looks like any other requests will continue running? >> >> Ah good point, I missed that. We need to keep looping until we get >> NOTFOUND returned. Fixed as well. >> >> Also added cancellation if the task is going away. Here's the >> incremental patch, I'll resend with the full version. > [...] >> +static int io_uring_flush(struct file *file, void *data) >> +{ >> + struct io_ring_ctx *ctx = file->private_data; >> + >> + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) >> + io_wq_cancel_all(ctx->io_wq); > > Looking at io_wq_cancel_all(), this will just send a signal to the > task without waiting for anything, right? Isn't that unsafe? Yes, that's a logic error, we should always do the io_uring_cancel_files(). Ala: io_uring_cancel_files(); if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) io_wq_cancel_all(ctx->io_wq); Thanks!
On 10/24/19 6:35 PM, Jens Axboe wrote: > On 10/24/19 5:13 PM, Jann Horn wrote: >> On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@kernel.dk> wrote: >>> On 10/24/19 2:31 PM, Jann Horn wrote: >>>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> On 10/18/19 12:50 PM, Jann Horn wrote: >>>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> On 10/18/19 12:06 PM, Jann Horn wrote: >>>>>>>> But actually, by the way: Is this whole files_struct thing creating a >>>>>>>> reference loop? The files_struct has a reference to the uring file, >>>>>>>> and the uring file has ACCEPT work that has a reference to the >>>>>>>> files_struct. If the task gets killed and the accept work blocks, the >>>>>>>> entire files_struct will stay alive, right? >>>>>>> >>>>>>> Yes, for the lifetime of the request, it does create a loop. So if the >>>>>>> application goes away, I think you're right, the files_struct will stay. >>>>>>> And so will the io_uring, for that matter, as we depend on the closing >>>>>>> of the files to do the final reap. >>>>>>> >>>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to >>>>>>> break the loop, if the request never finishes. >>>>>> >>>>>> A wacky and dubious approach would be to, instead of taking a >>>>>> reference to the files_struct, abuse f_op->flush() to synchronously >>>>>> flush out pending requests with references to the files_struct... But >>>>>> it's probably a bad idea, given that in f_op->flush(), you can't >>>>>> easily tell which files_struct the close is coming from. I suppose you >>>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests >>>>>> have come in and then let f_op->flush() probe whether the file >>>>>> pointers are gone from them... >>>>> >>>>> Got back to this after finishing the io-wq stuff, which we need for the >>>>> cancel. >>>>> >>>>> Here's an updated patch: >>>>> >>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 >>>>> >>>>> that seems to work for me (lightly tested), we correctly find and cancel >>>>> work that is holding on to the file table. >>>>> >>>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be >>>>> viewed here: >>>>> >>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test >>>>> >>>>> Let me know what you think! >>>> >>>> Ah, I didn't realize that the second argument to f_op->flush is a >>>> pointer to the files_struct. That's neat. >>>> >>>> >>>> Security: There is no guarantee that ->flush() will run after the last >>>> io_uring_enter() finishes. You can race like this, with threads A and >>>> B in one process and C in another one: >>>> >>>> A: sends uring fd to C via unix domain socket >>>> A: starts syscall io_uring_enter(fd, ...) >>>> A: calls fdget(fd), takes reference to file >>>> B: starts syscall close(fd) >>>> B: fd table entry is removed >>>> B: f_op->flush is invoked and finds no pending transactions >>>> B: syscall close() returns >>>> A: continues io_uring_enter(), grabbing current->files >>>> A: io_uring_enter() returns >>>> A and B: exit >>>> worker: use-after-free access to files_struct >>>> >>>> I think the solution to this would be (unless you're fine with adding >>>> some broad global read-write mutex) something like this in >>>> __io_queue_sqe(), where "fd" and "f" are the variables from >>>> io_uring_enter(), plumbed through the stack somehow: >>>> >>>> if (req->flags & REQ_F_NEED_FILES) { >>>> rcu_read_lock(); >>>> spin_lock_irq(&ctx->inflight_lock); >>>> if (fcheck(fd) == f) { >>>> list_add(&req->inflight_list, >>>> &ctx->inflight_list); >>>> req->work.files = current->files; >>>> ret = 0; >>>> } else { >>>> ret = -EBADF; >>>> } >>>> spin_unlock_irq(&ctx->inflight_lock); >>>> rcu_read_unlock(); >>>> if (ret) >>>> goto put_req; >>>> } >>> >>> First of all, thanks for the thorough look at this! We already have f >>> available here, it's req->file. And we just made a copy of the sqe, so >>> we have sqe->fd available as well. I fixed this up. >> >> sqe->fd is the file descriptor we're doing I/O on, not the file >> descriptor of the uring file, right? Same thing for req->file. This >> check only detects whether the fd we're doing I/O on was closed, which >> is irrelevant. > > Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd. Incremental: diff --git a/fs/io_uring.c b/fs/io_uring.c index ec9dadfa90d2..4d94886a3d13 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -262,11 +262,13 @@ struct io_ring_ctx { struct sqe_submit { const struct io_uring_sqe *sqe; + struct file *ring_file; unsigned short index; bool has_user : 1; bool in_async : 1; bool needs_fixed_file : 1; u32 sequence; + int ring_fd; }; /* @@ -2329,14 +2331,13 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; } -static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct io_uring_sqe *sqe) +static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req) { int ret = -EBADF; rcu_read_lock(); spin_lock_irq(&ctx->inflight_lock); - if (fcheck(sqe->fd) == req->file) { + if (fcheck(req->submit.ring_fd) == req->submit.ring_file) { list_add(&req->inflight_entry, &ctx->inflight_list); req->work.files = current->files; ret = 0; @@ -2367,7 +2368,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); if (req->flags & REQ_F_NEED_FILES) { - ret = io_grab_files(ctx, req, sqe_copy); + ret = io_grab_files(ctx, req); if (ret) { kfree(sqe_copy); goto err; @@ -2585,6 +2586,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) head = READ_ONCE(sq_array[head & ctx->sq_mask]); if (head < ctx->sq_entries) { + s->ring_file = NULL; s->index = head; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head; @@ -2782,7 +2784,8 @@ static int io_sq_thread(void *data) return 0; } -static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, + struct file *ring_file, int ring_fd) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -2824,9 +2827,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) } out: + s.ring_file = ring_file; s.has_user = true; s.in_async = false; s.needs_fixed_file = false; + s.ring_fd = ring_fd; submit++; trace_io_uring_submit_sqe(ctx, true, false); io_submit_sqe(ctx, &s, statep, &link); @@ -3828,10 +3833,9 @@ static int io_uring_flush(struct file *file, void *data) { struct io_ring_ctx *ctx = file->private_data; + io_uring_cancel_files(ctx, data); if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) io_wq_cancel_all(ctx->io_wq); - else - io_uring_cancel_files(ctx, data); return 0; } @@ -3903,7 +3907,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, to_submit = min(to_submit, ctx->sq_entries); mutex_lock(&ctx->uring_lock); - submitted = io_ring_submit(ctx, to_submit); + submitted = io_ring_submit(ctx, to_submit, f.file, fd); mutex_unlock(&ctx->uring_lock); } if (flags & IORING_ENTER_GETEVENTS) {
diff --git a/fs/io_uring.c b/fs/io_uring.c index 635856023fdf..ad462237275e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -267,10 +267,11 @@ struct io_ring_ctx { struct sqe_submit { const struct io_uring_sqe *sqe; unsigned short index; + bool has_user : 1; + bool in_async : 1; + bool needs_fixed_file : 1; u32 sequence; - bool has_user; - bool in_async; - bool needs_fixed_file; + struct files_struct *files; }; /* @@ -323,6 +324,7 @@ struct io_kiocb { #define REQ_F_FAIL_LINK 256 /* fail rest of links */ #define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ #define REQ_F_TIMEOUT 1024 /* timeout request */ +#define REQ_F_NEED_FILES 2048 /* needs to assume file table */ u64 user_data; u32 result; u32 sequence; @@ -2191,6 +2193,7 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct files_struct *old_files = NULL; struct io_ring_ctx *ctx = req->ctx; struct mm_struct *cur_mm = NULL; struct async_list *async_list; @@ -2220,6 +2223,10 @@ static void io_sq_wq_submit_work(struct work_struct *work) set_fs(USER_DS); } } + if (s->files && !old_files) { + old_files = current->files; + current->files = s->files; + } if (!ret) { s->has_user = cur_mm != NULL; @@ -2312,6 +2319,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) unuse_mm(cur_mm); mmput(cur_mm); } + if (old_files) { + struct files_struct *files = current->files; + current->files = old_files; + put_files_struct(files); + } } /* @@ -2413,6 +2425,8 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); + if (req->flags & REQ_F_NEED_FILES) + req->submit.files = get_files_struct(current); list = io_async_list_from_sqe(ctx, s->sqe); if (!io_add_to_prev_work(list, req)) { if (list) @@ -2633,6 +2647,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) s->index = head; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head; + s->files = NULL; ctx->cached_sq_head++; return true; }
This is in preparation for adding opcodes that need to modify files in a process file table, either adding new ones or closing old ones. If an opcode needs this, it must set REQ_F_NEED_FILES in the request structure. If work that needs to get punted to async context have this set, they will grab a reference to the process file table. When the work is completed, the reference is dropped again. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)