Message ID | 20211019134204.3382645-1-agruenba@redhat.com |
---|---|
Headers | show |
Series | gfs2: Fix mmap + page fault deadlocks | expand |
On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > From my point of view, the following questions remain: > > * I hope these patches will be merged for v5.16, but what process > should I follow for that? The patch queue contains mm and iomap > changes, so a pull request from the gfs2 tree would be unusual. Oh, I'd much rather get these as one pull request from the author and from the person that actually ended up testing this. It might be "unusual", but it's certainly not unheard of, and trying to push different parts of the series through different maintainers would just cause lots of extra churn. Yes, normally I'd expect filesystem changes to have a diffstat that clearly shows that "yes, it's all local to this filesystem", and when I see anything else it raises red flags. But it raises red flags not because it would be wrong to have changes to other parts, but simply because when cross-subsystem development happens, it needs to be discussed and cleared with people. And you've done that. So I'd take this as one pull request from you. You've been doing the work, you get the questionable glory of being in charge of it all. You'll get the blame too ;) > * Will Catalin Marinas's work for supporting arm64 sub-page faults > be queued behind these patches? We have an overlap in > fault_in_[pages_]readable fault_in_[pages_]writeable, so one of > the two patch queues will need some adjustments. I think that on the whole they should be developed separately, I don't think it's going to be a particularly difficult conflict. That whole discussion does mean that I suspect that we'll have to change fault_in_iov_iter_writeable() to do the "every 16 bytes" or whatever thing, and make it use an actual atomic "add zero" or whatever rather than walk the page tables. But that's a conceptually separate discussion from this one, I wouldn't actually want to mix up the two issues too much. Sure, they touch the same code, so there is _that_ overlap, but one is about "the hardware rules are a-changing" and the other is about filesystem use of - and expansion of - the things we do. Let's keep them separate until ready, and then fix up the fallout at that point (either as a merge resolution, or even partly after-the-fact). Linus
On 10/19/21 10:40 AM, Linus Torvalds wrote: > On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: >> >> From my point of view, the following questions remain: >> >> * I hope these patches will be merged for v5.16, but what process >> should I follow for that? The patch queue contains mm and iomap >> changes, so a pull request from the gfs2 tree would be unusual. > > Oh, I'd much rather get these as one pull request from the author and > from the person that actually ended up testing this. Hi Linus, FWIW, I've been working with Andreas on this and have tested it quite extensively, although only with gfs2. I've tested it with numerous scenarios, both stand-alone (xfstests as well as several other test programs I have in my collection) and in a cluster with some very heavy duty cluster coherency tests. My testing is nearly complete, but not quite. Regards, Bob Peterson GFS2 File System
On Tue, Oct 19, 2021 at 05:40:13AM -1000, Linus Torvalds wrote: > On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > * Will Catalin Marinas's work for supporting arm64 sub-page faults > > be queued behind these patches? We have an overlap in > > fault_in_[pages_]readable fault_in_[pages_]writeable, so one of > > the two patch queues will need some adjustments. > > I think that on the whole they should be developed separately, I don't > think it's going to be a particularly difficult conflict. > > That whole discussion does mean that I suspect that we'll have to > change fault_in_iov_iter_writeable() to do the "every 16 bytes" or > whatever thing, and make it use an actual atomic "add zero" or > whatever rather than walk the page tables. But that's a conceptually > separate discussion from this one, I wouldn't actually want to mix up > the two issues too much. I agree we shouldn't mix the two at the moment. The MTE fix requires some more thinking and it's not 5.16 material yet. The atomic "add zero" trick isn't that simple for MTE since the arm64 atomic or exclusive instructions run with kernel privileges and therefore with the kernel tag checking mode. We could toggle the mode to match user's just for those atomic ops but it will make this probing even more expensive (though normally it's done on the slow path). The quick/backportable fix for MTE is probably to just disable tag checking on user addresses during pagefault_disabled(). As I mentioned in the other thread, a more elaborate fix I think is to change the uaccess routines to update an error code somewhere in a similar way to the arm64 __put_user_error(). But that would require changing lots of callers.
On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > The atomic "add zero" trick isn't that simple for MTE since the arm64 > atomic or exclusive instructions run with kernel privileges and > therefore with the kernel tag checking mode. Are there any instructions that are useful for "probe_user_write()" kind of thing? We could always just add that as an arch function, with a fallback to using the futex "add zero" if the architecture doesn't need anything special. Although at least for MTE, I think the solution was to do a regular read, and that checks the tag, and then we could use the gup machinery for the writability checks. Linus
On Wed, Oct 20, 2021 at 10:11:19AM -1000, Linus Torvalds wrote: > On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > The atomic "add zero" trick isn't that simple for MTE since the arm64 > > atomic or exclusive instructions run with kernel privileges and > > therefore with the kernel tag checking mode. > > Are there any instructions that are useful for "probe_user_write()" > kind of thing? If it's on a user address, the only single-instruction that works with MTE is STTR (as in put_user()) but that's destructive. Other "add zero" constructs require some potentially expensive system register accesses just to set the tag checking mode of the current task. A probe_user_write() on the kernel linear address involves reading the tag from memory and comparing it with the tag in the user pointer. In addition, it needs to take into account the current task's tag checking mode and the vma vm_flags. We should have most of the information in the gup code. > Although at least for MTE, I think the solution was to do a regular > read, and that checks the tag, and then we could use the gup machinery > for the writability checks. Yes, for MTE this should work. For CHERI I think an "add zero" would do the trick (it should have atomics that work on capabilities directly). However, with MTE doing both get_user() every 16 bytes and gup can get pretty expensive. The problematic code is fault_in_safe_writable() in this series. I can give this 16-byte probing in gup a try (on top of -next) but IMHO we unnecessarily overload the fault_in_*() logic with something the kernel cannot fix up. The only reason we do it is so that we get an error code and bail out of a loop but the uaccess routines could be extended to report the fault type instead. It looks like we pretty much duplicate the uaccess in the fault_in_*() functions (four accesses per cache line).
On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > However, with MTE doing both get_user() every 16 bytes and > gup can get pretty expensive. So I really think that anything that is performance-critical had better only do the "fault_in_write()" code path in the cold error path where you took a page fault. IOW, the loop structure should be repeat: take_locks(); pagefault_disable(); ret = do_things_with_locks(); pagefault_enable(); release_locks(); // Failure path? if (unlikely(ret == -EFAULT)) { if (fault_in_writable(..)) goto repeat; return -EFAULT; } rather than have it be some kind of "first do fault_in_writable(), then do the work" model. So I wouldn't worry too much about the performance concerns. It simply shouldn't be a common or hot path. And yes, I've seen code that does that "fault_in_xyz()" before the critical operation that cannot take page faults, and does it unconditionally. But then it isn't the "fault_in_xyz()" that should be blamed if it is slow, but the caller that does things the wrong way around. Linus
On Wed, Oct 20, 2021 at 08:19:40PM -1000, Linus Torvalds wrote: > On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > > > However, with MTE doing both get_user() every 16 bytes and > > gup can get pretty expensive. > > So I really think that anything that is performance-critical had > better only do the "fault_in_write()" code path in the cold error path > where you took a page fault. [...] > So I wouldn't worry too much about the performance concerns. It simply > shouldn't be a common or hot path. > > And yes, I've seen code that does that "fault_in_xyz()" before the > critical operation that cannot take page faults, and does it > unconditionally. > > But then it isn't the "fault_in_xyz()" that should be blamed if it is > slow, but the caller that does things the wrong way around. Some more thinking out loud. I did some unscientific benchmarks on a Raspberry Pi 4 with the filesystem in a RAM block device and a "dd if=/dev/zero of=/mnt/test" writing 512MB in 1MB blocks. I changed fault_in_readable() in linux-next to probe every 16 bytes: - ext4 drops from around 261MB/s to 246MB/s: 5.7% penalty - btrfs drops from around 360MB/s to 337MB/s: 6.4% penalty For generic_perform_write() Dave Hansen attempted to move the fault-in after the uaccess in commit 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages"). This was reverted as it was exposing an ext4 bug. I don't whether it was fixed but re-applying Dave's commit avoids the performance drop. btrfs_buffered_write() has a comment about faulting pages in before locking them in prepare_pages(). I suspect it's a similar problem and the fault_in() could be moved, though I can't say I understand this code well enough. Probing only the first byte(s) in fault_in() would be ideal, no need to go through all filesystems and try to change the uaccess/probing order.
On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Probing only the first byte(s) in fault_in() would be ideal, no need to > go through all filesystems and try to change the uaccess/probing order. Let's try that. Or rather: probing just the first page - since there are users like that btrfs ioctl, and the direct-io path. Linus
commit On Fri, Oct 22, 2021 at 8:07 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, Oct 20, 2021 at 08:19:40PM -1000, Linus Torvalds wrote: > > On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > > > > However, with MTE doing both get_user() every 16 bytes and > > > gup can get pretty expensive. > > > > So I really think that anything that is performance-critical had > > better only do the "fault_in_write()" code path in the cold error path > > where you took a page fault. > [...] > > So I wouldn't worry too much about the performance concerns. It simply > > shouldn't be a common or hot path. > > > > And yes, I've seen code that does that "fault_in_xyz()" before the > > critical operation that cannot take page faults, and does it > > unconditionally. > > > > But then it isn't the "fault_in_xyz()" that should be blamed if it is > > slow, but the caller that does things the wrong way around. > > Some more thinking out loud. I did some unscientific benchmarks on a > Raspberry Pi 4 with the filesystem in a RAM block device and a > "dd if=/dev/zero of=/mnt/test" writing 512MB in 1MB blocks. I changed > fault_in_readable() in linux-next to probe every 16 bytes: > > - ext4 drops from around 261MB/s to 246MB/s: 5.7% penalty > > - btrfs drops from around 360MB/s to 337MB/s: 6.4% penalty > > For generic_perform_write() Dave Hansen attempted to move the fault-in > after the uaccess in commit 998ef75ddb57 ("fs: do not prefault > sys_write() user buffer pages"). This was reverted as it was exposing an > ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit > avoids the performance drop. Interesting. The revert of commit 998ef75ddb57 is in commit 00a3d660cbac. Maybe Dave and Ted can tell us more about what went wrong in ext4 and whether it's still an issue. Commit 998ef75ddb57 looks mostly good except that it should loop around whenever the fault-in succeeds even partially, so it needs the semantic change of patch 4 [*] of this series. A copy of the same code now lives in iomap_write_iter, so the same fix needs to be applied there. Finally, it may be worthwhile to check for pagefault_disabled() in generic_perform_write and iomap_write_iter before trying the fault-in; this would help gfs2 which will always call into iomap_write_iter with page faults disabled, and additional callers like that could emerge relatively soon. [*] https://lore.kernel.org/lkml/20211019134204.3382645-5-agruenba@redhat.com/ > btrfs_buffered_write() has a comment about faulting pages in before > locking them in prepare_pages(). I suspect it's a similar problem and > the fault_in() could be moved, though I can't say I understand this code > well enough. > > Probing only the first byte(s) in fault_in() would be ideal, no need to > go through all filesystems and try to change the uaccess/probing order. Thanks, Andreas
On Fri, Oct 22, 2021 at 9:23 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Probing only the first byte(s) in fault_in() would be ideal, no need to > > go through all filesystems and try to change the uaccess/probing order. > > Let's try that. Or rather: probing just the first page - since there > are users like that btrfs ioctl, and the direct-io path. For direct I/O, we actually only want to trigger page fault-in so that we can grab page references with bio_iov_iter_get_pages. Probing for sub-page error domains will only slow things down. If we hit -EFAULT during the actual copy-in or copy-out, we know that the error can't be page fault related. Similarly, in the buffered I/O case, we only really care about the next byte, so any probing beyond that is unnecessary. So maybe we should split the sub-page error domain probing off from the fault-in functions. Or at least add an argument to the fault-in functions that specifies the amount of memory to probe. Thanks, Andreas
On Mon, Oct 25, 2021 at 08:24:26PM +0200, Andreas Gruenbacher wrote: > > For generic_perform_write() Dave Hansen attempted to move the fault-in > > after the uaccess in commit 998ef75ddb57 ("fs: do not prefault > > sys_write() user buffer pages"). This was reverted as it was exposing an > > ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit > > avoids the performance drop. > > Interesting. The revert of commit 998ef75ddb57 is in commit > 00a3d660cbac. Maybe Dave and Ted can tell us more about what went > wrong in ext4 and whether it's still an issue. The context for the revert can be found here[1]. [1] https://lore.kernel.org/lkml/20151005152236.GA8140@thunk.org/ And "what went wrong in ext4" was fixed here[2]. [2] https://lore.kernel.org/lkml/20151005152236.GA8140@thunk.org/ which landed upstream as commit b90197b65518 ("ext4: use private version of page_zero_new_buffers() for data=journal mode"). So it looks like the original issue which triggered the revert in 2015 should be addressed, and we can easily test it by using generic/208 with data=journal mode. There also seems to be a related discussion about whether we should unrevert 998ef75ddb57 here[3]. Hmm. there is a mention on that thread in [3], "Side note: search for "iov_iter_fault_in_writeable()" on lkml for a gfs2 patch-series that is buggy, exactly because it does *not* use the atomic user space accesses, and just tries to do the fault-in to hide the real bug." I assume that's related to the discussion on this thread? [3] https://lore.kernel.org/all/3221175.1624375240@warthog.procyon.org.uk/T/#u - Ted
Ted,
here's an updated version of Dave Hansen's original commit, but note
that generic/208 won't run on ext4 with data journaling enabled:
$ MOUNT_OPTIONS='-o data=journal' TEST_DIR=/mnt/test TEST_DEV=/dev/vdb ./tests/generic/208
QA output created by 208
208 not run: ext4 data journaling doesn't support O_DIRECT
Thanks,
Andreas
--
Based on commit 998ef75ddb57 ("fs: do not prefault sys_write() user
buffer pages") by Dave Hansen <dave.hansen@linux.intel.com>, but:
* Fix generic_perform_write as well as iomap_write_iter.
* copy_page_from_iter_atomic() doesn't trigger page faults, so there's no need
to disable page faults around it [see commit 9e8c2af96e0d ("callers of
iov_copy_from_user_atomic() don't need pagecache_disable()")].
* If fault_in_iov_iter_readable() fails to fault in the entire buffer,
we still want to read everything up to the fault position. This depends on
commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into
fault_in_iov_iter_readable").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
fs/iomap/buffered-io.c | 20 +++++++-------------
mm/filemap.c | 20 +++++++-------------
2 files changed, 14 insertions(+), 26 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1753c26c8e76..d8809cd9ab31 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -744,17 +744,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
if (bytes > length)
bytes = length;
- /*
- * Bring in the user page that we'll copy from _first_.
- * Otherwise there's a nasty deadlock on copying from the
- * same page as we're writing to, without it being marked
- * up-to-date.
- */
- if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
- status = -EFAULT;
- break;
- }
-
status = iomap_write_begin(iter, pos, bytes, &page);
if (unlikely(status))
break;
@@ -777,9 +766,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
- if (copied)
+ if (copied) {
bytes = copied;
- goto again;
+ goto again;
+ }
+ if (fault_in_iov_iter_readable(i, bytes) != bytes)
+ goto again;
+ status = -EFAULT;
+ break;
}
pos += status;
written += status;
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dd5edcd39fd..467cdb7d086d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3751,17 +3751,6 @@ ssize_t generic_perform_write(struct file *file,
iov_iter_count(i));
again:
- /*
- * Bring in the user page that we will copy from _first_.
- * Otherwise there's a nasty deadlock on copying from the
- * same page as we're writing to, without it being marked
- * up-to-date.
- */
- if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
- status = -EFAULT;
- break;
- }
-
if (fatal_signal_pending(current)) {
status = -EINTR;
break;
@@ -3794,9 +3783,14 @@ ssize_t generic_perform_write(struct file *file,
* halfway through, might be a race with munmap,
* might be severe memory pressure.
*/
- if (copied)
+ if (copied) {
bytes = copied;
- goto again;
+ goto again;
+ }
+ if (fault_in_iov_iter_readable(i, bytes) != bytes)
+ goto again;
+ status = -EFAULT;
+ break;
}
pos += status;
written += status;
On Mon, Oct 25, 2021 at 09:00:43PM +0200, Andreas Gruenbacher wrote: > On Fri, Oct 22, 2021 at 9:23 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > Probing only the first byte(s) in fault_in() would be ideal, no need to > > > go through all filesystems and try to change the uaccess/probing order. > > > > Let's try that. Or rather: probing just the first page - since there > > are users like that btrfs ioctl, and the direct-io path. > > For direct I/O, we actually only want to trigger page fault-in so that > we can grab page references with bio_iov_iter_get_pages. Probing for > sub-page error domains will only slow things down. If we hit -EFAULT > during the actual copy-in or copy-out, we know that the error can't be > page fault related. Similarly, in the buffered I/O case, we only > really care about the next byte, so any probing beyond that is > unnecessary. > > So maybe we should split the sub-page error domain probing off from > the fault-in functions. Or at least add an argument to the fault-in > functions that specifies the amount of memory to probe. My preferred option is not to touch fault-in for sub-page faults (though I have some draft patches, they need testing). All this fault-in and uaccess with pagefaults_disabled() is needed to avoid a deadlock when the uaccess fault handling would take the same lock. With sub-page faults, the kernel cannot fix it up anyway, so the arch code won't even attempt call handle_mm_fault() (it is not an mm fault). But the problem is the copy_*_user() etc. API that can only return the number of bytes not copied. That's what I think should be fixed. fault_in() feels like the wrong place to address this when it's not an mm fault. As for fault_in() getting another argument with the amount of sub-page probing to do, I think the API gets even more confusing. I was also thinking, with your patches for fault_in() now returning size_t, is the expectation to be precise in what cannot be copied? We don't have such requirement for copy_*_user(). While more intrusive, I'd rather change copy_page_from_iter_atomic() etc. to take a pointer where to write back an error code. If it's -EFAULT, retry the loop. If it's -EACCES/EPERM just bail out. Or maybe simply a bool set if there was an mm fault to be retried. Yet another option to return an -EAGAIN if it could not process the mm fault due to page faults being disabled. Happy to give this a try, unless there's a strong preference for the fault_in() fix-up (well, I can do both options and post them).
On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > While more intrusive, I'd rather change copy_page_from_iter_atomic() > etc. to take a pointer where to write back an error code. I absolutely hate this model. The thing is, going down that rat-hole, you'll find that you'll need to add it to *all* the "copy_to/from_user()" cases, which isn't acceptable. So then you start doing some duplicate versions with different calling conventions, just because of things like this. So no, I really don't want a "pass down a reference to an extra error code" kind of horror. That said, the fact that these sub-page faults are always non-recoverable might be a hint to a solution to the problem: maybe we could extend the existing return code with actual negative error numbers. Because for _most_ cases of "copy_to/from_user()" and friends by far, the only thing we look for is "zero for success". We could extend the "number of bytes _not_ copied" semantics to say "negative means fatal", and because there are fairly few places that actually look at non-zero values, we could have a coccinelle script that actually marks those places. End result: no change in calling conventions, no change to most users, and the (relatively few) cases where we look at the "what about partial results", we just add a .. existing code .. ret = copy_from_user(..); + if (ret < 0) + break; // or whatever "fatal error" situation .. existing code .. kind of thing that just stops the re-try. (The coccinelle script couldn't actually do that, but it could add some comment marker or something so that it's easy to find and then manually fix up the places it finds). Linus
On Tue, Oct 26, 2021 at 11:50 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Because for _most_ cases of "copy_to/from_user()" and friends by far, > the only thing we look for is "zero for success". Gaah. Looking around, I almost immediately found some really odd exceptions to this. Like parse_write_buffer_into_params() in amdgpu_dm_debugfs.c, which does r = copy_from_user(wr_buf_ptr, buf, wr_buf_size); /* r is bytes not be copied */ if (r >= wr_buf_size) { DRM_DEBUG_DRIVER("user data not be read\n"); return -EINVAL; } and allows a partial copy to justy silently succeed, because all the callers have pre-cleared the wr_buf_ptr buffer. I have no idea why the code does that - it seems to imply that user space could give an invalid 'size' parameter and mean to write only the part that didn't succeed. Adding AMD GPU driver people just to point out that this code not only has odd whitespace, but that the pattern for "couldn't copy from user space" should basically always be if (copy_from_user(wr_buf_ptr, buf, wr_buf_size)) return -EFAULT; because if user-space passes in a partially invalid buffer, you generally really shouldn't say "ok, I got part of it and will use that part" There _are_ exceptions. We've had situations where user-space only passes in the pointer to the buffer, but not the size. Bad interface, but it historically happens for the 'mount()' system call 'data' pointer. So then we'll copy "up to a page size". Anyway, there are clearly some crazy users, and converting them all to also check for negative error returns might be more painful than I thought it would be. Linus
On Tue, Oct 26, 2021 at 11:50:04AM -0700, Linus Torvalds wrote: > On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > While more intrusive, I'd rather change copy_page_from_iter_atomic() > > etc. to take a pointer where to write back an error code. [...] > That said, the fact that these sub-page faults are always > non-recoverable might be a hint to a solution to the problem: maybe we > could extend the existing return code with actual negative error > numbers. > > Because for _most_ cases of "copy_to/from_user()" and friends by far, > the only thing we look for is "zero for success". > > We could extend the "number of bytes _not_ copied" semantics to say > "negative means fatal", and because there are fairly few places that > actually look at non-zero values, we could have a coccinelle script > that actually marks those places. As you already replied, there are some odd places where the returned uncopied of bytes is used. Also for some valid cases like copy_mount_options(), it's likely that it will fall back to byte-at-a-time with MTE since it's a good chance it would hit a fault in a 4K page (not a fast path though). I'd have to go through all the cases and check whether the return value is meaningful. The iter_iov.c functions and their callers also seem to make use of the bytes copied in case they need to call iov_iter_revert() (though I suppose the iov_iter_iovec_advance() would skip the update in case of an error). As an alternative, you mentioned earlier that a per-thread fault status was not feasible on x86 due to races. Was this only for the hw poison case? I think the uaccess is slightly different. We can add a current->non_recoverable_uaccess variable cleared on pagefault_disable(), only set by uaccess faults and checked by the fs code before re-attempting the fault_in(). An interrupt shouldn't do a uaccess (well, if it does a _nofault one, we can detect in_interrupt() in the MTE exception handler). Last time I looked at io_uring it was running in a separate kernel thread, not sure whether this was changed. I don't see what else would be racing with such current->non_recoverable_uaccess variable. If that's doable, I think it's the least intrusive approach.
On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > As an alternative, you mentioned earlier that a per-thread fault status > was not feasible on x86 due to races. Was this only for the hw poison > case? I think the uaccess is slightly different. It's not x86-specific, it's very generic. If we set some flag in the per-thread status, we'll need to be careful about not overwriting it if we then have a subsequent NMI that _also_ takes a (completely unrelated) page fault - before we then read the per-thread flag. Think 'perf' and fetching backtraces etc. Note that the NMI page fault can easily also be a pointer coloring fault on arm64, for exactly the same reason that whatever original copy_from_user() code was. So this is not a "oh, pointer coloring faults are different". They have the same re-entrancy issue. And both the "pagefault_disable" and "fault happens in interrupt context" cases are also the exact same 'faulthandler_disabled()' thing. So even at fault time they look very similar. So we'd have to have some way to separate out only the one we care about. Linus
One of the arguments against Dave Hansen's patch that eliminates the
pre-faulting was that it doubles the number of page faults in the slow
case. This can be avoided by using get_user_pages() to do the
"post-faulting", though. For what it's worth, here's a patch for that
(on top of this series).
Andreas
--
fs: Avoid predictable page faults for sys_write() user buffer pages
Introduce a new fault_in_iov_iter_slow_readable() helper for faulting in
an iterator via get_user_pages() instead of triggering page faults.
This is slower than a simple memory read when the underlying pages are
resident, but avoids the page fault overhead when the underlying pages
need to be faulted in.
Use fault_in_iov_iter_slow_readable() in generic_perform_write and
iomap_write_iter when reading from the user buffer fails.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
fs/iomap/buffered-io.c | 2 +-
include/linux/pagemap.h | 3 ++-
include/linux/uio.h | 17 ++++++++++++++++-
lib/iov_iter.c | 10 ++++++----
mm/filemap.c | 2 +-
mm/gup.c | 10 ++++++----
6 files changed, 32 insertions(+), 12 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d8809cd9ab31..15a0b4bb9528 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -770,7 +770,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
bytes = copied;
goto again;
}
- if (fault_in_iov_iter_readable(i, bytes) != bytes)
+ if (fault_in_iov_iter_slow_readable(i, bytes) != bytes)
goto again;
status = -EFAULT;
break;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2f7dd14083d9..43844ed5675f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -736,8 +736,9 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
* Fault in userspace address range.
*/
size_t fault_in_writeable(char __user *uaddr, size_t size);
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
size_t fault_in_readable(const char __user *uaddr, size_t size);
+size_t __fault_in_slow(const char __user *uaddr, size_t size,
+ unsigned int flags);
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6350354f97e9..b071f4445059 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -8,6 +8,7 @@
#include <linux/kernel.h>
#include <linux/thread_info.h>
#include <uapi/linux/uio.h>
+#include <linux/mm.h>
struct page;
struct pipe_inode_info;
@@ -135,7 +136,21 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
void iov_iter_advance(struct iov_iter *i, size_t bytes);
void iov_iter_revert(struct iov_iter *i, size_t bytes);
size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
+size_t __fault_in_iov_iter_slow(const struct iov_iter *i, size_t bytes,
+ unsigned int flags);
+
+static inline size_t fault_in_iov_iter_slow_readable(const struct iov_iter *i,
+ size_t bytes)
+{
+ return __fault_in_iov_iter_slow(i, bytes, 0);
+}
+
+static inline size_t fault_in_iov_iter_writeable(const struct iov_iter *i,
+ size_t bytes)
+{
+ return __fault_in_iov_iter_slow(i, bytes, FOLL_WRITE);
+}
+
size_t iov_iter_single_seg_count(const struct iov_iter *i);
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..73789a5409f6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -468,9 +468,10 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
EXPORT_SYMBOL(fault_in_iov_iter_readable);
/*
- * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * __fault_in_iov_iter_slow - fault in iov iterator for reading/writing
* @i: iterator
* @size: maximum length
+ * @flags: FOLL_* flags (FOLL_WRITE for writing)
*
* Faults in the iterator using get_user_pages(), i.e., without triggering
* hardware page faults. This is primarily useful when we already know that
@@ -481,7 +482,8 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
*
* Always returns 0 for non-user-space iterators.
*/
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+size_t __fault_in_iov_iter_slow(const struct iov_iter *i, size_t size,
+ unsigned int flags)
{
if (iter_is_iovec(i)) {
size_t count = min(size, iov_iter_count(i));
@@ -495,7 +497,7 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
if (unlikely(!len))
continue;
- ret = fault_in_safe_writeable(p->iov_base + skip, len);
+ ret = __fault_in_slow(p->iov_base + skip, len, flags);
count -= len - ret;
if (ret)
break;
@@ -504,7 +506,7 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
}
return 0;
}
-EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+EXPORT_SYMBOL(__fault_in_iov_iter_slow);
void iov_iter_init(struct iov_iter *i, unsigned int direction,
const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 467cdb7d086d..7ca76f4aa974 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3787,7 +3787,7 @@ ssize_t generic_perform_write(struct file *file,
bytes = copied;
goto again;
}
- if (fault_in_iov_iter_readable(i, bytes) != bytes)
+ if (fault_in_iov_iter_slow_readable(i, bytes) != bytes)
goto again;
status = -EFAULT;
break;
diff --git a/mm/gup.c b/mm/gup.c
index e1c7e4bde11f..def9f478a621 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1694,9 +1694,10 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
EXPORT_SYMBOL(fault_in_writeable);
/*
- * fault_in_safe_writeable - fault in an address range for writing
+ * __fault_in_slow - fault in an address range for reading/writing
* @uaddr: start of address range
* @size: length of address range
+ * @flags: FOLL_* flags (FOLL_WRITE for writing)
*
* Faults in an address range using get_user_pages, i.e., without triggering
* hardware page faults. This is primarily useful when we already know that
@@ -1711,7 +1712,8 @@ EXPORT_SYMBOL(fault_in_writeable);
* Returns the number of bytes not faulted in, like copy_to_user() and
* copy_from_user().
*/
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+size_t __fault_in_slow(const char __user *uaddr, size_t size,
+ unsigned int flags)
{
unsigned long start = (unsigned long)untagged_addr(uaddr);
unsigned long end, nstart, nend;
@@ -1743,7 +1745,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
nr_pages = (nend - nstart) / PAGE_SIZE;
ret = __get_user_pages_locked(mm, nstart, nr_pages,
NULL, NULL, &locked,
- FOLL_TOUCH | FOLL_WRITE);
+ FOLL_TOUCH | flags);
if (ret <= 0)
break;
nend = nstart + ret * PAGE_SIZE;
@@ -1754,7 +1756,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
return 0;
return size - min_t(size_t, nstart - start, size);
}
-EXPORT_SYMBOL(fault_in_safe_writeable);
+EXPORT_SYMBOL(__fault_in_slow);
/**
* fault_in_readable - fault in userspace address range for reading
One last try on this path before I switch to the other options. On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote: > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > As an alternative, you mentioned earlier that a per-thread fault status > > was not feasible on x86 due to races. Was this only for the hw poison > > case? I think the uaccess is slightly different. > > It's not x86-specific, it's very generic. > > If we set some flag in the per-thread status, we'll need to be careful > about not overwriting it if we then have a subsequent NMI that _also_ > takes a (completely unrelated) page fault - before we then read the > per-thread flag. > > Think 'perf' and fetching backtraces etc. > > Note that the NMI page fault can easily also be a pointer coloring > fault on arm64, for exactly the same reason that whatever original > copy_from_user() code was. So this is not a "oh, pointer coloring > faults are different". They have the same re-entrancy issue. > > And both the "pagefault_disable" and "fault happens in interrupt > context" cases are also the exact same 'faulthandler_disabled()' > thing. So even at fault time they look very similar. They do look fairly similar but we should have the information in the fault handler to distinguish: not a page fault (pte permission or p*d translation), in_task(), user address, fixup handler. But I agree the logic looks fragile. I think for nested contexts we can save the uaccess fault state on exception entry, restore it on return. Or (needs some thinking on atomicity) save it in a local variable. The high-level API would look something like: unsigned long uaccess_flags; /* we could use TIF_ flags */ uaccess_flags = begin_retriable_uaccess(); copied = copy_page_from_iter_atomic(...); retry = end_retriable_uaccess(uaccess_flags); ... if (!retry) break; I think we'd need a TIF flag to mark the retriable region and another to track whether a non-recoverable fault occurred. It needs prototyping. Anyway, if you don't like this approach, I'll look at error codes being returned but rather than changing all copy_from_user() etc., introduce a new API that returns different error codes depending on the fault (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd need something for the iov_iter stuff to use in the fs code.
On Thu, Oct 28, 2021 at 10:20:52PM +0100, Catalin Marinas wrote: > On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote: > > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > As an alternative, you mentioned earlier that a per-thread fault status > > > was not feasible on x86 due to races. Was this only for the hw poison > > > case? I think the uaccess is slightly different. > > > > It's not x86-specific, it's very generic. > > > > If we set some flag in the per-thread status, we'll need to be careful > > about not overwriting it if we then have a subsequent NMI that _also_ > > takes a (completely unrelated) page fault - before we then read the > > per-thread flag. > > > > Think 'perf' and fetching backtraces etc. > > > > Note that the NMI page fault can easily also be a pointer coloring > > fault on arm64, for exactly the same reason that whatever original > > copy_from_user() code was. So this is not a "oh, pointer coloring > > faults are different". They have the same re-entrancy issue. > > > > And both the "pagefault_disable" and "fault happens in interrupt > > context" cases are also the exact same 'faulthandler_disabled()' > > thing. So even at fault time they look very similar. > > They do look fairly similar but we should have the information in the > fault handler to distinguish: not a page fault (pte permission or p*d > translation), in_task(), user address, fixup handler. But I agree the > logic looks fragile. > > I think for nested contexts we can save the uaccess fault state on > exception entry, restore it on return. Or (needs some thinking on > atomicity) save it in a local variable. The high-level API would look > something like: > > unsigned long uaccess_flags; /* we could use TIF_ flags */ > > uaccess_flags = begin_retriable_uaccess(); > copied = copy_page_from_iter_atomic(...); > retry = end_retriable_uaccess(uaccess_flags); It doesn't work with local flags, so it would need to be saved on exception entry (interrupt, breakpoint etc.) on the stack, restore on return. But the API would return pretty close (and probably still more complicated than copy_*() returning an error code).
Am Do., 28. Okt. 2021 um 23:21 Uhr schrieb Catalin Marinas <catalin.marinas@arm.com>: > One last try on this path before I switch to the other options. > > On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote: > > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > As an alternative, you mentioned earlier that a per-thread fault status > > > was not feasible on x86 due to races. Was this only for the hw poison > > > case? I think the uaccess is slightly different. > > > > It's not x86-specific, it's very generic. > > > > If we set some flag in the per-thread status, we'll need to be careful > > about not overwriting it if we then have a subsequent NMI that _also_ > > takes a (completely unrelated) page fault - before we then read the > > per-thread flag. > > > > Think 'perf' and fetching backtraces etc. > > > > Note that the NMI page fault can easily also be a pointer coloring > > fault on arm64, for exactly the same reason that whatever original > > copy_from_user() code was. So this is not a "oh, pointer coloring > > faults are different". They have the same re-entrancy issue. > > > > And both the "pagefault_disable" and "fault happens in interrupt > > context" cases are also the exact same 'faulthandler_disabled()' > > thing. So even at fault time they look very similar. > > They do look fairly similar but we should have the information in the > fault handler to distinguish: not a page fault (pte permission or p*d > translation), in_task(), user address, fixup handler. But I agree the > logic looks fragile. > > I think for nested contexts we can save the uaccess fault state on > exception entry, restore it on return. Or (needs some thinking on > atomicity) save it in a local variable. The high-level API would look > something like: > > unsigned long uaccess_flags; /* we could use TIF_ flags */ > > uaccess_flags = begin_retriable_uaccess(); > copied = copy_page_from_iter_atomic(...); > retry = end_retriable_uaccess(uaccess_flags); > ... > > if (!retry) > break; > > I think we'd need a TIF flag to mark the retriable region and another to > track whether a non-recoverable fault occurred. It needs prototyping. > > Anyway, if you don't like this approach, I'll look at error codes being > returned but rather than changing all copy_from_user() etc., introduce a > new API that returns different error codes depending on the fault > (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd > need something for the iov_iter stuff to use in the fs code. We won't need any of that on the filesystem read and write paths. The two cases there are buffered and direct I/O: * In the buffered I/O case, the copying happens with page faults disabled, at a byte granularity. If that returns a short result, we need to enable page faults, check if the exact address that failed still fails (in which case we have a sub-page fault), fault in the pages, disable page faults again, and repeat. No probing for sub-page faults beyond the first byte of the fault-in address is needed. Functions fault_in_{readable,writeable} implicitly have this behavior; for fault_in_safe_writeable() the choice we have is to either add probing of the first byte for sub-page faults to this function or force callers to do that probing separately. At this point, I'd vote for the former. * In the direct I/O case, the copying happens while we're holding page references, so the only page faults that can occur during copying are sub-page faults. When iomap_dio_rw or its legacy counterpart is called with page faults disabled, we need to make sure that the caller can distinguish between page faults triggered during bio_iov_iter_get_pages() and during the copying, but that's a separate problem. (At the moment, when iomap_dio_rw fails with -EFAULT, the caller *cannot* distinguish between a bio_iov_iter_get_pages failure and a failure during synchronous copying, but that could be fixed by returning unique error codes from iomap_dio_rw.) So as far as I can see, the only problematic case we're left with is copying bigger than byte-size chunks with page faults disabled when we don't know whether the underlying pages are resident or not. My guess would be that in this case, if the copying fails, it would be perfectly acceptable to explicitly probe the entire chunk for sub-page faults. Thanks, Andreas
On Thu, Oct 28, 2021 at 2:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > They do look fairly similar but we should have the information in the > fault handler to distinguish: not a page fault (pte permission or p*d > translation), in_task(), user address, fixup handler. But I agree the > logic looks fragile. So thinking about this a bit more, I think I have a possible suggestion for how to handle this.. The pointer color fault (or whatever some other architecture may do to generate sub-page faults) is not only not recoverable in the sense that we can't fix it up, it also ends up being a forced SIGSEGV (ie it can't be blocked - it has to either be caught or cause the process to be killed). And the thing is, I think we could just make the rule be that kernel code that has this kind of retry loop with fault_in_pages() would force an EFAULT on a pending SIGSEGV. IOW, the pending SIGSEGV could effectively be exactly that "thread flag". And that means that fault_in_xyz() wouldn't need to worry about this situation at all: the regular copy_from_user() (or whatever flavor it is - to/from/iter/whatever) would take the fault. And if it's a regular page fault,. it would act exactly like it does now, so no changes. If it's a sub-page fault, we'd just make the rule be that we send a SIGSEGV even if the instruction in question has a user exception fixup. Then we just need to add the logic somewhere that does "if active pending SIGSEGV, return -EFAULT". Of course, that logic might be in fault_in_xyz(), but it migth also be a separate function entirely. So this does effectively end up being a thread flag, but it's also slightly more than that - it's that a sub-page fault from kernel mode has semantics that a regular page fault does not. The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT instead" has always been an odd and somewhat wrong-headed thing. Of course it should cause a SIGSEGV, but that's not how Unix traditionall worked. We would just say "color faults always raise a signal, even if the color fault was triggered in a system call". (And I didn't check: I say "SIGSEGV" above, but maybe the pointer color faults are actually SIGBUS? Doesn't change the end result). Linus
On Fri, Oct 29, 2021 at 12:15:55AM +0200, Andreas Grünbacher wrote: > Am Do., 28. Okt. 2021 um 23:21 Uhr schrieb Catalin Marinas > <catalin.marinas@arm.com>: > > I think for nested contexts we can save the uaccess fault state on > > exception entry, restore it on return. Or (needs some thinking on > > atomicity) save it in a local variable. The high-level API would look > > something like: > > > > unsigned long uaccess_flags; /* we could use TIF_ flags */ > > > > uaccess_flags = begin_retriable_uaccess(); > > copied = copy_page_from_iter_atomic(...); > > retry = end_retriable_uaccess(uaccess_flags); > > ... > > > > if (!retry) > > break; > > > > I think we'd need a TIF flag to mark the retriable region and another to > > track whether a non-recoverable fault occurred. It needs prototyping. > > > > Anyway, if you don't like this approach, I'll look at error codes being > > returned but rather than changing all copy_from_user() etc., introduce a > > new API that returns different error codes depending on the fault > > (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd > > need something for the iov_iter stuff to use in the fs code. > > We won't need any of that on the filesystem read and write paths. The > two cases there are buffered and direct I/O: Thanks for the clarification, very useful. > * In the buffered I/O case, the copying happens with page faults > disabled, at a byte granularity. If that returns a short result, we > need to enable page faults, check if the exact address that failed > still fails (in which case we have a sub-page fault), fault in the > pages, disable page faults again, and repeat. No probing for sub-page > faults beyond the first byte of the fault-in address is needed. > Functions fault_in_{readable,writeable} implicitly have this behavior; > for fault_in_safe_writeable() the choice we have is to either add > probing of the first byte for sub-page faults to this function or > force callers to do that probing separately. At this point, I'd vote > for the former. This sounds fine to me (and I have some draft patches already on top of your series). > * In the direct I/O case, the copying happens while we're holding page > references, so the only page faults that can occur during copying are > sub-page faults. Does holding a page reference guarantee that the user pte pointing to such page won't change, for example a pte_mkold()? I assume for direct I/O, the PG_locked is not held. But see below, it may not be relevant. > When iomap_dio_rw or its legacy counterpart is called > with page faults disabled, we need to make sure that the caller can > distinguish between page faults triggered during > bio_iov_iter_get_pages() and during the copying, but that's a separate > problem. (At the moment, when iomap_dio_rw fails with -EFAULT, the > caller *cannot* distinguish between a bio_iov_iter_get_pages failure > and a failure during synchronous copying, but that could be fixed by > returning unique error codes from iomap_dio_rw.) Since the direct I/O pins the pages in memory, does it even need to do a uaccess? It could copy the data via the kernel mapping (kmap). For arm64 MTE, all such accesses are not checked (they use a match-all pointer tag) since the kernel is not set up to handle such sub-page faults (no copy_from/to_user but a direct access). > So as far as I can see, the only problematic case we're left with is > copying bigger than byte-size chunks with page faults disabled when we > don't know whether the underlying pages are resident or not. My guess > would be that in this case, if the copying fails, it would be > perfectly acceptable to explicitly probe the entire chunk for sub-page > faults. Yeah, if there are only a couple of places left, we can add the explicit probing (via some probe_user_writable function).
On Thu, Oct 28, 2021 at 03:32:23PM -0700, Linus Torvalds wrote: > The pointer color fault (or whatever some other architecture may do to > generate sub-page faults) is not only not recoverable in the sense > that we can't fix it up, it also ends up being a forced SIGSEGV (ie it > can't be blocked - it has to either be caught or cause the process to > be killed). > > And the thing is, I think we could just make the rule be that kernel > code that has this kind of retry loop with fault_in_pages() would > force an EFAULT on a pending SIGSEGV. > > IOW, the pending SIGSEGV could effectively be exactly that "thread flag". > > And that means that fault_in_xyz() wouldn't need to worry about this > situation at all: the regular copy_from_user() (or whatever flavor it > is - to/from/iter/whatever) would take the fault. And if it's a > regular page fault,. it would act exactly like it does now, so no > changes. > > If it's a sub-page fault, we'd just make the rule be that we send a > SIGSEGV even if the instruction in question has a user exception > fixup. > > Then we just need to add the logic somewhere that does "if active > pending SIGSEGV, return -EFAULT". > > Of course, that logic might be in fault_in_xyz(), but it migth also be > a separate function entirely. > > So this does effectively end up being a thread flag, but it's also > slightly more than that - it's that a sub-page fault from kernel mode > has semantics that a regular page fault does not. > > The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT > instead" has always been an odd and somewhat wrong-headed thing. Of > course it should cause a SIGSEGV, but that's not how Unix traditionall > worked. We would just say "color faults always raise a signal, even if > the color fault was triggered in a system call". It's doable and, at least for MTE, people have asked for a signal even when the fault was caused by a kernel uaccess. But there are some potentially confusing aspects to sort out: First of all, a uaccess in interrupt should not force such signal as it had nothing to do with the interrupted context. I guess we can do an in_task() check in the fault handler. Second, is there a chance that we enter the fault-in loop with a SIGSEGV already pending? Maybe it's not a problem, we just bail out of the loop early and deliver the signal, though unrelated to the actual uaccess in the loop. Third is the sigcontext.pc presented to the signal handler. Normally for SIGSEGV it points to the address of a load/store instruction and a handler could disable MTE and restart from that point. With a syscall we don't want it to point to the syscall place as it shouldn't be restarted in case it copied something. Pointing it to the next instruction after syscall is backwards-compatible but it may confuse the handler (if it does some reporting). I think we need add a new si_code that describes a fault in kernel mode to differentiate from the genuine user access. There was a discussion back in August on infinite loops with hwpoison and Tony said that Andy convinced him that the kernel should not send a SIGBUS for uaccess: https://lore.kernel.org/linux-edac/20210823152437.GA1637466@agluck-desk2.amr.corp.intel.com/ I personally like the approach of a SIG{SEGV,BUS} on uaccess and I don't think the ABI change is significant but ideally we should have a unified approach that's not just for MTE. Adding Andy and Tony (the background is potentially infinite loops with faults at sub-page granularity: arm64 MTE, hwpoison, sparc ADI). Thanks.
On Fri, Oct 29, 2021 at 10:50 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > First of all, a uaccess in interrupt should not force such signal as it > had nothing to do with the interrupted context. I guess we can do an > in_task() check in the fault handler. Yeah. It ends up being similar to the thread flag in that you still end up having to protect against NMI and other users of asynchronous page faults. So the suggestion was more of a "mindset" difference and modified version of the task flag rather than anything fundamentally different. > Second, is there a chance that we enter the fault-in loop with a SIGSEGV > already pending? Maybe it's not a problem, we just bail out of the loop > early and deliver the signal, though unrelated to the actual uaccess in > the loop. If we ever run in user space with a pending per-thread SIGSEGV, that would already be a fairly bad bug. The intent of "force_sig()" is not only to make sure you can't block the signal, but also that it targets the particular thread that caused the problem: unlike other random "send signal to process", a SIGSEGV caused by a bad memory access is really local to that _thread_, not the signal thread group. So somebody else sending a SIGSEGV asynchronsly is actually very different - it goes to the thread group (although you can specify individual threads too - but once you do that you're already outside of POSIX). That said, the more I look at it, the more I think I was wrong. I think the "we have a SIGSEGV pending" could act as the per-thread flag, but the complexity of the signal handling is probably an argument against it. Not because a SIGSEGV could already be pending, but because so many other situations could be pending. In particular, the signal code won't send new signals to a thread if that thread group is already exiting. So another thread may have already started the exit and core dump sequence, and is in the process of killing the shared signal threads, and if one of those threads is now in the kernel and goes through the copy_from_user() dance, that whole "thread group is exiting" will mean that the signal code won't add a new SIGSEGV to the queue. So the signal could conceptually be used as the flag to stop looping, but it ends up being such a complicated flag that I think it's probably not worth it after all. Even if it semantically would be fairly nice to use pre-existing machinery. Could it be worked around? Sure. That kernel loop probably has to check for fatal_signal_pending() anyway, so it would all work even in the presense of the above kinds of issues. But just the fact that I went and looked at just how exciting the signal code is made me think "ok, conceptually nice, but we take a lot of locks and we do a lot of special things even in the 'simple' force_sig() case". > Third is the sigcontext.pc presented to the signal handler. Normally for > SIGSEGV it points to the address of a load/store instruction and a > handler could disable MTE and restart from that point. With a syscall we > don't want it to point to the syscall place as it shouldn't be restarted > in case it copied something. I think this is actually independent of the whole "how to return errors". We'll still need to return an error from the system call, even if we also have a signal pending. Linus