Message ID | 1491506092.9621.2.camel@redhat.com |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote: > @@ -868,6 +869,7 @@ struct file { > struct list_head f_tfile_llink; > #endif /* #ifdef CONFIG_EPOLL */ > struct address_space *f_mapping; > + u32 f_wb_err; > } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ > I think we can squeeze that in next to f_flags? > +/** > + * filemap_set_wb_error - set the wb error in the mapping for later reporting > + * @mapping: mapping in which the error should be set > + * @err: error to set. must be negative value but not less than -MAX_ERRNO Do we want to have users call filemap_set_wb_error(mapping, EIO) or filemap_set_wb_error(mapping, -EIO)? Either way, we can assert that it's in the correct range (oh look, we have at least one user of mapping_set_error calling it with a positive errno ...) I've been playing with positive or negative errnos for the xarray, and positive looks better to me, although there's a definite advantage to being able to just call filemap_set_wb_error(mapping, result). #define XAS_ERROR(errno) ((struct xa_node *)((errno << 1) | 1)) static inline int xas_error(const struct xa_state *xas) { unsigned long v = (unsigned long)xas->xa_node; return (v & 1) ? -(v >> 1) : 0; } static inline void xas_set_err(struct xa_state *xas, unsigned long err) { XA_BUG_ON(err > MAX_ERRNO); xas->xa_node = XAS_ERROR(err); } > + /* > + * Ensure the error code actually fits where we want it to go. If it > + * doesn't then just throw a warning and don't record anything. > + */ > + if (unlikely(err > 0 || err < -MAX_ERRNO)) { > + WARN(1, "err=%d\n", err); > + return; > + } Cute trick to make this more succinct: if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err) return; or even ... if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err) return; > + /* Clear out error bits and set new error */ > + new = (old & ~MAX_ERRNO) | -err; > + > + /* Only increment if someone has looked at it */ > + if (old & WB_ERR_SEEN) { > + new += WB_ERR_CTR_INC; > + new &= ~WB_ERR_SEEN; > + } Although we always want to clear out the SEEN bit if we're updating ... so new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err; /* Only increment if someone has looked at it */ if (old & WB_ERR_SEEN) new += WB_ERR_CTR_INC; ... and then there's no need to update if it's the same errno and nobody's seen it: if (old == new) break; [...] > + /* > + * We always store values with the "seen" bit set, so if this > + * matches what we already have, then we can call it done. > + * There is nothing to update so just return 0. > + */ > + if (old == file->f_wb_err) > + break; > + > + /* set flag and try to swap it into place */ > + new = old | WB_ERR_SEEN; Again, I think we should avoid the cmpxchg with: if (old == new) break; > + cur = cmpxchg(&mapping->wb_err, old, new); > + > + /* > + * We can quit now if we successfully swapped in the new value > + * or someone else beat us to it with the same value that we > + * were planning to store. > + */ > + if (likely(cur == old || cur == new)) { > + file->f_wb_err = new; > + err = -(new & MAX_ERRNO); > + break; > + } > + > + /* Raced with an update, try again */ > + old = cur; Well ... should we? We're returning an error which is new to this fd anyway. Do we want to return the most recent error by a nanosecond, or should we return the previous one and then see this one next time we call fsync()? I'd lean towards not looping here; not even looking at 'cur'.
On Thu, Apr 06 2017, Jeff Layton wrote: > > I tried to avoid updating things unnecesssarily. I could use some > guidance on how to specify the constants in terms of MAX_ERRNO as well. ilog2() defined in include/linux/log2.h And you have MAX_ERROR in one comment, instead of MAX_ERRNO :-) > > - flush: called by the close(2) system call to flush a file > + flush: called by the close(2) system call to flush a file. Writeback > + errors not previously reported via fsync should be reported > + here as you would for fsync. "could", not "should". I think it is agreed that this is a good idea, is it? > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7251f7bb45e8..f33857113ff4 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -394,6 +394,7 @@ struct address_space { > gfp_t gfp_mask; /* implicit gfp mask for allocations */ > struct list_head private_list; /* ditto */ > void *private_data; /* ditto */ > + u32 wb_err; I would rather this was a wb_err_t or similar, and that the functions which implement it take a pointer to a wb_err_t. Then the 'error' in 'struct nfs_open_context' could become a 'wb_err_t', and nfs could use these functions to do error tracking the way it wants to. Thanks - looking good. NeilBrown
On Thu, 2017-04-06 at 13:05 -0700, Matthew Wilcox wrote: > On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote: > > @@ -868,6 +869,7 @@ struct file { > > struct list_head f_tfile_llink; > > #endif /* #ifdef CONFIG_EPOLL */ > > struct address_space *f_mapping; > > + u32 f_wb_err; > > } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ > > > > I think we can squeeze that in next to f_flags? > Sure, will do. I meant to look at pahole output and see if there are existing holes. > > +/** > > + * filemap_set_wb_error - set the wb error in the mapping for later reporting > > + * @mapping: mapping in which the error should be set > > + * @err: error to set. must be negative value but not less than -MAX_ERRNO > > Do we want to have users call filemap_set_wb_error(mapping, EIO) > or filemap_set_wb_error(mapping, -EIO)? Either way, we can assert > that it's in the correct range (oh look, we have at least one user of > mapping_set_error calling it with a positive errno ...) > Yeah, I sent a patch for that a while back but I don't think anyone picked it up. Luckily that caller is harmless since EIO just ends up in the default case and gets turned into -EIO. > I've been playing with positive or negative errnos for the xarray, and > positive looks better to me, although there's a definite advantage to > being able to just call filemap_set_wb_error(mapping, result). > That's my main rationale. We generally use negative error codes in the kernel, so let's do what's easiest for most callsites. I say negative error codes here. > #define XAS_ERROR(errno) ((struct xa_node *)((errno << 1) | 1)) > > static inline int xas_error(const struct xa_state *xas) > { > unsigned long v = (unsigned long)xas->xa_node; > return (v & 1) ? -(v >> 1) : 0; > } > > static inline void xas_set_err(struct xa_state *xas, unsigned long err) > { > XA_BUG_ON(err > MAX_ERRNO); > xas->xa_node = XAS_ERROR(err); > } > > > + /* > > + * Ensure the error code actually fits where we want it to go. If it > > + * doesn't then just throw a warning and don't record anything. > > + */ > > + if (unlikely(err > 0 || err < -MAX_ERRNO)) { > > + WARN(1, "err=%d\n", err); > > + return; > > + } > > Cute trick to make this more succinct: > > if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err) > return; > or even ... > > if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err) > return; > Nice. I always forget that WARN has a return. Will fix. > > + /* Clear out error bits and set new error */ > > + new = (old & ~MAX_ERRNO) | -err; > > + > > + /* Only increment if someone has looked at it */ > > + if (old & WB_ERR_SEEN) { > > + new += WB_ERR_CTR_INC; > > + new &= ~WB_ERR_SEEN; > > + } > > Although we always want to clear out the SEEN bit if we're updating ... so > > new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err; > > /* Only increment if someone has looked at it */ > if (old & WB_ERR_SEEN) > new += WB_ERR_CTR_INC; > Sure, that is more succinct. > ... and then there's no need to update if it's the same errno and nobody's > seen it: > > if (old == new) > break; > No, we can't do this. The thing could have just been updated by a task that is setting the "seen" bit. We don't want to lose the error here. We always have to do the cmpxchg on the set_wb_error side, I think. > [...] > > > + /* > > + * We always store values with the "seen" bit set, so if this > > + * matches what we already have, then we can call it done. > > + * There is nothing to update so just return 0. > > + */ > > + if (old == file->f_wb_err) > > + break; > > + > > + /* set flag and try to swap it into place */ > > + new = old | WB_ERR_SEEN; > > Again, I think we should avoid the cmpxchg with: > > if (old == new) > break; > Yeah, we may be able to do this one. I had myself convinced otherwise yesterday, but I think you may be right. > > + cur = cmpxchg(&mapping->wb_err, old, new); > > + > > + /* > > + * We can quit now if we successfully swapped in the new value > > + * or someone else beat us to it with the same value that we > > + * were planning to store. > > + */ > > + if (likely(cur == old || cur == new)) { > > + file->f_wb_err = new; > > + err = -(new & MAX_ERRNO); > > + break; > > + } > > + > > + /* Raced with an update, try again */ > > + old = cur; > > Well ... should we? We're returning an error which is new to this fd anyway. > Do we want to return the most recent error by a nanosecond, or should we > return the previous one and then see this one next time we call fsync()? > > I'd lean towards not looping here; not even looking at 'cur'. > Yeah, that might be fine here. Let me think about it a bit more.
On Fri, Apr 07 2017, Jeff Layton wrote: > >> ... and then there's no need to update if it's the same errno and nobody's >> seen it: >> >> if (old == new) >> break; >> > > No, we can't do this. The thing could have just been updated by a task > that is setting the "seen" bit. We don't want to lose the error here. We > always have to do the cmpxchg on the set_wb_error side, I think. I don't follow your logic. If (old == new) then there was a moment since this function started when performing the cmpxchg() would not have changed the contents of memory. So let's pretend it did actually happen at that moment, and not change memory. If we race with a task setting the "seen" bit, then it will have seen the error *after* the new error, that this thread is reporting, actually happened. So the result is still correct. Thanks, NeilBrown
On Mon, 2017-04-10 at 09:15 +1000, NeilBrown wrote: > On Fri, Apr 07 2017, Jeff Layton wrote: > > > > > > ... and then there's no need to update if it's the same errno and nobody's > > > seen it: > > > > > > if (old == new) > > > break; > > > > > > > No, we can't do this. The thing could have just been updated by a task > > that is setting the "seen" bit. We don't want to lose the error here. We > > always have to do the cmpxchg on the set_wb_error side, I think. > > I don't follow your logic. > If (old == new) then there was a moment since this function started when > performing the cmpxchg() would not have changed the contents of memory. > So let's pretend it did actually happen at that moment, and not change > memory. > > If we race with a task setting the "seen" bit, then it will have seen > the error *after* the new error, that this thread is reporting, actually > happened. So the result is still correct. > Ok, that does make sense. I'll plan to do that. There's also a bug in the last patch that I sent. We need to mark the SEEN bit when we sample the value at open time, so we need a filemap_sample_wb_error function to grab the current wb_err_t and mark it SEEN if necessary. That also gives us a way to handle something like filemap_write_and_wait (which doesn't take a struct file). We can sample the wb_err_t prior to starting writeback, and then return an error if anything failed after that point. I think that's probably close enough to how the current code works that we can use it to make drop-in replacements for filemap_write_and_wait* which should keep us from having to change so much existing code here. filemap_check_errors would need to take a previously-sampled wb_err_t argument, but only the lowest-level callers of that and filemap_fdatawait* would need to deal with them directly.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 569211703721..b2b5e411b340 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -577,6 +577,11 @@ should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. +If there is an error during writeback, then the address_space should be +marked with an error (typically using filemap_set_wb_error), in order to +ensure that the error can later be reported to the application at fsync +or close. + Writeback makes use of a writeback_control structure... struct address_space_operations @@ -885,11 +890,16 @@ otherwise noted. "private_data" member in the file structure if you want to point to a device structure - flush: called by the close(2) system call to flush a file + flush: called by the close(2) system call to flush a file. Writeback + errors not previously reported via fsync should be reported + here as you would for fsync. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call + fsync: called by the fsync(2) system call. Filesystems that use the + pagecache should call filemap_report_wb_error before returning + to ensure that any errors that occurred during writeback are + reported and the file's error sequence advanced. fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file diff --git a/fs/open.c b/fs/open.c index 949cef29c3bb..baf82f2c642e 100644 --- a/fs/open.c +++ b/fs/open.c @@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f, f->f_inode = inode; f->f_mapping = inode->i_mapping; + /* Ensure that we skip any errors that predate opening of the file */ + f->f_wb_err = READ_ONCE(inode->i_mapping->wb_err); + if (unlikely(f->f_flags & O_PATH)) { f->f_mode = FMODE_PATH; f->f_op = &empty_fops; diff --git a/include/linux/fs.h b/include/linux/fs.h index 7251f7bb45e8..f33857113ff4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -394,6 +394,7 @@ struct address_space { gfp_t gfp_mask; /* implicit gfp mask for allocations */ struct list_head private_list; /* ditto */ void *private_data; /* ditto */ + u32 wb_err; } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but @@ -868,6 +869,7 @@ struct file { struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; + u32 f_wb_err; } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ struct file_handle { @@ -2521,6 +2523,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping, extern int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, loff_t end); extern int filemap_check_errors(struct address_space *mapping); +extern void filemap_set_wb_error(struct address_space *mapping, int err); +extern int filemap_report_wb_error(struct file *file); extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync); diff --git a/mm/filemap.c b/mm/filemap.c index 1694623a6289..60b6fa417b98 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -545,6 +545,168 @@ int filemap_write_and_wait_range(struct address_space *mapping, } EXPORT_SYMBOL(filemap_write_and_wait_range); +/* + * The wb_err field in the address_space provides a place to store writeback + * errors. We endeavor to deliver writeback errors to fsync on all open file + * descriptors that were open at the time that the error was caught. We do + * this using a 32-bit value to store the error, with the upper bits as a + * sequence counter. We can store any error up to MAX_ERROR. + * + * Additionally, we reserve one bit to indicate whether any fd has grabbed the + * value to record in its struct file. If nothing has, then we don't really + * need to increment the counter. + */ + +/* This bit is used as a flag to indicate whether the value has been seen */ +#define WB_ERR_SEEN (1 << 12) + +/* Increment the counter by this much to ensure that we don't touch earlier + * values */ +#define WB_ERR_CTR_INC (1 << 13) + +/** + * filemap_set_wb_error - set the wb error in the mapping for later reporting + * @mapping: mapping in which the error should be set + * @err: error to set. must be negative value but not less than -MAX_ERRNO + * + * When an error occurs during writeback of inode data, we must report that + * error during fsync. This function sets the writeback error field in the + * mapping, and increments the sequence counter. When fsync or close is later + * performed, the caller can then check the sequence in the mapping against + * the one in the file to determine whether the error should be reported. + * + * Because there are so few bits for the counter, we try to avoid incrementing + * it unless someone is going to record the value for later comparison. This + * is tracked by a bit in the 32 bit word that we use as a "seen" flag. + * + * Note that we always use the latest writeback error, since POSIX states + * that when there are multiple errors (e.g. -EIO followed by -ENOSPC), + * that any possible error may be returned. + */ +void filemap_set_wb_error(struct address_space *mapping, int err) +{ + u32 old; + + /* + * The above constants rely indirectly on MAX_ERRNO not changing + * since I'm not sure how to take a log at build time. Suggestions + * of better ways to phrase the flag values would be welcome. + */ + BUILD_BUG_ON(MAX_ERRNO + 1 != WB_ERR_SEEN); + + /* Optimize for common case of no error */ + if (likely(!err)) + return; + + /* + * Ensure the error code actually fits where we want it to go. If it + * doesn't then just throw a warning and don't record anything. + */ + if (unlikely(err > 0 || err < -MAX_ERRNO)) { + WARN(1, "err=%d\n", err); + return; + } + + old = READ_ONCE(mapping->wb_err); + for (;;) { + u32 new, cur; + + /* Clear out error bits and set new error */ + new = (old & ~MAX_ERRNO) | -err; + + /* Only increment if someone has looked at it */ + if (old & WB_ERR_SEEN) { + new += WB_ERR_CTR_INC; + new &= ~WB_ERR_SEEN; + } + + /* Try to swap the new value into place */ + cur = cmpxchg(&mapping->wb_err, old, new); + + /* + * Call it success if we did the swap or someone else beat us + * to it for the same value. + */ + if (likely(cur == old || cur == new)) + break; + + /* Raced with an update, try again */ + old = cur; + } +} +EXPORT_SYMBOL(filemap_set_wb_error); + +/** + * filemap_report_wb_error - report wb error (if any) that was previously set + * @file: struct file on which the error is being reported + * + * When userland calls fsync or close (or something like nfsd does the + * equivalent), we want to report any writeback errors that occurred since + * the last fsync (or since the file was opened if there haven't been any). + * + * Grab the wb_err from the mapping. If it matches what we have in the file, + * then just quickly return 0. The file is all caught up. + * + * If it doesn't match, then take the mapping value, set the "seen" flag in + * it and try to swap it into place. If it works, or another task beat us + * to it with the new value, then update the f_wb_err and return the error + * portion. The error at this point _should_ be reported to userland. + * + * While we handle mapping->wb_err with atomic operations, the f_wb_err + * value is protected by the f_lock since we must ensure that it reflects + * the latest value swapped in for this file descriptor. + */ +int filemap_report_wb_error(struct file *file) +{ + int err = 0; + struct address_space *mapping = file->f_mapping; + u32 old; + + old = READ_ONCE(mapping->wb_err); + + /* + * This catches the common case of no errors, and the case where + * nothing has changed since we last checked. + */ + if (old == READ_ONCE(file->f_wb_err)) + goto out; + + spin_lock(&file->f_lock); + for (;;) { + u32 cur, new; + + /* + * We always store values with the "seen" bit set, so if this + * matches what we already have, then we can call it done. + * There is nothing to update so just return 0. + */ + if (old == file->f_wb_err) + break; + + /* set flag and try to swap it into place */ + new = old | WB_ERR_SEEN; + cur = cmpxchg(&mapping->wb_err, old, new); + + /* + * We can quit now if we successfully swapped in the new value + * or someone else beat us to it with the same value that we + * were planning to store. + */ + if (likely(cur == old || cur == new)) { + file->f_wb_err = new; + err = -(new & MAX_ERRNO); + break; + } + + /* Raced with an update, try again */ + old = cur; + } + spin_unlock(&file->f_lock); +out: + return err; +} +EXPORT_SYMBOL(filemap_report_wb_error); + /** * replace_page_cache_page - replace a pagecache page with a new one * @old: page to be replaced