Message ID | 20230412134057.381941-1-bschubert@ddn.com |
---|---|
State | New |
Headers | show |
Series | [1/2] fs: add FMODE_DIO_PARALLEL_WRITE flag | expand |
Sorry, had forgotten to CC Miklos. On 4/12/23 15:40, Bernd Schubert wrote: > Miklos, Jens, > > could we please also set this flag for fuse? > > > Thanks, > Bernd > > > fuse: Set FMODE_DIO_PARALLEL_WRITE flag > > From: Bernd Schubert <bschubert@ddn.com> > > Fuse can also do parallel DIO writes, if userspace has enabled it. > > Signed-off-by: Bernd Schubert <bschubert@ddn.com> > --- > fs/fuse/file.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 875314ee6f59..46e7f1196fd1 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -215,6 +215,9 @@ void fuse_finish_open(struct inode *inode, struct file *file) > } > if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) > fuse_link_write_file(file); > + > + if (ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) > + file->f_mode |= FMODE_DIO_PARALLEL_WRITE; > } > > int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
On Wed, 12 Apr 2023 at 15:42, Bernd Schubert <bschubert@ddn.com> wrote: > > Miklos, Jens, > > could we please also set this flag for fuse? > > > Thanks, > Bernd > > > fuse: Set FMODE_DIO_PARALLEL_WRITE flag > > From: Bernd Schubert <bschubert@ddn.com> > > Fuse can also do parallel DIO writes, if userspace has enabled it. > > Signed-off-by: Bernd Schubert <bschubert@ddn.com> > --- > fs/fuse/file.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 875314ee6f59..46e7f1196fd1 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -215,6 +215,9 @@ void fuse_finish_open(struct inode *inode, struct file *file) > } > if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) > fuse_link_write_file(file); > + > + if (ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) > + file->f_mode |= FMODE_DIO_PARALLEL_WRITE; fuse_direct_write_iter(): bool exclusive_lock = !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) || iocb->ki_flags & IOCB_APPEND || fuse_direct_write_extending_i_size(iocb, from); If the write is size extending, then it will take the lock exclusive. OTOH, I guess that it would be unusual for lots of size extending writes to be done in parallel. What would be the effect of giving the FMODE_DIO_PARALLEL_WRITE hint and then still serializing the writes? Thanks, Miklos
On 4/13/23 09:40, Miklos Szeredi wrote: > On Wed, 12 Apr 2023 at 15:42, Bernd Schubert <bschubert@ddn.com> wrote: >> >> Miklos, Jens, >> >> could we please also set this flag for fuse? >> >> >> Thanks, >> Bernd >> >> >> fuse: Set FMODE_DIO_PARALLEL_WRITE flag >> >> From: Bernd Schubert <bschubert@ddn.com> >> >> Fuse can also do parallel DIO writes, if userspace has enabled it. >> >> Signed-off-by: Bernd Schubert <bschubert@ddn.com> >> --- >> fs/fuse/file.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >> index 875314ee6f59..46e7f1196fd1 100644 >> --- a/fs/fuse/file.c >> +++ b/fs/fuse/file.c >> @@ -215,6 +215,9 @@ void fuse_finish_open(struct inode *inode, struct file *file) >> } >> if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) >> fuse_link_write_file(file); >> + >> + if (ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) >> + file->f_mode |= FMODE_DIO_PARALLEL_WRITE; > > fuse_direct_write_iter(): > > bool exclusive_lock = > !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) || > iocb->ki_flags & IOCB_APPEND || > fuse_direct_write_extending_i_size(iocb, from); > > If the write is size extending, then it will take the lock exclusive. > OTOH, I guess that it would be unusual for lots of size extending > writes to be done in parallel. > > What would be the effect of giving the FMODE_DIO_PARALLEL_WRITE hint > and then still serializing the writes? It used here https://lore.kernel.org/io-uring/20230403-wound-roundworm-c1660e059b8c@brauner/T/#m5f86985d6c67dd1d01a977475dab542c338372dd fuse_finish_open has its own lock, so letting uring handle requests in parallel should not hurt? Is this going like application -> uring does parallel requests -> fuse.ko -> fuse-deamon So when fuse-deamon signals that it can handle parallel DIO, it is just fuse.ko that might need its own lock to extend the file? Thanks, Bernd
On Thu, Apr 13, 2023 at 09:40:29AM +0200, Miklos Szeredi wrote: > fuse_direct_write_iter(): > > bool exclusive_lock = > !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) || > iocb->ki_flags & IOCB_APPEND || > fuse_direct_write_extending_i_size(iocb, from); > > If the write is size extending, then it will take the lock exclusive. > OTOH, I guess that it would be unusual for lots of size extending > writes to be done in parallel. > > What would be the effect of giving the FMODE_DIO_PARALLEL_WRITE hint > and then still serializing the writes? I have no idea how this flags work, but XFS also takes i_rwsem exclusively for appends, when the positions and size aren't aligned to the block size, and a few other cases.
On Thu, Apr 13, 2023 at 10:11:28PM -0700, Christoph Hellwig wrote: > On Thu, Apr 13, 2023 at 09:40:29AM +0200, Miklos Szeredi wrote: > > fuse_direct_write_iter(): > > > > bool exclusive_lock = > > !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) || > > iocb->ki_flags & IOCB_APPEND || > > fuse_direct_write_extending_i_size(iocb, from); > > > > If the write is size extending, then it will take the lock exclusive. > > OTOH, I guess that it would be unusual for lots of size extending > > writes to be done in parallel. > > > > What would be the effect of giving the FMODE_DIO_PARALLEL_WRITE hint > > and then still serializing the writes? > > I have no idea how this flags work, but XFS also takes i_rwsem > exclusively for appends, when the positions and size aren't aligned to > the block size, and a few other cases. IIUC uring wants to avoid the situation where someone sends 300 writes to the same file, all of which end up in background workers, and all of which then contend on exclusive i_rwsem. Hence it has some hashing scheme that executes io requests serially if they hash to the same value (which iirc is the inode number?) to prevent resource waste. This flag turns off that hashing behavior on the assumption that each of those 300 writes won't serialize on the other 299 writes, hence it's ok to start up 300 workers. (apologies for precoffee garbled response) --D
On 4/14/23 9:36?AM, Darrick J. Wong wrote: > On Thu, Apr 13, 2023 at 10:11:28PM -0700, Christoph Hellwig wrote: >> On Thu, Apr 13, 2023 at 09:40:29AM +0200, Miklos Szeredi wrote: >>> fuse_direct_write_iter(): >>> >>> bool exclusive_lock = >>> !(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) || >>> iocb->ki_flags & IOCB_APPEND || >>> fuse_direct_write_extending_i_size(iocb, from); >>> >>> If the write is size extending, then it will take the lock exclusive. >>> OTOH, I guess that it would be unusual for lots of size extending >>> writes to be done in parallel. >>> >>> What would be the effect of giving the FMODE_DIO_PARALLEL_WRITE hint >>> and then still serializing the writes? >> >> I have no idea how this flags work, but XFS also takes i_rwsem >> exclusively for appends, when the positions and size aren't aligned to >> the block size, and a few other cases. > > IIUC uring wants to avoid the situation where someone sends 300 writes > to the same file, all of which end up in background workers, and all of > which then contend on exclusive i_rwsem. Hence it has some hashing > scheme that executes io requests serially if they hash to the same value > (which iirc is the inode number?) to prevent resource waste. > > This flag turns off that hashing behavior on the assumption that each of > those 300 writes won't serialize on the other 299 writes, hence it's ok > to start up 300 workers. > > (apologies for precoffee garbled response) Yep, that is pretty much it. If all writes to that inode are serialized by a lock on the fs side, then we'll get a lot of contention on that mutex. And since, originally, nothing supported async writes, everything would get punted to the io-wq workers. io_uring added per-inode hashing for this, so that any punt to io-wq of a write would get serialized. IOW, it's an efficiency thing, not a correctness thing.
On Fri, Apr 14, 2023 at 08:36:12AM -0700, Darrick J. Wong wrote: > IIUC uring wants to avoid the situation where someone sends 300 writes > to the same file, all of which end up in background workers, and all of > which then contend on exclusive i_rwsem. Hence it has some hashing > scheme that executes io requests serially if they hash to the same value > (which iirc is the inode number?) to prevent resource waste. > > This flag turns off that hashing behavior on the assumption that each of > those 300 writes won't serialize on the other 299 writes, hence it's ok > to start up 300 workers. > > (apologies for precoffee garbled response) It might be useful if someone (Jens?) could clearly document the assumptions for this flag.
On Sat, 15 Apr 2023 at 15:15, Jens Axboe <axboe@kernel.dk> wrote: > Yep, that is pretty much it. If all writes to that inode are serialized > by a lock on the fs side, then we'll get a lot of contention on that > mutex. And since, originally, nothing supported async writes, everything > would get punted to the io-wq workers. io_uring added per-inode hashing > for this, so that any punt to io-wq of a write would get serialized. > > IOW, it's an efficiency thing, not a correctness thing. We could still get a performance regression if the majority of writes still trigger the exclusive locking. The questions are: - how often does that happen in real life? - how bad the performance regression would be? Without first attempting to answer those questions, I'd be reluctant to add FMODE_DIO_PARALLEL_WRITE to fuse. Thanks, Miklos
On 4/18/23 14:42, Miklos Szeredi wrote: > On Sat, 15 Apr 2023 at 15:15, Jens Axboe <axboe@kernel.dk> wrote: > >> Yep, that is pretty much it. If all writes to that inode are serialized >> by a lock on the fs side, then we'll get a lot of contention on that >> mutex. And since, originally, nothing supported async writes, everything >> would get punted to the io-wq workers. io_uring added per-inode hashing >> for this, so that any punt to io-wq of a write would get serialized. >> >> IOW, it's an efficiency thing, not a correctness thing. > > We could still get a performance regression if the majority of writes > still trigger the exclusive locking. The questions are: > > - how often does that happen in real life? Application depending? My personal opinion is that applications/developers knowing about uring would also know that they should set the right file size first. Like MPIIO is extending files persistently and it is hard to fix with all these different MPI stacks (I can try to notify mpich and mvapich developers). So best would be to document it somewhere in the uring man page that parallel extending files might have negative side effects? > - how bad the performance regression would be? I can give it a try with fio and fallocate=none over fuse during the next days. > > Without first attempting to answer those questions, I'd be reluctant > to add FMODE_DIO_PARALLEL_WRITE to fuse. > Bernd
On Tue, Apr 18, 2023 at 12:55:40PM +0000, Bernd Schubert wrote: > On 4/18/23 14:42, Miklos Szeredi wrote: > > On Sat, 15 Apr 2023 at 15:15, Jens Axboe <axboe@kernel.dk> wrote: > > > >> Yep, that is pretty much it. If all writes to that inode are serialized > >> by a lock on the fs side, then we'll get a lot of contention on that > >> mutex. And since, originally, nothing supported async writes, everything > >> would get punted to the io-wq workers. io_uring added per-inode hashing > >> for this, so that any punt to io-wq of a write would get serialized. > >> > >> IOW, it's an efficiency thing, not a correctness thing. > > > > We could still get a performance regression if the majority of writes > > still trigger the exclusive locking. The questions are: > > > > - how often does that happen in real life? > > Application depending? My personal opinion is that > applications/developers knowing about uring would also know that they > should set the right file size first. Like MPIIO is extending files > persistently and it is hard to fix with all these different MPI stacks > (I can try to notify mpich and mvapich developers). So best would be to > document it somewhere in the uring man page that parallel extending > files might have negative side effects? There are relatively few applications running concurrent async RWF_APPEND DIO writes. IIRC SycallaDB was the first we came across a few years ago. Apps that use RWF_APPEND for individual DIOs expect that it doesn't cause performance anomolies. These days XFS will run concurrent append DIO writes and it doesn't serialise RWF_APPEND IO against other RWF_APPEND IOs. Avoiding data corruption due to racing append IOs doing file extension has been delegated to the userspace application similar to how we delegate the responsibility for avoiding data corruption due to overlapping concurrent DIO to userspace. > > - how bad the performance regression would be? > > I can give it a try with fio and fallocate=none over fuse during the > next days. It depends on where the lock that triggers serialisation is, and how bad the contention on it is. rwsems suck for write contention because of the "spin on owner" "optimisations" for write locking and long write holds that occur in the IO path. In general, it will be no worse than using userspace threads to issue the exact same IO pattern using concurrent sync IO. > > Without first attempting to answer those questions, I'd be reluctant > > to add FMODE_DIO_PARALLEL_WRITE to fuse. I'd tag it with this anyway - for the majority of apps that are doing concurrent DIO within EOF, shared locking is big win. If there's a corner case that apps trigger that is slow, deal with them when they are reported.... Cheers, Dave.
On 4/18/23 4:13?PM, Dave Chinner wrote: >>> Without first attempting to answer those questions, I'd be reluctant >>> to add FMODE_DIO_PARALLEL_WRITE to fuse. > > I'd tag it with this anyway - for the majority of apps that are > doing concurrent DIO within EOF, shared locking is big win. If > there's a corner case that apps trigger that is slow, deal with them > when they are reported.... Agree, the common/fast case will be fine, which is really the most important part.
On 4/15/23 11:54?PM, Christoph Hellwig wrote: > On Fri, Apr 14, 2023 at 08:36:12AM -0700, Darrick J. Wong wrote: >> IIUC uring wants to avoid the situation where someone sends 300 writes >> to the same file, all of which end up in background workers, and all of >> which then contend on exclusive i_rwsem. Hence it has some hashing >> scheme that executes io requests serially if they hash to the same value >> (which iirc is the inode number?) to prevent resource waste. >> >> This flag turns off that hashing behavior on the assumption that each of >> those 300 writes won't serialize on the other 299 writes, hence it's ok >> to start up 300 workers. >> >> (apologies for precoffee garbled response) > > It might be useful if someone (Jens?) could clearly document the > assumptions for this flag. I guess it can be summed up as the common case should not be using exclusive (per file/inode) locking. If file extensions need exclusive locking that's less of a concern, as I don't think it's unreasonable to expect that to require stricter locking.
diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 875314ee6f59..46e7f1196fd1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -215,6 +215,9 @@ void fuse_finish_open(struct inode *inode, struct file *file) } if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache) fuse_link_write_file(file); + + if (ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) + file->f_mode |= FMODE_DIO_PARALLEL_WRITE; } int fuse_open_common(struct inode *inode, struct file *file, bool isdir)