Message ID | 20241228014522.2395187-2-yi.zhang@huaweicloud.com |
---|---|
State | New |
Headers | show |
Series | [RFC,1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate | expand |
There's a feature request for something similar on the xfs list, so I guess people are asking for it. That being said this really should not be a modifier but a separate operation, as the logic is very different from FALLOC_FL_ZERO_RANGE, similar to how plain prealloc, hole punch and zero range are different operations despite all of them resulting in reads of zeroes from the range. That will also make it more clear that for files or file systems that require out place writes this operation should fail instead of doing pointless multiple writes. Also please write a man page update clearly specifying the semantics, especially if this should work or not if there is no write zeroes offload in the hardware, or if that offload actually writes physical zeroes to the media or not. Btw, someone really should clean up the ext4 fallocate code to use helper adnd do the switch (mode & FALLOC_FL_MODE_MASK) { } and then use helpers for each mode whih will make these things a lot more obvious.
On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote: > There's a feature request for something similar on the xfs list, so > I guess people are asking for it. Yeah, I have folks asking for this on the ext4 side as well. The one caution that I've given to them is that there is no guarantee what the performance will be for WRITE SAME or equivalent operations, since the standards documents state that performance is out of scope for the document. So in some cases, WRITE SAME might be fast (if for example it is just adjusing FTL metadata on an SSD, or some similar thing on cloud-emulated block devices such as Google's Persistent Desk or Amazon's Elastic Block Device --- what Darrick has called "software defined storage" for the cloud), but in other hardware deployments, WRITE SAME might be as slow as writing zeros to an HDD. This is technically not the kernel's problem, since we can also use the same mealy-mouth "performance is out of scope and not the kernel's concern", but that just transfers the problem to the application programmers. I could imagine some kind of tunable which we can make the block device pretend that it really doesn't support using WRITE SAME if the performance characteristics are such that it's a Bad Idea to use it, so that there's a single tunable knob that the system adminstrator can reach for as opposed to have different ways for PostgresQL, MySQL, Oracle Enterprise Database, etc have for configuring whether or not to disable WRITE SAME, but that's not something we need to decide right away. > That being said this really should not be a modifier but a separate > operation, as the logic is very different from FALLOC_FL_ZERO_RANGE, > similar to how plain prealloc, hole punch and zero range are different > operations despite all of them resulting in reads of zeroes from the > range. Yes. And we might decide that it should be done using some kind of ioctl, such as BLKDISCARD, as opposed to a new fallocate operation, since it really isn't a filesystem metadata operation, just as BLKDISARD isn't. The other side of the argument is that ioctls are ugly, and maybe all new such operations should be plumbed through via fallocate as opposed to adding a new ioctl. I don't have strong feelings on this, although I *do* belive that whatever interface we use, whether it be fallocate or ioctl, it should be supported by block devices and files in a file system, to make life easier for those databases that want to support running on a raw block device (for full-page advertisements on the back cover of the Businessweek magazine) or on files (which is how 99.9% of all real-world users actually run enterprise databases. :-) - Ted
On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote: > Yes. And we might decide that it should be done using some kind of > ioctl, such as BLKDISCARD, as opposed to a new fallocate operation, > since it really isn't a filesystem metadata operation, just as > BLKDISARD isn't. The other side of the argument is that ioctls are > ugly, and maybe all new such operations should be plumbed through via > fallocate as opposed to adding a new ioctl. I don't have strong > feelings on this, although I *do* belive that whatever interface we > use, whether it be fallocate or ioctl, it should be supported by block > devices and files in a file system, to make life easier for those > databases that want to support running on a raw block device (for > full-page advertisements on the back cover of the Businessweek > magazine) or on files (which is how 99.9% of all real-world users > actually run enterprise databases. :-) If you want the operation to work for files it needs to be routed through the file system as otherwise you can't make it actually work coherently. While you could add a new ioctl that works on a file fallocate seems like a much better interface. Supporting it on a block device is trivial, as it can mostly (or even entirely depending on the exact definition of the interface) reuse the existing zero range / punch hole code.
On Mon, Jan 06, 2025 at 08:27:49AM -0800, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote: > > Yes. And we might decide that it should be done using some kind of > > ioctl, such as BLKDISCARD, as opposed to a new fallocate operation, > > since it really isn't a filesystem metadata operation, just as > > BLKDISARD isn't. The other side of the argument is that ioctls are > > ugly, and maybe all new such operations should be plumbed through via > > fallocate as opposed to adding a new ioctl. I don't have strong > > feelings on this, although I *do* belive that whatever interface we > > use, whether it be fallocate or ioctl, it should be supported by block > > devices and files in a file system, to make life easier for those > > databases that want to support running on a raw block device (for > > full-page advertisements on the back cover of the Businessweek > > magazine) or on files (which is how 99.9% of all real-world users > > actually run enterprise databases. :-) > > If you want the operation to work for files it needs to be routed > through the file system as otherwise you can't make it actually > work coherently. While you could add a new ioctl that works on a > file fallocate seems like a much better interface. Supporting it > on a block device is trivial, as it can mostly (or even entirely > depending on the exact definition of the interface) reuse the existing > zero range / punch hole code. I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode, document very vigorously that it exists to facilitate pure overwrites (specifically that it returns EOPNOTSUPP for always-cow files), and not add more ioctls. (That said, doesn't BLKZEROOUT already do this for bdevs?) --D
On Mon, Jan 06, 2025 at 09:31:33AM -0800, Darrick J. Wong wrote: > I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode, > document very vigorously that it exists to facilitate pure overwrites > (specifically that it returns EOPNOTSUPP for always-cow files), and not > add more ioctls. That goes into a similar direction to what I'd prefer. > (That said, doesn't BLKZEROOUT already do this for bdevs?) Yes. But the same is true for the other fallocate modes on block devices.
On 2025/1/7 0:17, Theodore Ts'o wrote: > On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote: >> There's a feature request for something similar on the xfs list, so >> I guess people are asking for it. > > Yeah, I have folks asking for this on the ext4 side as well. > > The one caution that I've given to them is that there is no guarantee > what the performance will be for WRITE SAME or equivalent operations, > since the standards documents state that performance is out of scope > for the document. So in some cases, WRITE SAME might be fast (if for > example it is just adjusing FTL metadata on an SSD, or some similar > thing on cloud-emulated block devices such as Google's Persistent Desk > or Amazon's Elastic Block Device --- what Darrick has called "software > defined storage" for the cloud), but in other hardware deployments, > WRITE SAME might be as slow as writing zeros to an HDD. > > This is technically not the kernel's problem, since we can also use > the same mealy-mouth "performance is out of scope and not the kernel's > concern", but that just transfers the problem to the application > programmers. I could imagine some kind of tunable which we can make > the block device pretend that it really doesn't support using WRITE > SAME if the performance characteristics are such that it's a Bad Idea > to use it, so that there's a single tunable knob that the system > adminstrator can reach for as opposed to have different ways for > PostgresQL, MySQL, Oracle Enterprise Database, etc have for > configuring whether or not to disable WRITE SAME, but that's not > something we need to decide right away. Yes, I completely agree with you. At this time, it is not possible to determine whether a disk supports fast write zeros only by checking if the disk supports the write_zero command. Especially for some HDDs, which should submit actual zeros to the disk even if they claim to support the write_zero command, but that is very slow. Therefore, I propose that we add a new feature flag, such as BLK_FEAT_FAST_WRITE_ZERO, to queue->limits.features. This flag should be set by each disk driver if the attached disk supports fast write zeros. For instance, the NVMe SSD driver should set this flag if the given namespace supports NVME_NS_DEAC. Additionally, we can add an entry in sysfs that allows the user to enable and disable this feature manually when the driver don't know the whether the disk supports it or not for some corner cases. > >> That being said this really should not be a modifier but a separate >> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE, >> similar to how plain prealloc, hole punch and zero range are different >> operations despite all of them resulting in reads of zeroes from the >> range. > > Yes. And we might decide that it should be done using some kind of > ioctl, such as BLKDISCARD, as opposed to a new fallocate operation, > since it really isn't a filesystem metadata operation, just as > BLKDISARD isn't. The other side of the argument is that ioctls are > ugly, and maybe all new such operations should be plumbed through via > fallocate as opposed to adding a new ioctl. I don't have strong > feelings on this, although I *do* belive that whatever interface we > use, whether it be fallocate or ioctl, it should be supported by block > devices and files in a file system, to make life easier for those > databases that want to support running on a raw block device (for > full-page advertisements on the back cover of the Businessweek > magazine) or on files (which is how 99.9% of all real-world users > actually run enterprise databases. :-) > For this part, I still think it would be better to use fallocate. Thanks, Yi.
On 2025/1/6 19:27, Christoph Hellwig wrote: > There's a feature request for something similar on the xfs list, so > I guess people are asking for it. > > That being said this really should not be a modifier but a separate > operation, as the logic is very different from FALLOC_FL_ZERO_RANGE, > similar to how plain prealloc, hole punch and zero range are different > operations despite all of them resulting in reads of zeroes from the > range. OK, it seems reasonable to me, and adding a new operation would be better. There is actually no need to mix it with the current FALLOC_FL_ZERO_RANGE. > > That will also make it more clear that for files or file systems that > require out place writes this operation should fail instead of doing > pointless multiple writes. > > Also please write a man page update clearly specifying the semantics, > especially if this should work or not if there is no write zeroes > offload in the hardware, or if that offload actually writes physical > zeroes to the media or not. > Sure. thanks for your advice. Thanks, Yi. > Btw, someone really should clean up the ext4 fallocate code to use > helper adnd do the > > switch (mode & FALLOC_FL_MODE_MASK) { > } > > and then use helpers for each mode whih will make these things a lot > more obvious.
On 2025/1/7 1:31, Darrick J. Wong wrote: > On Mon, Jan 06, 2025 at 08:27:49AM -0800, Christoph Hellwig wrote: >> On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote: >>> Yes. And we might decide that it should be done using some kind of >>> ioctl, such as BLKDISCARD, as opposed to a new fallocate operation, >>> since it really isn't a filesystem metadata operation, just as >>> BLKDISARD isn't. The other side of the argument is that ioctls are >>> ugly, and maybe all new such operations should be plumbed through via >>> fallocate as opposed to adding a new ioctl. I don't have strong >>> feelings on this, although I *do* belive that whatever interface we >>> use, whether it be fallocate or ioctl, it should be supported by block >>> devices and files in a file system, to make life easier for those >>> databases that want to support running on a raw block device (for >>> full-page advertisements on the back cover of the Businessweek >>> magazine) or on files (which is how 99.9% of all real-world users >>> actually run enterprise databases. :-) >> >> If you want the operation to work for files it needs to be routed >> through the file system as otherwise you can't make it actually >> work coherently. While you could add a new ioctl that works on a >> file fallocate seems like a much better interface. Supporting it >> on a block device is trivial, as it can mostly (or even entirely >> depending on the exact definition of the interface) reuse the existing >> zero range / punch hole code. > > I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode, > document very vigorously that it exists to facilitate pure overwrites > (specifically that it returns EOPNOTSUPP for always-cow files), and not > add more ioctls. > Sorry. the "pure overwrites" and "always-cow files" makes me confused, this is mainly used to create a new written file range, but also could be used to zero out an existing range, why you mentioned it exists to facilitate pure overwrites? For the "always-cow files", do you mean reflinked files? Could you please give more details? Thanks, Yi. > (That said, doesn't BLKZEROOUT already do this for bdevs?) > > --D
On Tue, Jan 07, 2025 at 10:05:47PM +0800, Zhang Yi wrote: > Sorry. the "pure overwrites" and "always-cow files" makes me confused, > this is mainly used to create a new written file range, but also could > be used to zero out an existing range, why you mentioned it exists to > facilitate pure overwrites? If you're fine with writes to your file causing block allocations you can already use the hole punch or preallocate fallocate modes. No need to actually send a command to the device. > > For the "always-cow files", do you mean reflinked files? Could you > please give more details? reflinked files will require out of place writes for shared blocks. As will anything on device mapper snapshots. Or any file on file systems that write out of place (btrfs, f2fs, nilfs2, the upcoming zoned xfs mode).
On 2025/1/8 0:42, Christoph Hellwig wrote: > On Tue, Jan 07, 2025 at 10:05:47PM +0800, Zhang Yi wrote: >> Sorry. the "pure overwrites" and "always-cow files" makes me confused, >> this is mainly used to create a new written file range, but also could >> be used to zero out an existing range, why you mentioned it exists to >> facilitate pure overwrites? > > If you're fine with writes to your file causing block allocations you > can already use the hole punch or preallocate fallocate modes. No > need to actually send a command to the device. > Okay, I misunderstood your point earlier. This is indeed prepared for subsequent overwrites. Thanks a lot for explaining. Thanks, Yi. >> >> For the "always-cow files", do you mean reflinked files? Could you >> please give more details? > > reflinked files will require out of place writes for shared blocks. > As will anything on device mapper snapshots. Or any file on > file systems that write out of place (btrfs, f2fs, nilfs2, the > upcoming zoned xfs mode). >
diff --git a/fs/open.c b/fs/open.c index e6911101fe71..d3afaddfcf27 100644 --- a/fs/open.c +++ b/fs/open.c @@ -246,7 +246,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (offset < 0 || len <= 0) return -EINVAL; - if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE)) + if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_SUPPORT_MASK)) return -EOPNOTSUPP; /* @@ -259,15 +259,23 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) switch (mode & FALLOC_FL_MODE_MASK) { case FALLOC_FL_ALLOCATE_RANGE: case FALLOC_FL_UNSHARE_RANGE: + if (mode & FALLOC_FL_FORCE_ZERO) + return -EOPNOTSUPP; + break; case FALLOC_FL_ZERO_RANGE: + if ((mode & FALLOC_FL_KEEP_SIZE) && + (mode & FALLOC_FL_FORCE_ZERO)) + return -EOPNOTSUPP; break; case FALLOC_FL_PUNCH_HOLE: - if (!(mode & FALLOC_FL_KEEP_SIZE)) + if (!(mode & FALLOC_FL_KEEP_SIZE) || + (mode & FALLOC_FL_FORCE_ZERO)) return -EOPNOTSUPP; break; case FALLOC_FL_COLLAPSE_RANGE: case FALLOC_FL_INSERT_RANGE: - if (mode & FALLOC_FL_KEEP_SIZE) + if ((mode & FALLOC_FL_KEEP_SIZE) || + (mode & FALLOC_FL_FORCE_ZERO)) return -EOPNOTSUPP; break; default: diff --git a/include/linux/falloc.h b/include/linux/falloc.h index 3f49f3df6af5..75ac063d7eab 100644 --- a/include/linux/falloc.h +++ b/include/linux/falloc.h @@ -29,7 +29,8 @@ struct space_resv { * Mask of all supported fallocate modes. Only one can be set at a time. * * In addition to the mode bit, the mode argument can also encode flags. - * FALLOC_FL_KEEP_SIZE is the only supported flag so far. + * FALLOC_FL_KEEP_SIZE and FALLOC_FL_FORCE_ZERO are the only supported + * flags so far. */ #define FALLOC_FL_MODE_MASK (FALLOC_FL_ALLOCATE_RANGE | \ FALLOC_FL_PUNCH_HOLE | \ @@ -37,6 +38,8 @@ struct space_resv { FALLOC_FL_ZERO_RANGE | \ FALLOC_FL_INSERT_RANGE | \ FALLOC_FL_UNSHARE_RANGE) +#define FALLOC_FL_SUPPORT_MASK (FALLOC_FL_KEEP_SIZE | \ + FALLOC_FL_FORCE_ZERO) /* on ia32 l_start is on a 32-bit boundary */ #if defined(CONFIG_X86_64) diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h index 5810371ed72b..7c12bcdff7d3 100644 --- a/include/uapi/linux/falloc.h +++ b/include/uapi/linux/falloc.h @@ -78,4 +78,16 @@ */ #define FALLOC_FL_UNSHARE_RANGE 0x40 +/* + * FALLOC_FL_FORCE_ZERO should be used in conjunction with FALLOC_FL_ZERO_RANGE, + * it force the file system issuing zero and allocate written extents. The + * zeroing out can speed up with the REQ_OP_WRITE_ZEROES command, and sebsequent + * overwriting over this range can save significant metadata changes, which + * should be contribute to improve the overwrite performance on such + * preallocated range. + * + * This flag cannot be used in conjunction with the FALLOC_FL_KEEP_SIZE. + */ +#define FALLOC_FL_FORCE_ZERO 0x80 + #endif /* _UAPI_FALLOC_H_ */