Message ID | 5559C23A.4070406@redhat.com |
---|---|
State | New |
Headers | show |
On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote: > Another very recent example is here: > > https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html > > > This suggests that people actually rely on the current allocation > > behavior. Combined with my previous analysis that applications will > > start to fail if we remove the fallback and return EINVAL, I now think > > we need to keep the allocation loop. I should point out that the above patch isn't in elfutils yet. It is waiting on how this discussion turns out. At the moment we simply use ftruncate. The problem that is solved by using posix_fallocate is that we are about the write to the memory of an mmapped file. Since ftruncate doesn't guarantee that the backing store is really there we risk getting a SIGBUS if the disk is full and we write to a memory area that hasn't been allocated yet. Since this is in library code, we cannot simply catch the SIGBUS. And we cannot use fallocate since that doesn't guarantee that the backing store is really allocated since it depends on whether the underlying file system support fallocate. As far as I know posix_fallocate is the only way to guarantee that the file is fully allocated without spurious failures depending on where the file resides in the file system. And posix_fallocate has been available since forever with this functionality. We only expect an error if there was a real error, like ENOSPACE, allocating the file. But if there is another way to get that guarantee we would be happy to switch to it. Cheers, Mark
On 05/22/2015 12:18 PM, Mark Wielaard wrote: > On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote: >> Another very recent example is here: >> >> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html >> >>> This suggests that people actually rely on the current allocation >>> behavior. Combined with my previous analysis that applications will >>> start to fail if we remove the fallback and return EINVAL, I now think >>> we need to keep the allocation loop. > > I should point out that the above patch isn't in elfutils yet. It is > waiting on how this discussion turns out. > > At the moment we simply use ftruncate. The problem that is solved by > using posix_fallocate is that we are about the write to the memory of an > mmapped file. Since ftruncate doesn't guarantee that the backing store > is really there we risk getting a SIGBUS if the disk is full and we > write to a memory area that hasn't been allocated yet. Since this is in > library code, we cannot simply catch the SIGBUS. And we cannot use > fallocate since that doesn't guarantee that the backing store is really > allocated since it depends on whether the underlying file system support > fallocate. posix_fallocate does not guarantee this, either. See my patch with the documentation update. Compression, COW, thin provision all can result in ENOSPC. And obviously, there can be other I/O errors. If you absolutely, truly need to use mmap, we need either have to provide a way to intercept SIGBUS (perhaps à la SHE—the technology is there, it's just not available to C code in a deeply nested library right now), or another mmap flag that prevents the kernel from sending SIGBUS, and some way to tell if a mapping had been subject to write errors (perhaps revive msync(MS_ASYNC)?). > As far as I know posix_fallocate is the only way to guarantee > that the file is fully allocated without spurious failures depending on > where the file resides in the file system. And posix_fallocate has been > available since forever with this functionality. Thin provisioning etc. has become more common lately, so while posix_fallocate did provide this functionality in the past, it doesn't seem to do it right now.
On Tue, 2015-05-26 at 11:12 +0200, Florian Weimer wrote: > On 05/22/2015 12:18 PM, Mark Wielaard wrote: > > At the moment we simply use ftruncate. The problem that is solved by > > using posix_fallocate is that we are about the write to the memory of an > > mmapped file. Since ftruncate doesn't guarantee that the backing store > > is really there we risk getting a SIGBUS if the disk is full and we > > write to a memory area that hasn't been allocated yet. Since this is in > > library code, we cannot simply catch the SIGBUS. And we cannot use > > fallocate since that doesn't guarantee that the backing store is really > > allocated since it depends on whether the underlying file system support > > fallocate. > > posix_fallocate does not guarantee this, either. See my patch with the > documentation update. Compression, COW, thin provision all can result > in ENOSPC. And obviously, there can be other I/O errors. But that is precisely what we want. We want to get an ENOSPC (or some other I/O error as documented for posix_fallocate) before we start poking at the mmap backed memory. It seems that is what the glibc posix_fallocate guarantees, unlike ftruncate (or fallocate which can return an error even though there is space, but the file system just doesn't implement fallocate support). If not posix_allocate, what would you recommend we use to make sure the file has a backing store before we start poking at it? > If you absolutely, truly need to use mmap, we need either have to > provide a way to intercept SIGBUS (perhaps à la SHE—the technology is > there, it's just not available to C code in a deeply nested library > right now), or another mmap flag that prevents the kernel from sending > SIGBUS, and some way to tell if a mapping had been subject to write > errors (perhaps revive msync(MS_ASYNC)?). Yes, having msync report the error instead of having to deal with signals (which is a pain in a library) would definitely be preferred. Or any other mechanism to make sure that the mmap backed file area really exists and/or gets written to the backing store. > Thin provisioning etc. has become more common lately, so while > posix_fallocate did provide this functionality in the past, it doesn't > seem to do it right now. I might not understand precisely what you mean by thin-provisioning. But wouldn't all fancy new filesystems also support fallocate directly anyway? In which case the fallback part doesn't even trigger. And if it does trigger then because it touches and explicitly writes any block that doesn't have a backing store right now, would get it, doesn't it? Thanks, Mark
On 05/26/2015 12:44 PM, Mark Wielaard wrote: >> posix_fallocate does not guarantee this, either. See my patch with the >> documentation update. Compression, COW, thin provision all can result >> in ENOSPC. And obviously, there can be other I/O errors. > > But that is precisely what we want. We want to get an ENOSPC (or some > other I/O error as documented for posix_fallocate) before we start > poking at the mmap backed memory. It seems that is what the glibc > posix_fallocate guarantees, unlike ftruncate (or fallocate which can > return an error even though there is space, but the file system just > doesn't implement fallocate support). Sorry, what I meant is this: Even if you actually *write* zeros now, the increasing depth of the storage stack means that this will not reliable prevent you from getting ENOSPC later (which is reported, in the context of mmap, as SIGBUS). > If not posix_allocate, what would you recommend we use to make sure the > file has a backing store before we start poking at it? Due to the possibility of copy-on-write storage backends, I don't think you can get what you want anymore. >> Thin provisioning etc. has become more common lately, so while >> posix_fallocate did provide this functionality in the past, it doesn't >> seem to do it right now. > > I might not understand precisely what you mean by thin-provisioning. But > wouldn't all fancy new filesystems also support fallocate directly > anyway? Thin provisioning often operates at the block layer. An fallocate call typically only updates metadata, creating unwritten extents instead of actually writing out zeros. > In which case the fallback part doesn't even trigger. And if it > does trigger then because it touches and explicitly writes any block > that doesn't have a backing store right now, would get it, doesn't it? I don't know if anything implements this optimization, but it's possible that if you write zeros to a thin-provisioned hole (think sparse file at the block layer), then the device discards the write operation. It's definitely an issue with compression. A block of zeros likely compresses better than anything you will write later, so you can still get into an ENOSPC situation.
On Tue, May 26, 2015 at 11:12:15AM +0200, Florian Weimer wrote: > On 05/22/2015 12:18 PM, Mark Wielaard wrote: > > On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote: > >> Another very recent example is here: > >> > >> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html > >> > >>> This suggests that people actually rely on the current allocation > >>> behavior. Combined with my previous analysis that applications will > >>> start to fail if we remove the fallback and return EINVAL, I now think > >>> we need to keep the allocation loop. > > > > I should point out that the above patch isn't in elfutils yet. It is > > waiting on how this discussion turns out. > > > > At the moment we simply use ftruncate. The problem that is solved by > > using posix_fallocate is that we are about the write to the memory of an > > mmapped file. Since ftruncate doesn't guarantee that the backing store > > is really there we risk getting a SIGBUS if the disk is full and we > > write to a memory area that hasn't been allocated yet. Since this is in > > library code, we cannot simply catch the SIGBUS. And we cannot use > > fallocate since that doesn't guarantee that the backing store is really > > allocated since it depends on whether the underlying file system support > > fallocate. > > posix_fallocate does not guarantee this, either. See my patch with the > documentation update. Compression, COW, thin provision all can result > in ENOSPC. These are all buggy, non-conforming implementations. That doesn't mean users can't use them, but when applications malfunction, it's because they're using a low-quality, wrong implementation, not because the application has a bug. When the implementation intentionally cuts corners on correctness, it's not the application's job to make up for it. > And obviously, there can be other I/O errors. That's faulty hardware which is not covered by the specification. Technically, having faulty hardware is non-conforming too. > If you absolutely, truly need to use mmap, we need either have to > provide a way to intercept SIGBUS (perhaps à la SHE—the technology is > there, it's just not available to C code in a deeply nested library > right now), or another mmap flag that prevents the kernel from sending > SIGBUS, and some way to tell if a mapping had been subject to write > errors (perhaps revive msync(MS_ASYNC)?). This would be really nice. Handling SIGBUS is not a viable approach because it's not library-safe and difficult to make thread-safe. Rich
On Tue, 2015-05-26 at 11:13 -0400, Rich Felker wrote: > > posix_fallocate does not guarantee this, either. See my patch with the > > documentation update. Compression, COW, thin provision all can result > > in ENOSPC. > > These are all buggy, non-conforming implementations. That doesn't mean > users can't use them, but when applications malfunction, it's because > they're using a low-quality, wrong implementation, not because the > application has a bug. When the implementation intentionally cuts > corners on correctness, it's not the application's job to make up for > it. Well, if there are issues application/library writers certainly would like to know or have a way to detect them. In general relying on the glibc implementation details is a must, even if there are standards that define some corner cases differently. It looks like the guarantees that are needed in this case are (mostly) guaranteed by the glibc fallback code in those cases that the kernel doesn't have direct support. At least it seems that using posix_fallocate in this particular case is always the better choice over ftruncate. posix_fallocate will at least give a sensible error up front instead of just causing a SIGBUS much later in the code. > > If you absolutely, truly need to use mmap, we need either have to > > provide a way to intercept SIGBUS (perhaps à la SHE—the technology is > > there, it's just not available to C code in a deeply nested library > > right now), or another mmap flag that prevents the kernel from sending > > SIGBUS, and some way to tell if a mapping had been subject to write > > errors (perhaps revive msync(MS_ASYNC)?). > > This would be really nice. Handling SIGBUS is not a viable approach > because it's not library-safe and difficult to make thread-safe. Yes! Thanks, Mark
On 05/18/2015 06:43 AM, Florian Weimer wrote: > On 05/07/2015 09:05 PM, Florian Weimer wrote: >> > On 05/07/2015 08:19 PM, Roland McGrath wrote: >>>> >>> If I'm not mistaken ftruncate could still reduce the file size if it >>>> >>> races with another operation that would extend the file. This is also >>>> >>> a data loss bug. >>> >> >>> >> I concur. >> > >> > It happens with length == 0. We could error out with EINVAL instead of >> > calling ftruncate. >> > >> > Daniel Berrange pointed me to these bugs: >> > >> > https://sourceware.org/bugzilla/show_bug.cgi?id=17322 >> > https://bugzilla.redhat.com/show_bug.cgi?id=1140250 >> > https://bugzilla.redhat.com/show_bug.cgi?id=1077068 > Another very recent example is here: > > > https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html > >> > This suggests that people actually rely on the current allocation >> > behavior. Combined with my previous analysis that applications will >> > start to fail if we remove the fallback and return EINVAL, I now think >> > we need to keep the allocation loop. > This is the patch I currently have. It fixes the avoidable bugs. I > still think we are in a bad situation here, that even a compatibility > symbol cannot fix. > > -- Florian Weimer / Red Hat Product Security OK to checkin. I believe this is the best patch. It fixes the O_APPEND case (real bug) and makes the NFS usages of posix_fallocate sensible. It leaves the fallback code in place for applications that need it and do not wish to write their own allocation loops, as all applications would need to do. Lastly I believe the race conditions outlined in bug 15661 are application bugs. Given that posix_fallocate is not listed in POSIX section 2.9.7 "Thread Interactions with Regular File Operations" the call is not required to be atomic with respect to any other operation. Therefore it is the responsibility of the "other thread" to synchronize the file operations with the thread calling posix_fallocate to ensure posix_fallocate has returned before it starts writing to the file. Despite the fact that posix_fallocate does not explicitly say that it will write to the newly allocated storage, it also doesn't say that it won't, and thus if you are extending the file with posix_fallocate and also writing to the file, it's safest to be conservative and synchronize until and when POSIX clarifies that posix_fallocate does not modify the contents (as it does in at least according to FreeBSD[1]). [1] https://www.freebsd.org/cgi/man.cgi?query=posix_fallocate&sektion=2&manpath=FreeBSD+8.3-RELEASE "... Any existing file data in the specified range is unmodified. ..." > 0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch > > > From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001 > Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com> > From: Florian Weimer <fweimer@redhat.com> > Date: Mon, 18 May 2015 11:32:44 +0100 > Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ > #15661] > To: libc-alpha@sourceware.org > > Handle signed integer overflow correctly. Detect and reject O_APPEND. > Document drawbacks of emulation. > > This does not completely address bug 15661, but improves the situation > somewhat. > --- > ChangeLog | 9 ++++ > manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++ > sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++-------- > sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++-------- > 4 files changed, 199 insertions(+), 38 deletions(-) > > diff --git a/ChangeLog b/ChangeLog > index 4de8a25..603847b 100644 > --- a/ChangeLog > +++ b/ChangeLog > @@ -1,3 +1,12 @@ > +2015-05-18 Florian Weimer <fweimer@redhat.com> > + > + [BZ #15661] > + * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64): > + Check for overflow properly. Check for O_APPEND. Ignore large > + file system block sizes. Add comments about problems. > + * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise. > + * manual/filesys.texi (Storage Allocation): New node. > + > 2015-05-18 Arjun Shankar <arjun.is@lostca.se> > > * include/stdio.h: Define __need_wint_t. > diff --git a/manual/filesys.texi b/manual/filesys.texi > index 7d55b43..0f2e3dc 100644 > --- a/manual/filesys.texi > +++ b/manual/filesys.texi > @@ -1723,6 +1723,7 @@ modify the attributes of a file. > access a file. > * File Times:: About the time attributes of a file. > * File Size:: Manually changing the size of a file. > +* Storage Allocation:: Allocate backing storage for files. OK. > @end menu > > @node Attribute Meanings > @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the > real size, and when it has finished a final @code{ftruncate} call should > set the real size of the file. > > +@node Storage Allocation > +@subsection Storage Allocation > +@cindex allocating file storage > +@cindex file allocation > +@cindex storage allocating > + > +@cindex file fragmentation > +@cindex fragmentation of files > +@cindex sparse files > +@cindex files, sparse > +Most file systems support allocating large files in a non-contiguous > +fashion: the file is split into @emph{fragments} which are allocated > +sequentially, but the fragments themselves can be scattered across the > +disk. File systems generally try to avoid such fragmentation because it > +decreases performance, but if a file gradually increases in size, there > +might be no other option than to fragment it. In addition, many file > +systems support @emph{sparse files} with @emph{holes}: regions of null > +bytes for which no backing storage has been allocated by the file > +system. When the holes are finally overwritten with data, fragmentation > +can occur as well. > + > +Explicit allocation of storage for yet-unwritten parts of the file can > +help the system to avoid fragmentation. Additionally, if storage > +pre-allocation fails, it is possible to report the out-of-disk error > +early, often without filling up the entire disk. However, due to > +deduplication, copy-on-write semantics, and file compression, such > +pre-allocation may not reliably prevent the out-of-disk-space error from > +occurring later. Checking for write errors is still required, and > +writes to memory-mapped regions created with @code{mmap} can still > +result in @code{SIGBUS}. > + > +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length}) > +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} OK. > +@c If the file system does not support allocation, > +@c @code{posix_fallocate} has a race with file extension (if > +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if > +@c @var{length} is positive). > + > +Allocate backing store for the region of @var{length} bytes starting at > +byte @var{offset} in the file for the descriptor @var{fd}. The file > +length is increased to @samp{@var{length} + @var{offset}} if necessary. > + > +@var{fd} must be a regular file opened for writing, or @code{EBADF} is > +returned. If there is insufficient disk space to fulfill the allocation > +request, @code{ENOSPC} is returned. > + > +@strong{Note:} If @code{fallocate} is not available (because the file > +system does not support it), @code{posix_fallocate} is emulated, which > +has the following drawbacks: > + > +@itemize @bullet > +@item > +It is very inefficient because all file system blocks in the requested > +range need to be examined (even if they have been allocated before) and > +potentially rewritten. In contrast, with proper @code{fallocate} > +support (see below), the file system can examine the internal file > +allocation data structures and eliminate holes directly, maybe even > +using unwritten extents (which are pre-allocated but uninitialized on > +disk). > + > +@item > +There is a race condition if another thread or process modifies the > +underlying file in the to-be-allocated area. Non-null bytes could be > +overwritten with null bytes. > + > +@item > +If @var{fd} has been opened with the @code{O_APPEND} flag, the function > +will fail with an @code{errno} value of @code{EBADF}. > + > +@item > +If @var{length} is zero, @code{ftruncate} is used to increase the file > +size as requested, without allocating file system blocks. There is a > +race condition which means that @code{ftruncate} can accidentally > +truncate the file if it has been extended concurrently. > +@end itemize > + > +On Linux, if an application does not benefit from emulation or if the > +emulation is harmful due to its inherent race conditions, the > +application can use the Linux-specific @code{fallocate} function, with a > +zero flag argument. For the @code{fallocate} function, @theglibc{} does > +not perform allocation emulation if the file system does not support > +allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller. > + > +@end deftypefun OK. > + > +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset}) > +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} OK. > + > +This function is a variant of @code{posix_fallocate64} which accepts > +64-bit file offsets on all platforms. > + > +@end deftypefun > + > @node Making Special Files > @section Making Special Files > @cindex creating special files > diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c > index d15d603..e7fe201 100644 > --- a/sysdeps/posix/posix_fallocate.c > +++ b/sysdeps/posix/posix_fallocate.c > @@ -18,26 +18,36 @@ > #include <errno.h> > #include <fcntl.h> > #include <unistd.h> > +#include <stdint.h> > +#include <sys/fcntl.h> > #include <sys/stat.h> > #include <sys/statfs.h> > > -/* Reserve storage for the data of the file associated with FD. */ > +/* Reserve storage for the data of the file associated with FD. This > + emulation is far from perfect, but the kernel cannot do not much > + better for network file systems, either. */ > > int > posix_fallocate (int fd, __off_t offset, __off_t len) > { > struct stat64 st; > - struct statfs f; > > - /* `off_t' is a signed type. Therefore we can determine whether > - OFFSET + LEN is too large if it is a negative value. */ > if (offset < 0 || len < 0) > return EINVAL; > - if (offset + len < 0) > + > + /* Perform overflow check. The outer cast relies on a GCC > + extension. */ > + if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0) > return EFBIG; > > - /* First thing we have to make sure is that this is really a regular > - file. */ > + /* pwrite below will not do the right thing in O_APPEND mode. */ > + { > + int flags = __fcntl (fd, F_GETFL, 0); > + if (flags < 0 || (flags & O_APPEND) != 0) > + return EBADF; > + } > + > + /* We have to make sure that this is really a regular file. */ > if (__fxstat64 (_STAT_VER, fd, &st) != 0) > return EBADF; > if (S_ISFIFO (st.st_mode)) > @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len) > > if (len == 0) > { > + /* This is racy, but there is no good way to satisfy a > + zero-length allocation request. */ > if (st.st_size < offset) > { > int ret = __ftruncate (fd, offset); > @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len) > return 0; > } > > - /* We have to know the block size of the filesystem to get at least some > - sort of performance. */ > - if (__fstatfs (fd, &f) != 0) > - return errno; > - > - /* Try to play safe. */ > - if (f.f_bsize == 0) > - f.f_bsize = 512; > - > - /* Write something to every block. */ > - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) > + /* Minimize data transfer for network file systems, by issuing > + single-byte write requests spaced by the file system block size. > + (Most local file systems have fallocate support, so this fallback > + code is not used there.) */ > + > + unsigned increment; > + { > + struct statfs64 f; > + > + if (__fstatfs64 (fd, &f) != 0) > + return errno; > + if (f.f_bsize == 0) > + increment = 512; > + else if (f.f_bsize < 4096) > + increment = f.f_bsize; > + else > + /* NFS does not propagate the block size of the underlying > + storage and may report a much larger value which would still > + leave holes after the loop below, so we cap the increment at > + 4096. */ > + increment = 4096; OK. > + } > + > + /* Write a null byte to every block. This is racy; we currently > + lack a better option. Compare-and-swap against a file mapping > + might additional local races, but requires interposition of a > + signal handler to catch SIGBUS. */ > + for (offset += (len - 1) % increment; len > 0; offset += increment) > { > - len -= f.f_bsize; > + len -= increment; > > if (offset < st.st_size) > { > diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c > index b845df7..ee32679 100644 > --- a/sysdeps/posix/posix_fallocate64.c > +++ b/sysdeps/posix/posix_fallocate64.c > @@ -18,26 +18,36 @@ > #include <errno.h> > #include <fcntl.h> > #include <unistd.h> > +#include <stdint.h> > +#include <sys/fcntl.h> > #include <sys/stat.h> > #include <sys/statfs.h> > > -/* Reserve storage for the data of the file associated with FD. */ > +/* Reserve storage for the data of the file associated with FD. This > + emulation is far from perfect, but the kernel cannot do not much > + better for network file systems, either. */ > > int > __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) > { > struct stat64 st; > - struct statfs64 f; > > - /* `off64_t' is a signed type. Therefore we can determine whether > - OFFSET + LEN is too large if it is a negative value. */ > if (offset < 0 || len < 0) > return EINVAL; > - if (offset + len < 0) > + > + /* Perform overflow check. The outer cast relies on a GCC > + extension. */ > + if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0) > return EFBIG; > > - /* First thing we have to make sure is that this is really a regular > - file. */ > + /* pwrite64 below will not do the right thing in O_APPEND mode. */ > + { > + int flags = __fcntl (fd, F_GETFL, 0); > + if (flags < 0 || (flags & O_APPEND) != 0) > + return EBADF; > + } > + > + /* We have to make sure that this is really a regular file. */ > if (__fxstat64 (_STAT_VER, fd, &st) != 0) > return EBADF; > if (S_ISFIFO (st.st_mode)) > @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) > > if (len == 0) > { > + /* This is racy, but there is no good way to satisfy a > + zero-length allocation request. */ > if (st.st_size < offset) > { > int ret = __ftruncate64 (fd, offset); > @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) > return 0; > } > > - /* We have to know the block size of the filesystem to get at least some > - sort of performance. */ > - if (__fstatfs64 (fd, &f) != 0) > - return errno; > - > - /* Try to play safe. */ > - if (f.f_bsize == 0) > - f.f_bsize = 512; > - > - /* Write something to every block. */ > - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) > + /* Minimize data transfer for network file systems, by issuing > + single-byte write requests spaced by the file system block size. > + (Most local file systems have fallocate support, so this fallback > + code is not used there.) */ > + > + unsigned increment; > + { > + struct statfs64 f; > + > + if (__fstatfs64 (fd, &f) != 0) > + return errno; > + if (f.f_bsize == 0) > + increment = 512; > + else if (f.f_bsize < 4096) > + increment = f.f_bsize; > + else > + /* NFS clients do not propagate the block size of the underlying > + storage and may report a much larger value which would still > + leave holes after the loop below, so we cap the increment at > + 4096. */ > + increment = 4096; > + } > + > + /* Write a null byte to every block. This is racy; we currently > + lack a better option. Compare-and-swap against a file mapping > + might address local races, but requires interposition of a signal > + handler to catch SIGBUS. */ > + for (offset += (len - 1) % increment; len > 0; offset += increment) > { > - len -= f.f_bsize; > + len -= increment; > > if (offset < st.st_size) > { > -- 2.1.0 OK. Cheers, Carlos.
On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
> OK to checkin.
Thanks, committed. I did not at bug 15661 to the NEWS file and resolved
it as WONTFIX.
On 06/05/2015 05:52 AM, Florian Weimer wrote: > On 06/03/2015 06:06 AM, Carlos O'Donell wrote: > >> OK to checkin. > > Thanks, committed. I did not at bug 15661 to the NEWS file and resolved > it as WONTFIX. Since this commit fixes a user-visible issue it should have a bug filed and that bug should be marked in the fixed bug list. This way users can reopen if they have problems, or review the bugs fixed in the release. Would you mind opening a bug for the NFS block size issue and then marking it closed, and updating your ChangeLog entry and NEWS? Cheers, Carlos.
On 06/05/2015 03:11 PM, Carlos O'Donell wrote: > On 06/05/2015 05:52 AM, Florian Weimer wrote: >> On 06/03/2015 06:06 AM, Carlos O'Donell wrote: >> >>> OK to checkin. >> >> Thanks, committed. I did not at bug 15661 to the NEWS file and resolved >> it as WONTFIX. > > Since this commit fixes a user-visible issue it should have a bug filed > and that bug should be marked in the fixed bug list. This way users > can reopen if they have problems, or review the bugs fixed in the release. > > Would you mind opening a bug for the NFS block size issue and then > marking it closed, and updating your ChangeLog entry and NEWS? Turns out there was already a bug specifically for the NFS issue, 17322, so used that.
From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001 Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com> From: Florian Weimer <fweimer@redhat.com> Date: Mon, 18 May 2015 11:32:44 +0100 Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ #15661] To: libc-alpha@sourceware.org Handle signed integer overflow correctly. Detect and reject O_APPEND. Document drawbacks of emulation. This does not completely address bug 15661, but improves the situation somewhat. --- ChangeLog | 9 ++++ manual/filesys.texi | 94 +++++++++++++++++++++++++++++++++++++++ sysdeps/posix/posix_fallocate.c | 67 ++++++++++++++++++++-------- sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++-------- 4 files changed, 199 insertions(+), 38 deletions(-) diff --git a/ChangeLog b/ChangeLog index 4de8a25..603847b 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,12 @@ +2015-05-18 Florian Weimer <fweimer@redhat.com> + + [BZ #15661] + * sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64): + Check for overflow properly. Check for O_APPEND. Ignore large + file system block sizes. Add comments about problems. + * sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise. + * manual/filesys.texi (Storage Allocation): New node. + 2015-05-18 Arjun Shankar <arjun.is@lostca.se> * include/stdio.h: Define __need_wint_t. diff --git a/manual/filesys.texi b/manual/filesys.texi index 7d55b43..0f2e3dc 100644 --- a/manual/filesys.texi +++ b/manual/filesys.texi @@ -1723,6 +1723,7 @@ modify the attributes of a file. access a file. * File Times:: About the time attributes of a file. * File Size:: Manually changing the size of a file. +* Storage Allocation:: Allocate backing storage for files. @end menu @node Attribute Meanings @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}. The program has to keep track of the real size, and when it has finished a final @code{ftruncate} call should set the real size of the file. +@node Storage Allocation +@subsection Storage Allocation +@cindex allocating file storage +@cindex file allocation +@cindex storage allocating + +@cindex file fragmentation +@cindex fragmentation of files +@cindex sparse files +@cindex files, sparse +Most file systems support allocating large files in a non-contiguous +fashion: the file is split into @emph{fragments} which are allocated +sequentially, but the fragments themselves can be scattered across the +disk. File systems generally try to avoid such fragmentation because it +decreases performance, but if a file gradually increases in size, there +might be no other option than to fragment it. In addition, many file +systems support @emph{sparse files} with @emph{holes}: regions of null +bytes for which no backing storage has been allocated by the file +system. When the holes are finally overwritten with data, fragmentation +can occur as well. + +Explicit allocation of storage for yet-unwritten parts of the file can +help the system to avoid fragmentation. Additionally, if storage +pre-allocation fails, it is possible to report the out-of-disk error +early, often without filling up the entire disk. However, due to +deduplication, copy-on-write semantics, and file compression, such +pre-allocation may not reliably prevent the out-of-disk-space error from +occurring later. Checking for write errors is still required, and +writes to memory-mapped regions created with @code{mmap} can still +result in @code{SIGBUS}. + +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length}) +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} +@c If the file system does not support allocation, +@c @code{posix_fallocate} has a race with file extension (if +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if +@c @var{length} is positive). + +Allocate backing store for the region of @var{length} bytes starting at +byte @var{offset} in the file for the descriptor @var{fd}. The file +length is increased to @samp{@var{length} + @var{offset}} if necessary. + +@var{fd} must be a regular file opened for writing, or @code{EBADF} is +returned. If there is insufficient disk space to fulfill the allocation +request, @code{ENOSPC} is returned. + +@strong{Note:} If @code{fallocate} is not available (because the file +system does not support it), @code{posix_fallocate} is emulated, which +has the following drawbacks: + +@itemize @bullet +@item +It is very inefficient because all file system blocks in the requested +range need to be examined (even if they have been allocated before) and +potentially rewritten. In contrast, with proper @code{fallocate} +support (see below), the file system can examine the internal file +allocation data structures and eliminate holes directly, maybe even +using unwritten extents (which are pre-allocated but uninitialized on +disk). + +@item +There is a race condition if another thread or process modifies the +underlying file in the to-be-allocated area. Non-null bytes could be +overwritten with null bytes. + +@item +If @var{fd} has been opened with the @code{O_APPEND} flag, the function +will fail with an @code{errno} value of @code{EBADF}. + +@item +If @var{length} is zero, @code{ftruncate} is used to increase the file +size as requested, without allocating file system blocks. There is a +race condition which means that @code{ftruncate} can accidentally +truncate the file if it has been extended concurrently. +@end itemize + +On Linux, if an application does not benefit from emulation or if the +emulation is harmful due to its inherent race conditions, the +application can use the Linux-specific @code{fallocate} function, with a +zero flag argument. For the @code{fallocate} function, @theglibc{} does +not perform allocation emulation if the file system does not support +allocation. Instead, an @code{EOPNOTSUPP} is returned to the caller. + +@end deftypefun + +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset}) +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +This function is a variant of @code{posix_fallocate64} which accepts +64-bit file offsets on all platforms. + +@end deftypefun + @node Making Special Files @section Making Special Files @cindex creating special files diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c index d15d603..e7fe201 100644 --- a/sysdeps/posix/posix_fallocate.c +++ b/sysdeps/posix/posix_fallocate.c @@ -18,26 +18,36 @@ #include <errno.h> #include <fcntl.h> #include <unistd.h> +#include <stdint.h> +#include <sys/fcntl.h> #include <sys/stat.h> #include <sys/statfs.h> -/* Reserve storage for the data of the file associated with FD. */ +/* Reserve storage for the data of the file associated with FD. This + emulation is far from perfect, but the kernel cannot do not much + better for network file systems, either. */ int posix_fallocate (int fd, __off_t offset, __off_t len) { struct stat64 st; - struct statfs f; - /* `off_t' is a signed type. Therefore we can determine whether - OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; - if (offset + len < 0) + + /* Perform overflow check. The outer cast relies on a GCC + extension. */ + if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0) return EFBIG; - /* First thing we have to make sure is that this is really a regular - file. */ + /* pwrite below will not do the right thing in O_APPEND mode. */ + { + int flags = __fcntl (fd, F_GETFL, 0); + if (flags < 0 || (flags & O_APPEND) != 0) + return EBADF; + } + + /* We have to make sure that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len) if (len == 0) { + /* This is racy, but there is no good way to satisfy a + zero-length allocation request. */ if (st.st_size < offset) { int ret = __ftruncate (fd, offset); @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len) return 0; } - /* We have to know the block size of the filesystem to get at least some - sort of performance. */ - if (__fstatfs (fd, &f) != 0) - return errno; - - /* Try to play safe. */ - if (f.f_bsize == 0) - f.f_bsize = 512; - - /* Write something to every block. */ - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) + /* Minimize data transfer for network file systems, by issuing + single-byte write requests spaced by the file system block size. + (Most local file systems have fallocate support, so this fallback + code is not used there.) */ + + unsigned increment; + { + struct statfs64 f; + + if (__fstatfs64 (fd, &f) != 0) + return errno; + if (f.f_bsize == 0) + increment = 512; + else if (f.f_bsize < 4096) + increment = f.f_bsize; + else + /* NFS does not propagate the block size of the underlying + storage and may report a much larger value which would still + leave holes after the loop below, so we cap the increment at + 4096. */ + increment = 4096; + } + + /* Write a null byte to every block. This is racy; we currently + lack a better option. Compare-and-swap against a file mapping + might additional local races, but requires interposition of a + signal handler to catch SIGBUS. */ + for (offset += (len - 1) % increment; len > 0; offset += increment) { - len -= f.f_bsize; + len -= increment; if (offset < st.st_size) { diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c index b845df7..ee32679 100644 --- a/sysdeps/posix/posix_fallocate64.c +++ b/sysdeps/posix/posix_fallocate64.c @@ -18,26 +18,36 @@ #include <errno.h> #include <fcntl.h> #include <unistd.h> +#include <stdint.h> +#include <sys/fcntl.h> #include <sys/stat.h> #include <sys/statfs.h> -/* Reserve storage for the data of the file associated with FD. */ +/* Reserve storage for the data of the file associated with FD. This + emulation is far from perfect, but the kernel cannot do not much + better for network file systems, either. */ int __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) { struct stat64 st; - struct statfs64 f; - /* `off64_t' is a signed type. Therefore we can determine whether - OFFSET + LEN is too large if it is a negative value. */ if (offset < 0 || len < 0) return EINVAL; - if (offset + len < 0) + + /* Perform overflow check. The outer cast relies on a GCC + extension. */ + if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0) return EFBIG; - /* First thing we have to make sure is that this is really a regular - file. */ + /* pwrite64 below will not do the right thing in O_APPEND mode. */ + { + int flags = __fcntl (fd, F_GETFL, 0); + if (flags < 0 || (flags & O_APPEND) != 0) + return EBADF; + } + + /* We have to make sure that this is really a regular file. */ if (__fxstat64 (_STAT_VER, fd, &st) != 0) return EBADF; if (S_ISFIFO (st.st_mode)) @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) if (len == 0) { + /* This is racy, but there is no good way to satisfy a + zero-length allocation request. */ if (st.st_size < offset) { int ret = __ftruncate64 (fd, offset); @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len) return 0; } - /* We have to know the block size of the filesystem to get at least some - sort of performance. */ - if (__fstatfs64 (fd, &f) != 0) - return errno; - - /* Try to play safe. */ - if (f.f_bsize == 0) - f.f_bsize = 512; - - /* Write something to every block. */ - for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize) + /* Minimize data transfer for network file systems, by issuing + single-byte write requests spaced by the file system block size. + (Most local file systems have fallocate support, so this fallback + code is not used there.) */ + + unsigned increment; + { + struct statfs64 f; + + if (__fstatfs64 (fd, &f) != 0) + return errno; + if (f.f_bsize == 0) + increment = 512; + else if (f.f_bsize < 4096) + increment = f.f_bsize; + else + /* NFS clients do not propagate the block size of the underlying + storage and may report a much larger value which would still + leave holes after the loop below, so we cap the increment at + 4096. */ + increment = 4096; + } + + /* Write a null byte to every block. This is racy; we currently + lack a better option. Compare-and-swap against a file mapping + might address local races, but requires interposition of a signal + handler to catch SIGBUS. */ + for (offset += (len - 1) % increment; len > 0; offset += increment) { - len -= f.f_bsize; + len -= increment; if (offset < st.st_size) { -- 2.1.0