diff mbox

[4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]

Message ID 5559C23A.4070406@redhat.com
State New
Headers show

Commit Message

Florian Weimer May 18, 2015, 10:43 a.m. UTC
On 05/07/2015 09:05 PM, Florian Weimer wrote:
> On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>> If I'm not mistaken ftruncate could still reduce the file size if it
>>> races with another operation that would extend the file. This is also
>>> a data loss bug.
>>
>> I concur.
> 
> It happens with length == 0.  We could error out with EINVAL instead of
> calling ftruncate.
> 
> Daniel Berrange pointed me to these bugs:
> 
>   https://sourceware.org/bugzilla/show_bug.cgi?id=17322
>   https://bugzilla.redhat.com/show_bug.cgi?id=1140250
>   https://bugzilla.redhat.com/show_bug.cgi?id=1077068

Another very recent example is here:


https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html

> This suggests that people actually rely on the current allocation
> behavior.  Combined with my previous analysis that applications will
> start to fail if we remove the fallback and return EINVAL, I now think
> we need to keep the allocation loop.

This is the patch I currently have.  It fixes the avoidable bugs.  I
still think we are in a bad situation here, that even a compatibility
symbol cannot fix.

Comments

Mark Wielaard May 22, 2015, 10:18 a.m. UTC | #1
On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
> Another very recent example is here:
>
> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> 
> > This suggests that people actually rely on the current allocation
> > behavior.  Combined with my previous analysis that applications will
> > start to fail if we remove the fallback and return EINVAL, I now think
> > we need to keep the allocation loop.

I should point out that the above patch isn't in elfutils yet. It is
waiting on how this discussion turns out.

At the moment we simply use ftruncate. The problem that is solved by
using posix_fallocate is that we are about the write to the memory of an
mmapped file. Since ftruncate doesn't guarantee that the backing store
is really there we risk getting a SIGBUS if the disk is full and we
write to a memory area that hasn't been allocated yet. Since this is in
library code, we cannot simply catch the SIGBUS. And we cannot use
fallocate since that doesn't guarantee that the backing store is really
allocated since it depends on whether the underlying file system support
fallocate. As far as I know posix_fallocate is the only way to guarantee
that the file is fully allocated without spurious failures depending on
where the file resides in the file system. And posix_fallocate has been
available since forever with this functionality. We only expect an error
if there was a real error, like ENOSPACE, allocating the file. But if
there is another way to get that guarantee we would be happy to switch
to it.

Cheers,

Mark
Florian Weimer May 26, 2015, 9:12 a.m. UTC | #2
On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
>> Another very recent example is here:
>>
>> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
>>
>>> This suggests that people actually rely on the current allocation
>>> behavior.  Combined with my previous analysis that applications will
>>> start to fail if we remove the fallback and return EINVAL, I now think
>>> we need to keep the allocation loop.
> 
> I should point out that the above patch isn't in elfutils yet. It is
> waiting on how this discussion turns out.
> 
> At the moment we simply use ftruncate. The problem that is solved by
> using posix_fallocate is that we are about the write to the memory of an
> mmapped file. Since ftruncate doesn't guarantee that the backing store
> is really there we risk getting a SIGBUS if the disk is full and we
> write to a memory area that hasn't been allocated yet. Since this is in
> library code, we cannot simply catch the SIGBUS. And we cannot use
> fallocate since that doesn't guarantee that the backing store is really
> allocated since it depends on whether the underlying file system support
> fallocate.

posix_fallocate does not guarantee this, either.  See my patch with the
documentation update.  Compression, COW, thin provision all can result
in ENOSPC.  And obviously, there can be other I/O errors.

If you absolutely, truly need to use mmap, we need either have to
provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
there, it's just not available to C code in a deeply nested library
right now), or another mmap flag that prevents the kernel from sending
SIGBUS, and some way to tell if a mapping had been subject to write
errors (perhaps revive msync(MS_ASYNC)?).

> As far as I know posix_fallocate is the only way to guarantee
> that the file is fully allocated without spurious failures depending on
> where the file resides in the file system. And posix_fallocate has been
> available since forever with this functionality.

Thin provisioning etc. has become more common lately, so while
posix_fallocate did provide this functionality in the past, it doesn't
seem to do it right now.
Mark Wielaard May 26, 2015, 10:44 a.m. UTC | #3
On Tue, 2015-05-26 at 11:12 +0200, Florian Weimer wrote:
> On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> > At the moment we simply use ftruncate. The problem that is solved by
> > using posix_fallocate is that we are about the write to the memory of an
> > mmapped file. Since ftruncate doesn't guarantee that the backing store
> > is really there we risk getting a SIGBUS if the disk is full and we
> > write to a memory area that hasn't been allocated yet. Since this is in
> > library code, we cannot simply catch the SIGBUS. And we cannot use
> > fallocate since that doesn't guarantee that the backing store is really
> > allocated since it depends on whether the underlying file system support
> > fallocate.
> 
> posix_fallocate does not guarantee this, either.  See my patch with the
> documentation update.  Compression, COW, thin provision all can result
> in ENOSPC.  And obviously, there can be other I/O errors.

But that is precisely what we want. We want to get an ENOSPC (or some
other I/O error as documented for posix_fallocate) before we start
poking at the mmap backed memory. It seems that is what the glibc
posix_fallocate guarantees, unlike ftruncate (or fallocate which can
return an error even though there is space, but the file system just
doesn't implement fallocate support).

If not posix_allocate, what would you recommend we use to make sure the
file has a backing store before we start poking at it?

> If you absolutely, truly need to use mmap, we need either have to
> provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
> there, it's just not available to C code in a deeply nested library
> right now), or another mmap flag that prevents the kernel from sending
> SIGBUS, and some way to tell if a mapping had been subject to write
> errors (perhaps revive msync(MS_ASYNC)?).

Yes, having msync report the error instead of having to deal with
signals (which is a pain in a library) would definitely be preferred. Or
any other mechanism to make sure that the mmap backed file area really
exists and/or gets written to the backing store.

> Thin provisioning etc. has become more common lately, so while
> posix_fallocate did provide this functionality in the past, it doesn't
> seem to do it right now.

I might not understand precisely what you mean by thin-provisioning. But
wouldn't all fancy new filesystems also support fallocate directly
anyway? In which case the fallback part doesn't even trigger. And if it
does trigger then because it touches and explicitly writes any block
that doesn't have a backing store right now, would get it, doesn't it?

Thanks,

Mark
Florian Weimer May 26, 2015, 10:52 a.m. UTC | #4
On 05/26/2015 12:44 PM, Mark Wielaard wrote:

>> posix_fallocate does not guarantee this, either.  See my patch with the
>> documentation update.  Compression, COW, thin provision all can result
>> in ENOSPC.  And obviously, there can be other I/O errors.
> 
> But that is precisely what we want. We want to get an ENOSPC (or some
> other I/O error as documented for posix_fallocate) before we start
> poking at the mmap backed memory. It seems that is what the glibc
> posix_fallocate guarantees, unlike ftruncate (or fallocate which can
> return an error even though there is space, but the file system just
> doesn't implement fallocate support).

Sorry, what I meant is this: Even if you actually *write* zeros now, the
increasing depth of the storage stack means that this will not reliable
prevent you from getting ENOSPC later (which is reported, in the context
of mmap, as SIGBUS).

> If not posix_allocate, what would you recommend we use to make sure the
> file has a backing store before we start poking at it?

Due to the possibility of copy-on-write storage backends, I don't think
you can get what you want anymore.

>> Thin provisioning etc. has become more common lately, so while
>> posix_fallocate did provide this functionality in the past, it doesn't
>> seem to do it right now.
> 
> I might not understand precisely what you mean by thin-provisioning. But
> wouldn't all fancy new filesystems also support fallocate directly
> anyway?

Thin provisioning often operates at the block layer.  An fallocate call
typically only updates metadata, creating unwritten extents instead of
actually writing out zeros.

> In which case the fallback part doesn't even trigger. And if it
> does trigger then because it touches and explicitly writes any block
> that doesn't have a backing store right now, would get it, doesn't it?

I don't know if anything implements this optimization, but it's possible
that if you write zeros to a thin-provisioned hole (think sparse file at
the block layer), then the device discards the write operation.  It's
definitely an issue with compression.  A block of zeros likely
compresses better than anything you will write later, so you can still
get into an ENOSPC situation.
Rich Felker May 26, 2015, 3:13 p.m. UTC | #5
On Tue, May 26, 2015 at 11:12:15AM +0200, Florian Weimer wrote:
> On 05/22/2015 12:18 PM, Mark Wielaard wrote:
> > On Mon, 2015-05-18 at 12:43 +0200, Florian Weimer wrote:
> >> Another very recent example is here:
> >>
> >> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> >>
> >>> This suggests that people actually rely on the current allocation
> >>> behavior.  Combined with my previous analysis that applications will
> >>> start to fail if we remove the fallback and return EINVAL, I now think
> >>> we need to keep the allocation loop.
> > 
> > I should point out that the above patch isn't in elfutils yet. It is
> > waiting on how this discussion turns out.
> > 
> > At the moment we simply use ftruncate. The problem that is solved by
> > using posix_fallocate is that we are about the write to the memory of an
> > mmapped file. Since ftruncate doesn't guarantee that the backing store
> > is really there we risk getting a SIGBUS if the disk is full and we
> > write to a memory area that hasn't been allocated yet. Since this is in
> > library code, we cannot simply catch the SIGBUS. And we cannot use
> > fallocate since that doesn't guarantee that the backing store is really
> > allocated since it depends on whether the underlying file system support
> > fallocate.
> 
> posix_fallocate does not guarantee this, either.  See my patch with the
> documentation update.  Compression, COW, thin provision all can result
> in ENOSPC.

These are all buggy, non-conforming implementations. That doesn't mean
users can't use them, but when applications malfunction, it's because
they're using a low-quality, wrong implementation, not because the
application has a bug. When the implementation intentionally cuts
corners on correctness, it's not the application's job to make up for
it.

> And obviously, there can be other I/O errors.

That's faulty hardware which is not covered by the specification.
Technically, having faulty hardware is non-conforming too.

> If you absolutely, truly need to use mmap, we need either have to
> provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
> there, it's just not available to C code in a deeply nested library
> right now), or another mmap flag that prevents the kernel from sending
> SIGBUS, and some way to tell if a mapping had been subject to write
> errors (perhaps revive msync(MS_ASYNC)?).

This would be really nice. Handling SIGBUS is not a viable approach
because it's not library-safe and difficult to make thread-safe.

Rich
Mark Wielaard May 27, 2015, 3:06 p.m. UTC | #6
On Tue, 2015-05-26 at 11:13 -0400, Rich Felker wrote:
> > posix_fallocate does not guarantee this, either.  See my patch with the
> > documentation update.  Compression, COW, thin provision all can result
> > in ENOSPC.
> 
> These are all buggy, non-conforming implementations. That doesn't mean
> users can't use them, but when applications malfunction, it's because
> they're using a low-quality, wrong implementation, not because the
> application has a bug. When the implementation intentionally cuts
> corners on correctness, it's not the application's job to make up for
> it.

Well, if there are issues application/library writers certainly would
like to know or have a way to detect them. In general relying on the
glibc implementation details is a must, even if there are standards that
define some corner cases differently. It looks like the guarantees that
are needed in this case are (mostly) guaranteed by the glibc fallback
code in those cases that the kernel doesn't have direct support. At
least it seems that using posix_fallocate in this particular case is
always the better choice over ftruncate. posix_fallocate will at least
give a sensible error up front instead of just causing a SIGBUS much
later in the code.

> > If you absolutely, truly need to use mmap, we need either have to
> > provide a way to intercept SIGBUS (perhaps à la SHE—the technology is
> > there, it's just not available to C code in a deeply nested library
> > right now), or another mmap flag that prevents the kernel from sending
> > SIGBUS, and some way to tell if a mapping had been subject to write
> > errors (perhaps revive msync(MS_ASYNC)?).
> 
> This would be really nice. Handling SIGBUS is not a viable approach
> because it's not library-safe and difficult to make thread-safe.

Yes!

Thanks,

Mark
Carlos O'Donell June 3, 2015, 4:06 a.m. UTC | #7
On 05/18/2015 06:43 AM, Florian Weimer wrote:
> On 05/07/2015 09:05 PM, Florian Weimer wrote:
>> > On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>>> >>> If I'm not mistaken ftruncate could still reduce the file size if it
>>>> >>> races with another operation that would extend the file. This is also
>>>> >>> a data loss bug.
>>> >>
>>> >> I concur.
>> > 
>> > It happens with length == 0.  We could error out with EINVAL instead of
>> > calling ftruncate.
>> > 
>> > Daniel Berrange pointed me to these bugs:
>> > 
>> >   https://sourceware.org/bugzilla/show_bug.cgi?id=17322
>> >   https://bugzilla.redhat.com/show_bug.cgi?id=1140250
>> >   https://bugzilla.redhat.com/show_bug.cgi?id=1077068
> Another very recent example is here:
> 
> 
> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> 
>> > This suggests that people actually rely on the current allocation
>> > behavior.  Combined with my previous analysis that applications will
>> > start to fail if we remove the fallback and return EINVAL, I now think
>> > we need to keep the allocation loop.
> This is the patch I currently have.  It fixes the avoidable bugs.  I
> still think we are in a bad situation here, that even a compatibility
> symbol cannot fix.
> 
> -- Florian Weimer / Red Hat Product Security

OK to checkin.

I believe this is the best patch.

It fixes the O_APPEND case (real bug) and makes the NFS usages of
posix_fallocate sensible.

It leaves the fallback code in place for applications that need it
and do not wish to write their own allocation loops, as all
applications would need to do.

Lastly I believe the race conditions outlined in bug 15661 are 
application bugs.

Given that posix_fallocate is not listed in POSIX section 2.9.7 
"Thread Interactions with Regular File Operations" the call is not
required to be atomic with respect to any other operation. Therefore
it is the responsibility of the "other thread" to synchronize the
file operations with the thread calling posix_fallocate to ensure
posix_fallocate has returned before it starts writing to the file.

Despite the fact that posix_fallocate does not explicitly say that
it will write to the newly allocated storage, it also doesn't say
that it won't, and thus if you are extending the file with posix_fallocate
and also writing to the file, it's safest to be conservative and
synchronize until and when POSIX clarifies that posix_fallocate does
not modify the contents (as it does in at least according to FreeBSD[1]).

[1] https://www.freebsd.org/cgi/man.cgi?query=posix_fallocate&sektion=2&manpath=FreeBSD+8.3-RELEASE
"... Any existing file data in the specified range is unmodified. ..."


> 0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch
> 
> 
> From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
> Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
> From: Florian Weimer <fweimer@redhat.com>
> Date: Mon, 18 May 2015 11:32:44 +0100
> Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
>  #15661]
> To: libc-alpha@sourceware.org
> 
> Handle signed integer overflow correctly.  Detect and reject O_APPEND.
> Document drawbacks of emulation.
> 
> This does not completely address bug 15661, but improves the situation
> somewhat.
> ---
>  ChangeLog                         |  9 ++++
>  manual/filesys.texi               | 94 +++++++++++++++++++++++++++++++++++++++
>  sysdeps/posix/posix_fallocate.c   | 67 ++++++++++++++++++++--------
>  sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
>  4 files changed, 199 insertions(+), 38 deletions(-)
> 
> diff --git a/ChangeLog b/ChangeLog
> index 4de8a25..603847b 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,12 @@
> +2015-05-18  Florian Weimer  <fweimer@redhat.com>
> +
> +	[BZ #15661]
> +	* sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
> +	Check for overflow properly.  Check for O_APPEND.  Ignore large
> +	file system block sizes.  Add comments about problems.
> +	* sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
> +	* manual/filesys.texi (Storage Allocation): New node.
> +
>  2015-05-18  Arjun Shankar  <arjun.is@lostca.se>
>  
>  	* include/stdio.h: Define __need_wint_t.
> diff --git a/manual/filesys.texi b/manual/filesys.texi
> index 7d55b43..0f2e3dc 100644
> --- a/manual/filesys.texi
> +++ b/manual/filesys.texi
> @@ -1723,6 +1723,7 @@ modify the attributes of a file.
>                                   access a file.
>  * File Times::                  About the time attributes of a file.
>  * File Size::			Manually changing the size of a file.
> +* Storage Allocation::          Allocate backing storage for files.

OK.

>  @end menu
>  
>  @node Attribute Meanings
> @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}.  The program has to keep track of the
>  real size, and when it has finished a final @code{ftruncate} call should
>  set the real size of the file.
>  
> +@node Storage Allocation
> +@subsection Storage Allocation
> +@cindex allocating file storage
> +@cindex file allocation
> +@cindex storage allocating
> +
> +@cindex file fragmentation
> +@cindex fragmentation of files
> +@cindex sparse files
> +@cindex files, sparse
> +Most file systems support allocating large files in a non-contiguous
> +fashion: the file is split into @emph{fragments} which are allocated
> +sequentially, but the fragments themselves can be scattered across the
> +disk.  File systems generally try to avoid such fragmentation because it
> +decreases performance, but if a file gradually increases in size, there
> +might be no other option than to fragment it.  In addition, many file
> +systems support @emph{sparse files} with @emph{holes}: regions of null
> +bytes for which no backing storage has been allocated by the file
> +system.  When the holes are finally overwritten with data, fragmentation
> +can occur as well.
> +
> +Explicit allocation of storage for yet-unwritten parts of the file can
> +help the system to avoid fragmentation.  Additionally, if storage
> +pre-allocation fails, it is possible to report the out-of-disk error
> +early, often without filling up the entire disk.  However, due to
> +deduplication, copy-on-write semantics, and file compression, such
> +pre-allocation may not reliably prevent the out-of-disk-space error from
> +occurring later.  Checking for write errors is still required, and
> +writes to memory-mapped regions created with @code{mmap} can still
> +result in @code{SIGBUS}.
> +
> +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}

OK.

> +@c If the file system does not support allocation,
> +@c @code{posix_fallocate} has a race with file extension (if
> +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
> +@c @var{length} is positive).
> +
> +Allocate backing store for the region of @var{length} bytes starting at
> +byte @var{offset} in the file for the descriptor @var{fd}.  The file
> +length is increased to @samp{@var{length} + @var{offset}} if necessary.
> +
> +@var{fd} must be a regular file opened for writing, or @code{EBADF} is
> +returned.  If there is insufficient disk space to fulfill the allocation
> +request, @code{ENOSPC} is returned.
> +
> +@strong{Note:} If @code{fallocate} is not available (because the file
> +system does not support it), @code{posix_fallocate} is emulated, which
> +has the following drawbacks:
> +
> +@itemize @bullet
> +@item
> +It is very inefficient because all file system blocks in the requested
> +range need to be examined (even if they have been allocated before) and
> +potentially rewritten.  In contrast, with proper @code{fallocate}
> +support (see below), the file system can examine the internal file
> +allocation data structures and eliminate holes directly, maybe even
> +using unwritten extents (which are pre-allocated but uninitialized on
> +disk).
> +
> +@item
> +There is a race condition if another thread or process modifies the
> +underlying file in the to-be-allocated area.  Non-null bytes could be
> +overwritten with null bytes.
> +
> +@item
> +If @var{fd} has been opened with the @code{O_APPEND} flag, the function
> +will fail with an @code{errno} value of @code{EBADF}.
> +
> +@item
> +If @var{length} is zero, @code{ftruncate} is used to increase the file
> +size as requested, without allocating file system blocks.  There is a
> +race condition which means that @code{ftruncate} can accidentally
> +truncate the file if it has been extended concurrently.
> +@end itemize
> +
> +On Linux, if an application does not benefit from emulation or if the
> +emulation is harmful due to its inherent race conditions, the
> +application can use the Linux-specific @code{fallocate} function, with a
> +zero flag argument.  For the @code{fallocate} function, @theglibc{} does
> +not perform allocation emulation if the file system does not support
> +allocation.  Instead, an @code{EOPNOTSUPP} is returned to the caller.
> +
> +@end deftypefun

OK.

> +
> +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}

OK.

> +
> +This function is a variant of @code{posix_fallocate64} which accepts
> +64-bit file offsets on all platforms.
> +
> +@end deftypefun
> +
>  @node Making Special Files
>  @section Making Special Files
>  @cindex creating special files
> diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
> index d15d603..e7fe201 100644
> --- a/sysdeps/posix/posix_fallocate.c
> +++ b/sysdeps/posix/posix_fallocate.c
> @@ -18,26 +18,36 @@
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
>  #include <sys/stat.h>
>  #include <sys/statfs.h>
>  
> -/* Reserve storage for the data of the file associated with FD.  */
> +/* Reserve storage for the data of the file associated with FD.  This
> +   emulation is far from perfect, but the kernel cannot do not much
> +   better for network file systems, either.  */
>  
>  int
>  posix_fallocate (int fd, __off_t offset, __off_t len)
>  {
>    struct stat64 st;
> -  struct statfs f;
>  
> -  /* `off_t' is a signed type.  Therefore we can determine whether
> -     OFFSET + LEN is too large if it is a negative value.  */
>    if (offset < 0 || len < 0)
>      return EINVAL;
> -  if (offset + len < 0)
> +
> +  /* Perform overflow check.  The outer cast relies on a GCC
> +     extension.  */
> +  if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
>      return EFBIG;
>  
> -  /* First thing we have to make sure is that this is really a regular
> -     file.  */
> +  /* pwrite below will not do the right thing in O_APPEND mode.  */
> +  {
> +    int flags = __fcntl (fd, F_GETFL, 0);
> +    if (flags < 0 || (flags & O_APPEND) != 0)
> +      return EBADF;
> +  }
> +
> +  /* We have to make sure that this is really a regular file.  */
>    if (__fxstat64 (_STAT_VER, fd, &st) != 0)
>      return EBADF;
>    if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
>  
>    if (len == 0)
>      {
> +      /* This is racy, but there is no good way to satisfy a
> +	 zero-length allocation request.  */
>        if (st.st_size < offset)
>  	{
>  	  int ret = __ftruncate (fd, offset);
> @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
>        return 0;
>      }
>  
> -  /* We have to know the block size of the filesystem to get at least some
> -     sort of performance.  */
> -  if (__fstatfs (fd, &f) != 0)
> -    return errno;
> -
> -  /* Try to play safe.  */
> -  if (f.f_bsize == 0)
> -    f.f_bsize = 512;
> -
> -  /* Write something to every block.  */
> -  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> +  /* Minimize data transfer for network file systems, by issuing
> +     single-byte write requests spaced by the file system block size.
> +     (Most local file systems have fallocate support, so this fallback
> +     code is not used there.)  */
> +
> +  unsigned increment;
> +  {
> +    struct statfs64 f;
> +
> +    if (__fstatfs64 (fd, &f) != 0)
> +      return errno;
> +    if (f.f_bsize == 0)
> +      increment = 512;
> +    else if (f.f_bsize < 4096)
> +      increment = f.f_bsize;
> +    else
> +      /* NFS does not propagate the block size of the underlying
> +	 storage and may report a much larger value which would still
> +	 leave holes after the loop below, so we cap the increment at
> +	 4096.  */
> +      increment = 4096;

OK.

> +  }
> +
> +  /* Write a null byte to every block.  This is racy; we currently
> +     lack a better option.  Compare-and-swap against a file mapping
> +     might additional local races, but requires interposition of a
> +     signal handler to catch SIGBUS.  */
> +  for (offset += (len - 1) % increment; len > 0; offset += increment)
>      {
> -      len -= f.f_bsize;
> +      len -= increment;
>  
>        if (offset < st.st_size)
>  	{
> diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
> index b845df7..ee32679 100644
> --- a/sysdeps/posix/posix_fallocate64.c
> +++ b/sysdeps/posix/posix_fallocate64.c
> @@ -18,26 +18,36 @@
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
>  #include <sys/stat.h>
>  #include <sys/statfs.h>
>  
> -/* Reserve storage for the data of the file associated with FD.  */
> +/* Reserve storage for the data of the file associated with FD.  This
> +   emulation is far from perfect, but the kernel cannot do not much
> +   better for network file systems, either.  */
>  
>  int
>  __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>  {
>    struct stat64 st;
> -  struct statfs64 f;
>  
> -  /* `off64_t' is a signed type.  Therefore we can determine whether
> -     OFFSET + LEN is too large if it is a negative value.  */
>    if (offset < 0 || len < 0)
>      return EINVAL;
> -  if (offset + len < 0)
> +
> +  /* Perform overflow check.  The outer cast relies on a GCC
> +     extension.  */
> +  if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
>      return EFBIG;
>  
> -  /* First thing we have to make sure is that this is really a regular
> -     file.  */
> +  /* pwrite64 below will not do the right thing in O_APPEND mode.  */
> +  {
> +    int flags = __fcntl (fd, F_GETFL, 0);
> +    if (flags < 0 || (flags & O_APPEND) != 0)
> +      return EBADF;
> +  }
> +
> +  /* We have to make sure that this is really a regular file.  */
>    if (__fxstat64 (_STAT_VER, fd, &st) != 0)
>      return EBADF;
>    if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>  
>    if (len == 0)
>      {
> +      /* This is racy, but there is no good way to satisfy a
> +	 zero-length allocation request.  */
>        if (st.st_size < offset)
>  	{
>  	  int ret = __ftruncate64 (fd, offset);
> @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>        return 0;
>      }
>  
> -  /* We have to know the block size of the filesystem to get at least some
> -     sort of performance.  */
> -  if (__fstatfs64 (fd, &f) != 0)
> -    return errno;
> -
> -  /* Try to play safe.  */
> -  if (f.f_bsize == 0)
> -    f.f_bsize = 512;
> -
> -  /* Write something to every block.  */
> -  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> +  /* Minimize data transfer for network file systems, by issuing
> +     single-byte write requests spaced by the file system block size.
> +     (Most local file systems have fallocate support, so this fallback
> +     code is not used there.)  */
> +
> +  unsigned increment;
> +  {
> +    struct statfs64 f;
> +
> +    if (__fstatfs64 (fd, &f) != 0)
> +      return errno;
> +    if (f.f_bsize == 0)
> +      increment = 512;
> +    else if (f.f_bsize < 4096)
> +      increment = f.f_bsize;
> +    else
> +      /* NFS clients do not propagate the block size of the underlying
> +	 storage and may report a much larger value which would still
> +	 leave holes after the loop below, so we cap the increment at
> +	 4096.  */
> +      increment = 4096;
> +  }
> +
> +  /* Write a null byte to every block.  This is racy; we currently
> +     lack a better option.  Compare-and-swap against a file mapping
> +     might address local races, but requires interposition of a signal
> +     handler to catch SIGBUS.  */
> +  for (offset += (len - 1) % increment; len > 0; offset += increment)
>      {
> -      len -= f.f_bsize;
> +      len -= increment;
>  
>        if (offset < st.st_size)
>  	{
> -- 2.1.0

OK.

Cheers,
Carlos.
Florian Weimer June 5, 2015, 9:52 a.m. UTC | #8
On 06/03/2015 06:06 AM, Carlos O'Donell wrote:

> OK to checkin.

Thanks, committed.  I did not at bug 15661 to the NEWS file and resolved
it as WONTFIX.
Carlos O'Donell June 5, 2015, 1:11 p.m. UTC | #9
On 06/05/2015 05:52 AM, Florian Weimer wrote:
> On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
> 
>> OK to checkin.
> 
> Thanks, committed.  I did not at bug 15661 to the NEWS file and resolved
> it as WONTFIX.

Since this commit fixes a user-visible issue it should have a bug filed
and that bug should be marked in the fixed bug list. This way users
can reopen if they have problems, or review the bugs fixed in the release.

Would you mind opening a bug for the NFS block size issue and then
marking it closed, and updating your ChangeLog entry and NEWS?

Cheers,
Carlos.
Florian Weimer June 5, 2015, 1:22 p.m. UTC | #10
On 06/05/2015 03:11 PM, Carlos O'Donell wrote:
> On 06/05/2015 05:52 AM, Florian Weimer wrote:
>> On 06/03/2015 06:06 AM, Carlos O'Donell wrote:
>>
>>> OK to checkin.
>>
>> Thanks, committed.  I did not at bug 15661 to the NEWS file and resolved
>> it as WONTFIX.
> 
> Since this commit fixes a user-visible issue it should have a bug filed
> and that bug should be marked in the fixed bug list. This way users
> can reopen if they have problems, or review the bugs fixed in the release.
> 
> Would you mind opening a bug for the NFS block size issue and then
> marking it closed, and updating your ChangeLog entry and NEWS?

Turns out there was already a bug specifically for the NFS issue, 17322,
so used that.
diff mbox

Patch

From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
From: Florian Weimer <fweimer@redhat.com>
Date: Mon, 18 May 2015 11:32:44 +0100
Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
 #15661]
To: libc-alpha@sourceware.org

Handle signed integer overflow correctly.  Detect and reject O_APPEND.
Document drawbacks of emulation.

This does not completely address bug 15661, but improves the situation
somewhat.
---
 ChangeLog                         |  9 ++++
 manual/filesys.texi               | 94 +++++++++++++++++++++++++++++++++++++++
 sysdeps/posix/posix_fallocate.c   | 67 ++++++++++++++++++++--------
 sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
 4 files changed, 199 insertions(+), 38 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 4de8a25..603847b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@ 
+2015-05-18  Florian Weimer  <fweimer@redhat.com>
+
+	[BZ #15661]
+	* sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
+	Check for overflow properly.  Check for O_APPEND.  Ignore large
+	file system block sizes.  Add comments about problems.
+	* sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
+	* manual/filesys.texi (Storage Allocation): New node.
+
 2015-05-18  Arjun Shankar  <arjun.is@lostca.se>
 
 	* include/stdio.h: Define __need_wint_t.
diff --git a/manual/filesys.texi b/manual/filesys.texi
index 7d55b43..0f2e3dc 100644
--- a/manual/filesys.texi
+++ b/manual/filesys.texi
@@ -1723,6 +1723,7 @@  modify the attributes of a file.
                                  access a file.
 * File Times::                  About the time attributes of a file.
 * File Size::			Manually changing the size of a file.
+* Storage Allocation::          Allocate backing storage for files.
 @end menu
 
 @node Attribute Meanings
@@ -3233,6 +3234,99 @@  is a requirement of @code{mmap}.  The program has to keep track of the
 real size, and when it has finished a final @code{ftruncate} call should
 set the real size of the file.
 
+@node Storage Allocation
+@subsection Storage Allocation
+@cindex allocating file storage
+@cindex file allocation
+@cindex storage allocating
+
+@cindex file fragmentation
+@cindex fragmentation of files
+@cindex sparse files
+@cindex files, sparse
+Most file systems support allocating large files in a non-contiguous
+fashion: the file is split into @emph{fragments} which are allocated
+sequentially, but the fragments themselves can be scattered across the
+disk.  File systems generally try to avoid such fragmentation because it
+decreases performance, but if a file gradually increases in size, there
+might be no other option than to fragment it.  In addition, many file
+systems support @emph{sparse files} with @emph{holes}: regions of null
+bytes for which no backing storage has been allocated by the file
+system.  When the holes are finally overwritten with data, fragmentation
+can occur as well.
+
+Explicit allocation of storage for yet-unwritten parts of the file can
+help the system to avoid fragmentation.  Additionally, if storage
+pre-allocation fails, it is possible to report the out-of-disk error
+early, often without filling up the entire disk.  However, due to
+deduplication, copy-on-write semantics, and file compression, such
+pre-allocation may not reliably prevent the out-of-disk-space error from
+occurring later.  Checking for write errors is still required, and
+writes to memory-mapped regions created with @code{mmap} can still
+result in @code{SIGBUS}.
+
+@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+@c If the file system does not support allocation,
+@c @code{posix_fallocate} has a race with file extension (if
+@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
+@c @var{length} is positive).
+
+Allocate backing store for the region of @var{length} bytes starting at
+byte @var{offset} in the file for the descriptor @var{fd}.  The file
+length is increased to @samp{@var{length} + @var{offset}} if necessary.
+
+@var{fd} must be a regular file opened for writing, or @code{EBADF} is
+returned.  If there is insufficient disk space to fulfill the allocation
+request, @code{ENOSPC} is returned.
+
+@strong{Note:} If @code{fallocate} is not available (because the file
+system does not support it), @code{posix_fallocate} is emulated, which
+has the following drawbacks:
+
+@itemize @bullet
+@item
+It is very inefficient because all file system blocks in the requested
+range need to be examined (even if they have been allocated before) and
+potentially rewritten.  In contrast, with proper @code{fallocate}
+support (see below), the file system can examine the internal file
+allocation data structures and eliminate holes directly, maybe even
+using unwritten extents (which are pre-allocated but uninitialized on
+disk).
+
+@item
+There is a race condition if another thread or process modifies the
+underlying file in the to-be-allocated area.  Non-null bytes could be
+overwritten with null bytes.
+
+@item
+If @var{fd} has been opened with the @code{O_APPEND} flag, the function
+will fail with an @code{errno} value of @code{EBADF}.
+
+@item
+If @var{length} is zero, @code{ftruncate} is used to increase the file
+size as requested, without allocating file system blocks.  There is a
+race condition which means that @code{ftruncate} can accidentally
+truncate the file if it has been extended concurrently.
+@end itemize
+
+On Linux, if an application does not benefit from emulation or if the
+emulation is harmful due to its inherent race conditions, the
+application can use the Linux-specific @code{fallocate} function, with a
+zero flag argument.  For the @code{fallocate} function, @theglibc{} does
+not perform allocation emulation if the file system does not support
+allocation.  Instead, an @code{EOPNOTSUPP} is returned to the caller.
+
+@end deftypefun
+
+@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+This function is a variant of @code{posix_fallocate64} which accepts
+64-bit file offsets on all platforms.
+
+@end deftypefun
+
 @node Making Special Files
 @section Making Special Files
 @cindex creating special files
diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
index d15d603..e7fe201 100644
--- a/sysdeps/posix/posix_fallocate.c
+++ b/sysdeps/posix/posix_fallocate.c
@@ -18,26 +18,36 @@ 
 #include <errno.h>
 #include <fcntl.h>
 #include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
 #include <sys/stat.h>
 #include <sys/statfs.h>
 
-/* Reserve storage for the data of the file associated with FD.  */
+/* Reserve storage for the data of the file associated with FD.  This
+   emulation is far from perfect, but the kernel cannot do not much
+   better for network file systems, either.  */
 
 int
 posix_fallocate (int fd, __off_t offset, __off_t len)
 {
   struct stat64 st;
-  struct statfs f;
 
-  /* `off_t' is a signed type.  Therefore we can determine whether
-     OFFSET + LEN is too large if it is a negative value.  */
   if (offset < 0 || len < 0)
     return EINVAL;
-  if (offset + len < 0)
+
+  /* Perform overflow check.  The outer cast relies on a GCC
+     extension.  */
+  if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
     return EFBIG;
 
-  /* First thing we have to make sure is that this is really a regular
-     file.  */
+  /* pwrite below will not do the right thing in O_APPEND mode.  */
+  {
+    int flags = __fcntl (fd, F_GETFL, 0);
+    if (flags < 0 || (flags & O_APPEND) != 0)
+      return EBADF;
+  }
+
+  /* We have to make sure that this is really a regular file.  */
   if (__fxstat64 (_STAT_VER, fd, &st) != 0)
     return EBADF;
   if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@  posix_fallocate (int fd, __off_t offset, __off_t len)
 
   if (len == 0)
     {
+      /* This is racy, but there is no good way to satisfy a
+	 zero-length allocation request.  */
       if (st.st_size < offset)
 	{
 	  int ret = __ftruncate (fd, offset);
@@ -58,19 +70,36 @@  posix_fallocate (int fd, __off_t offset, __off_t len)
       return 0;
     }
 
-  /* We have to know the block size of the filesystem to get at least some
-     sort of performance.  */
-  if (__fstatfs (fd, &f) != 0)
-    return errno;
-
-  /* Try to play safe.  */
-  if (f.f_bsize == 0)
-    f.f_bsize = 512;
-
-  /* Write something to every block.  */
-  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+  /* Minimize data transfer for network file systems, by issuing
+     single-byte write requests spaced by the file system block size.
+     (Most local file systems have fallocate support, so this fallback
+     code is not used there.)  */
+
+  unsigned increment;
+  {
+    struct statfs64 f;
+
+    if (__fstatfs64 (fd, &f) != 0)
+      return errno;
+    if (f.f_bsize == 0)
+      increment = 512;
+    else if (f.f_bsize < 4096)
+      increment = f.f_bsize;
+    else
+      /* NFS does not propagate the block size of the underlying
+	 storage and may report a much larger value which would still
+	 leave holes after the loop below, so we cap the increment at
+	 4096.  */
+      increment = 4096;
+  }
+
+  /* Write a null byte to every block.  This is racy; we currently
+     lack a better option.  Compare-and-swap against a file mapping
+     might additional local races, but requires interposition of a
+     signal handler to catch SIGBUS.  */
+  for (offset += (len - 1) % increment; len > 0; offset += increment)
     {
-      len -= f.f_bsize;
+      len -= increment;
 
       if (offset < st.st_size)
 	{
diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
index b845df7..ee32679 100644
--- a/sysdeps/posix/posix_fallocate64.c
+++ b/sysdeps/posix/posix_fallocate64.c
@@ -18,26 +18,36 @@ 
 #include <errno.h>
 #include <fcntl.h>
 #include <unistd.h>
+#include <stdint.h>
+#include <sys/fcntl.h>
 #include <sys/stat.h>
 #include <sys/statfs.h>
 
-/* Reserve storage for the data of the file associated with FD.  */
+/* Reserve storage for the data of the file associated with FD.  This
+   emulation is far from perfect, but the kernel cannot do not much
+   better for network file systems, either.  */
 
 int
 __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
 {
   struct stat64 st;
-  struct statfs64 f;
 
-  /* `off64_t' is a signed type.  Therefore we can determine whether
-     OFFSET + LEN is too large if it is a negative value.  */
   if (offset < 0 || len < 0)
     return EINVAL;
-  if (offset + len < 0)
+
+  /* Perform overflow check.  The outer cast relies on a GCC
+     extension.  */
+  if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
     return EFBIG;
 
-  /* First thing we have to make sure is that this is really a regular
-     file.  */
+  /* pwrite64 below will not do the right thing in O_APPEND mode.  */
+  {
+    int flags = __fcntl (fd, F_GETFL, 0);
+    if (flags < 0 || (flags & O_APPEND) != 0)
+      return EBADF;
+  }
+
+  /* We have to make sure that this is really a regular file.  */
   if (__fxstat64 (_STAT_VER, fd, &st) != 0)
     return EBADF;
   if (S_ISFIFO (st.st_mode))
@@ -47,6 +57,8 @@  __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
 
   if (len == 0)
     {
+      /* This is racy, but there is no good way to satisfy a
+	 zero-length allocation request.  */
       if (st.st_size < offset)
 	{
 	  int ret = __ftruncate64 (fd, offset);
@@ -58,19 +70,36 @@  __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
       return 0;
     }
 
-  /* We have to know the block size of the filesystem to get at least some
-     sort of performance.  */
-  if (__fstatfs64 (fd, &f) != 0)
-    return errno;
-
-  /* Try to play safe.  */
-  if (f.f_bsize == 0)
-    f.f_bsize = 512;
-
-  /* Write something to every block.  */
-  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
+  /* Minimize data transfer for network file systems, by issuing
+     single-byte write requests spaced by the file system block size.
+     (Most local file systems have fallocate support, so this fallback
+     code is not used there.)  */
+
+  unsigned increment;
+  {
+    struct statfs64 f;
+
+    if (__fstatfs64 (fd, &f) != 0)
+      return errno;
+    if (f.f_bsize == 0)
+      increment = 512;
+    else if (f.f_bsize < 4096)
+      increment = f.f_bsize;
+    else
+      /* NFS clients do not propagate the block size of the underlying
+	 storage and may report a much larger value which would still
+	 leave holes after the loop below, so we cap the increment at
+	 4096.  */
+      increment = 4096;
+  }
+
+  /* Write a null byte to every block.  This is racy; we currently
+     lack a better option.  Compare-and-swap against a file mapping
+     might address local races, but requires interposition of a signal
+     handler to catch SIGBUS.  */
+  for (offset += (len - 1) % increment; len > 0; offset += increment)
     {
-      len -= f.f_bsize;
+      len -= increment;
 
       if (offset < st.st_size)
 	{
-- 
2.1.0