diff mbox series

[v4] Update mmap() flags and errors lists

Message ID xntthvs6qa.fsf@greed.delorie.com
State New
Headers show
Series [v4] Update mmap() flags and errors lists | expand

Commit Message

DJ Delorie June 14, 2024, 6:46 p.m. UTC
[v4: tweaked text on MAP_FIXED_NOREPLACE, MAP_POPULATE, MAP_32BIT, and
ENOMEM]

Extend the list of MAP_* macros to include all macros available
to the average program (gcc -E -dM | grep MAP_*)

Extend the list of errno codes.

Comments

Mathieu Desnoyers June 18, 2024, 8:13 p.m. UTC | #1
On 14-Jun-2024 02:46:05 PM, DJ Delorie wrote:
> [v4: tweaked text on MAP_FIXED_NOREPLACE, MAP_POPULATE, MAP_32BIT, and
> ENOMEM]
> 
> Extend the list of MAP_* macros to include all macros available
> to the average program (gcc -E -dM | grep MAP_*)
> 
> Extend the list of errno codes.

I will review this patch based on my understanding of the Linux mmap(2)
manual from Linux man-pages 6.03 2023-02-05. It is very much possible
that your intent is not to match that specific syscall man page, in
which case feel free to dismiss my concerns.

> 
> diff --git a/manual/llio.texi b/manual/llio.texi
> index 0d1a32e3e1..7edec3e8d7 100644
> --- a/manual/llio.texi
> +++ b/manual/llio.texi
> @@ -1574,10 +1574,15 @@ permitted.  They include @code{PROT_READ}, @code{PROT_WRITE}, and
>  of address space for future use.  The @code{mprotect} function can be
>  used to change the protection flags.  @xref{Memory Protection}.
>  
> -@var{flags} contains flags that control the nature of the map.
> -One of @code{MAP_SHARED} or @code{MAP_PRIVATE} must be specified.
> +The @var{flags} parameter contains flags that control the nature of
> +the map.  One of @code{MAP_SHARED}, @code{MAP_SHARED_VALIDATE}, or
> +@code{MAP_PRIVATE} must be specified.  Additional flags may be bitwise
> +OR'd to further define the mapping.

OK

>  
> -They include:
> +Note that, aside from @code{MAP_PRIVATE} and @code{MAP_SHARED}, not
> +all flags are supported on all versions of all operating systems.
> +Consult the kernel-specific documentation for details.  The flags
> +include:
>  
>  @vtable @code
>  @item MAP_PRIVATE
> @@ -1599,9 +1604,19 @@ Note that actual writing may take place at any time.  You need to use
>  @code{msync}, described below, if it is important that other processes
>  using conventional I/O get a consistent view of the file.
>  
> +@item MAP_SHARED_VALIDATE
> +Similar to @code{MAP_SHARED} except that additional flags will be
> +validated by the kernel, and the call will fail if an unrecognized
> +flag is provided.  With @code{MAP_SHARED} using a flag on a kernel
> +that doesn't support it causes the flag to be ignored.
> +@code{MAP_SHARED_VALIDATE} should be used when the behavior of all
> +flags is required.

OK

> +
>  @item MAP_FIXED
>  This forces the system to use the exact mapping address specified in
> -@var{address} and fail if it can't.
> +@var{address} and fail if it can't.  Note that if the new mapping
> +would overlap an existing mapping, the overlapping portion of the
> +existing map is unmapped.
>  
>  @c One of these is official - the other is obviously an obsolete synonym
>  @c Which is which?
> @@ -1642,10 +1657,78 @@ The @code{MAP_HUGETLB} flag is specific to Linux.
>  @c There is a mechanism to select different hugepage sizes; see
>  @c include/uapi/asm-generic/hugetlb_encode.h in the kernel sources.
>  
> -@c Linux has some other MAP_ options, which I have not discussed here.
> -@c MAP_DENYWRITE, MAP_EXECUTABLE and MAP_GROWSDOWN don't seem applicable to
> -@c user programs (and I don't understand the last two).  MAP_LOCKED does
> -@c not appear to be implemented.
> +@item MAP_32BIT
> +Require addresses that can be accessed with a signed 32 bit pointer,
> +i.e., within the first 2 GiB.  Ignored if MAP_FIXED is specified.
> +
> +@item MAP_DENYWRITE
> +@itemx MAP_EXECUTABLE
> +@itemx MAP_FILE
> +
> +Provided for compatibility.  Ignored by the Linux kernel.
> +
> +@item MAP_FIXED_NOREPLACE
> +Similar to @code{MAP_FIXED} except the call will fail with
> +@code{EEXIST} if the new mapping would overwrite an existing mapping.
> +To test for this, specify MAP_FIXED_NOREPLACE without MAP_FIXED, and
> +check the actual address returned.  If it does not match the address
> +passed, then this flag is not supported.

mmap(2) states that older kernels fallback to non-MAP_FIXED behavior if
the mapping would overwrite an existing mapping, which requires to
carefully handle the return value. Is this backward-compatibility
handling somehow abstracted within the libc wrapper ?

> +
> +@item MAP_GROWSDOWN
> +This flag is used to make stacks, and is typically only needed inside
> +the program loader to set up the main stack for the running process.
> +The mapping is created according to the other flags, except an
> +additional page just prior to the mapping is marked as a ``guard
> +page''.  If a write is attempted inside this guard page, that page is
> +mapped, the mapping is extended, and a new guard page is created.
> +Thus, the mapping continues to grow towards lower addresses until it
> +encounters some other mapping.
> +
> +Note that accessing memory beyond the guard page will not trigger this
> +feature.  In gcc, use @code{-fstack-clash-protection} to ensure the
> +guard page is always touched.

OK

> +
> +@item MAP_LOCKED
> +A hint that requests that mapped pages are locked in memory (i.e. not
> +paged out).  Note that this is a request and not a requirement; use
> +@code{mlock} if locking is required.
> +
> +@item MAP_POPULATE
> +@itemx MAP_NONBLOCK
> +These two are opposites.  @code{MAP_POPULATE} is a hint that requests
> +that the kernel read-ahead a file-backed mapping, causing more pages
> +to be mapped before they're needed.  @code{MAP_NONBLOCK} is a hint
> +that requests that the kernel @emph{not} attempt such, only mapping
> +pages when they're actually needed.  Note that neither of these hints
> +affects future paging activity, use @code{mlock} if such needs to be
> +controlled.

This explanation does not match my understanding of the mmap(2) man
page. MAP_NONBLOCK appears to be only meaningful in conjunction _with_
MAP_POPULATE. I suspect the goal here when those are combined is to
opportunistically populate the page table entries when those do not
require read-ahead from a file (AFAIU).

> +
> +@item MAP_NORESERVE
> +Asks the kernel to not reserve physical backing for a mapping.

What is "physical backing" ? I guess that you mean not backed by a swap
block device (or anything that requires I/O), but I am not sure that
"physical backing" conveys this clearly.

> This
> +would be useful for, for example, a very large but sparsely used
> +mapping which need not be limited in span by available RAM or swap.

I don't understand the meaning of this. How does not reserving swap
has anything to do with the virtual mapping size and its sparseness ?

> +Note that writes to such a mapping may cause a @code{SIGSEGV} if the
> +amount of backing required eventualy exceeds system resources.

It could be clarified that here "backing" does _not_ refer to physical
backing.

> +
> +On Linux, this flag's behavior may be overwridden by
> +@file{/proc/sys/vm/overcommit_memory} as documented in the proc(5) man
> +page.

OK

> +
> +@item MAP_STACK
> +Ensures that the resulting mapping is suitable for use as a program
> +stack.  For example, the use of huge pages might be precluded.

OK

> +
> +@item MAP_SYNC
> +This flag is used to map persistent memory devices into the running
> +program in such a way that writes to the mapping are immediately
> +written to the device as well.  Unlike most other flags, this one will
> +fail unless @code{MAP_SHARED_VALIDATE} is also given.

Note that this wording is misleading. Users of persistent memory devices
need to issue explicit "flush" instructions to ensure that writes are
made persistent to the device. The MAP_SYNC merely guarantees that
memory mappings within a file on a dax-enabled filesystem will appear
at the same file offset after a crash/reboot. It goes not guarantee
anything about write persistence.

> +
> +@item MAP_UNINITIALIZED
> +This flag allows the kernel to map anonymous pages without zeroing
> +them out first.  This is, of course, a security risk, and will only
> +work if the kernel is built to allow it (typically, on single-process
> +embedded systems).

OK

>  
>  @end vtable
>  
> @@ -1656,6 +1739,24 @@ Possible errors include:
>  
>  @table @code
>  
> +@item EACCES
> +
> +@var{filedes} was not open for the type of access specified in @var{protect}.
> +
> +@item EAGAIN
> +
> +The system has temporarily run out of resources.

or file has been locked.

> +
> +@item EBADF
> +
> +The @var{fd} passes is invalid, and a valid file descriptor is

passes -> passed ?

> +required (i.e. MAP_ANONYMOUS was not specified).
> +
> +@item EEXIST
> +
> +@code{MAP_FIXED_NOREPLACE} was specified and an existing mapping was
> +found overlapping the requested address range.
> +
>  @item EINVAL
>  
>  Either @var{address} was unusable (because it is not a multiple of the
> @@ -1664,23 +1765,33 @@ applicable page size), or inconsistent @var{flags} were given.
>  If @code{MAP_HUGETLB} was specified, the file or system does not support
>  large page sizes.
>  
> -@item EACCES
> +@item ENODEV
>  
> -@var{filedes} was not open for the type of access specified in @var{protect}.
> +This file is of a type that doesn't support mapping.
>  
>  @item ENOMEM
>  
> -Either there is not enough memory for the operation, or the process is
> -out of address space.
> -
> -@item ENODEV
> -
> -This file is of a type that doesn't support mapping.
> +There is not enough memory for the operation, the process is out of
> +address space, or there are too many mappings.  On Linux, the maximum
> +number of mappings can be controlled via
> +@file{/proc/sys/vm/max_map_count} or, if your OS supports it, via
> +the @code{vm.max_map_count} @code{sysctl} setting.

Also getrlimit(2) RLIMIT_DATA exceeded, or @addr exceeds virtual address
space of the CPU.

>  
>  @item ENOEXEC
>  
>  The file is on a filesystem that doesn't support mapping.
>  
> +@item EPERM
> +
> +@code{PROT_EXEC} was requested but the file is on a filesystem that
> +was mounted with execution denied.

Also operation was prevented by a file seal (fcntl(2)).
Also MAP_HUGETLB flag was specified, but the caller was not priviledged.

> +
> +@item EOVERFLOW
> +
> +Either the offset into the file plus the length of the mapping causes
> +internal page counts to overflow, or the offset requested exceeds the
> +length of the file.
> +

Thanks,

Mathieu


>  @c On Linux, EAGAIN will appear if the file has a conflicting mandatory lock.
>  @c However mandatory locks are not discussed in this manual.
>  @c
Florian Weimer June 19, 2024, 7:16 a.m. UTC | #2
* DJ Delorie:

> +@item MAP_UNINITIALIZED
> +This flag allows the kernel to map anonymous pages without zeroing
> +them out first.  This is, of course, a security risk, and will only
> +work if the kernel is built to allow it (typically, on single-process
> +embedded systems).

This isn't currently part of our headers, I think.

Thanks,
Florian
diff mbox series

Patch

diff --git a/manual/llio.texi b/manual/llio.texi
index 0d1a32e3e1..7edec3e8d7 100644
--- a/manual/llio.texi
+++ b/manual/llio.texi
@@ -1574,10 +1574,15 @@  permitted.  They include @code{PROT_READ}, @code{PROT_WRITE}, and
 of address space for future use.  The @code{mprotect} function can be
 used to change the protection flags.  @xref{Memory Protection}.
 
-@var{flags} contains flags that control the nature of the map.
-One of @code{MAP_SHARED} or @code{MAP_PRIVATE} must be specified.
+The @var{flags} parameter contains flags that control the nature of
+the map.  One of @code{MAP_SHARED}, @code{MAP_SHARED_VALIDATE}, or
+@code{MAP_PRIVATE} must be specified.  Additional flags may be bitwise
+OR'd to further define the mapping.
 
-They include:
+Note that, aside from @code{MAP_PRIVATE} and @code{MAP_SHARED}, not
+all flags are supported on all versions of all operating systems.
+Consult the kernel-specific documentation for details.  The flags
+include:
 
 @vtable @code
 @item MAP_PRIVATE
@@ -1599,9 +1604,19 @@  Note that actual writing may take place at any time.  You need to use
 @code{msync}, described below, if it is important that other processes
 using conventional I/O get a consistent view of the file.
 
+@item MAP_SHARED_VALIDATE
+Similar to @code{MAP_SHARED} except that additional flags will be
+validated by the kernel, and the call will fail if an unrecognized
+flag is provided.  With @code{MAP_SHARED} using a flag on a kernel
+that doesn't support it causes the flag to be ignored.
+@code{MAP_SHARED_VALIDATE} should be used when the behavior of all
+flags is required.
+
 @item MAP_FIXED
 This forces the system to use the exact mapping address specified in
-@var{address} and fail if it can't.
+@var{address} and fail if it can't.  Note that if the new mapping
+would overlap an existing mapping, the overlapping portion of the
+existing map is unmapped.
 
 @c One of these is official - the other is obviously an obsolete synonym
 @c Which is which?
@@ -1642,10 +1657,78 @@  The @code{MAP_HUGETLB} flag is specific to Linux.
 @c There is a mechanism to select different hugepage sizes; see
 @c include/uapi/asm-generic/hugetlb_encode.h in the kernel sources.
 
-@c Linux has some other MAP_ options, which I have not discussed here.
-@c MAP_DENYWRITE, MAP_EXECUTABLE and MAP_GROWSDOWN don't seem applicable to
-@c user programs (and I don't understand the last two).  MAP_LOCKED does
-@c not appear to be implemented.
+@item MAP_32BIT
+Require addresses that can be accessed with a signed 32 bit pointer,
+i.e., within the first 2 GiB.  Ignored if MAP_FIXED is specified.
+
+@item MAP_DENYWRITE
+@itemx MAP_EXECUTABLE
+@itemx MAP_FILE
+
+Provided for compatibility.  Ignored by the Linux kernel.
+
+@item MAP_FIXED_NOREPLACE
+Similar to @code{MAP_FIXED} except the call will fail with
+@code{EEXIST} if the new mapping would overwrite an existing mapping.
+To test for this, specify MAP_FIXED_NOREPLACE without MAP_FIXED, and
+check the actual address returned.  If it does not match the address
+passed, then this flag is not supported.
+
+@item MAP_GROWSDOWN
+This flag is used to make stacks, and is typically only needed inside
+the program loader to set up the main stack for the running process.
+The mapping is created according to the other flags, except an
+additional page just prior to the mapping is marked as a ``guard
+page''.  If a write is attempted inside this guard page, that page is
+mapped, the mapping is extended, and a new guard page is created.
+Thus, the mapping continues to grow towards lower addresses until it
+encounters some other mapping.
+
+Note that accessing memory beyond the guard page will not trigger this
+feature.  In gcc, use @code{-fstack-clash-protection} to ensure the
+guard page is always touched.
+
+@item MAP_LOCKED
+A hint that requests that mapped pages are locked in memory (i.e. not
+paged out).  Note that this is a request and not a requirement; use
+@code{mlock} if locking is required.
+
+@item MAP_POPULATE
+@itemx MAP_NONBLOCK
+These two are opposites.  @code{MAP_POPULATE} is a hint that requests
+that the kernel read-ahead a file-backed mapping, causing more pages
+to be mapped before they're needed.  @code{MAP_NONBLOCK} is a hint
+that requests that the kernel @emph{not} attempt such, only mapping
+pages when they're actually needed.  Note that neither of these hints
+affects future paging activity, use @code{mlock} if such needs to be
+controlled.
+
+@item MAP_NORESERVE
+Asks the kernel to not reserve physical backing for a mapping.  This
+would be useful for, for example, a very large but sparsely used
+mapping which need not be limited in span by available RAM or swap.
+Note that writes to such a mapping may cause a @code{SIGSEGV} if the
+amount of backing required eventualy exceeds system resources.
+
+On Linux, this flag's behavior may be overwridden by
+@file{/proc/sys/vm/overcommit_memory} as documented in the proc(5) man
+page.
+
+@item MAP_STACK
+Ensures that the resulting mapping is suitable for use as a program
+stack.  For example, the use of huge pages might be precluded.
+
+@item MAP_SYNC
+This flag is used to map persistent memory devices into the running
+program in such a way that writes to the mapping are immediately
+written to the device as well.  Unlike most other flags, this one will
+fail unless @code{MAP_SHARED_VALIDATE} is also given.
+
+@item MAP_UNINITIALIZED
+This flag allows the kernel to map anonymous pages without zeroing
+them out first.  This is, of course, a security risk, and will only
+work if the kernel is built to allow it (typically, on single-process
+embedded systems).
 
 @end vtable
 
@@ -1656,6 +1739,24 @@  Possible errors include:
 
 @table @code
 
+@item EACCES
+
+@var{filedes} was not open for the type of access specified in @var{protect}.
+
+@item EAGAIN
+
+The system has temporarily run out of resources.
+
+@item EBADF
+
+The @var{fd} passes is invalid, and a valid file descriptor is
+required (i.e. MAP_ANONYMOUS was not specified).
+
+@item EEXIST
+
+@code{MAP_FIXED_NOREPLACE} was specified and an existing mapping was
+found overlapping the requested address range.
+
 @item EINVAL
 
 Either @var{address} was unusable (because it is not a multiple of the
@@ -1664,23 +1765,33 @@  applicable page size), or inconsistent @var{flags} were given.
 If @code{MAP_HUGETLB} was specified, the file or system does not support
 large page sizes.
 
-@item EACCES
+@item ENODEV
 
-@var{filedes} was not open for the type of access specified in @var{protect}.
+This file is of a type that doesn't support mapping.
 
 @item ENOMEM
 
-Either there is not enough memory for the operation, or the process is
-out of address space.
-
-@item ENODEV
-
-This file is of a type that doesn't support mapping.
+There is not enough memory for the operation, the process is out of
+address space, or there are too many mappings.  On Linux, the maximum
+number of mappings can be controlled via
+@file{/proc/sys/vm/max_map_count} or, if your OS supports it, via
+the @code{vm.max_map_count} @code{sysctl} setting.
 
 @item ENOEXEC
 
 The file is on a filesystem that doesn't support mapping.
 
+@item EPERM
+
+@code{PROT_EXEC} was requested but the file is on a filesystem that
+was mounted with execution denied.
+
+@item EOVERFLOW
+
+Either the offset into the file plus the length of the mapping causes
+internal page counts to overflow, or the offset requested exceeds the
+length of the file.
+
 @c On Linux, EAGAIN will appear if the file has a conflicting mandatory lock.
 @c However mandatory locks are not discussed in this manual.
 @c