mbox series

[RFC,v2,0/4] mm: Introduce MAP_BELOW_HINT

Message ID 20240829-patches-below_hint_mmap-v2-0-638a28d9eae0@rivosinc.com
Headers show
Series mm: Introduce MAP_BELOW_HINT | expand

Message

Charlie Jenkins Aug. 29, 2024, 7:15 a.m. UTC
Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

The riscv architecture needs a way to similarly restrict the virtual
address space. On the riscv port of OpenJDK an error is thrown if
attempted to run on the 57-bit address space, called sv57 [1].  golang
has a comment that sv57 support is not complete, but there are some
workarounds to get it to mostly work [2].

These applications work on x86 because x86 does an implicit 47-bit
restriction of mmap() address that contain a hint address that is less
than 48 bits.

Instead of implicitly restricting the address space on riscv (or any
current/future architecture), a flag would allow users to opt-in to this
behavior rather than opt-out as is done on other architectures. This is
desirable because it is a small class of applications that do pointer
masking.

This flag will also allow seemless compatibility between all
architectures, so applications like Go and OpenJDK that use bits in a
virtual address can request the exact number of bits they need in a
generic way. The flag can be checked inside of vm_unmapped_area() so
that this flag does not have to be handled individually by each
architecture. 

Link:
https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
[1]
Link:
https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
[2]

To: Arnd Bergmann <arnd@arndb.de>
To: Richard Henderson <richard.henderson@linaro.org>
To: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
To: Matt Turner <mattst88@gmail.com>
To: Vineet Gupta <vgupta@kernel.org>
To: Russell King <linux@armlinux.org.uk>
To: Guo Ren <guoren@kernel.org>
To: Huacai Chen <chenhuacai@kernel.org>
To: WANG Xuerui <kernel@xen0n.name>
To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
To: Helge Deller <deller@gmx.de>
To: Michael Ellerman <mpe@ellerman.id.au>
To: Nicholas Piggin <npiggin@gmail.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Naveen N Rao <naveen@kernel.org>
To: Alexander Gordeev <agordeev@linux.ibm.com>
To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Heiko Carstens <hca@linux.ibm.com>
To: Vasily Gorbik <gor@linux.ibm.com>
To: Christian Borntraeger <borntraeger@linux.ibm.com>
To: Sven Schnelle <svens@linux.ibm.com>
To: Yoshinori Sato <ysato@users.sourceforge.jp>
To: Rich Felker <dalias@libc.org>
To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
To: David S. Miller <davem@davemloft.net>
To: Andreas Larsson <andreas@gaisler.com>
To: Thomas Gleixner <tglx@linutronix.de>
To: Ingo Molnar <mingo@redhat.com>
To: Borislav Petkov <bp@alien8.de>
To: Dave Hansen <dave.hansen@linux.intel.com>
To: x86@kernel.org
To: H. Peter Anvin <hpa@zytor.com>
To: Andy Lutomirski <luto@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
To: Muchun Song <muchun.song@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-csky@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>

Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able
  to factor this out into all generic functions, except for flags that
  needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com

---
Charlie Jenkins (4):
      mm: Add MAP_BELOW_HINT
      mm: Add hint and mmap_flags to struct vm_unmapped_area_info
      mm: Support MAP_BELOW_HINT in vm_unmapped_area()
      selftests/mm: Create MAP_BELOW_HINT test

 arch/alpha/kernel/osf_sys.c                  |  2 ++
 arch/arc/mm/mmap.c                           |  3 +++
 arch/arm/mm/mmap.c                           |  7 ++++++
 arch/csky/abiv1/mmap.c                       |  3 +++
 arch/loongarch/mm/mmap.c                     |  3 +++
 arch/mips/mm/mmap.c                          |  3 +++
 arch/parisc/kernel/sys_parisc.c              |  3 +++
 arch/powerpc/mm/book3s64/slice.c             |  7 ++++++
 arch/s390/mm/hugetlbpage.c                   |  4 ++++
 arch/s390/mm/mmap.c                          |  6 ++++++
 arch/sh/mm/mmap.c                            |  6 ++++++
 arch/sparc/kernel/sys_sparc_32.c             |  3 +++
 arch/sparc/kernel/sys_sparc_64.c             |  6 ++++++
 arch/sparc/mm/hugetlbpage.c                  |  4 ++++
 arch/x86/kernel/sys_x86_64.c                 |  6 ++++++
 arch/x86/mm/hugetlbpage.c                    |  4 ++++
 fs/hugetlbfs/inode.c                         |  4 ++++
 include/linux/mm.h                           |  2 ++
 include/uapi/asm-generic/mman-common.h       |  1 +
 mm/mmap.c                                    |  9 ++++++++
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 tools/testing/selftests/mm/Makefile          |  1 +
 tools/testing/selftests/mm/map_below_hint.c  | 32 ++++++++++++++++++++++++++++
 23 files changed, 120 insertions(+)
---
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55

Comments

Michael Ellerman Aug. 29, 2024, 8:26 a.m. UTC | #1
Charlie Jenkins <charlie@rivosinc.com> writes:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
>
> To make this behavior explicit and more versatile across all
> architectures, define a mmap flag that allows users to define an
> arbitrary upper limit on addresses returned by mmap.
>
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> ---
>  include/uapi/asm-generic/mman-common.h       | 1 +
>  tools/include/uapi/asm-generic/mman-common.h | 1 +
  
You're not meant to update the headers in tools/ directly. There's a
mail somewhere from acme somewhere describing the proper process, but
the tldr is leave it up to him.

> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..03ac13d9aa37 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -32,6 +32,7 @@
>  
>  #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
>  					 * uninitialized */
> +#define MAP_BELOW_HINT	  0x8000000	/* give out address that is below (inclusive) hint address */

IMHO the API would be clearer if this actually forced the address to be
below the hint. That's what the flag name implies after all.

It would also mean the application doesn't need to take into account the
length of the mapping when passing the hint.

cheers
Michal Hocko Aug. 29, 2024, 8:30 a.m. UTC | #2
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

IIRC this has been discussed at length when 5-level page tables support
has been proposed for x86. Sorry I do not have a link handy but lore
should help you. Linus was not really convinced and in the end vetoed it
and prefer that those few applications that benefit from greater address
space would do that explicitly than other way around.
Lorenzo Stoakes Aug. 29, 2024, 8:42 a.m. UTC | #3
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).

I'm still confused as to why, if an mmap flag is desired, and thus programs
are having to be heavily modified and controlled to be able to do this, why
you can't just do an mmap() with PROT_NONE early, around a hinted address
that, sits below the required limit, and then mprotect() or mmap() over it?

Your feature is a major adjustment to mmap(), it needs to be pretty
significantly justified, especially if taking up a new flag.

>
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
>
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.

You mean x86 _has_ to limit to physically available bits in a canonical
format :) this will not be the case for 5-page table levels though...

>
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

I raised this last time and you didn't seem to address it so to be more
blunt:

I don't understand why this needs to be an mmap() flag. From this it seems
the whole process needs allocations to be below a certain limit.

That _could_ be achieved through a 'personality' or similar (though a
personality is on/off, rather than allowing configuration so maybe
something else would be needed).

From what you're saying 57-bit is all you really need right? So maybe
ADDR_LIMIT_57BIT?

I don't see how you're going to actually enforce this in a process either
via an mmap flag, as a library might decide not to use it, so you'd need to
control the allocator, the thread library implementation, and everything
that might allocate.

Liam also raised various points about VMA particulars that I'm not sure are
addressed either.

I just find it hard to believe that everything will fit together.

I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
just not.

>
> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture.

I'm still very unconvinced and feel the bar needs to be high for making
changes like this that carry maintainership burden.

So for me, it's a no really as an overall concept.

Happy to be convinced otherwise, however... (I may be missing details or
context that provide more justification).

>
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
>
> To: Arnd Bergmann <arnd@arndb.de>
> To: Richard Henderson <richard.henderson@linaro.org>
> To: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> To: Matt Turner <mattst88@gmail.com>
> To: Vineet Gupta <vgupta@kernel.org>
> To: Russell King <linux@armlinux.org.uk>
> To: Guo Ren <guoren@kernel.org>
> To: Huacai Chen <chenhuacai@kernel.org>
> To: WANG Xuerui <kernel@xen0n.name>
> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> To: Helge Deller <deller@gmx.de>
> To: Michael Ellerman <mpe@ellerman.id.au>
> To: Nicholas Piggin <npiggin@gmail.com>
> To: Christophe Leroy <christophe.leroy@csgroup.eu>
> To: Naveen N Rao <naveen@kernel.org>
> To: Alexander Gordeev <agordeev@linux.ibm.com>
> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> To: Heiko Carstens <hca@linux.ibm.com>
> To: Vasily Gorbik <gor@linux.ibm.com>
> To: Christian Borntraeger <borntraeger@linux.ibm.com>
> To: Sven Schnelle <svens@linux.ibm.com>
> To: Yoshinori Sato <ysato@users.sourceforge.jp>
> To: Rich Felker <dalias@libc.org>
> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> To: David S. Miller <davem@davemloft.net>
> To: Andreas Larsson <andreas@gaisler.com>
> To: Thomas Gleixner <tglx@linutronix.de>
> To: Ingo Molnar <mingo@redhat.com>
> To: Borislav Petkov <bp@alien8.de>
> To: Dave Hansen <dave.hansen@linux.intel.com>
> To: x86@kernel.org
> To: H. Peter Anvin <hpa@zytor.com>
> To: Andy Lutomirski <luto@kernel.org>
> To: Peter Zijlstra <peterz@infradead.org>
> To: Muchun Song <muchun.song@linux.dev>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Liam R. Howlett <Liam.Howlett@oracle.com>
> To: Vlastimil Babka <vbabka@suse.cz>
> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> To: Shuah Khan <shuah@kernel.org>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kselftest@vger.kernel.org
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>
> Changes in v2:
> - Added much greater detail to cover letter
> - Removed all code that touched architecture specific code and was able
>   to factor this out into all generic functions, except for flags that
>   needed to be added to vm_unmapped_area_info
> - Made this an RFC since I have only tested it on riscv and x86
> - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com
>
> ---
> Charlie Jenkins (4):
>       mm: Add MAP_BELOW_HINT
>       mm: Add hint and mmap_flags to struct vm_unmapped_area_info
>       mm: Support MAP_BELOW_HINT in vm_unmapped_area()
>       selftests/mm: Create MAP_BELOW_HINT test
>
>  arch/alpha/kernel/osf_sys.c                  |  2 ++
>  arch/arc/mm/mmap.c                           |  3 +++
>  arch/arm/mm/mmap.c                           |  7 ++++++
>  arch/csky/abiv1/mmap.c                       |  3 +++
>  arch/loongarch/mm/mmap.c                     |  3 +++
>  arch/mips/mm/mmap.c                          |  3 +++
>  arch/parisc/kernel/sys_parisc.c              |  3 +++
>  arch/powerpc/mm/book3s64/slice.c             |  7 ++++++
>  arch/s390/mm/hugetlbpage.c                   |  4 ++++
>  arch/s390/mm/mmap.c                          |  6 ++++++
>  arch/sh/mm/mmap.c                            |  6 ++++++
>  arch/sparc/kernel/sys_sparc_32.c             |  3 +++
>  arch/sparc/kernel/sys_sparc_64.c             |  6 ++++++
>  arch/sparc/mm/hugetlbpage.c                  |  4 ++++
>  arch/x86/kernel/sys_x86_64.c                 |  6 ++++++
>  arch/x86/mm/hugetlbpage.c                    |  4 ++++
>  fs/hugetlbfs/inode.c                         |  4 ++++
>  include/linux/mm.h                           |  2 ++
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/mmap.c                                    |  9 ++++++++
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  tools/testing/selftests/mm/Makefile          |  1 +
>  tools/testing/selftests/mm/map_below_hint.c  | 32 ++++++++++++++++++++++++++++
>  23 files changed, 120 insertions(+)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
> --
> - Charlie
>
Vlastimil Babka Aug. 29, 2024, 9:02 a.m. UTC | #4
Such a large recipient list and no linux-api. CC'd, please include it on
future postings.

On 8/29/24 09:15, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

I doubt it's desirable to have different behavior depending on architecture.
Also you could say it's a small class of applications that need more than 47
bits.

> This flag will also allow seemless compatibility between all
> architectures, so applications like Go and OpenJDK that use bits in a
> virtual address can request the exact number of bits they need in a
> generic way. The flag can be checked inside of vm_unmapped_area() so
> that this flag does not have to be handled individually by each
> architecture. 
> 
> Link:
> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
> [1]
> Link:
> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
> [2]
> 
> To: Arnd Bergmann <arnd@arndb.de>
> To: Richard Henderson <richard.henderson@linaro.org>
> To: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> To: Matt Turner <mattst88@gmail.com>
> To: Vineet Gupta <vgupta@kernel.org>
> To: Russell King <linux@armlinux.org.uk>
> To: Guo Ren <guoren@kernel.org>
> To: Huacai Chen <chenhuacai@kernel.org>
> To: WANG Xuerui <kernel@xen0n.name>
> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
> To: Helge Deller <deller@gmx.de>
> To: Michael Ellerman <mpe@ellerman.id.au>
> To: Nicholas Piggin <npiggin@gmail.com>
> To: Christophe Leroy <christophe.leroy@csgroup.eu>
> To: Naveen N Rao <naveen@kernel.org>
> To: Alexander Gordeev <agordeev@linux.ibm.com>
> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> To: Heiko Carstens <hca@linux.ibm.com>
> To: Vasily Gorbik <gor@linux.ibm.com>
> To: Christian Borntraeger <borntraeger@linux.ibm.com>
> To: Sven Schnelle <svens@linux.ibm.com>
> To: Yoshinori Sato <ysato@users.sourceforge.jp>
> To: Rich Felker <dalias@libc.org>
> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> To: David S. Miller <davem@davemloft.net>
> To: Andreas Larsson <andreas@gaisler.com>
> To: Thomas Gleixner <tglx@linutronix.de>
> To: Ingo Molnar <mingo@redhat.com>
> To: Borislav Petkov <bp@alien8.de>
> To: Dave Hansen <dave.hansen@linux.intel.com>
> To: x86@kernel.org
> To: H. Peter Anvin <hpa@zytor.com>
> To: Andy Lutomirski <luto@kernel.org>
> To: Peter Zijlstra <peterz@infradead.org>
> To: Muchun Song <muchun.song@linux.dev>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Liam R. Howlett <Liam.Howlett@oracle.com>
> To: Vlastimil Babka <vbabka@suse.cz>
> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> To: Shuah Khan <shuah@kernel.org>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-csky@vger.kernel.org
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Cc: sparclinux@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kselftest@vger.kernel.org
> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> 
> Changes in v2:
> - Added much greater detail to cover letter
> - Removed all code that touched architecture specific code and was able
>   to factor this out into all generic functions, except for flags that
>   needed to be added to vm_unmapped_area_info
> - Made this an RFC since I have only tested it on riscv and x86
> - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com
> 
> ---
> Charlie Jenkins (4):
>       mm: Add MAP_BELOW_HINT
>       mm: Add hint and mmap_flags to struct vm_unmapped_area_info
>       mm: Support MAP_BELOW_HINT in vm_unmapped_area()
>       selftests/mm: Create MAP_BELOW_HINT test
> 
>  arch/alpha/kernel/osf_sys.c                  |  2 ++
>  arch/arc/mm/mmap.c                           |  3 +++
>  arch/arm/mm/mmap.c                           |  7 ++++++
>  arch/csky/abiv1/mmap.c                       |  3 +++
>  arch/loongarch/mm/mmap.c                     |  3 +++
>  arch/mips/mm/mmap.c                          |  3 +++
>  arch/parisc/kernel/sys_parisc.c              |  3 +++
>  arch/powerpc/mm/book3s64/slice.c             |  7 ++++++
>  arch/s390/mm/hugetlbpage.c                   |  4 ++++
>  arch/s390/mm/mmap.c                          |  6 ++++++
>  arch/sh/mm/mmap.c                            |  6 ++++++
>  arch/sparc/kernel/sys_sparc_32.c             |  3 +++
>  arch/sparc/kernel/sys_sparc_64.c             |  6 ++++++
>  arch/sparc/mm/hugetlbpage.c                  |  4 ++++
>  arch/x86/kernel/sys_x86_64.c                 |  6 ++++++
>  arch/x86/mm/hugetlbpage.c                    |  4 ++++
>  fs/hugetlbfs/inode.c                         |  4 ++++
>  include/linux/mm.h                           |  2 ++
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/mmap.c                                    |  9 ++++++++
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  tools/testing/selftests/mm/Makefile          |  1 +
>  tools/testing/selftests/mm/map_below_hint.c  | 32 ++++++++++++++++++++++++++++
>  23 files changed, 120 insertions(+)
> ---
> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Lorenzo Stoakes Aug. 29, 2024, 9:54 a.m. UTC | #5
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
>
> I'm still confused as to why, if an mmap flag is desired, and thus programs
> are having to be heavily modified and controlled to be able to do this, why
> you can't just do an mmap() with PROT_NONE early, around a hinted address
> that, sits below the required limit, and then mprotect() or mmap() over it?
>
> Your feature is a major adjustment to mmap(), it needs to be pretty
> significantly justified, especially if taking up a new flag.
>
> >
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> >
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
>
> You mean x86 _has_ to limit to physically available bits in a canonical
> format :) this will not be the case for 5-page table levels though...
>
> >
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
>
> I raised this last time and you didn't seem to address it so to be more
> blunt:
>
> I don't understand why this needs to be an mmap() flag. From this it seems
> the whole process needs allocations to be below a certain limit.
>
> That _could_ be achieved through a 'personality' or similar (though a
> personality is on/off, rather than allowing configuration so maybe
> something else would be needed).
>
> From what you're saying 57-bit is all you really need right? So maybe
> ADDR_LIMIT_57BIT?
>
> I don't see how you're going to actually enforce this in a process either
> via an mmap flag, as a library might decide not to use it, so you'd need to
> control the allocator, the thread library implementation, and everything
> that might allocate.
>
> Liam also raised various points about VMA particulars that I'm not sure are
> addressed either.
>
> I just find it hard to believe that everything will fit together.
>
> I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> just not.
>
> >
> > This flag will also allow seemless compatibility between all
> > architectures, so applications like Go and OpenJDK that use bits in a
> > virtual address can request the exact number of bits they need in a
> > generic way. The flag can be checked inside of vm_unmapped_area() so
> > that this flag does not have to be handled individually by each
> > architecture.
>
> I'm still very unconvinced and feel the bar needs to be high for making
> changes like this that carry maintainership burden.
>
> So for me, it's a no really as an overall concept.
>
> Happy to be convinced otherwise, however... (I may be missing details or
> context that provide more justification).
>

Some more thoughts:

* If you absolutely must keep allocations below a certain limit, you'd
  probably need to actually associate this information with the VMA so the
  memory can't be mremap()'d somewhere invalid (you might not control all
  code so you can't guarantee this won't happen).
* Keeping a map limit associated with a VMA would be horrid and keeping
  VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
  are in limited supply also.
* If we did implement a per-process thing, but it were arbitrary, we'd then
  have to handle all kinds of corner cases forever (this is UAPI, can't
  break it etc.) with crazy-low values, or determine a minimum that might
  vary by arch...
* If we did this we'd absolutely have to implement a check in the brk()
  implementation, which is a very very sensitive bit of code. And of
  course, in mmap() and mremap()... and any arch-specific code that might
  interface with this stuff (these functions are hooked).
* A fixed address limit would make more sense, but it seems difficult to
  know what would work for everybody, and again we'd have to deal with edge
  cases and having a permanent maintenance burden.
* If you did have a map flag what about merging between VMAs above the
  limit and below it? To avoid that you'd need to implement some kind of a
  'VMA flag that has an arbitrary characteristic' or a 'limit' field,
  adjust all the 'can VMA merge' functions and write extensive testing and
  none of that is frankly acceptable.
* We have some 'weird' arches that might have problem with certain virtual
  address ranges or require arbitrary mappings at a certain address range
  that a limit might not be able to account for.

I'm absolutely opposed to a new MAP_ flag for this, but even if you
implemented that, it implies a lot of complexity.

It implies even more complexity if you implement something per-process
except if it were a fixed limit.

And if you implement a fixed limit, it's hard to see that it'll be
acceptable to everybody, and I suspect we'd still run into some possible
weirdness.

So again, I'm struggling to see how this concept can be justified in any
form.
Palmer Dabbelt Aug. 29, 2024, 5 p.m. UTC | #6
On Thu, 29 Aug 2024 02:02:34 PDT (-0700), vbabka@suse.cz wrote:
> Such a large recipient list and no linux-api. CC'd, please include it on
> future postings.
>
> On 8/29/24 09:15, Charlie Jenkins wrote:
>> Some applications rely on placing data in free bits addresses allocated
>> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
>> address returned by mmap to be less than the 48-bit address space,
>> unless the hint address uses more than 47 bits (the 48th bit is reserved
>> for the kernel address space).
>>
>> The riscv architecture needs a way to similarly restrict the virtual
>> address space. On the riscv port of OpenJDK an error is thrown if
>> attempted to run on the 57-bit address space, called sv57 [1].  golang
>> has a comment that sv57 support is not complete, but there are some
>> workarounds to get it to mostly work [2].
>>
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
>>
>> Instead of implicitly restricting the address space on riscv (or any
>> current/future architecture), a flag would allow users to opt-in to this
>> behavior rather than opt-out as is done on other architectures. This is
>> desirable because it is a small class of applications that do pointer
>> masking.
>
> I doubt it's desirable to have different behavior depending on architecture.
> Also you could say it's a small class of applications that need more than 47
> bits.

We're sort of stuck with the architeture-depending behavior here: for 
the first few years RISC-V only had 39-bit VAs, so the defato uABI ended 
up being that userspace can ignore way more bits.  While 48 bits might 
be enough for everyone, 39 doesn't seem to be -- or at least IIRC when 
we tried restricting the default to that, we broke stuff.  There's also 
some other wrinkles like arbitrary bit boundaries in pointer masking and 
vendor-specific paging formats, but at some point we just end up down a 
rabbit hole of insanity there...

FWIW, I think that userspace depending on just tossing some VA bits 
because some kernels happened to never allocate from them is just 
broken, but it seems like other ports worked around the 48->57 bit 
transition and we're trying to do something similar for 39->48 (and that 
works with 49->57, as we'll have to deal with that eventually).

So that's basically how we ended up with this sort of thing: trying to 
do something similar without a flag broke userspace because we were 
trying to jam too much into the hints.  I couldn't really figure out a 
way to satisfy all the userspace constraints by just implicitly 
retrofitting behavior based on the hints, so we figured having an 
explicit flag to control the behavior would be the sanest way to go.

That said: I'm not opposed to just saying "depending on 39-bit VAs is 
broken" and just forcing people to fix it.

>> This flag will also allow seemless compatibility between all
>> architectures, so applications like Go and OpenJDK that use bits in a
>> virtual address can request the exact number of bits they need in a
>> generic way. The flag can be checked inside of vm_unmapped_area() so
>> that this flag does not have to be handled individually by each
>> architecture.
>>
>> Link:
>> https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1/src/hotspot/cpu/riscv/vm_version_riscv.cpp#L79
>> [1]
>> Link:
>> https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/src/runtime/tagptr_64bit.go#L47
>> [2]
>>
>> To: Arnd Bergmann <arnd@arndb.de>
>> To: Richard Henderson <richard.henderson@linaro.org>
>> To: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
>> To: Matt Turner <mattst88@gmail.com>
>> To: Vineet Gupta <vgupta@kernel.org>
>> To: Russell King <linux@armlinux.org.uk>
>> To: Guo Ren <guoren@kernel.org>
>> To: Huacai Chen <chenhuacai@kernel.org>
>> To: WANG Xuerui <kernel@xen0n.name>
>> To: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
>> To: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
>> To: Helge Deller <deller@gmx.de>
>> To: Michael Ellerman <mpe@ellerman.id.au>
>> To: Nicholas Piggin <npiggin@gmail.com>
>> To: Christophe Leroy <christophe.leroy@csgroup.eu>
>> To: Naveen N Rao <naveen@kernel.org>
>> To: Alexander Gordeev <agordeev@linux.ibm.com>
>> To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> To: Heiko Carstens <hca@linux.ibm.com>
>> To: Vasily Gorbik <gor@linux.ibm.com>
>> To: Christian Borntraeger <borntraeger@linux.ibm.com>
>> To: Sven Schnelle <svens@linux.ibm.com>
>> To: Yoshinori Sato <ysato@users.sourceforge.jp>
>> To: Rich Felker <dalias@libc.org>
>> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
>> To: David S. Miller <davem@davemloft.net>
>> To: Andreas Larsson <andreas@gaisler.com>
>> To: Thomas Gleixner <tglx@linutronix.de>
>> To: Ingo Molnar <mingo@redhat.com>
>> To: Borislav Petkov <bp@alien8.de>
>> To: Dave Hansen <dave.hansen@linux.intel.com>
>> To: x86@kernel.org
>> To: H. Peter Anvin <hpa@zytor.com>
>> To: Andy Lutomirski <luto@kernel.org>
>> To: Peter Zijlstra <peterz@infradead.org>
>> To: Muchun Song <muchun.song@linux.dev>
>> To: Andrew Morton <akpm@linux-foundation.org>
>> To: Liam R. Howlett <Liam.Howlett@oracle.com>
>> To: Vlastimil Babka <vbabka@suse.cz>
>> To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> To: Shuah Khan <shuah@kernel.org>
>> Cc: linux-arch@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-alpha@vger.kernel.org
>> Cc: linux-snps-arc@lists.infradead.org
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-csky@vger.kernel.org
>> Cc: loongarch@lists.linux.dev
>> Cc: linux-mips@vger.kernel.org
>> Cc: linux-parisc@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-s390@vger.kernel.org
>> Cc: linux-sh@vger.kernel.org
>> Cc: sparclinux@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: linux-kselftest@vger.kernel.org
>> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
>>
>> Changes in v2:
>> - Added much greater detail to cover letter
>> - Removed all code that touched architecture specific code and was able
>>   to factor this out into all generic functions, except for flags that
>>   needed to be added to vm_unmapped_area_info
>> - Made this an RFC since I have only tested it on riscv and x86
>> - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d@rivosinc.com
>>
>> ---
>> Charlie Jenkins (4):
>>       mm: Add MAP_BELOW_HINT
>>       mm: Add hint and mmap_flags to struct vm_unmapped_area_info
>>       mm: Support MAP_BELOW_HINT in vm_unmapped_area()
>>       selftests/mm: Create MAP_BELOW_HINT test
>>
>>  arch/alpha/kernel/osf_sys.c                  |  2 ++
>>  arch/arc/mm/mmap.c                           |  3 +++
>>  arch/arm/mm/mmap.c                           |  7 ++++++
>>  arch/csky/abiv1/mmap.c                       |  3 +++
>>  arch/loongarch/mm/mmap.c                     |  3 +++
>>  arch/mips/mm/mmap.c                          |  3 +++
>>  arch/parisc/kernel/sys_parisc.c              |  3 +++
>>  arch/powerpc/mm/book3s64/slice.c             |  7 ++++++
>>  arch/s390/mm/hugetlbpage.c                   |  4 ++++
>>  arch/s390/mm/mmap.c                          |  6 ++++++
>>  arch/sh/mm/mmap.c                            |  6 ++++++
>>  arch/sparc/kernel/sys_sparc_32.c             |  3 +++
>>  arch/sparc/kernel/sys_sparc_64.c             |  6 ++++++
>>  arch/sparc/mm/hugetlbpage.c                  |  4 ++++
>>  arch/x86/kernel/sys_x86_64.c                 |  6 ++++++
>>  arch/x86/mm/hugetlbpage.c                    |  4 ++++
>>  fs/hugetlbfs/inode.c                         |  4 ++++
>>  include/linux/mm.h                           |  2 ++
>>  include/uapi/asm-generic/mman-common.h       |  1 +
>>  mm/mmap.c                                    |  9 ++++++++
>>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>>  tools/testing/selftests/mm/Makefile          |  1 +
>>  tools/testing/selftests/mm/map_below_hint.c  | 32 ++++++++++++++++++++++++++++
>>  23 files changed, 120 insertions(+)
>> ---
>> base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
>> change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Charlie Jenkins Aug. 29, 2024, 5:33 p.m. UTC | #7
On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
> On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> > 
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> > 
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
> > 
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
> 
> IIRC this has been discussed at length when 5-level page tables support
> has been proposed for x86. Sorry I do not have a link handy but lore
> should help you. Linus was not really convinced and in the end vetoed it
> and prefer that those few applications that benefit from greater address
> space would do that explicitly than other way around.

I believe I found the conversation you were referring to. Ingo Molnar
recommended a flag similar to what I have proposed [1]. Catalin
recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up
MPX [3].

However these conversations are tangential to what I am proposing. arm64
and x86 decided to have the default address space be 48 bits. However
this was done on a per-architecture basis with no way for applications
to have guarantees between architectures. Even this behavior to restrict
to 48 bits does not even appear in the man pages, so would require
reading the kernel source code to understand that this feature is
available. Then to opt-in to larger address spaces, applications have to
know to provide a hint address that is greater than 47 bits, mmap() will
then return an address that contains up to 56 bits on x86 and 52 bits on
arm64. This difference of 4 bits causes inconsistency and is part of the
problem I am trying to solve with this flag.

I am not proposing to change x86 and arm64 away from using their opt-out
feature, I am instead proposing a standard ABI for applications that
need some guarantees of the bits used in pointers.

- Charlie

Link: https://lore.kernel.org/lkml/20161209050130.GC2595@gmail.com/ [1]
Link: https://lore.kernel.org/lkml/20161209105120.GA3705@e104818-lin.cambridge.arm.com/
[2]
Link:
https://lore.kernel.org/lkml/a2f86495-b55f-fda0-40d2-242c45d3c1f3@intel.com/
[3]

> 
> -- 
> Michal Hocko
> SUSE Labs
Charlie Jenkins Aug. 29, 2024, 10:16 p.m. UTC | #8
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> >
> > I'm still confused as to why, if an mmap flag is desired, and thus programs
> > are having to be heavily modified and controlled to be able to do this, why
> > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > that, sits below the required limit, and then mprotect() or mmap() over it?
> >
> > Your feature is a major adjustment to mmap(), it needs to be pretty
> > significantly justified, especially if taking up a new flag.
> >
> > >
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > >
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> >
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format :) this will not be the case for 5-page table levels though...

I might be misunderstanding but I am not talking about pointer masking
or canonical addresses here. I am referring to the pattern of:

1. Getting an address from mmap()
2. Writing data into bits assumed to be unused in the address
3. Using the data stored in the address
4. Clearing the data from the address and sign extending
5. Dereferencing the now sign-extended address to conform to canonical
   addresses

I am just talking about step 1 and 2 here -- getting an address from
mmap() that only uses bits that will allow your application to not
break. How canonicalization happens is a a separate conversation, that
can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
Ssnpm for riscv allow an arbitrary number of bits to be masked out.
A design goal here is to be able to support all of the pointer masking
flavors, and not just x86.

> >
> > >
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> >
> > I raised this last time and you didn't seem to address it so to be more
> > blunt:
> >
> > I don't understand why this needs to be an mmap() flag. From this it seems
> > the whole process needs allocations to be below a certain limit.

Yeah making it per-process does seem logical, as it would help with
pointer masking.

> >
> > That _could_ be achieved through a 'personality' or similar (though a
> > personality is on/off, rather than allowing configuration so maybe
> > something else would be needed).
> >
> > From what you're saying 57-bit is all you really need right? So maybe
> > ADDR_LIMIT_57BIT?

Addresses will always be limited to 57 bits on riscv and x86 (but not
necessarily on other architectures). A flag like that would have no
impact, I do not understand what you are suggesting. This patch is to
have a configurable number of bits be restricted.

If anything, a personality that was ADDR_LIMIT_48BIT would be the
closest to what I am trying to achieve. Since the issue is that
applications fail to work when the address space is greater than 48
bits.

> >
> > I don't see how you're going to actually enforce this in a process either
> > via an mmap flag, as a library might decide not to use it, so you'd need to
> > control the allocator, the thread library implementation, and everything
> > that might allocate.

It is reasonable to change the implementation to be per-process but that
is not the current proposal.

This flag was designed for applications which already directly manage
all of their addresses like OpenJDK and Go.

This flag implementation was an attempt to make this feature as least
invasive as possible to reduce maintainence burden and implementation
complexity.

> >
> > Liam also raised various points about VMA particulars that I'm not sure are
> > addressed either.
> >
> > I just find it hard to believe that everything will fit together.
> >
> > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> > just not.
> >
> > >
> > > This flag will also allow seemless compatibility between all
> > > architectures, so applications like Go and OpenJDK that use bits in a
> > > virtual address can request the exact number of bits they need in a
> > > generic way. The flag can be checked inside of vm_unmapped_area() so
> > > that this flag does not have to be handled individually by each
> > > architecture.
> >
> > I'm still very unconvinced and feel the bar needs to be high for making
> > changes like this that carry maintainership burden.
> >

I may be naive but what is the burden here? It's two lines of code to
check MAP_BELOW_HINT and restrict the address. There are the additional
flags for hint and mmap_addr but those are also trivial to implement.

> > So for me, it's a no really as an overall concept.
> >
> > Happy to be convinced otherwise, however... (I may be missing details or
> > context that provide more justification).
> >
> 
> Some more thoughts:
> 
> * If you absolutely must keep allocations below a certain limit, you'd
>   probably need to actually associate this information with the VMA so the
>   memory can't be mremap()'d somewhere invalid (you might not control all
>   code so you can't guarantee this won't happen).
> * Keeping a map limit associated with a VMA would be horrid and keeping
>   VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
>   are in limited supply also.

Yes that does seem like it would be challenging.

> * If we did implement a per-process thing, but it were arbitrary, we'd then
>   have to handle all kinds of corner cases forever (this is UAPI, can't
>   break it etc.) with crazy-low values, or determine a minimum that might
>   vary by arch...

Throwing an error if the value is determined to be "too low" seems
reasonable.

> * If we did this we'd absolutely have to implement a check in the brk()
>   implementation, which is a very very sensitive bit of code. And of
>   course, in mmap() and mremap()... and any arch-specific code that might
>   interface with this stuff (these functions are hooked).
> * A fixed address limit would make more sense, but it seems difficult to
>   know what would work for everybody, and again we'd have to deal with edge
>   cases and having a permanent maintenance burden.

A fixed value is not ideal, since a single size probably would not be
suffiecient for every application. However if necessary we could fix it
to 48-bits since arm64 and x86 already do that, and that would still
allow a generic way of defining this behavior.

> * If you did have a map flag what about merging between VMAs above the
>   limit and below it? To avoid that you'd need to implement some kind of a
>   'VMA flag that has an arbitrary characteristic' or a 'limit' field,
>   adjust all the 'can VMA merge' functions and write extensive testing and
>   none of that is frankly acceptable.
> * We have some 'weird' arches that might have problem with certain virtual
>   address ranges or require arbitrary mappings at a certain address range
>   that a limit might not be able to account for.
> 
> I'm absolutely opposed to a new MAP_ flag for this, but even if you
> implemented that, it implies a lot of complexity.
> 
> It implies even more complexity if you implement something per-process
> except if it were a fixed limit.
> 
> And if you implement a fixed limit, it's hard to see that it'll be
> acceptable to everybody, and I suspect we'd still run into some possible
> weirdness.
> 
> So again, I'm struggling to see how this concept can be justified in any
> form.

The piece I am missing here is that this idea is already being used by
x86 and arm64. They implicitly force all allocations to be below the
47-bit boundary if the hint address is below 47 bits. This flag is much
less invasive because it is opt-in and will not impact any existing
code. I am not familiar enough with all of the interactions spread
throughout mm to know how these architectures have managed to ensure
that this 48-bit limit is enforced across things like mremap() as well.

Are you against the idea that there should be a standard way for
applications to consistently obtain address that have free bits, or are
you just against this implementation? From your statement I assume you
mean that every architecture should continue to have varying behavior
and separate implementations for supporting larger address spaces.

- Charlie
Charlie Jenkins Aug. 30, 2024, 1:16 a.m. UTC | #9
On Thu, Aug 29, 2024 at 06:26:41PM +1000, Michael Ellerman wrote:
> Charlie Jenkins <charlie@rivosinc.com> writes:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> >
> > To make this behavior explicit and more versatile across all
> > architectures, define a mmap flag that allows users to define an
> > arbitrary upper limit on addresses returned by mmap.
> >
> > Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
> > ---
> >  include/uapi/asm-generic/mman-common.h       | 1 +
> >  tools/include/uapi/asm-generic/mman-common.h | 1 +
>   
> You're not meant to update the headers in tools/ directly. There's a
> mail somewhere from acme somewhere describing the proper process, but
> the tldr is leave it up to him.

Oh okay, thank you.

> 
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..03ac13d9aa37 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -32,6 +32,7 @@
> >  
> >  #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
> >  					 * uninitialized */
> > +#define MAP_BELOW_HINT	  0x8000000	/* give out address that is below (inclusive) hint address */
> 
> IMHO the API would be clearer if this actually forced the address to be
> below the hint. That's what the flag name implies after all.
> 
> It would also mean the application doesn't need to take into account the
> length of the mapping when passing the hint.
> 
> cheers

That's a good point. The reason I did it this way was to allow mmap the
possibility of returning the same address as the hint. If it must be
strictly less than the hint then the hint address can never be returned.
Maybe that doesn't matter though.

- Charlie
Lorenzo Stoakes Aug. 30, 2024, 9:52 a.m. UTC | #10
On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
> On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > > Some applications rely on placing data in free bits addresses allocated
> > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > > address returned by mmap to be less than the 48-bit address space,
> > > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > > for the kernel address space).
> > >
> > > I'm still confused as to why, if an mmap flag is desired, and thus programs
> > > are having to be heavily modified and controlled to be able to do this, why
> > > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > > that, sits below the required limit, and then mprotect() or mmap() over it?
> > >
> > > Your feature is a major adjustment to mmap(), it needs to be pretty
> > > significantly justified, especially if taking up a new flag.
> > >
> > > >
> > > > The riscv architecture needs a way to similarly restrict the virtual
> > > > address space. On the riscv port of OpenJDK an error is thrown if
> > > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > > has a comment that sv57 support is not complete, but there are some
> > > > workarounds to get it to mostly work [2].
> > > >
> > > > These applications work on x86 because x86 does an implicit 47-bit
> > > > restriction of mmap() address that contain a hint address that is less
> > > > than 48 bits.
> > >
> > > You mean x86 _has_ to limit to physically available bits in a canonical
> > > format :) this will not be the case for 5-page table levels though...
>
> I might be misunderstanding but I am not talking about pointer masking
> or canonical addresses here. I am referring to the pattern of:
>
> 1. Getting an address from mmap()
> 2. Writing data into bits assumed to be unused in the address
> 3. Using the data stored in the address
> 4. Clearing the data from the address and sign extending
> 5. Dereferencing the now sign-extended address to conform to canonical
>    addresses
>
> I am just talking about step 1 and 2 here -- getting an address from
> mmap() that only uses bits that will allow your application to not
> break. How canonicalization happens is a a separate conversation, that
> can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
> While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
> Ssnpm for riscv allow an arbitrary number of bits to be masked out.
> A design goal here is to be able to support all of the pointer masking
> flavors, and not just x86.

Right I get that, I was just saying that the implicit limitation in x86 is
due to virtual addresses _having_ to be less than 48 bits. So that's why
that is right? I mean perhaps I'm mistaken?

Or is it such that x86 can provide a space for tagging for CPU technology
that supports it (UAI perhaps?).

I agree with what Michal and others said about the decision to default to
the reduced address space size and opt-in for higher bits. Your series
doesn't do this...

>
> > >
> > > >
> > > > Instead of implicitly restricting the address space on riscv (or any
> > > > current/future architecture), a flag would allow users to opt-in to this
> > > > behavior rather than opt-out as is done on other architectures. This is
> > > > desirable because it is a small class of applications that do pointer
> > > > masking.
> > >
> > > I raised this last time and you didn't seem to address it so to be more
> > > blunt:
> > >
> > > I don't understand why this needs to be an mmap() flag. From this it seems
> > > the whole process needs allocations to be below a certain limit.
>
> Yeah making it per-process does seem logical, as it would help with
> pointer masking.

To me it's the only feasible way forward, you can't control all libraries,
a map flag continues to seem a strange way to implement this, and I
understand that your justification is that it is the _least invasive_ way
of doing this, but as I've said below, it's actually pretty invasive if you
think about it, the current implementation seems to me to be insufficient
without having VMA flags etc.

>
> > >
> > > That _could_ be achieved through a 'personality' or similar (though a
> > > personality is on/off, rather than allowing configuration so maybe
> > > something else would be needed).
> > >
> > > From what you're saying 57-bit is all you really need right? So maybe
> > > ADDR_LIMIT_57BIT?
>
> Addresses will always be limited to 57 bits on riscv and x86 (but not
> necessarily on other architectures). A flag like that would have no
> impact, I do not understand what you are suggesting. This patch is to
> have a configurable number of bits be restricted.

I get that, but as I say below, I don't think a customisable limit is
workable.

So I was trying to find a compromise that _might_ be more workable.

>
> If anything, a personality that was ADDR_LIMIT_48BIT would be the
> closest to what I am trying to achieve. Since the issue is that
> applications fail to work when the address space is greater than 48
> bits.

OK so this is at least some possible road forward given there is quite a
bit of push-back to alternatives.

>
> > >
> > > I don't see how you're going to actually enforce this in a process either
> > > via an mmap flag, as a library might decide not to use it, so you'd need to
> > > control the allocator, the thread library implementation, and everything
> > > that might allocate.
>
> It is reasonable to change the implementation to be per-process but that
> is not the current proposal.

I mean maybe I wasn't direct enough - I oppose the current proposal as-is.

>
> This flag was designed for applications which already directly manage
> all of their addresses like OpenJDK and Go.
>
> This flag implementation was an attempt to make this feature as least
> invasive as possible to reduce maintainence burden and implementation
> complexity.

I realise, and as I said below, I don't think your implementation is
correct in this form.

Also if you can control everything + for whatever reason can _absolutely
know_ no program will use a FFI or a 3rd party library or whatever that
mremap()'s, I don't see why you can't use mmap() in a creative way to solve
this rather than adding maintenance burden.

A couple ideas:

1. mmap(high_address - domain_size - buffer, ..., PROT_NONE, MAP_FIXED,
   ...) a vast domain. You will almost certainly get the hint you
   want. Then mprotect() regions to PROT_READ | PROT_WRITE as you use (or
   even mmap() with MAP_FIXED_REPLACE over them), all will have high bits
   clear.

2. (suggested by Liam separately) mmap() with PROT_NONE addresses in the
   higher range, which prevents mmap() or any other means of allocating
   memory from allocating there. Acting as a 'huge guard page'.

Neither require any changes.

You kinda can't have it both ways - if you are absolutely controlling all
allocations with no risk of a 3rd party library doing an allocation outside
of this - then you can just use existing mechanics.

If you don't, then MAP_BELOW_HINT is insufficient.

>
> > >
> > > Liam also raised various points about VMA particulars that I'm not sure are
> > > addressed either.
> > >
> > > I just find it hard to believe that everything will fit together.
> > >
> > > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> > > just not.
> > >
> > > >
> > > > This flag will also allow seemless compatibility between all
> > > > architectures, so applications like Go and OpenJDK that use bits in a
> > > > virtual address can request the exact number of bits they need in a
> > > > generic way. The flag can be checked inside of vm_unmapped_area() so
> > > > that this flag does not have to be handled individually by each
> > > > architecture.
> > >
> > > I'm still very unconvinced and feel the bar needs to be high for making
> > > changes like this that carry maintainership burden.
> > >
>
> I may be naive but what is the burden here? It's two lines of code to
> check MAP_BELOW_HINT and restrict the address. There are the additional
> flags for hint and mmap_addr but those are also trivial to implement.

You're taking up a MAP_ flag (in short supply) which we have to maintain
forever across all arches and have to respect a limited map range.

And everything in this realm has edge cases. I don't see how you can
implement this correctly or usefully without a VMA flag, and see below for
my concerns on that.

This is UAPI (and really UABI) so this is _forever_. The bar is high. To me
this proposal does not hit that, and as you keep saying this isn't even
what you want.

You want something per-process so I think the correct proposal is
per-process.

A configurable per-process thing is horrible in itself, so I think the only
workable proposal is a fixed personality.

>
> > > So for me, it's a no really as an overall concept.
> > >
> > > Happy to be convinced otherwise, however... (I may be missing details or
> > > context that provide more justification).
> > >
> >
> > Some more thoughts:
> >
> > * If you absolutely must keep allocations below a certain limit, you'd
> >   probably need to actually associate this information with the VMA so the
> >   memory can't be mremap()'d somewhere invalid (you might not control all
> >   code so you can't guarantee this won't happen).
> > * Keeping a map limit associated with a VMA would be horrid and keeping
> >   VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
> >   are in limited supply also.
>
> Yes that does seem like it would be challenging.

Right so to me this rules out the MAP_BELOW_HINT. And makes this
implementation invalid.

>
> > * If we did implement a per-process thing, but it were arbitrary, we'd then
> >   have to handle all kinds of corner cases forever (this is UAPI, can't
> >   break it etc.) with crazy-low values, or determine a minimum that might
> >   vary by arch...
>
> Throwing an error if the value is determined to be "too low" seems
> reasonable.

What's "too low"? This will vary by arch too right? Keep in mind this is
'forever'...

>
> > * If we did this we'd absolutely have to implement a check in the brk()
> >   implementation, which is a very very sensitive bit of code. And of
> >   course, in mmap() and mremap()... and any arch-specific code that might
> >   interface with this stuff (these functions are hooked).
> > * A fixed address limit would make more sense, but it seems difficult to
> >   know what would work for everybody, and again we'd have to deal with edge
> >   cases and having a permanent maintenance burden.
>
> A fixed value is not ideal, since a single size probably would not be
> suffiecient for every application. However if necessary we could fix it
> to 48-bits since arm64 and x86 already do that, and that would still
> allow a generic way of defining this behavior.

This is more acceptable. It avoids pretty much all of the rest of the
issues.

>
> > * If you did have a map flag what about merging between VMAs above the
> >   limit and below it? To avoid that you'd need to implement some kind of a
> >   'VMA flag that has an arbitrary characteristic' or a 'limit' field,
> >   adjust all the 'can VMA merge' functions and write extensive testing and
> >   none of that is frankly acceptable.
> > * We have some 'weird' arches that might have problem with certain virtual
> >   address ranges or require arbitrary mappings at a certain address range
> >   that a limit might not be able to account for.
> >
> > I'm absolutely opposed to a new MAP_ flag for this, but even if you
> > implemented that, it implies a lot of complexity.
> >
> > It implies even more complexity if you implement something per-process
> > except if it were a fixed limit.
> >
> > And if you implement a fixed limit, it's hard to see that it'll be
> > acceptable to everybody, and I suspect we'd still run into some possible
> > weirdness.

> >
> > So again, I'm struggling to see how this concept can be justified in any
> > form.
>
> The piece I am missing here is that this idea is already being used by
> x86 and arm64. They implicitly force all allocations to be below the
> 47-bit boundary if the hint address is below 47 bits. This flag is much
> less invasive because it is opt-in and will not impact any existing
> code. I am not familiar enough with all of the interactions spread
> throughout mm to know how these architectures have managed to ensure
> that this 48-bit limit is enforced across things like mremap() as well.
>

I just wrote a bunch above about this and did in the original email.

The 48-bit limit is much more workable and is across-the-board so it's easy
to implement. It's the variable thing or map flag thing that's the problem.

> Are you against the idea that there should be a standard way for
> applications to consistently obtain address that have free bits, or are
> you just against this implementation? From your statement I assume you
> mean that every architecture should continue to have varying behavior
> and separate implementations for supporting larger address spaces.
>

I'm against this implementation, or one with a variable limit.

An ADDR_LIMIT_48BIT personality I think is the most workable form of
this. Others may have opinions on this also, but it avoids pretty much all
of the aforementioned issues and is the least invasive.

> - Charlie
>

Sorry to push back so much on your series, your efforts are appreciated, I
am simply trying to head off issues in the future especially those that end
up exposed to userland! :)
Dave Hansen Aug. 30, 2024, 3:03 p.m. UTC | #11
On 8/29/24 01:42, Lorenzo Stoakes wrote:
>> These applications work on x86 because x86 does an implicit 47-bit
>> restriction of mmap() address that contain a hint address that is less
>> than 48 bits.
> You mean x86 _has_ to limit to physically available bits in a canonical
> format 🙂 this will not be the case for 5-page table levels though...

By "physically available bits" are you referring to the bits that can be
used as a part of the virtual address?  "Physically" may not have been
the best choice of words. ;)

There's a canonical hole in 4-level paging and 5-level paging on x86.
The 5-level canonical hole is just smaller.

Also, I should probably say that the >47-bit mmap() access hint was more
of a crutch than something that we wanted to make ABI forever.  We knew
that high addresses might break some apps and we hoped that the list of
things it would break would go down over time so that we could
eventually just let mmap() access the whole address space by default.

That optimism may have been misplaced.
Lorenzo Stoakes Aug. 30, 2024, 3:18 p.m. UTC | #12
On Fri, Aug 30, 2024 at 08:03:25AM GMT, Dave Hansen wrote:
> On 8/29/24 01:42, Lorenzo Stoakes wrote:
> >> These applications work on x86 because x86 does an implicit 47-bit
> >> restriction of mmap() address that contain a hint address that is less
> >> than 48 bits.
> > You mean x86 _has_ to limit to physically available bits in a canonical
> > format 🙂 this will not be the case for 5-page table levels though...
>
> By "physically available bits" are you referring to the bits that can be
> used as a part of the virtual address?  "Physically" may not have been
> the best choice of words. ;)
>
> There's a canonical hole in 4-level paging and 5-level paging on x86.
> The 5-level canonical hole is just smaller.

Yeah sorry this is what I meant!

>
> Also, I should probably say that the >47-bit mmap() access hint was more
> of a crutch than something that we wanted to make ABI forever.  We knew
> that high addresses might break some apps and we hoped that the list of
> things it would break would go down over time so that we could
> eventually just let mmap() access the whole address space by default.
>
> That optimism may have been misplaced.

Interesting, thanks. This speaks again I think to it being unwise to rely
on these things.

I do think the only workable form of this series is a fixed
personality-based mapping limit.
Charlie Jenkins Aug. 31, 2024, 1:45 a.m. UTC | #13
On Fri, Aug 30, 2024 at 10:52:01AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
> > On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
> > > On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
> > > > On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
> > > > > Some applications rely on placing data in free bits addresses allocated
> > > > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > > > address returned by mmap to be less than the 48-bit address space,
> > > > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > > > for the kernel address space).
> > > >
> > > > I'm still confused as to why, if an mmap flag is desired, and thus programs
> > > > are having to be heavily modified and controlled to be able to do this, why
> > > > you can't just do an mmap() with PROT_NONE early, around a hinted address
> > > > that, sits below the required limit, and then mprotect() or mmap() over it?
> > > >
> > > > Your feature is a major adjustment to mmap(), it needs to be pretty
> > > > significantly justified, especially if taking up a new flag.
> > > >
> > > > >
> > > > > The riscv architecture needs a way to similarly restrict the virtual
> > > > > address space. On the riscv port of OpenJDK an error is thrown if
> > > > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > > > has a comment that sv57 support is not complete, but there are some
> > > > > workarounds to get it to mostly work [2].
> > > > >
> > > > > These applications work on x86 because x86 does an implicit 47-bit
> > > > > restriction of mmap() address that contain a hint address that is less
> > > > > than 48 bits.
> > > >
> > > > You mean x86 _has_ to limit to physically available bits in a canonical
> > > > format :) this will not be the case for 5-page table levels though...
> >
> > I might be misunderstanding but I am not talking about pointer masking
> > or canonical addresses here. I am referring to the pattern of:
> >
> > 1. Getting an address from mmap()
> > 2. Writing data into bits assumed to be unused in the address
> > 3. Using the data stored in the address
> > 4. Clearing the data from the address and sign extending
> > 5. Dereferencing the now sign-extended address to conform to canonical
> >    addresses
> >
> > I am just talking about step 1 and 2 here -- getting an address from
> > mmap() that only uses bits that will allow your application to not
> > break. How canonicalization happens is a a separate conversation, that
> > can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv.
> > While LAM for x86 is only capable of masking addresses to 48 or 57 bits,
> > Ssnpm for riscv allow an arbitrary number of bits to be masked out.
> > A design goal here is to be able to support all of the pointer masking
> > flavors, and not just x86.
> 
> Right I get that, I was just saying that the implicit limitation in x86 is
> due to virtual addresses _having_ to be less than 48 bits. So that's why
> that is right? I mean perhaps I'm mistaken?
> 
> Or is it such that x86 can provide a space for tagging for CPU technology
> that supports it (UAI perhaps?).
> 
> I agree with what Michal and others said about the decision to default to
> the reduced address space size and opt-in for higher bits. Your series
> doesn't do this...
> 
> >
> > > >
> > > > >
> > > > > Instead of implicitly restricting the address space on riscv (or any
> > > > > current/future architecture), a flag would allow users to opt-in to this
> > > > > behavior rather than opt-out as is done on other architectures. This is
> > > > > desirable because it is a small class of applications that do pointer
> > > > > masking.
> > > >
> > > > I raised this last time and you didn't seem to address it so to be more
> > > > blunt:
> > > >
> > > > I don't understand why this needs to be an mmap() flag. From this it seems
> > > > the whole process needs allocations to be below a certain limit.
> >
> > Yeah making it per-process does seem logical, as it would help with
> > pointer masking.
> 
> To me it's the only feasible way forward, you can't control all libraries,
> a map flag continues to seem a strange way to implement this, and I
> understand that your justification is that it is the _least invasive_ way
> of doing this, but as I've said below, it's actually pretty invasive if you
> think about it, the current implementation seems to me to be insufficient
> without having VMA flags etc.
> 
> >
> > > >
> > > > That _could_ be achieved through a 'personality' or similar (though a
> > > > personality is on/off, rather than allowing configuration so maybe
> > > > something else would be needed).
> > > >
> > > > From what you're saying 57-bit is all you really need right? So maybe
> > > > ADDR_LIMIT_57BIT?
> >
> > Addresses will always be limited to 57 bits on riscv and x86 (but not
> > necessarily on other architectures). A flag like that would have no
> > impact, I do not understand what you are suggesting. This patch is to
> > have a configurable number of bits be restricted.
> 
> I get that, but as I say below, I don't think a customisable limit is
> workable.
> 
> So I was trying to find a compromise that _might_ be more workable.
> 
> >
> > If anything, a personality that was ADDR_LIMIT_48BIT would be the
> > closest to what I am trying to achieve. Since the issue is that
> > applications fail to work when the address space is greater than 48
> > bits.
> 
> OK so this is at least some possible road forward given there is quite a
> bit of push-back to alternatives.
> 
> >
> > > >
> > > > I don't see how you're going to actually enforce this in a process either
> > > > via an mmap flag, as a library might decide not to use it, so you'd need to
> > > > control the allocator, the thread library implementation, and everything
> > > > that might allocate.
> >
> > It is reasonable to change the implementation to be per-process but that
> > is not the current proposal.
> 
> I mean maybe I wasn't direct enough - I oppose the current proposal as-is.
> 
> >
> > This flag was designed for applications which already directly manage
> > all of their addresses like OpenJDK and Go.
> >
> > This flag implementation was an attempt to make this feature as least
> > invasive as possible to reduce maintainence burden and implementation
> > complexity.
> 
> I realise, and as I said below, I don't think your implementation is
> correct in this form.
> 
> Also if you can control everything + for whatever reason can _absolutely
> know_ no program will use a FFI or a 3rd party library or whatever that
> mremap()'s, I don't see why you can't use mmap() in a creative way to solve
> this rather than adding maintenance burden.
> 
> A couple ideas:
> 
> 1. mmap(high_address - domain_size - buffer, ..., PROT_NONE, MAP_FIXED,
>    ...) a vast domain. You will almost certainly get the hint you
>    want. Then mprotect() regions to PROT_READ | PROT_WRITE as you use (or
>    even mmap() with MAP_FIXED_REPLACE over them), all will have high bits
>    clear.
> 
> 2. (suggested by Liam separately) mmap() with PROT_NONE addresses in the
>    higher range, which prevents mmap() or any other means of allocating
>    memory from allocating there. Acting as a 'huge guard page'.
> 
> Neither require any changes.
> 
> You kinda can't have it both ways - if you are absolutely controlling all
> allocations with no risk of a 3rd party library doing an allocation outside
> of this - then you can just use existing mechanics.
> 
> If you don't, then MAP_BELOW_HINT is insufficient.
> 
> >
> > > >
> > > > Liam also raised various points about VMA particulars that I'm not sure are
> > > > addressed either.
> > > >
> > > > I just find it hard to believe that everything will fit together.
> > > >
> > > > I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m
> > > > just not.
> > > >
> > > > >
> > > > > This flag will also allow seemless compatibility between all
> > > > > architectures, so applications like Go and OpenJDK that use bits in a
> > > > > virtual address can request the exact number of bits they need in a
> > > > > generic way. The flag can be checked inside of vm_unmapped_area() so
> > > > > that this flag does not have to be handled individually by each
> > > > > architecture.
> > > >
> > > > I'm still very unconvinced and feel the bar needs to be high for making
> > > > changes like this that carry maintainership burden.
> > > >
> >
> > I may be naive but what is the burden here? It's two lines of code to
> > check MAP_BELOW_HINT and restrict the address. There are the additional
> > flags for hint and mmap_addr but those are also trivial to implement.
> 
> You're taking up a MAP_ flag (in short supply) which we have to maintain
> forever across all arches and have to respect a limited map range.
> 
> And everything in this realm has edge cases. I don't see how you can
> implement this correctly or usefully without a VMA flag, and see below for
> my concerns on that.
> 
> This is UAPI (and really UABI) so this is _forever_. The bar is high. To me
> this proposal does not hit that, and as you keep saying this isn't even
> what you want.
> 
> You want something per-process so I think the correct proposal is
> per-process.
> 
> A configurable per-process thing is horrible in itself, so I think the only
> workable proposal is a fixed personality.
> 
> >
> > > > So for me, it's a no really as an overall concept.
> > > >
> > > > Happy to be convinced otherwise, however... (I may be missing details or
> > > > context that provide more justification).
> > > >
> > >
> > > Some more thoughts:
> > >
> > > * If you absolutely must keep allocations below a certain limit, you'd
> > >   probably need to actually associate this information with the VMA so the
> > >   memory can't be mremap()'d somewhere invalid (you might not control all
> > >   code so you can't guarantee this won't happen).
> > > * Keeping a map limit associated with a VMA would be horrid and keeping
> > >   VMAs as small as possible is a key aim, so that'd be a no go. VMA flags
> > >   are in limited supply also.
> >
> > Yes that does seem like it would be challenging.
> 
> Right so to me this rules out the MAP_BELOW_HINT. And makes this
> implementation invalid.
> 
> >
> > > * If we did implement a per-process thing, but it were arbitrary, we'd then
> > >   have to handle all kinds of corner cases forever (this is UAPI, can't
> > >   break it etc.) with crazy-low values, or determine a minimum that might
> > >   vary by arch...
> >
> > Throwing an error if the value is determined to be "too low" seems
> > reasonable.
> 
> What's "too low"? This will vary by arch too right? Keep in mind this is
> 'forever'...
> 
> >
> > > * If we did this we'd absolutely have to implement a check in the brk()
> > >   implementation, which is a very very sensitive bit of code. And of
> > >   course, in mmap() and mremap()... and any arch-specific code that might
> > >   interface with this stuff (these functions are hooked).
> > > * A fixed address limit would make more sense, but it seems difficult to
> > >   know what would work for everybody, and again we'd have to deal with edge
> > >   cases and having a permanent maintenance burden.
> >
> > A fixed value is not ideal, since a single size probably would not be
> > suffiecient for every application. However if necessary we could fix it
> > to 48-bits since arm64 and x86 already do that, and that would still
> > allow a generic way of defining this behavior.
> 
> This is more acceptable. It avoids pretty much all of the rest of the
> issues.
> 
> >
> > > * If you did have a map flag what about merging between VMAs above the
> > >   limit and below it? To avoid that you'd need to implement some kind of a
> > >   'VMA flag that has an arbitrary characteristic' or a 'limit' field,
> > >   adjust all the 'can VMA merge' functions and write extensive testing and
> > >   none of that is frankly acceptable.
> > > * We have some 'weird' arches that might have problem with certain virtual
> > >   address ranges or require arbitrary mappings at a certain address range
> > >   that a limit might not be able to account for.
> > >
> > > I'm absolutely opposed to a new MAP_ flag for this, but even if you
> > > implemented that, it implies a lot of complexity.
> > >
> > > It implies even more complexity if you implement something per-process
> > > except if it were a fixed limit.
> > >
> > > And if you implement a fixed limit, it's hard to see that it'll be
> > > acceptable to everybody, and I suspect we'd still run into some possible
> > > weirdness.
> 
> > >
> > > So again, I'm struggling to see how this concept can be justified in any
> > > form.
> >
> > The piece I am missing here is that this idea is already being used by
> > x86 and arm64. They implicitly force all allocations to be below the
> > 47-bit boundary if the hint address is below 47 bits. This flag is much
> > less invasive because it is opt-in and will not impact any existing
> > code. I am not familiar enough with all of the interactions spread
> > throughout mm to know how these architectures have managed to ensure
> > that this 48-bit limit is enforced across things like mremap() as well.
> >
> 
> I just wrote a bunch above about this and did in the original email.
> 
> The 48-bit limit is much more workable and is across-the-board so it's easy
> to implement. It's the variable thing or map flag thing that's the problem.
> 
> > Are you against the idea that there should be a standard way for
> > applications to consistently obtain address that have free bits, or are
> > you just against this implementation? From your statement I assume you
> > mean that every architecture should continue to have varying behavior
> > and separate implementations for supporting larger address spaces.
> >
> 
> I'm against this implementation, or one with a variable limit.
> 
> An ADDR_LIMIT_48BIT personality I think is the most workable form of
> this. Others may have opinions on this also, but it avoids pretty much all
> of the aforementioned issues and is the least invasive.
> 
> > - Charlie
> >
> 
> Sorry to push back so much on your series, your efforts are appreciated, I
> am simply trying to head off issues in the future especially those that end
> up exposed to userland! :)

No I really appreciate it, thank you for your input! Having a variable
limit is not necessary. The motivation is to create a UABI that would be
flexible into the future where an application may want more (or less)
bits reserved than what ADDR_LIMIT_48BIT would provide. However,
current applications are happy to work with 48-bit address spaces so
having the 48-bit personality is sufficient.

There will still need to be some management in the userland software to
ensure that the personality has been set before any allocations, but
that is unavoidable with any solution.

- Charlie
Michal Hocko Sept. 3, 2024, 8:44 a.m. UTC | #14
On Thu 29-08-24 10:33:22, Charlie Jenkins wrote:
> On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
> > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > > 
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> > 
> > IIRC this has been discussed at length when 5-level page tables support
> > has been proposed for x86. Sorry I do not have a link handy but lore
> > should help you. Linus was not really convinced and in the end vetoed it
> > and prefer that those few applications that benefit from greater address
> > space would do that explicitly than other way around.
> 
> I believe I found the conversation you were referring to. Ingo Molnar
> recommended a flag similar to what I have proposed [1]. Catalin
> recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up
> MPX [3].
> 
> However these conversations are tangential to what I am proposing. arm64
> and x86 decided to have the default address space be 48 bits. However
> this was done on a per-architecture basis with no way for applications
> to have guarantees between architectures. Even this behavior to restrict
> to 48 bits does not even appear in the man pages, so would require
> reading the kernel source code to understand that this feature is
> available. Then to opt-in to larger address spaces, applications have to
> know to provide a hint address that is greater than 47 bits, mmap() will
> then return an address that contains up to 56 bits on x86 and 52 bits on
> arm64. This difference of 4 bits causes inconsistency and is part of the
> problem I am trying to solve with this flag.

Yes, I guess I do understand where you are heading. Our existing model
assumes that anybody requiring more address space know what they are
doing and deal with the reality. This is the way Linus has pushed this
and I am not really convinced it is the right way TBH. On the other hand
it is true that this allows a safe(r) transition to larger address
spaces.

> I am not proposing to change x86 and arm64 away from using their opt-out
> feature, I am instead proposing a standard ABI for applications that
> need some guarantees of the bits used in pointers.

Right, but this is not really different from earlier attempts to achieve
this IIRC. Extentind mmap for that purpose seems quite tricky as already
pointed out in other sub-threads. Quite honestly I am not really sure
what is the right and backwards compatible way. I just wanted to make
you aware this has been discussed at lenght in the past.
Kirill A. Shutemov Sept. 5, 2024, 6:47 a.m. UTC | #15
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

This argument looks broken to me.

The "small class of applications" is going to be broken unless they got
patched to use your new mmap() flag. You are asking for bugs.

Consider the case when you write, compile and validate a piece of software
on machine that has <=47bit VA. The binary got shipped to customers.
Later, customer gets a new shiny machine that supports larger address
space and your previously working software is broken. Such binaries might
exist today.

It is bad idea to use >47bit VA by default. Most of software got tested on
x86 with 47bit VA.

We can consider more options to opt-in into wider address space like
personality or prctl() handle. But opt-out is no-go from what I see.
Charlie Jenkins Sept. 5, 2024, 5:26 p.m. UTC | #16
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > Some applications rely on placing data in free bits addresses allocated
> > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > address returned by mmap to be less than the 48-bit address space,
> > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > for the kernel address space).
> > 
> > The riscv architecture needs a way to similarly restrict the virtual
> > address space. On the riscv port of OpenJDK an error is thrown if
> > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > has a comment that sv57 support is not complete, but there are some
> > workarounds to get it to mostly work [2].
> > 
> > These applications work on x86 because x86 does an implicit 47-bit
> > restriction of mmap() address that contain a hint address that is less
> > than 48 bits.
> > 
> > Instead of implicitly restricting the address space on riscv (or any
> > current/future architecture), a flag would allow users to opt-in to this
> > behavior rather than opt-out as is done on other architectures. This is
> > desirable because it is a small class of applications that do pointer
> > masking.
> 
> This argument looks broken to me.
> 
> The "small class of applications" is going to be broken unless they got
> patched to use your new mmap() flag. You are asking for bugs.
> 
> Consider the case when you write, compile and validate a piece of software
> on machine that has <=47bit VA. The binary got shipped to customers.
> Later, customer gets a new shiny machine that supports larger address
> space and your previously working software is broken. Such binaries might
> exist today.
> 
> It is bad idea to use >47bit VA by default. Most of software got tested on
> x86 with 47bit VA.
> 
> We can consider more options to opt-in into wider address space like
> personality or prctl() handle. But opt-out is no-go from what I see.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

riscv is in an interesting state in regards to this because the software
ecosystem is much less mature than other architectures. The existing
riscv hardware supports either 38 or 47 bit userspace VAs, but a lot of
people test on QEMU which defaults to 56 bit. As a result, a lot of
code is tested with the larger address space. Applications that don't
work on the larger address space, like OpenJDK, currently throw an error
and exit.

Since riscv does not currently have the address space default to 47
bits, some applications just don't work on 56 bits. We could change the
kernel so that these applications start working without the need for
them to change their code, but that seems like the kernel is
overstepping and fixing binaries rather than providing users tools to
fix the binaries themselves.

This mmap flag was an attempt to provide a tool for these applications
that work on the existing 47 bit VA hardware to also work on different
hardware that supports a 56 bit VA space. After feedback, it looks like
a better solution than the mmap flag is to use the personality syscall
to set a process wide restriction to 47 bits instead, which matches the
32 bit flag that already exists.

- Charlie
Kirill A. Shutemov Sept. 9, 2024, 9:46 a.m. UTC | #17
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].

I also saw libmozjs crashing with 57-bit address space on x86.

> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.

You reiterate the argument about "small class of applications". But it
makes no sense to me.

With full address space by default, this small class of applications is
going to *broken* unless they would handle RISC-V case specifically.

On other hand, if you limit VA to 128TiB by default (like many
architectures do[1]) everything would work without intervention.
And if an app needs wider address space it would get it with hint opt-in,
because it is required on x86-64 anyway. Again, no RISC-V-specific code.

I see no upside with your approach. Just worse user experience.

[1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115