mbox series

[0/8,RFC] Introduce floating point fetch_add builtins

Message ID 20240919131204.3865854-1-mmalcomson@nvidia.com
Headers show
Series Introduce floating point fetch_add builtins | expand

Message

mmalcomson@nvidia.com Sept. 19, 2024, 1:11 p.m. UTC
From: Matthew Malcomson <mmalcomson@nvidia.com>

Hello, this is an RFC for adding an atomic floating point fetch_add builtin
(and variants) to GCC.  The atomic fetch_add operation is defined to work
on the base floating point types in the C++20 standard chapter 31.7.3, and
extended to work for all cv-unqualified floating point types in C++23
chapter 33.5.7.4.

Honestly not sure who to Cc, please do point me to someone else if that's
better.

This is nowhere near complete (for one thing even the tests I've added
don't fully pass), but I think I have a complete enough idea that it's
worth checking if this is something that could be agreed on.

As it stands no target except the nvptx backend would natively support
these operations.

Main questions that I'm looking to resolve with this RFC:
1) Would GCC be OK accepting this implementation even though no backend
   would be implementing these yet?
   - AIUI only the nvptx backend could theoretically implement this.
   - Even without a backend implementing it natively, the ability to use
     this in code (especially libstdc++) enables other compilers to
     generate better code for GPU's using standard C++.
2) Would libstdc++ be OK relying on `__has_builtin(__atomic_fetch_add_fp)`
   (i.e. a check on the resolved builtin rather than the more user-facing
   one) in order to determine whether floating point atomic fetch_add is
   available.
   - N.b. this builtin is actually the builtin working on the "double"
     type, one would have to rely on any compilers implementing that
     particular resolved builtin to also implement the other floating point
     atomic fetch_add builtins that they would want to support in libstdc++
     `atomic<[floating_point_type]>::fetch_add`.

More specific questions about the choice of which builtins to implement and
whether the types are OK:
1) Is it OK to not implement the `__sync_*` versions?
   Since these are deprecated and the `__atomic_*` versions are there to
   match the C/C++ code atomic operations (which is a large part of the
   reason for the new floating point operations).
2) Is it OK to not implement all the `fetch_*` operations?
   None of the bitwise operations are specified for C++ and bitwise
   operations are (AIUI) rarely used on floating point values.
3) Wanting to be able to farm out to libatomic meant that we need constant names
   for the specialised functions.
   - This led to the naming convention based on floating point type.
   - That naming convention *could* be updated to include the special backend
     floating point types if needed.  I have not done this mostly because I
     thought it would not add much, though I have not looked into this very
     closely.
4) Wanting to name the functions based on floating point type rather than size
   meant that the mapping from type passed to the overloaded version to
   specialised builtin was less direct than for the integral versions.
   - Ended up with a hard-coded table in the source to check this.
   - Would very much like some better approach, not certain what better approach
     I could find.
   - Will eventually at least share the hard-coded tables (which isn't
     happening yet because this is at RFC level).
5) Are there any other types that I should use?
   Similarly are there any types that I'm trying to use that I shouldn't?
   I *believe* I've implemented all the types that make sense and are
   general builtin types.  Could easily have missed some (e.g. left
   `_Float128x` alone because AIUI it's larger than 128bits which means we
   don't have any other atomic operations on such data), could also easily
   be misunderstanding the mention in the C++ standards of "extended" types
   (which I believe is the `_Float*` and `bfloat16` types).
6) Anything special about floating point maths that I'm tripping up on?
   (Especially around CAS loop where we expand the operation directly,
   sometimes convert a PLUS into a MINUS of a negative value ...).
   Don't know of anything myself, but also a bit wary of floating point
   maths.

N.b. I know that there's a reasonable amount of work left in:
1) Ensuring that all the places which use atomic types are updated
   (e.g. asan), 
2) Ensuring that all frontends can use these to the level that they could
   use the integral atomics.
3) Ensuring the major backends can still compile libatomic (I had to do
   something in AArch64 libatomic backend to make it compile, seems like
   there might be more to do for others).


Matthew Malcomson (8):
  Define new floating point builtin fetch_add functions
  Add FP types for atomic builtin overload resolution
  Tie the new atomic builtins to the backend
  Have libatomic working as first draft
  Use new builtins in libstdc++
  First attempt at testsuite
  Mention floating point atomic fetch_add etc in docs
  Add demo implementation of one of the operations

 gcc/builtin-types.def                         |   20 +
 gcc/builtins.cc                               |  153 ++-
 gcc/c-family/c-common.cc                      |   88 +-
 gcc/config/aarch64/aarch64.h                  |    2 +
 gcc/config/aarch64/aarch64.opt                |    5 +
 gcc/config/aarch64/atomics.md                 |   15 +
 gcc/doc/extend.texi                           |   12 +
 gcc/fortran/types.def                         |   48 +
 gcc/optabs.cc                                 |  101 +-
 gcc/optabs.def                                |    6 +-
 gcc/optabs.h                                  |    2 +-
 gcc/sync-builtins.def                         |   40 +
 gcc/testsuite/gcc.dg/atomic-op-fp.c           |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf.c          |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf128.c       |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16.c        |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c       |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32.c        |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c       |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64.c        |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c       |  204 +++
 gcc/testsuite/gcc.dg/atomic-op-fpl.c          |  204 +++
 gcc/testsuite/lib/target-supports.exp         |  463 ++++++-
 libatomic/Makefile.am                         |    6 +-
 libatomic/Makefile.in                         |   12 +-
 libatomic/acinclude.m4                        |   49 +
 libatomic/auto-config.h.in                    |   84 +-
 libatomic/config/linux/aarch64/host-config.h  |    2 +
 libatomic/configure                           | 1153 ++++++++++++++++-
 libatomic/configure.ac                        |    4 +
 libatomic/fadd_n.c                            |   23 +
 libatomic/fop_n.c                             |    5 +-
 libatomic/fsub_n.c                            |   23 +
 libatomic/libatomic.map                       |   44 +
 libatomic/libatomic_i.h                       |   58 +
 libatomic/testsuite/Makefile.in               |    1 +
 .../testsuite/libatomic.c/atomic-op-fp.c      |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf.c     |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf128.c  |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf16.c   |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf16b.c  |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf32.c   |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf32x.c  |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf64.c   |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpf64x.c  |  203 +++
 .../testsuite/libatomic.c/atomic-op-fpl.c     |  203 +++
 libstdc++-v3/include/bits/atomic_base.h       |   16 +
 47 files changed, 6393 insertions(+), 112 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf128.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c
 create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpl.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf128.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16b.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64x.c
 create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpl.c

Comments

Jonathan Wakely Sept. 19, 2024, 2:47 p.m. UTC | #1
On Thu, 19 Sept 2024 at 14:12, <mmalcomson@nvidia.com> wrote:
>
> From: Matthew Malcomson <mmalcomson@nvidia.com>
>
> Hello, this is an RFC for adding an atomic floating point fetch_add builtin
> (and variants) to GCC.  The atomic fetch_add operation is defined to work
> on the base floating point types in the C++20 standard chapter 31.7.3, and
> extended to work for all cv-unqualified floating point types in C++23
> chapter 33.5.7.4.
>
> Honestly not sure who to Cc, please do point me to someone else if that's
> better.
>
> This is nowhere near complete (for one thing even the tests I've added
> don't fully pass), but I think I have a complete enough idea that it's
> worth checking if this is something that could be agreed on.
>
> As it stands no target except the nvptx backend would natively support
> these operations.
>
> Main questions that I'm looking to resolve with this RFC:
> 1) Would GCC be OK accepting this implementation even though no backend
>    would be implementing these yet?
>    - AIUI only the nvptx backend could theoretically implement this.
>    - Even without a backend implementing it natively, the ability to use
>      this in code (especially libstdc++) enables other compilers to
>      generate better code for GPU's using standard C++.
> 2) Would libstdc++ be OK relying on `__has_builtin(__atomic_fetch_add_fp)`
>    (i.e. a check on the resolved builtin rather than the more user-facing
>    one) in order to determine whether floating point atomic fetch_add is
>    available.

Yes, if that name is what other compilers will also use (have you
discussed this with Clang?)

It looks like PATCH 5/8 only uses the _fp name for fetch_add though,
and just uses fetch_sub etc. for the other functions, is that a
mistake?

>    - N.b. this builtin is actually the builtin working on the "double"

OK, so the library code just calls the generic __atomic_fetch_add that
accepts any types, but then that gets expanded to a more specific form
for float, double etc.?
And the more specific form has to exist at some level, because we need
an extern symbol from libatomic, so either we include the type as an
explicit suffix on the name, or we use some kind of name mangling like
_Z18__atomic_fetch_addPdS_S_, which is obviously nasty.

>      type, one would have to rely on any compilers implementing that
>      particular resolved builtin to also implement the other floating point
>      atomic fetch_add builtins that they would want to support in libstdc++
>      `atomic<[floating_point_type]>::fetch_add`.

This seems a bit concerning. I can imagine somebody implementing these
for float and double first, but leaving long double, _Float64,
_Float32, _Float128 etc. for later. In that case, libstdc++ would not
work if somebody tries to use std::atomic<long double>, or whichever
types aren't supported yet. It's OK if we can be *sure* that won't
happen i.e. that Clang will either implement the new built-in for
*all* FP types, or none.

>
> More specific questions about the choice of which builtins to implement and
> whether the types are OK:
> 1) Is it OK to not implement the `__sync_*` versions?
>    Since these are deprecated and the `__atomic_*` versions are there to
>    match the C/C++ code atomic operations (which is a large part of the
>    reason for the new floating point operations).
> 2) Is it OK to not implement all the `fetch_*` operations?
>    None of the bitwise operations are specified for C++ and bitwise
>    operations are (AIUI) rarely used on floating point values.

That seems OK (entirely correct even) to me.


> 3) Wanting to be able to farm out to libatomic meant that we need constant names
>    for the specialised functions.
>    - This led to the naming convention based on floating point type.
>    - That naming convention *could* be updated to include the special backend
>      floating point types if needed.  I have not done this mostly because I
>      thought it would not add much, though I have not looked into this very
>      closely.
> 4) Wanting to name the functions based on floating point type rather than size
>    meant that the mapping from type passed to the overloaded version to
>    specialised builtin was less direct than for the integral versions.
>    - Ended up with a hard-coded table in the source to check this.
>    - Would very much like some better approach, not certain what better approach
>      I could find.
>    - Will eventually at least share the hard-coded tables (which isn't
>      happening yet because this is at RFC level).
> 5) Are there any other types that I should use?
>    Similarly are there any types that I'm trying to use that I shouldn't?
>    I *believe* I've implemented all the types that make sense and are
>    general builtin types.  Could easily have missed some (e.g. left
>    `_Float128x` alone because AIUI it's larger than 128bits which means we
>    don't have any other atomic operations on such data), could also easily

That seems like a problem though - it means that GCC could be in
exactly the concerning position described above: there could be a FP
type (_Float128x) for which we have no __atomic_fetch_add built-in,
but the __has_builtin(__atomic_fetch_add) check would work, because
the built-in for double exists. So std::atomic<_Float128x> would not
work.

I don't think that's a problem today, because we don't have a
_Float128x type AFAIK. But that suggests to me that the rule for
whether to define the built-in for a FP type needs to be based on
whether the type exists at all, not whether we have other atomic ops
for it.

Maybe the libstdc++ code should have a separate __has_builtin check
for each type, e.g.

         bool __use_builtin = false;
+#if __has_builtin(__atomic_fetch_add_fp)
+       if constexpr (is_same_v<_Tp, double>)
+         __use_builtin = true;
+##endif
+#if __has_builtin(__atomic_fetch_add_fpf)
+       if constexpr (is_same_v<_Tp, float>)
+         __use_builtin = true;
+##endif
+#if __has_builtin(__atomic_fetch_add_fpl)
+       if constexpr (is_same_v<_Tp, long double>)
+         __use_builtin = true;
+##endif
// ...

        if (__use_builtin)
         return __atomic_fatch_add(...);
       // .. current CAS-based implementation

And so on with __has_builtin checks for the function for each type.
This would be a lot more tedious, but would handle the case where the
new built-in is only supported for some types.


>    be misunderstanding the mention in the C++ standards of "extended" types
>    (which I believe is the `_Float*` and `bfloat16` types).

Yes. The C++ standard says there's a typedef std::float32_t which
denotes some implementation-defined type, which is _Float32 in our
implementation. But the _Float32 name is not in the C++ standard.


> 6) Anything special about floating point maths that I'm tripping up on?
>    (Especially around CAS loop where we expand the operation directly,
>    sometimes convert a PLUS into a MINUS of a negative value ...).
>    Don't know of anything myself, but also a bit wary of floating point
>    maths.
>
> N.b. I know that there's a reasonable amount of work left in:
> 1) Ensuring that all the places which use atomic types are updated
>    (e.g. asan),
> 2) Ensuring that all frontends can use these to the level that they could
>    use the integral atomics.
> 3) Ensuring the major backends can still compile libatomic (I had to do
>    something in AArch64 libatomic backend to make it compile, seems like
>    there might be more to do for others).
>
>
> Matthew Malcomson (8):
>   Define new floating point builtin fetch_add functions
>   Add FP types for atomic builtin overload resolution
>   Tie the new atomic builtins to the backend
>   Have libatomic working as first draft
>   Use new builtins in libstdc++
>   First attempt at testsuite
>   Mention floating point atomic fetch_add etc in docs
>   Add demo implementation of one of the operations
>
>  gcc/builtin-types.def                         |   20 +
>  gcc/builtins.cc                               |  153 ++-
>  gcc/c-family/c-common.cc                      |   88 +-
>  gcc/config/aarch64/aarch64.h                  |    2 +
>  gcc/config/aarch64/aarch64.opt                |    5 +
>  gcc/config/aarch64/atomics.md                 |   15 +
>  gcc/doc/extend.texi                           |   12 +
>  gcc/fortran/types.def                         |   48 +
>  gcc/optabs.cc                                 |  101 +-
>  gcc/optabs.def                                |    6 +-
>  gcc/optabs.h                                  |    2 +-
>  gcc/sync-builtins.def                         |   40 +
>  gcc/testsuite/gcc.dg/atomic-op-fp.c           |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf.c          |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf128.c       |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf16.c        |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf16b.c       |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf32.c        |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf32x.c       |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf64.c        |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpf64x.c       |  204 +++
>  gcc/testsuite/gcc.dg/atomic-op-fpl.c          |  204 +++
>  gcc/testsuite/lib/target-supports.exp         |  463 ++++++-
>  libatomic/Makefile.am                         |    6 +-
>  libatomic/Makefile.in                         |   12 +-
>  libatomic/acinclude.m4                        |   49 +
>  libatomic/auto-config.h.in                    |   84 +-
>  libatomic/config/linux/aarch64/host-config.h  |    2 +
>  libatomic/configure                           | 1153 ++++++++++++++++-
>  libatomic/configure.ac                        |    4 +
>  libatomic/fadd_n.c                            |   23 +
>  libatomic/fop_n.c                             |    5 +-
>  libatomic/fsub_n.c                            |   23 +
>  libatomic/libatomic.map                       |   44 +
>  libatomic/libatomic_i.h                       |   58 +
>  libatomic/testsuite/Makefile.in               |    1 +
>  .../testsuite/libatomic.c/atomic-op-fp.c      |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf.c     |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf128.c  |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf16.c   |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf16b.c  |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf32.c   |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf32x.c  |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf64.c   |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpf64x.c  |  203 +++
>  .../testsuite/libatomic.c/atomic-op-fpl.c     |  203 +++
>  libstdc++-v3/include/bits/atomic_base.h       |   16 +
>  47 files changed, 6393 insertions(+), 112 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fp.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf128.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf16b.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf32x.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpf64x.c
>  create mode 100644 gcc/testsuite/gcc.dg/atomic-op-fpl.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fp.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf128.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf16b.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf32x.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpf64x.c
>  create mode 100644 libatomic/testsuite/libatomic.c/atomic-op-fpl.c
>
> --
> 2.43.0
>
Joseph Myers Sept. 19, 2024, 9:38 p.m. UTC | #2
On Thu, 19 Sep 2024, mmalcomson@nvidia.com wrote:

> 6) Anything special about floating point maths that I'm tripping up on?

Correct atomic operations with floating-point operands should ensure that 
exceptions raised exactly correspond to the operands for which the 
operation succeeded, and not to the operands for any previous attempts 
where the compare-exchange failed.  There is a lengthy note in the C 
standard (in C11 it's a footnote in 6.5.16.2, in C17 it's a Note in 
6.5.16.2 and in C23 that subclause has become 6.5.17.3) that discusses 
appropriate code sequences to achieve this.  In GCC the implementation of 
this is in c-typeck.cc:build_atomic_assign, which in turn calls 
targetm.atomic_assign_expand_fenv (note that we have the complication for 
C of not introducing libm dependencies in code that only uses standard 
language features and not <math.h>, <fenv.h> or <complex.h>, so direct use 
of <fenv.h> functions is inappropriate here).

I would expect such built-in functions to follow the same semantics for 
floating-point exceptions as _Atomic compound assignment does.  (Note that 
_Atomic compound assignment is more general in the allowed operands, 
because compound assignment is a heterogeneous operation - for example, 
the special floating-point logic in build_atomic_assign includes the case 
where the LHS of the compound assignment is of atomic integer type but the 
RHS is of floating type.  However, built-in functions allow memory orders 
other than seq_cst to be used, whereas _Atomic compound assignment is 
limited to the seq_cst case.)

So it would seem appropriate for the implementation of such built-in 
functions to make use of targetm.atomic_assign_expand_fenv for 
floating-point environment handling, and for testcases to include tests 
analogous to c11-atomic-exec-5.c that exceptions are being handled 
correctly.

Cf. N2329 which suggested such operations for C in <stdatomic.h> (but 
tried to do to many things in one paper to be accepted into C); it didn't 
go into the floating-point exceptions semantics but simple correctness 
would indicate avoiding spurious exceptions from discarded computations.