[RFC,rs6000] Add overloaded built-in function support to altivec.h, and re-implement vec_add

Hi,

The PowerPC back end loses performance on vector intrinsics, because currently
all of them are treated as calls throughout the middle-end phases and only
expanded when they reach RTL.  Our version of altivec.h currently defines the
public names of overloaded functions (like vec_add) to be #defines for hidden
functions (like __builtin_vec_add), which are recognized in the parser as 
requiring special back-end support.  Tables in rs6000-c.c handle dispatch of
the overloaded functions to specific function calls appropriate to the argument
types.

The Clang version of altivec.h, by contrast, creates static inlines for each
overloaded function variant, relying on a special __attribute__((overloadable))
construct to do the dispatch in the parser itself.  This allows vec_add to be
immediately translated into type-specific addition during parsing, allowing
the expressions to be subject to all subsequent optimization.

We have opened a PR suggesting that this attribute be supported in GCC as well
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71199), but so far there hasn't
been any success in that regard.  While waiting/hoping for the attribute to be
implemented, though, we can use existing mechanisms to create a poor man's
version of overloading dispatch.  This patch is a proof of concept for how
this can be done, and provides support for early expansion of the overloaded
vec_add intrinsic.  If we get this working, then we can gradually add more
intrinsics over time.

The dispatch mechanism is provided in a new header file, overload.h, which is
included in altivec.h.  This is done because the guts of the dispatch
mechanism are pretty ugly to look at.  Overloading is done with a chain of
calls to __builtin_choose_expr and __builtin_types_compatible_p.  Currently
I envision providing a separate dispatch macro for each combination of the
number of arguments and the number of variants to be distinguished.  I also
provide a separate "decl" macro for each number of arguments, used to create
the function decls for each static inline function.  The add_vec intrinsic
takes two input arguments and has 28 variants, so it requires the definition
of OVERLOAD_2ARG_28VAR and OVERLOAD_2ARG_DECL in overload.h.

These macros are then instantiated in altivec.h.  The dispatch macro for an
overloaded intrinsic is instantiated once, and the decl macro is instantiated
once for each variant, along with the associated inline function body.

The dispatch macro may need to vary depending on the supported processor
features.  In the vec_add example, we have some variants that support the
"vector double" and "vector long long" data types.  These only exist when
VSX code generation is supported, so a dispatch table conditioned on
__VSX__ includes these, while a separate one without VSX support does not.
Similarly, __POWER8_VECTOR__ must be defined if we are to support "vector
signed/unsigned __int128".  Because we use a numbering scheme that needs
to be kept consistent, this requires three versions of the dispatch table,
where the more restrictive versions replace the unimplemented entries with
redundant entries.

Note that if and when we get an overloadable attribute in GCC, the stuff
in overload.h will become obsolete, we will remove the dispatch instantiations,
and we will replace the decl instantiations with plain decls with the
overloadable attribute applied.

There are several complications on top of the basic design:

 * When compiling for C++, the dispatch mechanism is not available, and indeed
   is not necessary.  Thus for C++ we skip the dispatch mechanism, and change
   the definition of OVERLOAD_2ARG_DECL to use standard function overloading.

 * Compiling with -ansi or -std=c11 or the like means the dispatch mechanism
   is unavailable even for C, since GNU extensions are disallowed.  Regret-
   tably, this means that we can't get rid of the existing late-expansion
   methods altogether.  I don't see any way to avoid this.  Note that this
   would be the case even if we had __attribute__ ((overloadable)), since
   that would also be a GNU extension.  Despite the mess, I think that the 
   performance improvements for non-strict-ANSI code make the dual maintenance
   worthwhile.

 * "#pragma GCC target" is going to cause a lot of trouble.  With the patch
   in its present state, we fail gcc.target/powerpc/ppc-target-4.c, which
   tests the use of "vsx", "altivec,no-vsx", and "no-altivec" target options,
   and happens to use vec_add (float, float) as a testbed.  The test fails
   because altivec.h is #included after the #pragma GCC target("vsx"), which
   allows the interfaces involving vector long long and vector double to be
   produced.  However, when the options are changed to "altivec,no-vsx", the
   subsequent invocation of vec_add expands to a dispatch sequence including
   vector long long, leading to a compile-time error.

   I can only think of two ways to deal with this, neither of which is
   attractive.  The first idea would be to make altivec.h capable of being
   inlined more than once.  This essentially requires an #undef before each
   #define.  Once this is done, usage of #pragma GCC target would be 
   supported provided that altivec.h is re-included after each such #pragma,
   so that the dispatch macros would be re-evaluated in the new context.
   The problem with this is that existing code not conforming to this
   requirement would fail to compile, so this is probably off the table.

   The other way would be to require a specific option on the command line
   to use the new dispatch mechanism.  When the option is present, we would
   predefine a macro such as __PPC_FAST_VECTOR__, which would then gate the
   usage in altivec.h and overload.h.  Use of #pragma GCC target to change
   the availability of Altivec, VMX, P8-vector, etc. would also be disallowed
   when the option is present.  This has the advantage of always generating
   correct code, at the cost of requiring a special option before anyone
   can leverage the benefits of early vector expansion.  That's unfortunate,
   but I suspect it's the best we can do.

The current patch is nearly complete, but the #pragma GCC target issue is
not yet resolved.  I'd like to get opinions on the overall approach of the
patch and whether you agree with my assessment of the #pragma issue before
taking the patch forward.  Thanks for reading this far, and thanks in 
advance for your opinions.  We can get some big performance improvements
here eventually, but the road is a bit rocky.

Thanks,
Bill

[gcc]

2016-10-31  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* config/rs6000/altivec.h: Add new include of overload.h; when not
	compiling for C++ or strict ANSI, add new #defines for vec_add in
	terms of OVERLOAD_2ARG_28VAR and OVERLOAD_2ARG_DECL macros; when
	compiling for C++ but not for strict ANSI, use just the
	OVERLOAD_2ARG_DECL macros; when not compiling for strict ANSI,
	remove #define of vec_add in terms of __builtin_vec_add.
	* config/rs6000/overload.h: New file, with #defines of
	OVERLOAD_2ARG_28VAR when not compiling for C++ or strict ANSI, and
	two different flavors of OVERLOAD_2ARG_DECL (C++ and otherwise)
	when not compiling for strict ANSI.
	* config.gcc: For each triple that includes altivec.h in
	extra_headers, also add overload.h.

[gcc/testsuite]

2016-10-31  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* gcc.target/powerpc/overload-add-1.c: New.
	* gcc.target/powerpc/overload-add-2.c: New.
	* gcc.target/powerpc/overload-add-3.c: New.
	* gcc.target/powerpc/overload-add-4.c: New.
	* gcc.target/powerpc/overload-add-5.c: New.
	* gcc.target/powerpc/overload-add-6.c: New.
	* gcc.target/powerpc/overload-add-7.c: New.

[RFC,rs6000] Add overloaded built-in function support to altivec.h, and re-implement vec_add

Commit Message

Comments

Patch