diff mbox series

PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0.

Message ID 010501d81941$20d5a2b0$6280e810$@nextmovesoftware.com
State New
Headers show
Series PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0. | expand

Commit Message

Roger Sayle Feb. 3, 2022, 9 p.m. UTC
This patch addresses the "increased register pressure" regression on
nvptx-none caused by my change to transition the backend to a
STORE_FLAG_VALUE = 1 target.  This improved code generation for the
more common case of producing 0/1 Boolean values, but unfortunately
made things marginally worse when a 0/-1 mask value is desired.
Unfortunately, nvptx kernels are extremely sensitive to changes in
register usage, which was observable in the reported PR.

This patch provides optimizations for -(cond ? 1 : 0), effectively
simplify this into cond ? -1 : 0, where these ternary operators are
provided by nvptx's selp instruction, and for the specific case of
SImode, using (restoring) nvptx's "set" instruction (which avoids
the need for a predicate register).

This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
with a "make" and "make -k check" with no new failures.  Unfortunately,
the exact register usage of a nvptx kernel depends upon the version of
the Cuda drivers being used (and the hardware), but I believe this
change should resolve the PR (for Thomas) by improving code generation
for the cases that regressed.  Ok for mainline?


2022-02-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR target/104345
	* config/nvptx/nvptx.md (sel_true<mode>): Fix indentation.
	(sel_false<mode>): Likewise.
	(define_code_iterator eqne): New code iterator for EQ and NE.
	(*selp<mode>_neg_<code>): New define_insn_and_split to optimize
	the negation of a selp instruction.
	(*selp<mode>_not_<code>): New define_insn_and_split to optimize
	the bitwise not of a selp instruction.
	(*setcc_int<mode>): Use set instruction for neg:SI of a selp.

gcc/testsuite/ChangeLog
	PR target/104345
	* gcc.target/nvptx/neg-selp.c: New test case.


Thanks in advance,
Roger
--

Comments

Thomas Schwinge Feb. 4, 2022, 8:23 a.m. UTC | #1
Hi Roger!

On 2022-02-03T21:00:50+0000, "Roger Sayle" <roger@nextmovesoftware.com> wrote:
> This patch

Thanks!

> addresses the "increased register pressure" regression on
> nvptx-none caused by my change to transition the backend to a
> STORE_FLAG_VALUE = 1 target.

Yes, "addresses", but unfortunately doesn't "resolve".  ;-|

> This improved code generation for the
> more common case of producing 0/1 Boolean values, but unfortunately
> made things marginally worse when a 0/-1 mask value is desired.
> Unfortunately, nvptx kernels are extremely sensitive to changes in
> register usage, which was observable in the reported PR.
>
> This patch provides optimizations for -(cond ? 1 : 0), effectively
> simplify this into cond ? -1 : 0, where these ternary operators are
> provided by nvptx's selp instruction, and for the specific case of
> SImode, using (restoring) nvptx's "set" instruction (which avoids
> the need for a predicate register).

I'm confirming the improved code generation (less registers used, less
instructions emitted) in cases where it triggers -- but unfortunately it
doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'
scenario.

> This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
> with a "make" and "make -k check" with no new failures.  Unfortunately,
> the exact register usage of a nvptx kernel depends upon the version of
> the Cuda drivers being used (and the hardware), but I believe this
> change should resolve the PR (for Thomas) by improving code generation
> for the cases that regressed.  Ok for mainline?

So, testing your patch in isolation, it does *not* resolve PR104345,
unfortunately.  I'll next test in combination with your other pending
patches:

  - "nvptx: Expand QI mode operations using SI mode instructions".
  - "nvptx: Fix and use BI mode logic instructions (e.g. and.pred)"


Grüße
 Thomas


> gcc/ChangeLog
>       PR target/104345
>       * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation.
>       (sel_false<mode>): Likewise.
>       (define_code_iterator eqne): New code iterator for EQ and NE.
>       (*selp<mode>_neg_<code>): New define_insn_and_split to optimize
>       the negation of a selp instruction.
>       (*selp<mode>_not_<code>): New define_insn_and_split to optimize
>       the bitwise not of a selp instruction.
>       (*setcc_int<mode>): Use set instruction for neg:SI of a selp.
>
> gcc/testsuite/ChangeLog
>       PR target/104345
>       * gcc.target/nvptx/neg-selp.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
> diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> index 92768dd..651ba20 100644
> --- a/gcc/config/nvptx/nvptx.md
> +++ b/gcc/config/nvptx/nvptx.md
> @@ -892,7 +892,7 @@
>
>  (define_insn "sel_true<mode>"
>    [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> -        (if_then_else:HSDIM
> +     (if_then_else:HSDIM
>         (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
>         (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
>         (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
> @@ -901,7 +901,7 @@
>
>  (define_insn "sel_true<mode>"
>    [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
> -        (if_then_else:SDFM
> +     (if_then_else:SDFM
>         (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
>         (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
>         (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
> @@ -910,7 +910,7 @@
>
>  (define_insn "sel_false<mode>"
>    [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> -        (if_then_else:HSDIM
> +     (if_then_else:HSDIM
>         (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
>         (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
>         (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
> @@ -919,13 +919,63 @@
>
>  (define_insn "sel_false<mode>"
>    [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
> -        (if_then_else:SDFM
> +     (if_then_else:SDFM
>         (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
>         (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
>         (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
>    ""
>    "%.\\tselp%t0\\t%0, %3, %2, %1;")
>
> +(define_code_iterator eqne [eq ne])
> +
> +;; Split negation of a predicate into a conditional move.
> +(define_insn_and_split "*selp<mode>_neg_<code>"
> +  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> +     (neg:HSDIM (eqne:HSDIM
> +                  (match_operand:BI 1 "nvptx_register_operand" "R")
> +                  (const_int 0))))]
> +  ""
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +     (if_then_else:HSDIM
> +       (eqne (match_dup 1) (const_int 0))
> +       (const_int -1)
> +       (const_int 0)))])
> +
> +;; Split bitwise not of a predicate into a conditional move.
> +(define_insn_and_split "*selp<mode>_not_<code>"
> +  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> +     (not:HSDIM (eqne:HSDIM
> +                  (match_operand:BI 1 "nvptx_register_operand" "R")
> +                  (const_int 0))))]
> +  ""
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> +     (if_then_else:HSDIM
> +       (eqne (match_dup 1) (const_int 0))
> +       (const_int -2)
> +       (const_int -1)))])
> +
> +(define_insn "*setcc_int<mode>"
> +  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
> +     (neg:SI
> +       (match_operator:SI 1 "nvptx_comparison_operator"
> +         [(match_operand:HSDIM 2 "nvptx_register_operand" "R")
> +          (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")])))]
> +  ""
> +  "%.\\tset%t0%c1\\t%0, %2, %3;")
> +
> +(define_insn "*setcc_int<mode>"
> +  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
> +     (neg:SI
> +       (match_operator:SI 1 "nvptx_float_comparison_operator"
> +         [(match_operand:SDFM 2 "nvptx_register_operand" "R")
> +          (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")])))]
> +  ""
> +  "%.\\tset%t0%c1\\t%0, %2, %3;")
> +
>  (define_insn "setcc_float<mode>"
>    [(set (match_operand:SF 0 "nvptx_register_operand" "=R")
>       (match_operator:SF 1 "nvptx_comparison_operator"
> diff --git a/gcc/testsuite/gcc.target/nvptx/neg-selp.c b/gcc/testsuite/gcc.target/nvptx/neg-selp.c
> new file mode 100644
> index 0000000..a8f0118
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/nvptx/neg-selp.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +int neg(int x, int y)
> +{
> +  int t = (x == y) ? 1 : 0;
> +  return -t;
> +}
> +
> +int not(int x, int y)
> +{
> +  int t = (x == y) ? 1 : 0;
> +  return ~t;
> +}
> +
> +/* { dg-final { scan-assembler-not "neg.s32" } } */
> +/* { dg-final { scan-assembler-not "not.b32" } } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Thomas Schwinge Feb. 4, 2022, 8:38 a.m. UTC | #2
Hi!

As an aside:

On 2022-02-03T21:00:50+0000, "Roger Sayle" <roger@nextmovesoftware.com> wrote:
> the exact register usage of a nvptx kernel depends upon the version of
> the Cuda drivers being used (and the hardware)

Yeah, that's a "problem" -- or: "challenge"?  ;-)

The GCC/nvptx back end is generating some rather high-level IR (PTX)
targeting a "black hole": not knowing what exactly the Nvidia/CUDA
Driver, PTX -> SASS compiler are going to do with it.  (Well, similar
problem also exists for more traditional ISAs if CPU microcode etc. is
involved, but it's certainly more severe here.)

Five years ago, I asked our then Nvidia PTX contact person about ideas,
"How to generate PTX code to the PTX -> SASS compiler's liking":

| We're currently looking into options for improving the PTX code generated
| by GCC's nvptx back end, and it came up the question about how to
| generate PTX code to the PTX -> SASS compiler's liking?  Is there any
| documentation available regarding this?  (I say "PTX -> SASS compiler" as
| I don't think I know the proper name of it.  For avoidance of doubt, I
| mean the "component" that sits between the PTX code we feed into
| cuLinkAddData, and what actually gets executed on the GPU as SASS code.
| Presumably the same "component" that is part of the "ptxas" tool?)
|
| As always, there are often many different variants for expressing the
| same thing.  A few examples.
|
| [...]
|
| ;-) Any so on, and so forth.  Are there any generic recommendations,
| "best practice"?

The answer was:

| I don't know of any official documentation; in general we have tuned the backend to the PTX that we generate, so following that lead will give you the best results.

So, yeah.  :-\ Understandable and not unexpected, though.


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Roger Sayle Feb. 4, 2022, 10:56 a.m. UTC | #3
Hi Thomas,

Very many thanks for your help investigating this problem.

> > This patch addresses the "increased register pressure" regression
> > on nvptx-none caused by my change to transition the backend to
> > a STORE_FLAG_VALUE = 1 target.
> 
> Yes, "addresses", but unfortunately doesn't "resolve".  ;-|

Doh!

> I'm confirming the improved code generation (less registers used, less
> instructions emitted) in cases where it triggers -- but unfortunately it
> doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'
> scenario.

Looking over the nvptx code currently generated for reduction-cplx-dbl.c,
it appears nearly optimal and it's difficult to see what could have regressed.
[It makes almost no uses of Boolean types, so is relatively unaffected
by a STORE_FLAG_VALUE change].  One remaining possibility is that
the "register usage" regression is not in reduction-cplx-dbl.c itself but
in __muldc3 in libgcc.a.  [I believe kernel resource usage is computed
including all called functions].

Might this be easy to test on your configuration, moving libgcc.a from
one build to another?

If it is __muldc3 regressing, then the other nvptx patches mentioned
previously, and perhaps even improvements to isnan and isinf, may help.

Again apologies that the "using nvptx set.?32 for "cond ? -1: 0" patch, that
catches many of the issues observed in your initial PR analysis, isn't actually
the root cause of this particular case.

Thanks again,
Roger
--
Tom de Vries Feb. 10, 2022, 8:46 a.m. UTC | #4
On 2/3/22 22:00, Roger Sayle wrote:
> 
> This patch addresses the "increased register pressure" regression on
> nvptx-none caused by my change to transition the backend to a
> STORE_FLAG_VALUE = 1 target.  This improved code generation for the
> more common case of producing 0/1 Boolean values, but unfortunately
> made things marginally worse when a 0/-1 mask value is desired.
> Unfortunately, nvptx kernels are extremely sensitive to changes in
> register usage, which was observable in the reported PR.
> 
> This patch provides optimizations for -(cond ? 1 : 0), effectively
> simplify this into cond ? -1 : 0, where these ternary operators are
> provided by nvptx's selp instruction, and for the specific case of
> SImode, using (restoring) nvptx's "set" instruction (which avoids
> the need for a predicate register).
> 
> This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
> with a "make" and "make -k check" with no new failures.  Unfortunately,
> the exact register usage of a nvptx kernel depends upon the version of
> the Cuda drivers being used (and the hardware), but I believe this
> change should resolve the PR (for Thomas) by improving code generation
> for the cases that regressed.  Ok for mainline?
> 
> 

LGTM, applied.

Thanks,
- Tom

> 2022-02-03  Roger Sayle  <roger@nextmovesoftware.com>
> 
> gcc/ChangeLog
> 	PR target/104345
> 	* config/nvptx/nvptx.md (sel_true<mode>): Fix indentation.
> 	(sel_false<mode>): Likewise.
> 	(define_code_iterator eqne): New code iterator for EQ and NE.
> 	(*selp<mode>_neg_<code>): New define_insn_and_split to optimize
> 	the negation of a selp instruction.
> 	(*selp<mode>_not_<code>): New define_insn_and_split to optimize
> 	the bitwise not of a selp instruction.
> 	(*setcc_int<mode>): Use set instruction for neg:SI of a selp.
> 
> gcc/testsuite/ChangeLog
> 	PR target/104345
> 	* gcc.target/nvptx/neg-selp.c: New test case.
> 
> 
> Thanks in advance,
> Roger
> --
>
diff mbox series

Patch

diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 92768dd..651ba20 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -892,7 +892,7 @@ 
 
 (define_insn "sel_true<mode>"
   [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
-        (if_then_else:HSDIM
+	(if_then_else:HSDIM
 	  (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
 	  (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
 	  (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
@@ -901,7 +901,7 @@ 
 
 (define_insn "sel_true<mode>"
   [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
-        (if_then_else:SDFM
+	(if_then_else:SDFM
 	  (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
 	  (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
 	  (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
@@ -910,7 +910,7 @@ 
 
 (define_insn "sel_false<mode>"
   [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
-        (if_then_else:HSDIM
+	(if_then_else:HSDIM
 	  (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
 	  (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
 	  (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
@@ -919,13 +919,63 @@ 
 
 (define_insn "sel_false<mode>"
   [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
-        (if_then_else:SDFM
+	(if_then_else:SDFM
 	  (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
 	  (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
 	  (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
   ""
   "%.\\tselp%t0\\t%0, %3, %2, %1;")
 
+(define_code_iterator eqne [eq ne])
+
+;; Split negation of a predicate into a conditional move.
+(define_insn_and_split "*selp<mode>_neg_<code>"
+  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
+	(neg:HSDIM (eqne:HSDIM
+		     (match_operand:BI 1 "nvptx_register_operand" "R")
+		     (const_int 0))))]
+  ""
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:HSDIM
+	  (eqne (match_dup 1) (const_int 0))
+	  (const_int -1)
+	  (const_int 0)))])
+
+;; Split bitwise not of a predicate into a conditional move.
+(define_insn_and_split "*selp<mode>_not_<code>"
+  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
+	(not:HSDIM (eqne:HSDIM
+		     (match_operand:BI 1 "nvptx_register_operand" "R")
+		     (const_int 0))))]
+  ""
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:HSDIM
+	  (eqne (match_dup 1) (const_int 0))
+	  (const_int -2)
+	  (const_int -1)))])
+
+(define_insn "*setcc_int<mode>"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+	(neg:SI
+	  (match_operator:SI 1 "nvptx_comparison_operator"
+	    [(match_operand:HSDIM 2 "nvptx_register_operand" "R")
+	     (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")])))]
+  ""
+  "%.\\tset%t0%c1\\t%0, %2, %3;")
+
+(define_insn "*setcc_int<mode>"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+	(neg:SI
+	  (match_operator:SI 1 "nvptx_float_comparison_operator"
+	    [(match_operand:SDFM 2 "nvptx_register_operand" "R")
+	     (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")])))]
+  ""
+  "%.\\tset%t0%c1\\t%0, %2, %3;")
+
 (define_insn "setcc_float<mode>"
   [(set (match_operand:SF 0 "nvptx_register_operand" "=R")
 	(match_operator:SF 1 "nvptx_comparison_operator"
diff --git a/gcc/testsuite/gcc.target/nvptx/neg-selp.c b/gcc/testsuite/gcc.target/nvptx/neg-selp.c
new file mode 100644
index 0000000..a8f0118
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/neg-selp.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int neg(int x, int y)
+{
+  int t = (x == y) ? 1 : 0;
+  return -t;
+}
+
+int not(int x, int y)
+{
+  int t = (x == y) ? 1 : 0;
+  return ~t;
+}
+
+/* { dg-final { scan-assembler-not "neg.s32" } } */
+/* { dg-final { scan-assembler-not "not.b32" } } */