Message ID | 010501d81941$20d5a2b0$6280e810$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0. | expand |
Hi Roger! On 2022-02-03T21:00:50+0000, "Roger Sayle" <roger@nextmovesoftware.com> wrote: > This patch Thanks! > addresses the "increased register pressure" regression on > nvptx-none caused by my change to transition the backend to a > STORE_FLAG_VALUE = 1 target. Yes, "addresses", but unfortunately doesn't "resolve". ;-| > This improved code generation for the > more common case of producing 0/1 Boolean values, but unfortunately > made things marginally worse when a 0/-1 mask value is desired. > Unfortunately, nvptx kernels are extremely sensitive to changes in > register usage, which was observable in the reported PR. > > This patch provides optimizations for -(cond ? 1 : 0), effectively > simplify this into cond ? -1 : 0, where these ternary operators are > provided by nvptx's selp instruction, and for the specific case of > SImode, using (restoring) nvptx's "set" instruction (which avoids > the need for a predicate register). I'm confirming the improved code generation (less registers used, less instructions emitted) in cases where it triggers -- but unfortunately it doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' scenario. > This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu > with a "make" and "make -k check" with no new failures. Unfortunately, > the exact register usage of a nvptx kernel depends upon the version of > the Cuda drivers being used (and the hardware), but I believe this > change should resolve the PR (for Thomas) by improving code generation > for the cases that regressed. Ok for mainline? So, testing your patch in isolation, it does *not* resolve PR104345, unfortunately. I'll next test in combination with your other pending patches: - "nvptx: Expand QI mode operations using SI mode instructions". - "nvptx: Fix and use BI mode logic instructions (e.g. and.pred)" Grüße Thomas > gcc/ChangeLog > PR target/104345 > * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation. > (sel_false<mode>): Likewise. > (define_code_iterator eqne): New code iterator for EQ and NE. > (*selp<mode>_neg_<code>): New define_insn_and_split to optimize > the negation of a selp instruction. > (*selp<mode>_not_<code>): New define_insn_and_split to optimize > the bitwise not of a selp instruction. > (*setcc_int<mode>): Use set instruction for neg:SI of a selp. > > gcc/testsuite/ChangeLog > PR target/104345 > * gcc.target/nvptx/neg-selp.c: New test case. > > > Thanks in advance, > Roger > -- > > diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md > index 92768dd..651ba20 100644 > --- a/gcc/config/nvptx/nvptx.md > +++ b/gcc/config/nvptx/nvptx.md > @@ -892,7 +892,7 @@ > > (define_insn "sel_true<mode>" > [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") > - (if_then_else:HSDIM > + (if_then_else:HSDIM > (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) > (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri") > (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))] > @@ -901,7 +901,7 @@ > > (define_insn "sel_true<mode>" > [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R") > - (if_then_else:SDFM > + (if_then_else:SDFM > (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) > (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF") > (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))] > @@ -910,7 +910,7 @@ > > (define_insn "sel_false<mode>" > [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") > - (if_then_else:HSDIM > + (if_then_else:HSDIM > (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) > (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri") > (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))] > @@ -919,13 +919,63 @@ > > (define_insn "sel_false<mode>" > [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R") > - (if_then_else:SDFM > + (if_then_else:SDFM > (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) > (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF") > (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))] > "" > "%.\\tselp%t0\\t%0, %3, %2, %1;") > > +(define_code_iterator eqne [eq ne]) > + > +;; Split negation of a predicate into a conditional move. > +(define_insn_and_split "*selp<mode>_neg_<code>" > + [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") > + (neg:HSDIM (eqne:HSDIM > + (match_operand:BI 1 "nvptx_register_operand" "R") > + (const_int 0))))] > + "" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:HSDIM > + (eqne (match_dup 1) (const_int 0)) > + (const_int -1) > + (const_int 0)))]) > + > +;; Split bitwise not of a predicate into a conditional move. > +(define_insn_and_split "*selp<mode>_not_<code>" > + [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") > + (not:HSDIM (eqne:HSDIM > + (match_operand:BI 1 "nvptx_register_operand" "R") > + (const_int 0))))] > + "" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:HSDIM > + (eqne (match_dup 1) (const_int 0)) > + (const_int -2) > + (const_int -1)))]) > + > +(define_insn "*setcc_int<mode>" > + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") > + (neg:SI > + (match_operator:SI 1 "nvptx_comparison_operator" > + [(match_operand:HSDIM 2 "nvptx_register_operand" "R") > + (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")])))] > + "" > + "%.\\tset%t0%c1\\t%0, %2, %3;") > + > +(define_insn "*setcc_int<mode>" > + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") > + (neg:SI > + (match_operator:SI 1 "nvptx_float_comparison_operator" > + [(match_operand:SDFM 2 "nvptx_register_operand" "R") > + (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")])))] > + "" > + "%.\\tset%t0%c1\\t%0, %2, %3;") > + > (define_insn "setcc_float<mode>" > [(set (match_operand:SF 0 "nvptx_register_operand" "=R") > (match_operator:SF 1 "nvptx_comparison_operator" > diff --git a/gcc/testsuite/gcc.target/nvptx/neg-selp.c b/gcc/testsuite/gcc.target/nvptx/neg-selp.c > new file mode 100644 > index 0000000..a8f0118 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/nvptx/neg-selp.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +int neg(int x, int y) > +{ > + int t = (x == y) ? 1 : 0; > + return -t; > +} > + > +int not(int x, int y) > +{ > + int t = (x == y) ? 1 : 0; > + return ~t; > +} > + > +/* { dg-final { scan-assembler-not "neg.s32" } } */ > +/* { dg-final { scan-assembler-not "not.b32" } } */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Hi! As an aside: On 2022-02-03T21:00:50+0000, "Roger Sayle" <roger@nextmovesoftware.com> wrote: > the exact register usage of a nvptx kernel depends upon the version of > the Cuda drivers being used (and the hardware) Yeah, that's a "problem" -- or: "challenge"? ;-) The GCC/nvptx back end is generating some rather high-level IR (PTX) targeting a "black hole": not knowing what exactly the Nvidia/CUDA Driver, PTX -> SASS compiler are going to do with it. (Well, similar problem also exists for more traditional ISAs if CPU microcode etc. is involved, but it's certainly more severe here.) Five years ago, I asked our then Nvidia PTX contact person about ideas, "How to generate PTX code to the PTX -> SASS compiler's liking": | We're currently looking into options for improving the PTX code generated | by GCC's nvptx back end, and it came up the question about how to | generate PTX code to the PTX -> SASS compiler's liking? Is there any | documentation available regarding this? (I say "PTX -> SASS compiler" as | I don't think I know the proper name of it. For avoidance of doubt, I | mean the "component" that sits between the PTX code we feed into | cuLinkAddData, and what actually gets executed on the GPU as SASS code. | Presumably the same "component" that is part of the "ptxas" tool?) | | As always, there are often many different variants for expressing the | same thing. A few examples. | | [...] | | ;-) Any so on, and so forth. Are there any generic recommendations, | "best practice"? The answer was: | I don't know of any official documentation; in general we have tuned the backend to the PTX that we generate, so following that lead will give you the best results. So, yeah. :-\ Understandable and not unexpected, though. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Hi Thomas, Very many thanks for your help investigating this problem. > > This patch addresses the "increased register pressure" regression > > on nvptx-none caused by my change to transition the backend to > > a STORE_FLAG_VALUE = 1 target. > > Yes, "addresses", but unfortunately doesn't "resolve". ;-| Doh! > I'm confirming the improved code generation (less registers used, less > instructions emitted) in cases where it triggers -- but unfortunately it > doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' > scenario. Looking over the nvptx code currently generated for reduction-cplx-dbl.c, it appears nearly optimal and it's difficult to see what could have regressed. [It makes almost no uses of Boolean types, so is relatively unaffected by a STORE_FLAG_VALUE change]. One remaining possibility is that the "register usage" regression is not in reduction-cplx-dbl.c itself but in __muldc3 in libgcc.a. [I believe kernel resource usage is computed including all called functions]. Might this be easy to test on your configuration, moving libgcc.a from one build to another? If it is __muldc3 regressing, then the other nvptx patches mentioned previously, and perhaps even improvements to isnan and isinf, may help. Again apologies that the "using nvptx set.?32 for "cond ? -1: 0" patch, that catches many of the issues observed in your initial PR analysis, isn't actually the root cause of this particular case. Thanks again, Roger --
On 2/3/22 22:00, Roger Sayle wrote: > > This patch addresses the "increased register pressure" regression on > nvptx-none caused by my change to transition the backend to a > STORE_FLAG_VALUE = 1 target. This improved code generation for the > more common case of producing 0/1 Boolean values, but unfortunately > made things marginally worse when a 0/-1 mask value is desired. > Unfortunately, nvptx kernels are extremely sensitive to changes in > register usage, which was observable in the reported PR. > > This patch provides optimizations for -(cond ? 1 : 0), effectively > simplify this into cond ? -1 : 0, where these ternary operators are > provided by nvptx's selp instruction, and for the specific case of > SImode, using (restoring) nvptx's "set" instruction (which avoids > the need for a predicate register). > > This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu > with a "make" and "make -k check" with no new failures. Unfortunately, > the exact register usage of a nvptx kernel depends upon the version of > the Cuda drivers being used (and the hardware), but I believe this > change should resolve the PR (for Thomas) by improving code generation > for the cases that regressed. Ok for mainline? > > LGTM, applied. Thanks, - Tom > 2022-02-03 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > PR target/104345 > * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation. > (sel_false<mode>): Likewise. > (define_code_iterator eqne): New code iterator for EQ and NE. > (*selp<mode>_neg_<code>): New define_insn_and_split to optimize > the negation of a selp instruction. > (*selp<mode>_not_<code>): New define_insn_and_split to optimize > the bitwise not of a selp instruction. > (*setcc_int<mode>): Use set instruction for neg:SI of a selp. > > gcc/testsuite/ChangeLog > PR target/104345 > * gcc.target/nvptx/neg-selp.c: New test case. > > > Thanks in advance, > Roger > -- >
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 92768dd..651ba20 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -892,7 +892,7 @@ (define_insn "sel_true<mode>" [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") - (if_then_else:HSDIM + (if_then_else:HSDIM (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri") (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))] @@ -901,7 +901,7 @@ (define_insn "sel_true<mode>" [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R") - (if_then_else:SDFM + (if_then_else:SDFM (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF") (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))] @@ -910,7 +910,7 @@ (define_insn "sel_false<mode>" [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") - (if_then_else:HSDIM + (if_then_else:HSDIM (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri") (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))] @@ -919,13 +919,63 @@ (define_insn "sel_false<mode>" [(set (match_operand:SDFM 0 "nvptx_register_operand" "=R") - (if_then_else:SDFM + (if_then_else:SDFM (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0)) (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF") (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))] "" "%.\\tselp%t0\\t%0, %3, %2, %1;") +(define_code_iterator eqne [eq ne]) + +;; Split negation of a predicate into a conditional move. +(define_insn_and_split "*selp<mode>_neg_<code>" + [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") + (neg:HSDIM (eqne:HSDIM + (match_operand:BI 1 "nvptx_register_operand" "R") + (const_int 0))))] + "" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:HSDIM + (eqne (match_dup 1) (const_int 0)) + (const_int -1) + (const_int 0)))]) + +;; Split bitwise not of a predicate into a conditional move. +(define_insn_and_split "*selp<mode>_not_<code>" + [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R") + (not:HSDIM (eqne:HSDIM + (match_operand:BI 1 "nvptx_register_operand" "R") + (const_int 0))))] + "" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:HSDIM + (eqne (match_dup 1) (const_int 0)) + (const_int -2) + (const_int -1)))]) + +(define_insn "*setcc_int<mode>" + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") + (neg:SI + (match_operator:SI 1 "nvptx_comparison_operator" + [(match_operand:HSDIM 2 "nvptx_register_operand" "R") + (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")])))] + "" + "%.\\tset%t0%c1\\t%0, %2, %3;") + +(define_insn "*setcc_int<mode>" + [(set (match_operand:SI 0 "nvptx_register_operand" "=R") + (neg:SI + (match_operator:SI 1 "nvptx_float_comparison_operator" + [(match_operand:SDFM 2 "nvptx_register_operand" "R") + (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")])))] + "" + "%.\\tset%t0%c1\\t%0, %2, %3;") + (define_insn "setcc_float<mode>" [(set (match_operand:SF 0 "nvptx_register_operand" "=R") (match_operator:SF 1 "nvptx_comparison_operator" diff --git a/gcc/testsuite/gcc.target/nvptx/neg-selp.c b/gcc/testsuite/gcc.target/nvptx/neg-selp.c new file mode 100644 index 0000000..a8f0118 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/neg-selp.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int neg(int x, int y) +{ + int t = (x == y) ? 1 : 0; + return -t; +} + +int not(int x, int y) +{ + int t = (x == y) ? 1 : 0; + return ~t; +} + +/* { dg-final { scan-assembler-not "neg.s32" } } */ +/* { dg-final { scan-assembler-not "not.b32" } } */