Message ID | 001654df-a083-c10f-6c0f-dfda0de6e87b@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v3,rs6000] Disable TImode from Bool expanders [PR100694, PR93123] | expand |
Hi! On Mon, Jul 04, 2022 at 02:27:42PM +0800, HAO CHEN GUI wrote: > This patch fails TImode for all 128-bit logical operation expanders. So > TImode splits to two DI registers during expand. Potential optimizations can > be taken after expand pass. Originally, the TImode logical operations are > split after reload pass. It's too late. The test case illustrates it. Out of interest, did you see any performance win? There is a lot of opportunity for this to cause performance *loss*, on newer systems :-( > ChangeLog > 2022-07-04 Haochen Gui <guihaoc@linux.ibm.com> Two spaces both before and after your name, in changelogs. > --- a/gcc/config/rs6000/rs6000.md > +++ b/gcc/config/rs6000/rs6000.md > @@ -7078,27 +7078,38 @@ (define_expand "subti3" > }) > > ;; 128-bit logical operations expanders > +;; Fail TImode in all 128-bit logical operations expanders and split it into > +;; two DI registers. > > (define_expand "and<mode>3" > [(set (match_operand:BOOL_128 0 "vlogical_operand") > (and:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") > (match_operand:BOOL_128 2 "vlogical_operand")))] > "" > - "") > +{ > + if (<MODE>mode == TImode) > + FAIL; > +}) It is better to not FAIL it, but simply not have a pattern for the TImode version at all. Does nothing depend on the :TI version to exist? What about the :PTI version? Getting rid of that as well will allow some nice optimisations. Of course we *do* have instructions to do such TImode ops, on newer CPUs, but in vector registers only. It isn't obvious what is faster. > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr100694.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target int128 } */ > +/* { dg-options "-O2" } */ > +/* { dg-final { scan-assembler-times {\mstd\M} 2 } } */ > +/* { dg-final { scan-assembler-not {\mli\M} } } */ > +/* { dg-final { scan-assembler-not {\mor\M} } } */ > + > +/* It just needs two std. */ > +void foo (unsigned __int128* res, unsigned long long hi, unsigned long long lo) > +{ > + unsigned __int128 i = hi; > + i <<= 64; > + i |= lo; > + *res = i; > +} You can also just count the number of generated insns: /* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 3 } } */ (three, not two, because of the blr insn at the end). If possible, we should simply not do :TI ops on older systems at all, and only on the newer systems that have instructions for it (and that does not fix PR100694 btw, the problems there have to be solved, not side-stepped :-( ) Segher
Hi Segher, Thanks for your comments. On 13/7/2022 上午 1:26, Segher Boessenkool wrote: >> --- a/gcc/config/rs6000/rs6000.md >> +++ b/gcc/config/rs6000/rs6000.md >> @@ -7078,27 +7078,38 @@ (define_expand "subti3" >> }) >> >> ;; 128-bit logical operations expanders >> +;; Fail TImode in all 128-bit logical operations expanders and split it into >> +;; two DI registers. >> >> (define_expand "and<mode>3" >> [(set (match_operand:BOOL_128 0 "vlogical_operand") >> (and:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") >> (match_operand:BOOL_128 2 "vlogical_operand")))] >> "" >> - "") >> +{ >> + if (<MODE>mode == TImode) >> + FAIL; >> +}) > It is better to not FAIL it, but simply not have a pattern for the > TImode version at all. > > Does nothing depend on the :TI version to exist? > > What about the :PTI version? Getting rid of that as well will allow > some nice optimisations. > > Of course we *do* have instructions to do such TImode ops, on newer > CPUs, but in vector registers only. It isn't obvious what is faster. > During expand, TI mode is split to two registers when it can't match any expands. So I failed TI mode in each expand and expect to be split at expand. TI mode is still in some insn_and_split patterns (e.g. "*and<mode>3_internal"). If later rtl passes generate TI mode logical operations, they still can be matched. Originally, the TI mode is split after reload pass by rs6000_split_logical. It's too late to catch some rtl optimizations. For the PTI, it can't be split to two registers during expand. PTI requires an even/odd register pair. So splitting it after reload can make sure it gets correct registers, I think. From my understanding, it's sub-optimal to use vector logical operation instructions for TI mode if the destination is an integer operand. It needs three instructions (move to vector register, vector logical operation and move from vector register). When splitting TImode, it only needs two logical instructions on two separate registers. Thanks again Gui Haochen
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index c55ee7e171a..6e57aac3ebf 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -7078,27 +7078,38 @@ (define_expand "subti3" }) ;; 128-bit logical operations expanders +;; Fail TImode in all 128-bit logical operations expanders and split it into +;; two DI registers. (define_expand "and<mode>3" [(set (match_operand:BOOL_128 0 "vlogical_operand") (and:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") (match_operand:BOOL_128 2 "vlogical_operand")))] "" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) (define_expand "ior<mode>3" [(set (match_operand:BOOL_128 0 "vlogical_operand") (ior:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") (match_operand:BOOL_128 2 "vlogical_operand")))] "" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) (define_expand "xor<mode>3" [(set (match_operand:BOOL_128 0 "vlogical_operand") (xor:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") (match_operand:BOOL_128 2 "vlogical_operand")))] "" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) (define_expand "nor<mode>3" [(set (match_operand:BOOL_128 0 "vlogical_operand") @@ -7106,7 +7117,10 @@ (define_expand "nor<mode>3" (not:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand")) (not:BOOL_128 (match_operand:BOOL_128 2 "vlogical_operand"))))] "" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) (define_expand "andc<mode>3" [(set (match_operand:BOOL_128 0 "vlogical_operand") @@ -7114,7 +7128,10 @@ (define_expand "andc<mode>3" (not:BOOL_128 (match_operand:BOOL_128 2 "vlogical_operand")) (match_operand:BOOL_128 1 "vlogical_operand")))] "" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) ;; Power8 vector logical instructions. (define_expand "eqv<mode>3" @@ -7123,7 +7140,10 @@ (define_expand "eqv<mode>3" (xor:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand") (match_operand:BOOL_128 2 "vlogical_operand"))))] "<MODE>mode == TImode || <MODE>mode == PTImode || TARGET_P8_VECTOR" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) ;; Rewrite nand into canonical form (define_expand "nand<mode>3" @@ -7132,7 +7152,10 @@ (define_expand "nand<mode>3" (not:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand")) (not:BOOL_128 (match_operand:BOOL_128 2 "vlogical_operand"))))] "<MODE>mode == TImode || <MODE>mode == PTImode || TARGET_P8_VECTOR" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) ;; The canonical form is to have the negated element first, so we need to ;; reverse arguments. @@ -7142,7 +7165,10 @@ (define_expand "orc<mode>3" (not:BOOL_128 (match_operand:BOOL_128 2 "vlogical_operand")) (match_operand:BOOL_128 1 "vlogical_operand")))] "<MODE>mode == TImode || <MODE>mode == PTImode || TARGET_P8_VECTOR" - "") +{ + if (<MODE>mode == TImode) + FAIL; +}) ;; 128-bit logical operations insns and split operations (define_insn_and_split "*and<mode>3_internal" @@ -7394,7 +7420,17 @@ (define_insn_and_split "*eqv<mode>3_internal2" (const_string "16")))]) ;; 128-bit one's complement -(define_insn_and_split "one_cmpl<mode>2" +(define_expand "one_cmpl<mode>2" + [(set (match_operand:BOOL_128 0 "vlogical_operand" "=<BOOL_REGS_OUTPUT>") + (not:BOOL_128 + (match_operand:BOOL_128 1 "vlogical_operand" "<BOOL_REGS_UNARY>")))] + "" +{ + if (<MODE>mode == TImode) + FAIL; +}) + +(define_insn_and_split "*one_cmpl<mode>2" [(set (match_operand:BOOL_128 0 "vlogical_operand" "=<BOOL_REGS_OUTPUT>") (not:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand" "<BOOL_REGS_UNARY>")))] diff --git a/gcc/testsuite/gcc.target/powerpc/pr100694.c b/gcc/testsuite/gcc.target/powerpc/pr100694.c new file mode 100644 index 00000000000..99dd3ca89ff --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr100694.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target int128 } */ +/* { dg-options "-O2" } */ +/* { dg-final { scan-assembler-times {\mstd\M} 2 } } */ +/* { dg-final { scan-assembler-not {\mli\M} } } */ +/* { dg-final { scan-assembler-not {\mor\M} } } */ + +/* It just needs two std. */ +void foo (unsigned __int128* res, unsigned long long hi, unsigned long long lo) +{ + unsigned __int128 i = hi; + i <<= 64; + i |= lo; + *res = i; +} + diff --git a/gcc/testsuite/gcc.target/powerpc/pr92398.c b/gcc/testsuite/gcc.target/powerpc/pr92398.c new file mode 100644 index 00000000000..7d6201cc5bb --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr92398.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target int128 } */ +/* { dg-options "-O2" } */ +/* { dg-final { scan-assembler-times {\mnot\M} 2 } } */ +/* { dg-final { scan-assembler-times {\mstd\M} 2 } } */ + +/* All platforms should generate the same instructions: not;not;std;std. */ +void bar (__int128_t *dst, __int128_t src) +{ + *dst = ~src; +} + diff --git a/gcc/testsuite/gcc.target/powerpc/pr92398.h b/gcc/testsuite/gcc.target/powerpc/pr92398.h deleted file mode 100644 index 5a4a8bcab80..00000000000 --- a/gcc/testsuite/gcc.target/powerpc/pr92398.h +++ /dev/null @@ -1,17 +0,0 @@ -/* This test code is included into pr92398.p9-.c and pr92398.p9+.c. - The two files have the tests for the number of instructions generated for - P9- versus P9+. - - store generates difference instructions as below: - P9+: mtvsrdd;xxlnot;stxv. - P8/P7/P6 LE: not;not;std;std. - P8 BE: mtvsrd;mtvsrd;xxpermdi;xxlnor;stxvd2x. - P7/P6 BE: std;std;addi;lxvd2x;xxlnor;stxvd2x. - P9+ and P9- LE are expected, P6/P7/P8 BE are unexpected. */ - -void -bar (__int128_t *dst, __int128_t src) -{ - *dst = ~src; -} - diff --git a/gcc/testsuite/gcc.target/powerpc/pr92398.p9+.c b/gcc/testsuite/gcc.target/powerpc/pr92398.p9+.c deleted file mode 100644 index 72dd1d9a274..00000000000 --- a/gcc/testsuite/gcc.target/powerpc/pr92398.p9+.c +++ /dev/null @@ -1,12 +0,0 @@ -/* { dg-do compile { target { lp64 && has_arch_pwr9 } } } */ -/* { dg-require-effective-target powerpc_vsx_ok } */ -/* { dg-options "-O2 -mvsx" } */ - -/* { dg-final { scan-assembler-times {\mmtvsrdd\M} 1 } } */ -/* { dg-final { scan-assembler-times {\mxxlnor\M} 1 } } */ -/* { dg-final { scan-assembler-times {\mstxv\M} 1 } } */ -/* { dg-final { scan-assembler-not {\mld\M} } } */ -/* { dg-final { scan-assembler-not {\mnot\M} } } */ - -/* Source code for the test in pr92398.h */ -#include "pr92398.h" diff --git a/gcc/testsuite/gcc.target/powerpc/pr92398.p9-.c b/gcc/testsuite/gcc.target/powerpc/pr92398.p9-.c deleted file mode 100644 index bd7fa98af51..00000000000 --- a/gcc/testsuite/gcc.target/powerpc/pr92398.p9-.c +++ /dev/null @@ -1,10 +0,0 @@ -/* { dg-do compile { target { lp64 && {! has_arch_pwr9} } } } */ -/* { dg-require-effective-target powerpc_vsx_ok } */ -/* { dg-options "-O2 -mvsx" } */ - -/* { dg-final { scan-assembler-times {\mnot\M} 2 { xfail be } } } */ -/* { dg-final { scan-assembler-times {\mstd\M} 2 { xfail { { {! has_arch_pwr9} && has_arch_pwr8 } && be } } } } */ - -/* Source code for the test in pr92398.h */ -#include "pr92398.h" -