Message ID | 20131217115034.GB28733@bubble.grove.modra.org |
---|---|
State | New |
Headers | show |
On Tue, Dec 17, 2013 at 6:50 AM, Alan Modra <amodra@gmail.com> wrote: > Bootstrapped and regression tested powerpc64-linux. Output of > pr53199.c inspected for sanity with -mcpu=power{6,7} -m{32,64} and > {-mlra,}. OK to apply? > > gcc/ > * config/rs6000/rs6000.md (bswapdi2): Remove one scratch reg. > Modify Z->r bswapdi splitter to use dest in place of scratch. > In r->Z and Z->r bswapdi splitter rename word_high, word_low > to word1, word2 and rearrange logic to suit. > (bswapdi2_64bit): Remove early clobber on Z->r alternative. > (bswapdi2_ldbrx): Likewise. Remove '??' on r->r. > (bswapdi2_32bit): Remove early clobber on Z->r alternative. > Add one '?' on r->r. Modify Z->r splitter to avoid need for > early clobber. > gcc/testsuite/ > * gcc.target/powerpc/pr53199.c: Add extra functions. > @@ -2438,29 +2433,29 @@ > addr2 = gen_rtx_PLUS (Pmode, op2, addr1); > } > > + word1 = change_address (src, SImode, addr1); > + word2 = change_address (src, SImode, addr2); > + > if (BYTES_BIG_ENDIAN) > { > - word_high = change_address (src, SImode, addr1); > - word_low = change_address (src, SImode, addr2); > + emit_insn (gen_bswapsi2 (op3_32, word2)); > + emit_insn (gen_bswapsi2 (dest_32, word1)); > } > else > { > - word_high = change_address (src, SImode, addr2); > - word_low = change_address (src, SImode, addr1); > + emit_insn (gen_bswapsi2 (op3_32, word1)); > + emit_insn (gen_bswapsi2 (dest_32, word2)); > } > > - emit_insn (gen_bswapsi2 (op3_32, word_low)); > - emit_insn (gen_bswapsi2 (op4_32, word_high)); > - emit_insn (gen_ashldi3 (dest, op3, GEN_INT (32))); > - emit_insn (gen_iordi3 (dest, dest, op4)); > + emit_insn (gen_ashldi3 (op3, op3, GEN_INT (32))); > + emit_insn (gen_iordi3 (dest, dest, op3)); > }") Why change the code from swapping the words at the initial change_address() to swapping the words in the call to gen_bswapsi2()? - David
On Wed, Dec 18, 2013 at 09:53:38AM -0500, David Edelsohn wrote: https://gcc.gnu.org/ml/gcc-patches/2013-12/msg01599.html > Why change the code from swapping the words at the initial > change_address() to swapping the words in the call to gen_bswapsi2()? Sorry for dropping this on the floor for so long. I've been prodded back into action by Redhat people and pr63150. I don't recall a compelling technical reason for the change. It was probably to make my life easier in tracking the lifetimes of addr1 and addr2, necessary due to losing one of the scratch registers along with early clobbers. (In the splitter you question, addr1 might be the same register as dest/dest_32.) I suppose it also makes those splitters look a little more like the one for bswapdi2_32bit, so a plus for maintenance. The patch applies with some minor changes (see pr63150) and I've checked for regressions on a current powerpc64le build. OK to apply, and on the branches?
On Tue, Dec 17, 2013 at 6:50 AM, Alan Modra <amodra@gmail.com> wrote: > This patch is aimed at fixing test failures introduced by my > 2013-12-07 change to bswapdi2_32bit: > FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times lwbrx 6 > FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times stwbrx 6 > > The 2013-12-07 change was necessary to make -m32 -mlra generate good > code for pr53199.c:reg_reverse. Too many '?'s on the r->r alternative > result in lra never choosing that option. Instead we get Z->r, > ie. storing the input reg to a stack temp then using lwbrx from there. > That means we have a load-hit-store flush with a severe slowdown. > (See http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00002.html for the > corresponding -m64 result, a 4x slowdown.) > > A similar problem occurs with -m64 -mcpu=power7 -mlra due to > bswapdi2_ldbrx having two '?'s on r->r. > > To fix this I ran into a whole lot of pain. reload and lra are quite > different in their selection of insn alternatives. I could not find > any combination of '?'s that generated the best code for both reload > an lra on pr53199.c. To see why, it's necessary to look (from a great > height) at how reload chooses amongst alternatives. A particular > alternative gets a "loser" score, with the lowest "loser" being > chosen. "loser" is incremented > a) when an operand does not match its constraint. > b) when an alternative has a '?' in the constraint. > c) when a scratch register is required. > d) when an early clobber output clashes with one of the inputs. > > a) is fairly obvious. For example, if we have a MEM when the operand > alternative needs a reg, then we'll require a reload. > b) is also quite obvious. Multiple '?'s accumulate. > c) is a little more subtle. It bites you when alternatives require > differing numbers of scratch registers. Take for example > bswapdi2_64bit, which before this patch has three alternatives > Z->r with 3 scratch regs, (Z is a subset of m) > r->Z with 2 scratch regs, > r->r with 3 scratch regs. > All other things being equal, with reload you could correct this > disparity by adding a '?' to the r->Z alternative. We might want > to do that so that Z->r and r->r are the same "distance" apart > as r->Z is from r->r. With lra it seems that scratch regs are > weighted differently.. > d) is also tricky, and a trap for anyone optimizing insn selection for > functions like some in pr53199.c that have just one rtl insn with > early clobbers. PowerPC generally returns function results in the > same register as the first argument, so these hit the early > clobber. Code elsewhere in larger functions probably won't. > lra penalizes early clobbers differently to reload (a lot less). > > So, putting this all together.. Point (d) test implication is covered > by the additional functions in pr53199.c. Avoiding early clobbers > where possible is good since it reduces differences between reload and > lra insn alternative costing. We also generate better code. I > managed to do this for all the Z->r bswapdi patterns, but stopped > short of changing r->r as well since at that point everything looked > OK! Avoiding extra scratch registers for one alternative is also good. > The op4 r->r scratch wasn't even used, and op4 for Z->r fairly easy to > do without. Renaming word_high/low to word1/2 was to make it a little > easier to track lifetime of addr1/2. > > Bootstrapped and regression tested powerpc64-linux. Output of > pr53199.c inspected for sanity with -mcpu=power{6,7} -m{32,64} and > {-mlra,}. OK to apply? > > gcc/ > * config/rs6000/rs6000.md (bswapdi2): Remove one scratch reg. > Modify Z->r bswapdi splitter to use dest in place of scratch. > In r->Z and Z->r bswapdi splitter rename word_high, word_low > to word1, word2 and rearrange logic to suit. > (bswapdi2_64bit): Remove early clobber on Z->r alternative. > (bswapdi2_ldbrx): Likewise. Remove '??' on r->r. > (bswapdi2_32bit): Remove early clobber on Z->r alternative. > Add one '?' on r->r. Modify Z->r splitter to avoid need for > early clobber. > gcc/testsuite/ > * gcc.target/powerpc/pr53199.c: Add extra functions. Okay. Thanks, David
On Sun, Mar 15, 2015 at 10:51:02PM -0400, David Edelsohn wrote: > On Tue, Dec 17, 2013 at 6:50 AM, Alan Modra <amodra@gmail.com> wrote: PR target/61350 > > gcc/ > > * config/rs6000/rs6000.md (bswapdi2): Remove one scratch reg. > > Modify Z->r bswapdi splitter to use dest in place of scratch. > > In r->Z and Z->r bswapdi splitter rename word_high, word_low > > to word1, word2 and rearrange logic to suit. > > (bswapdi2_64bit): Remove early clobber on Z->r alternative. > > (bswapdi2_ldbrx): Likewise. Remove '??' on r->r. > > (bswapdi2_32bit): Remove early clobber on Z->r alternative. > > Add one '?' on r->r. Modify Z->r splitter to avoid need for > > early clobber. > > gcc/testsuite/ > > * gcc.target/powerpc/pr53199.c: Add extra functions. > > Okay. Committed revision 221445. I'll leave it a few days before applying to gcc-4.9.
Index: gcc/config/rs6000/rs6000.md =================================================================== --- gcc/config/rs6000/rs6000.md (revision 206009) +++ gcc/config/rs6000/rs6000.md (working copy) @@ -2344,8 +2344,7 @@ (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" ""))) (clobber (match_scratch:DI 2 "")) - (clobber (match_scratch:DI 3 "")) - (clobber (match_scratch:DI 4 ""))])] + (clobber (match_scratch:DI 3 ""))])] "" { if (!REG_P (operands[0]) && !REG_P (operands[1])) @@ -2363,11 +2362,10 @@ ;; Power7/cell has ldbrx/stdbrx, so use it directly (define_insn "*bswapdi2_ldbrx" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,??&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:DI 2 "=X,X,&r")) - (clobber (match_scratch:DI 3 "=X,X,&r")) - (clobber (match_scratch:DI 4 "=X,X,&r"))] + (clobber (match_scratch:DI 3 "=X,X,&r"))] "TARGET_POWERPC64 && TARGET_LDBRX && (REG_P (operands[0]) || REG_P (operands[1]))" "@ @@ -2379,11 +2377,10 @@ ;; Non-power7/cell, fall back to use lwbrx/stwbrx (define_insn "*bswapdi2_64bit" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:DI 2 "=&b,&b,&r")) - (clobber (match_scratch:DI 3 "=&r,&r,&r")) - (clobber (match_scratch:DI 4 "=&r,X,&r"))] + (clobber (match_scratch:DI 3 "=&r,&r,&r"))] "TARGET_POWERPC64 && !TARGET_LDBRX && (REG_P (operands[0]) || REG_P (operands[1])) && !(MEM_P (operands[0]) && MEM_VOLATILE_P (operands[0])) @@ -2395,8 +2392,7 @@ [(set (match_operand:DI 0 "gpc_reg_operand" "") (bswap:DI (match_operand:DI 1 "indexed_or_indirect_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "gpc_reg_operand" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && !TARGET_LDBRX && reload_completed" [(const_int 0)] " @@ -2405,15 +2401,14 @@ rtx src = operands[1]; rtx op2 = operands[2]; rtx op3 = operands[3]; - rtx op4 = operands[4]; rtx op3_32 = simplify_gen_subreg (SImode, op3, DImode, BYTES_BIG_ENDIAN ? 4 : 0); - rtx op4_32 = simplify_gen_subreg (SImode, op4, DImode, - BYTES_BIG_ENDIAN ? 4 : 0); + rtx dest_32 = simplify_gen_subreg (SImode, dest, DImode, + BYTES_BIG_ENDIAN ? 4 : 0); rtx addr1; rtx addr2; - rtx word_high; - rtx word_low; + rtx word1; + rtx word2; addr1 = XEXP (src, 0); if (GET_CODE (addr1) == PLUS) @@ -2438,29 +2433,29 @@ addr2 = gen_rtx_PLUS (Pmode, op2, addr1); } + word1 = change_address (src, SImode, addr1); + word2 = change_address (src, SImode, addr2); + if (BYTES_BIG_ENDIAN) { - word_high = change_address (src, SImode, addr1); - word_low = change_address (src, SImode, addr2); + emit_insn (gen_bswapsi2 (op3_32, word2)); + emit_insn (gen_bswapsi2 (dest_32, word1)); } else { - word_high = change_address (src, SImode, addr2); - word_low = change_address (src, SImode, addr1); + emit_insn (gen_bswapsi2 (op3_32, word1)); + emit_insn (gen_bswapsi2 (dest_32, word2)); } - emit_insn (gen_bswapsi2 (op3_32, word_low)); - emit_insn (gen_bswapsi2 (op4_32, word_high)); - emit_insn (gen_ashldi3 (dest, op3, GEN_INT (32))); - emit_insn (gen_iordi3 (dest, dest, op4)); + emit_insn (gen_ashldi3 (op3, op3, GEN_INT (32))); + emit_insn (gen_iordi3 (dest, dest, op3)); }") (define_split [(set (match_operand:DI 0 "indexed_or_indirect_operand" "") (bswap:DI (match_operand:DI 1 "gpc_reg_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && !TARGET_LDBRX && reload_completed" [(const_int 0)] " @@ -2475,8 +2470,8 @@ BYTES_BIG_ENDIAN ? 4 : 0); rtx addr1; rtx addr2; - rtx word_high; - rtx word_low; + rtx word1; + rtx word2; addr1 = XEXP (dest, 0); if (GET_CODE (addr1) == PLUS) @@ -2501,27 +2496,28 @@ addr2 = gen_rtx_PLUS (Pmode, op2, addr1); } + word1 = change_address (dest, SImode, addr1); + word2 = change_address (dest, SImode, addr2); + emit_insn (gen_lshrdi3 (op3, src, GEN_INT (32))); + if (BYTES_BIG_ENDIAN) { - word_high = change_address (dest, SImode, addr1); - word_low = change_address (dest, SImode, addr2); + emit_insn (gen_bswapsi2 (word1, src_si)); + emit_insn (gen_bswapsi2 (word2, op3_si)); } else { - word_high = change_address (dest, SImode, addr2); - word_low = change_address (dest, SImode, addr1); + emit_insn (gen_bswapsi2 (word2, src_si)); + emit_insn (gen_bswapsi2 (word1, op3_si)); } - emit_insn (gen_bswapsi2 (word_high, src_si)); - emit_insn (gen_bswapsi2 (word_low, op3_si)); }") (define_split [(set (match_operand:DI 0 "gpc_reg_operand" "") (bswap:DI (match_operand:DI 1 "gpc_reg_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && reload_completed" [(const_int 0)] " @@ -2544,7 +2540,7 @@ }") (define_insn "bswapdi2_32bit" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,?&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:SI 2 "=&b,&b,X"))] "!TARGET_POWERPC64 && (REG_P (operands[0]) || REG_P (operands[1]))" @@ -2573,7 +2569,8 @@ if (GET_CODE (addr1) == PLUS) { emit_insn (gen_add3_insn (op2, XEXP (addr1, 0), GEN_INT (4))); - if (TARGET_AVOID_XFORM) + if (TARGET_AVOID_XFORM + || REGNO (XEXP (addr1, 1)) == REGNO (dest2)) { emit_insn (gen_add3_insn (op2, XEXP (addr1, 1), op2)); addr2 = op2; @@ -2581,7 +2578,8 @@ else addr2 = gen_rtx_PLUS (SImode, op2, XEXP (addr1, 1)); } - else if (TARGET_AVOID_XFORM) + else if (TARGET_AVOID_XFORM + || REGNO (addr1) == REGNO (dest2)) { emit_insn (gen_add3_insn (op2, addr1, GEN_INT (4))); addr2 = op2; @@ -2596,6 +2594,8 @@ word2 = change_address (src, SImode, addr2); emit_insn (gen_bswapsi2 (dest2, word1)); + /* The REGNO (dest2) tests above ensure that addr2 has not been trashed, + thus allowing us to omit an early clobber on the output. */ emit_insn (gen_bswapsi2 (dest1, word2)); }") Index: gcc/testsuite/gcc.target/powerpc/pr53199.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr53199.c (revision 206009) +++ gcc/testsuite/gcc.target/powerpc/pr53199.c (working copy) @@ -1,7 +1,7 @@ /* { dg-do compile { target { powerpc*-*-* } } } */ /* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */ /* { dg-options "-O2 -mcpu=power6 -mavoid-indexed-addresses" } */ -/* { dg-final { scan-assembler-times "lwbrx" 6 } } */ +/* { dg-final { scan-assembler-times "lwbrx" 12 } } */ /* { dg-final { scan-assembler-times "stwbrx" 6 } } */ /* PR 51399: bswap gets an error if -mavoid-indexed-addresses was used in @@ -25,6 +25,24 @@ return __builtin_bswap64 (p[i]); } +long long +load64_reverse_4 (long long dummy __attribute__ ((unused)), long long *p) +{ + return __builtin_bswap64 (*p); +} + +long long +load64_reverse_5 (long long dummy __attribute__ ((unused)), long long *p) +{ + return __builtin_bswap64 (p[1]); +} + +long long +load64_reverse_6 (long long dummy __attribute__ ((unused)), long long *p, int i) +{ + return __builtin_bswap64 (p[i]); +} + void store64_reverse_1 (long long *p, long long x) { @@ -44,7 +62,13 @@ } long long -reg_reverse (long long x) +reg_reverse_1 (long long x) { return __builtin_bswap64 (x); } + +long long +reg_reverse_2 (long long dummy __attribute__ ((unused)), long long x) +{ + return __builtin_bswap64 (x); +}