Message ID | 92676187-37ed-4d1f-aad1-c8eb4c938fa5@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [PATCH-2,rs6000] Enable vector mode for by pieces equality compare [PR111449] | expand |
Hi Haochen, on 2023/11/6 10:36, HAO CHEN GUI wrote: > Hi, > This patch enables vector mode for by pieces equality compare. It > adds a new expand pattern - cbrnachv16qi4 and set MOVE_MAX_PIECES > and COMPARE_MAX_PIECES to 16 bytes when P8 vector enabled. The compare > relies both move and compare instructions, so both macro are changed. > The vector load/store might be unaligned, so the 16-byte move and > compare are only enabled when p8 vector enabled (TARGET_VSX + > TARGET_EFFICIENT_UNALIGNED_VSX). > > This patch enables 16 byte by pieces move. As the vector mode is not > enabled for by pieces move, TImode is used for the move. It caused some > regression cases. I drafted the third patch to fix them. Could you also list the failures if the total number is small? > > Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no > regressions. Is this OK for trunk? > > Thanks > Gui Haochen > > > ChangeLog > rs6000: Enable vector mode for by pieces equality compare > > This patch adds a new expand pattern - cbranchv16qi4 to enable vector > mode by pieces equality compare on rs6000. The macro MOVE_MAX_PIECES > (COMPARE_MAX_PIECES) is set to 16 bytes when P8 vector is enabled, > otherwise keeps unchanged. The macro STORE_MAX_PIECES is set to the > same value as MOVE_MAX_PIECES by default, so now it's explicitly > defined and keeps unchanged. > > gcc/ > PR target/111449 > * config/rs6000/altivec.md (cbranchv16qi4): New expand pattern. > * config/rs6000/rs6000.cc (rs6000_generate_compare): Generate > insn sequence for V16QImode equality compare. > * config/rs6000/rs6000.h (MOVE_MAX_PIECES): Define. > (STORE_MAX_PIECES): Define. > > gcc/testsuite/ > PR target/111449 > * gcc.target/powerpc/pr111449-1.c: New. > > patch.diff > diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md > index e8a596fb7e9..d0937f192d6 100644 > --- a/gcc/config/rs6000/altivec.md > +++ b/gcc/config/rs6000/altivec.md > @@ -2605,6 +2605,45 @@ (define_insn "altivec_vupklpx" > } > [(set_attr "type" "vecperm")]) > > +/* The cbranch_optabs doesn't allow FAIL, so altivec load/store > + instructions are disabled as the cost is high for unaligned > + load/store. */ > +(define_expand "cbranchv16qi4" > + [(use (match_operator 0 "equality_operator" > + [(match_operand:V16QI 1 "reg_or_mem_operand") > + (match_operand:V16QI 2 "reg_or_mem_operand")])) > + (use (match_operand 3))] > + "VECTOR_MEM_VSX_P (V16QImode) > + && TARGET_EFFICIENT_UNALIGNED_VSX" > +{ > + if (!TARGET_P9_VECTOR > + && !BYTES_BIG_ENDIAN Nit: If these two aim to match the below comment "P8 little endian", it seems easier to read it with "TARGET_P8_VECTOR && !BYTES_BIG_ENDIAN". > + && MEM_P (operands[1]) > + && !altivec_indexed_or_indirect_operand (operands[1], V16QImode) Need a comment on why it checks altivec_indexed_or_indirect_operand, it's not obvious. > + && MEM_P (operands[2]) > + && !altivec_indexed_or_indirect_operand (operands[2], V16QImode)) > + { > + /* Use direct move for P8 little endian to skip bswap, as the byte > + order doesn't matter for equality compare. */ > + rtx reg_op1 = gen_reg_rtx (V16QImode); > + rtx reg_op2 = gen_reg_rtx (V16QImode); > + rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode); > + rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode); > + operands[1] = reg_op1; > + operands[2] = reg_op2; > + } > + else > + { > + operands[1] = force_reg (V16QImode, operands[1]); > + operands[2] = force_reg (V16QImode, operands[2]); > + } > + > + rtx_code code = GET_CODE (operands[0]); > + operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1], operands[2]); > + rs6000_emit_cbranch (V16QImode, operands); > + DONE; > +}) > + > ;; Compare vectors producing a vector result and a predicate, setting CR6 to > ;; indicate a combined status > (define_insn "altivec_vcmpequ<VI_char>_p" > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index cc24dd5301e..10279052636 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -15472,6 +15472,18 @@ rs6000_generate_compare (rtx cmp, machine_mode mode) > else > emit_insn (gen_stack_protect_testsi (compare_result, op0, op1b)); > } > + else if (mode == V16QImode) > + { > + gcc_assert (code == EQ || code == NE); > + > + rtx result_vector = gen_reg_rtx (V16QImode); > + rtx cc_bit = gen_reg_rtx (SImode); > + emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1)); > + emit_insn (gen_cr6_test_for_lt (cc_bit)); > + emit_insn (gen_rtx_SET (compare_result, > + gen_rtx_COMPARE (comp_mode, cc_bit, > + const1_rtx))); > + } > else > emit_insn (gen_rtx_SET (compare_result, > gen_rtx_COMPARE (comp_mode, op0, op1))); > diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h > index 22595f6ebd7..51441825e20 100644 > --- a/gcc/config/rs6000/rs6000.h > +++ b/gcc/config/rs6000/rs6000.h > @@ -1730,6 +1730,8 @@ typedef struct rs6000_args > in one reasonably fast instruction. */ > #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8) > #define MAX_MOVE_MAX 8 > +#define MOVE_MAX_PIECES (TARGET_P8_VECTOR ? 16 : (TARGET_POWERPC64 ? 8 : 4)) But the cbranchv16qi4 condition uses TARGET_EFFICIENT_UNALIGNED_VSX which can be separately disabled but TARGET_P8_VECTOR still enabled? So I think we should keep both consistent. > +#define STORE_MAX_PIECES (TARGET_POWERPC64 ? 8 : 4) > > /* Nonzero if access to memory by bytes is no faster than for words. > Also nonzero if doing byte operations (specifically shifts) in registers > diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449-1.c b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c > new file mode 100644 > index 00000000000..d8583d3303b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile { target { has_arch_pwr8 } } } */ Nit: has_arch_pwr8 would make it un-tested on Power7 default env, I'd prefer to remove this "has_arch_pwr8" and append "-mdejagnu-cpu=power8" to dg-options. Otherwise LGTM. BR, Kewen > +/* { dg-require-effective-target powerpc_p8vector_ok } */ > +/* { dg-options "-mvsx -O2" } */ > + > +/* Ensure vector mode is used for 16-byte memory equality compare. */ > + > +int compare1 (const char* s1, const char* s2) > +{ > + return __builtin_memcmp (s1, s2, 16) == 0; > +} > + > +int compare2 (const char* s1) > +{ > + return __builtin_memcmp (s1, "0123456789012345", 16) == 0; > +} > + > +/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */ > +/* { dg-final { scan-assembler-not {\mcmpd\M} } } */
Hi Kewen,
Thanks for your review comments. Just one question on following
comment.
在 2023/11/7 10:40, Kewen.Lin 写道:
> Nit: has_arch_pwr8 would make it un-tested on Power7 default env, I'd prefer to remove this "has_arch_pwr8" and append "-mdejagnu-cpu=power8" to dg-options.
My original propose is to test the case on p8/p9/p10. Each of them
generate different instruction sequence. If it's assigned
"-mdejagnu-cpu=power8", only p8 instruction sequence is generated.
Does it lost the coverage?
Thanks
Gui Haochen
Hi, on 2023/11/7 11:24, HAO CHEN GUI wrote: > Hi Kewen, > > Thanks for your review comments. Just one question on following > comment. > > 在 2023/11/7 10:40, Kewen.Lin 写道: >> Nit: has_arch_pwr8 would make it un-tested on Power7 default env, I'd prefer to remove this "has_arch_pwr8" and append "-mdejagnu-cpu=power8" to dg-options. > > My original propose is to test the case on p8/p9/p10. Each of them > generate different instruction sequence. If it's assigned But the actual test point is the same for p8/p9/p10: +/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */ +/* { dg-final { scan-assembler-not {\mcmpd\M} } } */ ,it checks if we generates vector comparison, the p8/p9/p10 difference doesn't matter in this checking at least. IMHO the consistency for vector comparison on different cpus shouldn't be tested in this case either, it should be covered by some other existing test cases dedicated for vector comparison instead. But if you are eager to cover p8/p9/p10, we can separate the test cases with separated -mdejagnu-cpu=, but I don't feel it's worthy. :) BR, Kewen > "-mdejagnu-cpu=power8", only p8 instruction sequence is generated. > Does it lost the coverage? > > Thanks > Gui Haochen
diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md index e8a596fb7e9..d0937f192d6 100644 --- a/gcc/config/rs6000/altivec.md +++ b/gcc/config/rs6000/altivec.md @@ -2605,6 +2605,45 @@ (define_insn "altivec_vupklpx" } [(set_attr "type" "vecperm")]) +/* The cbranch_optabs doesn't allow FAIL, so altivec load/store + instructions are disabled as the cost is high for unaligned + load/store. */ +(define_expand "cbranchv16qi4" + [(use (match_operator 0 "equality_operator" + [(match_operand:V16QI 1 "reg_or_mem_operand") + (match_operand:V16QI 2 "reg_or_mem_operand")])) + (use (match_operand 3))] + "VECTOR_MEM_VSX_P (V16QImode) + && TARGET_EFFICIENT_UNALIGNED_VSX" +{ + if (!TARGET_P9_VECTOR + && !BYTES_BIG_ENDIAN + && MEM_P (operands[1]) + && !altivec_indexed_or_indirect_operand (operands[1], V16QImode) + && MEM_P (operands[2]) + && !altivec_indexed_or_indirect_operand (operands[2], V16QImode)) + { + /* Use direct move for P8 little endian to skip bswap, as the byte + order doesn't matter for equality compare. */ + rtx reg_op1 = gen_reg_rtx (V16QImode); + rtx reg_op2 = gen_reg_rtx (V16QImode); + rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode); + rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode); + operands[1] = reg_op1; + operands[2] = reg_op2; + } + else + { + operands[1] = force_reg (V16QImode, operands[1]); + operands[2] = force_reg (V16QImode, operands[2]); + } + + rtx_code code = GET_CODE (operands[0]); + operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1], operands[2]); + rs6000_emit_cbranch (V16QImode, operands); + DONE; +}) + ;; Compare vectors producing a vector result and a predicate, setting CR6 to ;; indicate a combined status (define_insn "altivec_vcmpequ<VI_char>_p" diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc index cc24dd5301e..10279052636 100644 --- a/gcc/config/rs6000/rs6000.cc +++ b/gcc/config/rs6000/rs6000.cc @@ -15472,6 +15472,18 @@ rs6000_generate_compare (rtx cmp, machine_mode mode) else emit_insn (gen_stack_protect_testsi (compare_result, op0, op1b)); } + else if (mode == V16QImode) + { + gcc_assert (code == EQ || code == NE); + + rtx result_vector = gen_reg_rtx (V16QImode); + rtx cc_bit = gen_reg_rtx (SImode); + emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1)); + emit_insn (gen_cr6_test_for_lt (cc_bit)); + emit_insn (gen_rtx_SET (compare_result, + gen_rtx_COMPARE (comp_mode, cc_bit, + const1_rtx))); + } else emit_insn (gen_rtx_SET (compare_result, gen_rtx_COMPARE (comp_mode, op0, op1))); diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h index 22595f6ebd7..51441825e20 100644 --- a/gcc/config/rs6000/rs6000.h +++ b/gcc/config/rs6000/rs6000.h @@ -1730,6 +1730,8 @@ typedef struct rs6000_args in one reasonably fast instruction. */ #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8) #define MAX_MOVE_MAX 8 +#define MOVE_MAX_PIECES (TARGET_P8_VECTOR ? 16 : (TARGET_POWERPC64 ? 8 : 4)) +#define STORE_MAX_PIECES (TARGET_POWERPC64 ? 8 : 4) /* Nonzero if access to memory by bytes is no faster than for words. Also nonzero if doing byte operations (specifically shifts) in registers diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449-1.c b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c new file mode 100644 index 00000000000..d8583d3303b --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile { target { has_arch_pwr8 } } } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-options "-mvsx -O2" } */ + +/* Ensure vector mode is used for 16-byte memory equality compare. */ + +int compare1 (const char* s1, const char* s2) +{ + return __builtin_memcmp (s1, s2, 16) == 0; +} + +int compare2 (const char* s1) +{ + return __builtin_memcmp (s1, "0123456789012345", 16) == 0; +} + +/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */ +/* { dg-final { scan-assembler-not {\mcmpd\M} } } */