Message ID | alpine.DEB.2.23.453.2008051443320.18411@stedding.saclay.inria.fr |
---|---|
State | New |
Headers | show |
Series | VEC_COND_EXPR optimizations v2 | expand |
On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote: > > New version that passed bootstrap+regtest during the night. > > When vector comparisons were forced to use vec_cond_expr, we lost a number of > optimizations (my fault for not adding enough testcases to prevent that). > This patch tries to unwrap vec_cond_expr a bit so some optimizations can > still happen. > > I wasn't planning to add all those transformations together, but adding one > caused a regression, whose fix introduced a second regression, etc. > > Restricting to constant folding would not be sufficient, we also need at > least things like X|0 or X&X. The transformations are quite conservative > with :s and folding only if everything simplifies, we may want to relax > this later. And of course we are going to miss things like a?b:c + a?c:b > -> b+c. > > In terms of number of operations, some transformations turning 2 > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look > like a gain... I expect the bit_not disappears in most cases, and > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. > > I am a bit confused that with avx512 we get types like "vector(4) > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not > -1?), but that doesn't matter for this patch. OK. Thanks, Richard. > 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > > PR tree-optimization/95906 > PR target/70314 > * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), > (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. > (op (c ? a : b)): Update to match the new transformations. > > * gcc.dg/tree-ssa/andnot-2.c: New file. > * gcc.dg/tree-ssa/pr95906.c: Likewise. > * gcc.target/i386/pr70314.c: Likewise. > > -- > Marc Glisse
Hi, On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote: > > > > New version that passed bootstrap+regtest during the night. > > > > When vector comparisons were forced to use vec_cond_expr, we lost a number of > > optimizations (my fault for not adding enough testcases to prevent that). > > This patch tries to unwrap vec_cond_expr a bit so some optimizations can > > still happen. > > > > I wasn't planning to add all those transformations together, but adding one > > caused a regression, whose fix introduced a second regression, etc. > > > > Restricting to constant folding would not be sufficient, we also need at > > least things like X|0 or X&X. The transformations are quite conservative > > with :s and folding only if everything simplifies, we may want to relax > > this later. And of course we are going to miss things like a?b:c + a?c:b > > -> b+c. > > > > In terms of number of operations, some transformations turning 2 > > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look > > like a gain... I expect the bit_not disappears in most cases, and > > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. > > > > I am a bit confused that with avx512 we get types like "vector(4) > > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not > > -1?), but that doesn't matter for this patch. > > OK. > > Thanks, > Richard. > > > 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > > > > PR tree-optimization/95906 > > PR target/70314 > > * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), > > (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. > > (op (c ? a : b)): Update to match the new transformations. > > > > * gcc.dg/tree-ssa/andnot-2.c: New file. > > * gcc.dg/tree-ssa/pr95906.c: Likewise. > > * gcc.target/i386/pr70314.c: Likewise. > > I think this patch is causing several ICEs on arm-none-linux-gnueabihf --with-cpu cortex-a9 --with-fpu neon-fp16: Executed from: gcc.c-torture/compile/compile.exp gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) Executed from: gcc.dg/dg.exp gcc.dg/pr87746.c (internal compiler error) Executed from: gcc.dg/tree-ssa/tree-ssa.exp gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) Executed from: gcc.dg/vect/vect.exp gcc.dg/vect/pr59591-1.c (internal compiler error) gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/pr86927.c (internal compiler error) gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/slp-cond-5.c (internal compiler error) gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-23.c (internal compiler error) gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-24.c (internal compiler error) gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal compiler error) Backtrace for gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions during RTL pass: expand /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal compiler error: in do_store_flag, at expr.c:12259 0x8feb26 do_store_flag /gcc/expr.c:12259 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9617 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 Christophe > > -- > > Marc Glisse
On Thu, 6 Aug 2020, Christophe Lyon wrote: >>> 2020-08-05 Marc Glisse <marc.glisse@inria.fr> >>> >>> PR tree-optimization/95906 >>> PR target/70314 >>> * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), >>> (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. >>> (op (c ? a : b)): Update to match the new transformations. >>> >>> * gcc.dg/tree-ssa/andnot-2.c: New file. >>> * gcc.dg/tree-ssa/pr95906.c: Likewise. >>> * gcc.target/i386/pr70314.c: Likewise. >>> > > I think this patch is causing several ICEs on arm-none-linux-gnueabihf > --with-cpu cortex-a9 --with-fpu neon-fp16: > Executed from: gcc.c-torture/compile/compile.exp > gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer > -funroll-loops -fpeel-loops -ftracer -finline-functions (internal > compiler error) > gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) > Executed from: gcc.dg/dg.exp > gcc.dg/pr87746.c (internal compiler error) > Executed from: gcc.dg/tree-ssa/tree-ssa.exp > gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) I tried a cross from x86_64-linux with current master .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 make it stops at some point with an error, but I have xgcc and cc1 in build/gcc. I copied 2 of the testcases and compiled ./xgcc pr87746.c -Ofast -S -B. ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. without getting any ICE. Is there a machine on the compile farm where this is easy to reproduce? Or could you attach the .optimized dump that corresponds to the backtrace below? It looks like we end up with a comparison with an unexpected return type. > Executed from: gcc.dg/vect/vect.exp > gcc.dg/vect/pr59591-1.c (internal compiler error) > gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/pr86927.c (internal compiler error) > gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/slp-cond-5.c (internal compiler error) > gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-23.c (internal compiler error) > gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-24.c (internal compiler error) > gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) > gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal > compiler error) > > Backtrace for gcc.c-torture/compile/20160205-1.c -O3 > -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer > -finline-functions > during RTL pass: expand > /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal > compiler error: in do_store_flag, at expr.c:12259 > 0x8feb26 do_store_flag > /gcc/expr.c:12259 > 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9617 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > > Christophe
On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon <christophe.lyon@linaro.org> wrote: > > Hi, > > > On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote: > > > > > > New version that passed bootstrap+regtest during the night. > > > > > > When vector comparisons were forced to use vec_cond_expr, we lost a number of > > > optimizations (my fault for not adding enough testcases to prevent that). > > > This patch tries to unwrap vec_cond_expr a bit so some optimizations can > > > still happen. > > > > > > I wasn't planning to add all those transformations together, but adding one > > > caused a regression, whose fix introduced a second regression, etc. > > > > > > Restricting to constant folding would not be sufficient, we also need at > > > least things like X|0 or X&X. The transformations are quite conservative > > > with :s and folding only if everything simplifies, we may want to relax > > > this later. And of course we are going to miss things like a?b:c + a?c:b > > > -> b+c. > > > > > > In terms of number of operations, some transformations turning 2 > > > VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look > > > like a gain... I expect the bit_not disappears in most cases, and > > > VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. > > > > > > I am a bit confused that with avx512 we get types like "vector(4) > > > <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not > > > -1?), but that doesn't matter for this patch. > > > > OK. > > > > Thanks, > > Richard. > > > > > 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > > > > > > PR tree-optimization/95906 > > > PR target/70314 > > > * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), > > > (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. > > > (op (c ? a : b)): Update to match the new transformations. > > > > > > * gcc.dg/tree-ssa/andnot-2.c: New file. > > > * gcc.dg/tree-ssa/pr95906.c: Likewise. > > > * gcc.target/i386/pr70314.c: Likewise. > > > > > I think this patch is causing several ICEs on arm-none-linux-gnueabihf > --with-cpu cortex-a9 --with-fpu neon-fp16: > Executed from: gcc.c-torture/compile/compile.exp > gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer > -funroll-loops -fpeel-loops -ftracer -finline-functions (internal > compiler error) > gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) > Executed from: gcc.dg/dg.exp > gcc.dg/pr87746.c (internal compiler error) > Executed from: gcc.dg/tree-ssa/tree-ssa.exp > gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) > Executed from: gcc.dg/vect/vect.exp > gcc.dg/vect/pr59591-1.c (internal compiler error) > gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/pr86927.c (internal compiler error) > gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/slp-cond-5.c (internal compiler error) > gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-23.c (internal compiler error) > gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-24.c (internal compiler error) > gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) > gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) > gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal > compiler error) > > Backtrace for gcc.c-torture/compile/20160205-1.c -O3 > -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer > -finline-functions > during RTL pass: expand > /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal > compiler error: in do_store_flag, at expr.c:12259 > 0x8feb26 do_store_flag > /gcc/expr.c:12259 > 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9617 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > rtx_def**, expand_modifier) > /gcc/expr.c:8065 > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > /gcc/expr.c:9950 > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > /gcc/expr.c:10159 > 0x91174e expand_expr > /gcc/expr.h:282 Hmm, I guess we might need to verify that the VEC_COND_EXPRs can be RTL expanded, at least if the folding triggers after vector lowering (but needing to lower a previously expandable VEC_COND_EXPR would be similarly bad). So we may need to handle VEC_COND_EXPRs like VEC_PERMs and thus need to check target support. Ick. Richard. > Christophe > > > > > -- > > > Marc Glisse
On Thu, 6 Aug 2020, Richard Biener wrote: > On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon > <christophe.lyon@linaro.org> wrote: >> >> Hi, >> >> >> On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches >> <gcc-patches@gcc.gnu.org> wrote: >>> >>> On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse <marc.glisse@inria.fr> wrote: >>>> >>>> New version that passed bootstrap+regtest during the night. >>>> >>>> When vector comparisons were forced to use vec_cond_expr, we lost a number of >>>> optimizations (my fault for not adding enough testcases to prevent that). >>>> This patch tries to unwrap vec_cond_expr a bit so some optimizations can >>>> still happen. >>>> >>>> I wasn't planning to add all those transformations together, but adding one >>>> caused a regression, whose fix introduced a second regression, etc. >>>> >>>> Restricting to constant folding would not be sufficient, we also need at >>>> least things like X|0 or X&X. The transformations are quite conservative >>>> with :s and folding only if everything simplifies, we may want to relax >>>> this later. And of course we are going to miss things like a?b:c + a?c:b >>>> -> b+c. >>>> >>>> In terms of number of operations, some transformations turning 2 >>>> VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look >>>> like a gain... I expect the bit_not disappears in most cases, and >>>> VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. >>>> >>>> I am a bit confused that with avx512 we get types like "vector(4) >>>> <signed-boolean:2>" with :2 and not :1 (is it a hack so true is 1 and not >>>> -1?), but that doesn't matter for this patch. >>> >>> OK. >>> >>> Thanks, >>> Richard. >>> >>>> 2020-08-05 Marc Glisse <marc.glisse@inria.fr> >>>> >>>> PR tree-optimization/95906 >>>> PR target/70314 >>>> * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), >>>> (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. >>>> (op (c ? a : b)): Update to match the new transformations. >>>> >>>> * gcc.dg/tree-ssa/andnot-2.c: New file. >>>> * gcc.dg/tree-ssa/pr95906.c: Likewise. >>>> * gcc.target/i386/pr70314.c: Likewise. >>>> >> >> I think this patch is causing several ICEs on arm-none-linux-gnueabihf >> --with-cpu cortex-a9 --with-fpu neon-fp16: >> Executed from: gcc.c-torture/compile/compile.exp >> gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer >> -funroll-loops -fpeel-loops -ftracer -finline-functions (internal >> compiler error) >> gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) >> Executed from: gcc.dg/dg.exp >> gcc.dg/pr87746.c (internal compiler error) >> Executed from: gcc.dg/tree-ssa/tree-ssa.exp >> gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) >> Executed from: gcc.dg/vect/vect.exp >> gcc.dg/vect/pr59591-1.c (internal compiler error) >> gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) >> gcc.dg/vect/pr86927.c (internal compiler error) >> gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) >> gcc.dg/vect/slp-cond-5.c (internal compiler error) >> gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) >> gcc.dg/vect/vect-23.c (internal compiler error) >> gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) >> gcc.dg/vect/vect-24.c (internal compiler error) >> gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) >> gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) >> gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal >> compiler error) >> >> Backtrace for gcc.c-torture/compile/20160205-1.c -O3 >> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer >> -finline-functions >> during RTL pass: expand >> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal >> compiler error: in do_store_flag, at expr.c:12259 >> 0x8feb26 do_store_flag >> /gcc/expr.c:12259 >> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >> expand_modifier) >> /gcc/expr.c:9617 >> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> expand_modifier, rtx_def**, bool) >> /gcc/expr.c:10159 >> 0x91174e expand_expr >> /gcc/expr.h:282 >> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >> rtx_def**, expand_modifier) >> /gcc/expr.c:8065 >> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >> expand_modifier) >> /gcc/expr.c:9950 >> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> expand_modifier, rtx_def**, bool) >> /gcc/expr.c:10159 >> 0x91174e expand_expr >> /gcc/expr.h:282 >> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >> rtx_def**, expand_modifier) >> /gcc/expr.c:8065 >> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >> expand_modifier) >> /gcc/expr.c:9950 >> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> expand_modifier, rtx_def**, bool) >> /gcc/expr.c:10159 >> 0x91174e expand_expr >> /gcc/expr.h:282 >> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >> rtx_def**, expand_modifier) >> /gcc/expr.c:8065 >> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >> expand_modifier) >> /gcc/expr.c:9950 >> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> expand_modifier, rtx_def**, bool) >> /gcc/expr.c:10159 >> 0x91174e expand_expr >> /gcc/expr.h:282 >> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >> rtx_def**, expand_modifier) >> /gcc/expr.c:8065 >> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >> expand_modifier) >> /gcc/expr.c:9950 >> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> expand_modifier, rtx_def**, bool) >> /gcc/expr.c:10159 >> 0x91174e expand_expr >> /gcc/expr.h:282 > > Hmm, I guess we might need to verify that the VEC_COND_EXPRs > can be RTL expanded, at least if the folding triggers after vector > lowering (but needing to lower a previously expandable VEC_COND_EXPR > would be similarly bad). So we may need to handle VEC_COND_EXPRs > like VEC_PERMs and thus need to check target support. Ick. Maybe. I'd like to see what the gimple looks like that arm fails to expand, if that's really a limitation in the hardware, or just some simple missing case in the target or the expansion code. Is it that we had (a<b)?-1:0 which arm can handle, and because of the transformation we have to expand a plain c=a<b and arm cannot handle that? If someone can confirm the breakage, please feel free to revert that patch, but also please give some details about how this breaks or provide a simple way to reproduce.
On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote: > > On Thu, 6 Aug 2020, Christophe Lyon wrote: > > >>> 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > >>> > >>> PR tree-optimization/95906 > >>> PR target/70314 > >>> * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), > >>> (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. > >>> (op (c ? a : b)): Update to match the new transformations. > >>> > >>> * gcc.dg/tree-ssa/andnot-2.c: New file. > >>> * gcc.dg/tree-ssa/pr95906.c: Likewise. > >>> * gcc.target/i386/pr70314.c: Likewise. > >>> > > > > I think this patch is causing several ICEs on arm-none-linux-gnueabihf > > --with-cpu cortex-a9 --with-fpu neon-fp16: > > Executed from: gcc.c-torture/compile/compile.exp > > gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer > > -funroll-loops -fpeel-loops -ftracer -finline-functions (internal > > compiler error) > > gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) > > Executed from: gcc.dg/dg.exp > > gcc.dg/pr87746.c (internal compiler error) > > Executed from: gcc.dg/tree-ssa/tree-ssa.exp > > gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) > > I tried a cross from x86_64-linux with current master > > .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 > make > > it stops at some point with an error, but I have xgcc and cc1 in > build/gcc. > > I copied 2 of the testcases and compiled > > ./xgcc pr87746.c -Ofast -S -B. > ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. > > without getting any ICE. Sorry for the delay, I had to reproduce the problem manually. > > Is there a machine on the compile farm where this is easy to reproduce? I don't think there is any arm machine in the compile farm. > Or could you attach the .optimized dump that corresponds to the > backtrace below? It looks like we end up with a comparison with an > unexpected return type. > I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details, here is the log. Is that what you need? Thanks, Christophe > > Executed from: gcc.dg/vect/vect.exp > > gcc.dg/vect/pr59591-1.c (internal compiler error) > > gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) > > gcc.dg/vect/pr86927.c (internal compiler error) > > gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) > > gcc.dg/vect/slp-cond-5.c (internal compiler error) > > gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) > > gcc.dg/vect/vect-23.c (internal compiler error) > > gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) > > gcc.dg/vect/vect-24.c (internal compiler error) > > gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) > > gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) > > gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal > > compiler error) > > > > Backtrace for gcc.c-torture/compile/20160205-1.c -O3 > > -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer > > -finline-functions > > during RTL pass: expand > > /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal > > compiler error: in do_store_flag, at expr.c:12259 > > 0x8feb26 do_store_flag > > /gcc/expr.c:12259 > > 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > > expand_modifier) > > /gcc/expr.c:9617 > > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > /gcc/expr.c:10159 > > 0x91174e expand_expr > > /gcc/expr.h:282 > > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > > rtx_def**, expand_modifier) > > /gcc/expr.c:8065 > > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > > expand_modifier) > > /gcc/expr.c:9950 > > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > /gcc/expr.c:10159 > > 0x91174e expand_expr > > /gcc/expr.h:282 > > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > > rtx_def**, expand_modifier) > > /gcc/expr.c:8065 > > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > > expand_modifier) > > /gcc/expr.c:9950 > > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > /gcc/expr.c:10159 > > 0x91174e expand_expr > > /gcc/expr.h:282 > > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > > rtx_def**, expand_modifier) > > /gcc/expr.c:8065 > > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > > expand_modifier) > > /gcc/expr.c:9950 > > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > /gcc/expr.c:10159 > > 0x91174e expand_expr > > /gcc/expr.h:282 > > 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > > rtx_def**, expand_modifier) > > /gcc/expr.c:8065 > > 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > > expand_modifier) > > /gcc/expr.c:9950 > > 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > /gcc/expr.c:10159 > > 0x91174e expand_expr > > /gcc/expr.h:282 > > > > Christophe > > -- > Marc Glisse
On Thu, 6 Aug 2020, Christophe Lyon wrote: > On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote: >> >> On Thu, 6 Aug 2020, Christophe Lyon wrote: >> >>>>> 2020-08-05 Marc Glisse <marc.glisse@inria.fr> >>>>> >>>>> PR tree-optimization/95906 >>>>> PR target/70314 >>>>> * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), >>>>> (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. >>>>> (op (c ? a : b)): Update to match the new transformations. >>>>> >>>>> * gcc.dg/tree-ssa/andnot-2.c: New file. >>>>> * gcc.dg/tree-ssa/pr95906.c: Likewise. >>>>> * gcc.target/i386/pr70314.c: Likewise. >>>>> >>> >>> I think this patch is causing several ICEs on arm-none-linux-gnueabihf >>> --with-cpu cortex-a9 --with-fpu neon-fp16: >>> Executed from: gcc.c-torture/compile/compile.exp >>> gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer >>> -funroll-loops -fpeel-loops -ftracer -finline-functions (internal >>> compiler error) >>> gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) >>> Executed from: gcc.dg/dg.exp >>> gcc.dg/pr87746.c (internal compiler error) >>> Executed from: gcc.dg/tree-ssa/tree-ssa.exp >>> gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) >> >> I tried a cross from x86_64-linux with current master >> >> .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 >> make >> >> it stops at some point with an error, but I have xgcc and cc1 in >> build/gcc. >> >> I copied 2 of the testcases and compiled >> >> ./xgcc pr87746.c -Ofast -S -B. >> ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. >> >> without getting any ICE. > > Sorry for the delay, I had to reproduce the problem manually. >> >> Is there a machine on the compile farm where this is easy to reproduce? > I don't think there is any arm machine in the compile farm. > >> Or could you attach the .optimized dump that corresponds to the >> backtrace below? It looks like we end up with a comparison with an >> unexpected return type. >> > > I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details, > here is the log. > Is that what you need? Thanks. The one from -fdump-tree-optimized would be closer to the ICE. Though it would also be convenient to know which stmt is being expanded when we ICE, etc. Was I on the right track configuring with --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 then compiling without any special option? > Thanks, > > Christophe > >>> Executed from: gcc.dg/vect/vect.exp >>> gcc.dg/vect/pr59591-1.c (internal compiler error) >>> gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) >>> gcc.dg/vect/pr86927.c (internal compiler error) >>> gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) >>> gcc.dg/vect/slp-cond-5.c (internal compiler error) >>> gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) >>> gcc.dg/vect/vect-23.c (internal compiler error) >>> gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) >>> gcc.dg/vect/vect-24.c (internal compiler error) >>> gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) >>> gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) >>> gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal >>> compiler error) >>> >>> Backtrace for gcc.c-torture/compile/20160205-1.c -O3 >>> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer >>> -finline-functions >>> during RTL pass: expand >>> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal >>> compiler error: in do_store_flag, at expr.c:12259 >>> 0x8feb26 do_store_flag >>> /gcc/expr.c:12259 >>> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >>> expand_modifier) >>> /gcc/expr.c:9617 >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >>> expand_modifier, rtx_def**, bool) >>> /gcc/expr.c:10159 >>> 0x91174e expand_expr >>> /gcc/expr.h:282 >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >>> rtx_def**, expand_modifier) >>> /gcc/expr.c:8065 >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >>> expand_modifier) >>> /gcc/expr.c:9950 >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >>> expand_modifier, rtx_def**, bool) >>> /gcc/expr.c:10159 >>> 0x91174e expand_expr >>> /gcc/expr.h:282 >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >>> rtx_def**, expand_modifier) >>> /gcc/expr.c:8065 >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >>> expand_modifier) >>> /gcc/expr.c:9950 >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >>> expand_modifier, rtx_def**, bool) >>> /gcc/expr.c:10159 >>> 0x91174e expand_expr >>> /gcc/expr.h:282 >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >>> rtx_def**, expand_modifier) >>> /gcc/expr.c:8065 >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >>> expand_modifier) >>> /gcc/expr.c:9950 >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >>> expand_modifier, rtx_def**, bool) >>> /gcc/expr.c:10159 >>> 0x91174e expand_expr >>> /gcc/expr.h:282 >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, >>> rtx_def**, expand_modifier) >>> /gcc/expr.c:8065 >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, >>> expand_modifier) >>> /gcc/expr.c:9950 >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >>> expand_modifier, rtx_def**, bool) >>> /gcc/expr.c:10159 >>> 0x91174e expand_expr >>> /gcc/expr.h:282 >>> >>> Christophe >> >> -- >> Marc Glisse >
On Thu, 6 Aug 2020 at 13:42, Marc Glisse <marc.glisse@inria.fr> wrote: > > On Thu, 6 Aug 2020, Christophe Lyon wrote: > > > On Thu, 6 Aug 2020 at 11:06, Marc Glisse <marc.glisse@inria.fr> wrote: > >> > >> On Thu, 6 Aug 2020, Christophe Lyon wrote: > >> > >>>>> 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > >>>>> > >>>>> PR tree-optimization/95906 > >>>>> PR target/70314 > >>>>> * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), > >>>>> (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. > >>>>> (op (c ? a : b)): Update to match the new transformations. > >>>>> > >>>>> * gcc.dg/tree-ssa/andnot-2.c: New file. > >>>>> * gcc.dg/tree-ssa/pr95906.c: Likewise. > >>>>> * gcc.target/i386/pr70314.c: Likewise. > >>>>> > >>> > >>> I think this patch is causing several ICEs on arm-none-linux-gnueabihf > >>> --with-cpu cortex-a9 --with-fpu neon-fp16: > >>> Executed from: gcc.c-torture/compile/compile.exp > >>> gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer > >>> -funroll-loops -fpeel-loops -ftracer -finline-functions (internal > >>> compiler error) > >>> gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) > >>> Executed from: gcc.dg/dg.exp > >>> gcc.dg/pr87746.c (internal compiler error) > >>> Executed from: gcc.dg/tree-ssa/tree-ssa.exp > >>> gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) > >> > >> I tried a cross from x86_64-linux with current master > >> > >> .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 > >> make > >> > >> it stops at some point with an error, but I have xgcc and cc1 in > >> build/gcc. > >> > >> I copied 2 of the testcases and compiled > >> > >> ./xgcc pr87746.c -Ofast -S -B. > >> ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. > >> > >> without getting any ICE. > > > > Sorry for the delay, I had to reproduce the problem manually. > >> > >> Is there a machine on the compile farm where this is easy to reproduce? > > I don't think there is any arm machine in the compile farm. > > > >> Or could you attach the .optimized dump that corresponds to the > >> backtrace below? It looks like we end up with a comparison with an > >> unexpected return type. > >> > > > > I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details, > > here is the log. > > Is that what you need? > > Thanks. > The one from -fdump-tree-optimized would be closer to the ICE. Here it is. > Though it would also be convenient to know which stmt is being expanded > when we ICE, etc. I think it's when expanding _96 = _86 | _95; (that the value of "stmt" in expand_gimple_stmt_1 when we enter do_store_flag > Was I on the right track configuring with > --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 > --with-fpu=neon-fp16 > then compiling without any special option? > Maybe you also need --with-float=hard, I don't remember if it's implied by the 'hf' target suffix (I saw similar problems with arm-none-linux-gnueabi anyway) > > Thanks, > > > > Christophe > > > >>> Executed from: gcc.dg/vect/vect.exp > >>> gcc.dg/vect/pr59591-1.c (internal compiler error) > >>> gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) > >>> gcc.dg/vect/pr86927.c (internal compiler error) > >>> gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) > >>> gcc.dg/vect/slp-cond-5.c (internal compiler error) > >>> gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) > >>> gcc.dg/vect/vect-23.c (internal compiler error) > >>> gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) > >>> gcc.dg/vect/vect-24.c (internal compiler error) > >>> gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) > >>> gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) > >>> gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal > >>> compiler error) > >>> > >>> Backtrace for gcc.c-torture/compile/20160205-1.c -O3 > >>> -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer > >>> -finline-functions > >>> during RTL pass: expand > >>> /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal > >>> compiler error: in do_store_flag, at expr.c:12259 > >>> 0x8feb26 do_store_flag > >>> /gcc/expr.c:12259 > >>> 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > >>> expand_modifier) > >>> /gcc/expr.c:9617 > >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >>> expand_modifier, rtx_def**, bool) > >>> /gcc/expr.c:10159 > >>> 0x91174e expand_expr > >>> /gcc/expr.h:282 > >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > >>> rtx_def**, expand_modifier) > >>> /gcc/expr.c:8065 > >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > >>> expand_modifier) > >>> /gcc/expr.c:9950 > >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >>> expand_modifier, rtx_def**, bool) > >>> /gcc/expr.c:10159 > >>> 0x91174e expand_expr > >>> /gcc/expr.h:282 > >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > >>> rtx_def**, expand_modifier) > >>> /gcc/expr.c:8065 > >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > >>> expand_modifier) > >>> /gcc/expr.c:9950 > >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >>> expand_modifier, rtx_def**, bool) > >>> /gcc/expr.c:10159 > >>> 0x91174e expand_expr > >>> /gcc/expr.h:282 > >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > >>> rtx_def**, expand_modifier) > >>> /gcc/expr.c:8065 > >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > >>> expand_modifier) > >>> /gcc/expr.c:9950 > >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >>> expand_modifier, rtx_def**, bool) > >>> /gcc/expr.c:10159 > >>> 0x91174e expand_expr > >>> /gcc/expr.h:282 > >>> 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, > >>> rtx_def**, expand_modifier) > >>> /gcc/expr.c:8065 > >>> 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > >>> expand_modifier) > >>> /gcc/expr.c:9950 > >>> 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >>> expand_modifier, rtx_def**, bool) > >>> /gcc/expr.c:10159 > >>> 0x91174e expand_expr > >>> /gcc/expr.h:282 > >>> > >>> Christophe > >> > >> -- > >> Marc Glisse > > > > -- > Marc Glisse
On Thu, 6 Aug 2020, Christophe Lyon wrote: >> Was I on the right track configuring with >> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 >> --with-fpu=neon-fp16 >> then compiling without any special option? > > Maybe you also need --with-float=hard, I don't remember if it's > implied by the 'hf' target suffix Thanks! That's what I was missing to reproduce the issue. Now I can reproduce it with just typedef unsigned int vec __attribute__((vector_size(16))); typedef int vi __attribute__((vector_size(16))); vi f(vec a,vec b){ return a==5 | b==7; } with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 _1 = a_5(D) == { 5, 5, 5, 5 }; _3 = b_6(D) == { 7, 7, 7, 7 }; _9 = _1 | _3; _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); we fail to expand the equality comparison (expand_vec_cmp_expr_p returns false), while with -fdisable-tree-forwprop4 we do manage to expand _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); It doesn't make much sense to me that we can expand the more complicated form and not the simpler form of the same operation (both compare a to 5 and produce a vector of -1 or 0 of the same size), especially when the target has an instruction (vceq) that does just what we want. Introducing boolean vectors was fine, but I think they should be real types, that we can operate on, not be forced to appear only as the first argument of a vcond. I can think of 2 natural ways to improve things: either implement vector comparisons in the ARM backend (possibly by forwarding to their existing code for vcond), or in the generic expansion code try using vcond if the direct comparison opcode is not provided. We can temporarily revert my patch, but I would like it to be temporary. Since aarch64 seems to handle the same code just fine, maybe someone who knows arm could copy the relevant code over? Does my message make sense, do people have comments?
On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote: > > On Thu, 6 Aug 2020, Christophe Lyon wrote: > > >> Was I on the right track configuring with > >> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 > >> --with-fpu=neon-fp16 > >> then compiling without any special option? > > > > Maybe you also need --with-float=hard, I don't remember if it's > > implied by the 'hf' target suffix > > Thanks! That's what I was missing to reproduce the issue. Now I can > reproduce it with just > > typedef unsigned int vec __attribute__((vector_size(16))); > typedef int vi __attribute__((vector_size(16))); > vi f(vec a,vec b){ > return a==5 | b==7; > } > > with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 > > _1 = a_5(D) == { 5, 5, 5, 5 }; > _3 = b_6(D) == { 7, 7, 7, 7 }; > _9 = _1 | _3; > _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); > > we fail to expand the equality comparison (expand_vec_cmp_expr_p returns > false), while with -fdisable-tree-forwprop4 we do manage to expand > > _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); > > It doesn't make much sense to me that we can expand the more complicated > form and not the simpler form of the same operation (both compare a to 5 > and produce a vector of -1 or 0 of the same size), especially when the > target has an instruction (vceq) that does just what we want. > > Introducing boolean vectors was fine, but I think they should be real > types, that we can operate on, not be forced to appear only as the first > argument of a vcond. > > I can think of 2 natural ways to improve things: either implement vector > comparisons in the ARM backend (possibly by forwarding to their existing > code for vcond), or in the generic expansion code try using vcond if the > direct comparison opcode is not provided. > > We can temporarily revert my patch, but I would like it to be temporary. > Since aarch64 seems to handle the same code just fine, maybe someone who > knows arm could copy the relevant code over? > > Does my message make sense, do people have comments? So what complicates things now (and to some extent pre-existed when you used AVX512 which _could_ operate on boolean vectors) is that we have split out the condition from VEC_COND_EXPR to separate stmts but we do not expect backends to be able to code-generate the separate form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs to .VCOND[U] "merging" the compares again. Now that process breaks down once we have things like _9 = _1 | _3; - at some point I argued that we should handle vector compares [and operations on boolean vectors] as well in ISEL but then when it came up again for some reason I disregarded that again. Thus - we don't want to go back to fixing up the generic expansion code (which looks at one instruction at a time and is restricted by TER single-use restrictions). Instead we want to deal with this in ISEL which should behave more intelligently. In the above case it might involve turning the _1 and _3 defs into .VCOND [with different result type], doing _9 in that type and then somehow dealing with _7 ... but this eventually means undoing the match simplification that introduced the code? Not sure if that helps though. Richard. > -- > Marc Glisse
On Fri, 7 Aug 2020, Richard Biener wrote: > On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote: >> >> On Thu, 6 Aug 2020, Christophe Lyon wrote: >> >>>> Was I on the right track configuring with >>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 >>>> --with-fpu=neon-fp16 >>>> then compiling without any special option? >>> >>> Maybe you also need --with-float=hard, I don't remember if it's >>> implied by the 'hf' target suffix >> >> Thanks! That's what I was missing to reproduce the issue. Now I can >> reproduce it with just >> >> typedef unsigned int vec __attribute__((vector_size(16))); >> typedef int vi __attribute__((vector_size(16))); >> vi f(vec a,vec b){ >> return a==5 | b==7; >> } >> >> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 >> >> _1 = a_5(D) == { 5, 5, 5, 5 }; >> _3 = b_6(D) == { 7, 7, 7, 7 }; >> _9 = _1 | _3; >> _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); >> >> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns >> false), while with -fdisable-tree-forwprop4 we do manage to expand >> >> _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); >> >> It doesn't make much sense to me that we can expand the more complicated >> form and not the simpler form of the same operation (both compare a to 5 >> and produce a vector of -1 or 0 of the same size), especially when the >> target has an instruction (vceq) that does just what we want. >> >> Introducing boolean vectors was fine, but I think they should be real >> types, that we can operate on, not be forced to appear only as the first >> argument of a vcond. >> >> I can think of 2 natural ways to improve things: either implement vector >> comparisons in the ARM backend (possibly by forwarding to their existing >> code for vcond), or in the generic expansion code try using vcond if the >> direct comparison opcode is not provided. >> >> We can temporarily revert my patch, but I would like it to be temporary. >> Since aarch64 seems to handle the same code just fine, maybe someone who >> knows arm could copy the relevant code over? >> >> Does my message make sense, do people have comments? > > So what complicates things now (and to some extent pre-existed when you > used AVX512 which _could_ operate on boolean vectors) is that we > have split out the condition from VEC_COND_EXPR to separate stmts > but we do not expect backends to be able to code-generate the separate > form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs > to .VCOND[U] "merging" the compares again. Now that process breaks > down once we have things like _9 = _1 | _3; - at some point I argued > that we should handle vector compares [and operations on boolean vectors] > as well in ISEL but then when it came up again for some reason I > disregarded that again. > > Thus - we don't want to go back to fixing up the generic expansion code > (which looks at one instruction at a time and is restricted by TER single-use > restrictions). Here, it would only be handling comparisons left over by ISEL because they do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would be more consistent, but then we may have to teach the vector lowering pass about this. > Instead we want to deal with this in ISEL which should > behave more intelligently. In the above case it might involve turning > the _1 and _3 defs into .VCOND [with different result type], doing > _9 in that type and then somehow dealing with _7 ... but this eventually > means undoing the match simplification that introduced the code? For targets that do not have compact boolean vectors, VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. Removing it to allow more simplifications makes sense, reintroducing it for expansion can make sense as well, I think it is ok if the second one reverses the first, but very locally, without having to propagate a change of type to the uses. So on ARM we could turn _1 and _3 into .VCOND producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or does using bool vectors like that seem bad? (maybe they aren't guaranteed to be equivalent to signed integer vectors with values 0 and -1 and we need to query the target to know if that is the case, or introduce an extra .VCOND) Also, we only want to replace comparisons with .VCOND if the target doesn't handle the comparison directly. For AVX512, we do want to produce SImode bool vectors (for k* registers) and operate on them directly, that's the whole point of introducing the bool vectors (if bool vectors were only used to feed VEC_COND_EXPR and were all turned into .VCOND before expansion, I don't see the point of specifying different types for bool vectors for AVX512 vs non-AVX512, as it would make no difference on what is passed to the backend). I think targets should provide some number of operations, including for instance bit_and and bit_ior on bool vectors, and be a bit consistent about what they provide, it becomes unmanageable in the middle-end otherwise...
On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote: > > On Fri, 7 Aug 2020, Richard Biener wrote: > > > On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote: > >> > >> On Thu, 6 Aug 2020, Christophe Lyon wrote: > >> > >>>> Was I on the right track configuring with > >>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 > >>>> --with-fpu=neon-fp16 > >>>> then compiling without any special option? > >>> > >>> Maybe you also need --with-float=hard, I don't remember if it's > >>> implied by the 'hf' target suffix > >> > >> Thanks! That's what I was missing to reproduce the issue. Now I can > >> reproduce it with just > >> > >> typedef unsigned int vec __attribute__((vector_size(16))); > >> typedef int vi __attribute__((vector_size(16))); > >> vi f(vec a,vec b){ > >> return a==5 | b==7; > >> } > >> > >> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 > >> > >> _1 = a_5(D) == { 5, 5, 5, 5 }; > >> _3 = b_6(D) == { 7, 7, 7, 7 }; > >> _9 = _1 | _3; > >> _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); > >> > >> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns > >> false), while with -fdisable-tree-forwprop4 we do manage to expand > >> > >> _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); > >> > >> It doesn't make much sense to me that we can expand the more complicated > >> form and not the simpler form of the same operation (both compare a to 5 > >> and produce a vector of -1 or 0 of the same size), especially when the > >> target has an instruction (vceq) that does just what we want. > >> > >> Introducing boolean vectors was fine, but I think they should be real > >> types, that we can operate on, not be forced to appear only as the first > >> argument of a vcond. > >> > >> I can think of 2 natural ways to improve things: either implement vector > >> comparisons in the ARM backend (possibly by forwarding to their existing > >> code for vcond), or in the generic expansion code try using vcond if the > >> direct comparison opcode is not provided. > >> > >> We can temporarily revert my patch, but I would like it to be temporary. > >> Since aarch64 seems to handle the same code just fine, maybe someone who > >> knows arm could copy the relevant code over? > >> > >> Does my message make sense, do people have comments? > > > > So what complicates things now (and to some extent pre-existed when you > > used AVX512 which _could_ operate on boolean vectors) is that we > > have split out the condition from VEC_COND_EXPR to separate stmts > > but we do not expect backends to be able to code-generate the separate > > form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs > > to .VCOND[U] "merging" the compares again. Now that process breaks > > down once we have things like _9 = _1 | _3; - at some point I argued > > that we should handle vector compares [and operations on boolean vectors] > > as well in ISEL but then when it came up again for some reason I > > disregarded that again. > > > > Thus - we don't want to go back to fixing up the generic expansion code > > (which looks at one instruction at a time and is restricted by TER single-use > > restrictions). > > Here, it would only be handling comparisons left over by ISEL because they > do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would > be more consistent, but then we may have to teach the vector lowering pass > about this. > > > Instead we want to deal with this in ISEL which should > > behave more intelligently. In the above case it might involve turning > > the _1 and _3 defs into .VCOND [with different result type], doing > > _9 in that type and then somehow dealing with _7 ... but this eventually > > means undoing the match simplification that introduced the code? > > For targets that do not have compact boolean vectors, > VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. > Removing it to allow more simplifications makes sense, reintroducing it > for expansion can make sense as well, I think it is ok if the second one > reverses the first, but very locally, without having to propagate a change > of type to the uses. So on ARM we could turn _1 and _3 into .VCOND > producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or > does using bool vectors like that seem bad? (maybe they aren't guaranteed > to be equivalent to signed integer vectors with values 0 and -1 and we > need to query the target to know if that is the case, or introduce an > extra .VCOND) > > Also, we only want to replace comparisons with .VCOND if the target > doesn't handle the comparison directly. For AVX512, we do want to produce > SImode bool vectors (for k* registers) and operate on them directly, > that's the whole point of introducing the bool vectors (if bool vectors > were only used to feed VEC_COND_EXPR and were all turned into .VCOND > before expansion, I don't see the point of specifying different types for > bool vectors for AVX512 vs non-AVX512, as it would make no difference on > what is passed to the backend). > > I think targets should provide some number of operations, including for > instance bit_and and bit_ior on bool vectors, and be a bit consistent > about what they provide, it becomes unmanageable in the middle-end > otherwise... It's currently somewhat of a mess I guess. It might help to restrict the simplification patterns to passes before vector lowering so maybe add something like (or re-use) canonicalize_math[_after_vectorization]_p ()? Richard. > > -- > Marc Glisse
On Fri, 7 Aug 2020, Richard Biener wrote: > On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote: >> >> On Fri, 7 Aug 2020, Richard Biener wrote: >> >>> On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote: >>>> >>>> On Thu, 6 Aug 2020, Christophe Lyon wrote: >>>> >>>>>> Was I on the right track configuring with >>>>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 >>>>>> --with-fpu=neon-fp16 >>>>>> then compiling without any special option? >>>>> >>>>> Maybe you also need --with-float=hard, I don't remember if it's >>>>> implied by the 'hf' target suffix >>>> >>>> Thanks! That's what I was missing to reproduce the issue. Now I can >>>> reproduce it with just >>>> >>>> typedef unsigned int vec __attribute__((vector_size(16))); >>>> typedef int vi __attribute__((vector_size(16))); >>>> vi f(vec a,vec b){ >>>> return a==5 | b==7; >>>> } >>>> >>>> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 >>>> >>>> _1 = a_5(D) == { 5, 5, 5, 5 }; >>>> _3 = b_6(D) == { 7, 7, 7, 7 }; >>>> _9 = _1 | _3; >>>> _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); >>>> >>>> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns >>>> false), while with -fdisable-tree-forwprop4 we do manage to expand >>>> >>>> _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); >>>> >>>> It doesn't make much sense to me that we can expand the more complicated >>>> form and not the simpler form of the same operation (both compare a to 5 >>>> and produce a vector of -1 or 0 of the same size), especially when the >>>> target has an instruction (vceq) that does just what we want. >>>> >>>> Introducing boolean vectors was fine, but I think they should be real >>>> types, that we can operate on, not be forced to appear only as the first >>>> argument of a vcond. >>>> >>>> I can think of 2 natural ways to improve things: either implement vector >>>> comparisons in the ARM backend (possibly by forwarding to their existing >>>> code for vcond), or in the generic expansion code try using vcond if the >>>> direct comparison opcode is not provided. >>>> >>>> We can temporarily revert my patch, but I would like it to be temporary. >>>> Since aarch64 seems to handle the same code just fine, maybe someone who >>>> knows arm could copy the relevant code over? >>>> >>>> Does my message make sense, do people have comments? >>> >>> So what complicates things now (and to some extent pre-existed when you >>> used AVX512 which _could_ operate on boolean vectors) is that we >>> have split out the condition from VEC_COND_EXPR to separate stmts >>> but we do not expect backends to be able to code-generate the separate >>> form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs >>> to .VCOND[U] "merging" the compares again. Now that process breaks >>> down once we have things like _9 = _1 | _3; - at some point I argued >>> that we should handle vector compares [and operations on boolean vectors] >>> as well in ISEL but then when it came up again for some reason I >>> disregarded that again. >>> >>> Thus - we don't want to go back to fixing up the generic expansion code >>> (which looks at one instruction at a time and is restricted by TER single-use >>> restrictions). >> >> Here, it would only be handling comparisons left over by ISEL because they >> do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would >> be more consistent, but then we may have to teach the vector lowering pass >> about this. >> >>> Instead we want to deal with this in ISEL which should >>> behave more intelligently. In the above case it might involve turning >>> the _1 and _3 defs into .VCOND [with different result type], doing >>> _9 in that type and then somehow dealing with _7 ... but this eventually >>> means undoing the match simplification that introduced the code? >> >> For targets that do not have compact boolean vectors, >> VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. >> Removing it to allow more simplifications makes sense, reintroducing it >> for expansion can make sense as well, I think it is ok if the second one >> reverses the first, but very locally, without having to propagate a change >> of type to the uses. So on ARM we could turn _1 and _3 into .VCOND >> producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or >> does using bool vectors like that seem bad? (maybe they aren't guaranteed >> to be equivalent to signed integer vectors with values 0 and -1 and we >> need to query the target to know if that is the case, or introduce an >> extra .VCOND) >> >> Also, we only want to replace comparisons with .VCOND if the target >> doesn't handle the comparison directly. For AVX512, we do want to produce >> SImode bool vectors (for k* registers) and operate on them directly, >> that's the whole point of introducing the bool vectors (if bool vectors >> were only used to feed VEC_COND_EXPR and were all turned into .VCOND >> before expansion, I don't see the point of specifying different types for >> bool vectors for AVX512 vs non-AVX512, as it would make no difference on >> what is passed to the backend). >> >> I think targets should provide some number of operations, including for >> instance bit_and and bit_ior on bool vectors, and be a bit consistent >> about what they provide, it becomes unmanageable in the middle-end >> otherwise... > > It's currently somewhat of a mess I guess. It might help to > restrict the simplification patterns to passes before vector lowering > so maybe add something like (or re-use) > canonicalize_math[_after_vectorization]_p ()? That's indeed a nice way to gain time to sort out the mess :-) This patch solved the issue on ARM for at least 1 testcase. I have a bootstrap+regtest running on x86_64-linux-gnu, at least the tests added in the previous patch still work. 2020-08-05 Marc Glisse <marc.glisse@inria.fr> * generic-match-head.c (optimize_vectors_before_lowering_p): New function. * gimple-match-head.c (optimize_vectors_before_lowering_p): Likewise. * match.pd ((v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Use it.
On Fri, Aug 7, 2020 at 2:15 PM Marc Glisse <marc.glisse@inria.fr> wrote: > > On Fri, 7 Aug 2020, Richard Biener wrote: > > > On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse <marc.glisse@inria.fr> wrote: > >> > >> On Fri, 7 Aug 2020, Richard Biener wrote: > >> > >>> On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse <marc.glisse@inria.fr> wrote: > >>>> > >>>> On Thu, 6 Aug 2020, Christophe Lyon wrote: > >>>> > >>>>>> Was I on the right track configuring with > >>>>>> --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 > >>>>>> --with-fpu=neon-fp16 > >>>>>> then compiling without any special option? > >>>>> > >>>>> Maybe you also need --with-float=hard, I don't remember if it's > >>>>> implied by the 'hf' target suffix > >>>> > >>>> Thanks! That's what I was missing to reproduce the issue. Now I can > >>>> reproduce it with just > >>>> > >>>> typedef unsigned int vec __attribute__((vector_size(16))); > >>>> typedef int vi __attribute__((vector_size(16))); > >>>> vi f(vec a,vec b){ > >>>> return a==5 | b==7; > >>>> } > >>>> > >>>> with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 > >>>> > >>>> _1 = a_5(D) == { 5, 5, 5, 5 }; > >>>> _3 = b_6(D) == { 7, 7, 7, 7 }; > >>>> _9 = _1 | _3; > >>>> _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); > >>>> > >>>> we fail to expand the equality comparison (expand_vec_cmp_expr_p returns > >>>> false), while with -fdisable-tree-forwprop4 we do manage to expand > >>>> > >>>> _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); > >>>> > >>>> It doesn't make much sense to me that we can expand the more complicated > >>>> form and not the simpler form of the same operation (both compare a to 5 > >>>> and produce a vector of -1 or 0 of the same size), especially when the > >>>> target has an instruction (vceq) that does just what we want. > >>>> > >>>> Introducing boolean vectors was fine, but I think they should be real > >>>> types, that we can operate on, not be forced to appear only as the first > >>>> argument of a vcond. > >>>> > >>>> I can think of 2 natural ways to improve things: either implement vector > >>>> comparisons in the ARM backend (possibly by forwarding to their existing > >>>> code for vcond), or in the generic expansion code try using vcond if the > >>>> direct comparison opcode is not provided. > >>>> > >>>> We can temporarily revert my patch, but I would like it to be temporary. > >>>> Since aarch64 seems to handle the same code just fine, maybe someone who > >>>> knows arm could copy the relevant code over? > >>>> > >>>> Does my message make sense, do people have comments? > >>> > >>> So what complicates things now (and to some extent pre-existed when you > >>> used AVX512 which _could_ operate on boolean vectors) is that we > >>> have split out the condition from VEC_COND_EXPR to separate stmts > >>> but we do not expect backends to be able to code-generate the separate > >>> form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs > >>> to .VCOND[U] "merging" the compares again. Now that process breaks > >>> down once we have things like _9 = _1 | _3; - at some point I argued > >>> that we should handle vector compares [and operations on boolean vectors] > >>> as well in ISEL but then when it came up again for some reason I > >>> disregarded that again. > >>> > >>> Thus - we don't want to go back to fixing up the generic expansion code > >>> (which looks at one instruction at a time and is restricted by TER single-use > >>> restrictions). > >> > >> Here, it would only be handling comparisons left over by ISEL because they > >> do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would > >> be more consistent, but then we may have to teach the vector lowering pass > >> about this. > >> > >>> Instead we want to deal with this in ISEL which should > >>> behave more intelligently. In the above case it might involve turning > >>> the _1 and _3 defs into .VCOND [with different result type], doing > >>> _9 in that type and then somehow dealing with _7 ... but this eventually > >>> means undoing the match simplification that introduced the code? > >> > >> For targets that do not have compact boolean vectors, > >> VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. > >> Removing it to allow more simplifications makes sense, reintroducing it > >> for expansion can make sense as well, I think it is ok if the second one > >> reverses the first, but very locally, without having to propagate a change > >> of type to the uses. So on ARM we could turn _1 and _3 into .VCOND > >> producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or > >> does using bool vectors like that seem bad? (maybe they aren't guaranteed > >> to be equivalent to signed integer vectors with values 0 and -1 and we > >> need to query the target to know if that is the case, or introduce an > >> extra .VCOND) > >> > >> Also, we only want to replace comparisons with .VCOND if the target > >> doesn't handle the comparison directly. For AVX512, we do want to produce > >> SImode bool vectors (for k* registers) and operate on them directly, > >> that's the whole point of introducing the bool vectors (if bool vectors > >> were only used to feed VEC_COND_EXPR and were all turned into .VCOND > >> before expansion, I don't see the point of specifying different types for > >> bool vectors for AVX512 vs non-AVX512, as it would make no difference on > >> what is passed to the backend). > >> > >> I think targets should provide some number of operations, including for > >> instance bit_and and bit_ior on bool vectors, and be a bit consistent > >> about what they provide, it becomes unmanageable in the middle-end > >> otherwise... > > > > It's currently somewhat of a mess I guess. It might help to > > restrict the simplification patterns to passes before vector lowering > > so maybe add something like (or re-use) > > canonicalize_math[_after_vectorization]_p ()? > > That's indeed a nice way to gain time to sort out the mess :-) > > This patch solved the issue on ARM for at least 1 testcase. I have a > bootstrap+regtest running on x86_64-linux-gnu, at least the tests added in > the previous patch still work. So let's go with this for now (assuming bootstrap / testing works). I guess we should have a look at whether the foldings make anything worse. If you hit a case where vector lowering is required with but not without folding then that's likely the case. Now, that would also mainly ask for vector lowering to be improved of course. Richard. > 2020-08-05 Marc Glisse <marc.glisse@inria.fr> > > * generic-match-head.c (optimize_vectors_before_lowering_p): New > function. > * gimple-match-head.c (optimize_vectors_before_lowering_p): > Likewise. > * match.pd ((v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): Use it. > > -- > Marc Glisse
diff --git a/gcc/match.pd b/gcc/match.pd index a052c9e3dbc..f9297fcadbe 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3436,20 +3436,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (integer_zerop (@0)) @2))) -/* Sink unary operations to constant branches, but only if we do fold it to - constants. */ +#if GIMPLE +/* Sink unary operations to branches, but only if we do fold both. */ (for op (negate bit_not abs absu) (simplify - (op (vec_cond @0 VECTOR_CST@1 VECTOR_CST@2)) - (with - { - tree cst1, cst2; - cst1 = const_unop (op, type, @1); - if (cst1) - cst2 = const_unop (op, type, @2); - } - (if (cst1 && cst2) - (vec_cond @0 { cst1; } { cst2; }))))) + (op (vec_cond:s @0 @1 @2)) + (vec_cond @0 (op! @1) (op! @2)))) + +/* Sink binary operation to branches, but only if we can fold it. */ +(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor + rdiv trunc_div ceil_div floor_div round_div + trunc_mod ceil_mod floor_mod round_mod min max) +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (vec_cond @0 (op! @1 @3) (op! @2 @4))) + +/* (c ? a : b) op d --> c ? (a op d) : (b op d) */ + (simplify + (op (vec_cond:s @0 @1 @2) @3) + (vec_cond @0 (op! @1 @3) (op! @2 @3))) + (simplify + (op @3 (vec_cond:s @0 @1 @2)) + (vec_cond @0 (op! @3 @1) (op! @3 @2)))) +#endif + +/* (v ? w : 0) ? a : b is just (v & w) ? a : b */ +(simplify + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_and @0 @3) @1 @2))) +(simplify + (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_ior @0 @3) @1 @2))) +(simplify + (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_ior @0 (bit_not @3)) @2 @1))) +(simplify + (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_and @0 (bit_not @3)) @2 @1))) + +/* c1 ? c2 ? a : b : b --> (c1 & c2) ? a : b */ +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @3) + (if (types_match (@0, @1)) + (vec_cond (bit_and @0 @1) @2 @3))) +(simplify + (vec_cond @0 @2 (vec_cond:s @1 @2 @3)) + (if (types_match (@0, @1)) + (vec_cond (bit_ior @0 @1) @2 @3))) +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @2) + (if (types_match (@0, @1)) + (vec_cond (bit_ior (bit_not @0) @1) @2 @3))) +(simplify + (vec_cond @0 @3 (vec_cond:s @1 @2 @3)) + (if (types_match (@0, @1)) + (vec_cond (bit_and (bit_not @0) @1) @2 @3))) /* Simplification moved from fold_cond_expr_with_comparison. It may also be extended. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c new file mode 100644 index 00000000000..e0955ce3ffd --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */ + +typedef long vec __attribute__((vector_size(16))); +vec f(vec x){ + vec y = x < 10; + return y & (y == 0); +} + +/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c new file mode 100644 index 00000000000..3d820a58e93 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */ + +// FIXME: this should further optimize to a MAX_EXPR +typedef signed char v16i8 __attribute__((vector_size(16))); +v16i8 f(v16i8 a, v16i8 b) +{ + v16i8 cmp = (a > b); + return (cmp & a) | (~cmp & b); +} + +/* { dg-final { scan-tree-dump-not "bit_(and|ior)_expr" "forwprop3" } } */ +/* { dg-final { scan-tree-dump-times "vec_cond_expr" 1 "forwprop3" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr70314.c b/gcc/testsuite/gcc.target/i386/pr70314.c new file mode 100644 index 00000000000..aad8dd9b57e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr70314.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-march=skylake-avx512 -O2" } */ +/* { dg-final { scan-assembler-times "cmp" 2 } } */ +/* { dg-final { scan-assembler-not "and" } } */ + +typedef long vec __attribute__((vector_size(16))); +vec f(vec x, vec y){ + return (x < 5) & (y < 8); +} + +/* On x86_64, currently + vpcmpq $2, .LC1(%rip), %xmm1, %k1 + vpcmpq $2, .LC0(%rip), %xmm0, %k0{%k1} + vpmovm2q %k0, %xmm0 +*/