Message ID | CAFULd4ZqqnAiyXC75NmCo_YbA64PF60PZb3qWJDe+=xu0Oudjg@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFC,i386] : Autovectorize 8-byte vectors | expand |
On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be > able to auto-vectorize: On a related note, following slightly changed testcase: void foo (char *restrict r, char *restrict a) { for (int i = 0; i < 24; i++) r[i] += a[i]; } compiles to: foo: vmovdqu (%rdi), %xmm1 vpaddb (%rsi), %xmm1, %xmm0 movzbl 16(%rsi), %eax addb %al, 16(%rdi) vmovups %xmm0, (%rdi) movzbl 17(%rsi), %eax addb %al, 17(%rdi) movzbl 18(%rsi), %eax addb %al, 18(%rdi) movzbl 19(%rsi), %eax addb %al, 19(%rdi) movzbl 20(%rsi), %eax addb %al, 20(%rdi) movzbl 21(%rsi), %eax addb %al, 21(%rdi) movzbl 22(%rsi), %eax addb %al, 22(%rdi) movzbl 23(%rsi), %eax addb %al, 23(%rdi) ret One would expect that the remaining 8-byte array would also get vectorized, resulting in one 16-byte operation and one 8-byte operation. Uros.
On June 26, 2019 10:17:26 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote: >Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be >able to auto-vectorize: > >void >foo (char *restrict r, char *restrict a) >{ > for (int i = 0; i < 8; i++) > r[i] += a[i]; >} > >Attached patch enables the conversion and produces: > >foo: > movq (%rdi), %xmm1 > movq (%rsi), %xmm0 > paddb %xmm1, %xmm0 > movq %xmm0, (%rdi) > ret > >Please note that the patch regresses > >FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect >"vectorized [1-3] loops" 2 >FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect >"vectorized [1-3] loops" 2 > >For some reason, the compiler decides to vectorize with 8-byte >vectors, resulting in: > >missed: not vectorized: relevant stmt not supported: _8 = (short >unsigned int) _4; >missed: bad operation or unsupported loop bound. >missed: couldn't vectorize loop > >However, the unpatched compiler is able to vectorize loop using >16-byte vectors. It looks that the compiler should re-run >vectorization with wider vectors, if vectorization with narrower >vectors fails. Jakub, Richard, do you have any insight in this issue? Double check the ordering of the vector size pushes - it should already iterate but first successful wins. Richard. >2019-06-26 Uroš Bizjak <ubizjak@gmail.com> > > * config/i386/i386.c (ix86_autovectorize_vector_sizes): > Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE. > >testsuite/ChangeLog: > >2019-06-26 Uroš Bizjak <ubizjak@gmail.com> > > * lib/target-supports.exp (available_vector_sizes) > <[istarget i?86-*-*] || [istarget x86_64-*-*]>: Add > 64-bit vectors for !ia32. > >The patch was bootstrapped and regression tested on x86_64-linux-gnu >{,-m32}. > >Uros.
On June 26, 2019 10:25:44 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote: >On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote: >> >> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be >> able to auto-vectorize: > >On a related note, following slightly changed testcase: > >void >foo (char *restrict r, char *restrict a) >{ > for (int i = 0; i < 24; i++) > r[i] += a[i]; >} > >compiles to: > >foo: > vmovdqu (%rdi), %xmm1 > vpaddb (%rsi), %xmm1, %xmm0 > movzbl 16(%rsi), %eax > addb %al, 16(%rdi) > vmovups %xmm0, (%rdi) > movzbl 17(%rsi), %eax > addb %al, 17(%rdi) > movzbl 18(%rsi), %eax > addb %al, 18(%rdi) > movzbl 19(%rsi), %eax > addb %al, 19(%rdi) > movzbl 20(%rsi), %eax > addb %al, 20(%rdi) > movzbl 21(%rsi), %eax > addb %al, 21(%rdi) > movzbl 22(%rsi), %eax > addb %al, 22(%rdi) > movzbl 23(%rsi), %eax > addb %al, 23(%rdi) > ret > >One would expect that the remaining 8-byte array would also get >vectorized, resulting in one 16-byte operation and one 8-byte >operation. Try - - param vect-epilogue-nomask=1 (or so). Richard. >Uros.
On Wed, Jun 26, 2019 at 10:36 AM Richard Biener <rguenther@suse.de> wrote: > > On June 26, 2019 10:25:44 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote: > >On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be > >> able to auto-vectorize: > > > >On a related note, following slightly changed testcase: > > > >void > >foo (char *restrict r, char *restrict a) > >{ > > for (int i = 0; i < 24; i++) > > r[i] += a[i]; > >} > > > >compiles to: > > > >foo: > > vmovdqu (%rdi), %xmm1 > > vpaddb (%rsi), %xmm1, %xmm0 > > movzbl 16(%rsi), %eax > > addb %al, 16(%rdi) > > vmovups %xmm0, (%rdi) > > movzbl 17(%rsi), %eax > > addb %al, 17(%rdi) > > movzbl 18(%rsi), %eax > > addb %al, 18(%rdi) > > movzbl 19(%rsi), %eax > > addb %al, 19(%rdi) > > movzbl 20(%rsi), %eax > > addb %al, 20(%rdi) > > movzbl 21(%rsi), %eax > > addb %al, 21(%rdi) > > movzbl 22(%rsi), %eax > > addb %al, 22(%rdi) > > movzbl 23(%rsi), %eax > > addb %al, 23(%rdi) > > ret > > > >One would expect that the remaining 8-byte array would also get > >vectorized, resulting in one 16-byte operation and one 8-byte > >operation. > > Try - - param vect-epilogue-nomask=1 (or so). Yes, this (--param vect-epilogues-nomask=1) works! foo: movdqu (%rdi), %xmm0 movdqu (%rsi), %xmm2 movq 16(%rsi), %xmm1 paddb %xmm2, %xmm0 movups %xmm0, (%rdi) movq 16(%rdi), %xmm0 paddb %xmm1, %xmm0 movq %xmm0, 16(%rdi) ret Thanks, Uros.
On Wed, Jun 26, 2019 at 10:17:26AM +0200, Uros Bizjak wrote: > Please note that the patch regresses > > FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect > "vectorized [1-3] loops" 2 > FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect > "vectorized [1-3] loops" 2 > > For some reason, the compiler decides to vectorize with 8-byte > vectors, resulting in: > > missed: not vectorized: relevant stmt not supported: _8 = (short > unsigned int) _4; > missed: bad operation or unsupported loop bound. > missed: couldn't vectorize loop > > However, the unpatched compiler is able to vectorize loop using > 16-byte vectors. It looks that the compiler should re-run > vectorization with wider vectors, if vectorization with narrower > vectors fails. Jakub, Richard, do you have any insight in this issue? > > 2019-06-26 Uroš Bizjak <ubizjak@gmail.com> > > * config/i386/i386.c (ix86_autovectorize_vector_sizes): > Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE. The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because in that case it will push only that 8 and nothing else, while you really want to have 16 and 8 in that order, so that it tries to vectorize first with 16-byte vectors and fall back to 8-byte. The hook is supposed to either push nothing at all, then only one vector size is tried, one derived from preferred_simd_mode, or push all possible vectorization sizes to be tried. The following patch fixes the failures: --- gcc/config/i386/i386.c.jj 2019-06-26 09:15:53.474869259 +0200 +++ gcc/config/i386/i386.c 2019-06-26 10:42:01.354106012 +0200 @@ -21401,6 +21401,11 @@ ix86_autovectorize_vector_sizes (vector_ sizes->safe_push (16); sizes->safe_push (32); } + else if (TARGET_MMX_WITH_SSE) + sizes->safe_push (16); + + if (TARGET_MMX_WITH_SSE) + sizes->safe_push (8); } /* Implemenation of targetm.vectorize.get_mask_mode. */ Jakub
On Wed, Jun 26, 2019 at 10:47 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Wed, Jun 26, 2019 at 10:17:26AM +0200, Uros Bizjak wrote: > > Please note that the patch regresses > > > > FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect > > "vectorized [1-3] loops" 2 > > FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect > > "vectorized [1-3] loops" 2 > > > > For some reason, the compiler decides to vectorize with 8-byte > > vectors, resulting in: > > > > missed: not vectorized: relevant stmt not supported: _8 = (short > > unsigned int) _4; > > missed: bad operation or unsupported loop bound. > > missed: couldn't vectorize loop > > > > However, the unpatched compiler is able to vectorize loop using > > 16-byte vectors. It looks that the compiler should re-run > > vectorization with wider vectors, if vectorization with narrower > > vectors fails. Jakub, Richard, do you have any insight in this issue? > > > > 2019-06-26 Uroš Bizjak <ubizjak@gmail.com> > > > > * config/i386/i386.c (ix86_autovectorize_vector_sizes): > > Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE. > > The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because > in that case it will push only that 8 and nothing else, while you really > want to have 16 and 8 in that order, so that it tries to vectorize first > with 16-byte vectors and fall back to 8-byte. The hook is supposed to > either push nothing at all, then only one vector size is tried, > one derived from preferred_simd_mode, or push all possible vectorization > sizes to be tried. Thanks for the explanation and the patch! Yes, the patch works OK. I'll regression test it and push it later today. Thanks, Uros. > The following patch fixes the failures: > > --- gcc/config/i386/i386.c.jj 2019-06-26 09:15:53.474869259 +0200 > +++ gcc/config/i386/i386.c 2019-06-26 10:42:01.354106012 +0200 > @@ -21401,6 +21401,11 @@ ix86_autovectorize_vector_sizes (vector_ > sizes->safe_push (16); > sizes->safe_push (32); > } > + else if (TARGET_MMX_WITH_SSE) > + sizes->safe_push (16); > + > + if (TARGET_MMX_WITH_SSE) > + sizes->safe_push (8); > } > > /* Implemenation of targetm.vectorize.get_mask_mode. */ > > > Jakub
On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: > > The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because > > in that case it will push only that 8 and nothing else, while you really > > want to have 16 and 8 in that order, so that it tries to vectorize first > > with 16-byte vectors and fall back to 8-byte. The hook is supposed to > > either push nothing at all, then only one vector size is tried, > > one derived from preferred_simd_mode, or push all possible vectorization > > sizes to be tried. > > Thanks for the explanation and the patch! It is even documented that way: "If the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is not\n\ the only one that is worth considering, this hook should add all suitable\n\ vector sizes to @var{sizes}, in order of decreasing preference. The first\n\ one should be the size of @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.\n\ If @var{all} is true, add suitable vector sizes even when they are generally\n\ not expected to be worthwhile.\n\ Jakub
On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> Yes, the patch works OK. I'll regression test it and push it later today.
I think it caused
+FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
which admittedly already is xfailed on various targets.
We now newly vectorize those loops and there is no FRE or similar pass
after vectorization to clean it up, in particular optimize the
a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
store:
MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
_13 = a[8];
res_6 = _13 + 140;
_18 = a[9];
res_15 = res_6 + _18;
a ={v} {CLOBBER};
return res_15;
Shall we xfail it, or is there a plan to enable FRE after vectorization,
or similar pass that would be able to do similar memory optimizations?
Note, the RTL passes are able to optimize it in the end in this testcase.
Jakub
On Thu, Jun 27, 2019 at 8:05 AM Jakub Jelinek <jakub@redhat.com> wrote: > > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: > > Yes, the patch works OK. I'll regression test it and push it later today. > > I think it caused > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > which admittedly already is xfailed on various targets. > We now newly vectorize those loops and there is no FRE or similar pass > after vectorization to clean it up, in particular optimize the > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B] > store: > MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 }; > _13 = a[8]; > res_6 = _13 + 140; > _18 = a[9]; > res_15 = res_6 + _18; > a ={v} {CLOBBER}; > return res_15; Yes, I have seen pr84512.c, but the failure is benign. It is caused by the fact that we now vectorize the loops of the test. > Shall we xfail it, or is there a plan to enable FRE after vectorization, > or similar pass that would be able to do similar memory optimizations? > Note, the RTL passes are able to optimize it in the end in this testcase. The testcase failure could be solved by -fno-tree-vectorize, but I think that the value should be propagated through vectors, and tree optimizers should optimize the vectorized function in the same way as scalar function. Uros.
On Thu, 27 Jun 2019, Uros Bizjak wrote: > On Thu, Jun 27, 2019 at 8:05 AM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: > > > Yes, the patch works OK. I'll regression test it and push it later today. > > > > I think it caused > > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > > which admittedly already is xfailed on various targets. > > We now newly vectorize those loops and there is no FRE or similar pass > > after vectorization to clean it up, in particular optimize the > > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B] > > store: > > MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 }; > > _13 = a[8]; > > res_6 = _13 + 140; > > _18 = a[9]; > > res_15 = res_6 + _18; > > a ={v} {CLOBBER}; > > return res_15; > > Yes, I have seen pr84512.c, but the failure is benign. It is caused by > the fact that we now vectorize the loops of the test. > > > Shall we xfail it, or is there a plan to enable FRE after vectorization, > > or similar pass that would be able to do similar memory optimizations? > > Note, the RTL passes are able to optimize it in the end in this testcase. > > The testcase failure could be solved by -fno-tree-vectorize, but I > think that the value should be propagated through vectors, and tree > optimizers should optimize the vectorized function in the same way as > scalar function. FRE needs a simple fix (oops) to handle this case though. Bootstrap / regtest running on x86_64-unknown-linux-gnu. And yes, I think ultimatively we want a late FRE... Richard. 2019-06-27 Richard Biener <rguenther@suse.de> * tree-ssa-sccvn.c (vn_reference_lookup_3): Encode valueized RHS. * gcc.dg/tree-ssa/ssa-fre-67.c: New testcase. Index: gcc/tree-ssa-sccvn.c =================================================================== --- gcc/tree-ssa-sccvn.c (revision 272732) +++ gcc/tree-ssa-sccvn.c (working copy) @@ -2242,7 +2242,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree tree rhs = gimple_assign_rhs1 (def_stmt); if (TREE_CODE (rhs) == SSA_NAME) rhs = SSA_VAL (rhs); - len = native_encode_expr (gimple_assign_rhs1 (def_stmt), + len = native_encode_expr (rhs, buffer, sizeof (buffer), (offseti - offset2) / BITS_PER_UNIT); if (len > 0 && len * BITS_PER_UNIT >= maxsizei) Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c =================================================================== --- gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c (revision 272732) +++ gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c (working copy) @@ -1,16 +1,32 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fno-tree-ccp -fdump-tree-fre1-stats" } */ +/* { dg-options "-fgimple -O1 -fdump-tree-fre1" } */ -int foo() +int a[10]; +typedef int v2si __attribute__((vector_size(__SIZEOF_INT__*2))); +int __GIMPLE (ssa,guessed_local(97603132),startwith("fre1")) + foo () { - int i = 0; - do - { - i++; - } - while (i != 1); - return i; + int i; + int _59; + int _44; + int _13; + int _18; + v2si _80; + v2si _81; + int res; + + __BB(2,guessed_local(97603132)): + _59 = 64; + i_61 = 9; + _44 = i_61 * i_61; + _80 = _Literal (v2si) {_59, _44}; + _81 = _80; + __MEM <v2si> ((int *)&a + _Literal (int *) 32) = _81; + i_48 = 9; + _13 = a[8]; + _18 = a[i_48]; + res_15 = _13 + _18; + return res_15; } -/* { dg-final { scan-tree-dump "RPO iteration over 3 blocks visited 3 blocks" "fre1" } } */ -/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */ +/* { dg-final { scan-tree-dump "return 145;" "fre1" } } */
On 6/27/19 12:05 AM, Jakub Jelinek wrote: > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: >> Yes, the patch works OK. I'll regression test it and push it later today. > > I think it caused > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > which admittedly already is xfailed on various targets. > We now newly vectorize those loops and there is no FRE or similar pass > after vectorization to clean it up, in particular optimize the > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B] > store: > MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 }; > _13 = a[8]; > res_6 = _13 + 140; > _18 = a[9]; > res_15 = res_6 + _18; > a ={v} {CLOBBER}; > return res_15; > > Shall we xfail it, or is there a plan to enable FRE after vectorization, > or similar pass that would be able to do similar memory optimizations? > Note, the RTL passes are able to optimize it in the end in this testcase. I wonder if we could logically break up the vector store within DOM. If we did that we'd end up with a[8] and a[9] in DOM's expression hash table. That would allow us to replace the loads into _13 and _18 with constants and the rest should just fall out. Care to open a BZ? If so, go ahead and assign it to me. jeff
On Thu, Jun 27, 2019 at 09:24:58AM -0600, Jeff Law wrote: > On 6/27/19 12:05 AM, Jakub Jelinek wrote: > > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: > >> Yes, the patch works OK. I'll regression test it and push it later today. > > > > I think it caused > > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > > which admittedly already is xfailed on various targets. > > We now newly vectorize those loops and there is no FRE or similar pass > > after vectorization to clean it up, in particular optimize the > > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B] > > store: > > MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 }; > > _13 = a[8]; > > res_6 = _13 + 140; > > _18 = a[9]; > > res_15 = res_6 + _18; > > a ={v} {CLOBBER}; > > return res_15; > > > > Shall we xfail it, or is there a plan to enable FRE after vectorization, > > or similar pass that would be able to do similar memory optimizations? > > Note, the RTL passes are able to optimize it in the end in this testcase. > I wonder if we could logically break up the vector store within DOM. If > we did that we'd end up with a[8] and a[9] in DOM's expression hash > table. That would allow us to replace the loads into _13 and _18 with > constants and the rest should just fall out. > > Care to open a BZ? If so, go ahead and assign it to me. I think Richi is on working on adding fre3 now. Jakub
On 6/27/19 9:34 AM, Jakub Jelinek wrote: > On Thu, Jun 27, 2019 at 09:24:58AM -0600, Jeff Law wrote: >> On 6/27/19 12:05 AM, Jakub Jelinek wrote: >>> On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote: >>>> Yes, the patch works OK. I'll regression test it and push it later today. >>> >>> I think it caused >>> +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" >>> which admittedly already is xfailed on various targets. >>> We now newly vectorize those loops and there is no FRE or similar pass >>> after vectorization to clean it up, in particular optimize the >>> a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B] >>> store: >>> MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 }; >>> _13 = a[8]; >>> res_6 = _13 + 140; >>> _18 = a[9]; >>> res_15 = res_6 + _18; >>> a ={v} {CLOBBER}; >>> return res_15; >>> >>> Shall we xfail it, or is there a plan to enable FRE after vectorization, >>> or similar pass that would be able to do similar memory optimizations? >>> Note, the RTL passes are able to optimize it in the end in this testcase. >> I wonder if we could logically break up the vector store within DOM. If >> we did that we'd end up with a[8] and a[9] in DOM's expression hash >> table. That would allow us to replace the loads into _13 and _18 with >> constants and the rest should just fall out. >> >> Care to open a BZ? If so, go ahead and assign it to me. > > I think Richi is on working on adding fre3 now. Yea, I saw that later. I think Richi's message indicated he wanted a late fre pass, so even if DOM was to capture this, it may not eliminate the desire for a late fre pass. jeff
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 1ca1712183dc..24bd0896f137 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -21401,6 +21401,9 @@ ix86_autovectorize_vector_sizes (vector_sizes *sizes, bool all) sizes->safe_push (16); sizes->safe_push (32); } + + if (TARGET_MMX_WITH_SSE) + sizes->safe_push (8); } /* Implemenation of targetm.vectorize.get_mask_mode. */ diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index 1d4aaa2a87ec..285c32f8cebb 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -6603,9 +6603,14 @@ proc available_vector_sizes { } { } elseif { [istarget arm*-*-*] && [check_effective_target_arm_neon_ok] } { lappend result 128 64 - } elseif { (([istarget i?86-*-*] || [istarget x86_64-*-*]) - && ([check_avx_available] && ![check_prefer_avx128])) } { - lappend result 256 128 + } elseif { [istarget i?86-*-*] || [istarget x86_64-*-*] } { + if { [check_avx_available] && ![check_prefer_avx128] } { + lappend result 256 + } + lappend result 128 + if { ![is-effective-target ia32] } { + lappend result 64 + } } elseif { [istarget sparc*-*-*] } { lappend result 64 } else {