diff mbox series

[4/4] middle-end: create the longest possible zero extend chain after overwidening

Message ID Zwz45D88lAvH2G02@arm.com
State New
Headers show
Series [1/4] middle-end: support multi-step zero-extends using VEC_PERM_EXPR | expand

Commit Message

Tamar Christina Oct. 14, 2024, 10:56 a.m. UTC
Hi All,

Consider loops such as:

void test9(unsigned char *x, long long *y, int n, unsigned char k) {
    for(int i = 0; i < n; i++) {
        y[i] = k + x[i];
    }
}

where today we generate:

.L5:
        ldr     q29, [x5], 16
        add     x4, x4, 128
        uaddl   v1.8h, v29.8b, v30.8b
        uaddl2  v29.8h, v29.16b, v30.16b
        zip1    v2.8h, v1.8h, v31.8h
        zip1    v0.8h, v29.8h, v31.8h
        zip2    v1.8h, v1.8h, v31.8h
        zip2    v29.8h, v29.8h, v31.8h
        sxtl    v25.2d, v2.2s
        sxtl    v28.2d, v0.2s
        sxtl    v27.2d, v1.2s
        sxtl    v26.2d, v29.2s
        sxtl2   v2.2d, v2.4s
        sxtl2   v0.2d, v0.4s
        sxtl2   v1.2d, v1.4s
        sxtl2   v29.2d, v29.4s
        stp     q25, q2, [x4, -128]
        stp     q27, q1, [x4, -96]
        stp     q28, q0, [x4, -64]
        stp     q26, q29, [x4, -32]
        cmp     x5, x6
        bne     .L5

Note how the zero extend from short to long is half way the chain transformed
into a sign extend.  There are two problems with this:

  1. sign extends are typically slower than zero extends on many uArches.
  2. it prevents vectorizable_conversion from attempting to do a single step
     promotion.

These sign extend happen due to the varous range reduction optimizations and
patterns we have, such as multiplication widening, etc.

My first attempt to fix this was just updating the patterns to when the original
source is a zero extend, to not add the intermediate sign extend.

However this behavior happens in many other places, some of it and as new
patterns get added the problem can be re-introduced.

Instead I have added a new pattern vect_recog_zero_extend_chain_pattern that
attempts to simplify and extend an existing zero extend over multiple
conversions statements.

As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 > T1
gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.

The final cast to signed it kept so the types in the tree still match. It will
be correctly elided later on.

This represenation is the most optimal as vectorizable_conversion is already
able to decompose a long promotion into multiple steps if the target does not
support it in a single step.  More importantly it allows us to do proper costing
and support such conversions like (double)x, where bitsize(x) < int in an
efficient manner.

To do this I have used Ranger's on-demand analysis to perform the check to see
if an extension can be removed and extended to zero extend.  The reason for this
is that the vectorizer introduces several patterns that are not in the IL,  but
also lots of widening IFNs for which handling in a switch wouldn't be very
future proof.

I did try to do it without Ranger, but ranger had two benefits:

1.  It simplified the handling of the IL changes the vectorizer introduces, and
    makes it future proof.
2.  Ranger has the advantage of doing the transformation in cases where it knows
    that the top bits of the value is zero.  Which we wouldn't be able to tell
    by looking purely at statements.
3.  Ranger simplified the handling of corner cases.  Without it the handling was
    quite complex and I wasn't very confident in it's correctness.

So I think ranger is the right way to go here...  With these changes the above
now generates:

.L5:
        add     x4, x4, 128
        ldr     q26, [x5], 16
        uaddl   v2.8h, v26.8b, v31.8b
        uaddl2  v26.8h, v26.16b, v31.16b
        tbl     v4.16b, {v2.16b}, v30.16b
        tbl     v3.16b, {v2.16b}, v29.16b
        tbl     v24.16b, {v2.16b}, v28.16b
        tbl     v1.16b, {v26.16b}, v30.16b
        tbl     v0.16b, {v26.16b}, v29.16b
        tbl     v25.16b, {v26.16b}, v28.16b
        tbl     v2.16b, {v2.16b}, v27.16b
        tbl     v26.16b, {v26.16b}, v27.16b
        stp     q4, q3, [x4, -128]
        stp     q1, q0, [x4, -64]
        stp     q24, q2, [x4, -96]
        stp     q25, q26, [x4, -32]
        cmp     x5, x6
        bne     .L5

I have also seen similar improvements in codegen on Arm and x86_64, especially
with AVX512.

Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
x86_64-pc-linux-gnu -m32, -m64 and no issues.

Hopefully Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
	* gcc.dg/vect/slp-widen-mult-half.c: Likewise.
	* gcc.dg/vect/vect-over-widen-10.c: Likewise.
	* gcc.dg/vect/vect-over-widen-12.c: Likewise.
	* gcc.dg/vect/vect-over-widen-14.c: Likewise.
	* gcc.dg/vect/vect-over-widen-16.c: Likewise.
	* gcc.dg/vect/vect-over-widen-6.c: Likewise.
	* gcc.dg/vect/vect-over-widen-8.c: Likewise.
	* gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
	* gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
	* lib/target-supports.exp
	(check_effective_target_vect_widen_mult_hi_to_si_pattern,
	check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
	AArch64.
	* gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.

---




--

Comments

Richard Biener Oct. 15, 2024, 12:42 p.m. UTC | #1
On Mon, 14 Oct 2024, Tamar Christina wrote:

> Hi All,
> 
> Consider loops such as:
> 
> void test9(unsigned char *x, long long *y, int n, unsigned char k) {
>     for(int i = 0; i < n; i++) {
>         y[i] = k + x[i];
>     }
> }
> 
> where today we generate:
> 
> .L5:
>         ldr     q29, [x5], 16
>         add     x4, x4, 128
>         uaddl   v1.8h, v29.8b, v30.8b
>         uaddl2  v29.8h, v29.16b, v30.16b
>         zip1    v2.8h, v1.8h, v31.8h
>         zip1    v0.8h, v29.8h, v31.8h
>         zip2    v1.8h, v1.8h, v31.8h
>         zip2    v29.8h, v29.8h, v31.8h
>         sxtl    v25.2d, v2.2s
>         sxtl    v28.2d, v0.2s
>         sxtl    v27.2d, v1.2s
>         sxtl    v26.2d, v29.2s
>         sxtl2   v2.2d, v2.4s
>         sxtl2   v0.2d, v0.4s
>         sxtl2   v1.2d, v1.4s
>         sxtl2   v29.2d, v29.4s
>         stp     q25, q2, [x4, -128]
>         stp     q27, q1, [x4, -96]
>         stp     q28, q0, [x4, -64]
>         stp     q26, q29, [x4, -32]
>         cmp     x5, x6
>         bne     .L5
> 
> Note how the zero extend from short to long is half way the chain transformed
> into a sign extend.  There are two problems with this:
> 
>   1. sign extends are typically slower than zero extends on many uArches.
>   2. it prevents vectorizable_conversion from attempting to do a single step
>      promotion.
> 
> These sign extend happen due to the varous range reduction optimizations and
> patterns we have, such as multiplication widening, etc.
> 
> My first attempt to fix this was just updating the patterns to when the original
> source is a zero extend, to not add the intermediate sign extend.
> 
> However this behavior happens in many other places, some of it and as new
> patterns get added the problem can be re-introduced.
> 
> Instead I have added a new pattern vect_recog_zero_extend_chain_pattern that
> attempts to simplify and extend an existing zero extend over multiple
> conversions statements.
> 
> As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 > T1
> gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.
> 
> The final cast to signed it kept so the types in the tree still match. It will
> be correctly elided later on.
> 
> This represenation is the most optimal as vectorizable_conversion is already
> able to decompose a long promotion into multiple steps if the target does not
> support it in a single step.  More importantly it allows us to do proper costing
> and support such conversions like (double)x, where bitsize(x) < int in an
> efficient manner.
> 
> To do this I have used Ranger's on-demand analysis to perform the check to see
> if an extension can be removed and extended to zero extend.  The reason for this
> is that the vectorizer introduces several patterns that are not in the IL,  but
> also lots of widening IFNs for which handling in a switch wouldn't be very
> future proof.
> 
> I did try to do it without Ranger, but ranger had two benefits:
> 
> 1.  It simplified the handling of the IL changes the vectorizer introduces, and
>     makes it future proof.
> 2.  Ranger has the advantage of doing the transformation in cases where it knows
>     that the top bits of the value is zero.  Which we wouldn't be able to tell
>     by looking purely at statements.
> 3.  Ranger simplified the handling of corner cases.  Without it the handling was
>     quite complex and I wasn't very confident in it's correctness.
> 
> So I think ranger is the right way to go here...  With these changes the above
> now generates:
> 
> .L5:
>         add     x4, x4, 128
>         ldr     q26, [x5], 16
>         uaddl   v2.8h, v26.8b, v31.8b
>         uaddl2  v26.8h, v26.16b, v31.16b
>         tbl     v4.16b, {v2.16b}, v30.16b
>         tbl     v3.16b, {v2.16b}, v29.16b
>         tbl     v24.16b, {v2.16b}, v28.16b
>         tbl     v1.16b, {v26.16b}, v30.16b
>         tbl     v0.16b, {v26.16b}, v29.16b
>         tbl     v25.16b, {v26.16b}, v28.16b
>         tbl     v2.16b, {v2.16b}, v27.16b
>         tbl     v26.16b, {v26.16b}, v27.16b
>         stp     q4, q3, [x4, -128]
>         stp     q1, q0, [x4, -64]
>         stp     q24, q2, [x4, -96]
>         stp     q25, q26, [x4, -32]
>         cmp     x5, x6
>         bne     .L5
> 
> I have also seen similar improvements in codegen on Arm and x86_64, especially
> with AVX512.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-pc-linux-gnu -m32, -m64 and no issues.
> 
> Hopefully Ok for master?

Hohumm.  So I looked at one of the examples and I don't see any
sign-extends in the IL we vectorize.  So your pattern is about
changing int -> double to unsigned int -> double but only so
a required intermediate int -> long conversion is done as
zero-extend?  IMO this doesn't belong to patterns but to
vectorizable_conversion, specifically the step determining the
intermediate types.

I don't quite understand what scalar pattern IL you feed to
the vectorizer in the end, few comments - also to this effect,
below.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
> 	* gcc.dg/vect/slp-widen-mult-half.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-10.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-12.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-14.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-16.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-6.c: Likewise.
> 	* gcc.dg/vect/vect-over-widen-8.c: Likewise.
> 	* gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
> 	* gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
> 	* lib/target-supports.exp
> 	(check_effective_target_vect_widen_mult_hi_to_si_pattern,
> 	check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
> 	AArch64.
> 	* gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> index 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda97bf2a0708ac091cb 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> @@ -52,4 +52,4 @@ int main (void)
>  
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
>  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> index b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df0a77a0a099ad6ef 100644
> --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> @@ -52,5 +52,5 @@ int main (void)
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_widen_mult_hi_to_si } } } */
>  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_widen_mult_hi_to_si } } } */
>  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> index f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d488206cfef9e39fac2bfea 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> @@ -11,7 +11,7 @@
>  
>  #include "vect-over-widen-9.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> index ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4ed13206278c72925 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> @@ -11,7 +11,7 @@
>  
>  #include "vect-over-widen-11.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> index dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef97789726a1dc6ec8b1 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> @@ -11,7 +11,7 @@
>  
>  #include "vect-over-widen-13.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* = \(unsigned char\)} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> index 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a063a1b0f47bdd543d 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> @@ -11,7 +11,7 @@
>  
>  #include "vect-over-widen-15.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern: detected} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> index bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607ef2d2f14a14eb32f 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> @@ -9,7 +9,7 @@
>  
>  #include "vect-over-widen-5.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> index 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1a0b65173c4c83b19 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> @@ -12,7 +12,7 @@
>  
>  #include "vect-over-widen-7.c"
>  
> -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> index 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2dc2b7ab2ac0deb674b 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> @@ -47,5 +47,5 @@ int main (void)
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
>  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> index 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9c0eb31c987ffe9 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> @@ -47,5 +47,5 @@ int main (void)
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
>  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f86c77406555b7726482
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> +
> +void test6(unsigned char *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] += x[i];
> +    }
> +}
> +
> +void test7(unsigned char *x, double *y, int n, unsigned char k) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] += k * x[i];
> +    }
> +}
> +
> +void test8(unsigned char *x, double *y, int n, unsigned char k) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = k + x[i];
> +    }
> +}
> +
> +void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = k + x[i];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
> +/* { dg-final { scan-assembler-not {\tuxtl2} } } */
> +/* { dg-final { scan-assembler-not {\tzip1} } } */
> +/* { dg-final { scan-assembler-not {\tzip2} } } */
> +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
> +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
> index d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3034c0a828724 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8240,6 +8240,7 @@ proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
>      return [check_cached_effective_target_indexed vect_widen_mult_hi_to_si_pattern {
>        expr { [istarget powerpc*-*-*]
>  	     || [istarget ia64-*-*]
> +	     || [istarget aarch64*-*-*]
>  	     || [istarget loongarch*-*-*]
>  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
>  	     || ([is-effective-target arm_neon]
> @@ -8259,6 +8260,7 @@ proc check_effective_target_vect_widen_mult_si_to_di_pattern { } {
>        expr { [istarget ia64-*-*]
>  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
>  	     || [istarget loongarch*-*-*]
> +	     || [istarget aarch64*-*-*]
>  	     || ([istarget s390*-*-*]
>  		 && [check_effective_target_s390_vx]) }}]
>  }
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef0556d25afda851f7f0 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info *vinfo,
>    return pattern_stmt;
>  }
>  
> +/* Function vect_recog_zero_extend_chain_pattern
> +
> +   Try to find the following pattern:
> +
> +     type x_t;
> +     TYPE a_T, b_T, c_T;

But a_T, b_T and c_T are different types - what types?

> +   loop:
> +     S1  a_T = (b_T)(c_T)x_t;
> +
> +   where type 'TYPE' is an integral type which has different size
> +   from 'type' and c_T is a zero extend or a sign extend on a value whose top
> +   bit is known to be zero. a_T can be signed or unsigned.
> +
> +   Input:
> +
> +   * STMT_VINFO: The stmt from which the pattern search begins.
> +
> +   Output:
> +
> +   * TYPE_OUT: The type of the output of this pattern.
> +
> +   * Return value: A new stmt that will be used to replace the pattern.
> +	This replaces multiple chained extensions with the longest possible
> +	chain or zero extends and a final convert to the required sign.
> +	
> +	S1  a_T = (a_T)(unsigned a_T)x_t;  */

So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
feels like the above example misses a conversion?

I don't think the pure integer type sequence happens - we do
(and should if not) already perform canonicalization of equivalent
conversion sequences, otherwise you'd see a missed CSE (I couldn't
produce such an example).

You say you found the "issue" to be exposed in several (or found in one,
suspected in several) existing patterns.  Can you elaborate and point
out where this happens?  I don't think it covers the int -> double
case, does it?

Thanks,
Richard.

> +
> +static gimple *
> +vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
> +				      stmt_vec_info stmt_vinfo, tree *type_out)
> +{
> +  gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize (stmt_vinfo));
> +
> +  if (!is_gimple_assign (last_stmt))
> +    return NULL;
> +
> +  tree_code code = gimple_assign_rhs_code (last_stmt);
> +  tree lhs = gimple_assign_lhs (last_stmt);
> +  tree rhs = gimple_assign_rhs1 (last_stmt);
> +  tree lhs_type = TREE_TYPE (lhs);
> +  tree rhs_type = TREE_TYPE (rhs);
> +
> +  if ((code != FLOAT_EXPR && code != NOP_EXPR)
> +      || TYPE_UNSIGNED (lhs_type)
> +      || TREE_CODE (rhs_type) != INTEGER_TYPE
> +      || TREE_CODE (rhs) != SSA_NAME
> +      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
> +    return NULL;
> +
> +  /* Check to see if it's safe to extend the zero extend to the new type.
> +     In general this is safe if the rhs1 type is unsigned or if we know that
> +     the top bits are zero,  this can happen due to all the widening operations
> +     we have.   For instance a widening addition will have top bits zero.  */
> +  if (!TYPE_UNSIGNED (rhs_type))
> +    {
> +      wide_int wcst = get_nonzero_bits (rhs);
> +      if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
> +	return NULL;
> +    }
> +
> +  tree cvt_type = unsigned_type_for (lhs_type);
> +
> +  tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
> +  if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
> +    return NULL;
> +
> +  tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
> +  if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
> +    return NULL;
> +
> +  stmt_vec_info irhs;
> +
> +  gimple_ranger ranger;
> +
> +  /* Dig through any existing conversions to see if we can extend the zero
> +     extend chain across multiple converts.  */
> +  while ((irhs = vect_get_internal_def (vinfo, rhs)))
> +    {
> +      gimple *g_irhs = STMT_VINFO_STMT (irhs);
> +      if (!is_gimple_assign (g_irhs)
> +	  || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
> +	break;
> +
> +      /* See if we can consume the next conversion as well.  To do this it's
> +	 best to use Ranger as it can see through the intermediate IL that the
> +	 vectorizer creates throughout pattern matching.  */
> +      int_range_max r;
> +      ranger.range_of_stmt (r, g_irhs);
> +      wide_int nz = r.get_nonzero_bits ();
> +      if (wi::neg_p (nz) || wi::clz (nz) == 0)
> +	break;
> +
> +      rhs = gimple_assign_rhs1 (g_irhs);
> +    }
> +
> +  /* If the result is a no-op, or we've jumped over a truncate of sort, or if
> +     nothing would change materially just leave it alone.  */
> +  if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
> +      || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
> +    return NULL;
> +
> +  vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", last_stmt);
> +
> +  tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
> +  gimple *pattern_stmt = NULL;
> +  pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
> +  append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
> +
> +  tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
> +  pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
> +
> +  *type_out = out_vectype;
> +
> +  return pattern_stmt;
> +}
> +
>  
>  /* Helper function of vect_recog_bool_pattern.  Called recursively, return
>     true if bool VAR can and should be optimized that way.  Assume it shouldn't
> @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] = {
>    { vect_recog_widen_minus_pattern, "widen_minus" },
>    { vect_recog_widen_abd_pattern, "widen_abd" },
>    /* These must come after the double widening ones.  */
> +  { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
>  };
>  
>  /* Mark statements that are involved in a pattern.  */
> 
> 
> 
> 
>
Tamar Christina Oct. 15, 2024, 1:17 p.m. UTC | #2
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, October 15, 2024 1:42 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: Re: [PATCH 4/4]middle-end: create the longest possible zero extend chain
> after overwidening
> 
> On Mon, 14 Oct 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > Consider loops such as:
> >
> > void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> >     for(int i = 0; i < n; i++) {
> >         y[i] = k + x[i];
> >     }
> > }
> >
> > where today we generate:
> >
> > .L5:
> >         ldr     q29, [x5], 16
> >         add     x4, x4, 128
> >         uaddl   v1.8h, v29.8b, v30.8b
> >         uaddl2  v29.8h, v29.16b, v30.16b
> >         zip1    v2.8h, v1.8h, v31.8h
> >         zip1    v0.8h, v29.8h, v31.8h
> >         zip2    v1.8h, v1.8h, v31.8h
> >         zip2    v29.8h, v29.8h, v31.8h
> >         sxtl    v25.2d, v2.2s
> >         sxtl    v28.2d, v0.2s
> >         sxtl    v27.2d, v1.2s
> >         sxtl    v26.2d, v29.2s
> >         sxtl2   v2.2d, v2.4s
> >         sxtl2   v0.2d, v0.4s
> >         sxtl2   v1.2d, v1.4s
> >         sxtl2   v29.2d, v29.4s
> >         stp     q25, q2, [x4, -128]
> >         stp     q27, q1, [x4, -96]
> >         stp     q28, q0, [x4, -64]
> >         stp     q26, q29, [x4, -32]
> >         cmp     x5, x6
> >         bne     .L5
> >
> > Note how the zero extend from short to long is half way the chain transformed
> > into a sign extend.  There are two problems with this:
> >
> >   1. sign extends are typically slower than zero extends on many uArches.
> >   2. it prevents vectorizable_conversion from attempting to do a single step
> >      promotion.
> >
> > These sign extend happen due to the varous range reduction optimizations and
> > patterns we have, such as multiplication widening, etc.
> >
> > My first attempt to fix this was just updating the patterns to when the original
> > source is a zero extend, to not add the intermediate sign extend.
> >
> > However this behavior happens in many other places, some of it and as new
> > patterns get added the problem can be re-introduced.
> >
> > Instead I have added a new pattern vect_recog_zero_extend_chain_pattern that
> > attempts to simplify and extend an existing zero extend over multiple
> > conversions statements.
> >
> > As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 > T1
> > gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.
> >
> > The final cast to signed it kept so the types in the tree still match. It will
> > be correctly elided later on.
> >
> > This represenation is the most optimal as vectorizable_conversion is already
> > able to decompose a long promotion into multiple steps if the target does not
> > support it in a single step.  More importantly it allows us to do proper costing
> > and support such conversions like (double)x, where bitsize(x) < int in an
> > efficient manner.
> >
> > To do this I have used Ranger's on-demand analysis to perform the check to see
> > if an extension can be removed and extended to zero extend.  The reason for this
> > is that the vectorizer introduces several patterns that are not in the IL,  but
> > also lots of widening IFNs for which handling in a switch wouldn't be very
> > future proof.
> >
> > I did try to do it without Ranger, but ranger had two benefits:
> >
> > 1.  It simplified the handling of the IL changes the vectorizer introduces, and
> >     makes it future proof.
> > 2.  Ranger has the advantage of doing the transformation in cases where it
> knows
> >     that the top bits of the value is zero.  Which we wouldn't be able to tell
> >     by looking purely at statements.
> > 3.  Ranger simplified the handling of corner cases.  Without it the handling was
> >     quite complex and I wasn't very confident in it's correctness.
> >
> > So I think ranger is the right way to go here...  With these changes the above
> > now generates:
> >
> > .L5:
> >         add     x4, x4, 128
> >         ldr     q26, [x5], 16
> >         uaddl   v2.8h, v26.8b, v31.8b
> >         uaddl2  v26.8h, v26.16b, v31.16b
> >         tbl     v4.16b, {v2.16b}, v30.16b
> >         tbl     v3.16b, {v2.16b}, v29.16b
> >         tbl     v24.16b, {v2.16b}, v28.16b
> >         tbl     v1.16b, {v26.16b}, v30.16b
> >         tbl     v0.16b, {v26.16b}, v29.16b
> >         tbl     v25.16b, {v26.16b}, v28.16b
> >         tbl     v2.16b, {v2.16b}, v27.16b
> >         tbl     v26.16b, {v26.16b}, v27.16b
> >         stp     q4, q3, [x4, -128]
> >         stp     q1, q0, [x4, -64]
> >         stp     q24, q2, [x4, -96]
> >         stp     q25, q26, [x4, -32]
> >         cmp     x5, x6
> >         bne     .L5
> >
> > I have also seen similar improvements in codegen on Arm and x86_64, especially
> > with AVX512.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> >
> > Hopefully Ok for master?
> 
> Hohumm.  So I looked at one of the examples and I don't see any
> sign-extends in the IL we vectorize.  So your pattern is about
> changing int -> double to unsigned int -> double but only so
> a required intermediate int -> long conversion is done as
> zero-extend?  IMO this doesn't belong to patterns but to
> vectorizable_conversion, specifically the step determining the
> intermediate types.

There is, it's not in C but created by the vectorizer.
So the way I saw it, was that vectorizable_conversion should be
given the choice of what it wants to do.  Whether it wants to do
it in one operation of multiple.

If the goal was to just get rid of the final zero extend, yes I would agree.

But that's just part of the goal, the other is to have the zero extend explicitly
exposed to vectorizable_conversion.  That way the patch that uses TBL actually
sees the long multi-step conversion.

My worry was that if done in vectorizable_conversion, while I can walk the IL,
we'd cost the intermediate casts.  On AArch64 the cost model takes into account
the throughput of sequences, not just latencies. And the TBLs have better throughput.
So for costing you do really want to see the full thing and not cost the intermediate
Conversions.

This is why I used a pattern, since the IL is actually changed from the input. But see blow...

> 
> I don't quite understand what scalar pattern IL you feed to
> the vectorizer in the end, few comments - also to this effect,
> below.

Sure, I'll answer this and the question below in one go:

Lets pick a simple example:

void test8(unsigned char *x, double *y, int n, unsigned char k) {
    for(int i = 0; i < n; i++) {
        y[i] = k + x[i];
    }
}

In GIMPLE this generates:

 _4 = *_3;
  _5 = (int) _4;
  _6 = _5 + _29;
  _9 = (double) _6;

i.e. the unsigned char, is widened to int, added to k as int and then
converted to a double.

When we start vectorizing, overwidening detection runs:

note:   vect_recog_over_widening_pattern: detected: _6 = _5 + _29;
note:   demoting int to unsigned short
note:   Splitting statement: _5 = (int) _4;
note:   into pattern statements: patt_32 = (unsigned short) _4;
note:   and: patt_31 = (int) patt_32;
note:   created pattern stmt: patt_28 = patt_32 + patt_30;
note:   over_widening pattern recognized: patt_27 = (int) patt_28;
note:   extra pattern stmt: patt_28 = patt_32 + patt_30;
note:   vect_is_simple_use: operand (unsigned short) _4, type of def: internal

and it demotes it correctly from int to unsigned short.  Performs the operation
as unsigned, and then sign extends that to int.

The final IL the vectorizer builds is this:

note:   node 0x473d790 (max_nunits=16, refcnt=2) vector(2) double
note:   op template: *_8 = _9;
note:   	stmt 0 *_8 = _9;
note:   	children 0x473d828
note:   node 0x473d828 (max_nunits=16, refcnt=2) vector(2) double
note:   op template: _9 = (double) _6;
note:   	stmt 0 _9 = (double) _6;
note:   	children 0x473d8c0
note:   node 0x473d8c0 (max_nunits=16, refcnt=2) vector(4) int
note:   op template: patt_27 = (int) patt_28;
note:   	stmt 0 patt_27 = (int) patt_28;
note:   	children 0x473d958
note:   node 0x473d958 (max_nunits=16, refcnt=2) vector(8) unsigned short
note:   op template: patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
note:   	stmt 0 patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
note:   	children 0x473d9f0 0x473da88
note:   node 0x473d9f0 (max_nunits=16, refcnt=2) vector(16) unsigned char
note:   op template: _4 = *_3;
note:   	stmt 0 _4 = *_3;
note:   	load permutation { 0 }
note:   node (external) 0x473da88 (max_nunits=1, refcnt=1)
note:   	{ k_14(D) }

We later relax the cast to int as a zero extend today already.  However the final promotion is
the FLOAT_EXPR.  There it sees a widening from int to long.  It assumes that this has to be a
sign extend because of the signed input, it doesn't know that the signed input was created
by a zero extending operation.

What my pattern does it make this explicit in the tree.

My first attempt was to update the code that does:

/app/example.c:2:22: note:   demoting int to unsigned short
/app/example.c:2:22: note:   Splitting statement: _5 = (int) _4;
/app/example.c:2:22: note:   into pattern statements: patt_32 = (unsigned short) _4;
/app/example.c:2:22: note:   and: patt_31 = (int) patt_32;

In vect_split_statement,   But even doing so, the problem is that it split the range of the
conversions.  And so while this fixed some cases.  Code that uses the result never know that
the top bits are zero.

So it worked for some cases. But missed out plenty of others.

> So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> feels like the above example misses a conversion?
> 

No, (double)(int)x is basically (double)(long)(int)x

And the pattern creates

(double)(long)(unsigned long)x? Basically it just makes the conversions explicit
and extends the zero extend as wide as possible.

> I don't think the pure integer type sequence happens - we do
> (and should if not) already perform canonicalization of equivalent
> conversion sequences, otherwise you'd see a missed CSE (I couldn't
> produce such an example).

It does.  This is a pure integer sequence with the same problem

void test8(unsigned char *x, long *y, int n, unsigned char k) {
    for(int i = 0; i < n; i++) {
        y[i] = k + x[i];
    }
}

And there are other code that introduce this, for instance there's a cleanup after
multiplication widening specifically that also tries to split types.  And it will do this
as well.

> 
> You say you found the "issue" to be exposed in several (or found in one,
> suspected in several) existing patterns.  Can you elaborate and point
> out where this happens?  I don't think it covers the int -> double
> case, does it?

Not this case on its own.  You have to do *some* operation in between.

Thanks,
Tamar

> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
> > 	* gcc.dg/vect/slp-widen-mult-half.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-10.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-12.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-14.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-16.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-6.c: Likewise.
> > 	* gcc.dg/vect/vect-over-widen-8.c: Likewise.
> > 	* gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
> > 	* gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
> > 	* lib/target-supports.exp
> > 	(check_effective_target_vect_widen_mult_hi_to_si_pattern,
> > 	check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
> > 	AArch64.
> > 	* gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.
> >
> > ---
> > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > index
> 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda9
> 7bf2a0708ac091cb 100644
> > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > @@ -52,4 +52,4 @@ int main (void)
> >
> >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" {
> target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
> >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> > -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { target
> vect_widen_mult_hi_to_si_pattern } } } */
> > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 "slp2" {
> target vect_widen_mult_hi_to_si_pattern } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > index
> b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df
> 0a77a0a099ad6ef 100644
> > --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > @@ -52,5 +52,5 @@ int main (void)
> >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target
> vect_widen_mult_hi_to_si } } } */
> >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> target vect_widen_mult_hi_to_si } } } */
> >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { target
> vect_widen_mult_hi_to_si_pattern } } } */
> > +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { target
> vect_widen_mult_hi_to_si_pattern } } } */
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > index
> f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d48820
> 6cfef9e39fac2bfea 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > @@ -11,7 +11,7 @@
> >
> >  #include "vect-over-widen-9.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > index
> ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4
> ed13206278c72925 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > @@ -11,7 +11,7 @@
> >
> >  #include "vect-over-widen-11.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > index
> dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef9778
> 9726a1dc6ec8b1 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > @@ -11,7 +11,7 @@
> >
> >  #include "vect-over-widen-13.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* = \(unsigned char\)} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > index
> 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a06
> 3a1b0f47bdd543d 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > @@ -11,7 +11,7 @@
> >
> >  #include "vect-over-widen-15.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> >  /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern:
> detected} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > index
> bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607
> ef2d2f14a14eb32f 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > @@ -9,7 +9,7 @@
> >
> >  #include "vect-over-widen-5.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > index
> 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1
> a0b65173c4c83b19 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > @@ -12,7 +12,7 @@
> >
> >  #include "vect-over-widen-7.c"
> >
> > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > index
> 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2
> dc2b7ab2ac0deb674b 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > @@ -47,5 +47,5 @@ int main (void)
> >
> >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target {
> vect_widen_mult_hi_to_si || vect_unpack } } } } */
> >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target
> vect_widen_mult_hi_to_si_pattern } } } */
> > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" {
> target vect_widen_mult_hi_to_si_pattern } } } */
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > index
> 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9
> c0eb31c987ffe9 100644
> > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > @@ -47,5 +47,5 @@ int main (void)
> >
> >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target {
> vect_widen_mult_hi_to_si || vect_unpack } } } } */
> >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target
> vect_widen_mult_hi_to_si_pattern } } } */
> > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" {
> target vect_widen_mult_hi_to_si_pattern } } } */
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f
> 86c77406555b7726482
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> > +
> > +void test6(unsigned char *x, double *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] += x[i];
> > +    }
> > +}
> > +
> > +void test7(unsigned char *x, double *y, int n, unsigned char k) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] += k * x[i];
> > +    }
> > +}
> > +
> > +void test8(unsigned char *x, double *y, int n, unsigned char k) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] = k + x[i];
> > +    }
> > +}
> > +
> > +void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] = k + x[i];
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
> > +/* { dg-final { scan-assembler-not {\tuxtl2} } } */
> > +/* { dg-final { scan-assembler-not {\tzip1} } } */
> > +/* { dg-final { scan-assembler-not {\tzip2} } } */
> > +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
> > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> > diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-
> supports.exp
> > index
> d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3
> 034c0a828724 100644
> > --- a/gcc/testsuite/lib/target-supports.exp
> > +++ b/gcc/testsuite/lib/target-supports.exp
> > @@ -8240,6 +8240,7 @@ proc
> check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
> >      return [check_cached_effective_target_indexed
> vect_widen_mult_hi_to_si_pattern {
> >        expr { [istarget powerpc*-*-*]
> >  	     || [istarget ia64-*-*]
> > +	     || [istarget aarch64*-*-*]
> >  	     || [istarget loongarch*-*-*]
> >  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
> >  	     || ([is-effective-target arm_neon]
> > @@ -8259,6 +8260,7 @@ proc
> check_effective_target_vect_widen_mult_si_to_di_pattern { } {
> >        expr { [istarget ia64-*-*]
> >  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
> >  	     || [istarget loongarch*-*-*]
> > +	     || [istarget aarch64*-*-*]
> >  	     || ([istarget s390*-*-*]
> >  		 && [check_effective_target_s390_vx]) }}]
> >  }
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index
> 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef055
> 6d25afda851f7f0 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info
> *vinfo,
> >    return pattern_stmt;
> >  }
> >
> > +/* Function vect_recog_zero_extend_chain_pattern
> > +
> > +   Try to find the following pattern:
> > +
> > +     type x_t;
> > +     TYPE a_T, b_T, c_T;
> 
> But a_T, b_T and c_T are different types - what types?
> 
> > +   loop:
> > +     S1  a_T = (b_T)(c_T)x_t;
> > +
> > +   where type 'TYPE' is an integral type which has different size
> > +   from 'type' and c_T is a zero extend or a sign extend on a value whose top
> > +   bit is known to be zero. a_T can be signed or unsigned.
> > +
> > +   Input:
> > +
> > +   * STMT_VINFO: The stmt from which the pattern search begins.
> > +
> > +   Output:
> > +
> > +   * TYPE_OUT: The type of the output of this pattern.
> > +
> > +   * Return value: A new stmt that will be used to replace the pattern.
> > +	This replaces multiple chained extensions with the longest possible
> > +	chain or zero extends and a final convert to the required sign.
> > +
> > +	S1  a_T = (a_T)(unsigned a_T)x_t;  */
> 
> So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> feels like the above example misses a conversion?
> 
> I don't think the pure integer type sequence happens - we do
> (and should if not) already perform canonicalization of equivalent
> conversion sequences, otherwise you'd see a missed CSE (I couldn't
> produce such an example).
> 
> You say you found the "issue" to be exposed in several (or found in one,
> suspected in several) existing patterns.  Can you elaborate and point
> out where this happens?  I don't think it covers the int -> double
> case, does it?
> 
> Thanks,
> Richard.
> 
> > +
> > +static gimple *
> > +vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
> > +				      stmt_vec_info stmt_vinfo, tree *type_out)
> > +{
> > +  gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize
> (stmt_vinfo));
> > +
> > +  if (!is_gimple_assign (last_stmt))
> > +    return NULL;
> > +
> > +  tree_code code = gimple_assign_rhs_code (last_stmt);
> > +  tree lhs = gimple_assign_lhs (last_stmt);
> > +  tree rhs = gimple_assign_rhs1 (last_stmt);
> > +  tree lhs_type = TREE_TYPE (lhs);
> > +  tree rhs_type = TREE_TYPE (rhs);
> > +
> > +  if ((code != FLOAT_EXPR && code != NOP_EXPR)
> > +      || TYPE_UNSIGNED (lhs_type)
> > +      || TREE_CODE (rhs_type) != INTEGER_TYPE
> > +      || TREE_CODE (rhs) != SSA_NAME
> > +      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
> > +    return NULL;
> > +
> > +  /* Check to see if it's safe to extend the zero extend to the new type.
> > +     In general this is safe if the rhs1 type is unsigned or if we know that
> > +     the top bits are zero,  this can happen due to all the widening operations
> > +     we have.   For instance a widening addition will have top bits zero.  */
> > +  if (!TYPE_UNSIGNED (rhs_type))
> > +    {
> > +      wide_int wcst = get_nonzero_bits (rhs);
> > +      if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
> > +	return NULL;
> > +    }
> > +
> > +  tree cvt_type = unsigned_type_for (lhs_type);
> > +
> > +  tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
> > +  if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
> > +    return NULL;
> > +
> > +  tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
> > +  if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
> > +    return NULL;
> > +
> > +  stmt_vec_info irhs;
> > +
> > +  gimple_ranger ranger;
> > +
> > +  /* Dig through any existing conversions to see if we can extend the zero
> > +     extend chain across multiple converts.  */
> > +  while ((irhs = vect_get_internal_def (vinfo, rhs)))
> > +    {
> > +      gimple *g_irhs = STMT_VINFO_STMT (irhs);
> > +      if (!is_gimple_assign (g_irhs)
> > +	  || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
> > +	break;
> > +
> > +      /* See if we can consume the next conversion as well.  To do this it's
> > +	 best to use Ranger as it can see through the intermediate IL that the
> > +	 vectorizer creates throughout pattern matching.  */
> > +      int_range_max r;
> > +      ranger.range_of_stmt (r, g_irhs);
> > +      wide_int nz = r.get_nonzero_bits ();
> > +      if (wi::neg_p (nz) || wi::clz (nz) == 0)
> > +	break;
> > +
> > +      rhs = gimple_assign_rhs1 (g_irhs);
> > +    }
> > +
> > +  /* If the result is a no-op, or we've jumped over a truncate of sort, or if
> > +     nothing would change materially just leave it alone.  */
> > +  if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
> > +      || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
> > +    return NULL;
> > +
> > +  vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", last_stmt);
> > +
> > +  tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
> > +  gimple *pattern_stmt = NULL;
> > +  pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
> > +  append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
> > +
> > +  tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
> > +  pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
> > +
> > +  *type_out = out_vectype;
> > +
> > +  return pattern_stmt;
> > +}
> > +
> >
> >  /* Helper function of vect_recog_bool_pattern.  Called recursively, return
> >     true if bool VAR can and should be optimized that way.  Assume it shouldn't
> > @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] =
> {
> >    { vect_recog_widen_minus_pattern, "widen_minus" },
> >    { vect_recog_widen_abd_pattern, "widen_abd" },
> >    /* These must come after the double widening ones.  */
> > +  { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
> >  };
> >
> >  /* Mark statements that are involved in a pattern.  */
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Richard Biener Oct. 17, 2024, 12:59 p.m. UTC | #3
On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, October 15, 2024 1:42 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH 4/4]middle-end: create the longest possible zero extend chain
> > after overwidening
> > 
> > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > Consider loops such as:
> > >
> > > void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > >     for(int i = 0; i < n; i++) {
> > >         y[i] = k + x[i];
> > >     }
> > > }
> > >
> > > where today we generate:
> > >
> > > .L5:
> > >         ldr     q29, [x5], 16
> > >         add     x4, x4, 128
> > >         uaddl   v1.8h, v29.8b, v30.8b
> > >         uaddl2  v29.8h, v29.16b, v30.16b
> > >         zip1    v2.8h, v1.8h, v31.8h
> > >         zip1    v0.8h, v29.8h, v31.8h
> > >         zip2    v1.8h, v1.8h, v31.8h
> > >         zip2    v29.8h, v29.8h, v31.8h
> > >         sxtl    v25.2d, v2.2s
> > >         sxtl    v28.2d, v0.2s
> > >         sxtl    v27.2d, v1.2s
> > >         sxtl    v26.2d, v29.2s
> > >         sxtl2   v2.2d, v2.4s
> > >         sxtl2   v0.2d, v0.4s
> > >         sxtl2   v1.2d, v1.4s
> > >         sxtl2   v29.2d, v29.4s
> > >         stp     q25, q2, [x4, -128]
> > >         stp     q27, q1, [x4, -96]
> > >         stp     q28, q0, [x4, -64]
> > >         stp     q26, q29, [x4, -32]
> > >         cmp     x5, x6
> > >         bne     .L5
> > >
> > > Note how the zero extend from short to long is half way the chain transformed
> > > into a sign extend.  There are two problems with this:
> > >
> > >   1. sign extends are typically slower than zero extends on many uArches.
> > >   2. it prevents vectorizable_conversion from attempting to do a single step
> > >      promotion.
> > >
> > > These sign extend happen due to the varous range reduction optimizations and
> > > patterns we have, such as multiplication widening, etc.
> > >
> > > My first attempt to fix this was just updating the patterns to when the original
> > > source is a zero extend, to not add the intermediate sign extend.
> > >
> > > However this behavior happens in many other places, some of it and as new
> > > patterns get added the problem can be re-introduced.
> > >
> > > Instead I have added a new pattern vect_recog_zero_extend_chain_pattern that
> > > attempts to simplify and extend an existing zero extend over multiple
> > > conversions statements.
> > >
> > > As an example, T3 a = (T3)(signed T2)(unsigned T1)x where bitsize T3 > T2 > T1
> > > gets transformed into T3 a = (T3)(signed T2)(unsigned T2)x.
> > >
> > > The final cast to signed it kept so the types in the tree still match. It will
> > > be correctly elided later on.
> > >
> > > This represenation is the most optimal as vectorizable_conversion is already
> > > able to decompose a long promotion into multiple steps if the target does not
> > > support it in a single step.  More importantly it allows us to do proper costing
> > > and support such conversions like (double)x, where bitsize(x) < int in an
> > > efficient manner.
> > >
> > > To do this I have used Ranger's on-demand analysis to perform the check to see
> > > if an extension can be removed and extended to zero extend.  The reason for this
> > > is that the vectorizer introduces several patterns that are not in the IL,  but
> > > also lots of widening IFNs for which handling in a switch wouldn't be very
> > > future proof.
> > >
> > > I did try to do it without Ranger, but ranger had two benefits:
> > >
> > > 1.  It simplified the handling of the IL changes the vectorizer introduces, and
> > >     makes it future proof.
> > > 2.  Ranger has the advantage of doing the transformation in cases where it
> > knows
> > >     that the top bits of the value is zero.  Which we wouldn't be able to tell
> > >     by looking purely at statements.
> > > 3.  Ranger simplified the handling of corner cases.  Without it the handling was
> > >     quite complex and I wasn't very confident in it's correctness.
> > >
> > > So I think ranger is the right way to go here...  With these changes the above
> > > now generates:
> > >
> > > .L5:
> > >         add     x4, x4, 128
> > >         ldr     q26, [x5], 16
> > >         uaddl   v2.8h, v26.8b, v31.8b
> > >         uaddl2  v26.8h, v26.16b, v31.16b
> > >         tbl     v4.16b, {v2.16b}, v30.16b
> > >         tbl     v3.16b, {v2.16b}, v29.16b
> > >         tbl     v24.16b, {v2.16b}, v28.16b
> > >         tbl     v1.16b, {v26.16b}, v30.16b
> > >         tbl     v0.16b, {v26.16b}, v29.16b
> > >         tbl     v25.16b, {v26.16b}, v28.16b
> > >         tbl     v2.16b, {v2.16b}, v27.16b
> > >         tbl     v26.16b, {v26.16b}, v27.16b
> > >         stp     q4, q3, [x4, -128]
> > >         stp     q1, q0, [x4, -64]
> > >         stp     q24, q2, [x4, -96]
> > >         stp     q25, q26, [x4, -32]
> > >         cmp     x5, x6
> > >         bne     .L5
> > >
> > > I have also seen similar improvements in codegen on Arm and x86_64, especially
> > > with AVX512.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > >
> > > Hopefully Ok for master?
> > 
> > Hohumm.  So I looked at one of the examples and I don't see any
> > sign-extends in the IL we vectorize.  So your pattern is about
> > changing int -> double to unsigned int -> double but only so
> > a required intermediate int -> long conversion is done as
> > zero-extend?  IMO this doesn't belong to patterns but to
> > vectorizable_conversion, specifically the step determining the
> > intermediate types.
> 
> There is, it's not in C but created by the vectorizer.
> So the way I saw it, was that vectorizable_conversion should be
> given the choice of what it wants to do.  Whether it wants to do
> it in one operation of multiple.
> 
> If the goal was to just get rid of the final zero extend, yes I would agree.
> 
> But that's just part of the goal, the other is to have the zero extend explicitly
> exposed to vectorizable_conversion.  That way the patch that uses TBL actually
> sees the long multi-step conversion.
> 
> My worry was that if done in vectorizable_conversion, while I can walk the IL,
> we'd cost the intermediate casts.  On AArch64 the cost model takes into account
> the throughput of sequences, not just latencies. And the TBLs have better throughput.
> So for costing you do really want to see the full thing and not cost the intermediate
> Conversions.
> 
> This is why I used a pattern, since the IL is actually changed from the input. But see blow...
> 
> > 
> > I don't quite understand what scalar pattern IL you feed to
> > the vectorizer in the end, few comments - also to this effect,
> > below.
> 
> Sure, I'll answer this and the question below in one go:
> 
> Lets pick a simple example:
> 
> void test8(unsigned char *x, double *y, int n, unsigned char k) {
>     for(int i = 0; i < n; i++) {
>         y[i] = k + x[i];
>     }
> }
> 
> In GIMPLE this generates:
> 
>  _4 = *_3;
>   _5 = (int) _4;
>   _6 = _5 + _29;
>   _9 = (double) _6;
> 
> i.e. the unsigned char, is widened to int, added to k as int and then
> converted to a double.
> 
> When we start vectorizing, overwidening detection runs:
> 
> note:   vect_recog_over_widening_pattern: detected: _6 = _5 + _29;
> note:   demoting int to unsigned short
> note:   Splitting statement: _5 = (int) _4;
> note:   into pattern statements: patt_32 = (unsigned short) _4;
> note:   and: patt_31 = (int) patt_32;
> note:   created pattern stmt: patt_28 = patt_32 + patt_30;
> note:   over_widening pattern recognized: patt_27 = (int) patt_28;
> note:   extra pattern stmt: patt_28 = patt_32 + patt_30;
> note:   vect_is_simple_use: operand (unsigned short) _4, type of def: internal
> 
> and it demotes it correctly from int to unsigned short.  Performs the operation
> as unsigned, and then sign extends that to int.
> 
> The final IL the vectorizer builds is this:
> 
> note:   node 0x473d790 (max_nunits=16, refcnt=2) vector(2) double
> note:   op template: *_8 = _9;
> note:   	stmt 0 *_8 = _9;
> note:   	children 0x473d828
> note:   node 0x473d828 (max_nunits=16, refcnt=2) vector(2) double
> note:   op template: _9 = (double) _6;
> note:   	stmt 0 _9 = (double) _6;
> note:   	children 0x473d8c0
> note:   node 0x473d8c0 (max_nunits=16, refcnt=2) vector(4) int
> note:   op template: patt_27 = (int) patt_28;
> note:   	stmt 0 patt_27 = (int) patt_28;
> note:   	children 0x473d958
> note:   node 0x473d958 (max_nunits=16, refcnt=2) vector(8) unsigned short
> note:   op template: patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note:   	stmt 0 patt_28 = .VEC_WIDEN_PLUS (_4, k_14(D));
> note:   	children 0x473d9f0 0x473da88
> note:   node 0x473d9f0 (max_nunits=16, refcnt=2) vector(16) unsigned char
> note:   op template: _4 = *_3;
> note:   	stmt 0 _4 = *_3;
> note:   	load permutation { 0 }
> note:   node (external) 0x473da88 (max_nunits=1, refcnt=1)
> note:   	{ k_14(D) }
> 
> We later relax the cast to int as a zero extend today already.  However the final promotion is
> the FLOAT_EXPR.  There it sees a widening from int to long.  It assumes that this has to be a
> sign extend because of the signed input, it doesn't know that the signed input was created
> by a zero extending operation.

OK, so it's like I thought then.  The fix should be to
vectorizable_conversion to consider using a zero-extend for the
conversion to the intermediate long type.  I'm not sure how far
we can use stmt-vinfo min_output_precision for such analysis,
maybe Richard can answer this.  But there's the bad (because it's
wrongly implemented for SLP) example of using range info
in supportable_indirect_convert_operation for this already.

> What my pattern does it make this explicit in the tree.
> 
> My first attempt was to update the code that does:
> 
> /app/example.c:2:22: note:   demoting int to unsigned short
> /app/example.c:2:22: note:   Splitting statement: _5 = (int) _4;
> /app/example.c:2:22: note:   into pattern statements: patt_32 = (unsigned short) _4;
> /app/example.c:2:22: note:   and: patt_31 = (int) patt_32;
> 
> In vect_split_statement,   But even doing so, the problem is that it split the range of the
> conversions.  And so while this fixed some cases.  Code that uses the result never know that
> the top bits are zero.
> 
> So it worked for some cases. But missed out plenty of others.

I think the pattern would have to be for the (double) _5 conversion
and we do not want to expose (double) (long) (unsigned) _5 at that
point because some targets (x86!) _can_ do SImode to DFmode float
conversions and don't require an intermediate widening at all
(vectorizable_conversion knows this).

Richard.

> > So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> > feels like the above example misses a conversion?
> > 
> 
> No, (double)(int)x is basically (double)(long)(int)x
> 
> And the pattern creates
> 
> (double)(long)(unsigned long)x? Basically it just makes the conversions explicit
> and extends the zero extend as wide as possible.
>
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
> 
> It does.  This is a pure integer sequence with the same problem
> 
> void test8(unsigned char *x, long *y, int n, unsigned char k) {
>     for(int i = 0; i < n; i++) {
>         y[i] = k + x[i];
>     }
> }
> 
> And there are other code that introduce this, for instance there's a cleanup after
> multiplication widening specifically that also tries to split types.  And it will do this
> as well.
> 
> > 
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns.  Can you elaborate and point
> > out where this happens?  I don't think it covers the int -> double
> > case, does it?
> 
> Not this case on its own.  You have to do *some* operation in between.
> 
> Thanks,
> Tamar
> 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* tree-vect-patterns.cc (vect_recog_zero_extend_chain_pattern): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 	* gcc.dg/vect/bb-slp-pattern-1.c: Update tests.
> > > 	* gcc.dg/vect/slp-widen-mult-half.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-10.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-12.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-14.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-16.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-6.c: Likewise.
> > > 	* gcc.dg/vect/vect-over-widen-8.c: Likewise.
> > > 	* gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
> > > 	* gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Likewise.
> > > 	* lib/target-supports.exp
> > > 	(check_effective_target_vect_widen_mult_hi_to_si_pattern,
> > > 	check_effective_target_vect_widen_mult_si_to_di_pattern): Enable
> > > 	AArch64.
> > > 	* gcc.target/aarch64/vect-tbl-zero-extend_2.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > index
> > 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda9
> > 7bf2a0708ac091cb 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
> > > @@ -52,4 +52,4 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" {
> > target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 "slp2" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > index
> > b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df
> > 0a77a0a099ad6ef 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
> > > @@ -52,5 +52,5 @@ int main (void)
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target
> > vect_widen_mult_hi_to_si } } } */
> > >  /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {
> > target vect_widen_mult_hi_to_si } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > index
> > f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d48820
> > 6cfef9e39fac2bfea 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-9.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > index
> > ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4
> > ed13206278c72925 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-11.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > index
> > dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef9778
> > 9726a1dc6ec8b1 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-13.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* = \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > index
> > 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a06
> > 3a1b0f47bdd543d 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
> > > @@ -11,7 +11,7 @@
> > >
> > >  #include "vect-over-widen-15.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern:
> > detected} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > index
> > bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607
> > ef2d2f14a14eb32f 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> > > @@ -9,7 +9,7 @@
> > >
> > >  #include "vect-over-widen-5.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 1} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > index
> > 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1
> > a0b65173c4c83b19 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> > > @@ -12,7 +12,7 @@
> > >
> > >  #include "vect-over-widen-7.c"
> > >
> > > -/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
> > > +/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* \+ } "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> > detected:[^\n]* >> 2} "vect" } } */
> > >  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> > detected:[^\n]* \(unsigned char\)} "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > index
> > 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2
> > dc2b7ab2ac0deb674b 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > index
> > 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9
> > c0eb31c987ffe9 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
> > > @@ -47,5 +47,5 @@ int main (void)
> > >
> > >  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target {
> > vect_widen_mult_hi_to_si || vect_unpack } } } } */
> > >  /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern:
> > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
> > > -/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target
> > vect_widen_mult_hi_to_si_pattern } } } */
> > > +/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" {
> > target vect_widen_mult_hi_to_si_pattern } } } */
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f
> > 86c77406555b7726482
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> > > +
> > > +void test6(unsigned char *x, double *y, int n) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] += x[i];
> > > +    }
> > > +}
> > > +
> > > +void test7(unsigned char *x, double *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] += k * x[i];
> > > +    }
> > > +}
> > > +
> > > +void test8(unsigned char *x, double *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] = k + x[i];
> > > +    }
> > > +}
> > > +
> > > +void test9(unsigned char *x, long long *y, int n, unsigned char k) {
> > > +    for(int i = 0; i < (n & -8); i++) {
> > > +        y[i] = k + x[i];
> > > +    }
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
> > > +/* { dg-final { scan-assembler-not {\tuxtl2} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip1} } } */
> > > +/* { dg-final { scan-assembler-not {\tzip2} } } */
> > > +/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
> > > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> > > diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-
> > supports.exp
> > > index
> > d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3
> > 034c0a828724 100644
> > > --- a/gcc/testsuite/lib/target-supports.exp
> > > +++ b/gcc/testsuite/lib/target-supports.exp
> > > @@ -8240,6 +8240,7 @@ proc
> > check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
> > >      return [check_cached_effective_target_indexed
> > vect_widen_mult_hi_to_si_pattern {
> > >        expr { [istarget powerpc*-*-*]
> > >  	     || [istarget ia64-*-*]
> > > +	     || [istarget aarch64*-*-*]
> > >  	     || [istarget loongarch*-*-*]
> > >  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > >  	     || ([is-effective-target arm_neon]
> > > @@ -8259,6 +8260,7 @@ proc
> > check_effective_target_vect_widen_mult_si_to_di_pattern { } {
> > >        expr { [istarget ia64-*-*]
> > >  	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
> > >  	     || [istarget loongarch*-*-*]
> > > +	     || [istarget aarch64*-*-*]
> > >  	     || ([istarget s390*-*-*]
> > >  		 && [check_effective_target_s390_vx]) }}]
> > >  }
> > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > index
> > 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef055
> > 6d25afda851f7f0 100644
> > > --- a/gcc/tree-vect-patterns.cc
> > > +++ b/gcc/tree-vect-patterns.cc
> > > @@ -5524,6 +5524,122 @@ vect_recog_mixed_size_cond_pattern (vec_info
> > *vinfo,
> > >    return pattern_stmt;
> > >  }
> > >
> > > +/* Function vect_recog_zero_extend_chain_pattern
> > > +
> > > +   Try to find the following pattern:
> > > +
> > > +     type x_t;
> > > +     TYPE a_T, b_T, c_T;
> > 
> > But a_T, b_T and c_T are different types - what types?
> > 
> > > +   loop:
> > > +     S1  a_T = (b_T)(c_T)x_t;
> > > +
> > > +   where type 'TYPE' is an integral type which has different size
> > > +   from 'type' and c_T is a zero extend or a sign extend on a value whose top
> > > +   bit is known to be zero. a_T can be signed or unsigned.
> > > +
> > > +   Input:
> > > +
> > > +   * STMT_VINFO: The stmt from which the pattern search begins.
> > > +
> > > +   Output:
> > > +
> > > +   * TYPE_OUT: The type of the output of this pattern.
> > > +
> > > +   * Return value: A new stmt that will be used to replace the pattern.
> > > +	This replaces multiple chained extensions with the longest possible
> > > +	chain or zero extends and a final convert to the required sign.
> > > +
> > > +	S1  a_T = (a_T)(unsigned a_T)x_t;  */
> > 
> > So for (double)(int)x you produce (double)(int)(unsigned int)x?  It
> > feels like the above example misses a conversion?
> > 
> > I don't think the pure integer type sequence happens - we do
> > (and should if not) already perform canonicalization of equivalent
> > conversion sequences, otherwise you'd see a missed CSE (I couldn't
> > produce such an example).
> > 
> > You say you found the "issue" to be exposed in several (or found in one,
> > suspected in several) existing patterns.  Can you elaborate and point
> > out where this happens?  I don't think it covers the int -> double
> > case, does it?
> > 
> > Thanks,
> > Richard.
> > 
> > > +
> > > +static gimple *
> > > +vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
> > > +				      stmt_vec_info stmt_vinfo, tree *type_out)
> > > +{
> > > +  gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize
> > (stmt_vinfo));
> > > +
> > > +  if (!is_gimple_assign (last_stmt))
> > > +    return NULL;
> > > +
> > > +  tree_code code = gimple_assign_rhs_code (last_stmt);
> > > +  tree lhs = gimple_assign_lhs (last_stmt);
> > > +  tree rhs = gimple_assign_rhs1 (last_stmt);
> > > +  tree lhs_type = TREE_TYPE (lhs);
> > > +  tree rhs_type = TREE_TYPE (rhs);
> > > +
> > > +  if ((code != FLOAT_EXPR && code != NOP_EXPR)
> > > +      || TYPE_UNSIGNED (lhs_type)
> > > +      || TREE_CODE (rhs_type) != INTEGER_TYPE
> > > +      || TREE_CODE (rhs) != SSA_NAME
> > > +      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
> > > +    return NULL;
> > > +
> > > +  /* Check to see if it's safe to extend the zero extend to the new type.
> > > +     In general this is safe if the rhs1 type is unsigned or if we know that
> > > +     the top bits are zero,  this can happen due to all the widening operations
> > > +     we have.   For instance a widening addition will have top bits zero.  */
> > > +  if (!TYPE_UNSIGNED (rhs_type))
> > > +    {
> > > +      wide_int wcst = get_nonzero_bits (rhs);
> > > +      if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
> > > +	return NULL;
> > > +    }
> > > +
> > > +  tree cvt_type = unsigned_type_for (lhs_type);
> > > +
> > > +  tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
> > > +  if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
> > > +    return NULL;
> > > +
> > > +  tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
> > > +  if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
> > > +    return NULL;
> > > +
> > > +  stmt_vec_info irhs;
> > > +
> > > +  gimple_ranger ranger;
> > > +
> > > +  /* Dig through any existing conversions to see if we can extend the zero
> > > +     extend chain across multiple converts.  */
> > > +  while ((irhs = vect_get_internal_def (vinfo, rhs)))
> > > +    {
> > > +      gimple *g_irhs = STMT_VINFO_STMT (irhs);
> > > +      if (!is_gimple_assign (g_irhs)
> > > +	  || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
> > > +	break;
> > > +
> > > +      /* See if we can consume the next conversion as well.  To do this it's
> > > +	 best to use Ranger as it can see through the intermediate IL that the
> > > +	 vectorizer creates throughout pattern matching.  */
> > > +      int_range_max r;
> > > +      ranger.range_of_stmt (r, g_irhs);
> > > +      wide_int nz = r.get_nonzero_bits ();
> > > +      if (wi::neg_p (nz) || wi::clz (nz) == 0)
> > > +	break;
> > > +
> > > +      rhs = gimple_assign_rhs1 (g_irhs);
> > > +    }
> > > +
> > > +  /* If the result is a no-op, or we've jumped over a truncate of sort, or if
> > > +     nothing would change materially just leave it alone.  */
> > > +  if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
> > > +      || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
> > > +    return NULL;
> > > +
> > > +  vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", last_stmt);
> > > +
> > > +  tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
> > > +  gimple *pattern_stmt = NULL;
> > > +  pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
> > > +  append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
> > > +
> > > +  tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
> > > +  pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
> > > +
> > > +  *type_out = out_vectype;
> > > +
> > > +  return pattern_stmt;
> > > +}
> > > +
> > >
> > >  /* Helper function of vect_recog_bool_pattern.  Called recursively, return
> > >     true if bool VAR can and should be optimized that way.  Assume it shouldn't
> > > @@ -7509,6 +7625,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] =
> > {
> > >    { vect_recog_widen_minus_pattern, "widen_minus" },
> > >    { vect_recog_widen_abd_pattern, "widen_abd" },
> > >    /* These must come after the double widening ones.  */
> > > +  { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
> > >  };
> > >
> > >  /* Mark statements that are involved in a pattern.  */
> > >
> > >
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
index 5ae99225273ca5f915f60ecba3a5aaedebe46e96..627de78af4e48581575beda97bf2a0708ac091cb 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pattern-1.c
@@ -52,4 +52,4 @@  int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "slp2" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
-/* { dg-final { scan-tree-dump-times "pattern recognized" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 8 "slp2" { target vect_widen_mult_hi_to_si_pattern } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
index b69ade338862cda4f44f5206d195eef1cb5e8d36..aecc085a51c93e0e7bed122df0a77a0a099ad6ef 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c
@@ -52,5 +52,5 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_widen_mult_hi_to_si } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_widen_mult_hi_to_si } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
-/* { dg-final { scan-tree-dump-times "pattern recognized" 2 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times "pattern recognized" 4 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
index f0140e4ef6d70cd61aa7dbb3ba39b1da142a79b2..bd798fae7e8136975d488206cfef9e39fac2bfea 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
@@ -11,7 +11,7 @@ 
 
 #include "vect-over-widen-9.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
index ddb3bd8c0d378f0138c8cc7f9c6ea3300744b8a8..8c0544e35c29de60e76759f4ed13206278c72925 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-12.c
@@ -11,7 +11,7 @@ 
 
 #include "vect-over-widen-11.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
index dfa09f5d2cafe329e6d57b5cc681786cc2c7d215..1fe0305c1c4f61d05864ef97789726a1dc6ec8b1 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
@@ -11,7 +11,7 @@ 
 
 #include "vect-over-widen-13.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* = \(unsigned char\)} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
index 4584c586da1e6f13e8c8de4c1291cea0141ebab5..4ecdadf7a035a4f83b1767a063a1b0f47bdd543d 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-16.c
@@ -11,7 +11,7 @@ 
 
 #include "vect-over-widen-15.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump-not {vect_recog_cast_forwprop_pattern: detected} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
index bda92c965e080dd3f48ec42b6bea16e79d9416cd..6b8c3dfa2c89ce04d7673607ef2d2f14a14eb32f 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
@@ -9,7 +9,7 @@ 
 
 #include "vect-over-widen-5.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
index 553c0712a79a1d19195dbdab7cbd6fa330685bea..1cf725ff4b7f151097192db1a0b65173c4c83b19 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
@@ -12,7 +12,7 @@ 
 
 #include "vect-over-widen-7.c"
 
-/* { dg-final { scan-tree-dump {Splitting statement} "vect" } } */
+/* { dg-final { scan-tree-dump {Splitting pattern statement} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
index 258d253f401459d448d1ae86f56b0c97815d5b61..b5018f855a72534b4d64d2dc2b7ab2ac0deb674b 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u16.c
@@ -47,5 +47,5 @@  int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
-/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
index 3baafca7b548124ae5c48fdf3c2f07c319155967..ab523ca77652e1f1533889fda9c0eb31c987ffe9 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-u8-s16-s32.c
@@ -47,5 +47,5 @@  int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_widen_mult_hi_to_si || vect_unpack } } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
-/* { dg-final { scan-tree-dump-times "pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
+/* { dg-final { scan-tree-dump-times "widen_mult pattern recognized" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..1577eacd9dbbb52274d9f86c77406555b7726482
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_2.c
@@ -0,0 +1,33 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
+
+void test6(unsigned char *x, double *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] += x[i];
+    }
+}
+
+void test7(unsigned char *x, double *y, int n, unsigned char k) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] += k * x[i];
+    }
+}
+
+void test8(unsigned char *x, double *y, int n, unsigned char k) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] = k + x[i];
+    }
+}
+
+void test9(unsigned char *x, long long *y, int n, unsigned char k) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] = k + x[i];
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tuxtl} 1 } } */
+/* { dg-final { scan-assembler-not {\tuxtl2} } } */
+/* { dg-final { scan-assembler-not {\tzip1} } } */
+/* { dg-final { scan-assembler-not {\tzip2} } } */
+/* { dg-final { scan-assembler-times {\ttbl} 44 } } */
+/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index d113a08dff7b2a8ab5bdfe24386d271bff255afc..feae1b8fcf8cd7ab56a8c76c0cd3034c0a828724 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -8240,6 +8240,7 @@  proc check_effective_target_vect_widen_mult_hi_to_si_pattern { } {
     return [check_cached_effective_target_indexed vect_widen_mult_hi_to_si_pattern {
       expr { [istarget powerpc*-*-*]
 	     || [istarget ia64-*-*]
+	     || [istarget aarch64*-*-*]
 	     || [istarget loongarch*-*-*]
 	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
 	     || ([is-effective-target arm_neon]
@@ -8259,6 +8260,7 @@  proc check_effective_target_vect_widen_mult_si_to_di_pattern { } {
       expr { [istarget ia64-*-*]
 	     || [istarget i?86-*-*] || [istarget x86_64-*-*]
 	     || [istarget loongarch*-*-*]
+	     || [istarget aarch64*-*-*]
 	     || ([istarget s390*-*-*]
 		 && [check_effective_target_s390_vx]) }}]
 }
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 9bf8526ac995c6c2678b25f5df4316aec41333e0..74c7269a3ab15cba1ee2ef0556d25afda851f7f0 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -5524,6 +5524,122 @@  vect_recog_mixed_size_cond_pattern (vec_info *vinfo,
   return pattern_stmt;
 }
 
+/* Function vect_recog_zero_extend_chain_pattern
+
+   Try to find the following pattern:
+
+     type x_t;
+     TYPE a_T, b_T, c_T;
+   loop:
+     S1  a_T = (b_T)(c_T)x_t;
+
+   where type 'TYPE' is an integral type which has different size
+   from 'type' and c_T is a zero extend or a sign extend on a value whose top
+   bit is known to be zero. a_T can be signed or unsigned.
+
+   Input:
+
+   * STMT_VINFO: The stmt from which the pattern search begins.
+
+   Output:
+
+   * TYPE_OUT: The type of the output of this pattern.
+
+   * Return value: A new stmt that will be used to replace the pattern.
+	This replaces multiple chained extensions with the longest possible
+	chain or zero extends and a final convert to the required sign.
+	
+	S1  a_T = (a_T)(unsigned a_T)x_t;  */
+
+static gimple *
+vect_recog_zero_extend_chain_pattern (vec_info *vinfo,
+				      stmt_vec_info stmt_vinfo, tree *type_out)
+{
+  gimple *last_stmt = STMT_VINFO_STMT (vect_stmt_to_vectorize (stmt_vinfo));
+
+  if (!is_gimple_assign (last_stmt))
+    return NULL;
+
+  tree_code code = gimple_assign_rhs_code (last_stmt);
+  tree lhs = gimple_assign_lhs (last_stmt);
+  tree rhs = gimple_assign_rhs1 (last_stmt);
+  tree lhs_type = TREE_TYPE (lhs);
+  tree rhs_type = TREE_TYPE (rhs);
+
+  if ((code != FLOAT_EXPR && code != NOP_EXPR)
+      || TYPE_UNSIGNED (lhs_type)
+      || TREE_CODE (rhs_type) != INTEGER_TYPE
+      || TREE_CODE (rhs) != SSA_NAME
+      || STMT_VINFO_DEF_TYPE (stmt_vinfo) != vect_internal_def)
+    return NULL;
+
+  /* Check to see if it's safe to extend the zero extend to the new type.
+     In general this is safe if the rhs1 type is unsigned or if we know that
+     the top bits are zero,  this can happen due to all the widening operations
+     we have.   For instance a widening addition will have top bits zero.  */
+  if (!TYPE_UNSIGNED (rhs_type))
+    {
+      wide_int wcst = get_nonzero_bits (rhs);
+      if (wi::neg_p (wcst) || wi::clz (wcst) == 0)
+	return NULL;
+    }
+
+  tree cvt_type = unsigned_type_for (lhs_type);
+
+  tree cvt_vectype = get_vectype_for_scalar_type (vinfo, cvt_type);
+  if (!cvt_vectype || !VECTOR_TYPE_P (cvt_vectype))
+    return NULL;
+
+  tree out_vectype = get_vectype_for_scalar_type (vinfo, lhs_type);
+  if (!out_vectype || !VECTOR_TYPE_P (out_vectype))
+    return NULL;
+
+  stmt_vec_info irhs;
+
+  gimple_ranger ranger;
+
+  /* Dig through any existing conversions to see if we can extend the zero
+     extend chain across multiple converts.  */
+  while ((irhs = vect_get_internal_def (vinfo, rhs)))
+    {
+      gimple *g_irhs = STMT_VINFO_STMT (irhs);
+      if (!is_gimple_assign (g_irhs)
+	  || gimple_assign_rhs_code (g_irhs) != NOP_EXPR)
+	break;
+
+      /* See if we can consume the next conversion as well.  To do this it's
+	 best to use Ranger as it can see through the intermediate IL that the
+	 vectorizer creates throughout pattern matching.  */
+      int_range_max r;
+      ranger.range_of_stmt (r, g_irhs);
+      wide_int nz = r.get_nonzero_bits ();
+      if (wi::neg_p (nz) || wi::clz (nz) == 0)
+	break;
+
+      rhs = gimple_assign_rhs1 (g_irhs);
+    }
+
+  /* If the result is a no-op, or we've jumped over a truncate of sort, or if
+     nothing would change materially just leave it alone.  */
+  if (TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (TREE_TYPE (rhs))
+      || (code == FLOAT_EXPR && rhs == gimple_assign_rhs1 (last_stmt)))
+    return NULL;
+
+  vect_pattern_detected ("vect_recog_zero_extend_chain_pattern", last_stmt);
+
+  tree cast_var = vect_recog_temp_ssa_var (cvt_type, NULL);
+  gimple *pattern_stmt = NULL;
+  pattern_stmt = gimple_build_assign (cast_var, NOP_EXPR, rhs);
+  append_pattern_def_seq (vinfo, stmt_vinfo, pattern_stmt, cvt_vectype);
+
+  tree cvt_var = vect_recog_temp_ssa_var (lhs_type, NULL);
+  pattern_stmt = gimple_build_assign (cvt_var, code, cast_var);
+
+  *type_out = out_vectype;
+
+  return pattern_stmt;
+}
+
 
 /* Helper function of vect_recog_bool_pattern.  Called recursively, return
    true if bool VAR can and should be optimized that way.  Assume it shouldn't
@@ -7509,6 +7625,7 @@  static vect_recog_func vect_vect_recog_func_ptrs[] = {
   { vect_recog_widen_minus_pattern, "widen_minus" },
   { vect_recog_widen_abd_pattern, "widen_abd" },
   /* These must come after the double widening ones.  */
+  { vect_recog_zero_extend_chain_pattern, "zero_extend_chain" },
 };
 
 /* Mark statements that are involved in a pattern.  */