Message ID | da469973-d874-1eb6-739f-048831e8a898@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC/PATCH] vect: Recog mul_highpart pattern | expand |
On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi, > > When I added the support for Power10 newly introduced multiply > highpart instrutions, I noticed that currently vectorizer > doesn't try to vectorize multiply highpart pattern, I hope > this isn't intentional? > > This patch is to extend the existing pattern mulhs handlings > to cover multiply highpart. Another alternative seems to > recog mul_highpart operation in a general place applied for > scalar code when the target supports the optab for the scalar > operation, it's based on the assumption that one target which > supports vector version of multiply highpart should have the > scalar version. I noticed that the function can_mult_highpart_p > can check/handle mult_highpart well even without mul_highpart > optab support, I think to recog this pattern in vectorizer > is better. Is it on the right track? I think it's on the right track, using IFN_LAST is a bit awkward in case yet another case pops up so maybe you can use a code_helper instance instead which unifies tree_code, builtin_code and internal_fn? I also notice that can_mult_highpart_p will return true if only vec_widen_[us]mult_{even,odd,hi,lo} are available, but then the result might be less optimal (or even not handled later)? That is, what about adding optab internal functions for [us]mul_highpart instead, much like the existing ones for MULH{R,}S? Richard. > > Bootstrapped & regtested on powerpc64le-linux-gnu P9, > x86_64-redhat-linux and aarch64-linux-gnu. > > BR, > Kewen > ----- > gcc/ChangeLog: > > * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to > recog normal multiply highpart. >
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi, >> >> When I added the support for Power10 newly introduced multiply >> highpart instrutions, I noticed that currently vectorizer >> doesn't try to vectorize multiply highpart pattern, I hope >> this isn't intentional? >> >> This patch is to extend the existing pattern mulhs handlings >> to cover multiply highpart. Another alternative seems to >> recog mul_highpart operation in a general place applied for >> scalar code when the target supports the optab for the scalar >> operation, it's based on the assumption that one target which >> supports vector version of multiply highpart should have the >> scalar version. I noticed that the function can_mult_highpart_p >> can check/handle mult_highpart well even without mul_highpart >> optab support, I think to recog this pattern in vectorizer >> is better. Is it on the right track? > > I think it's on the right track, using IFN_LAST is a bit awkward > in case yet another case pops up so maybe you can use > a code_helper instance instead which unifies tree_code, > builtin_code and internal_fn? > > I also notice that can_mult_highpart_p will return true if > only vec_widen_[us]mult_{even,odd,hi,lo} are available, > but then the result might be less optimal (or even not > handled later)? > > That is, what about adding optab internal functions > for [us]mul_highpart instead, much like the existing > ones for MULH{R,}S? Yeah, that's be my preference too FWIW. All uses of MULT_HIGHPART_EXPR already have to be guarded by can_mult_highpart_p, so replacing it with a directly-mapped ifn seems like a natural fit. (Then can_mult_highpart_p should be replaced with a direct_internal_fn_supported_p query.) Thanks, Richard
On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> When I added the support for Power10 newly introduced multiply > >> highpart instrutions, I noticed that currently vectorizer > >> doesn't try to vectorize multiply highpart pattern, I hope > >> this isn't intentional? > >> > >> This patch is to extend the existing pattern mulhs handlings > >> to cover multiply highpart. Another alternative seems to > >> recog mul_highpart operation in a general place applied for > >> scalar code when the target supports the optab for the scalar > >> operation, it's based on the assumption that one target which > >> supports vector version of multiply highpart should have the > >> scalar version. I noticed that the function can_mult_highpart_p > >> can check/handle mult_highpart well even without mul_highpart > >> optab support, I think to recog this pattern in vectorizer > >> is better. Is it on the right track? > > > > I think it's on the right track, using IFN_LAST is a bit awkward > > in case yet another case pops up so maybe you can use > > a code_helper instance instead which unifies tree_code, > > builtin_code and internal_fn? > > > > I also notice that can_mult_highpart_p will return true if > > only vec_widen_[us]mult_{even,odd,hi,lo} are available, > > but then the result might be less optimal (or even not > > handled later)? > > > > That is, what about adding optab internal functions > > for [us]mul_highpart instead, much like the existing > > ones for MULH{R,}S? > > Yeah, that's be my preference too FWIW. All uses of MULT_HIGHPART_EXPR > already have to be guarded by can_mult_highpart_p, so replacing it with > a directly-mapped ifn seems like a natural fit. (Then can_mult_highpart_p > should be replaced with a direct_internal_fn_supported_p query.) But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo} but I think this specific pattern should key on [us]mul_highpart only? Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)? Richard. > Thanks, > Richard
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Jul 13, 2021 at 11:40 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Richard Biener <richard.guenther@gmail.com> writes: >> > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> >> >> Hi, >> >> >> >> When I added the support for Power10 newly introduced multiply >> >> highpart instrutions, I noticed that currently vectorizer >> >> doesn't try to vectorize multiply highpart pattern, I hope >> >> this isn't intentional? >> >> >> >> This patch is to extend the existing pattern mulhs handlings >> >> to cover multiply highpart. Another alternative seems to >> >> recog mul_highpart operation in a general place applied for >> >> scalar code when the target supports the optab for the scalar >> >> operation, it's based on the assumption that one target which >> >> supports vector version of multiply highpart should have the >> >> scalar version. I noticed that the function can_mult_highpart_p >> >> can check/handle mult_highpart well even without mul_highpart >> >> optab support, I think to recog this pattern in vectorizer >> >> is better. Is it on the right track? >> > >> > I think it's on the right track, using IFN_LAST is a bit awkward >> > in case yet another case pops up so maybe you can use >> > a code_helper instance instead which unifies tree_code, >> > builtin_code and internal_fn? >> > >> > I also notice that can_mult_highpart_p will return true if >> > only vec_widen_[us]mult_{even,odd,hi,lo} are available, >> > but then the result might be less optimal (or even not >> > handled later)? >> > >> > That is, what about adding optab internal functions >> > for [us]mul_highpart instead, much like the existing >> > ones for MULH{R,}S? >> >> Yeah, that's be my preference too FWIW. All uses of MULT_HIGHPART_EXPR >> already have to be guarded by can_mult_highpart_p, so replacing it with >> a directly-mapped ifn seems like a natural fit. (Then can_mult_highpart_p >> should be replaced with a direct_internal_fn_supported_p query.) > > But note can_mult_highpart_t covers use via vec_widen_[us]mult_{even,odd,hi,lo} > but I think this specific pattern should key on [us]mul_highpart only? > > Because vec_widen_* implies a higher VF (or else we might miss vectorizing?)? But wouldn't it be better to do the existing hi/lo/even/odd conversion in gimple, rather than hide it in expand? (Yes, this is feature creep. :-)) Richard
Hi Richi, Thanks for the comments! on 2021/7/13 下午5:35, Richard Biener wrote: > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi, >> >> When I added the support for Power10 newly introduced multiply >> highpart instrutions, I noticed that currently vectorizer >> doesn't try to vectorize multiply highpart pattern, I hope >> this isn't intentional? >> >> This patch is to extend the existing pattern mulhs handlings >> to cover multiply highpart. Another alternative seems to >> recog mul_highpart operation in a general place applied for >> scalar code when the target supports the optab for the scalar >> operation, it's based on the assumption that one target which >> supports vector version of multiply highpart should have the >> scalar version. I noticed that the function can_mult_highpart_p >> can check/handle mult_highpart well even without mul_highpart >> optab support, I think to recog this pattern in vectorizer >> is better. Is it on the right track? > > I think it's on the right track, using IFN_LAST is a bit awkward > in case yet another case pops up so maybe you can use > a code_helper instance instead which unifies tree_code, > builtin_code and internal_fn? > If there is one new requirement which doesn't have/introduce IFN stuffs but have one existing tree_code, can we add one more field with type tree_code, then for the IFN_LAST path we can check the different requirements under the guard with that tree_code variable? > I also notice that can_mult_highpart_p will return true if > only vec_widen_[us]mult_{even,odd,hi,lo} are available, > but then the result might be less optimal (or even not > handled later)? > I think it will be handled always? The expander calls rtx expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, rtx target, bool uns_p) which will further check with can_mult_highpart_p. For the below case, #define SHT_CNT 16 __attribute__ ((noipa)) void test () { for (int i = 0; i < N; i++) sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; } Without this patch, it use widen_mult like below: vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1]; vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1]; vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; vect__6.10_25 = vect_patt_22.9_13 >> 16; vect__6.10_26 = vect_patt_22.9_9 >> 16; vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; .L2: lxvx 33,7,9 lxvx 32,8,9 vmulosh 13,0,1 // widen mult vmulesh 0,0,1 xxmrglw 33,32,45 // merge xxmrghw 32,32,45 vsraw 1,1,12 // shift vsraw 0,0,12 vpkuwum 0,0,1 // pack stxvx 32,10,9 addi 9,9,16 bdnz .L2 With this patch, it ends up with: vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1]; vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1]; vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; .L2: lxvx 32,8,9 lxvx 33,10,9 vmulosh 13,0,1 // widen mult vmulesh 0,0,1 vperm 0,0,13,12 // perm on widen mults stxvx 32,7,9 addi 9,9,16 bdnz .L2 > That is, what about adding optab internal functions > for [us]mul_highpart instead, much like the existing > ones for MULH{R,}S? > OK, I was thinking the IFN way at the beginning, but was worried that it's easy to be blamed saying it's not necessary since there is one existing tree_code. :-) Will update it with IFN way. BR, Kewen > Richard. > >> >> Bootstrapped & regtested on powerpc64le-linux-gnu P9, >> x86_64-redhat-linux and aarch64-linux-gnu. >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to >> recog normal multiply highpart. >>
On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Richi, > > Thanks for the comments! > > on 2021/7/13 下午5:35, Richard Biener wrote: > > On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> When I added the support for Power10 newly introduced multiply > >> highpart instrutions, I noticed that currently vectorizer > >> doesn't try to vectorize multiply highpart pattern, I hope > >> this isn't intentional? > >> > >> This patch is to extend the existing pattern mulhs handlings > >> to cover multiply highpart. Another alternative seems to > >> recog mul_highpart operation in a general place applied for > >> scalar code when the target supports the optab for the scalar > >> operation, it's based on the assumption that one target which > >> supports vector version of multiply highpart should have the > >> scalar version. I noticed that the function can_mult_highpart_p > >> can check/handle mult_highpart well even without mul_highpart > >> optab support, I think to recog this pattern in vectorizer > >> is better. Is it on the right track? > > > > I think it's on the right track, using IFN_LAST is a bit awkward > > in case yet another case pops up so maybe you can use > > a code_helper instance instead which unifies tree_code, > > builtin_code and internal_fn? > > > > If there is one new requirement which doesn't have/introduce IFN > stuffs but have one existing tree_code, can we add one more field > with type tree_code, then for the IFN_LAST path we can check the > different requirements under the guard with that tree_code variable? > > > I also notice that can_mult_highpart_p will return true if > > only vec_widen_[us]mult_{even,odd,hi,lo} are available, > > but then the result might be less optimal (or even not > > handled later)? > > > > I think it will be handled always? The expander calls > > rtx > expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, > rtx target, bool uns_p) > > which will further check with can_mult_highpart_p. > > For the below case, > > #define SHT_CNT 16 > > __attribute__ ((noipa)) void > test () > { > for (int i = 0; i < N; i++) > sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; > } > > Without this patch, it use widen_mult like below: > > vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1]; > vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1]; > vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; > vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; > vect__6.10_25 = vect_patt_22.9_13 >> 16; > vect__6.10_26 = vect_patt_22.9_9 >> 16; > vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; > MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; > > .L2: > lxvx 33,7,9 > lxvx 32,8,9 > vmulosh 13,0,1 // widen mult > vmulesh 0,0,1 > xxmrglw 33,32,45 // merge > xxmrghw 32,32,45 > vsraw 1,1,12 // shift > vsraw 0,0,12 > vpkuwum 0,0,1 // pack > stxvx 32,10,9 > addi 9,9,16 > bdnz .L2 > > > With this patch, it ends up with: > > vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1]; > vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1]; > vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; > MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; Yes, so I'm curious what it ends up with/without the patch on x86_64 which can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. Richard. > .L2: > lxvx 32,8,9 > lxvx 33,10,9 > vmulosh 13,0,1 // widen mult > vmulesh 0,0,1 > vperm 0,0,13,12 // perm on widen mults > stxvx 32,7,9 > addi 9,9,16 > bdnz .L2 > > > > That is, what about adding optab internal functions > > for [us]mul_highpart instead, much like the existing > > ones for MULH{R,}S? > > > > OK, I was thinking the IFN way at the beginning, but was worried > that it's easy to be blamed saying it's not necessary since there > is one existing tree_code. :-) Will update it with IFN way. > > BR, > Kewen > > > Richard. > > > >> > >> Bootstrapped & regtested on powerpc64le-linux-gnu P9, > >> x86_64-redhat-linux and aarch64-linux-gnu. > >> > >> BR, > >> Kewen > >> ----- > >> gcc/ChangeLog: > >> > >> * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to > >> recog normal multiply highpart. > >> > >
on 2021/7/13 下午8:42, Richard Biener wrote: > On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi Richi, >> >> Thanks for the comments! >> >> on 2021/7/13 下午5:35, Richard Biener wrote: >>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >>>> >>>> Hi, >>>> >>>> When I added the support for Power10 newly introduced multiply >>>> highpart instrutions, I noticed that currently vectorizer >>>> doesn't try to vectorize multiply highpart pattern, I hope >>>> this isn't intentional? >>>> >>>> This patch is to extend the existing pattern mulhs handlings >>>> to cover multiply highpart. Another alternative seems to >>>> recog mul_highpart operation in a general place applied for >>>> scalar code when the target supports the optab for the scalar >>>> operation, it's based on the assumption that one target which >>>> supports vector version of multiply highpart should have the >>>> scalar version. I noticed that the function can_mult_highpart_p >>>> can check/handle mult_highpart well even without mul_highpart >>>> optab support, I think to recog this pattern in vectorizer >>>> is better. Is it on the right track? >>> >>> I think it's on the right track, using IFN_LAST is a bit awkward >>> in case yet another case pops up so maybe you can use >>> a code_helper instance instead which unifies tree_code, >>> builtin_code and internal_fn? >>> >> >> If there is one new requirement which doesn't have/introduce IFN >> stuffs but have one existing tree_code, can we add one more field >> with type tree_code, then for the IFN_LAST path we can check the >> different requirements under the guard with that tree_code variable? >> >>> I also notice that can_mult_highpart_p will return true if >>> only vec_widen_[us]mult_{even,odd,hi,lo} are available, >>> but then the result might be less optimal (or even not >>> handled later)? >>> >> >> I think it will be handled always? The expander calls >> >> rtx >> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, >> rtx target, bool uns_p) >> >> which will further check with can_mult_highpart_p. >> >> For the below case, >> >> #define SHT_CNT 16 >> >> __attribute__ ((noipa)) void >> test () >> { >> for (int i = 0; i < N; i++) >> sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; >> } >> >> Without this patch, it use widen_mult like below: >> >> vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1]; >> vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1]; >> vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; >> vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; >> vect__6.10_25 = vect_patt_22.9_13 >> 16; >> vect__6.10_26 = vect_patt_22.9_9 >> 16; >> vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; >> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; >> >> .L2: >> lxvx 33,7,9 >> lxvx 32,8,9 >> vmulosh 13,0,1 // widen mult >> vmulesh 0,0,1 >> xxmrglw 33,32,45 // merge >> xxmrghw 32,32,45 >> vsraw 1,1,12 // shift >> vsraw 0,0,12 >> vpkuwum 0,0,1 // pack >> stxvx 32,10,9 >> addi 9,9,16 >> bdnz .L2 >> >> >> With this patch, it ends up with: >> >> vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1]; >> vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1]; >> vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; >> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; > > Yes, so I'm curious what it ends up with/without the patch on x86_64 which > can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. > For test case: ``` #define N 32 typedef signed int bigType; typedef signed short smallType; #define SH_CNT 16 extern smallType small_a[N], small_b[N], small_c[N]; __attribute__((noipa)) void test_si(int n) { for (int i = 0; i < n; i++) small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT; } ``` on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize 1) without this patch, the pattern isn't recognized, the IR looks like: <bb 4> [local count: 94607391]: bnd.5_34 = niters.4_25 >> 3; _13 = (sizetype) bnd.5_34; _29 = _13 * 16; <bb 5> [local count: 378429566]: # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)> vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1]; vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1]; vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>; vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>; vect__6.15_46 = vect_patt_18.14_44 >> 16; vect__6.15_47 = vect_patt_18.14_45 >> 16; vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>; MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48; ivtmp.34_5 = ivtmp.34_4 + 16; if (ivtmp.34_5 != _29) goto <bb 5>; [75.00%] else goto <bb 6>; [25.00%] ... *asm*: .L4: movdqu small_b(%rax), %xmm3 movdqu small_a(%rax), %xmm1 addq $16, %rax movdqu small_a-16(%rax), %xmm2 pmullw %xmm3, %xmm1 pmulhw %xmm3, %xmm2 movdqa %xmm1, %xmm0 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 psrad $16, %xmm1 psrad $16, %xmm0 movdqa %xmm0, %xmm2 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm2 movdqa %xmm0, %xmm1 punpckhwd %xmm2, %xmm1 punpcklwd %xmm2, %xmm0 punpcklwd %xmm1, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %edi, %eax andl $-8, %eax testb $7, %dil je .L14 *insn dist in loop* 1 addq 3 movdqa 3 movdqu 1 movups 1 pmulhw 1 pmullw 2 psrad 3 punpckhwd 4 punpcklwd 2) with this patch but make the mul_highpart optab query return false, the IR looks like: <bb 4> [local count: 94607391]: bnd.5_37 = niters.4_22 >> 3; _13 = (sizetype) bnd.5_37; _32 = _13 * 16; <bb 5> [local count: 378429566]: # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)> vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1]; vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1]; vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46; MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47; ivtmp.33_5 = ivtmp.33_4 + 16; if (ivtmp.33_5 != _32) goto <bb 5>; [75.00%] else goto <bb 6>; [25.00%] *asm*: .L4: movdqu small_b(%rax), %xmm3 movdqu small_a(%rax), %xmm1 addq $16, %rax movdqu small_a-16(%rax), %xmm2 pmullw %xmm3, %xmm1 pmulhw %xmm3, %xmm2 movdqa %xmm1, %xmm0 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 movdqa %xmm0, %xmm2 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm2 movdqa %xmm0, %xmm1 punpckhwd %xmm2, %xmm1 punpcklwd %xmm2, %xmm0 punpckhwd %xmm1, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %edi, %eax andl $-8, %eax testb $7, %dil je .L14 *insn dist*: 1 addq 3 movdqa 3 movdqu 1 movups 1 pmulhw 1 pmullw 4 punpckhwd 3 punpcklwd I know little on i386 asm, but this seems better from insn distribution, the one without this patch uses two more psrad (s). 3) FWIW with this patch as well as existing optab supports, the IR looks like: <bb 4> [local count: 94607391]: bnd.5_40 = niters.4_19 >> 3; _75 = (sizetype) bnd.5_40; _91 = _75 * 16; <bb 5> [local count: 378429566]: # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)> vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1]; vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1]; vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51; MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52; ivtmp.48_45 = ivtmp.48_53 + 16; if (ivtmp.48_45 != _91) goto <bb 5>; [75.00%] else goto <bb 6>; [25.00%] <bb 6> [local count: 94607391]: niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288; tmp.7_42 = (int) niters_vector_mult_vf.6_41; _61 = niters.4_19 & 7; if (_61 == 0) goto <bb 12>; [12.50%] else goto <bb 7>; [87.50%] <bb 7> [local count: 93293400]: # i_38 = PHI <tmp.7_42(6), 0(3)> # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)> niters.18_60 = niters.4_19 - _44; _76 = niters.18_60 + 4294967295; if (_76 <= 2) goto <bb 9>; [10.00%] else goto <bb 8>; [90.00%] <bb 8> [local count: 167928121]: _85 = (sizetype) _44; _86 = _85 * 2; vectp_small_a.23_84 = &small_a + _86; vectp_small_b.26_90 = &small_b + _86; vectp_small_c.31_98 = &small_c + _86; vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84]; vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90]; vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95; MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96; niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292; _82 = (int) niters_vector_mult_vf.20_80; tmp.21_81 = i_38 + _82; _74 = niters.18_60 & 3; if (_74 == 0) goto <bb 11>; [25.00%] else goto <bb 9>; [75.00%] ... *asm*: .L4: movdqu small_a(%rax), %xmm0 movdqu small_b(%rax), %xmm2 addq $16, %rax pmulhw %xmm2, %xmm0 movups %xmm0, small_c-16(%rax) cmpq %rdx, %rax jne .L4 movl %ecx, %edx andl $-8, %edx movl %edx, %eax testb $7, %cl je .L19 .L3: movl %ecx, %esi subl %edx, %esi leal -1(%rsi), %edi cmpl $2, %edi jbe .L6 movq small_a(%rdx,%rdx), %xmm0 movq small_b(%rdx,%rdx), %xmm1 pmulhw %xmm1, %xmm0 movq %xmm0, small_c(%rdx,%rdx) movl %esi, %edx andl $-4, %edx addl %edx, %eax andl $3, %esi je .L1 I guess the proposed IFN would be directly mapped for [us]mul_highpart? or do you expect it will also cover what we support in can_mult_highpart_p? BR, Kewen
On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > on 2021/7/13 下午8:42, Richard Biener wrote: > > On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > >> > >> Hi Richi, > >> > >> Thanks for the comments! > >> > >> on 2021/7/13 下午5:35, Richard Biener wrote: > >>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> When I added the support for Power10 newly introduced multiply > >>>> highpart instrutions, I noticed that currently vectorizer > >>>> doesn't try to vectorize multiply highpart pattern, I hope > >>>> this isn't intentional? > >>>> > >>>> This patch is to extend the existing pattern mulhs handlings > >>>> to cover multiply highpart. Another alternative seems to > >>>> recog mul_highpart operation in a general place applied for > >>>> scalar code when the target supports the optab for the scalar > >>>> operation, it's based on the assumption that one target which > >>>> supports vector version of multiply highpart should have the > >>>> scalar version. I noticed that the function can_mult_highpart_p > >>>> can check/handle mult_highpart well even without mul_highpart > >>>> optab support, I think to recog this pattern in vectorizer > >>>> is better. Is it on the right track? > >>> > >>> I think it's on the right track, using IFN_LAST is a bit awkward > >>> in case yet another case pops up so maybe you can use > >>> a code_helper instance instead which unifies tree_code, > >>> builtin_code and internal_fn? > >>> > >> > >> If there is one new requirement which doesn't have/introduce IFN > >> stuffs but have one existing tree_code, can we add one more field > >> with type tree_code, then for the IFN_LAST path we can check the > >> different requirements under the guard with that tree_code variable? > >> > >>> I also notice that can_mult_highpart_p will return true if > >>> only vec_widen_[us]mult_{even,odd,hi,lo} are available, > >>> but then the result might be less optimal (or even not > >>> handled later)? > >>> > >> > >> I think it will be handled always? The expander calls > >> > >> rtx > >> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, > >> rtx target, bool uns_p) > >> > >> which will further check with can_mult_highpart_p. > >> > >> For the below case, > >> > >> #define SHT_CNT 16 > >> > >> __attribute__ ((noipa)) void > >> test () > >> { > >> for (int i = 0; i < N; i++) > >> sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; > >> } > >> > >> Without this patch, it use widen_mult like below: > >> > >> vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1]; > >> vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1]; > >> vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; > >> vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; > >> vect__6.10_25 = vect_patt_22.9_13 >> 16; > >> vect__6.10_26 = vect_patt_22.9_9 >> 16; > >> vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; > >> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; > >> > >> .L2: > >> lxvx 33,7,9 > >> lxvx 32,8,9 > >> vmulosh 13,0,1 // widen mult > >> vmulesh 0,0,1 > >> xxmrglw 33,32,45 // merge > >> xxmrghw 32,32,45 > >> vsraw 1,1,12 // shift > >> vsraw 0,0,12 > >> vpkuwum 0,0,1 // pack > >> stxvx 32,10,9 > >> addi 9,9,16 > >> bdnz .L2 > >> > >> > >> With this patch, it ends up with: > >> > >> vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1]; > >> vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1]; > >> vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; > >> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; > > > > Yes, so I'm curious what it ends up with/without the patch on x86_64 which > > can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. > > > > For test case: > > ``` > #define N 32 > typedef signed int bigType; > typedef signed short smallType; > #define SH_CNT 16 > > extern smallType small_a[N], small_b[N], small_c[N]; > > __attribute__((noipa)) void test_si(int n) { > for (int i = 0; i < n; i++) > small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT; > } > > ``` > > on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize > > 1) without this patch, the pattern isn't recognized, the IR looks like: > > <bb 4> [local count: 94607391]: > bnd.5_34 = niters.4_25 >> 3; > _13 = (sizetype) bnd.5_34; > _29 = _13 * 16; > > <bb 5> [local count: 378429566]: > # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)> > vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1]; > vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1]; > vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>; > vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>; > vect__6.15_46 = vect_patt_18.14_44 >> 16; > vect__6.15_47 = vect_patt_18.14_45 >> 16; > vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>; > MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48; > ivtmp.34_5 = ivtmp.34_4 + 16; > if (ivtmp.34_5 != _29) > goto <bb 5>; [75.00%] > else > goto <bb 6>; [25.00%] > > ... > > *asm*: > > .L4: > movdqu small_b(%rax), %xmm3 > movdqu small_a(%rax), %xmm1 > addq $16, %rax > movdqu small_a-16(%rax), %xmm2 > pmullw %xmm3, %xmm1 > pmulhw %xmm3, %xmm2 > movdqa %xmm1, %xmm0 > punpcklwd %xmm2, %xmm0 > punpckhwd %xmm2, %xmm1 > psrad $16, %xmm1 > psrad $16, %xmm0 > movdqa %xmm0, %xmm2 > punpcklwd %xmm1, %xmm0 > punpckhwd %xmm1, %xmm2 > movdqa %xmm0, %xmm1 > punpckhwd %xmm2, %xmm1 > punpcklwd %xmm2, %xmm0 > punpcklwd %xmm1, %xmm0 > movups %xmm0, small_c-16(%rax) > cmpq %rdx, %rax > jne .L4 > movl %edi, %eax > andl $-8, %eax > testb $7, %dil > je .L14 > > *insn dist in loop* > > 1 addq > 3 movdqa > 3 movdqu > 1 movups > 1 pmulhw > 1 pmullw > 2 psrad > 3 punpckhwd > 4 punpcklwd > > > 2) with this patch but make the mul_highpart optab query return false, the IR looks > like: > > <bb 4> [local count: 94607391]: > bnd.5_37 = niters.4_22 >> 3; > _13 = (sizetype) bnd.5_37; > _32 = _13 * 16; > > <bb 5> [local count: 378429566]: > # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)> > vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1]; > vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1]; > vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46; > MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47; > ivtmp.33_5 = ivtmp.33_4 + 16; > if (ivtmp.33_5 != _32) > goto <bb 5>; [75.00%] > else > goto <bb 6>; [25.00%] > > *asm*: > > .L4: > movdqu small_b(%rax), %xmm3 > movdqu small_a(%rax), %xmm1 > addq $16, %rax > movdqu small_a-16(%rax), %xmm2 > pmullw %xmm3, %xmm1 > pmulhw %xmm3, %xmm2 > movdqa %xmm1, %xmm0 > punpcklwd %xmm2, %xmm0 > punpckhwd %xmm2, %xmm1 > movdqa %xmm0, %xmm2 > punpcklwd %xmm1, %xmm0 > punpckhwd %xmm1, %xmm2 > movdqa %xmm0, %xmm1 > punpckhwd %xmm2, %xmm1 > punpcklwd %xmm2, %xmm0 > punpckhwd %xmm1, %xmm0 > movups %xmm0, small_c-16(%rax) > cmpq %rdx, %rax > jne .L4 > movl %edi, %eax > andl $-8, %eax > testb $7, %dil > je .L14 > > *insn dist*: > > 1 addq > 3 movdqa > 3 movdqu > 1 movups > 1 pmulhw > 1 pmullw > 4 punpckhwd > 3 punpcklwd > > I know little on i386 asm, but this seems better from insn distribution, > the one without this patch uses two more psrad (s). So the advantage of 1) would be more appropriate costing in the vectorizer (x86 does not have a native vector highpart multiply). > 3) FWIW with this patch as well as existing optab supports, the IR looks like: > > <bb 4> [local count: 94607391]: > bnd.5_40 = niters.4_19 >> 3; > _75 = (sizetype) bnd.5_40; > _91 = _75 * 16; > > <bb 5> [local count: 378429566]: > # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)> > vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1]; > vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1]; > vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51; > MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52; > ivtmp.48_45 = ivtmp.48_53 + 16; > if (ivtmp.48_45 != _91) > goto <bb 5>; [75.00%] > else > goto <bb 6>; [25.00%] > > <bb 6> [local count: 94607391]: > niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288; > tmp.7_42 = (int) niters_vector_mult_vf.6_41; > _61 = niters.4_19 & 7; > if (_61 == 0) > goto <bb 12>; [12.50%] > else > goto <bb 7>; [87.50%] > > <bb 7> [local count: 93293400]: > # i_38 = PHI <tmp.7_42(6), 0(3)> > # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)> > niters.18_60 = niters.4_19 - _44; > _76 = niters.18_60 + 4294967295; > if (_76 <= 2) > goto <bb 9>; [10.00%] > else > goto <bb 8>; [90.00%] > > <bb 8> [local count: 167928121]: > _85 = (sizetype) _44; > _86 = _85 * 2; > vectp_small_a.23_84 = &small_a + _86; > vectp_small_b.26_90 = &small_b + _86; > vectp_small_c.31_98 = &small_c + _86; > vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84]; > vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90]; > vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95; > MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96; > niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292; > _82 = (int) niters_vector_mult_vf.20_80; > tmp.21_81 = i_38 + _82; > _74 = niters.18_60 & 3; > if (_74 == 0) > goto <bb 11>; [25.00%] > else > goto <bb 9>; [75.00%] > > ... > > *asm*: > > .L4: > movdqu small_a(%rax), %xmm0 > movdqu small_b(%rax), %xmm2 > addq $16, %rax > pmulhw %xmm2, %xmm0 > movups %xmm0, small_c-16(%rax) > cmpq %rdx, %rax > jne .L4 > movl %ecx, %edx > andl $-8, %edx > movl %edx, %eax > testb $7, %cl > je .L19 > .L3: > movl %ecx, %esi > subl %edx, %esi > leal -1(%rsi), %edi > cmpl $2, %edi > jbe .L6 > movq small_a(%rdx,%rdx), %xmm0 > movq small_b(%rdx,%rdx), %xmm1 > pmulhw %xmm1, %xmm0 > movq %xmm0, small_c(%rdx,%rdx) > movl %esi, %edx > andl $-4, %edx > addl %edx, %eax > andl $3, %esi > je .L1 Oh, so indeed x86 has a vector highpart multiply, not grep-friendly macroized as <s>mul<mode>3_highpart ;) > I guess the proposed IFN would be directly mapped for [us]mul_highpart? Yes. > or do you expect it will also cover what we support in can_mult_highpart_p? See above - since we manage to use widen_mult we should expose this detail for the purpose of costing. So the pattern would directly ask for an optab IFN mapping to [us]mul_highpart. Richard. > BR, > Kewen
on 2021/7/14 下午2:38, Richard Biener wrote: > On Tue, Jul 13, 2021 at 4:59 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> on 2021/7/13 下午8:42, Richard Biener wrote: >>> On Tue, Jul 13, 2021 at 12:25 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >>>> >>>> Hi Richi, >>>> >>>> Thanks for the comments! >>>> >>>> on 2021/7/13 下午5:35, Richard Biener wrote: >>>>> On Tue, Jul 13, 2021 at 10:53 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> When I added the support for Power10 newly introduced multiply >>>>>> highpart instrutions, I noticed that currently vectorizer >>>>>> doesn't try to vectorize multiply highpart pattern, I hope >>>>>> this isn't intentional? >>>>>> >>>>>> This patch is to extend the existing pattern mulhs handlings >>>>>> to cover multiply highpart. Another alternative seems to >>>>>> recog mul_highpart operation in a general place applied for >>>>>> scalar code when the target supports the optab for the scalar >>>>>> operation, it's based on the assumption that one target which >>>>>> supports vector version of multiply highpart should have the >>>>>> scalar version. I noticed that the function can_mult_highpart_p >>>>>> can check/handle mult_highpart well even without mul_highpart >>>>>> optab support, I think to recog this pattern in vectorizer >>>>>> is better. Is it on the right track? >>>>> >>>>> I think it's on the right track, using IFN_LAST is a bit awkward >>>>> in case yet another case pops up so maybe you can use >>>>> a code_helper instance instead which unifies tree_code, >>>>> builtin_code and internal_fn? >>>>> >>>> >>>> If there is one new requirement which doesn't have/introduce IFN >>>> stuffs but have one existing tree_code, can we add one more field >>>> with type tree_code, then for the IFN_LAST path we can check the >>>> different requirements under the guard with that tree_code variable? >>>> >>>>> I also notice that can_mult_highpart_p will return true if >>>>> only vec_widen_[us]mult_{even,odd,hi,lo} are available, >>>>> but then the result might be less optimal (or even not >>>>> handled later)? >>>>> >>>> >>>> I think it will be handled always? The expander calls >>>> >>>> rtx >>>> expand_mult_highpart (machine_mode mode, rtx op0, rtx op1, >>>> rtx target, bool uns_p) >>>> >>>> which will further check with can_mult_highpart_p. >>>> >>>> For the below case, >>>> >>>> #define SHT_CNT 16 >>>> >>>> __attribute__ ((noipa)) void >>>> test () >>>> { >>>> for (int i = 0; i < N; i++) >>>> sh_c[i] = ((SI) sh_a[i] * (SI) sh_b[i]) >> 16; >>>> } >>>> >>>> Without this patch, it use widen_mult like below: >>>> >>>> vect__1.5_19 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.18_24 * 1]; >>>> vect__3.8_14 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.18_24 * 1]; >>>> vect_patt_22.9_13 = WIDEN_MULT_LO_EXPR <vect__3.8_14, vect__1.5_19>; >>>> vect_patt_22.9_9 = WIDEN_MULT_HI_EXPR <vect__3.8_14, vect__1.5_19>; >>>> vect__6.10_25 = vect_patt_22.9_13 >> 16; >>>> vect__6.10_26 = vect_patt_22.9_9 >> 16; >>>> vect__7.11_27 = VEC_PACK_TRUNC_EXPR <vect__6.10_25, vect__6.10_26>; >>>> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.18_24 * 1] = vect__7.11_27; >>>> >>>> .L2: >>>> lxvx 33,7,9 >>>> lxvx 32,8,9 >>>> vmulosh 13,0,1 // widen mult >>>> vmulesh 0,0,1 >>>> xxmrglw 33,32,45 // merge >>>> xxmrghw 32,32,45 >>>> vsraw 1,1,12 // shift >>>> vsraw 0,0,12 >>>> vpkuwum 0,0,1 // pack >>>> stxvx 32,10,9 >>>> addi 9,9,16 >>>> bdnz .L2 >>>> >>>> >>>> With this patch, it ends up with: >>>> >>>> vect__1.5_14 = MEM <vector(8) short int> [(short int *)&sh_a + ivtmp.17_24 * 1]; >>>> vect__3.8_8 = MEM <vector(8) short int> [(short int *)&sh_b + ivtmp.17_24 * 1]; >>>> vect_patt_21.9_25 = vect__3.8_8 h* vect__1.5_14; >>>> MEM <vector(8) short int> [(short int *)&sh_c + ivtmp.17_24 * 1] = vect_patt_21.9_25; >>> >>> Yes, so I'm curious what it ends up with/without the patch on x86_64 which >>> can do vec_widen_[us]mult_{even,odd} but not [us]mul_highpart. >>> >> >> For test case: >> >> ``` >> #define N 32 >> typedef signed int bigType; >> typedef signed short smallType; >> #define SH_CNT 16 >> >> extern smallType small_a[N], small_b[N], small_c[N]; >> >> __attribute__((noipa)) void test_si(int n) { >> for (int i = 0; i < n; i++) >> small_c[i] = ((bigType)small_a[i] * (bigType)small_b[i]) >> SH_CNT; >> } >> >> ``` >> >> on x86_64, with option set: -mfpmath=sse -msse2 -O2 -ftree-vectorize >> >> 1) without this patch, the pattern isn't recognized, the IR looks like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_34 = niters.4_25 >> 3; >> _13 = (sizetype) bnd.5_34; >> _29 = _13 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.34_4 = PHI <ivtmp.34_5(5), 0(4)> >> vect__1.10_40 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.34_4 * 1]; >> vect__3.13_43 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.34_4 * 1]; >> vect_patt_18.14_44 = WIDEN_MULT_LO_EXPR <vect__1.10_40, vect__3.13_43>; >> vect_patt_18.14_45 = WIDEN_MULT_HI_EXPR <vect__1.10_40, vect__3.13_43>; >> vect__6.15_46 = vect_patt_18.14_44 >> 16; >> vect__6.15_47 = vect_patt_18.14_45 >> 16; >> vect__7.16_48 = VEC_PACK_TRUNC_EXPR <vect__6.15_46, vect__6.15_47>; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.34_4 * 1] = vect__7.16_48; >> ivtmp.34_5 = ivtmp.34_4 + 16; >> if (ivtmp.34_5 != _29) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> ... >> >> *asm*: >> >> .L4: >> movdqu small_b(%rax), %xmm3 >> movdqu small_a(%rax), %xmm1 >> addq $16, %rax >> movdqu small_a-16(%rax), %xmm2 >> pmullw %xmm3, %xmm1 >> pmulhw %xmm3, %xmm2 >> movdqa %xmm1, %xmm0 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm2, %xmm1 >> psrad $16, %xmm1 >> psrad $16, %xmm0 >> movdqa %xmm0, %xmm2 >> punpcklwd %xmm1, %xmm0 >> punpckhwd %xmm1, %xmm2 >> movdqa %xmm0, %xmm1 >> punpckhwd %xmm2, %xmm1 >> punpcklwd %xmm2, %xmm0 >> punpcklwd %xmm1, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %edi, %eax >> andl $-8, %eax >> testb $7, %dil >> je .L14 >> >> *insn dist in loop* >> >> 1 addq >> 3 movdqa >> 3 movdqu >> 1 movups >> 1 pmulhw >> 1 pmullw >> 2 psrad >> 3 punpckhwd >> 4 punpcklwd >> >> >> 2) with this patch but make the mul_highpart optab query return false, the IR looks >> like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_37 = niters.4_22 >> 3; >> _13 = (sizetype) bnd.5_37; >> _32 = _13 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.33_4 = PHI <ivtmp.33_5(5), 0(4)> >> vect__1.10_43 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.33_4 * 1]; >> vect__3.13_46 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.33_4 * 1]; >> vect_patt_26.14_47 = vect__1.10_43 h* vect__3.13_46; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.33_4 * 1] = vect_patt_26.14_47; >> ivtmp.33_5 = ivtmp.33_4 + 16; >> if (ivtmp.33_5 != _32) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> *asm*: >> >> .L4: >> movdqu small_b(%rax), %xmm3 >> movdqu small_a(%rax), %xmm1 >> addq $16, %rax >> movdqu small_a-16(%rax), %xmm2 >> pmullw %xmm3, %xmm1 >> pmulhw %xmm3, %xmm2 >> movdqa %xmm1, %xmm0 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm2, %xmm1 >> movdqa %xmm0, %xmm2 >> punpcklwd %xmm1, %xmm0 >> punpckhwd %xmm1, %xmm2 >> movdqa %xmm0, %xmm1 >> punpckhwd %xmm2, %xmm1 >> punpcklwd %xmm2, %xmm0 >> punpckhwd %xmm1, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %edi, %eax >> andl $-8, %eax >> testb $7, %dil >> je .L14 >> >> *insn dist*: >> >> 1 addq >> 3 movdqa >> 3 movdqu >> 1 movups >> 1 pmulhw >> 1 pmullw >> 4 punpckhwd >> 3 punpcklwd >> >> I know little on i386 asm, but this seems better from insn distribution, >> the one without this patch uses two more psrad (s). > > So the advantage of 1) would be more appropriate costing in the vectorizer > (x86 does not have a native vector highpart multiply). > >> 3) FWIW with this patch as well as existing optab supports, the IR looks like: >> >> <bb 4> [local count: 94607391]: >> bnd.5_40 = niters.4_19 >> 3; >> _75 = (sizetype) bnd.5_40; >> _91 = _75 * 16; >> >> <bb 5> [local count: 378429566]: >> # ivtmp.48_53 = PHI <ivtmp.48_45(5), 0(4)> >> vect__1.10_48 = MEM <vector(8) short int> [(short int *)&small_a + ivtmp.48_53 * 1]; >> vect__3.13_51 = MEM <vector(8) short int> [(short int *)&small_b + ivtmp.48_53 * 1]; >> vect_patt_26.14_52 = vect__1.10_48 h* vect__3.13_51; >> MEM <vector(8) short int> [(short int *)&small_c + ivtmp.48_53 * 1] = vect_patt_26.14_52; >> ivtmp.48_45 = ivtmp.48_53 + 16; >> if (ivtmp.48_45 != _91) >> goto <bb 5>; [75.00%] >> else >> goto <bb 6>; [25.00%] >> >> <bb 6> [local count: 94607391]: >> niters_vector_mult_vf.6_41 = niters.4_19 & 4294967288; >> tmp.7_42 = (int) niters_vector_mult_vf.6_41; >> _61 = niters.4_19 & 7; >> if (_61 == 0) >> goto <bb 12>; [12.50%] >> else >> goto <bb 7>; [87.50%] >> >> <bb 7> [local count: 93293400]: >> # i_38 = PHI <tmp.7_42(6), 0(3)> >> # _44 = PHI <niters_vector_mult_vf.6_41(6), 0(3)> >> niters.18_60 = niters.4_19 - _44; >> _76 = niters.18_60 + 4294967295; >> if (_76 <= 2) >> goto <bb 9>; [10.00%] >> else >> goto <bb 8>; [90.00%] >> >> <bb 8> [local count: 167928121]: >> _85 = (sizetype) _44; >> _86 = _85 * 2; >> vectp_small_a.23_84 = &small_a + _86; >> vectp_small_b.26_90 = &small_b + _86; >> vectp_small_c.31_98 = &small_c + _86; >> vect__9.24_89 = MEM <vector(4) short int> [(short int *)vectp_small_a.23_84]; >> vect__28.27_95 = MEM <vector(4) short int> [(short int *)vectp_small_b.26_90]; >> vect_patt_23.28_96 = vect__9.24_89 h* vect__28.27_95; >> MEM <vector(4) short int> [(short int *)vectp_small_c.31_98] = vect_patt_23.28_96; >> niters_vector_mult_vf.20_80 = niters.18_60 & 4294967292; >> _82 = (int) niters_vector_mult_vf.20_80; >> tmp.21_81 = i_38 + _82; >> _74 = niters.18_60 & 3; >> if (_74 == 0) >> goto <bb 11>; [25.00%] >> else >> goto <bb 9>; [75.00%] >> >> ... >> >> *asm*: >> >> .L4: >> movdqu small_a(%rax), %xmm0 >> movdqu small_b(%rax), %xmm2 >> addq $16, %rax >> pmulhw %xmm2, %xmm0 >> movups %xmm0, small_c-16(%rax) >> cmpq %rdx, %rax >> jne .L4 >> movl %ecx, %edx >> andl $-8, %edx >> movl %edx, %eax >> testb $7, %cl >> je .L19 >> .L3: >> movl %ecx, %esi >> subl %edx, %esi >> leal -1(%rsi), %edi >> cmpl $2, %edi >> jbe .L6 >> movq small_a(%rdx,%rdx), %xmm0 >> movq small_b(%rdx,%rdx), %xmm1 >> pmulhw %xmm1, %xmm0 >> movq %xmm0, small_c(%rdx,%rdx) >> movl %esi, %edx >> andl $-4, %edx >> addl %edx, %eax >> andl $3, %esi >> je .L1 > > Oh, so indeed x86 has a vector highpart multiply, not grep-friendly > macroized as <s>mul<mode>3_highpart ;) > >> I guess the proposed IFN would be directly mapped for [us]mul_highpart? > > Yes. > Thanks for confirming! The related patch v2 is attached and the testing is ongoing. >> or do you expect it will also cover what we support in can_mult_highpart_p? > > See above - since we manage to use widen_mult we should expose this > detail for the purpose of costing. So the pattern would directly ask for > an optab IFN mapping to [us]mul_highpart. > OK, the costing issue seems already exist since vect_recog_divmod_pattern checks with can_mult_highpart_p and generate mul_highpart in some path which probably ends up with widen_mult_{odd,even,lo,hi}, I guess we can fix the related costing routine by querying can_mult_highpart_p and price it more for mul_highpart input and it has to work with widen_mult_{odd...}. btw, IIUC Richard S. proposed to generate these widen_mult_{odd,even,lo,hi} in gimple rather than expand phase, since vector perm is involved it seems we have to generate them during transforming? Otherwise the mul_highpart in scalar code like pattern recog divmod seems hard to be expanded? BR, Kewen ----- gcc/ChangeLog: * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. * internal-fn.def (IFN_MULH): New internal function. * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to recog normal multiply highpart as IFN_MULH. --- gcc/internal-fn.c | 1 + gcc/internal-fn.def | 2 ++ gcc/tree-vect-patterns.c | 45 +++++++++++++++++++++++++++------------- 3 files changed, 34 insertions(+), 14 deletions(-) diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index fb8b43d1ce2..b1b4289357c 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -3703,6 +3703,7 @@ first_commutative_argument (internal_fn fn) case IFN_FNMS: case IFN_AVG_FLOOR: case IFN_AVG_CEIL: + case IFN_MULH: case IFN_MULHS: case IFN_MULHRS: case IFN_FMIN: diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index c3b8e730960..ed6d7de1680 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -169,6 +169,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_FLOOR, ECF_CONST | ECF_NOTHROW, first, DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first, savg_ceil, uavg_ceil, binary) +DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first, + smul_highpart, umul_highpart, binary) DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first, smulhs, umulhs, binary) DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first, diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index b2e7fc2cc7a..4581cc6b51c 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo, 1) Multiply high with scaling TYPE res = ((TYPE) a * (TYPE) b) >> c; + Here, c is bitsize (TYPE) / 2 - 1. + 2) ... or also with rounding TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1; + Here, d is bitsize (TYPE) / 2 - 2. + + 3) Normal multiply high + TYPE res = ((TYPE) a * (TYPE) b) >> e; + Here, e is bitsize (TYPE) / 2. where only the bottom half of res is used. */ @@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo, stmt_vec_info mulh_stmt_info; tree scale_term; internal_fn ifn; - unsigned int expect_offset; /* Check for the presence of the rounding term. */ if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR) @@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Get the scaling term. */ scale_term = gimple_assign_rhs2 (plus_input_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; + + /* Check pattern 2). */ + if (wi::to_widest (scale_term) + target_precision + 2 + != TYPE_PRECISION (lhs_type)) + return NULL; - expect_offset = target_precision + 2; ifn = IFN_MULHRS; } else { mulh_stmt_info = rshift_input_stmt_info; scale_term = gimple_assign_rhs2 (last_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; - expect_offset = target_precision + 1; - ifn = IFN_MULHS; + /* Check for pattern 1). */ + if (wi::to_widest (scale_term) + target_precision + 1 + == TYPE_PRECISION (lhs_type)) + ifn = IFN_MULHS; + /* Check for pattern 3). */ + else if (wi::to_widest (scale_term) + target_precision + == TYPE_PRECISION (lhs_type)) + ifn = IFN_MULH; + else + return NULL; } - /* Check that the scaling factor is correct. */ - if (TREE_CODE (scale_term) != INTEGER_CST - || wi::to_widest (scale_term) + expect_offset - != TYPE_PRECISION (lhs_type)) - return NULL; - /* Check whether the scaling input term can be seen as two widened inputs multiplied together. */ vect_unpromoted_value unprom_mult[2]; @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Check for target support. */ tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); if (!new_vectype - || !direct_internal_fn_supported_p - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) return NULL; /* The IR requires a valid vector type for the cast result, even though @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Generate the IFN_MULHRS call. */ tree new_var = vect_recog_temp_ssa_var (new_type, NULL); tree new_ops[2]; - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, - unprom_mult, new_vectype); + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, + new_vectype); gcall *mulhrs_stmt = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); gimple_call_set_lhs (mulhrs_stmt, new_var);
"Kewen.Lin" <linkw@linux.ibm.com> writes: > gcc/ChangeLog: > > * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. > * internal-fn.def (IFN_MULH): New internal function. > * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to > recog normal multiply highpart as IFN_MULH. LGTM FWIW, although: > @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, > /* Check for target support. */ > tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); > if (!new_vectype > - || !direct_internal_fn_supported_p > - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) > + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) > return NULL; > > /* The IR requires a valid vector type for the cast result, even though > @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, > /* Generate the IFN_MULHRS call. */ > tree new_var = vect_recog_temp_ssa_var (new_type, NULL); > tree new_ops[2]; > - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, > - unprom_mult, new_vectype); > + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, > + new_vectype); > gcall *mulhrs_stmt > = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); > gimple_call_set_lhs (mulhrs_stmt, new_var); …these changes look like formatting only. (I guess it's down to whether or not the 80th column should be kept free for an “end of line+1” cursor.) Thanks, Richard
Hi Richard, on 2021/7/14 下午4:38, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> gcc/ChangeLog: >> >> * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. >> * internal-fn.def (IFN_MULH): New internal function. >> * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to >> recog normal multiply highpart as IFN_MULH. > > LGTM FWIW, although: > Thanks for the review! >> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >> /* Check for target support. */ >> tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); >> if (!new_vectype >> - || !direct_internal_fn_supported_p >> - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >> + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >> return NULL; >> >> /* The IR requires a valid vector type for the cast result, even though >> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >> /* Generate the IFN_MULHRS call. */ >> tree new_var = vect_recog_temp_ssa_var (new_type, NULL); >> tree new_ops[2]; >> - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, >> - unprom_mult, new_vectype); >> + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, >> + new_vectype); >> gcall *mulhrs_stmt >> = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); >> gimple_call_set_lhs (mulhrs_stmt, new_var); > > …these changes look like formatting only. (I guess it's down to whether > or not the 80th column should be kept free for an “end of line+1” cursor.) > Yeah, just for formatting, the formatting tool (clang-format) reformatted them. Thanks for the information on "end of line+1" cursor, I didn't know that before. I guess you prefer me to keep the original format? If so I will remove them when committing it. I was thinking whether I should change field ColumnLimit of my .clang-format to 79 to avoid this kind of case to be caught by formatting tool again. Hope reviewers won't nit-pick the exact 80 column cases then. :) BR, Kewen
"Kewen.Lin" <linkw@linux.ibm.com> writes: > Hi Richard, > > on 2021/7/14 下午4:38, Richard Sandiford wrote: >> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>> gcc/ChangeLog: >>> >>> * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. >>> * internal-fn.def (IFN_MULH): New internal function. >>> * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to >>> recog normal multiply highpart as IFN_MULH. >> >> LGTM FWIW, although: >> > > Thanks for the review! > >>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >>> /* Check for target support. */ >>> tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); >>> if (!new_vectype >>> - || !direct_internal_fn_supported_p >>> - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >>> + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >>> return NULL; >>> >>> /* The IR requires a valid vector type for the cast result, even though >>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >>> /* Generate the IFN_MULHRS call. */ >>> tree new_var = vect_recog_temp_ssa_var (new_type, NULL); >>> tree new_ops[2]; >>> - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, >>> - unprom_mult, new_vectype); >>> + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, >>> + new_vectype); >>> gcall *mulhrs_stmt >>> = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); >>> gimple_call_set_lhs (mulhrs_stmt, new_var); >> >> …these changes look like formatting only. (I guess it's down to whether >> or not the 80th column should be kept free for an “end of line+1” cursor.) >> > > Yeah, just for formatting, the formatting tool (clang-format) reformatted > them. Thanks for the information on "end of line+1" cursor, I didn't know > that before. I guess you prefer me to keep the original format? If so I > will remove them when committing it. I was thinking whether I should change > field ColumnLimit of my .clang-format to 79 to avoid this kind of case to > be caught by formatting tool again. Hope reviewers won't nit-pick the exact > 80 column cases then. :) TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing new code. But I know in the past people have asked for 79 to be used for the “end+1” reason, so I don't think we should “fix” existing code that honours the 79 limit so that it no longer does, especially when the lines surrounding the code aren't changing. There's also a risk of yo-yo-ing if someone else is using clang-format and does have the limit set to 79 columns. So yeah, I think it'd better to commit without the two hunks above. Thanks, Richard
On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote: > TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing > new code. But I know in the past people have asked for 79 to be used > for the “end+1” reason, so I don't think we should “fix” existing code > that honours the 79 limit so that it no longer does, especially when the > lines surrounding the code aren't changing. The normal rule is you cannot go over 80. It is perfectly fine to have shorter lines, certainly if that is nice for some other reason, so automatically (by some tool) changing this is Just Wrong. Segher
on 2021/7/14 下午7:32, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> Hi Richard, >> >> on 2021/7/14 下午4:38, Richard Sandiford wrote: >>> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>>> gcc/ChangeLog: >>>> >>>> * internal-fn.c (first_commutative_argument): Add info for IFN_MULH. >>>> * internal-fn.def (IFN_MULH): New internal function. >>>> * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to >>>> recog normal multiply highpart as IFN_MULH. >>> >>> LGTM FWIW, although: >>> >> >> Thanks for the review! >> >>>> @@ -2030,8 +2048,7 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >>>> /* Check for target support. */ >>>> tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); >>>> if (!new_vectype >>>> - || !direct_internal_fn_supported_p >>>> - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >>>> + || !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) >>>> return NULL; >>>> >>>> /* The IR requires a valid vector type for the cast result, even though >>>> @@ -2043,8 +2060,8 @@ vect_recog_mulhs_pattern (vec_info *vinfo, >>>> /* Generate the IFN_MULHRS call. */ >>>> tree new_var = vect_recog_temp_ssa_var (new_type, NULL); >>>> tree new_ops[2]; >>>> - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, >>>> - unprom_mult, new_vectype); >>>> + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, >>>> + new_vectype); >>>> gcall *mulhrs_stmt >>>> = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); >>>> gimple_call_set_lhs (mulhrs_stmt, new_var); >>> >>> …these changes look like formatting only. (I guess it's down to whether >>> or not the 80th column should be kept free for an “end of line+1” cursor.) >>> >> >> Yeah, just for formatting, the formatting tool (clang-format) reformatted >> them. Thanks for the information on "end of line+1" cursor, I didn't know >> that before. I guess you prefer me to keep the original format? If so I >> will remove them when committing it. I was thinking whether I should change >> field ColumnLimit of my .clang-format to 79 to avoid this kind of case to >> be caught by formatting tool again. Hope reviewers won't nit-pick the exact >> 80 column cases then. :) > > TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing > new code. But I know in the past people have asked for 79 to be used > for the “end+1” reason, so I don't think we should “fix” existing code > that honours the 79 limit so that it no longer does, especially when the > lines surrounding the code aren't changing. > Thanks for the explanation! Agree. > There's also a risk of yo-yo-ing if someone else is using clang-format > and does have the limit set to 79 columns. > > So yeah, I think it'd better to commit without the two hunks above. > Will fix them. Thanks for catching and explanations! BR, Kewen
on 2021/7/15 上午3:32, Segher Boessenkool wrote: > On Wed, Jul 14, 2021 at 12:32:24PM +0100, Richard Sandiford wrote: >> TBH, 79 vs. 80 isn't normally something I'd worry about when reviewing >> new code. But I know in the past people have asked for 79 to be used >> for the “end+1” reason, so I don't think we should “fix” existing code >> that honours the 79 limit so that it no longer does, especially when the >> lines surrounding the code aren't changing. > > The normal rule is you cannot go over 80. It is perfectly fine to have > shorter lines, certainly if that is nice for some other reason, so > automatically (by some tool) changing this is Just Wrong. > OK, could this be applied to changelog entry too? I guess yes? BR, Kewen
On Thu, Jul 15, 2021 at 09:40:52AM +0800, Kewen.Lin wrote: > on 2021/7/15 上午3:32, Segher Boessenkool wrote: > > The normal rule is you cannot go over 80. It is perfectly fine to have > > shorter lines, certainly if that is nice for some other reason, so > > automatically (by some tool) changing this is Just Wrong. > > OK, could this be applied to changelog entry too? I guess yes? Yes, lines of length 80 are fine in changelogs. But try to not make short lines (that do no end an entry). A changelog that looks different from other changelogs is harder to read. Normally you have a whole bunch of totally boring entries ("New." or "Likewise." for example), and the few that are longer naturally stand out then, making it easier to scan the changelogs (which is what they are used for most of the time: search for something, and press "n" a lot). Also try to write less trivial things somewhat briefly in changelogs: changelogs just say *what* changed, not *why*, and it is okay to leave out details (this is a tradeoff of course). Segher
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c index b2e7fc2cc7a..9253c8088e9 100644 --- a/gcc/tree-vect-patterns.c +++ b/gcc/tree-vect-patterns.c @@ -1896,8 +1896,15 @@ vect_recog_over_widening_pattern (vec_info *vinfo, 1) Multiply high with scaling TYPE res = ((TYPE) a * (TYPE) b) >> c; + Here, c is bitsize (TYPE) / 2 - 1. + 2) ... or also with rounding TYPE res = (((TYPE) a * (TYPE) b) >> d + 1) >> 1; + Here, d is bitsize (TYPE) / 2 - 2. + + 3) Normal multiply high + TYPE res = ((TYPE) a * (TYPE) b) >> e; + Here, e is bitsize (TYPE) / 2. where only the bottom half of res is used. */ @@ -1942,7 +1949,6 @@ vect_recog_mulhs_pattern (vec_info *vinfo, stmt_vec_info mulh_stmt_info; tree scale_term; internal_fn ifn; - unsigned int expect_offset; /* Check for the presence of the rounding term. */ if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR) @@ -1991,25 +1997,37 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Get the scaling term. */ scale_term = gimple_assign_rhs2 (plus_input_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; + + /* Check pattern 2). */ + if (wi::to_widest (scale_term) + target_precision + 2 + != TYPE_PRECISION (lhs_type)) + return NULL; - expect_offset = target_precision + 2; ifn = IFN_MULHRS; } else { mulh_stmt_info = rshift_input_stmt_info; scale_term = gimple_assign_rhs2 (last_stmt); + /* Check that the scaling factor is correct. */ + if (TREE_CODE (scale_term) != INTEGER_CST) + return NULL; - expect_offset = target_precision + 1; - ifn = IFN_MULHS; + /* Check for pattern 1). */ + if (wi::to_widest (scale_term) + target_precision + 1 + == TYPE_PRECISION (lhs_type)) + ifn = IFN_MULHS; + /* Check for pattern 3). */ + else if (wi::to_widest (scale_term) + target_precision + == TYPE_PRECISION (lhs_type)) + ifn = IFN_LAST; + else + return NULL; } - /* Check that the scaling factor is correct. */ - if (TREE_CODE (scale_term) != INTEGER_CST - || wi::to_widest (scale_term) + expect_offset - != TYPE_PRECISION (lhs_type)) - return NULL; - /* Check whether the scaling input term can be seen as two widened inputs multiplied together. */ vect_unpromoted_value unprom_mult[2]; @@ -2029,9 +2047,14 @@ vect_recog_mulhs_pattern (vec_info *vinfo, /* Check for target support. */ tree new_vectype = get_vectype_for_scalar_type (vinfo, new_type); - if (!new_vectype - || !direct_internal_fn_supported_p - (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) + if (!new_vectype) + return NULL; + if (ifn != IFN_LAST + && !direct_internal_fn_supported_p (ifn, new_vectype, OPTIMIZE_FOR_SPEED)) + return NULL; + else if (ifn == IFN_LAST + && !can_mult_highpart_p (TYPE_MODE (new_vectype), + TYPE_UNSIGNED (new_type))) return NULL; /* The IR requires a valid vector type for the cast result, even though @@ -2040,14 +2063,20 @@ vect_recog_mulhs_pattern (vec_info *vinfo, if (!*type_out) return NULL; - /* Generate the IFN_MULHRS call. */ + gimple *mulhrs_stmt; tree new_var = vect_recog_temp_ssa_var (new_type, NULL); tree new_ops[2]; - vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, - unprom_mult, new_vectype); - gcall *mulhrs_stmt - = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); - gimple_call_set_lhs (mulhrs_stmt, new_var); + vect_convert_inputs (vinfo, last_stmt_info, 2, new_ops, new_type, unprom_mult, + new_vectype); + if (ifn == IFN_LAST) + mulhrs_stmt = gimple_build_assign (new_var, MULT_HIGHPART_EXPR, new_ops[0], + new_ops[1]); + else + { + /* Generate the IFN_MULHRS call. */ + mulhrs_stmt = gimple_build_call_internal (ifn, 2, new_ops[0], new_ops[1]); + gimple_call_set_lhs (mulhrs_stmt, new_var); + } gimple_set_location (mulhrs_stmt, gimple_location (last_stmt)); if (dump_enabled_p ())