diff mbox series

[take,#2] x86_64: Improve code expanded for highpart multiplications.

Message ID 001b01d7f594$7acfc280$706f4780$@nextmovesoftware.com
State New
Headers show
Series [take,#2] x86_64: Improve code expanded for highpart multiplications. | expand

Commit Message

Roger Sayle Dec. 20, 2021, 11:26 a.m. UTC
Hi Uros,
Many thanks for the review.  Here's a revised patch incorporating your
suggestion to use a single define_insn with a mode iterator instead of two
new near identical define_insns for SImode and DImode.  I initially tried
SWI48, but a testsuite failure of pr82418.c revealed that it's more
efficient on TARGET_64BIT to use a full width multiplication than a
SImode highpart multiplication, i.e. we only want ?mulsi3_highpart on
!TARGET_64BIT, so this revised patch uses a DWIH mode iterator.

This patch has been tested on x86_64-pc-linux-gnu by make bootstrap and
make -k check (with and without RUNTESTFLAGS="--target_board='unix{-m32}'")
with no new failures.  Ok for mainline?


2021-12-20  Roger Sayle  <roger@nextmovesoftware.com>
	    Uroš Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog
	* config/i386/i386.md (any_mul_highpart): New code iterator.
	(sgnprefix, s): Add attribute support for [su]mul_highpart.
	(<s>mul<mode>3_highpart): Delete expander.
	(<s>mul<mode>3_highpart, <s>mulsi32_highpart_zext):
	New define_insn patterns.
	(define_peephole2): Tweak the register allocation for the above
	instructions after reload.

gcc/testsuite/ChangeLog
	* gcc.target/i386/smuldi3_highpart.c: New test case.

Thanks again,
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 13 December 2021 11:53
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH] x86_64: Improve code expanded for highpart
> multiplications.
> 
> On Fri, Dec 10, 2021 at 12:58 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > While working on a middle-end patch to more aggressively use highpart
> > multiplications on targets that support them, I noticed that the RTL
> > expanded by the x86 backend interacts poorly with register allocation
> > leading to suboptimal code.
> >
> > For the testcase,
> > typedef int __attribute ((mode(TI))) ti_t; long foo(long x) {
> >   return ((ti_t)x * 19065) >> 64;
> > }
> >
> > we'd like to avoid:
> > foo:    movq    %rdi, %rax
> >         movl    $19065, %edx
> >         imulq   %rdx
> >         movq    %rdx, %rax
> >         ret
> >
> > and would prefer:
> > foo:    movl    $19065, %eax
> >         imulq   %rdi
> >         movq    %rdx, %rax
> >         ret
> >
> > This patch provides a pair of peephole2 transformations to tweak the
> > spills generated by reload, and at the same time replaces the current
> > define_expand with define_insn patterns using the new [su]mul_highpart
> > RTX codes.  I've left the old-style patterns in the machine
> > description for the time being, but plan to remove these once my
> > planned middle-end improvements make them obsolete.
> >
> > This patch has been tested on x86_64-pc-linux-gnu (both with and
> > without the middle-end changes) by make bootstrap and make -k check
> > with no new failures.  The new test case, which currently passes,
> > ensures that the code we generate isn't adversely affected by changes outside
> the backend.
> > Ok for mainline?
> >
> >
> > 2021-12-10  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.md (any_mul_highpart): New code iterator.
> >         (sgnprefix, s): Add attribute support for [su]mul_highpart.
> >         (<s>mul<mode>3_highpart): Delete expander.
> >         (<s>muldi3_highpart, <s>mulsi3_highpart, <s>mulsi32_highpart_zext):
> >         New define_insn patterns.
> >         (define_peephole2): Tweak the register allocation for the above
> >         instructions after reload.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/smuldi3_highpart.c: New test case.
> 
> Can you please merge <s>muldi3_highpart and *<s>mulsi3_highpart using
> SWI48 mode iterator?
> 
> Uros.
> 
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >

Comments

Uros Bizjak Dec. 20, 2021, 12:44 p.m. UTC | #1
On Mon, Dec 20, 2021 at 12:26 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Many thanks for the review.  Here's a revised patch incorporating your
> suggestion to use a single define_insn with a mode iterator instead of two
> new near identical define_insns for SImode and DImode.  I initially tried
> SWI48, but a testsuite failure of pr82418.c revealed that it's more
> efficient on TARGET_64BIT to use a full width multiplication than a
> SImode highpart multiplication, i.e. we only want ?mulsi3_highpart on
> !TARGET_64BIT, so this revised patch uses a DWIH mode iterator.
>
> This patch has been tested on x86_64-pc-linux-gnu by make bootstrap and
> make -k check (with and without RUNTESTFLAGS="--target_board='unix{-m32}'")
> with no new failures.  Ok for mainline?
>
>
> 2021-12-20  Roger Sayle  <roger@nextmovesoftware.com>
>             Uroš Bizjak  <ubizjak@gmail.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (any_mul_highpart): New code iterator.
>         (sgnprefix, s): Add attribute support for [su]mul_highpart.
>         (<s>mul<mode>3_highpart): Delete expander.
>         (<s>mul<mode>3_highpart, <s>mulsi32_highpart_zext):
>         New define_insn patterns.
>         (define_peephole2): Tweak the register allocation for the above
>         instructions after reload.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/smuldi3_highpart.c: New test case.

OK.

Thanks,
Uros.

>
> Thanks again,
> Roger
> --
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 13 December 2021 11:53
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [PATCH] x86_64: Improve code expanded for highpart
> > multiplications.
> >
> > On Fri, Dec 10, 2021 at 12:58 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > While working on a middle-end patch to more aggressively use highpart
> > > multiplications on targets that support them, I noticed that the RTL
> > > expanded by the x86 backend interacts poorly with register allocation
> > > leading to suboptimal code.
> > >
> > > For the testcase,
> > > typedef int __attribute ((mode(TI))) ti_t; long foo(long x) {
> > >   return ((ti_t)x * 19065) >> 64;
> > > }
> > >
> > > we'd like to avoid:
> > > foo:    movq    %rdi, %rax
> > >         movl    $19065, %edx
> > >         imulq   %rdx
> > >         movq    %rdx, %rax
> > >         ret
> > >
> > > and would prefer:
> > > foo:    movl    $19065, %eax
> > >         imulq   %rdi
> > >         movq    %rdx, %rax
> > >         ret
> > >
> > > This patch provides a pair of peephole2 transformations to tweak the
> > > spills generated by reload, and at the same time replaces the current
> > > define_expand with define_insn patterns using the new [su]mul_highpart
> > > RTX codes.  I've left the old-style patterns in the machine
> > > description for the time being, but plan to remove these once my
> > > planned middle-end improvements make them obsolete.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu (both with and
> > > without the middle-end changes) by make bootstrap and make -k check
> > > with no new failures.  The new test case, which currently passes,
> > > ensures that the code we generate isn't adversely affected by changes outside
> > the backend.
> > > Ok for mainline?
> > >
> > >
> > > 2021-12-10  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386.md (any_mul_highpart): New code iterator.
> > >         (sgnprefix, s): Add attribute support for [su]mul_highpart.
> > >         (<s>mul<mode>3_highpart): Delete expander.
> > >         (<s>muldi3_highpart, <s>mulsi3_highpart, <s>mulsi32_highpart_zext):
> > >         New define_insn patterns.
> > >         (define_peephole2): Tweak the register allocation for the above
> > >         instructions after reload.
> > >
> > > gcc/testsuite/ChangeLog
> > >         * gcc.target/i386/smuldi3_highpart.c: New test case.
> >
> > Can you please merge <s>muldi3_highpart and *<s>mulsi3_highpart using
> > SWI48 mode iterator?
> >
> > Uros.
> >
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
Iain Sandoe Dec. 25, 2021, 5:39 p.m. UTC | #2
Hi Roger,

> On 20 Dec 2021, at 12:44, Uros Bizjak via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> 
> On Mon, Dec 20, 2021 at 12:26 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>> 

>> Many thanks for the review.  Here's a revised patch incorporating your
>> suggestion to use a single define_insn with a mode iterator instead of two
>> new near identical define_insns for SImode and DImode.  I initially tried
>> SWI48, but a testsuite failure of pr82418.c revealed that it's more
>> efficient on TARGET_64BIT to use a full width multiplication than a
>> SImode highpart multiplication, i.e. we only want ?mulsi3_highpart on
>> !TARGET_64BIT, so this revised patch uses a DWIH mode iterator.
>> 
>> This patch has been tested on x86_64-pc-linux-gnu by make bootstrap and
>> make -k check (with and without RUNTESTFLAGS="--target_board='unix{-m32}'")
>> with no new failures.  Ok for mainline?

Did you try this with Ada and / or on a 32b host?

Somehow this is breaking Ada bootstrap on i686-darwin{9,17}.

I fear debugging why might be quite tricky - since the fail does not ocurr until building
the Ada runtime libs, and the faults are essentially memory access constraint errors -
so the compiler is segv-ing while building the runtime (but the stage2 version obv.
succeeds in building the stage3).

In the short-term, to make some progress, I’m going to back the rev out of my local tree.

Iff you did not already test with Ada and on a 32b host I wonder if you would mind seeing
if the problem repeats for you?

(unless it is already obvious to you what it might be, of course)

thanks
Iain
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 4e9fae8..00361f4 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -992,11 +992,16 @@ 
 ;; Mapping of extend operators
 (define_code_iterator any_extend [sign_extend zero_extend])
 
+;; Mapping of highpart multiply operators
+(define_code_iterator any_mul_highpart [smul_highpart umul_highpart])
+
 ;; Prefix for insn menmonic.
 (define_code_attr sgnprefix [(sign_extend "i") (zero_extend "")
+			     (smul_highpart "i") (umul_highpart "")
 			     (div "i") (udiv "")])
 ;; Prefix for define_insn
-(define_code_attr s [(sign_extend "s") (zero_extend "u")])
+(define_code_attr s [(sign_extend "s") (zero_extend "u")
+		     (smul_highpart "s") (umul_highpart "u")])
 (define_code_attr u [(sign_extend "") (zero_extend "u")
 		     (div "") (udiv "u")])
 (define_code_attr u_bool [(sign_extend "false") (zero_extend "true")
@@ -8426,20 +8431,45 @@ 
    (set_attr "bdver1_decode" "direct")
    (set_attr "mode" "QI")])
 
-(define_expand "<s>mul<mode>3_highpart"
-  [(parallel [(set (match_operand:DWIH 0 "register_operand")
-		   (truncate:DWIH
-		     (lshiftrt:<DWI>
-		       (mult:<DWI>
-			 (any_extend:<DWI>
-			   (match_operand:DWIH 1 "nonimmediate_operand"))
-			 (any_extend:<DWI>
-			   (match_operand:DWIH 2 "register_operand")))
-		       (match_dup 3))))
-	      (clobber (scratch:DWIH))
-	      (clobber (reg:CC FLAGS_REG))])]
+;; Highpart multiplication patterns
+(define_insn "<s>mul<mode>3_highpart"
+  [(set (match_operand:DWIH 0 "register_operand" "=d")
+	(any_mul_highpart:DWIH
+	  (match_operand:DWIH 1 "register_operand" "%a")
+	  (match_operand:DWIH 2 "nonimmediate_operand" "rm")))
+   (clobber (match_scratch:DWIH 3 "=1"))
+   (clobber (reg:CC FLAGS_REG))]
   ""
-  "operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));")
+  "<sgnprefix>mul{<imodesuffix>}\t%2"
+  [(set_attr "type" "imul")
+   (set_attr "length_immediate" "0")
+   (set (attr "athlon_decode")
+     (if_then_else (eq_attr "cpu" "athlon")
+	(const_string "vector")
+	(const_string "double")))
+   (set_attr "amdfam10_decode" "double")
+   (set_attr "bdver1_decode" "direct")
+   (set_attr "mode" "<MODE>")])
+
+(define_insn "*<s>mulsi3_highpart_zext"
+  [(set (match_operand:DI 0 "register_operand" "=d")
+	(zero_extend:DI 
+	  (any_mul_highpart:SI
+	    (match_operand:SI 1 "register_operand" "%a")
+	    (match_operand:SI 2 "nonimmediate_operand" "rm"))))
+   (clobber (match_scratch:SI 3 "=1"))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT"
+  "<sgnprefix>mul{l}\t%2"
+  [(set_attr "type" "imul")
+   (set_attr "length_immediate" "0")
+   (set (attr "athlon_decode")
+     (if_then_else (eq_attr "cpu" "athlon")
+	(const_string "vector")
+	(const_string "double")))
+   (set_attr "amdfam10_decode" "double")
+   (set_attr "bdver1_decode" "direct")
+   (set_attr "mode" "SI")])
 
 (define_insn "*<s>muldi3_highpart_1"
   [(set (match_operand:DI 0 "register_operand" "=d")
@@ -8460,8 +8490,8 @@ 
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
-        (const_string "vector")
-        (const_string "double")))
+	(const_string "vector")
+	(const_string "double")))
    (set_attr "amdfam10_decode" "double")
    (set_attr "bdver1_decode" "direct")
    (set_attr "mode" "DI")])
@@ -8484,8 +8514,8 @@ 
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
-        (const_string "vector")
-        (const_string "double")))
+	(const_string "vector")
+	(const_string "double")))
    (set_attr "amdfam10_decode" "double")
    (set_attr "bdver1_decode" "direct")
    (set_attr "mode" "SI")])
@@ -8508,12 +8538,54 @@ 
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
-        (const_string "vector")
-        (const_string "double")))
+	(const_string "vector")
+	(const_string "double")))
    (set_attr "amdfam10_decode" "double")
    (set_attr "bdver1_decode" "direct")
    (set_attr "mode" "SI")])
 
+;; Highpart multiplication peephole2s to tweak register allocation.
+;; mov %rdx,imm; mov %rax,%rdi; imulq %rdx  ->  mov %rax,imm; imulq %rdi
+(define_peephole2
+  [(set (match_operand:SWI48 0 "general_reg_operand")
+	(match_operand:SWI48 1 "immediate_operand"))
+   (set (match_operand:SWI48 2 "general_reg_operand")
+	(match_operand:SWI48 3 "general_reg_operand"))
+   (parallel [(set (match_operand:SWI48 4 "general_reg_operand")
+		   (any_mul_highpart:SWI48 (match_dup 2) (match_dup 0)))
+	      (clobber (match_dup 2))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "REGNO (operands[0]) != REGNO (operands[2])
+   && REGNO (operands[0]) != REGNO (operands[3])
+   && (REGNO (operands[0]) == REGNO (operands[4])
+       || peep2_reg_dead_p (3, operands[0]))"
+  [(set (match_dup 2) (match_dup 1))
+   (parallel [(set (match_dup 4)
+		   (any_mul_highpart:SWI48 (match_dup 2) (match_dup 3)))
+	      (clobber (match_dup 2))
+	      (clobber (reg:CC FLAGS_REG))])])
+
+(define_peephole2
+  [(set (match_operand:SI 0 "general_reg_operand")
+	(match_operand:SI 1 "immediate_operand"))
+   (set (match_operand:SI 2 "general_reg_operand")
+	(match_operand:SI 3 "general_reg_operand"))
+   (parallel [(set (match_operand:DI 4 "general_reg_operand")
+		   (zero_extend:DI
+		     (any_mul_highpart:SI (match_dup 2) (match_dup 0))))
+	      (clobber (match_dup 2))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "REGNO (operands[0]) != REGNO (operands[2])
+   && REGNO (operands[0]) != REGNO (operands[3])
+   && (REGNO (operands[0]) == REGNO (operands[4])
+       || peep2_reg_dead_p (3, operands[0]))"
+  [(set (match_dup 2) (match_dup 1))
+   (parallel [(set (match_dup 4)
+		   (zero_extend:DI
+		     (any_mul_highpart:SI (match_dup 2) (match_dup 3))))
+	      (clobber (match_dup 2))
+	      (clobber (reg:CC FLAGS_REG))])])
+
 ;; The patterns that match these are at the end of this file.
 
 (define_expand "mulxf3"
diff --git a/gcc/testsuite/gcc.target/i386/smuldi3_highpart.c b/gcc/testsuite/gcc.target/i386/smuldi3_highpart.c
new file mode 100644
index 0000000..8bbd5f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/smuldi3_highpart.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+typedef int __attribute ((mode(TI))) ti_t;
+
+long foo(long x)
+{
+  return ((ti_t)x * 19065) >> 72;
+}
+
+/* { dg-final { scan-assembler "movl\[ \\t]+\\\$19065, %eax" } } */
+/* { dg-final { scan-assembler-times "movq" 1 } } */