diff mbox series

[x86] : Additional peephole2 to use lea in round-up integer division.

Message ID 00c401daca40$64626a60$2d273f20$@nextmovesoftware.com
State New
Headers show
Series [x86] : Additional peephole2 to use lea in round-up integer division. | expand

Commit Message

Roger Sayle June 29, 2024, 4:21 p.m. UTC
A common idiom for implementing an integer division that rounds upwards is
to write (x + y - 1) / y.  Conveniently on x86, the two additions to form
the numerator can be performed by a single lea instruction, and indeed gcc
currently generates a lea when x and y both registers.

int foo(int x, int y) {
  return (x+y-1)/y;
}

generates with -O2:

foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
        cltd
        idivl   %esi
        ret

Oddly, however, if x is a memory, gcc currently uses two instructions:

int m;
int bar(int y) {
  return (m+y-1)/y;
}

generates:

foo:    movl    m(%rip), %eax
        addl    %edi, %eax              // 2 bytes
        subl    $1, %eax                // 3 bytes
        cltd
        idivl   %edi
        ret

This discrepancy is caused by the late decision (in peephole2) to split
an addition with a memory operand, into a load followed by a reg-reg
addition.  This patch improves this situation by adding a peephole2
to recognized consecutive additions and transform them into lea if
profitable.

My first attempt at fixing this was to use a define_insn_and_split:

(define_insn_and_split "*lea<mode>3_reg_mem_imm"
  [(set (match_operand:SWI48 0 "register_operand")
       (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
                               (match_operand:SWI48 2 "memory_operand"))
                   (match_operand:SWI48 3 "x86_64_immediate_operand")))]
  "ix86_pre_reload_split ()"
  "#"
  "&& 1"
  [(set (match_dup 4) (match_dup 2))
   (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
                                 (match_dup 3)))]
  "operands[4] = gen_reg_rtx (<MODE>mode);")

using combine to combine instructions.  Unfortunately, this approach
interferes with (reload's) subtle balance of deciding when to use/avoid lea,
which can be observed as a code size regression in CSiBE.  The peephole2
approach (proposed here) uniformly improves CSiBE results.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-06-29  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386.md (peephole2): Transform two consecutive
        additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.

gcc/testsuite/ChageLog
        * gcc.target/i386/lea-3.c: New test case.


Thanks in advance,
Roger
--

Comments

Uros Bizjak June 30, 2024, 3:53 p.m. UTC | #1
On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> A common idiom for implementing an integer division that rounds upwards is
> to write (x + y - 1) / y.  Conveniently on x86, the two additions to form
> the numerator can be performed by a single lea instruction, and indeed gcc
> currently generates a lea when x and y both registers.
>
> int foo(int x, int y) {
>   return (x+y-1)/y;
> }
>
> generates with -O2:
>
> foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
>         cltd
>         idivl   %esi
>         ret
>
> Oddly, however, if x is a memory, gcc currently uses two instructions:
>
> int m;
> int bar(int y) {
>   return (m+y-1)/y;
> }
>
> generates:
>
> foo:    movl    m(%rip), %eax
>         addl    %edi, %eax              // 2 bytes
>         subl    $1, %eax                // 3 bytes
>         cltd
>         idivl   %edi
>         ret
>
> This discrepancy is caused by the late decision (in peephole2) to split
> an addition with a memory operand, into a load followed by a reg-reg
> addition.  This patch improves this situation by adding a peephole2
> to recognized consecutive additions and transform them into lea if
> profitable.
>
> My first attempt at fixing this was to use a define_insn_and_split:
>
> (define_insn_and_split "*lea<mode>3_reg_mem_imm"
>   [(set (match_operand:SWI48 0 "register_operand")
>        (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
>                                (match_operand:SWI48 2 "memory_operand"))
>                    (match_operand:SWI48 3 "x86_64_immediate_operand")))]
>   "ix86_pre_reload_split ()"
>   "#"
>   "&& 1"
>   [(set (match_dup 4) (match_dup 2))
>    (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
>                                  (match_dup 3)))]
>   "operands[4] = gen_reg_rtx (<MODE>mode);")
>
> using combine to combine instructions.  Unfortunately, this approach
> interferes with (reload's) subtle balance of deciding when to use/avoid lea,
> which can be observed as a code size regression in CSiBE.  The peephole2
> approach (proposed here) uniformly improves CSiBE results.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-06-29  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (peephole2): Transform two consecutive
>         additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
>
> gcc/testsuite/ChageLog
>         * gcc.target/i386/lea-3.c: New test case.

Is the assumption that one LEA is always faster than two ADD
instructions universally correct for TARGET_AVOID_LEA_FOR_ADDR?

Please note ix86_lea_outperforms predicate and its uses in
ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
!avoid_lea_for_addr() should be used here, but I didn't check it
thoroughly.

The function comment of avoid_lea_for_addr() says:

/* Return true if we need to split lea into a sequence of
   instructions to avoid AGU stalls during peephole2. */

And your peephole tries to reverse the above split.

Uros.

>
>
> Thanks in advance,
> Roger
> --
>
Roger Sayle June 30, 2024, 7:09 p.m. UTC | #2
Hi Uros,
> On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > A common idiom for implementing an integer division that rounds
> > upwards is to write (x + y - 1) / y.  Conveniently on x86, the two
> > additions to form the numerator can be performed by a single lea
> > instruction, and indeed gcc currently generates a lea when x and y both
> registers.
> >
> > int foo(int x, int y) {
> >   return (x+y-1)/y;
> > }
> >
> > generates with -O2:
> >
> > foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
> >         cltd
> >         idivl   %esi
> >         ret
> >
> > Oddly, however, if x is a memory, gcc currently uses two instructions:
> >
> > int m;
> > int bar(int y) {
> >   return (m+y-1)/y;
> > }
> >
> > generates:
> >
> > foo:    movl    m(%rip), %eax
> >         addl    %edi, %eax              // 2 bytes
> >         subl    $1, %eax                // 3 bytes
> >         cltd
> >         idivl   %edi
> >         ret
> >
> > This discrepancy is caused by the late decision (in peephole2) to
> > split an addition with a memory operand, into a load followed by a
> > reg-reg addition.  This patch improves this situation by adding a
> > peephole2 to recognized consecutive additions and transform them into
> > lea if profitable.
> >
> > My first attempt at fixing this was to use a define_insn_and_split:
> >
> > (define_insn_and_split "*lea<mode>3_reg_mem_imm"
> >   [(set (match_operand:SWI48 0 "register_operand")
> >        (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
> >                                (match_operand:SWI48 2 "memory_operand"))
> >                    (match_operand:SWI48 3 "x86_64_immediate_operand")))]
> >   "ix86_pre_reload_split ()"
> >   "#"
> >   "&& 1"
> >   [(set (match_dup 4) (match_dup 2))
> >    (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
> >                                  (match_dup 3)))]
> >   "operands[4] = gen_reg_rtx (<MODE>mode);")
> >
> > using combine to combine instructions.  Unfortunately, this approach
> > interferes with (reload's) subtle balance of deciding when to
> > use/avoid lea, which can be observed as a code size regression in
> > CSiBE.  The peephole2 approach (proposed here) uniformly improves CSiBE
> results.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2024-06-29  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.md (peephole2): Transform two consecutive
> >         additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
> >
> > gcc/testsuite/ChageLog
> >         * gcc.target/i386/lea-3.c: New test case.
> 
> Is the assumption that one LEA is always faster than two ADD instructions
> universally correct for TARGET_AVOID_LEA_FOR_ADDR?
> 
> Please note ix86_lea_outperforms predicate and its uses in
> ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
> ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
> !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly.
> 
> The function comment of avoid_lea_for_addr() says:
> 
> /* Return true if we need to split lea into a sequence of
>    instructions to avoid AGU stalls during peephole2. */
> 
> And your peephole tries to reverse the above split.

I completely agree that understanding when/why i386.md converts
an lea into a sequence of additions (and avoiding reversing this split)
is vitally important to understanding my patch.  You're quite right that
the logic governing this ultimately calls ix86_lea_outperforms, but as
I'll explain below the shape of those APIs (requiring an insn) is not as
convenient for instruction merging as for splitting.

The current location in i386.md where it decides whether the
lea in the foo example above needs to be split, is at line 6293:

(define_peephole2
  [(set (match_operand:SWI48 0 "register_operand")
    (match_operand:SWI48 1 "address_no_seg_operand"))]
  "ix86_hardreg_mov_ok (operands[0], operands[1])
   && peep2_regno_dead_p (0, FLAGS_REG)
   && ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)"
...

Hence, we transform lea->add+add when ix86_avoid_lea_for_addr
returns true, so by symmetry is not unreasonable to turn add+add->lea
when ix86_avoid_lea_for_addr would return false.  The relevant part
of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc:

  /* Check we need to optimize.  */
  if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
    return false;

which you'll recognize is precisely the condition under which my
proposed peephole2 fires.  Technically, we also know that this is
a 3-component lea, "parts.disp" is non-zero, the addition is destructive
(overwriting one of the input register operands), and that we're not
using partial register mode (such as HImode).

I won't argue that my peephole2 is "obviously" correct, but the observation
that it correctly reduces code size when -Os is specified, indicates that
something is broken/inconsistent/missing with the current logic, and
this change is an improvement.

Perhaps the TARGET_AVOID_LEA_FOR_ADDR in ix86_avoid_lead_for_addr
is incorrect, if so please feel free to file a bugzilla PR.

Ok for mainline?

> Uros.
> > Thanks in advance,
> > Roger
Uros Bizjak June 30, 2024, 7:20 p.m. UTC | #3
On Sun, Jun 30, 2024 at 9:09 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> > On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > A common idiom for implementing an integer division that rounds
> > > upwards is to write (x + y - 1) / y.  Conveniently on x86, the two
> > > additions to form the numerator can be performed by a single lea
> > > instruction, and indeed gcc currently generates a lea when x and y both
> > registers.
> > >
> > > int foo(int x, int y) {
> > >   return (x+y-1)/y;
> > > }
> > >
> > > generates with -O2:
> > >
> > > foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
> > >         cltd
> > >         idivl   %esi
> > >         ret
> > >
> > > Oddly, however, if x is a memory, gcc currently uses two instructions:
> > >
> > > int m;
> > > int bar(int y) {
> > >   return (m+y-1)/y;
> > > }
> > >
> > > generates:
> > >
> > > foo:    movl    m(%rip), %eax
> > >         addl    %edi, %eax              // 2 bytes
> > >         subl    $1, %eax                // 3 bytes
> > >         cltd
> > >         idivl   %edi
> > >         ret
> > >
> > > This discrepancy is caused by the late decision (in peephole2) to
> > > split an addition with a memory operand, into a load followed by a
> > > reg-reg addition.  This patch improves this situation by adding a
> > > peephole2 to recognized consecutive additions and transform them into
> > > lea if profitable.
> > >
> > > My first attempt at fixing this was to use a define_insn_and_split:
> > >
> > > (define_insn_and_split "*lea<mode>3_reg_mem_imm"
> > >   [(set (match_operand:SWI48 0 "register_operand")
> > >        (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
> > >                                (match_operand:SWI48 2 "memory_operand"))
> > >                    (match_operand:SWI48 3 "x86_64_immediate_operand")))]
> > >   "ix86_pre_reload_split ()"
> > >   "#"
> > >   "&& 1"
> > >   [(set (match_dup 4) (match_dup 2))
> > >    (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
> > >                                  (match_dup 3)))]
> > >   "operands[4] = gen_reg_rtx (<MODE>mode);")
> > >
> > > using combine to combine instructions.  Unfortunately, this approach
> > > interferes with (reload's) subtle balance of deciding when to
> > > use/avoid lea, which can be observed as a code size regression in
> > > CSiBE.  The peephole2 approach (proposed here) uniformly improves CSiBE
> > results.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2024-06-29  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/i386.md (peephole2): Transform two consecutive
> > >         additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
> > >
> > > gcc/testsuite/ChageLog
> > >         * gcc.target/i386/lea-3.c: New test case.
> >
> > Is the assumption that one LEA is always faster than two ADD instructions
> > universally correct for TARGET_AVOID_LEA_FOR_ADDR?
> >
> > Please note ix86_lea_outperforms predicate and its uses in
> > ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
> > ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
> > !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly.
> >
> > The function comment of avoid_lea_for_addr() says:
> >
> > /* Return true if we need to split lea into a sequence of
> >    instructions to avoid AGU stalls during peephole2. */
> >
> > And your peephole tries to reverse the above split.
>
> I completely agree that understanding when/why i386.md converts
> an lea into a sequence of additions (and avoiding reversing this split)
> is vitally important to understanding my patch.  You're quite right that
> the logic governing this ultimately calls ix86_lea_outperforms, but as
> I'll explain below the shape of those APIs (requiring an insn) is not as
> convenient for instruction merging as for splitting.
>
> The current location in i386.md where it decides whether the
> lea in the foo example above needs to be split, is at line 6293:
>
> (define_peephole2
>   [(set (match_operand:SWI48 0 "register_operand")
>     (match_operand:SWI48 1 "address_no_seg_operand"))]
>   "ix86_hardreg_mov_ok (operands[0], operands[1])
>    && peep2_regno_dead_p (0, FLAGS_REG)
>    && ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)"
> ...
>
> Hence, we transform lea->add+add when ix86_avoid_lea_for_addr
> returns true, so by symmetry is not unreasonable to turn add+add->lea
> when ix86_avoid_lea_for_addr would return false.  The relevant part
> of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc:
>
>   /* Check we need to optimize.  */
>   if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
>     return false;
>
> which you'll recognize is precisely the condition under which my
> proposed peephole2 fires.  Technically, we also know that this is
> a 3-component lea, "parts.disp" is non-zero, the addition is destructive
> (overwriting one of the input register operands), and that we're not
> using partial register mode (such as HImode).
>
> I won't argue that my peephole2 is "obviously" correct, but the observation
> that it correctly reduces code size when -Os is specified, indicates that
> something is broken/inconsistent/missing with the current logic, and
> this change is an improvement.
>
> Perhaps the TARGET_AVOID_LEA_FOR_ADDR in ix86_avoid_lead_for_addr
> is incorrect, if so please feel free to file a bugzilla PR.
>
> Ok for mainline?

OK.

Thanks,
Uros.
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index fd48e76..66ef234 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6332,6 +6332,21 @@ 
   "TARGET_APX_NF && reload_completed"
   [(set (match_dup 0) (ashift:SWI48 (match_dup 0) (match_dup 1)))]
   "operands[1] = GEN_INT (exact_log2 (INTVAL (operands[1])));")
+
+;; The peephole2 pass may expose consecutive additions suitable for lea.
+(define_peephole2
+  [(parallel [(set (match_operand:SWI48 0 "register_operand")
+		   (plus:SWI48 (match_dup 0)
+			       (match_operand 1 "register_operand")))
+	      (clobber (reg:CC FLAGS_REG))])
+   (parallel [(set (match_dup 0)
+		   (plus:SWI48 (match_dup 0)
+			       (match_operand 2 "x86_64_immediate_operand")))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)"
+  [(set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 0)
+					      (match_dup 1))
+				  (match_dup 2)))])
 
 ;; Add instructions