Message ID | 20100913161721.GA18471@intel.com |
---|---|
State | New |
Headers | show |
On Mon, Sep 13, 2010 at 6:17 PM, H.J. Lu <hongjiu.lu@intel.com> wrote: > On some processors like Atom, 32bit integer divide is much slower than > 8bit unsigned integer divide. This patch adds an option, -m8bit-idiv, > for x86. It turns 32bit integer divide into > > if (dividend an divisor are in [0-255]) > use 8bit unsigned integer divide > else > use 32bit integer divide > > I didn't turn it on by default for Atom: > > 1. If inputs are outside of 8bit unsigned integers, extra compare and > branch will always be slow. > 2. For > > --- > int funca( int a, int b ) > { > return (a/b) + (a%b); > } > -- > > This patch generates 2 idivbs since the optimization is done at RTL > expansion. Is there a way to delay this until later when 2 idivls are > optimized into 1 idivl and before IRA since this optimization needs > a scratch register. Splitter with && can_create_pseudo_p () split constraint will limit splits to pre-regalloc passes, or ... > I tried to split idivl into idivb and idivl. It didn't work due > to recursive split. ... use peephole2 pass. It can allocate hard-reg scratch, if it finds a free one. Uros.
On Mon, Sep 13, 2010 at 11:58 AM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Mon, Sep 13, 2010 at 6:17 PM, H.J. Lu <hongjiu.lu@intel.com> wrote: > >> On some processors like Atom, 32bit integer divide is much slower than >> 8bit unsigned integer divide. This patch adds an option, -m8bit-idiv, >> for x86. It turns 32bit integer divide into >> >> if (dividend an divisor are in [0-255]) >> use 8bit unsigned integer divide >> else >> use 32bit integer divide >> >> I didn't turn it on by default for Atom: >> >> 1. If inputs are outside of 8bit unsigned integers, extra compare and >> branch will always be slow. >> 2. For >> >> --- >> int funca( int a, int b ) >> { >> return (a/b) + (a%b); >> } >> -- >> >> This patch generates 2 idivbs since the optimization is done at RTL >> expansion. Is there a way to delay this until later when 2 idivls are >> optimized into 1 idivl and before IRA since this optimization needs >> a scratch register. > > Splitter with && can_create_pseudo_p () split constraint will limit > splits to pre-regalloc passes, or ... try_split doesn't allow any insn of the result matches the original pattern to avoid infinite loop. >> I tried to split idivl into idivb and idivl. It didn't work due >> to recursive split. > > ... use peephole2 pass. It can allocate hard-reg scratch, if it finds > a free one. Unfortunately, peephole2 pass can't find a scratch register in simple case. Also I really want to split (parallel [ (set (reg:SI 66) (div:SI (reg/v:SI 62 [ x ]) (mem/c/i:SI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [2 y+0 S4 A32]))) (set (reg:SI 67) (mod:SI (reg/v:SI 62 [ x ]) (mem/c/i:SI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [2 y+0 S4 A32]))) (clobber (reg:CC 17 flags)) ])
On Tue, Sep 14, 2010 at 7:05 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> This patch generates 2 idivbs since the optimization is done at RTL >>> expansion. Is there a way to delay this until later when 2 idivls are >>> optimized into 1 idivl and before IRA since this optimization needs >>> a scratch register. >> >> Splitter with && can_create_pseudo_p () split constraint will limit >> splits to pre-regalloc passes, or ... > > try_split doesn't allow any insn of the result matches the original pattern > to avoid infinite loop. So, switch the places of div and mod RTXes in the parallel and provide another divl_1 insn pattern that matches this new parallel. Uros.
"H.J. Lu" <hongjiu.lu@intel.com> writes: > On some processors like Atom, 32bit integer divide is much slower than > 8bit unsigned integer divide. This patch adds an option, -m8bit-idiv, > for x86. It turns 32bit integer divide into To be honest I'm not sure the option is really useful -- it's probably hard to figure out for a user that it makes sense to set. > > if (dividend an divisor are in [0-255]) > use 8bit unsigned integer divide > else > use 32bit integer divide > > I didn't turn it on by default for Atom: > > 1. If inputs are outside of 8bit unsigned integers, extra compare and > branch will always be slow. Would be something for profile feedback to determine? -Andi
On Tue, Sep 14, 2010 at 12:54 AM, Andi Kleen <andi@firstfloor.org> wrote: > "H.J. Lu" <hongjiu.lu@intel.com> writes: > >> On some processors like Atom, 32bit integer divide is much slower than >> 8bit unsigned integer divide. This patch adds an option, -m8bit-idiv, >> for x86. It turns 32bit integer divide into > > To be honest I'm not sure the option is really useful -- it's > probably hard to figure out for a user that it makes sense > to set. That is true. >> >> if (dividend an divisor are in [0-255]) >> use 8bit unsigned integer divide >> else >> use 32bit integer divide >> >> I didn't turn it on by default for Atom: >> >> 1. If inputs are outside of 8bit unsigned integers, extra compare and >> branch will always be slow. > > Would be something for profile feedback to determine? > That is an excellent idea. I will take a look in a follow up patch. Thanks,.
On 09/14/2010 09:13 AM, Uros Bizjak wrote: > On Tue, Sep 14, 2010 at 7:05 AM, H.J. Lu<hjl.tools@gmail.com> wrote: > >>>> This patch generates 2 idivbs since the optimization is done at RTL >>>> expansion. Is there a way to delay this until later when 2 idivls are >>>> optimized into 1 idivl and before IRA since this optimization needs >>>> a scratch register. >>> >>> Splitter with&& can_create_pseudo_p () split constraint will limit >>> splits to pre-regalloc passes, or ... >> >> try_split doesn't allow any insn of the result matches the original pattern >> to avoid infinite loop. > > So, switch the places of div and mod RTXes in the parallel and provide > another divl_1 insn pattern that matches this new parallel. I think adding an unspec to the parallel would be less hacky than reversing the div/mod. Paolo
On Wed, Sep 15, 2010 at 10:07 AM, Paolo Bonzini <bonzini@gnu.org> wrote: > On 09/14/2010 09:13 AM, Uros Bizjak wrote: >> >> On Tue, Sep 14, 2010 at 7:05 AM, H.J. Lu<hjl.tools@gmail.com> wrote: >> >>>>> This patch generates 2 idivbs since the optimization is done at RTL >>>>> expansion. Is there a way to delay this until later when 2 idivls are >>>>> optimized into 1 idivl and before IRA since this optimization needs >>>>> a scratch register. >>>> >>>> Splitter with&& can_create_pseudo_p () split constraint will limit >>>> splits to pre-regalloc passes, or ... >>> >>> try_split doesn't allow any insn of the result matches the original >>> pattern >>> to avoid infinite loop. >> >> So, switch the places of div and mod RTXes in the parallel and provide >> another divl_1 insn pattern that matches this new parallel. > > I think adding an unspec to the parallel would be less hacky than reversing > the div/mod. > There are places which check DIV/UDIV. In i386.c, there are case DIV: case UDIV: case MOD: case UMOD: if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) /* ??? SSE cost should be used here. */ *total = cost->fdiv; else if (X87_FLOAT_MODE_P (mode)) *total = cost->fdiv; else if (FLOAT_MODE_P (mode)) /* ??? SSE vector cost should be used here. */ *total = cost->fdiv; else *total = cost->divide[MODE_INDEX (mode)]; return false; I'd like to keep them as much as possible.
On 09/15/2010 07:16 PM, H.J. Lu wrote: >> > I think adding an unspec to the parallel would be less hacky than reversing >> > the div/mod. >> > > There are places which check DIV/UDIV. I said "adding", like (parallel [(set ... (div ...)) (set ... (mod ...)) (unspec [] DIV_ALREADY_SPLIT)]) Paolo
On Thu, Sep 16, 2010 at 9:20 AM, Paolo Bonzini <bonzini@gnu.org> wrote: > On 09/15/2010 07:16 PM, H.J. Lu wrote: >>> >>> > I think adding an unspec to the parallel would be less hacky than >>> > reversing >>> > the div/mod. >>> > >> >> There are places which check DIV/UDIV. > > I said "adding", like > > (parallel [(set ... (div ...)) > (set ... (mod ...)) > (unspec [] DIV_ALREADY_SPLIT)]) I agree with Paolo. BTW: You should write: (unspec [(const_int 0)] DIV_ALREADY_SPLIT) otherwise sky will fall down. Uros.
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 900b424..6b1f447 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -129,6 +129,7 @@ extern void ix86_split_ashr (rtx *, rtx, enum machine_mode); extern void ix86_split_lshr (rtx *, rtx, enum machine_mode); extern rtx ix86_find_base_term (rtx); extern bool ix86_check_movabs (rtx, int); +extern void ix86_expand_idivmod (enum rtx_code, enum machine_mode, rtx[]); extern rtx assign_386_stack_local (enum machine_mode, enum ix86_stack_slot); extern int ix86_attr_length_immediate_default (rtx, int); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 1d79a18..21d9c0f 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -1981,6 +1981,7 @@ static bool ix86_expand_vector_init_one_nonzero (bool, enum machine_mode, static void ix86_add_new_builtins (int); static rtx ix86_expand_vec_perm_builtin (tree); static tree ix86_canonical_va_list_type (tree); +static void predict_jump (int); enum ix86_function_specific_strings { @@ -3699,6 +3700,9 @@ override_options (bool main_args_p) #endif } + if (flag_8bit_idiv < 0) + flag_8bit_idiv = 0; + /* Save the initial options in case the user does function specific options */ if (main_args_p) target_option_default_node = target_option_current_node @@ -14647,6 +14651,103 @@ ix86_expand_unary_operator (enum rtx_code code, enum machine_mode mode, emit_move_insn (operands[0], dst); } +/* Expand 32bit divmod with 8bit unsigned divmod if dividend and + divisor are within the the range [0-255]. */ + +void +ix86_expand_idivmod (enum rtx_code code, enum machine_mode mode, + rtx operands[]) +{ + rtx end_label, qimode_label; + rtx insn, div, mod; + rtx tmp0, tmp1, tmp2; + rtx (*gen_divmod4) (rtx, rtx, rtx, rtx); + + switch (mode) + { + case HImode: + gen_divmod4 = code == DIV ? gen_divmodhi4_1 : gen_udivmodhi4_1; + break; + case SImode: + gen_divmod4 = code == DIV ? gen_divmodsi4_1 : gen_udivmodsi4_1; + break; + case DImode: + gen_divmod4 = code == DIV ? gen_divmoddi4_1 : gen_udivmoddi4_1; + break; + default: + gcc_unreachable (); + } + + div = (*gen_divmod4) (operands[0], operands[1], + operands[2], operands[3]); + if (mode != SImode + || !flag_8bit_idiv + || !TARGET_QIMODE_MATH + || optimize_insn_for_size_p ()) + { + emit_insn (div); + return; + } + + end_label = gen_label_rtx (); + qimode_label = gen_label_rtx (); + + tmp0 = gen_reg_rtx (SImode); + + /* Use 8bit unsigned divimod if dividend and divisor are within the + the range [0-255]. */ + emit_move_insn (tmp0, operands[1]); + tmp0 = expand_simple_binop (mode, IOR, tmp0, operands[2], + tmp0, 1, OPTAB_DIRECT); + emit_insn (gen_testsi_ccno_1 (tmp0, GEN_INT (-0x100))); + tmp0 = gen_rtx_REG (CCNOmode, FLAGS_REG); + tmp0 = gen_rtx_EQ (VOIDmode, tmp0, const0_rtx); + tmp0 = gen_rtx_IF_THEN_ELSE (VOIDmode, tmp0, + gen_rtx_LABEL_REF (VOIDmode, qimode_label), + pc_rtx); + insn = emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx, tmp0)); + predict_jump (REG_BR_PROB_BASE * 50 / 100); + JUMP_LABEL (insn) = qimode_label; + + /* Generate 32bit signed/unsigned divimod. */ + emit_insn (div); + + /* Branch to the end. */ + emit_jump_insn (gen_jump (end_label)); + emit_barrier (); + + /* Generate 8bit unsigned divide. */ + emit_label (qimode_label); + tmp0 = simplify_gen_subreg (HImode, operands[0], SImode, 0); + tmp1 = simplify_gen_subreg (HImode, operands[1], SImode, 0); + tmp2 = simplify_gen_subreg (QImode, operands[2], SImode, 0); + emit_insn (gen_udivmodhiqi3 (tmp0, tmp1, tmp2)); + + if (code == DIV) + { + div = gen_rtx_DIV (SImode, operands[1], operands[2]); + mod = gen_rtx_MOD (SImode, operands[1], operands[2]); + } + else + { + div = gen_rtx_UDIV (SImode, operands[1], operands[2]); + mod = gen_rtx_UMOD (SImode, operands[1], operands[2]); + } + + /* Extract remainder from AH. */ + tmp1 = gen_rtx_ZERO_EXTRACT (SImode, tmp0, + GEN_INT (8), GEN_INT (8)); + insn = emit_move_insn (operands[3], tmp1); + set_unique_reg_note (insn, REG_EQUAL, mod); + + /* Zero extend quotient from AL. */ + tmp0 = gen_lowpart (QImode, tmp0); + insn = emit_insn (gen_zero_extendqisi2 (operands[0], tmp0)); + set_unique_reg_note (insn, REG_EQUAL, div); + + emit_label (end_label); +} + #define LEA_SEARCH_THRESHOLD 12 /* Search backward for non-agu definition of register number REGNO1 diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 9780eef..1ddf0b4 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -7295,7 +7295,7 @@ [(set_attr "type" "idiv") (set_attr "mode" "QI")]) -(define_expand "divmod<mode>4" +(define_expand "divmod<mode>4_1" [(parallel [(set (match_operand:SWIM248 0 "register_operand" "") (div:SWIM248 (match_operand:SWIM248 1 "register_operand" "") @@ -7306,6 +7306,17 @@ "" "") +(define_expand "divmod<mode>4" + [(parallel [(set (match_operand:SWIM248 0 "register_operand" "") + (div:SWIM248 + (match_operand:SWIM248 1 "register_operand" "") + (match_operand:SWIM248 2 "nonimmediate_operand" ""))) + (set (match_operand:SWIM248 3 "register_operand" "") + (mod:SWIM248 (match_dup 1) (match_dup 2))) + (clobber (reg:CC FLAGS_REG))])] + "" + "ix86_expand_idivmod (DIV, <MODE>mode, operands); DONE;") + (define_insn_and_split "*divmod<mode>4" [(set (match_operand:SWIM248 0 "register_operand" "=a") (div:SWIM248 (match_operand:SWIM248 2 "register_operand" "0") @@ -7354,7 +7365,7 @@ [(set_attr "type" "idiv") (set_attr "mode" "<MODE>")]) -(define_expand "udivmod<mode>4" +(define_expand "udivmod<mode>4_1" [(parallel [(set (match_operand:SWIM248 0 "register_operand" "") (udiv:SWIM248 (match_operand:SWIM248 1 "register_operand" "") @@ -7365,6 +7376,17 @@ "" "") +(define_expand "udivmod<mode>4" + [(parallel [(set (match_operand:SWIM248 0 "register_operand" "") + (udiv:SWIM248 + (match_operand:SWIM248 1 "register_operand" "") + (match_operand:SWIM248 2 "nonimmediate_operand" ""))) + (set (match_operand:SWIM248 3 "register_operand" "") + (umod:SWIM248 (match_dup 1) (match_dup 2))) + (clobber (reg:CC FLAGS_REG))])] + "" + "ix86_expand_idivmod (UDIV, <MODE>mode, operands); DONE;") + (define_insn_and_split "*udivmod<mode>4" [(set (match_operand:SWIM248 0 "register_operand" "=a") (udiv:SWIM248 (match_operand:SWIM248 2 "register_operand" "0") diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 5790e76..f7abe75 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -388,3 +388,7 @@ Support F16C built-in functions and code generation mfentry Target Report Var(flag_fentry) Init(-1) Emit profiling counter call at function entry before prologue. + +m8bit-idiv +Target Report Var(flag_8bit_idiv) Init(-1) Save +Expand 32bit integer divide into control flow with 8bit unsigned integer divide diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index b354382..cb3a756 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -602,7 +602,7 @@ Objective-C and Objective-C++ Dialects}. -momit-leaf-frame-pointer -mno-red-zone -mno-tls-direct-seg-refs @gol -mcmodel=@var{code-model} -mabi=@var{name} @gol -m32 -m64 -mlarge-data-threshold=@var{num} @gol --msse2avx -mfentry} +-msse2avx -mfentry -m8bit-idiv} @emph{IA-64 Options} @gccoptlist{-mbig-endian -mlittle-endian -mgnu-as -mgnu-ld -mno-pic @gol @@ -12647,6 +12647,13 @@ If profiling is active @option{-pg} put the profiling counter call before prologue. Note: On x86 architectures the attribute @code{ms_hook_prologue} isn't possible at the moment for @option{-mfentry} and @option{-pg}. + +@item -m8bit-idiv +@itemx -mno-8bit-idiv +@opindex 8bit-idiv +This option will enable GCC to expand 32bit integer divide into control +flow with 8bit unsigned integer divide. + @end table These @samp{-m} switches are supported in addition to the above diff --git a/gcc/testsuite/gcc.target/i386/divmod-1.c b/gcc/testsuite/gcc.target/i386/divmod-1.c new file mode 100644 index 0000000..2769a21 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/divmod-1.c @@ -0,0 +1,30 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +extern void abort (void); + +void +__attribute__((noinline)) +test (int x, int y, int q, int r) +{ + if ((x / y) != q || (x % y) != r) + abort (); +} + +int +main () +{ + test (7, 6, 1, 1); + test (-7, -6, 1, -1); + test (-7, 6, -1, -1); + test (7, -6, -1, 1); + test (255, 254, 1, 1); + test (256, 254, 1, 2); + test (256, 256, 1, 0); + test (254, 256, 0, 254); + test (254, 255, 0, 254); + test (254, 1, 254, 0); + test (255, 2, 127, 1); + test (1, 256, 0, 1); + return 0; +} diff --git a/gcc/testsuite/gcc.target/i386/divmod-2.c b/gcc/testsuite/gcc.target/i386/divmod-2.c new file mode 100644 index 0000000..dbe40b7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/divmod-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +int +foo (int x, int y) +{ + return x / y; +} + +/* { dg-final { scan-assembler-times "divb" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/divmod-3.c b/gcc/testsuite/gcc.target/i386/divmod-3.c new file mode 100644 index 0000000..4966d7f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/divmod-3.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +int +foo (int x, int y) +{ + return x % y; +} + +/* { dg-final { scan-assembler-times "divb" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/udivmod-1.c b/gcc/testsuite/gcc.target/i386/udivmod-1.c new file mode 100644 index 0000000..eebd843 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/udivmod-1.c @@ -0,0 +1,31 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +extern void abort (void); + +void +__attribute__((noinline)) +test (unsigned int x, unsigned int y, unsigned int q, unsigned int r) +{ + if ((x / y) != q || (x % y) != r) + abort (); +} + +int +main () +{ + test (7, 6, 1, 1); + test (255, 254, 1, 1); + test (256, 254, 1, 2); + test (256, 256, 1, 0); + test (254, 256, 0, 254); + test (254, 255, 0, 254); + test (254, 1, 254, 0); + test (255, 2, 127, 1); + test (1, 256, 0, 1); + test (0x80000000, 0x7fffffff, 1, 1); + test (0x7fffffff, 0x80000000, 0, 0x7fffffff); + test (0x80000000, 0x80000003, 0, 0x80000000); + test (0xfffffffd, 0xfffffffe, 0, 0xfffffffd); + return 0; +} diff --git a/gcc/testsuite/gcc.target/i386/udivmod-2.c b/gcc/testsuite/gcc.target/i386/udivmod-2.c new file mode 100644 index 0000000..e43bbc7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/udivmod-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +unsigned int +foo (unsigned int x, unsigned int y) +{ + return x / y; +} + +/* { dg-final { scan-assembler-times "divb" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/udivmod-3.c b/gcc/testsuite/gcc.target/i386/udivmod-3.c new file mode 100644 index 0000000..9aead9e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/udivmod-3.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -m8bit-idiv" } */ + +unsigned int +foo (unsigned int x, unsigned int y) +{ + return x % y; +} + +/* { dg-final { scan-assembler-times "divb" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/umod-1.c b/gcc/testsuite/gcc.target/i386/umod-1.c index 54edf13..a39e75b 100644 --- a/gcc/testsuite/gcc.target/i386/umod-1.c +++ b/gcc/testsuite/gcc.target/i386/umod-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=atom" } */ +/* { dg-options "-O2 -m8bit-idiv" } */ unsigned char foo (unsigned char x, unsigned char y) diff --git a/gcc/testsuite/gcc.target/i386/umod-2.c b/gcc/testsuite/gcc.target/i386/umod-2.c index 6fe7384..1ef33bb 100644 --- a/gcc/testsuite/gcc.target/i386/umod-2.c +++ b/gcc/testsuite/gcc.target/i386/umod-2.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=atom" } */ +/* { dg-options "-O2 -m8bit-idiv" } */ extern unsigned char z; diff --git a/gcc/testsuite/gcc.target/i386/umod-3.c b/gcc/testsuite/gcc.target/i386/umod-3.c index 7123bc9..3e5bc40 100644 --- a/gcc/testsuite/gcc.target/i386/umod-3.c +++ b/gcc/testsuite/gcc.target/i386/umod-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=atom" } */ +/* { dg-options "-O2 -m8bit-idiv" } */ extern void abort (void); extern void exit (int);