Message ID | 020d01d9fef6$c4fff920$4effeb60$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | Improved RTL expansion of 1LL << x. | expand |
On 10/14/23 17:32, Roger Sayle wrote: > > This patch improves the initial RTL expanded for double word shifts > on architectures with conditional moves, so that later passes don't > need to clean-up unnecessary and/or unused instructions. > > Consider the general case, x << y, which is expanded well as: > > t1 = y & 32; > t2 = 0; > t3 = x_lo >> 1; > t4 = y ^ ~0; > t5 = t3 >> t4; > tmp_hi = x_hi << y; > tmp_hi |= t5; > tmp_lo = x_lo << y; > out_hi = t1 ? tmp_lo : tmp_hi; > out_lo = t1 ? t2 : tmp_lo; > > which is nearly optimal, the only thing that can be improved is > that using a unary NOT operation "t4 = ~y" is better than XOR > with -1, on targets that support it. [Note the one_cmpl_optab > expander didn't fall back to XOR when this code was originally > written, but has been improved since]. > > Now consider the relatively common idiom of 1LL << y, which > currently produces the RTL equivalent of: > > t1 = y & 32; > t2 = 0; > t3 = 1 >> 1; > t4 = y ^ ~0; > t5 = t3 >> t4; > tmp_hi = 0 << y; > tmp_hi |= t5; > tmp_lo = 1 << y; > out_hi = t1 ? tmp_lo : tmp_hi; > out_lo = t1 ? t2 : tmp_lo; > > Notice here that t3 is always zero, so the assignment of t5 > is a variable shift of zero, which expands to a loop on many > smaller targets, a similar shift by zero in the first tmp_hi > assignment (another loop), that the value of t4 is no longer > required (as t3 is zero), and that the ultimate value of tmp_hi > is always zero. > > Fortunately, for many (but perhaps not all) targets this mess > gets cleaned up by later optimization passes. However, this > patch avoids generating unnecessary RTL at expand time, by > calling simplify_expand_binop instead of expand_binop, and > avoiding generating dead or unnecessary code when intermediate > values are known to be zero. For the 1LL << y test case above, > we now generate: > > t1 = y & 32; > t2 = 0; > tmp_hi = 0; > tmp_lo = 1 << y; > out_hi = t1 ? tmp_lo : tmp_hi; > out_lo = t1 ? t2 : tmp_lo; > > On arc-elf, for example, there are 18 RTL INSN_P instructions > generated by expand before this patch, but only 12 with this patch > (improving both compile-time and memory usage). > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > > 2023-10-15 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * optabs.cc (expand_subword_shift): Call simplify_expand_binop > instead of expand_binop. Optimize cases (i.e. avoid generating > RTL) when CARRIES or INTO_INPUT is zero. Use one_cmpl_optab > (i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1. OK. FWIW H8 is another one of the targets where variable shifts have to be implemented as a loop. Jeff
diff --git a/gcc/optabs.cc b/gcc/optabs.cc index e1898da..f0a048a 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, has unknown behavior. Do a single shift first, then shift by the remainder. It's OK to use ~OP1 as the remainder if shift counts are truncated to the mode size. */ - carries = expand_binop (word_mode, reverse_unsigned_shift, - outof_input, const1_rtx, 0, unsignedp, methods); - if (shift_mask == BITS_PER_WORD - 1) - { - tmp = immed_wide_int_const - (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode); - tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp, - 0, true, methods); - } + carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, + outof_input, const1_rtx, 0, + unsignedp, methods); + if (carries == const0_rtx) + tmp = const0_rtx; + else if (shift_mask == BITS_PER_WORD - 1) + tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true); else { tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, @@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, } if (tmp == 0 || carries == 0) return false; - carries = expand_binop (word_mode, reverse_unsigned_shift, - carries, tmp, 0, unsignedp, methods); + if (carries != const0_rtx && tmp != const0_rtx) + carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, + carries, tmp, 0, unsignedp, methods); if (carries == 0) return false; - /* Shift INTO_INPUT logically by OP1. This is the last use of INTO_INPUT - so the result can go directly into INTO_TARGET if convenient. */ - tmp = expand_binop (word_mode, unsigned_shift, into_input, op1, - into_target, unsignedp, methods); - if (tmp == 0) - return false; + if (into_input != const0_rtx) + { + /* Shift INTO_INPUT logically by OP1. This is the last use of + INTO_INPUT so the result can go directly into INTO_TARGET if + convenient. */ + tmp = simplify_expand_binop (word_mode, unsigned_shift, into_input, + op1, into_target, unsignedp, methods); + if (tmp == 0) + return false; - /* Now OR in the bits carried over from OUTOF_INPUT. */ - if (!force_expand_binop (word_mode, ior_optab, tmp, carries, - into_target, unsignedp, methods)) - return false; + /* Now OR in the bits carried over from OUTOF_INPUT. */ + if (!force_expand_binop (word_mode, ior_optab, tmp, carries, + into_target, unsignedp, methods)) + return false; + } + else + emit_move_insn (into_target, carries); /* Use a standard word_mode shift for the out-of half. */ if (outof_target != 0)