Message ID | 201006272305.05010.ebotcazou@adacore.com |
---|---|
State | New |
Headers | show |
On 06/27/2010 11:05 PM, Eric Botcazou wrote: > + /* ??? This probably disables the MULT case for several platforms. */ > + if (GET_CODE (PATTERN (insn)) == PARALLEL) > + return NULL_RTX; single_set would be a better match, no? Paolo
> single_set would be a better match, no?
What do you mean exactly? That single_set returns NULL_RTX for PARALLEL?
On 06/28/2010 11:47 AM, Eric Botcazou wrote: >> single_set would be a better match, no? > > What do you mean exactly? That single_set returns NULL_RTX for PARALLEL? That it doesn't return NULL_RTX for CLOBBERs, USEs, or SETs whose output will not be used (like, most of the time, the CC set). Paolo
> That it doesn't return NULL_RTX for CLOBBERs, USEs, or SETs whose output > will not be used (like, most of the time, the CC set). Yes, but that's irrelevant here, single_set is already called a few lines above. The problem is that resolve_simple_move doesn't know (yet) how to reinstantiate the other members of the PARALLEL when it is copying the SRC.
On 06/27/2010 11:05 PM, Eric Botcazou wrote: > lower-subreg should already be able to handle this but it cannot because > expand_doubleword_mult generates unnecessarily tangled RTL. The first > attached patch is sufficient to untangle it, but this has an unexpected side > effect on x86 (turning an LEA into ADD + MOVE). It also seems to generally increase stack frames on ARM on a number of testcases I looked at and often pessimizes multiplication code: ldr ip, [r4, #40] | ldr r1, [r4, #40] ldr lr, .L351+232 | ldr ip, .L351+232 umull r0, r1, ip, lr | umull r8, r9, r1, ip ldr ip, [r4, #44] | ldr r1, [r4, #44] mla r1, lr, ip, r1 | mov r0, r8 > mla r1, ip, r1, r9 I haven't looked at it further. > Another approach is to teach lower-subreg to handle MULT on the RHS specially, > as it already does for ASM_OPERANDS. This is the second attached patch. It > has no effect on x86 (and SPARC and others) because of the ??? comment. This produces better results in a number of cases, but it's not a clear win. In smallish testcases it often produces extra instructions due to the general problem IRA currently has with tracking lifetimes of DImode subwords. If that gets fixed we might consider the patch, but right now I don't think we can. smull r4, r5, r2, ip + mov ip, r4 ldr r6, [r0, r3] - mov ip, r4, lsr #31 + mov r4, r5 + mov ip, ip, lsr #31 It also seems somewhat hackish to me to treat some operations specially for no really good reason. Why do we care about the type of the source operand? Can't we just decompose anything that occurs in a SET_DEST? The fact that it does nothing on some targets is also a minus IMO. Bernd
> This produces better results in a number of cases, but it's not a clear > win. In smallish testcases it often produces extra instructions due to > the general problem IRA currently has with tracking lifetimes of DImode > subwords. If that gets fixed we might consider the patch, but right now > I don't think we can. > > smull r4, r5, r2, ip > + mov ip, r4 > ldr r6, [r0, r3] > - mov ip, r4, lsr #31 > + mov r4, r5 > + mov ip, ip, lsr #31 The original testcase is smallish too so it would be interesting to understand why it helps for certain smallish testcases and pessimizes for others. How does your can_decompose_p patch come into play here? > It also seems somewhat hackish to me to treat some operations specially > for no really good reason. Why do we care about the type of the source > operand? Can't we just decompose anything that occurs in a SET_DEST? We aren't trying to enhance lower-subreg, only to solve a problem related to double-word multiplication. The pass has a special provision for ASM_OPERANDS on the RHS because of a certain x86 pattern, it isn't unreasonable to extend it to another common pattern if this proves beneficial in most cases. > The fact that it does nothing on some targets is also a minus IMO. This can presumably be easily fixed.
On 06/29/2010 07:38 PM, Eric Botcazou wrote: >> This produces better results in a number of cases, but it's not a clear >> win. In smallish testcases it often produces extra instructions due to >> the general problem IRA currently has with tracking lifetimes of DImode >> subwords. If that gets fixed we might consider the patch, but right now >> I don't think we can. >> >> smull r4, r5, r2, ip >> + mov ip, r4 >> ldr r6, [r0, r3] >> - mov ip, r4, lsr #31 >> + mov r4, r5 >> + mov ip, ip, lsr #31 > > The original testcase is smallish too so it would be interesting to understand > why it helps for certain smallish testcases and pessimizes for others. As I said, it's the known IRA deficiency. Your patch creates more instances of (set (reg:DI x) (something)) (set (reg:SI a) (subreg x 0)) (set (reg:SI b) (subreg x 1)) where IRA treats x as conflicting with a. If we can fix this in IRA, we should retest this patch (or something similar). > How does your can_decompose_p patch come into play here? Not at all, I think. It's a different problem, we're not creating superfluous DImode pseudos here. >> It also seems somewhat hackish to me to treat some operations specially >> for no really good reason. Why do we care about the type of the source >> operand? Can't we just decompose anything that occurs in a SET_DEST? > > We aren't trying to enhance lower-subreg, I think we are, if that's the best solution for the problem. IMO my mini-DCE is unambiguously beneficial as it only deletes instructions, it should be free in the normal case since it's guarded by the "First see if there are any multi-word pseudo-registers" test, and cheap even when there are multi-word pseudos. It's not explicitly tied to multiplications. On the other hand it has very little effect in general beyond fixing the testcase. Since we may be able to do better, I'm withdrawing the patch for now, until we decide whether IRA will be enhanced to cope better with DImode lifetimes. Once (if) IRA is fixed, a plausible approach for lower-subreg would be to decompose DImode regs that are set once in a non-decomposable context. After the initializing insn, we can move its subregs to new SImode pseudos, just as we're doing for ASM_OPERANDS now. Bernd
Index: lower-subreg.c =================================================================== --- lower-subreg.c (revision 161426) +++ lower-subreg.c (working copy) @@ -67,6 +67,24 @@ static bitmap non_decomposable_context; copy from reg M to reg N. */ static VEC(bitmap,heap) *reg_copy_graph; +/* Return nonzero if X is a special source operand for which we may want to + decompose the destination. The value is the number of sub-operands. */ + +static int +special_source_operand (rtx x) +{ + /* We handle ASM_OPERANDS as it is beneficial for things like x86 rdtsc + which returns a DImode value. */ + if (GET_CODE (x) == ASM_OPERANDS) + return 1; + + /* We handle MULT as it is beneficial for double-word multiplication. */ + if (GET_CODE (x) == MULT) + return 2; + + return 0; +} + /* Return whether X is a simple object which we can take a word_mode subreg of. */ @@ -100,11 +118,12 @@ simple_move_operand (rtx x) static rtx simple_move (rtx insn) { - rtx x; - rtx set; + const int n = recog_data.n_operands; enum machine_mode mode; + rtx set, x; + int sso; - if (recog_data.n_operands != 2) + if (n < 2) return NULL_RTX; set = single_set (insn); @@ -118,13 +137,22 @@ simple_move (rtx insn) return NULL_RTX; x = SET_SRC (set); - if (x != recog_data.operand[0] && x != recog_data.operand[1]) - return NULL_RTX; - /* For the src we can handle ASM_OPERANDS, and it is beneficial for - things like x86 rdtsc which returns a DImode value. */ - if (GET_CODE (x) != ASM_OPERANDS - && !simple_move_operand (x)) - return NULL_RTX; + sso = special_source_operand (x); + if (sso) + { + if (n != 1 + sso) + return NULL_RTX; + /* ??? This probably disables the MULT case for several platforms. */ + if (GET_CODE (PATTERN (insn)) == PARALLEL) + return NULL_RTX; + } + else + { + if (x != recog_data.operand[0] && x != recog_data.operand[1]) + return NULL_RTX; + if (!simple_move_operand (x)) + return NULL_RTX; + } /* We try to decompose in integer modes, to avoid generating inefficient code copying between integer and floating point @@ -714,16 +742,23 @@ resolve_simple_move (rtx set, rtx insn) gcc_assert (acg); } + /* If SRC is a special operand, we need to move via a temporary register. */ + if (special_source_operand (src)) + { + int acg; + rtx reg = gen_reg_rtx (orig_mode); + for_each_rtx (&src, resolve_subreg_use, NULL_RTX); + acg = apply_change_group (); + gcc_assert (acg); + emit_move_insn (reg, src); + src = reg; + } + /* If SRC is a register which we can't decompose, or has side effects, we need to move via a temporary register. */ - - if (!can_decompose_p (src) - || side_effects_p (src) - || GET_CODE (src) == ASM_OPERANDS) + else if (!can_decompose_p (src) || side_effects_p (src)) { - rtx reg; - - reg = gen_reg_rtx (orig_mode); + rtx reg = gen_reg_rtx (orig_mode); emit_move_insn (reg, src); src = reg; } @@ -1104,7 +1139,7 @@ decompose_multiword_subregs (void) { rtx set; enum classify_move_insn cmi; - int i, n; + int i; if (!INSN_P (insn) || GET_CODE (PATTERN (insn)) == CLOBBER @@ -1129,25 +1164,11 @@ decompose_multiword_subregs (void) cmi = SIMPLE_MOVE; } - n = recog_data.n_operands; - for (i = 0; i < n; ++i) - { - for_each_rtx (&recog_data.operand[i], - find_decomposable_subregs, - &cmi); - - /* We handle ASM_OPERANDS as a special case to support - things like x86 rdtsc which returns a DImode value. - We can decompose the output, which will certainly be - operand 0, but not the inputs. */ - - if (cmi == SIMPLE_MOVE - && GET_CODE (SET_SRC (set)) == ASM_OPERANDS) - { - gcc_assert (i == 0); - cmi = NOT_SIMPLE_MOVE; - } - } + for (i = 0; i < recog_data.n_operands; ++i) + for_each_rtx (&recog_data.operand[i], + find_decomposable_subregs, + &cmi); + } }