[RFC,match.pd] optimize (X & C) == N when C is power of 2

Message ID	55B1F2C3.2000903@arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-403746-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=qnm47ZARe6xSkQs/yBFV+gJXo4tMXugjSYWVtTgSJIe vmFfH2ZxfeRFFJUXZI/TFbDAFxFVEEuvbna7ylaSdkaR617Ej4RrL5pNsf/TmsSw 8eMch69D/I4GQd5m8M6EtZ745Oo94edVc04Fr5pPhMRWouiD9LWnvTy2n1Z9nBYM = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <55B1F2C3.2000903@arm.com> Date: Fri, 24 Jul 2015 09:09:39 +0100 From: Kyrill Tkachov <kyrylo.tkachov@arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: GCC Patches <gcc-patches@gcc.gnu.org> CC: Richard Biener <rguenther@suse.de> Subject: [PATCH][RFC][match.pd] optimize (X & C) == N when C is power of 2 Content-Type: multipart/mixed; boundary="------------050505060900090703040507"

Kyrylo Tkachov July 24, 2015, 8:09 a.m. UTC

Hi all,

This transformation folds (X % C) == N into
X & ((1 << (size - 1)) | (C - 1))) == N
for constants C and N where N is positive and C is a power of 2.

The idea is similar to the existing X % C -> X & (C - 1) transformation
for unsigned values but this time when we have the comparison we use the
C - 1 mask (all 1s for power of 2 N) orred with the sign bit.

At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
calculates X % C, which is compared against the positive N.

If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
X % C but since we also catch the set sign bit, it will never compare equal
to N (which is positive), so we still get the right comparison result.

I don't have much experience with writing match.pd patterns, so I appreciate
any feedback on how to write this properly.

Bootstrapped and tested on arm, aarch64, x86_64.

Thanks,
Kyrill

2015-07-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * match.pd ((X % C) == N -> (X & ((1 << (size - 1)) | (C - 1))) == N):
     Transform when N is positive and C is a power of 2.

2015-07-24  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.dg/fold-mod-cmp-1.c: New test.

Jakub Jelinek July 24, 2015, 8:23 a.m. UTC | #1

On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
> This transformation folds (X % C) == N into
> X & ((1 << (size - 1)) | (C - 1))) == N
> for constants C and N where N is positive and C is a power of 2.
> 
> The idea is similar to the existing X % C -> X & (C - 1) transformation
> for unsigned values but this time when we have the comparison we use the
> C - 1 mask (all 1s for power of 2 N) orred with the sign bit.
> 
> At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
> calculates X % C, which is compared against the positive N.
> 
> If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
> X % C but since we also catch the set sign bit, it will never compare equal
> to N (which is positive), so we still get the right comparison result.
> 
> I don't have much experience with writing match.pd patterns, so I appreciate
> any feedback on how to write this properly.
> 
> Bootstrapped and tested on arm, aarch64, x86_64.

I think this is another case that, if at all, should be done during or right
before RTL expansion and should test rtx costs.
Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive,
while C cheap, and % might not be that expensive compared to & to offset
that.

E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other
(well, if we don't take instruction size into account), but 64-bit constant
is at least 3 times more expensive (movabsq is needed with its latency).
In the x86_64 case supposedly the divmod is still more expensive, but there
are many other targets.  On sparc64 for 64-bit constants, you might need
many instructions to create the constants, etc.

	Jakub

Kyrylo Tkachov July 24, 2015, 8:48 a.m. UTC | #2

On 24/07/15 09:23, Jakub Jelinek wrote:
> On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
>> This transformation folds (X % C) == N into
>> X & ((1 << (size - 1)) | (C - 1))) == N
>> for constants C and N where N is positive and C is a power of 2.
>>
>> The idea is similar to the existing X % C -> X & (C - 1) transformation
>> for unsigned values but this time when we have the comparison we use the
>> C - 1 mask (all 1s for power of 2 N) orred with the sign bit.
>>
>> At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
>> calculates X % C, which is compared against the positive N.
>>
>> If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
>> X % C but since we also catch the set sign bit, it will never compare equal
>> to N (which is positive), so we still get the right comparison result.
>>
>> I don't have much experience with writing match.pd patterns, so I appreciate
>> any feedback on how to write this properly.
>>
>> Bootstrapped and tested on arm, aarch64, x86_64.
> I think this is another case that, if at all, should be done during or right
> before RTL expansion and should test rtx costs.

Hmm, where would that be?
expmed.c has a lot of code to expand div or mod by a power of 2.
In what form would a compare with a mod by power of 2 arrive to
the expansion phase?

> Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive,
> while C cheap, and % might not be that expensive compared to & to offset
> that.
>
> E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other
> (well, if we don't take instruction size into account), but 64-bit constant
> is at least 3 times more expensive (movabsq is needed with its latency).
> In the x86_64 case supposedly the divmod is still more expensive, but there
> are many other targets.  On sparc64 for 64-bit constants, you might need
> many instructions to create the constants, etc.

Ok, I am not familiar with sparc64. The constant is just a 1
in the sign bit orred with a continuous string of ones.
That's usually cheap on aarch64 but may not be so on other targets.

Thanks,
Kyrill

>
> 	Jakub
>

Richard Biener July 24, 2015, 9 a.m. UTC | #3

On Fri, 24 Jul 2015, Kyrill Tkachov wrote:

> 
> On 24/07/15 09:23, Jakub Jelinek wrote:
> > On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
> > > This transformation folds (X % C) == N into
> > > X & ((1 << (size - 1)) | (C - 1))) == N
> > > for constants C and N where N is positive and C is a power of 2.
> > > 
> > > The idea is similar to the existing X % C -> X & (C - 1) transformation
> > > for unsigned values but this time when we have the comparison we use the
> > > C - 1 mask (all 1s for power of 2 N) orred with the sign bit.
> > > 
> > > At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
> > > calculates X % C, which is compared against the positive N.
> > > 
> > > If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
> > > X % C but since we also catch the set sign bit, it will never compare
> > > equal
> > > to N (which is positive), so we still get the right comparison result.
> > > 
> > > I don't have much experience with writing match.pd patterns, so I
> > > appreciate
> > > any feedback on how to write this properly.
> > > 
> > > Bootstrapped and tested on arm, aarch64, x86_64.
> > I think this is another case that, if at all, should be done during or right
> > before RTL expansion and should test rtx costs.
> 
> Hmm, where would that be?
> expmed.c has a lot of code to expand div or mod by a power of 2.
> In what form would a compare with a mod by power of 2 arrive to
> the expansion phase?

It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name
or get_def_for_expr to get at the defining stmt if that is possible
(it's still unexpanded and thus TERed) and expand a different
expression.

But why can't simplify-rtx via combine handle this - it should have
access to target costs.

> > Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive,
> > while C cheap, and % might not be that expensive compared to & to offset
> > that.
> > 
> > E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other
> > (well, if we don't take instruction size into account), but 64-bit constant
> > is at least 3 times more expensive (movabsq is needed with its latency).
> > In the x86_64 case supposedly the divmod is still more expensive, but there
> > are many other targets.  On sparc64 for 64-bit constants, you might need
> > many instructions to create the constants, etc.
> 
> Ok, I am not familiar with sparc64. The constant is just a 1
> in the sign bit orred with a continuous string of ones.
> That's usually cheap on aarch64 but may not be so on other targets.

On GIMPLE we might still want to canonicalize to one form.  I'd
canonicalize to the form with "smaller" constants if the number
of operations is the same.

Richard.

> Thanks,
> Kyrill
> 
> > 
> > 	Jakub
> > 
> 
>

Kyrylo Tkachov July 24, 2015, 9:04 a.m. UTC | #4

On 24/07/15 10:00, Richard Biener wrote:
> On Fri, 24 Jul 2015, Kyrill Tkachov wrote:
>
>> On 24/07/15 09:23, Jakub Jelinek wrote:
>>> On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
>>>> This transformation folds (X % C) == N into
>>>> X & ((1 << (size - 1)) | (C - 1))) == N
>>>> for constants C and N where N is positive and C is a power of 2.
>>>>
>>>> The idea is similar to the existing X % C -> X & (C - 1) transformation
>>>> for unsigned values but this time when we have the comparison we use the
>>>> C - 1 mask (all 1s for power of 2 N) orred with the sign bit.
>>>>
>>>> At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
>>>> calculates X % C, which is compared against the positive N.
>>>>
>>>> If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
>>>> X % C but since we also catch the set sign bit, it will never compare
>>>> equal
>>>> to N (which is positive), so we still get the right comparison result.
>>>>
>>>> I don't have much experience with writing match.pd patterns, so I
>>>> appreciate
>>>> any feedback on how to write this properly.
>>>>
>>>> Bootstrapped and tested on arm, aarch64, x86_64.
>>> I think this is another case that, if at all, should be done during or right
>>> before RTL expansion and should test rtx costs.
>> Hmm, where would that be?
>> expmed.c has a lot of code to expand div or mod by a power of 2.
>> In what form would a compare with a mod by power of 2 arrive to
>> the expansion phase?
> It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name
> or get_def_for_expr to get at the defining stmt if that is possible
> (it's still unexpanded and thus TERed) and expand a different
> expression.

Thanks, so it's where we expand compares... (what's TERed?)

>
> But why can't simplify-rtx via combine handle this - it should have
> access to target costs.

That would require for the target to expand to an SMOD rtx
which, if the target has no direct instruction for would be somewhat
awkward.

Thanks,
Kyrill

>
>>> Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive,
>>> while C cheap, and % might not be that expensive compared to & to offset
>>> that.
>>>
>>> E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other
>>> (well, if we don't take instruction size into account), but 64-bit constant
>>> is at least 3 times more expensive (movabsq is needed with its latency).
>>> In the x86_64 case supposedly the divmod is still more expensive, but there
>>> are many other targets.  On sparc64 for 64-bit constants, you might need
>>> many instructions to create the constants, etc.
>> Ok, I am not familiar with sparc64. The constant is just a 1
>> in the sign bit orred with a continuous string of ones.
>> That's usually cheap on aarch64 but may not be so on other targets.
> On GIMPLE we might still want to canonicalize to one form.  I'd
> canonicalize to the form with "smaller" constants if the number
> of operations is the same.
>
> Richard.
>
>> Thanks,
>> Kyrill
>>
>>> 	Jakub
>>>
>>

Jakub Jelinek July 24, 2015, 9:09 a.m. UTC | #5

On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote:
> >>Bootstrapped and tested on arm, aarch64, x86_64.
> >I think this is another case that, if at all, should be done during or right
> >before RTL expansion and should test rtx costs.
> 
> Hmm, where would that be?

That is up to discussions, all I'm saying is that match.pd when run on
GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable.

In expr.c, with TER you can detect such patterns, in this case when
expanding the comparison, but perhaps we want a *.pd file that would have
rules that would be only GIMPLE and only enabled in a special pass right
before (or very close to) expansion, that would perform such instruction
selection.

> Ok, I am not familiar with sparc64. The constant is just a 1
> in the sign bit orred with a continuous string of ones.
> That's usually cheap on aarch64 but may not be so on other targets.

It has been a long time since I've done anything on sparc64, but you can
e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64)
to clearly see that not all constants are equal cost, some are much more
expensive.  Seems sparc_rtx_cost does not model this accurrately, but that
is a backend bug, so shouldn't affect the generic decisions.  Sparc does not
have a moddi3 pattern, so your transformation might still be a win there,
except for -Os where it would be very bad code size pessimization.

All I want to say is that on GIMPLE we usually perform transformations where
simpler (fewer operations) is considered better, and for the same number of
operations where one sequence might be better on one target and another on
another target, supposedly we want some other infrastructure.

	Jakub

Ramana Radhakrishnan July 24, 2015, 9:11 a.m. UTC | #6

On Fri, Jul 24, 2015 at 10:04 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote:
>> It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name
>> or get_def_for_expr to get at the defining stmt if that is possible
>> (it's still unexpanded and thus TERed) and expand a different
>> expression.
>
>
> Thanks, so it's where we expand compares... (what's TERed?)

Temporary Expression Replacement - IIRC something done when you just
come out of ssa . tree-ssa-ter.[hc].

Ramana

>
>>
>> But why can't simplify-rtx via combine handle this - it should have
>> access to target costs.
>
>
> That would require for the target to expand to an SMOD rtx
> which, if the target has no direct instruction for would be somewhat
> awkward.
>
> Thanks,
> Kyrill
>
>
>>
>>>> Because, ((1 << (size - 1)) | (C - 1))) constant might be very
>>>> expensive,
>>>> while C cheap, and % might not be that expensive compared to & to offset
>>>> that.
>>>>
>>>> E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any
>>>> other
>>>> (well, if we don't take instruction size into account), but 64-bit
>>>> constant
>>>> is at least 3 times more expensive (movabsq is needed with its latency).
>>>> In the x86_64 case supposedly the divmod is still more expensive, but
>>>> there
>>>> are many other targets.  On sparc64 for 64-bit constants, you might need
>>>> many instructions to create the constants, etc.
>>>
>>> Ok, I am not familiar with sparc64. The constant is just a 1
>>> in the sign bit orred with a continuous string of ones.
>>> That's usually cheap on aarch64 but may not be so on other targets.
>>
>> On GIMPLE we might still want to canonicalize to one form.  I'd
>> canonicalize to the form with "smaller" constants if the number
>> of operations is the same.
>>
>> Richard.
>>
>>> Thanks,
>>> Kyrill
>>>
>>>>         Jakub
>>>>
>>>
>

Kyrylo Tkachov July 24, 2015, 9:23 a.m. UTC | #7

On 24/07/15 10:09, Jakub Jelinek wrote:
> On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote:
>>>> Bootstrapped and tested on arm, aarch64, x86_64.
>>> I think this is another case that, if at all, should be done during or right
>>> before RTL expansion and should test rtx costs.
>> Hmm, where would that be?
> That is up to discussions, all I'm saying is that match.pd when run on
> GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable.
>
> In expr.c, with TER you can detect such patterns, in this case when
> expanding the comparison, but perhaps we want a *.pd file that would have
> rules that would be only GIMPLE and only enabled in a special pass right
> before (or very close to) expansion, that would perform such instruction
> selection.

Wild idea, but could it be considered to have target-specific
match.pd files that can be included in the main match.pd?
  That way, targets would get the benefit of getting
the target-specific folding they benefit from at the very beginning
of compilation without stepping on other targets toes.

Kyrill

>
>> Ok, I am not familiar with sparc64. The constant is just a 1
>> in the sign bit orred with a continuous string of ones.
>> That's usually cheap on aarch64 but may not be so on other targets.
> It has been a long time since I've done anything on sparc64, but you can
> e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64)
> to clearly see that not all constants are equal cost, some are much more
> expensive.  Seems sparc_rtx_cost does not model this accurrately, but that
> is a backend bug, so shouldn't affect the generic decisions.  Sparc does not
> have a moddi3 pattern, so your transformation might still be a win there,
> except for -Os where it would be very bad code size pessimization.
>
> All I want to say is that on GIMPLE we usually perform transformations where
> simpler (fewer operations) is considered better, and for the same number of
> operations where one sequence might be better on one target and another on
> another target, supposedly we want some other infrastructure.
>
> 	Jakub
>

Richard Biener July 24, 2015, 9:31 a.m. UTC | #8

On Fri, 24 Jul 2015, Kyrill Tkachov wrote:

> 
> On 24/07/15 10:09, Jakub Jelinek wrote:
> > On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote:
> > > > > Bootstrapped and tested on arm, aarch64, x86_64.
> > > > I think this is another case that, if at all, should be done during or
> > > > right
> > > > before RTL expansion and should test rtx costs.
> > > Hmm, where would that be?
> > That is up to discussions, all I'm saying is that match.pd when run on
> > GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable.
> > 
> > In expr.c, with TER you can detect such patterns, in this case when
> > expanding the comparison, but perhaps we want a *.pd file that would have
> > rules that would be only GIMPLE and only enabled in a special pass right
> > before (or very close to) expansion, that would perform such instruction
> > selection.

Yes, we can do that - that .pd could also be target specific.

> Wild idea, but could it be considered to have target-specific
> match.pd files that can be included in the main match.pd?
>  That way, targets would get the benefit of getting
> the target-specific folding they benefit from at the very beginning
> of compilation without stepping on other targets toes.

The patterns would interact with those in match.pd, so it adds extra
complication in testing (would need to test all targets) to not
run into infinite recursions for example.

It will also make adding testcases that work on all targets harder
as the IL presented to passes could then wildly differ...

So not sure if we want that (from the very beginning of the compilation).

Richard.

> Kyrill
> 
> > 
> > > Ok, I am not familiar with sparc64. The constant is just a 1
> > > in the sign bit orred with a continuous string of ones.
> > > That's usually cheap on aarch64 but may not be so on other targets.
> > It has been a long time since I've done anything on sparc64, but you can
> > e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64)
> > to clearly see that not all constants are equal cost, some are much more
> > expensive.  Seems sparc_rtx_cost does not model this accurrately, but that
> > is a backend bug, so shouldn't affect the generic decisions.  Sparc does not
> > have a moddi3 pattern, so your transformation might still be a win there,
> > except for -Os where it would be very bad code size pessimization.
> > 
> > All I want to say is that on GIMPLE we usually perform transformations where
> > simpler (fewer operations) is considered better, and for the same number of
> > operations where one sequence might be better on one target and another on
> > another target, supposedly we want some other infrastructure.
> > 
> > 	Jakub
> > 
> 
>

Jakub Jelinek July 24, 2015, 9:36 a.m. UTC | #9

On Fri, Jul 24, 2015 at 10:23:59AM +0100, Kyrill Tkachov wrote:
> On 24/07/15 10:09, Jakub Jelinek wrote:
> >On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote:
> >>>>Bootstrapped and tested on arm, aarch64, x86_64.
> >>>I think this is another case that, if at all, should be done during or right
> >>>before RTL expansion and should test rtx costs.
> >>Hmm, where would that be?
> >That is up to discussions, all I'm saying is that match.pd when run on
> >GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable.
> >
> >In expr.c, with TER you can detect such patterns, in this case when
> >expanding the comparison, but perhaps we want a *.pd file that would have
> >rules that would be only GIMPLE and only enabled in a special pass right
> >before (or very close to) expansion, that would perform such instruction
> >selection.
> 
> Wild idea, but could it be considered to have target-specific
> match.pd files that can be included in the main match.pd?
>  That way, targets would get the benefit of getting
> the target-specific folding they benefit from at the very beginning
> of compilation without stepping on other targets toes.

I'd strongly prefer that isn't done.  First of all, you really don't want to
make target-specific foldings during generic folding (yeah, there are
already cases where it is done, but we want to avoid it), or during early
GIMPLE, ideally that should be late GIMPLE only where we introduce gradually
more and more target dependencies.  By making GIMPLE more target specific
earlier, you break e.g. the offloading stuff, but also introduce target
specific bugs more often to GIMPLE optimizers (these days, most GIMPLE
optimizer issues (especially before vectorizer/ivopts and other target
specific changes) affect all targets, or are perhaps ILP32/LP64 etc. related
at most).  Also, by allowing target-specific foldings, we'd end up with duplications
between different target, using rtx_costs, maintaining them more
accurrately and performing some generic code selections based on them IMHO
is desirable.

	Jakub

Ramana Radhakrishnan July 24, 2015, 9:44 a.m. UTC | #10

>>
>> In expr.c, with TER you can detect such patterns, in this case when
>> expanding the comparison, but perhaps we want a *.pd file that would have
>> rules that would be only GIMPLE and only enabled in a special pass right
>> before (or very close to) expansion, that would perform such instruction
>> selection.
>
>
> Wild idea, but could it be considered to have target-specific
> match.pd files that can be included in the main match.pd?
>  That way, targets would get the benefit of getting
> the target-specific folding they benefit from at the very beginning
> of compilation without stepping on other targets toes.


The downside is preventing duplication, potentially reducing "generic"
improvements and a maintenance headache for gimple optimizers.

regards
Ramana


>
> Kyrill
>
>
>>
>>> Ok, I am not familiar with sparc64. The constant is just a 1
>>> in the sign bit orred with a continuous string of ones.
>>> That's usually cheap on aarch64 but may not be so on other targets.
>>
>> It has been a long time since I've done anything on sparc64, but you can
>> e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64)
>> to clearly see that not all constants are equal cost, some are much more
>> expensive.  Seems sparc_rtx_cost does not model this accurrately, but that
>> is a backend bug, so shouldn't affect the generic decisions.  Sparc does
>> not
>> have a moddi3 pattern, so your transformation might still be a win there,
>> except for -Os where it would be very bad code size pessimization.
>>
>> All I want to say is that on GIMPLE we usually perform transformations
>> where
>> simpler (fewer operations) is considered better, and for the same number
>> of
>> operations where one sequence might be better on one target and another on
>> another target, supposedly we want some other infrastructure.
>>
>>         Jakub
>>
>

Jeff Law July 24, 2015, 6:35 p.m. UTC | #11

On 07/24/2015 03:44 AM, Ramana Radhakrishnan wrote:
>>>
>>> In expr.c, with TER you can detect such patterns, in this case when
>>> expanding the comparison, but perhaps we want a *.pd file that would have
>>> rules that would be only GIMPLE and only enabled in a special pass right
>>> before (or very close to) expansion, that would perform such instruction
>>> selection.
>>
>>
>> Wild idea, but could it be considered to have target-specific
>> match.pd files that can be included in the main match.pd?
>>   That way, targets would get the benefit of getting
>> the target-specific folding they benefit from at the very beginning
>> of compilation without stepping on other targets toes.
>
>
> The downside is preventing duplication, potentially reducing "generic"
> improvements and a maintenance headache for gimple optimizers.
So how about wedding the two ideas that have sprouted out of this 
discussion.  Specifically having a pass apply a target specific 
match.pd, but only do so at the end of the gimple optimization pipeline?

The design goal would (of course) be to change representations in ways 
that allow the gimple->rtl expanders to generate more efficient code for 
the target.

It avoids introducing the target bits early in the gimple pipeline, but 
still gives a clean way for targets to rewrite gimple for the benefit of 
gimple->rtl expansion.

jeff

Segher Boessenkool July 25, 2015, 2:19 a.m. UTC | #12

On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
> This transformation folds (X % C) == N into
> X & ((1 << (size - 1)) | (C - 1))) == N
> for constants C and N where N is positive and C is a power of 2.

For N = 0 you can transform it to

	((unsigned)X % C) == N

and for 0 < N < C you can transform it to

	X > 0 && ((unsigned)X % C) == N          (or X >= 0)

and for -C < N < 0 it is

	X < 0 && ((unsigned)X % C) == N + C      (or X <= 0)

and for other N it is

	0.

For N not a constant, well, do you really care?  :-)

(That second case might eventually fold to your original expression).


Segher

Richard Biener July 27, 2015, 7:36 a.m. UTC | #13

On Fri, 24 Jul 2015, Jeff Law wrote:

> On 07/24/2015 03:44 AM, Ramana Radhakrishnan wrote:
> > > > 
> > > > In expr.c, with TER you can detect such patterns, in this case when
> > > > expanding the comparison, but perhaps we want a *.pd file that would
> > > > have
> > > > rules that would be only GIMPLE and only enabled in a special pass right
> > > > before (or very close to) expansion, that would perform such instruction
> > > > selection.
> > > 
> > > 
> > > Wild idea, but could it be considered to have target-specific
> > > match.pd files that can be included in the main match.pd?
> > >   That way, targets would get the benefit of getting
> > > the target-specific folding they benefit from at the very beginning
> > > of compilation without stepping on other targets toes.
> > 
> > 
> > The downside is preventing duplication, potentially reducing "generic"
> > improvements and a maintenance headache for gimple optimizers.
> So how about wedding the two ideas that have sprouted out of this discussion.
> Specifically having a pass apply a target specific match.pd, but only do so at
> the end of the gimple optimization pipeline?
> 
> The design goal would (of course) be to change representations in ways that
> allow the gimple->rtl expanders to generate more efficient code for the
> target.
> 
> It avoids introducing the target bits early in the gimple pipeline, but still
> gives a clean way for targets to rewrite gimple for the benefit of gimple->rtl
> expansion.

I think it also aligns with the idea of pushing back RTL expansion and
expose some target specifics after another GIMPLE lowering phase.
I'm also thinking of addressing-mode selection and register promotion.

So at least if we think of that target specific match.pd pass as
containing all RTL expansion tricks done with TER only then it
should be quite simple to make it work.

Richard.

Kyrylo Tkachov July 27, 2015, 8:11 a.m. UTC | #14

On 25/07/15 03:19, Segher Boessenkool wrote:
> On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
>> This transformation folds (X % C) == N into
>> X & ((1 << (size - 1)) | (C - 1))) == N
>> for constants C and N where N is positive and C is a power of 2.
> For N = 0 you can transform it to
>
> 	((unsigned)X % C) == N
>
> and for 0 < N < C you can transform it to
>
> 	X > 0 && ((unsigned)X % C) == N          (or X >= 0)
>
> and for -C < N < 0 it is
>
> 	X < 0 && ((unsigned)X % C) == N + C      (or X <= 0)
>
> and for other N it is
>
> 	0.
>
> For N not a constant, well, do you really care?  :-)
>
> (That second case might eventually fold to your original expression).

Yeah, these avoid the potentially expensive mask, but introduce more operations,
which I believe may not be desirable at this stage.
Unless these transformations are ok for match.pd I'll try to implement this transformation
at RTL expansion time.

Thanks,
Kyrill

>
>
> Segher
>

Jeff Law July 27, 2015, 3:28 p.m. UTC | #15

On 07/27/2015 01:36 AM, Richard Biener wrote:
>
> I think it also aligns with the idea of pushing back RTL expansion and
> expose some target specifics after another GIMPLE lowering phase.
> I'm also thinking of addressing-mode selection and register promotion.
>
> So at least if we think of that target specific match.pd pass as
> containing all RTL expansion tricks done with TER only then it
> should be quite simple to make it work.
Yea -- I was also thinking it might allow us to clean up some of the 
actual expansion code as well.  It seems promising enough for some 
experimentation.

jeff

Segher Boessenkool July 27, 2015, 3:36 p.m. UTC | #16

On Mon, Jul 27, 2015 at 09:11:12AM +0100, Kyrill Tkachov wrote:
> On 25/07/15 03:19, Segher Boessenkool wrote:
> >On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
> >>This transformation folds (X % C) == N into
> >>X & ((1 << (size - 1)) | (C - 1))) == N
> >>for constants C and N where N is positive and C is a power of 2.
> >For N = 0 you can transform it to
> >
> >	((unsigned)X % C) == N
> >
> >and for 0 < N < C you can transform it to
> >
> >	X > 0 && ((unsigned)X % C) == N          (or X >= 0)
> >
> >and for -C < N < 0 it is
> >
> >	X < 0 && ((unsigned)X % C) == N + C      (or X <= 0)
> >
> >and for other N it is
> >
> >	0.
> >
> >For N not a constant, well, do you really care?  :-)
> >
> >(That second case might eventually fold to your original expression).
> 
> Yeah, these avoid the potentially expensive mask,

Fun fact: the current code ends up using the exact same mask, for some
targets.

> but introduce more operations,
> which I believe may not be desirable at this stage.

It is getting rid of the (expensive) division/modulo.  In many cases it
could get rid of the sign test, or hoist it to some outer structure, hard
to test here though (at least, I have no idea how to do that).

> Unless these transformations are ok for match.pd I'll try to implement this 
> transformation
> at RTL expansion time.

If you have to do conditional jumps, the RTL optimisers will not be able
to do very much :-(


Segher

[RFC,match.pd] optimize (X & C) == N when C is power of 2

Commit Message

Comments

Patch