Message ID | 55B1F2C3.2000903@arm.com |
---|---|
State | New |
Headers | show |
On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: > This transformation folds (X % C) == N into > X & ((1 << (size - 1)) | (C - 1))) == N > for constants C and N where N is positive and C is a power of 2. > > The idea is similar to the existing X % C -> X & (C - 1) transformation > for unsigned values but this time when we have the comparison we use the > C - 1 mask (all 1s for power of 2 N) orred with the sign bit. > > At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1))) > calculates X % C, which is compared against the positive N. > > If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate > X % C but since we also catch the set sign bit, it will never compare equal > to N (which is positive), so we still get the right comparison result. > > I don't have much experience with writing match.pd patterns, so I appreciate > any feedback on how to write this properly. > > Bootstrapped and tested on arm, aarch64, x86_64. I think this is another case that, if at all, should be done during or right before RTL expansion and should test rtx costs. Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive, while C cheap, and % might not be that expensive compared to & to offset that. E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other (well, if we don't take instruction size into account), but 64-bit constant is at least 3 times more expensive (movabsq is needed with its latency). In the x86_64 case supposedly the divmod is still more expensive, but there are many other targets. On sparc64 for 64-bit constants, you might need many instructions to create the constants, etc. Jakub
On 24/07/15 09:23, Jakub Jelinek wrote: > On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: >> This transformation folds (X % C) == N into >> X & ((1 << (size - 1)) | (C - 1))) == N >> for constants C and N where N is positive and C is a power of 2. >> >> The idea is similar to the existing X % C -> X & (C - 1) transformation >> for unsigned values but this time when we have the comparison we use the >> C - 1 mask (all 1s for power of 2 N) orred with the sign bit. >> >> At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1))) >> calculates X % C, which is compared against the positive N. >> >> If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate >> X % C but since we also catch the set sign bit, it will never compare equal >> to N (which is positive), so we still get the right comparison result. >> >> I don't have much experience with writing match.pd patterns, so I appreciate >> any feedback on how to write this properly. >> >> Bootstrapped and tested on arm, aarch64, x86_64. > I think this is another case that, if at all, should be done during or right > before RTL expansion and should test rtx costs. Hmm, where would that be? expmed.c has a lot of code to expand div or mod by a power of 2. In what form would a compare with a mod by power of 2 arrive to the expansion phase? > Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive, > while C cheap, and % might not be that expensive compared to & to offset > that. > > E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other > (well, if we don't take instruction size into account), but 64-bit constant > is at least 3 times more expensive (movabsq is needed with its latency). > In the x86_64 case supposedly the divmod is still more expensive, but there > are many other targets. On sparc64 for 64-bit constants, you might need > many instructions to create the constants, etc. Ok, I am not familiar with sparc64. The constant is just a 1 in the sign bit orred with a continuous string of ones. That's usually cheap on aarch64 but may not be so on other targets. Thanks, Kyrill > > Jakub >
On Fri, 24 Jul 2015, Kyrill Tkachov wrote: > > On 24/07/15 09:23, Jakub Jelinek wrote: > > On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: > > > This transformation folds (X % C) == N into > > > X & ((1 << (size - 1)) | (C - 1))) == N > > > for constants C and N where N is positive and C is a power of 2. > > > > > > The idea is similar to the existing X % C -> X & (C - 1) transformation > > > for unsigned values but this time when we have the comparison we use the > > > C - 1 mask (all 1s for power of 2 N) orred with the sign bit. > > > > > > At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1))) > > > calculates X % C, which is compared against the positive N. > > > > > > If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate > > > X % C but since we also catch the set sign bit, it will never compare > > > equal > > > to N (which is positive), so we still get the right comparison result. > > > > > > I don't have much experience with writing match.pd patterns, so I > > > appreciate > > > any feedback on how to write this properly. > > > > > > Bootstrapped and tested on arm, aarch64, x86_64. > > I think this is another case that, if at all, should be done during or right > > before RTL expansion and should test rtx costs. > > Hmm, where would that be? > expmed.c has a lot of code to expand div or mod by a power of 2. > In what form would a compare with a mod by power of 2 arrive to > the expansion phase? It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name or get_def_for_expr to get at the defining stmt if that is possible (it's still unexpanded and thus TERed) and expand a different expression. But why can't simplify-rtx via combine handle this - it should have access to target costs. > > Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive, > > while C cheap, and % might not be that expensive compared to & to offset > > that. > > > > E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other > > (well, if we don't take instruction size into account), but 64-bit constant > > is at least 3 times more expensive (movabsq is needed with its latency). > > In the x86_64 case supposedly the divmod is still more expensive, but there > > are many other targets. On sparc64 for 64-bit constants, you might need > > many instructions to create the constants, etc. > > Ok, I am not familiar with sparc64. The constant is just a 1 > in the sign bit orred with a continuous string of ones. > That's usually cheap on aarch64 but may not be so on other targets. On GIMPLE we might still want to canonicalize to one form. I'd canonicalize to the form with "smaller" constants if the number of operations is the same. Richard. > Thanks, > Kyrill > > > > > Jakub > > > >
On 24/07/15 10:00, Richard Biener wrote: > On Fri, 24 Jul 2015, Kyrill Tkachov wrote: > >> On 24/07/15 09:23, Jakub Jelinek wrote: >>> On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: >>>> This transformation folds (X % C) == N into >>>> X & ((1 << (size - 1)) | (C - 1))) == N >>>> for constants C and N where N is positive and C is a power of 2. >>>> >>>> The idea is similar to the existing X % C -> X & (C - 1) transformation >>>> for unsigned values but this time when we have the comparison we use the >>>> C - 1 mask (all 1s for power of 2 N) orred with the sign bit. >>>> >>>> At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1))) >>>> calculates X % C, which is compared against the positive N. >>>> >>>> If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate >>>> X % C but since we also catch the set sign bit, it will never compare >>>> equal >>>> to N (which is positive), so we still get the right comparison result. >>>> >>>> I don't have much experience with writing match.pd patterns, so I >>>> appreciate >>>> any feedback on how to write this properly. >>>> >>>> Bootstrapped and tested on arm, aarch64, x86_64. >>> I think this is another case that, if at all, should be done during or right >>> before RTL expansion and should test rtx costs. >> Hmm, where would that be? >> expmed.c has a lot of code to expand div or mod by a power of 2. >> In what form would a compare with a mod by power of 2 arrive to >> the expansion phase? > It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name > or get_def_for_expr to get at the defining stmt if that is possible > (it's still unexpanded and thus TERed) and expand a different > expression. Thanks, so it's where we expand compares... (what's TERed?) > > But why can't simplify-rtx via combine handle this - it should have > access to target costs. That would require for the target to expand to an SMOD rtx which, if the target has no direct instruction for would be somewhat awkward. Thanks, Kyrill > >>> Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive, >>> while C cheap, and % might not be that expensive compared to & to offset >>> that. >>> >>> E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other >>> (well, if we don't take instruction size into account), but 64-bit constant >>> is at least 3 times more expensive (movabsq is needed with its latency). >>> In the x86_64 case supposedly the divmod is still more expensive, but there >>> are many other targets. On sparc64 for 64-bit constants, you might need >>> many instructions to create the constants, etc. >> Ok, I am not familiar with sparc64. The constant is just a 1 >> in the sign bit orred with a continuous string of ones. >> That's usually cheap on aarch64 but may not be so on other targets. > On GIMPLE we might still want to canonicalize to one form. I'd > canonicalize to the form with "smaller" constants if the number > of operations is the same. > > Richard. > >> Thanks, >> Kyrill >> >>> Jakub >>> >>
On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote: > >>Bootstrapped and tested on arm, aarch64, x86_64. > >I think this is another case that, if at all, should be done during or right > >before RTL expansion and should test rtx costs. > > Hmm, where would that be? That is up to discussions, all I'm saying is that match.pd when run on GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable. In expr.c, with TER you can detect such patterns, in this case when expanding the comparison, but perhaps we want a *.pd file that would have rules that would be only GIMPLE and only enabled in a special pass right before (or very close to) expansion, that would perform such instruction selection. > Ok, I am not familiar with sparc64. The constant is just a 1 > in the sign bit orred with a continuous string of ones. > That's usually cheap on aarch64 but may not be so on other targets. It has been a long time since I've done anything on sparc64, but you can e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64) to clearly see that not all constants are equal cost, some are much more expensive. Seems sparc_rtx_cost does not model this accurrately, but that is a backend bug, so shouldn't affect the generic decisions. Sparc does not have a moddi3 pattern, so your transformation might still be a win there, except for -Os where it would be very bad code size pessimization. All I want to say is that on GIMPLE we usually perform transformations where simpler (fewer operations) is considered better, and for the same number of operations where one sequence might be better on one target and another on another target, supposedly we want some other infrastructure. Jakub
On Fri, Jul 24, 2015 at 10:04 AM, Kyrill Tkachov <kyrylo.tkachov@arm.com> wrote: >> It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name >> or get_def_for_expr to get at the defining stmt if that is possible >> (it's still unexpanded and thus TERed) and expand a different >> expression. > > > Thanks, so it's where we expand compares... (what's TERed?) Temporary Expression Replacement - IIRC something done when you just come out of ssa . tree-ssa-ter.[hc]. Ramana > >> >> But why can't simplify-rtx via combine handle this - it should have >> access to target costs. > > > That would require for the target to expand to an SMOD rtx > which, if the target has no direct instruction for would be somewhat > awkward. > > Thanks, > Kyrill > > >> >>>> Because, ((1 << (size - 1)) | (C - 1))) constant might be very >>>> expensive, >>>> while C cheap, and % might not be that expensive compared to & to offset >>>> that. >>>> >>>> E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any >>>> other >>>> (well, if we don't take instruction size into account), but 64-bit >>>> constant >>>> is at least 3 times more expensive (movabsq is needed with its latency). >>>> In the x86_64 case supposedly the divmod is still more expensive, but >>>> there >>>> are many other targets. On sparc64 for 64-bit constants, you might need >>>> many instructions to create the constants, etc. >>> >>> Ok, I am not familiar with sparc64. The constant is just a 1 >>> in the sign bit orred with a continuous string of ones. >>> That's usually cheap on aarch64 but may not be so on other targets. >> >> On GIMPLE we might still want to canonicalize to one form. I'd >> canonicalize to the form with "smaller" constants if the number >> of operations is the same. >> >> Richard. >> >>> Thanks, >>> Kyrill >>> >>>> Jakub >>>> >>> >
On 24/07/15 10:09, Jakub Jelinek wrote: > On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote: >>>> Bootstrapped and tested on arm, aarch64, x86_64. >>> I think this is another case that, if at all, should be done during or right >>> before RTL expansion and should test rtx costs. >> Hmm, where would that be? > That is up to discussions, all I'm saying is that match.pd when run on > GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable. > > In expr.c, with TER you can detect such patterns, in this case when > expanding the comparison, but perhaps we want a *.pd file that would have > rules that would be only GIMPLE and only enabled in a special pass right > before (or very close to) expansion, that would perform such instruction > selection. Wild idea, but could it be considered to have target-specific match.pd files that can be included in the main match.pd? That way, targets would get the benefit of getting the target-specific folding they benefit from at the very beginning of compilation without stepping on other targets toes. Kyrill > >> Ok, I am not familiar with sparc64. The constant is just a 1 >> in the sign bit orred with a continuous string of ones. >> That's usually cheap on aarch64 but may not be so on other targets. > It has been a long time since I've done anything on sparc64, but you can > e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64) > to clearly see that not all constants are equal cost, some are much more > expensive. Seems sparc_rtx_cost does not model this accurrately, but that > is a backend bug, so shouldn't affect the generic decisions. Sparc does not > have a moddi3 pattern, so your transformation might still be a win there, > except for -Os where it would be very bad code size pessimization. > > All I want to say is that on GIMPLE we usually perform transformations where > simpler (fewer operations) is considered better, and for the same number of > operations where one sequence might be better on one target and another on > another target, supposedly we want some other infrastructure. > > Jakub >
On Fri, 24 Jul 2015, Kyrill Tkachov wrote: > > On 24/07/15 10:09, Jakub Jelinek wrote: > > On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote: > > > > > Bootstrapped and tested on arm, aarch64, x86_64. > > > > I think this is another case that, if at all, should be done during or > > > > right > > > > before RTL expansion and should test rtx costs. > > > Hmm, where would that be? > > That is up to discussions, all I'm saying is that match.pd when run on > > GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable. > > > > In expr.c, with TER you can detect such patterns, in this case when > > expanding the comparison, but perhaps we want a *.pd file that would have > > rules that would be only GIMPLE and only enabled in a special pass right > > before (or very close to) expansion, that would perform such instruction > > selection. Yes, we can do that - that .pd could also be target specific. > Wild idea, but could it be considered to have target-specific > match.pd files that can be included in the main match.pd? > That way, targets would get the benefit of getting > the target-specific folding they benefit from at the very beginning > of compilation without stepping on other targets toes. The patterns would interact with those in match.pd, so it adds extra complication in testing (would need to test all targets) to not run into infinite recursions for example. It will also make adding testcases that work on all targets harder as the IL presented to passes could then wildly differ... So not sure if we want that (from the very beginning of the compilation). Richard. > Kyrill > > > > > > Ok, I am not familiar with sparc64. The constant is just a 1 > > > in the sign bit orred with a continuous string of ones. > > > That's usually cheap on aarch64 but may not be so on other targets. > > It has been a long time since I've done anything on sparc64, but you can > > e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64) > > to clearly see that not all constants are equal cost, some are much more > > expensive. Seems sparc_rtx_cost does not model this accurrately, but that > > is a backend bug, so shouldn't affect the generic decisions. Sparc does not > > have a moddi3 pattern, so your transformation might still be a win there, > > except for -Os where it would be very bad code size pessimization. > > > > All I want to say is that on GIMPLE we usually perform transformations where > > simpler (fewer operations) is considered better, and for the same number of > > operations where one sequence might be better on one target and another on > > another target, supposedly we want some other infrastructure. > > > > Jakub > > > >
On Fri, Jul 24, 2015 at 10:23:59AM +0100, Kyrill Tkachov wrote: > On 24/07/15 10:09, Jakub Jelinek wrote: > >On Fri, Jul 24, 2015 at 09:48:30AM +0100, Kyrill Tkachov wrote: > >>>>Bootstrapped and tested on arm, aarch64, x86_64. > >>>I think this is another case that, if at all, should be done during or right > >>>before RTL expansion and should test rtx costs. > >>Hmm, where would that be? > >That is up to discussions, all I'm saying is that match.pd when run on > >GENERIC and/or anywhere among GIMPLE passes that fold stuff is undesirable. > > > >In expr.c, with TER you can detect such patterns, in this case when > >expanding the comparison, but perhaps we want a *.pd file that would have > >rules that would be only GIMPLE and only enabled in a special pass right > >before (or very close to) expansion, that would perform such instruction > >selection. > > Wild idea, but could it be considered to have target-specific > match.pd files that can be included in the main match.pd? > That way, targets would get the benefit of getting > the target-specific folding they benefit from at the very beginning > of compilation without stepping on other targets toes. I'd strongly prefer that isn't done. First of all, you really don't want to make target-specific foldings during generic folding (yeah, there are already cases where it is done, but we want to avoid it), or during early GIMPLE, ideally that should be late GIMPLE only where we introduce gradually more and more target dependencies. By making GIMPLE more target specific earlier, you break e.g. the offloading stuff, but also introduce target specific bugs more often to GIMPLE optimizers (these days, most GIMPLE optimizer issues (especially before vectorizer/ivopts and other target specific changes) affect all targets, or are perhaps ILP32/LP64 etc. related at most). Also, by allowing target-specific foldings, we'd end up with duplications between different target, using rtx_costs, maintaining them more accurrately and performing some generic code selections based on them IMHO is desirable. Jakub
>> >> In expr.c, with TER you can detect such patterns, in this case when >> expanding the comparison, but perhaps we want a *.pd file that would have >> rules that would be only GIMPLE and only enabled in a special pass right >> before (or very close to) expansion, that would perform such instruction >> selection. > > > Wild idea, but could it be considered to have target-specific > match.pd files that can be included in the main match.pd? > That way, targets would get the benefit of getting > the target-specific folding they benefit from at the very beginning > of compilation without stepping on other targets toes. The downside is preventing duplication, potentially reducing "generic" improvements and a maintenance headache for gimple optimizers. regards Ramana > > Kyrill > > >> >>> Ok, I am not familiar with sparc64. The constant is just a 1 >>> in the sign bit orred with a continuous string of ones. >>> That's usually cheap on aarch64 but may not be so on other targets. >> >> It has been a long time since I've done anything on sparc64, but you can >> e.g. have a look at config/sparc/sparc.c (sparc_emit_set_const64) >> to clearly see that not all constants are equal cost, some are much more >> expensive. Seems sparc_rtx_cost does not model this accurrately, but that >> is a backend bug, so shouldn't affect the generic decisions. Sparc does >> not >> have a moddi3 pattern, so your transformation might still be a win there, >> except for -Os where it would be very bad code size pessimization. >> >> All I want to say is that on GIMPLE we usually perform transformations >> where >> simpler (fewer operations) is considered better, and for the same number >> of >> operations where one sequence might be better on one target and another on >> another target, supposedly we want some other infrastructure. >> >> Jakub >> >
On 07/24/2015 03:44 AM, Ramana Radhakrishnan wrote: >>> >>> In expr.c, with TER you can detect such patterns, in this case when >>> expanding the comparison, but perhaps we want a *.pd file that would have >>> rules that would be only GIMPLE and only enabled in a special pass right >>> before (or very close to) expansion, that would perform such instruction >>> selection. >> >> >> Wild idea, but could it be considered to have target-specific >> match.pd files that can be included in the main match.pd? >> That way, targets would get the benefit of getting >> the target-specific folding they benefit from at the very beginning >> of compilation without stepping on other targets toes. > > > The downside is preventing duplication, potentially reducing "generic" > improvements and a maintenance headache for gimple optimizers. So how about wedding the two ideas that have sprouted out of this discussion. Specifically having a pass apply a target specific match.pd, but only do so at the end of the gimple optimization pipeline? The design goal would (of course) be to change representations in ways that allow the gimple->rtl expanders to generate more efficient code for the target. It avoids introducing the target bits early in the gimple pipeline, but still gives a clean way for targets to rewrite gimple for the benefit of gimple->rtl expansion. jeff
On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: > This transformation folds (X % C) == N into > X & ((1 << (size - 1)) | (C - 1))) == N > for constants C and N where N is positive and C is a power of 2. For N = 0 you can transform it to ((unsigned)X % C) == N and for 0 < N < C you can transform it to X > 0 && ((unsigned)X % C) == N (or X >= 0) and for -C < N < 0 it is X < 0 && ((unsigned)X % C) == N + C (or X <= 0) and for other N it is 0. For N not a constant, well, do you really care? :-) (That second case might eventually fold to your original expression). Segher
On Fri, 24 Jul 2015, Jeff Law wrote: > On 07/24/2015 03:44 AM, Ramana Radhakrishnan wrote: > > > > > > > > In expr.c, with TER you can detect such patterns, in this case when > > > > expanding the comparison, but perhaps we want a *.pd file that would > > > > have > > > > rules that would be only GIMPLE and only enabled in a special pass right > > > > before (or very close to) expansion, that would perform such instruction > > > > selection. > > > > > > > > > Wild idea, but could it be considered to have target-specific > > > match.pd files that can be included in the main match.pd? > > > That way, targets would get the benefit of getting > > > the target-specific folding they benefit from at the very beginning > > > of compilation without stepping on other targets toes. > > > > > > The downside is preventing duplication, potentially reducing "generic" > > improvements and a maintenance headache for gimple optimizers. > So how about wedding the two ideas that have sprouted out of this discussion. > Specifically having a pass apply a target specific match.pd, but only do so at > the end of the gimple optimization pipeline? > > The design goal would (of course) be to change representations in ways that > allow the gimple->rtl expanders to generate more efficient code for the > target. > > It avoids introducing the target bits early in the gimple pipeline, but still > gives a clean way for targets to rewrite gimple for the benefit of gimple->rtl > expansion. I think it also aligns with the idea of pushing back RTL expansion and expose some target specifics after another GIMPLE lowering phase. I'm also thinking of addressing-mode selection and register promotion. So at least if we think of that target specific match.pd pass as containing all RTL expansion tricks done with TER only then it should be quite simple to make it work. Richard.
On 25/07/15 03:19, Segher Boessenkool wrote: > On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: >> This transformation folds (X % C) == N into >> X & ((1 << (size - 1)) | (C - 1))) == N >> for constants C and N where N is positive and C is a power of 2. > For N = 0 you can transform it to > > ((unsigned)X % C) == N > > and for 0 < N < C you can transform it to > > X > 0 && ((unsigned)X % C) == N (or X >= 0) > > and for -C < N < 0 it is > > X < 0 && ((unsigned)X % C) == N + C (or X <= 0) > > and for other N it is > > 0. > > For N not a constant, well, do you really care? :-) > > (That second case might eventually fold to your original expression). Yeah, these avoid the potentially expensive mask, but introduce more operations, which I believe may not be desirable at this stage. Unless these transformations are ok for match.pd I'll try to implement this transformation at RTL expansion time. Thanks, Kyrill > > > Segher >
On 07/27/2015 01:36 AM, Richard Biener wrote: > > I think it also aligns with the idea of pushing back RTL expansion and > expose some target specifics after another GIMPLE lowering phase. > I'm also thinking of addressing-mode selection and register promotion. > > So at least if we think of that target specific match.pd pass as > containing all RTL expansion tricks done with TER only then it > should be quite simple to make it work. Yea -- I was also thinking it might allow us to clean up some of the actual expansion code as well. It seems promising enough for some experimentation. jeff
On Mon, Jul 27, 2015 at 09:11:12AM +0100, Kyrill Tkachov wrote: > On 25/07/15 03:19, Segher Boessenkool wrote: > >On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote: > >>This transformation folds (X % C) == N into > >>X & ((1 << (size - 1)) | (C - 1))) == N > >>for constants C and N where N is positive and C is a power of 2. > >For N = 0 you can transform it to > > > > ((unsigned)X % C) == N > > > >and for 0 < N < C you can transform it to > > > > X > 0 && ((unsigned)X % C) == N (or X >= 0) > > > >and for -C < N < 0 it is > > > > X < 0 && ((unsigned)X % C) == N + C (or X <= 0) > > > >and for other N it is > > > > 0. > > > >For N not a constant, well, do you really care? :-) > > > >(That second case might eventually fold to your original expression). > > Yeah, these avoid the potentially expensive mask, Fun fact: the current code ends up using the exact same mask, for some targets. > but introduce more operations, > which I believe may not be desirable at this stage. It is getting rid of the (expensive) division/modulo. In many cases it could get rid of the sign test, or hoist it to some outer structure, hard to test here though (at least, I have no idea how to do that). > Unless these transformations are ok for match.pd I'll try to implement this > transformation > at RTL expansion time. If you have to do conditional jumps, the RTL optimisers will not be able to do very much :-( Segher
commit af785bad5cb32ef2bc35503ebbe65414b67ef8b1 Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com> Date: Tue Jul 14 18:20:28 2015 +0100 [match.pd] optimize (X & C) == N when C is power of 2 diff --git a/gcc/match.pd b/gcc/match.pd index 9cf0278..02ad708 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -266,7 +266,9 @@ along with GCC; see the file COPYING3. If not see /* Optimize TRUNC_MOD_EXPR by a power of two into a BIT_AND_EXPR, i.e. "X % C" into "X & (C - 1)", if X and C are positive. Also optimize A % (C << N) where C is a power of 2, - to A & ((C << N) - 1). */ + to A & ((C << N) - 1). + Optimize (X % C) == N into (X & ((1 << (size - 1)) | (C - 1))) == N + when C is signed, a power of 2 and N is positive. */ (match (power_of_two_cand @1) INTEGER_CST@1) (match (power_of_two_cand @1) @@ -278,7 +280,17 @@ along with GCC; see the file COPYING3. If not see || tree_expr_nonnegative_p (@0)) && tree_nop_conversion_p (type, TREE_TYPE (@3)) && integer_pow2p (@2) && tree_int_cst_sgn (@2) > 0) - (bit_and @0 (convert (minus @1 { build_int_cst (TREE_TYPE (@1), 1); })))))) + (bit_and @0 (convert (minus @1 { build_int_cst (TREE_TYPE (@1), 1); }))))) + + (simplify + (eq (mod @0 (INTEGER_CST@1)) INTEGER_CST@2) + (if (integer_pow2p (@1) && tree_expr_nonnegative_p (@2) + && tree_to_uhwi (@2) < tree_to_uhwi (@1)) + (eq (bit_and @0 + (bit_ior (minus @1 { build_int_cst (TREE_TYPE (@1), 1); }) + { build_int_cst (type, 1 << (tree_to_uhwi (TYPE_SIZE (type)) - 1)); }) + ) + @2)))) /* X % Y is smaller than Y. */ (for cmp (lt ge) diff --git a/gcc/testsuite/gcc.dg/fold-mod-cmp-1.c b/gcc/testsuite/gcc.dg/fold-mod-cmp-1.c new file mode 100644 index 0000000..9dcf313 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-mod-cmp-1.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-fdump-tree-original" } */ + +int +modcmp (int a) +{ + return (a % 32) == 6; +} + +/* The above should be simplified to (a & -2147483617) == 6. */ +/* { dg-final { scan-tree-dump "& -2147483617" "original" } } */ +/* { dg-final { scan-tree-dump "== 6" "original" } } */