Message ID | D4C76825A6780047854A11E93CDE84D02F7759@SAUSEXMBP01.amd.com |
---|---|
State | New |
Headers | show |
On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > Hi, > > Attached is the patch that partially fixes bug 44576: testsuite/gfortran.dg/zero_sized_1.f90 with huge compile > time on prefetching + peeling. > > The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding > compilation time when there are thousands of memory references in the loop. > > compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache > lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally > determined ACCEPTABLE_MISS_RATE. > > In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed > miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such > case. Note that the worst case cost is not affected by this patch. > > This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system. > Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly > spent in "compute_all_dependences (datarefs, &dependences, vloops, true)". I am not sure whether the cost in > dependence analysis for loop should be that high (with several thousands memory references in loop) or there > is a bug in dependence computation. Well, it seems that you compute the same information over-and-over by doing unsigned int tree_ssa_prefetch_arrays (void) { ... FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Processing loop %d:\n", loop->num); unrolled |= loop_prefetch_arrays (loop); static bool loop_prefetch_arrays (struct loop *loop) { ... determine_loop_nest_reuse (loop, refs, no_other_refs); static void determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs, bool no_other_refs) { ... if (loop->inner) return; ... /* Find the outermost loop of the loop nest of loop (we require that there are no sibling loops inside the nest). */ nest = loop; while (1) { aloop = loop_outer (nest); if (aloop == current_loops->tree_root || aloop->inner->next) break; nest = aloop; } ... compute_all_dependences (datarefs, &dependences, vloops, true); In fact you should iterate _only_ over innermost loops like FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) does that make a difference? > > The patch passed Bootstrapping and regression tests. > > Is this patch OK to commit? Ok. Thanks, Richard. > Thanks, > > Changpeng
> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) >does that make a difference? This doesn't help, because "compute_all_dependences" was called the same number of time as before. (BTW, should we limit prefetching only to the innermost one?) In this test case, there are 6 large loops, where each loop has 729 memory reference. It takes 4~5 seconds to "compute_all_dependence" for one such loop. > The patch passed Bootstrapping and regression tests. > > Is this patch OK to commit? >Ok. Thank you. Changpeng
On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) > >>does that make a difference? > > This doesn't help, because "compute_all_dependences" was called the same number of time > as before. > > (BTW, should we limit prefetching only to the innermost one?) > > In this test case, there are 6 large loops, where each loop has 729 memory reference. > It takes 4~5 seconds to "compute_all_dependence" for one such loop. It shouldn't take that long. Can you gather a more detailed profile? > >> The patch passed Bootstrapping and regression tests. >> >> Is this patch OK to commit? > >>Ok. > > Thank you. > > Changpeng >
Hi, > On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) > > > >>does that make a difference? > > > > This doesn't help, because "compute_all_dependences" was called the same number of time > > as before. > > > > (BTW, should we limit prefetching only to the innermost one?) > > > > In this test case, there are 6 large loops, where each loop has 729 memory reference. > > It takes 4~5 seconds to "compute_all_dependence" for one such loop. > > It shouldn't take that long. Can you gather a more detailed profile? actually, compute_all_dependences is quadratic in the number of memory references, and in more complicated cases, it can perform rather complex computations, so 5 seconds on 729 references does seem like a realistic time. Of course, we need to either speed up the analysis or add an cut-off to avoid it on loops with too many memory references (or both), Zdenek
On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote: > Hi, > >> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) >> > >> >>does that make a difference? >> > >> > This doesn't help, because "compute_all_dependences" was called the same number of time >> > as before. >> > >> > (BTW, should we limit prefetching only to the innermost one?) >> > >> > In this test case, there are 6 large loops, where each loop has 729 memory reference. >> > It takes 4~5 seconds to "compute_all_dependence" for one such loop. >> >> It shouldn't take that long. Can you gather a more detailed profile? > > actually, compute_all_dependences is quadratic in the number of memory > references, and in more complicated cases, it can perform rather complex > computations, so 5 seconds on 729 references does seem like a realistic time. > Of course, we need to either speed up the analysis or add an cut-off to avoid > it on loops with too many memory references (or both), Well, but at -O3 the vectorizer computes dependences as well, and it doesn't take that much of time, So there must be something obvious going wrong. Richard. > Zdenek >
On Wed, Jun 30, 2010 at 13:05, Richard Guenther <richard.guenther@gmail.com> wrote: > On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote: >> Hi, >> >>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) >>> > >>> >>does that make a difference? >>> > >>> > This doesn't help, because "compute_all_dependences" was called the same number of time >>> > as before. >>> > >>> > (BTW, should we limit prefetching only to the innermost one?) >>> > >>> > In this test case, there are 6 large loops, where each loop has 729 memory reference. >>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop. >>> >>> It shouldn't take that long. Can you gather a more detailed profile? >> >> actually, compute_all_dependences is quadratic in the number of memory >> references, and in more complicated cases, it can perform rather complex >> computations, so 5 seconds on 729 references does seem like a realistic time. >> Of course, we need to either speed up the analysis or add an cut-off to avoid >> it on loops with too many memory references (or both), > > Well, but at -O3 the vectorizer computes dependences as well, and it > doesn't take that much of time, So there must be something obvious > going wrong. > The dependence analysis in the vectorizer is done only in the innermost loop that is vectorized, whereas prefetch does the analysis of data deps for every loop. Sebastian
On Wed, Jun 30, 2010 at 8:10 PM, Sebastian Pop <sebpop@gmail.com> wrote: > On Wed, Jun 30, 2010 at 13:05, Richard Guenther > <richard.guenther@gmail.com> wrote: >> On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote: >>> Hi, >>> >>>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) >>>> > >>>> >>does that make a difference? >>>> > >>>> > This doesn't help, because "compute_all_dependences" was called the same number of time >>>> > as before. >>>> > >>>> > (BTW, should we limit prefetching only to the innermost one?) >>>> > >>>> > In this test case, there are 6 large loops, where each loop has 729 memory reference. >>>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop. >>>> >>>> It shouldn't take that long. Can you gather a more detailed profile? >>> >>> actually, compute_all_dependences is quadratic in the number of memory >>> references, and in more complicated cases, it can perform rather complex >>> computations, so 5 seconds on 729 references does seem like a realistic time. >>> Of course, we need to either speed up the analysis or add an cut-off to avoid >>> it on loops with too many memory references (or both), >> >> Well, but at -O3 the vectorizer computes dependences as well, and it >> doesn't take that much of time, So there must be something obvious >> going wrong. >> > > The dependence analysis in the vectorizer is done only in the innermost > loop that is vectorized, whereas prefetch does the analysis of data deps > for every loop. The vectorizer also does dependence analysis for outer-loop vectorization. I guess prefetch analysis should restrict itself to data dependences on a maximum loop nest depth (of say 2 or 3). Still dependence analysis looks overly costly here. Richard. > Sebastian >
>>> Well, but at -O3 the vectorizer computes dependences as well, and it >>> doesn't take that much of time, So there must be something obvious >>> going wrong. >>> >> >> The dependence analysis in the vectorizer is done only in the innermost >> loop that is vectorized, whereas prefetch does the analysis of data deps >> for every loop. >The vectorizer also does dependence analysis for outer-loop vectorization. >I guess prefetch analysis should restrict itself to data dependences on a >maximum loop nest depth (of say 2 or 3). For vectorizer (and tree-loop-linear), we are just lucky that the loop could not be vectorized and dependence analysis was not invoked at all for this loop. For this test case, prefetching is the only pass that invokes dependence analysis. >Still dependence analysis looks overly costly here. Yes, we just understand and reduce the dependence analysis cost. Thanks, Changpeng
>actually, compute_all_dependences is quadratic in the number of memory >references, and in more complicated cases, it can perform rather complex >computations, so 5 seconds on 729 references does seem like a realistic time. >Of course, we need to either speed up the analysis or add an cut-off to avoid >it on loops with too many memory references (or both), Actually, this 4~5 seconds are evenly distributed, and I do think it is due to the large number of memory references. Let take a close look at this: determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs, bool no_other_refs) { ... compute_all_dependences (datarefs, &dependences, vloops, true); /* 3+ seconds */ for (i = 0; VEC_iterate (ddr_p, dependences, i, dep); i++) /* 1+ second */ { ... } } The 4~5 seconds are spent in compute_all_dependences and the loop after it to get the dependence info. The loop iterates 265356 times and takes 1+ second. The loop is simple and reasonable. Now, let look at compute_all_dependences: compute_all_dependences (VEC (data_reference_p, heap) *datarefs, VEC (ddr_p, heap) **dependence_relations, VEC (loop_p, heap) *loop_nest, bool compute_self_and_rr) { ... for (i = 0; VEC_iterate (data_reference_p, datarefs, i, a); i++) for (j = i + 1; VEC_iterate (data_reference_p, datarefs, j, b); j++) if (!DR_IS_READ (a) || !DR_IS_READ (b) || compute_self_and_rr) { ddr = initialize_data_dependence_relation (a, b, loop_nest); /* 1+ second */ VEC_safe_push (ddr_p, heap, *dependence_relations, ddr); if (loop_nest) compute_affine_dependence (ddr, VEC_index (loop_p, loop_nest, 0)); /* 1+ second */ } ... } Each of the functions, initialize_data_dependence_relation and compute_affine_dependence, are called 265356 times, and each of them costs 1+ seconds (for 265356 iters). Again, initialize_data_dependence_relation is also simple. As a result, I conclude that the large number of memory references is the reason of slowdown. So, it is better to add a cut-off (for prefetching at least). Thanks, Changpeng
On Wed, Jun 30, 2010 at 03:52, Richard Guenther <richard.guenther@gmail.com> wrote: > On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >> Hi, >> >> Attached is the patch that partially fixes bug 44576: testsuite/gfortran.dg/zero_sized_1.f90 with huge compile >> time on prefetching + peeling. >> >> The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding >> compilation time when there are thousands of memory references in the loop. >> >> compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache >> lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally >> determined ACCEPTABLE_MISS_RATE. >> >> In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed >> miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such >> case. Note that the worst case cost is not affected by this patch. >> >> This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system. >> Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly >> spent in "compute_all_dependences (datarefs, &dependences, vloops, true)". I am not sure whether the cost in >> dependence analysis for loop should be that high (with several thousands memory references in loop) or there >> is a bug in dependence computation. > > Well, it seems that you compute the same information over-and-over > by doing > > unsigned int > tree_ssa_prefetch_arrays (void) > { > ... > FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST) > { > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, "Processing loop %d:\n", loop->num); > > unrolled |= loop_prefetch_arrays (loop); > > static bool > loop_prefetch_arrays (struct loop *loop) > { > ... > determine_loop_nest_reuse (loop, refs, no_other_refs); > > static void > determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs, > bool no_other_refs) > { > ... > if (loop->inner) > return; > ... > /* Find the outermost loop of the loop nest of loop (we require that > there are no sibling loops inside the nest). */ > nest = loop; > while (1) > { > aloop = loop_outer (nest); > > if (aloop == current_loops->tree_root > || aloop->inner->next) > break; > > nest = aloop; > } > ... > compute_all_dependences (datarefs, &dependences, vloops, true); > > > In fact you should iterate _only_ over innermost loops like > > FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) > > does that make a difference? > >> >> The patch passed Bootstrapping and regression tests. >> >> Is this patch OK to commit? > > Ok. > Committed r161728
From 3dbd1a84428979b5e73b2532c6389d31377b929a Mon Sep 17 00:00:00 2001 From: Changpeng Fang <chfang@pathscale.(none)> Date: Mon, 28 Jun 2010 17:20:10 -0700 Subject: [PATCH 5/5] Reduce the cost in miss rate computation * tree-ssa-loop-prefetch.c (compute_miss_rate): Rename to is_miss_rate_acceptable. Pull total_positions computation out of the loops. Early return if miss_positions exceeds the acceptable threshold. * tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Call is_miss_rate_acceptable after renaming of compute_miss_rate. --- gcc/tree-ssa-loop-prefetch.c | 41 +++++++++++++++++++++-------------------- 1 files changed, 21 insertions(+), 20 deletions(-) diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c index 27e2b42..f2c95ac 100644 --- a/gcc/tree-ssa-loop-prefetch.c +++ b/gcc/tree-ssa-loop-prefetch.c @@ -640,27 +640,29 @@ ddown (HOST_WIDE_INT x, unsigned HOST_WIDE_INT by) /* Given a CACHE_LINE_SIZE and two inductive memory references with a common STEP greater than CACHE_LINE_SIZE and an address difference DELTA, compute the probability that they will fall - in different cache lines. DISTINCT_ITERS is the number of - distinct iterations after which the pattern repeats itself. + in different cache lines. Return true if the computed miss rate + is not greater than the ACCEPTABLE_MISS_RATE. DISTINCT_ITERS is the + number of distinct iterations after which the pattern repeats itself. ALIGN_UNIT is the unit of alignment in bytes. */ -static int -compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size, +static bool +is_miss_rate_acceptable (unsigned HOST_WIDE_INT cache_line_size, HOST_WIDE_INT step, HOST_WIDE_INT delta, unsigned HOST_WIDE_INT distinct_iters, int align_unit) { unsigned align, iter; - int total_positions, miss_positions, miss_rate; + int total_positions, miss_positions, max_allowed_miss_positions; int address1, address2, cache_line1, cache_line2; /* It always misses if delta is greater than or equal to the cache - line size. */ - if (delta >= cache_line_size) - return 1000; + line size. */ + if (delta >= (HOST_WIDE_INT) cache_line_size) + return false; - total_positions = 0; miss_positions = 0; + total_positions = (cache_line_size / align_unit) * distinct_iters; + max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000; /* Iterate through all possible alignments of the first memory reference within its cache line. */ @@ -673,12 +675,14 @@ compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size, address2 = address1 + delta; cache_line1 = address1 / cache_line_size; cache_line2 = address2 / cache_line_size; - total_positions += 1; if (cache_line1 != cache_line2) - miss_positions += 1; + { + miss_positions += 1; + if (miss_positions > max_allowed_miss_positions) + return false; + } } - miss_rate = 1000 * miss_positions / total_positions; - return miss_rate; + return true; } /* Prune the prefetch candidate REF using the reuse with BY. @@ -694,7 +698,6 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, HOST_WIDE_INT delta = delta_b - delta_r; HOST_WIDE_INT hit_from; unsigned HOST_WIDE_INT prefetch_before, prefetch_block; - int miss_rate; HOST_WIDE_INT reduced_step; unsigned HOST_WIDE_INT reduced_prefetch_block; tree ref_type; @@ -793,9 +796,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, delta %= step; ref_type = TREE_TYPE (ref->mem); align_unit = TYPE_ALIGN (ref_type) / 8; - miss_rate = compute_miss_rate(prefetch_block, step, delta, - reduced_prefetch_block, align_unit); - if (miss_rate <= ACCEPTABLE_MISS_RATE) + if (is_miss_rate_acceptable (prefetch_block, step, delta, + reduced_prefetch_block, align_unit)) { /* Do not reduce prefetch_before if we meet beyond cache size. */ if (prefetch_before > L2_CACHE_SIZE_BYTES / PREFETCH_BLOCK) @@ -809,9 +811,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, /* Try also the following iteration. */ prefetch_before++; delta = step - delta; - miss_rate = compute_miss_rate(prefetch_block, step, delta, - reduced_prefetch_block, align_unit); - if (miss_rate <= ACCEPTABLE_MISS_RATE) + if (is_miss_rate_acceptable (prefetch_block, step, delta, + reduced_prefetch_block, align_unit)) { if (prefetch_before < ref->prefetch_before) ref->prefetch_before = prefetch_before; -- 1.6.3.3