Message ID | 2dc2b54e-e737-e03d-e103-541dd4847255@redhat.com |
---|---|
State | New |
Headers | show |
Series | [RFC] Sparse on entry cache for Ranger. | expand |
On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote: > > As mentioned earlier, I abstracted the on-entry cache at the beginning > of stage1. This was to make it easier to port future changes back to > GCC11 so we could provide alternate representations to deal with memory > issues, or what have you. > > This patch introduces a sparse representation of the cache which is > used when the number of basic blocks gets too large. > > I commandeered the bitmap code since its efficient and has been working > a long time, and added 2 routines to get and set 4 bits (quads) at a > time. This allows me to use a bitmap like its a sparse array which can > contain a value between 0 and 15, and is conveniently pre-initialized to > values of 0 at no cost :-) This is then used as an index into a small > local table to store ranges for the name. Its limiting in that an > ssa-name will not be able to have more than 15 unique ranges, but this > happens in less than 8% of all cases in the data I collected, and most > of those are switches.. any ranges after the 15 slots are full revert to > VARYING. The values for VARYING and UNDEFINED are pre-populated, and > for pointers, I also pre-populate [0,0] and ~[0, 0]. > > This also adds --param=evrp-sparse-threshold= which allows the > threshold between the full vector and this new sparse representation to > be changed. It defaults to a value of 800. I've done various performance > runs, and this seems to be a reasonably balanced number. In fact its a > 28% improvement in EVRP compile time over 390 files from a gcc bootstrap > with minimal impact on missed opportunities. > > I've also tried to see if using less than 15 values has any significant > effect (The lookup is linear when setting), but it does not appear to. > > I've also bootstrapped with the sparse threshold at 0 to ensure there > aren't any issues. > > My thoughts are I would put this into trunk, and assuming nothing comes > up over the next couple of days, port it back to GCC11 to resolve > 100299 and other excessive memory consumption PRs there as well. given > that its reusing bitmap code for the sparse representation, it seems > like it would be low risk. > > Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad > routines in bitmap.c? It seems like they might be useful to others. > They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing > with 4 bits at a time. I could make them local if this is a problem, > but i don't have access to the bitmap internals there. I think _quad is a bit too specific - it's aligned chunks so maybe void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned int chunk_size, BITMAP_WORD chunk_value); and BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk, unsigned chunk_size); and assert that chunk_size is power-of-two and fits in BITMAP_WORD? (also note using unsigned ints and BITMAP_WORD for the data type) I've been using two-bit representations in a few places (but mostly setting/testing the respective bits independently), I suppose for example static dep_state query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind) { unsigned first_bit = 6 * loop->num + kind * 2; if (bitmap_bit_p (&ref->dep_loop, first_bit)) return dep_independent; else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1)) return dep_dependent; could use a chunk size of 2 and a single bitmap query. Incidentially this specific code uses 6 bits, so it's not fully aligned ... /* We use six bits per loop in the ref->dep_loop bitmap to record the dep_kind x dep_state combinations. */ enum dep_kind { lim_raw, sm_war, sm_waw }; enum dep_state { dep_unknown, dep_independent, dep_dependent }; ... but there's also at most a single bit set. Anyway, I'm OK with adding API to access aligned power-of-two sized chunks. Even not power-of-two sized unaligned chunks should be quite straight forward to implement if we limit the chunk size to BITMAP_WORD by simply advancing to the next bitmap word / element when necessary. An alternative low-level API would provide accesses to whole BITMAP_WORD entries and the quads could be implemented on top of that (bitmap_set_word/_get_word) Richard. > Bootstraps on x86_64-pc-linux-gnu with no regressions. > > Andrew > > PS in PR10299 we spend a fraction of a second in EVRP now. > > >
On 6/7/21 4:40 AM, Richard Biener wrote: > On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote: >> >> My thoughts are I would put this into trunk, and assuming nothing comes >> up over the next couple of days, port it back to GCC11 to resolve >> 100299 and other excessive memory consumption PRs there as well. given >> that its reusing bitmap code for the sparse representation, it seems >> like it would be low risk. >> >> Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad >> routines in bitmap.c? It seems like they might be useful to others. >> They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing >> with 4 bits at a time. I could make them local if this is a problem, >> but i don't have access to the bitmap internals there. > I think _quad is a bit too specific - it's aligned chunks so maybe > > void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned > int chunk_size, BITMAP_WORD chunk_value); > > and > > BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk, > unsigned chunk_size); > > and assert that chunk_size is power-of-two and fits in BITMAP_WORD? > > (also note using unsigned ints and BITMAP_WORD for the data type) > > I've been using two-bit representations in a few places (but mostly > setting/testing the > respective bits independently), I suppose for example That's exactly how this started.. I was using a pair of bits for pointers. UNDEFINED, zero, non-zero and varying... and checking the bits independently. when I decided I needed 3 bits, the whole quad thing evolved since picking up 3 or 4 consecutive bits one at a time seemed too inefficient. > > static dep_state > query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind) > { > unsigned first_bit = 6 * loop->num + kind * 2; > if (bitmap_bit_p (&ref->dep_loop, first_bit)) > return dep_independent; > else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1)) > return dep_dependent; > > could use a chunk size of 2 and a single bitmap query. Incidentially this > specific code uses 6 bits, so it's not fully aligned ... > > /* We use six bits per loop in the ref->dep_loop bitmap to record > the dep_kind x dep_state combinations. */ > > enum dep_kind { lim_raw, sm_war, sm_waw }; > enum dep_state { dep_unknown, dep_independent, dep_dependent }; > > ... but there's also at most a single bit set. > > Anyway, I'm OK with adding API to access aligned power-of-two sized chunks. > Even not power-of-two sized unaligned chunks should be quite straight > forward to implement if we limit the chunk size to BITMAP_WORD by > simply advancing to the next bitmap word / element when necessary. > > An alternative low-level API would provide accesses to whole BITMAP_WORD > entries and the quads could be implemented on top of that > (bitmap_set_word/_get_word) > > Richard. > I think I'll stick to the power of 2 limitation for now. If someone finds a pressing need or desire, they can enhance it :-) Wanna eyeball this an make sure I'm not doing something unportable.. I just used your original 2 function names, and swapped out the 4 and a couple of constants for computed values. Works fine for me. I also made the self test process 2, 4 and 8 bit quantities. Its going thru a test cycle now. Andrew
On Mon, Jun 7, 2021 at 7:26 PM Andrew MacLeod <amacleod@redhat.com> wrote: > > On 6/7/21 4:40 AM, Richard Biener wrote: > > On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote: > >> > >> My thoughts are I would put this into trunk, and assuming nothing comes > >> up over the next couple of days, port it back to GCC11 to resolve > >> 100299 and other excessive memory consumption PRs there as well. given > >> that its reusing bitmap code for the sparse representation, it seems > >> like it would be low risk. > >> > >> Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad > >> routines in bitmap.c? It seems like they might be useful to others. > >> They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing > >> with 4 bits at a time. I could make them local if this is a problem, > >> but i don't have access to the bitmap internals there. > > I think _quad is a bit too specific - it's aligned chunks so maybe > > > > void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned > > int chunk_size, BITMAP_WORD chunk_value); > > > > and > > > > BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk, > > unsigned chunk_size); > > > > and assert that chunk_size is power-of-two and fits in BITMAP_WORD? > > > > (also note using unsigned ints and BITMAP_WORD for the data type) > > > > I've been using two-bit representations in a few places (but mostly > > setting/testing the > > respective bits independently), I suppose for example > > That's exactly how this started.. I was using a pair of bits for > pointers. UNDEFINED, zero, non-zero and varying... and checking the bits > independently. when I decided I needed 3 bits, the whole quad thing > evolved since picking up 3 or 4 consecutive bits one at a time seemed > too inefficient. > > > > > static dep_state > > query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind) > > { > > unsigned first_bit = 6 * loop->num + kind * 2; > > if (bitmap_bit_p (&ref->dep_loop, first_bit)) > > return dep_independent; > > else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1)) > > return dep_dependent; > > > > could use a chunk size of 2 and a single bitmap query. Incidentially this > > specific code uses 6 bits, so it's not fully aligned ... > > > > /* We use six bits per loop in the ref->dep_loop bitmap to record > > the dep_kind x dep_state combinations. */ > > > > enum dep_kind { lim_raw, sm_war, sm_waw }; > > enum dep_state { dep_unknown, dep_independent, dep_dependent }; > > > > ... but there's also at most a single bit set. > > > > Anyway, I'm OK with adding API to access aligned power-of-two sized chunks. > > Even not power-of-two sized unaligned chunks should be quite straight > > forward to implement if we limit the chunk size to BITMAP_WORD by > > simply advancing to the next bitmap word / element when necessary. > > > > An alternative low-level API would provide accesses to whole BITMAP_WORD > > entries and the quads could be implemented on top of that > > (bitmap_set_word/_get_word) > > > > Richard. > > > I think I'll stick to the power of 2 limitation for now. If someone > finds a pressing need or desire, they can enhance it :-) > > Wanna eyeball this an make sure I'm not doing something unportable.. I > just used your original 2 function names, and swapped out the 4 and a > couple of constants for computed values. Works fine for me. > > I also made the self test process 2, 4 and 8 bit quantities. > > Its going thru a test cycle now. + gcc_checking_assert (__builtin_popcount (chunk_size) == 1); please use pow2p_hwi (chunk_size) instead, __builtin_popcount might not be available with non-GCC host compilers. Otherwise looks good to me. Thanks, Richard. > Andrew > >
From 9408efe462c7a664858d986f69082eb86f2a0007 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod <amacleod@redhat.com> Date: Wed, 2 Jun 2021 15:20:02 -0400 Subject: [PATCH 3/3] Implement a sparse bitmap represenation for Rangers on-entry cache. Use s sparse representation for the on entry cache, and utilize it when the number of basic blocks in the function exceeds param_evrp_sparse_threshold. PR tree-optimization/PR100299 * gimple-range-cache.cc (class sbr_sparse_bitmap): New. (sbr_sparse_bitmap::sbr_sparse_bitmap): New. (sbr_sparse_bitmap::set_bb_range): New. (sbr_sparse_bitmap::get_bb_range): New. (sbr_sparse_bitmap::bb_range_p): New. (block_range_cache::block_range_cache): initialize bitmap obstack. (block_range_cache::~block_range_cache): Destruct obstack. (block_range_cache::set_bb_range): Decide when to utilze the sparse on entry cache. * gimple-range-cache.h (block_range_cache): Add bitmap obstack. * params.opt (-param=evrp-sparse-threshold): New. --- gcc/gimple-range-cache.cc | 126 +++++++++++++++++++++++++++++++++++++- gcc/gimple-range-cache.h | 1 + gcc/params.opt | 4 ++ 3 files changed, 128 insertions(+), 3 deletions(-) diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc index 3d1ca3b94ca..658f1bc9342 100644 --- a/gcc/gimple-range-cache.cc +++ b/gcc/gimple-range-cache.cc @@ -236,12 +236,119 @@ sbr_vector::bb_range_p (const basic_block bb) return m_tab[bb->index] != NULL; } +// This class implements the on entry cache via a sparse bitmap. +// It uses the quad bit routines to access 4 bits at a time. +// A value of 0 (the default) means there is no entry, and a value of +// 1 thru SBR_NUM represents an element in the m_range vector. +// Varying is given the first value (1) and pre-cached. +// SBR_NUM + 1 represents the value of UNDEFINED, and is never stored. +// SBR_NUM is the number of values that can be cached. +// Indexes are 1..SBR_NUM and are stored locally at m_range[0..SBR_NUM-1] + +#define SBR_NUM 14 +#define SBR_UNDEF SBR_NUM + 1 +#define SBR_VARYING 1 + +class sbr_sparse_bitmap : public ssa_block_ranges +{ +public: + sbr_sparse_bitmap (tree t, irange_allocator *allocator, bitmap_obstack *bm); + virtual void set_bb_range (const basic_block bb, const irange &r) OVERRIDE; + virtual bool get_bb_range (irange &r, const basic_block bb) OVERRIDE; + virtual bool bb_range_p (const basic_block bb) OVERRIDE; +private: + irange_allocator *m_irange_allocator; + irange *m_range[SBR_NUM]; + bitmap bitvec; + tree m_type; +}; + +// Initialize a block cache for an ssa_name of type T. + +sbr_sparse_bitmap::sbr_sparse_bitmap (tree t, irange_allocator *allocator, + bitmap_obstack *bm) +{ + gcc_checking_assert (TYPE_P (t)); + m_type = t; + bitvec = BITMAP_ALLOC (bm); + m_irange_allocator = allocator; + // Pre-cache varying. + m_range[0] = m_irange_allocator->allocate (2); + m_range[0]->set_varying (t); + // Pre-cache zero and non-zero values for pointers. + if (POINTER_TYPE_P (t)) + { + m_range[1] = m_irange_allocator->allocate (2); + m_range[1]->set_nonzero (t); + m_range[2] = m_irange_allocator->allocate (2); + m_range[2]->set_zero (t); + } + else + m_range[1] = m_range[2] = NULL; + // Clear SBR_NUM entries. + for (int x = 3; x < SBR_NUM; x++) + m_range[x] = 0; +} + +// Set the range on entry to basic block BB to R. + +void +sbr_sparse_bitmap::set_bb_range (const basic_block bb, const irange &r) +{ + if (r.undefined_p ()) + { + bitmap_set_quad (bitvec, bb->index, SBR_UNDEF); + return; + } + + // Loop thru the values to see if R is already present. + for (int x = 0; x < SBR_NUM; x++) + if (!m_range[x] || r == *(m_range[x])) + { + if (!m_range[x]) + m_range[x] = m_irange_allocator->allocate (r); + bitmap_set_quad (bitvec, bb->index, x + 1); + return; + } + // All values are taken, default to VARYING. + bitmap_set_quad (bitvec, bb->index, SBR_VARYING); + return; +} + +// Return the range associated with block BB in R. Return false if +// there is no range. + +bool +sbr_sparse_bitmap::get_bb_range (irange &r, const basic_block bb) +{ + int value = bitmap_get_quad (bitvec, bb->index); + + if (!value) + return false; + + gcc_checking_assert (value <= SBR_UNDEF); + if (value == SBR_UNDEF) + r.set_undefined (); + else + r = *(m_range[value - 1]); + return true; +} + +// Return true if a range is present. + +bool +sbr_sparse_bitmap::bb_range_p (const basic_block bb) +{ + return (bitmap_get_quad (bitvec, bb->index) != 0); +} + // ------------------------------------------------------------------------- // Initialize the block cache. block_range_cache::block_range_cache () { + bitmap_obstack_initialize (&m_bitmaps); m_ssa_ranges.create (0); m_ssa_ranges.safe_grow_cleared (num_ssa_names); m_irange_allocator = new irange_allocator; @@ -254,6 +361,7 @@ block_range_cache::~block_range_cache () delete m_irange_allocator; // Release the vector itself. m_ssa_ranges.release (); + bitmap_obstack_release (&m_bitmaps); } // Set the range for NAME on entry to block BB to R. @@ -269,9 +377,21 @@ block_range_cache::set_bb_range (tree name, const basic_block bb, if (!m_ssa_ranges[v]) { - void *r = m_irange_allocator->get_memory (sizeof (sbr_vector)); - m_ssa_ranges[v] = new (r) sbr_vector (TREE_TYPE (name), - m_irange_allocator); + // Use sparse representation if there are too many basic blocks. + if (last_basic_block_for_fn (cfun) > param_evrp_sparse_threshold) + { + void *r = m_irange_allocator->get_memory (sizeof (sbr_sparse_bitmap)); + m_ssa_ranges[v] = new (r) sbr_sparse_bitmap (TREE_TYPE (name), + m_irange_allocator, + &m_bitmaps); + } + else + { + // Otherwise use the default vector implemntation. + void *r = m_irange_allocator->get_memory (sizeof (sbr_vector)); + m_ssa_ranges[v] = new (r) sbr_vector (TREE_TYPE (name), + m_irange_allocator); + } } m_ssa_ranges[v]->set_bb_range (bb, r); } diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h index 4af461d2aa3..ce4449a08db 100644 --- a/gcc/gimple-range-cache.h +++ b/gcc/gimple-range-cache.h @@ -61,6 +61,7 @@ private: ssa_block_ranges &get_block_ranges (tree name); ssa_block_ranges *query_block_ranges (tree name); irange_allocator *m_irange_allocator; + bitmap_obstack m_bitmaps; }; // This global cache is used with the range engine as markers for what diff --git a/gcc/params.opt b/gcc/params.opt index 0d0dcd216f6..18e6036c4f4 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -126,6 +126,10 @@ Maximum size (in bytes) of objects tracked bytewise by dead store elimination. Common Joined UInteger Var(param_early_inlining_insns) Init(6) Optimization Param Maximal estimated growth of function body caused by early inlining of single call. +-param=evrp-sparse-threshold= +Common Joined UInteger Var(param_evrp_sparse_threshold) Init(800) Optimization Param +Maximum number of basic blocks before EVRP uses a sparse cache. + -param=evrp-mode= Common Joined Var(param_evrp_mode) Enum(evrp_mode) Init(EVRP_MODE_EVRP_FIRST) Param Optimization --param=evrp-mode=[legacy|ranger|legacy-first|ranger-first|ranger-trace|ranger-debug|trace|debug] Specifies the mode Early VRP should operate in. -- 2.17.2