diff mbox series

[RFC] Sparse on entry cache for Ranger.

Message ID 2dc2b54e-e737-e03d-e103-541dd4847255@redhat.com
State New
Headers show
Series [RFC] Sparse on entry cache for Ranger. | expand

Commit Message

Andrew MacLeod June 2, 2021, 9:15 p.m. UTC
As mentioned earlier, I abstracted the on-entry cache at the beginning 
of stage1. This was to make it easier to port future changes back to 
GCC11 so we could provide alternate representations to deal with memory 
issues, or what have you.

This patch introduces a sparse representation of the cache which is 
used  when the number of basic blocks gets too large.

I commandeered the bitmap code since its efficient and has been working 
a long time, and added 2 routines to get and set 4 bits (quads) at a 
time.  This allows me to use a bitmap like its a sparse array which can 
contain a value between 0 and 15, and is conveniently pre-initialized to 
values of 0 at no cost :-)   This is then used as an index into a small 
local table to store ranges for the name.  Its limiting in that an 
ssa-name will not be able to have more than 15 unique ranges, but this 
happens in less than 8% of all cases in the data I collected, and most 
of those are switches.. any ranges after the 15 slots are full revert to 
VARYING.  The values for VARYING and UNDEFINED are pre-populated, and 
for pointers, I also pre-populate [0,0] and ~[0, 0].

This also adds --param=evrp-sparse-threshold=  which allows the 
threshold between the full vector and this new sparse representation to 
be changed. It defaults to a value of 800. I've done various performance 
runs, and this seems to be a reasonably balanced number. In fact its a 
28% improvement in EVRP compile time over 390 files from a gcc bootstrap 
with minimal impact on missed opportunities.

I've also tried to see if using less than 15 values has any significant 
effect (The lookup is linear when setting), but it does not appear to.

I've also bootstrapped with the sparse threshold at 0 to ensure there 
aren't any issues.

My thoughts are I would put this into trunk, and assuming nothing comes 
up  over the next couple of days, port it back to GCC11 to resolve 
100299 and other excessive memory consumption PRs there as well. given 
that its reusing bitmap code for the sparse representation, it seems 
like it would be low risk.

Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad 
routines in bitmap.c?  It seems like they might be useful to others.  
They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing 
with 4 bits at a time.  I could make them local if this is a problem, 
but i don't have access to the bitmap internals there.

Bootstraps on x86_64-pc-linux-gnu with no regressions.

Andrew

PS in PR10299 we spend a fraction of a second in EVRP now.

Comments

Richard Biener June 7, 2021, 8:40 a.m. UTC | #1
On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> As mentioned earlier, I abstracted the on-entry cache at the beginning
> of stage1. This was to make it easier to port future changes back to
> GCC11 so we could provide alternate representations to deal with memory
> issues, or what have you.
>
> This patch introduces a sparse representation of the cache which is
> used  when the number of basic blocks gets too large.
>
> I commandeered the bitmap code since its efficient and has been working
> a long time, and added 2 routines to get and set 4 bits (quads) at a
> time.  This allows me to use a bitmap like its a sparse array which can
> contain a value between 0 and 15, and is conveniently pre-initialized to
> values of 0 at no cost :-)   This is then used as an index into a small
> local table to store ranges for the name.  Its limiting in that an
> ssa-name will not be able to have more than 15 unique ranges, but this
> happens in less than 8% of all cases in the data I collected, and most
> of those are switches.. any ranges after the 15 slots are full revert to
> VARYING.  The values for VARYING and UNDEFINED are pre-populated, and
> for pointers, I also pre-populate [0,0] and ~[0, 0].
>
> This also adds --param=evrp-sparse-threshold=  which allows the
> threshold between the full vector and this new sparse representation to
> be changed. It defaults to a value of 800. I've done various performance
> runs, and this seems to be a reasonably balanced number. In fact its a
> 28% improvement in EVRP compile time over 390 files from a gcc bootstrap
> with minimal impact on missed opportunities.
>
> I've also tried to see if using less than 15 values has any significant
> effect (The lookup is linear when setting), but it does not appear to.
>
> I've also bootstrapped with the sparse threshold at 0 to ensure there
> aren't any issues.
>
> My thoughts are I would put this into trunk, and assuming nothing comes
> up  over the next couple of days, port it back to GCC11 to resolve
> 100299 and other excessive memory consumption PRs there as well. given
> that its reusing bitmap code for the sparse representation, it seems
> like it would be low risk.
>
> Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad
> routines in bitmap.c?  It seems like they might be useful to others.
> They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing
> with 4 bits at a time.  I could make them local if this is a problem,
> but i don't have access to the bitmap internals there.

I think _quad is a bit too specific - it's aligned chunks so maybe

void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned
int chunk_size, BITMAP_WORD chunk_value);

and

BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk,
unsigned chunk_size);

and assert that chunk_size is power-of-two and fits in BITMAP_WORD?

(also note using unsigned ints and BITMAP_WORD for the data type)

I've been using two-bit representations in a few places (but mostly
setting/testing the
respective bits independently), I suppose for example

static dep_state
query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind)
{
  unsigned first_bit = 6 * loop->num + kind * 2;
  if (bitmap_bit_p (&ref->dep_loop, first_bit))
    return dep_independent;
  else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1))
    return dep_dependent;

could use a chunk size of 2 and a single bitmap query.  Incidentially this
specific code uses 6 bits, so it's not fully aligned ...

/* We use six bits per loop in the ref->dep_loop bitmap to record
   the dep_kind x dep_state combinations.  */

enum dep_kind { lim_raw, sm_war, sm_waw };
enum dep_state { dep_unknown, dep_independent, dep_dependent };

... but there's also at most a single bit set.

Anyway, I'm OK with adding API to access aligned power-of-two sized chunks.
Even not power-of-two sized unaligned chunks should be quite straight
forward to implement if we limit the chunk size to BITMAP_WORD by
simply advancing to the next bitmap word / element when necessary.

An alternative low-level API would provide accesses to whole BITMAP_WORD
entries and the quads could be implemented on top of that
(bitmap_set_word/_get_word)

Richard.

> Bootstraps on x86_64-pc-linux-gnu with no regressions.
>
> Andrew
>
> PS in PR10299 we spend a fraction of a second in EVRP now.
>
>
>
Andrew MacLeod June 7, 2021, 5:26 p.m. UTC | #2
On 6/7/21 4:40 AM, Richard Biener wrote:
> On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>
>> My thoughts are I would put this into trunk, and assuming nothing comes
>> up  over the next couple of days, port it back to GCC11 to resolve
>> 100299 and other excessive memory consumption PRs there as well. given
>> that its reusing bitmap code for the sparse representation, it seems
>> like it would be low risk.
>>
>> Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad
>> routines in bitmap.c?  It seems like they might be useful to others.
>> They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing
>> with 4 bits at a time.  I could make them local if this is a problem,
>> but i don't have access to the bitmap internals there.
> I think _quad is a bit too specific - it's aligned chunks so maybe
>
> void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned
> int chunk_size, BITMAP_WORD chunk_value);
>
> and
>
> BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk,
> unsigned chunk_size);
>
> and assert that chunk_size is power-of-two and fits in BITMAP_WORD?
>
> (also note using unsigned ints and BITMAP_WORD for the data type)
>
> I've been using two-bit representations in a few places (but mostly
> setting/testing the
> respective bits independently), I suppose for example

That's exactly how this started.. I was using a pair of bits for 
pointers. UNDEFINED, zero, non-zero and varying... and checking the bits 
independently. when I decided I needed 3 bits, the whole quad thing 
evolved since picking up 3 or 4 consecutive bits one at a time seemed 
too inefficient.

>
> static dep_state
> query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind)
> {
>    unsigned first_bit = 6 * loop->num + kind * 2;
>    if (bitmap_bit_p (&ref->dep_loop, first_bit))
>      return dep_independent;
>    else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1))
>      return dep_dependent;
>
> could use a chunk size of 2 and a single bitmap query.  Incidentially this
> specific code uses 6 bits, so it's not fully aligned ...
>
> /* We use six bits per loop in the ref->dep_loop bitmap to record
>     the dep_kind x dep_state combinations.  */
>
> enum dep_kind { lim_raw, sm_war, sm_waw };
> enum dep_state { dep_unknown, dep_independent, dep_dependent };
>
> ... but there's also at most a single bit set.
>
> Anyway, I'm OK with adding API to access aligned power-of-two sized chunks.
> Even not power-of-two sized unaligned chunks should be quite straight
> forward to implement if we limit the chunk size to BITMAP_WORD by
> simply advancing to the next bitmap word / element when necessary.
>
> An alternative low-level API would provide accesses to whole BITMAP_WORD
> entries and the quads could be implemented on top of that
> (bitmap_set_word/_get_word)
>
> Richard.
>
I think I'll stick to the power of 2 limitation for now.  If someone 
finds a pressing need or desire, they can enhance it :-)

Wanna eyeball this an make sure I'm not doing something unportable.. I 
just used your original 2 function names, and swapped out the 4 and a 
couple of constants for computed values. Works fine for me.

I also made the self test process 2, 4 and 8 bit quantities.

Its going thru a test cycle now.

Andrew
Richard Biener June 7, 2021, 5:50 p.m. UTC | #3
On Mon, Jun 7, 2021 at 7:26 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 6/7/21 4:40 AM, Richard Biener wrote:
> > On Wed, Jun 2, 2021 at 11:15 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >>
> >> My thoughts are I would put this into trunk, and assuming nothing comes
> >> up  over the next couple of days, port it back to GCC11 to resolve
> >> 100299 and other excessive memory consumption PRs there as well. given
> >> that its reusing bitmap code for the sparse representation, it seems
> >> like it would be low risk.
> >>
> >> Are we OK with the addition of the bitmap_get_quad and bitmap_set_quad
> >> routines in bitmap.c?  It seems like they might be useful to others.
> >> They are simple tweaks of bitmap_set_bit and bitmap_bit_p.. just dealing
> >> with 4 bits at a time.  I could make them local if this is a problem,
> >> but i don't have access to the bitmap internals there.
> > I think _quad is a bit too specific - it's aligned chunks so maybe
> >
> > void bitmap_set_aligned_chunk (bitmap, unsigned int chunk, unsigned
> > int chunk_size, BITMAP_WORD chunk_value);
> >
> > and
> >
> > BITMAP_WORD bitmap_get_aligned_chunk (bitmap, unsigned int chunk,
> > unsigned chunk_size);
> >
> > and assert that chunk_size is power-of-two and fits in BITMAP_WORD?
> >
> > (also note using unsigned ints and BITMAP_WORD for the data type)
> >
> > I've been using two-bit representations in a few places (but mostly
> > setting/testing the
> > respective bits independently), I suppose for example
>
> That's exactly how this started.. I was using a pair of bits for
> pointers. UNDEFINED, zero, non-zero and varying... and checking the bits
> independently. when I decided I needed 3 bits, the whole quad thing
> evolved since picking up 3 or 4 consecutive bits one at a time seemed
> too inefficient.
>
> >
> > static dep_state
> > query_loop_dependence (class loop *loop, im_mem_ref *ref, dep_kind kind)
> > {
> >    unsigned first_bit = 6 * loop->num + kind * 2;
> >    if (bitmap_bit_p (&ref->dep_loop, first_bit))
> >      return dep_independent;
> >    else if (bitmap_bit_p (&ref->dep_loop, first_bit + 1))
> >      return dep_dependent;
> >
> > could use a chunk size of 2 and a single bitmap query.  Incidentially this
> > specific code uses 6 bits, so it's not fully aligned ...
> >
> > /* We use six bits per loop in the ref->dep_loop bitmap to record
> >     the dep_kind x dep_state combinations.  */
> >
> > enum dep_kind { lim_raw, sm_war, sm_waw };
> > enum dep_state { dep_unknown, dep_independent, dep_dependent };
> >
> > ... but there's also at most a single bit set.
> >
> > Anyway, I'm OK with adding API to access aligned power-of-two sized chunks.
> > Even not power-of-two sized unaligned chunks should be quite straight
> > forward to implement if we limit the chunk size to BITMAP_WORD by
> > simply advancing to the next bitmap word / element when necessary.
> >
> > An alternative low-level API would provide accesses to whole BITMAP_WORD
> > entries and the quads could be implemented on top of that
> > (bitmap_set_word/_get_word)
> >
> > Richard.
> >
> I think I'll stick to the power of 2 limitation for now.  If someone
> finds a pressing need or desire, they can enhance it :-)
>
> Wanna eyeball this an make sure I'm not doing something unportable.. I
> just used your original 2 function names, and swapped out the 4 and a
> couple of constants for computed values. Works fine for me.
>
> I also made the self test process 2, 4 and 8 bit quantities.
>
> Its going thru a test cycle now.

+  gcc_checking_assert (__builtin_popcount (chunk_size) == 1);

please use pow2p_hwi (chunk_size) instead, __builtin_popcount might
not be available with non-GCC host compilers.

Otherwise looks good to me.

Thanks,
Richard.

> Andrew
>
>
diff mbox series

Patch

From 9408efe462c7a664858d986f69082eb86f2a0007 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod <amacleod@redhat.com>
Date: Wed, 2 Jun 2021 15:20:02 -0400
Subject: [PATCH 3/3] Implement a sparse bitmap represenation for Rangers
 on-entry cache.

Use s sparse representation for the on entry cache, and utilize it when
the number of basic blocks in the function exceeds param_evrp_sparse_threshold.

	PR tree-optimization/PR100299
	* gimple-range-cache.cc (class sbr_sparse_bitmap): New.
	(sbr_sparse_bitmap::sbr_sparse_bitmap): New.
	(sbr_sparse_bitmap::set_bb_range): New.
	(sbr_sparse_bitmap::get_bb_range): New.
	(sbr_sparse_bitmap::bb_range_p): New.
	(block_range_cache::block_range_cache): initialize bitmap obstack.
	(block_range_cache::~block_range_cache): Destruct obstack.
	(block_range_cache::set_bb_range): Decide when to utilze the
	sparse on entry cache.
	* gimple-range-cache.h (block_range_cache): Add bitmap obstack.
	* params.opt (-param=evrp-sparse-threshold): New.
---
 gcc/gimple-range-cache.cc | 126 +++++++++++++++++++++++++++++++++++++-
 gcc/gimple-range-cache.h  |   1 +
 gcc/params.opt            |   4 ++
 3 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 3d1ca3b94ca..658f1bc9342 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -236,12 +236,119 @@  sbr_vector::bb_range_p (const basic_block bb)
   return m_tab[bb->index] != NULL;
 }
 
+// This class implements the on entry cache via a sparse bitmap.
+// It uses the quad bit routines to access 4 bits at a time.
+// A value of 0 (the default) means there is no entry, and a value of
+// 1 thru SBR_NUM represents an element in the m_range vector.
+// Varying is given the first value (1) and pre-cached.
+// SBR_NUM + 1 represents the value of UNDEFINED, and is never stored.
+// SBR_NUM is the number of values that can be cached.
+// Indexes are 1..SBR_NUM and are stored locally at m_range[0..SBR_NUM-1]
+
+#define SBR_NUM		14
+#define SBR_UNDEF	SBR_NUM + 1
+#define SBR_VARYING	1
+
+class sbr_sparse_bitmap : public ssa_block_ranges
+{
+public:
+  sbr_sparse_bitmap (tree t, irange_allocator *allocator, bitmap_obstack *bm);
+  virtual void set_bb_range (const basic_block bb, const irange &r) OVERRIDE;
+  virtual bool get_bb_range (irange &r, const basic_block bb) OVERRIDE;
+  virtual bool bb_range_p (const basic_block bb) OVERRIDE;
+private:
+  irange_allocator *m_irange_allocator;
+  irange *m_range[SBR_NUM];
+  bitmap bitvec;
+  tree m_type;
+};
+
+// Initialize a block cache for an ssa_name of type T.
+
+sbr_sparse_bitmap::sbr_sparse_bitmap (tree t, irange_allocator *allocator,
+				bitmap_obstack *bm)
+{
+  gcc_checking_assert (TYPE_P (t));
+  m_type = t;
+  bitvec = BITMAP_ALLOC (bm);
+  m_irange_allocator = allocator;
+  // Pre-cache varying.
+  m_range[0] = m_irange_allocator->allocate (2);
+  m_range[0]->set_varying (t);
+  // Pre-cache zero and non-zero values for pointers.
+  if (POINTER_TYPE_P (t))
+    {
+      m_range[1] = m_irange_allocator->allocate (2);
+      m_range[1]->set_nonzero (t);
+      m_range[2] = m_irange_allocator->allocate (2);
+      m_range[2]->set_zero (t);
+    }
+  else
+    m_range[1] = m_range[2] = NULL;
+  // Clear SBR_NUM entries.
+  for (int x = 3; x < SBR_NUM; x++)
+    m_range[x] = 0;
+}
+
+// Set the range on entry to basic block BB to R.
+
+void
+sbr_sparse_bitmap::set_bb_range (const basic_block bb, const irange &r)
+{
+  if (r.undefined_p ())
+    {
+      bitmap_set_quad (bitvec, bb->index, SBR_UNDEF);
+      return;
+    }
+
+  // Loop thru the values to see if R is already present.
+  for (int x = 0; x < SBR_NUM; x++)
+    if (!m_range[x] || r == *(m_range[x]))
+      {
+	if (!m_range[x])
+	  m_range[x] = m_irange_allocator->allocate (r);
+	bitmap_set_quad (bitvec, bb->index, x + 1);
+	return;
+      }
+  // All values are taken, default to VARYING.
+  bitmap_set_quad (bitvec, bb->index, SBR_VARYING);
+  return;
+}
+
+// Return the range associated with block BB in R.  Return false if
+// there is no range.
+
+bool
+sbr_sparse_bitmap::get_bb_range (irange &r, const basic_block bb)
+{
+  int value = bitmap_get_quad (bitvec, bb->index);
+
+  if (!value)
+    return false;
+
+  gcc_checking_assert (value <= SBR_UNDEF);
+  if (value == SBR_UNDEF)
+    r.set_undefined ();
+  else
+    r = *(m_range[value - 1]);
+  return true;
+}
+
+// Return true if a range is present.
+
+bool
+sbr_sparse_bitmap::bb_range_p (const basic_block bb)
+{
+  return (bitmap_get_quad (bitvec, bb->index) != 0);
+}
+
 // -------------------------------------------------------------------------
 
 // Initialize the block cache.
 
 block_range_cache::block_range_cache ()
 {
+  bitmap_obstack_initialize (&m_bitmaps);
   m_ssa_ranges.create (0);
   m_ssa_ranges.safe_grow_cleared (num_ssa_names);
   m_irange_allocator = new irange_allocator;
@@ -254,6 +361,7 @@  block_range_cache::~block_range_cache ()
   delete m_irange_allocator;
   // Release the vector itself.
   m_ssa_ranges.release ();
+  bitmap_obstack_release (&m_bitmaps);
 }
 
 // Set the range for NAME on entry to block BB to R.
@@ -269,9 +377,21 @@  block_range_cache::set_bb_range (tree name, const basic_block bb,
 
   if (!m_ssa_ranges[v])
     {
-      void *r = m_irange_allocator->get_memory (sizeof (sbr_vector));
-      m_ssa_ranges[v] = new (r) sbr_vector (TREE_TYPE (name),
-					    m_irange_allocator);
+      // Use sparse representation if there are too many basic blocks.
+      if (last_basic_block_for_fn (cfun) > param_evrp_sparse_threshold)
+	{
+	  void *r = m_irange_allocator->get_memory (sizeof (sbr_sparse_bitmap));
+	  m_ssa_ranges[v] = new (r) sbr_sparse_bitmap (TREE_TYPE (name),
+						       m_irange_allocator,
+						       &m_bitmaps);
+	}
+      else
+	{
+	  // Otherwise use the default vector implemntation.
+	  void *r = m_irange_allocator->get_memory (sizeof (sbr_vector));
+	  m_ssa_ranges[v] = new (r) sbr_vector (TREE_TYPE (name),
+						m_irange_allocator);
+	}
     }
   m_ssa_ranges[v]->set_bb_range (bb, r);
 }
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 4af461d2aa3..ce4449a08db 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -61,6 +61,7 @@  private:
   ssa_block_ranges &get_block_ranges (tree name);
   ssa_block_ranges *query_block_ranges (tree name);
   irange_allocator *m_irange_allocator;
+  bitmap_obstack m_bitmaps;
 };
 
 // This global cache is used with the range engine as markers for what
diff --git a/gcc/params.opt b/gcc/params.opt
index 0d0dcd216f6..18e6036c4f4 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -126,6 +126,10 @@  Maximum size (in bytes) of objects tracked bytewise by dead store elimination.
 Common Joined UInteger Var(param_early_inlining_insns) Init(6) Optimization Param
 Maximal estimated growth of function body caused by early inlining of single call.
 
+-param=evrp-sparse-threshold=
+Common Joined UInteger Var(param_evrp_sparse_threshold) Init(800) Optimization Param
+Maximum number of basic blocks before EVRP uses a sparse cache.
+
 -param=evrp-mode=
 Common Joined Var(param_evrp_mode) Enum(evrp_mode) Init(EVRP_MODE_EVRP_FIRST) Param Optimization
 --param=evrp-mode=[legacy|ranger|legacy-first|ranger-first|ranger-trace|ranger-debug|trace|debug] Specifies the mode Early VRP should operate in.
-- 
2.17.2