diff mbox series

[to-be-committed,RISC-V] Aggressively hoist VXRM assignments

Message ID 28a79a14-91e7-4a1a-a8ab-390a2f63e0fb@ventanamicro.com
State New
Headers show
Series [to-be-committed,RISC-V] Aggressively hoist VXRM assignments | expand

Commit Message

Jeff Law Oct. 29, 2024, 1:45 p.m. UTC
So a while back I was looking at pixel_avg for RISC-V where we try to 
use vaaddu for the halfword-ceiling-average step.  The problem with 
vaaddu is that you must set VXRM to a suitable rounding mode as it has 
an undetermined state at function entry or after a function call.

It turns out some designs will fully flush their pipelines on a write to 
VXRM which you can imagine is incredibly expensive.

VXRM assignments are handled by an LCM based algorithm to find "optimal" 
placement points based on what insns in the stream need VXRM assignments 
and the particular mode they need.

Unfortunately in pixel_avg an LCM algorithm only allows hoisting out of 
the innermost loop, but not the outer loop.  The core issue is that LCM 
does not allow any speculation and there are paths which would bypass 
the inner loop (which don't actually trigger at runtime IIRC).

The expectation is that VXRM assignments should be exceedingly rare and 
needing more than one mode even rarer.  So hoisting more aggressively 
seems like a reasonable thing to do, but we don't want to burn too much 
time trying to do something fancy.

So what this patch does is scan the IL once collecting any VXRM needs. 
If the current function has precisely one VXRM mode needed, then we 
pretend (for the sake of LCM) that the first instruction in the function 
also has that need.

By doing so the VXRM assignment is essentially anticipated everywhere in
the function.  The standard LCM algorithm is run and has enough 
information to hoist the VXRM assignment more aggressively, most often 
to the prologue.

This helps the BPI in a measurable way (IIRC it was 2-3%).  It probably 
helps some of the SiFive designs, but I've been told they still benefit 
from the longer sequence of shifts & adds, hoisting just isn't enough 
for those designs.  The Ventana design basically doesn't care where the 
VXRM assignment is.  Point is we may want to have a tuning knob for the 
patterns which need VXRM (vaadd[u], vasub[u]) at some point in the near 
future.

Bootstrapped and regression tested on riscv64 and regression tested on 
riscv32-elf and riscv64-elf.  We've been using this internally for a 
while a while on spec as well.   Obviously I'll wait for the pre-commit 
tester to do its thing.

Jeff
* riscv.cc (singleton_vxrm_need): New function.
	(riscv_mode_needed): See if there is a singleton need and if so,
	claim it happens on the first insn in the chain.
diff mbox series

Patch

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 0cf7ee4904d..c9b48d6fac0 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -11772,6 +11772,65 @@  riscv_frm_mode_needed (rtx_insn *cur_insn, int code)
   return mode;
 }
 
+/* If the current function needs a single VXRM mode, return it.  Else
+   return VXRM_MODE_NONE.
+
+   This is called on the first insn in the chain and scans the full function
+   once to collect VXRM mode settings.  If a single mode is needed, it will
+   often be better to set it once at the start of the function rather than
+   at an anticipation point.  */
+static int
+singleton_vxrm_need (void)
+{
+  /* Only needed for vector code.  */
+  if (!TARGET_VECTOR)
+    return VXRM_MODE_NONE;
+
+  /* If ENTRY has more than once successor, then don't optimize, just to
+     keep things simple.  */
+  if (EDGE_COUNT (ENTRY_BLOCK_PTR_FOR_FN (cfun)->succs) > 1)
+    return VXRM_MODE_NONE;
+
+  /* Walk the IL noting if VXRM is needed and if there's more than one
+     mode needed.  */
+  bool found = false;
+  int saved_vxrm_mode;
+  for (rtx_insn *insn = get_insns (); insn; insn = NEXT_INSN (insn))
+    {
+      if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+	continue;
+
+      int code = recog_memoized (insn);
+      if (code < 0)
+	continue;
+
+      int vxrm_mode = get_attr_vxrm_mode (insn);
+      if (vxrm_mode == VXRM_MODE_NONE)
+	continue;
+
+      /* If this is the first VXRM need, note it.  */
+      if (!found)
+	{
+	  saved_vxrm_mode = vxrm_mode;
+	  found = true;
+	  continue;
+	}
+
+      /* Not the first VXRM need.  If this is different than
+	 the saved need, then we're not going to be able to
+	 optimize and we can stop scanning now.  */
+      if (saved_vxrm_mode != vxrm_mode)
+	return VXRM_MODE_NONE;
+
+      /* Same mode as we've seen, keep scanning.  */
+    }
+
+  /* If we got here we scanned the whole function.  If we found
+     some VXRM state, then we can optimize.  If we didn't find
+     VXRM state, then there's nothing to optimize.  */
+  return found ? saved_vxrm_mode : VXRM_MODE_NONE;
+}
+
 /* Return mode that entity must be switched into
    prior to the execution of insn.  */
 
@@ -11783,6 +11842,16 @@  riscv_mode_needed (int entity, rtx_insn *insn, HARD_REG_SET)
   switch (entity)
     {
     case RISCV_VXRM:
+      /* If CUR_INSN is the first insn in the function, then determine if we
+	 want to signal a need in ENTRY->succs to allow for aggressive
+	 elimination of subsequent sets of VXRM.  */
+      if (insn == get_first_nonnote_insn ())
+	{
+	  int need = singleton_vxrm_need ();
+	  if (need != VXRM_MODE_NONE)
+	    return need;
+	}
+
       return code >= 0 ? get_attr_vxrm_mode (insn) : VXRM_MODE_NONE;
     case RISCV_FRM:
       return riscv_frm_mode_needed (insn, code);