diff mbox series

xtensa: Optimize stack frame adjustment more

Message ID 04a89dbf-c2a3-5dcb-8949-77569a1ad169@yahoo.co.jp
State New
Headers show
Series xtensa: Optimize stack frame adjustment more | expand

Commit Message

Takayuki 'January June' Suwa Jan. 5, 2023, 8:40 a.m. UTC
This patch introduces a convenient helper function for integer immediate
addition with scratch register as needed, that splits and emits either
up to two ADDI/ADDMI machine instructions or an addition by register
following an immediate integer load (which may later be transformed by
constantsynth).

By using the helper function, it makes stack frame adjustment logic
simplified and instruction count less in some cases.

gcc/ChangeLog:

	* config/xtensa/xtensa.cc
	(xtensa_split_imm_two_addends, xtensa_emit_add_imm):
	New helper functions.
	(xtensa_emit_adjust_stack_ptr, xtensa_set_return_address,
	xtensa_output_mi_thunk): Change to use the helper function.
---
 gcc/config/xtensa/xtensa.cc | 139 +++++++++++++++++++++++-------------
 1 file changed, 88 insertions(+), 51 deletions(-)

Comments

Max Filippov Jan. 5, 2023, 9:32 p.m. UTC | #1
Hi Suwa-san,

On Thu, Jan 5, 2023 at 3:57 AM Takayuki 'January June' Suwa
<jjsuwa_sys3175@yahoo.co.jp> wrote:
>
> This patch introduces a convenient helper function for integer immediate
> addition with scratch register as needed, that splits and emits either
> up to two ADDI/ADDMI machine instructions or an addition by register
> following an immediate integer load (which may later be transformed by
> constantsynth).
>
> By using the helper function, it makes stack frame adjustment logic
> simplified and instruction count less in some cases.
>
> gcc/ChangeLog:
>
>         * config/xtensa/xtensa.cc
>         (xtensa_split_imm_two_addends, xtensa_emit_add_imm):
>         New helper functions.
>         (xtensa_emit_adjust_stack_ptr, xtensa_set_return_address,
>         xtensa_output_mi_thunk): Change to use the helper function.
> ---
>  gcc/config/xtensa/xtensa.cc | 139 +++++++++++++++++++++++-------------
>  1 file changed, 88 insertions(+), 51 deletions(-)

This change introduces a bunch of failures in the g++ testsuite,
but the culprit is apparently somewhere in the libstdc++.so, I'm
still looking for it.

I see the following pattern change in the generated epilogue code:

-    4aaf:      b0a192          movi    a9, 0x1b0
-    4ab2:      1f9a            add.n   a1, a15, a9
...
-    4abe:      20c112          addi    a1, a1, 32
-    4ac1:      f00d            ret.n
+    4aaf:      02df12          addmi   a1, a15, 0x200
+    4ab2:      b0c112          addi    a1, a1, -80
...
+    4abf:      20c112          addi    a1, a1, 32
+    4ac2:      f00d            ret.n

I.e. a1 is first moved into the parent stack frame, then back to the right
spot. This does not look correct, especially for bare-metal targets.
Takayuki 'January June' Suwa Jan. 6, 2023, 3:34 a.m. UTC | #2
On 2023/01/06 6:32, Max Filippov wrote:
> Hi Suwa-san,
Hi!

> 
> On Thu, Jan 5, 2023 at 3:57 AM Takayuki 'January June' Suwa
> <jjsuwa_sys3175@yahoo.co.jp> wrote:
>>
>> This patch introduces a convenient helper function for integer immediate
>> addition with scratch register as needed, that splits and emits either
>> up to two ADDI/ADDMI machine instructions or an addition by register
>> following an immediate integer load (which may later be transformed by
>> constantsynth).
>>
>> By using the helper function, it makes stack frame adjustment logic
>> simplified and instruction count less in some cases.
>>
>> gcc/ChangeLog:
>>
>>         * config/xtensa/xtensa.cc
>>         (xtensa_split_imm_two_addends, xtensa_emit_add_imm):
>>         New helper functions.
>>         (xtensa_emit_adjust_stack_ptr, xtensa_set_return_address,
>>         xtensa_output_mi_thunk): Change to use the helper function.
>> ---
>>  gcc/config/xtensa/xtensa.cc | 139 +++++++++++++++++++++++-------------
>>  1 file changed, 88 insertions(+), 51 deletions(-)
> 
> This change introduces a bunch of failures in the g++ testsuite,
> but the culprit is apparently somewhere in the libstdc++.so, I'm
> still looking for it.
> 
> I see the following pattern change in the generated epilogue code:
> 
> -    4aaf:      b0a192          movi    a9, 0x1b0
> -    4ab2:      1f9a            add.n   a1, a15, a9
> ...
> -    4abe:      20c112          addi    a1, a1, 32
> -    4ac1:      f00d            ret.n
> +    4aaf:      02df12          addmi   a1, a15, 0x200
> +    4ab2:      b0c112          addi    a1, a1, -80
> ...
> +    4abf:      20c112          addi    a1, a1, 32
> +    4ac2:      f00d            ret.n
> 
> I.e. a1 is first moved into the parent stack frame, then back to the right
> spot. This does not look correct, especially for bare-metal targets.
> 

On second thought, it cannot be a good idea to split addition/subtraction to the stack pointer.

> -    4aaf:      b0a192          movi    a9, 0x1b0
> -    4ab2:      1f9a            add.n   a1, a15, a9

> +    4aaf:      02df12          addmi   a1, a15, 0x200
> +    4ab2:      b0c112          addi    a1, a1, -80

Because the former is atomic, but the latter is not. (may be interrupted between the two add instructions)

I'll wait for the results of your investigation, but it may be better to withdraw the patch.
Max Filippov Jan. 6, 2023, 6:26 a.m. UTC | #3
On Thu, Jan 5, 2023 at 7:35 PM Takayuki 'January June' Suwa
<jjsuwa_sys3175@yahoo.co.jp> wrote:
> On second thought, it cannot be a good idea to split addition/subtraction to the stack pointer.
>
> > -    4aaf:      b0a192          movi    a9, 0x1b0
> > -    4ab2:      1f9a            add.n   a1, a15, a9
>
> > +    4aaf:      02df12          addmi   a1, a15, 0x200
> > +    4ab2:      b0c112          addi    a1, a1, -80
>
> Because the former is atomic, but the latter is not. (may be interrupted between the two add instructions)

Oh, right, there are two issues: one is interruption in the absence of
detailed stack tracking in the DWARF info, which can be fixed by emitting
a separate note for each a1 change, the other is interruption when
a1 is in the parent frame, which can be fixed by always moving a1
down first, e.g. with the following change:

diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index 3b8a7bcda371..29cb91fa7de5 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -2539,7 +2539,10 @@ xtensa_split_imm_two_addends (HOST_WIDE_INT
imm, HOST_WIDE_INT v[2])

  if (xtensa_simm8 (v1) || xtensa_simm8x256 (v1))
    {
-      v[0] = v0, v[1] = v1;
+      if (v0 < 0)
+       v[0] = v0, v[1] = v1;
+      else
+       v[0] = v1, v[1] = v0;
      return true;
    }

Or both can be fixed by using a scratch register in the middle of the
addi/addmi sequence.

> I'll wait for the results of your investigation, but it may be better to withdraw the patch.

The issue was in the unwinding code in the libgcc_s.so. I haven't figured
out the exact mechanism, but found that emitting a separate note for each
a1 change fixes it.
Takayuki 'January June' Suwa Jan. 6, 2023, 6:57 a.m. UTC | #4
On 2023/01/06 15:26, Max Filippov wrote:
> On Thu, Jan 5, 2023 at 7:35 PM Takayuki 'January June' Suwa
> <jjsuwa_sys3175@yahoo.co.jp> wrote:
>> On second thought, it cannot be a good idea to split addition/subtraction to the stack pointer.
>>
>>> -    4aaf:      b0a192          movi    a9, 0x1b0
>>> -    4ab2:      1f9a            add.n   a1, a15, a9
>>
>>> +    4aaf:      02df12          addmi   a1, a15, 0x200
>>> +    4ab2:      b0c112          addi    a1, a1, -80
>>
>> Because the former is atomic, but the latter is not. (may be interrupted between the two add instructions)
> 
> Oh, right, there are two issues: one is interruption in the absence of
> detailed stack tracking in the DWARF info, which can be fixed by emitting
> a separate note for each a1 change, the other is interruption when
> a1 is in the parent frame, which can be fixed by always moving a1
> down first, e.g. with the following change:
> 
> diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
> index 3b8a7bcda371..29cb91fa7de5 100644
> --- a/gcc/config/xtensa/xtensa.cc
> +++ b/gcc/config/xtensa/xtensa.cc
> @@ -2539,7 +2539,10 @@ xtensa_split_imm_two_addends (HOST_WIDE_INT
> imm, HOST_WIDE_INT v[2])
> 
>   if (xtensa_simm8 (v1) || xtensa_simm8x256 (v1))
>     {
> -      v[0] = v0, v[1] = v1;
> +      if (v0 < 0)
> +       v[0] = v0, v[1] = v1;
> +      else
> +       v[0] = v1, v[1] = v0;
>       return true;
>     }
> 
> Or both can be fixed by using a scratch register in the middle of the
> addi/addmi sequence.
> 
>> I'll wait for the results of your investigation, but it may be better to withdraw the patch.
> 
> The issue was in the unwinding code in the libgcc_s.so. I haven't figured
> out the exact mechanism, but found that emitting a separate note for each
> a1 change fixes it.
> 

Oh, thank you very much for your detailed investigation.  I will try to correct what you pointed out ASAP.
Max Filippov Jan. 6, 2023, 8:05 a.m. UTC | #5
On Thu, Jan 5, 2023 at 10:57 PM Takayuki 'January June' Suwa
<jjsuwa_sys3175@yahoo.co.jp> wrote:
> By using the helper function, it makes stack frame adjustment logic
> simplified and instruction count less in some cases.

I've built a couple linux configurations with and without this change and
I observe consistent code size growth, e.g.:

iss_defconfig without the change:
  text    data     bss     dec     hex filename
3014768  164016  115108 3293892  3242c4 vmlinux

iss_defconfig with the change:
  text    data     bss     dec     hex filename
3015296  164008  115108 3294412  3244cc vmlinux

virt_defconfig without the change:
  text    data     bss     dec     hex filename
5498881 2254360  291768 8045009  7ac1d1 vmlinux

virt_defconfig with the change:
  text    data     bss     dec     hex filename
5500389 2254360  291768 8046517  7ac7b5 vmlinux

generic_kc705_defconfig without the change:
  text    data     bss     dec     hex filename
7062530  635340  286400 7984270  79d48e vmlinux

generic_kc705_defconfig with the change:
  text    data     bss     dec     hex filename
7064078  635340  286400 7985818  79da9a vmlinux
Takayuki 'January June' Suwa Jan. 7, 2023, 2:54 a.m. UTC | #6
On 2023/01/06 17:05, Max Filippov wrote:
> On Thu, Jan 5, 2023 at 10:57 PM Takayuki 'January June' Suwa
> <jjsuwa_sys3175@yahoo.co.jp> wrote:
>> By using the helper function, it makes stack frame adjustment logic
>> simplified and instruction count less in some cases.
> 
> I've built a couple linux configurations with and without this change and
> I observe consistent code size growth, e.g.:
> 
> iss_defconfig without the change:
>   text    data     bss     dec     hex filename
> 3014768  164016  115108 3293892  3242c4 vmlinux
> 
> iss_defconfig with the change:
>   text    data     bss     dec     hex filename
> 3015296  164008  115108 3294412  3244cc vmlinux
> 
> virt_defconfig without the change:
>   text    data     bss     dec     hex filename
> 5498881 2254360  291768 8045009  7ac1d1 vmlinux
> 
> virt_defconfig with the change:
>   text    data     bss     dec     hex filename
> 5500389 2254360  291768 8046517  7ac7b5 vmlinux
> 
> generic_kc705_defconfig without the change:
>   text    data     bss     dec     hex filename
> 7062530  635340  286400 7984270  79d48e vmlinux
> 
> generic_kc705_defconfig with the change:
>   text    data     bss     dec     hex filename
> 7064078  635340  286400 7985818  79da9a vmlinux
> 

Probably due to this location:
> +  else if (TARGET_DENSITY && optimize_size && xtensa_simm12b (imm))
                             ^^^^^^^^^^^^^^^^
I omitted it in the new patch, so please check it.
diff mbox series

Patch

diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc
index ae44199bc98..3b8a7bcda37 100644
--- a/gcc/config/xtensa/xtensa.cc
+++ b/gcc/config/xtensa/xtensa.cc
@@ -2518,6 +2518,82 @@  xtensa_split_DI_reg_imm (rtx *operands)
 }
 
 
+/* Try to split an integer value into what are suitable for two consecutive
+   immediate addition instructions, ADDI or ADDMI.  */
+
+static bool
+xtensa_split_imm_two_addends (HOST_WIDE_INT imm, HOST_WIDE_INT v[2])
+{
+  HOST_WIDE_INT v0, v1;
+
+  if (imm < -32768)
+    v0 = -32768, v1 = imm + 32768;
+  else if (imm > 32512)
+    v0 = 32512, v1 = imm - 32512;
+  else if (TARGET_DENSITY && optimize_size && xtensa_simm12b (imm))
+    /* A pair of MOVI(.N) and ADD.N is one or two bytes less than two
+       immediate additions if TARGET_DENSITY.  */
+    return false;
+  else
+    v0 = (imm + 128) & ~255L, v1 = imm - v0;
+
+  if (xtensa_simm8 (v1) || xtensa_simm8x256 (v1))
+    {
+      v[0] = v0, v[1] = v1;
+      return true;
+    }
+
+  return false;
+}
+
+
+/* Helper function for integer immediate addition with scratch register
+   as needed, that splits and emits either up to two ADDI/ADDMI machine
+   instructions or an addition by register following an immediate integer
+   load (which may later be transformed by constantsynth).
+
+   If 'scratch' is NULL_RTX but still needed, a new pseudo-register will
+   be allocated.  Thus, after the reload/LRA pass, the specified scratch
+   register must be a hard one.  */
+
+static void
+xtensa_emit_add_imm (rtx dst, rtx src, HOST_WIDE_INT imm, rtx scratch,
+		     bool need_note)
+{
+  HOST_WIDE_INT v[2];
+  rtx_insn *insn0, *insn1;
+
+  if (imm == 0)
+    return;
+
+  if (xtensa_simm8 (imm) || xtensa_simm8x256 (imm))
+    insn0 = emit_insn (gen_addsi3 (dst, src, GEN_INT (imm)));
+  else if (xtensa_split_imm_two_addends (imm, v))
+    {
+      insn0 = emit_insn (gen_addsi3 (dst, src, GEN_INT (v[0])));
+      insn1 = emit_insn (gen_addsi3 (dst, dst, GEN_INT (v[1])));
+      if (need_note)
+	RTX_FRAME_RELATED_P (insn1) = 1;
+    }
+  else
+    {
+      if (scratch)
+	emit_move_insn (scratch, GEN_INT (imm));
+      else
+	scratch = force_reg (SImode, GEN_INT (imm));
+      insn0 = emit_insn (gen_addsi3 (dst, src, scratch));
+    }
+
+  if (need_note)
+    {
+      rtx note_rtx = gen_rtx_SET (dst, plus_constant (Pmode, src, imm));
+
+      RTX_FRAME_RELATED_P (insn0) = 1;
+      add_reg_note (insn0, REG_FRAME_RELATED_EXPR, note_rtx);
+    }
+}
+
+
 /* Implement TARGET_CANNOT_FORCE_CONST_MEM.  */
 
 static bool
@@ -3245,41 +3321,14 @@  xtensa_initial_elimination_offset (int from, int to ATTRIBUTE_UNUSED)
 static void
 xtensa_emit_adjust_stack_ptr (HOST_WIDE_INT offset, int flags)
 {
-  rtx_insn *insn;
-  rtx ptr = (flags & ADJUST_SP_FRAME_PTR) ? hard_frame_pointer_rtx
-					  : stack_pointer_rtx;
-
   if (cfun->machine->inhibit_logues_a1_adjusts)
     return;
 
-  if (xtensa_simm8 (offset)
-      || xtensa_simm8x256 (offset))
-    insn = emit_insn (gen_addsi3 (stack_pointer_rtx, ptr, GEN_INT (offset)));
-  else
-    {
-      rtx tmp_reg = gen_rtx_REG (Pmode, A9_REG);
-
-      if (offset < 0)
-	{
-	  emit_move_insn (tmp_reg, GEN_INT (-offset));
-	  insn = emit_insn (gen_subsi3 (stack_pointer_rtx, ptr, tmp_reg));
-	}
-      else
-	{
-	  emit_move_insn (tmp_reg, GEN_INT (offset));
-	  insn = emit_insn (gen_addsi3 (stack_pointer_rtx, ptr,	tmp_reg));
-	}
-    }
-
-  if (flags & ADJUST_SP_NEED_NOTE)
-    {
-      rtx note_rtx = gen_rtx_SET (stack_pointer_rtx,
-				  plus_constant (Pmode, stack_pointer_rtx,
-						 offset));
-
-      RTX_FRAME_RELATED_P (insn) = 1;
-      add_reg_note (insn, REG_FRAME_RELATED_EXPR, note_rtx);
-    }
+  xtensa_emit_add_imm (stack_pointer_rtx,
+		       (flags & ADJUST_SP_FRAME_PTR)
+			? hard_frame_pointer_rtx : stack_pointer_rtx,
+		       offset, gen_rtx_REG (Pmode, A9_REG),
+		       (flags & ADJUST_SP_NEED_NOTE));
 }
 
 /* minimum frame = reg save area (4 words) plus static chain (1 word)
@@ -3307,8 +3356,9 @@  xtensa_expand_prologue (void)
 	  /* Use a8 as a temporary since a0-a7 may be live.  */
 	  rtx tmp_reg = gen_rtx_REG (Pmode, A8_REG);
 	  emit_insn (gen_entry (GEN_INT (MIN_FRAME_SIZE)));
-	  emit_move_insn (tmp_reg, GEN_INT (total_size - MIN_FRAME_SIZE));
-	  emit_insn (gen_subsi3 (tmp_reg, stack_pointer_rtx, tmp_reg));
+	  xtensa_emit_add_imm (tmp_reg, stack_pointer_rtx,
+			       MIN_FRAME_SIZE - total_size,
+			       tmp_reg, false);
 	  insn = emit_insn (gen_movsi (stack_pointer_rtx, tmp_reg));
 	}
     }
@@ -3540,8 +3590,8 @@  xtensa_set_return_address (rtx address, rtx scratch)
 
   if (total_size > 1024)
     {
-      emit_move_insn (scratch, GEN_INT (total_size - UNITS_PER_WORD));
-      emit_insn (gen_addsi3 (scratch, frame, scratch));
+      xtensa_emit_add_imm (scratch, frame, total_size - UNITS_PER_WORD,
+			   scratch, false);
       a0_addr = scratch;
     }
 
@@ -5101,15 +5151,7 @@  xtensa_output_mi_thunk (FILE *file, tree thunk ATTRIBUTE_UNUSED,
   this_rtx = gen_rtx_REG (Pmode, A0_REG + this_reg_no);
 
   if (delta)
-    {
-      if (xtensa_simm8 (delta))
-	emit_insn (gen_addsi3 (this_rtx, this_rtx, GEN_INT (delta)));
-      else
-	{
-	  emit_move_insn (temp0, GEN_INT (delta));
-	  emit_insn (gen_addsi3 (this_rtx, this_rtx, temp0));
-	}
-    }
+    xtensa_emit_add_imm (this_rtx, this_rtx, delta, temp0, false);
 
   if (vcall_offset)
     {
@@ -5119,13 +5161,8 @@  xtensa_output_mi_thunk (FILE *file, tree thunk ATTRIBUTE_UNUSED,
       emit_move_insn (temp0, gen_rtx_MEM (Pmode, this_rtx));
       if (xtensa_uimm8x4 (vcall_offset))
 	addr = plus_constant (Pmode, temp0, vcall_offset);
-      else if (xtensa_simm8 (vcall_offset))
-	emit_insn (gen_addsi3 (temp1, temp0, GEN_INT (vcall_offset)));
       else
-	{
-	  emit_move_insn (temp1, GEN_INT (vcall_offset));
-	  emit_insn (gen_addsi3 (temp1, temp0, temp1));
-	}
+	xtensa_emit_add_imm (temp1, temp0, vcall_offset, temp1, false);
       emit_move_insn (temp1, gen_rtx_MEM (Pmode, addr));
       emit_insn (gen_add2_insn (this_rtx, temp1));
     }