diff mbox

[2/2,v3] Drop excess size used for run time allocated stack variables.

Message ID 20160525133241.GB6938@linux.vnet.ibm.com
State New
Headers show

Commit Message

Dominik Vogt May 25, 2016, 1:32 p.m. UTC
On Wed, May 25, 2016 at 02:30:54PM +0100, Dominik Vogt wrote:
> On Tue, May 03, 2016 at 03:17:53PM +0100, Dominik Vogt wrote:
> > Version two of the patch including a test case.
> > 
> > On Mon, May 02, 2016 at 09:10:25AM -0600, Jeff Law wrote:
> > > On 04/29/2016 04:12 PM, Dominik Vogt wrote:
> > > >The attached patch removes excess stack space allocation with
> > > >alloca in some situations.  Plese check the commit message in the
> > > >patch for details.
> > 
> > > However, I would strongly recommend some tests, even if they are
> > > target specific.  You can always copy pr36728-1 into the s390x
> > > directory and look at size of the generated stack.  Simliarly for
> > > pr50938 for x86.
> > 
> > However, x86 uses the "else" branch in round_push, i.e. it uses
> > "virtual_preferred_stack_boundary_rtx" to calculate the number of
> > bytes to add for stack alignment.  That value is unknown at the
> > time round_push is called, so the test case fails on such targets,
> > and I've no idea how to fix this properly.
> 
> Third version of the patch with the suggested cleanup in the first
> patch and the functional stuff in the second one.  The first patch
> is based on Jeff's draft with the change suggested by Eric and
> more cleanup added by me.

This is the updated funtional patch.  Re-tested with limited
effort, i.e. tested and bootstrapped on s390x biarch (but did not
look for performance regressions compared to version 2 of the
patch).

Ciao

Dominik ^_^  ^_^

Comments

Dominik Vogt May 25, 2016, 1:41 p.m. UTC | #1
On Wed, May 25, 2016 at 02:32:41PM +0100, Dominik Vogt wrote:
> On Wed, May 25, 2016 at 02:30:54PM +0100, Dominik Vogt wrote:
> > On Tue, May 03, 2016 at 03:17:53PM +0100, Dominik Vogt wrote:
> > > Version two of the patch including a test case.
> > > 
> > > On Mon, May 02, 2016 at 09:10:25AM -0600, Jeff Law wrote:
> > > > On 04/29/2016 04:12 PM, Dominik Vogt wrote:
> > > > >The attached patch removes excess stack space allocation with
> > > > >alloca in some situations.  Plese check the commit message in the
> > > > >patch for details.
> > > 
> > > > However, I would strongly recommend some tests, even if they are
> > > > target specific.  You can always copy pr36728-1 into the s390x
> > > > directory and look at size of the generated stack.  Simliarly for
> > > > pr50938 for x86.
> > > 
> > > However, x86 uses the "else" branch in round_push, i.e. it uses
> > > "virtual_preferred_stack_boundary_rtx" to calculate the number of
> > > bytes to add for stack alignment.  That value is unknown at the
> > > time round_push is called, so the test case fails on such targets,
> > > and I've no idea how to fix this properly.
> > 
> > Third version of the patch with the suggested cleanup in the first
> > patch and the functional stuff in the second one.  The first patch
> > is based on Jeff's draft with the change suggested by Eric and
> > more cleanup added by me.
> 
> This is the updated funtional patch.  Re-tested with limited
> effort, i.e. tested and bootstrapped on s390x biarch (but did not
> look for performance regressions compared to version 2 of the
> patch).

(I won't be able to reply to any comments until 20th of June.)

Ciao

Dominik ^_^  ^_^
Bernd Schmidt June 8, 2016, 11:21 a.m. UTC | #2
On 05/25/2016 03:32 PM, Dominik Vogt wrote:
> 	* explow.c (round_push): Use know adjustment.
> 	(allocate_dynamic_stack_space): Pass known adjustment to round_push.
> gcc/testsuite/ChangeLog
>

I was thinking about whether it would be possible/desirable to eliminate 
the double add entirely, but I couldn't find a way to structure the code 
in a way that seems better than what you have. So, ...

>  /* Round the size of a block to be pushed up to the boundary required
> -   by this machine.  SIZE is the desired size, which need not be constant.  */
> +   by this machine.  SIZE is the desired size, which need not be constant.
> +   ALREADY_ADDED is the number of units that have already been added to SIZE
> +   for other alignment reasons.
> +*/

The */ goes on the last line of the comment.

> +/* PR/50938: Check that alloca () reserves the correct amount of stack space.
> + */

Same here really, even if it's only a test.

Ok with these fixed.


Bernd
Dominik Vogt June 20, 2016, 12:19 p.m. UTC | #3
On Wed, Jun 08, 2016 at 01:21:09PM +0200, Bernd Schmidt wrote:
> On 05/25/2016 03:32 PM, Dominik Vogt wrote:
> >	* explow.c (round_push): Use know adjustment.
> >	(allocate_dynamic_stack_space): Pass known adjustment to round_push.
> >gcc/testsuite/ChangeLog
> >
> 
> I was thinking about whether it would be possible/desirable to
> eliminate the double add entirely, but I couldn't find a way to
> structure the code in a way that seems better than what you have.
> So, ...
> 
> > /* Round the size of a block to be pushed up to the boundary required
> >-   by this machine.  SIZE is the desired size, which need not be constant.  */
> >+   by this machine.  SIZE is the desired size, which need not be constant.
> >+   ALREADY_ADDED is the number of units that have already been added to SIZE
> >+   for other alignment reasons.
> >+*/
> 
> The */ goes on the last line of the comment.

No problem.

> >+/* PR/50938: Check that alloca () reserves the correct amount of stack space.
> >+ */
> 
> Same here really, even if it's only a test.

In this case, the line gets too long with "  */" appended.

Ciao

Dominik ^_^  ^_^
Bernd Schmidt June 20, 2016, 12:20 p.m. UTC | #4
On 06/20/2016 02:19 PM, Dominik Vogt wrote:

>>> +/* PR/50938: Check that alloca () reserves the correct amount of stack space.
>>> + */
>>
>> Same here really, even if it's only a test.
>
> In this case, the line gets too long with "  */" appended.

In that case we wrap before the last word.


Bernd
Jeff Law June 23, 2016, 4:24 a.m. UTC | #5
On 05/25/2016 07:32 AM, Dominik Vogt wrote:
> On Wed, May 25, 2016 at 02:30:54PM +0100, Dominik Vogt wrote:
>> > On Tue, May 03, 2016 at 03:17:53PM +0100, Dominik Vogt wrote:
>>> > > Version two of the patch including a test case.
>>> > >
>>> > > On Mon, May 02, 2016 at 09:10:25AM -0600, Jeff Law wrote:
>>>> > > > On 04/29/2016 04:12 PM, Dominik Vogt wrote:
>>>>> > > > >The attached patch removes excess stack space allocation with
>>>>> > > > >alloca in some situations.  Plese check the commit message in the
>>>>> > > > >patch for details.
>>> > >
>>>> > > > However, I would strongly recommend some tests, even if they are
>>>> > > > target specific.  You can always copy pr36728-1 into the s390x
>>>> > > > directory and look at size of the generated stack.  Simliarly for
>>>> > > > pr50938 for x86.
>>> > >
>>> > > However, x86 uses the "else" branch in round_push, i.e. it uses
>>> > > "virtual_preferred_stack_boundary_rtx" to calculate the number of
>>> > > bytes to add for stack alignment.  That value is unknown at the
>>> > > time round_push is called, so the test case fails on such targets,
>>> > > and I've no idea how to fix this properly.
>> >
>> > Third version of the patch with the suggested cleanup in the first
>> > patch and the functional stuff in the second one.  The first patch
>> > is based on Jeff's draft with the change suggested by Eric and
>> > more cleanup added by me.
> This is the updated funtional patch.  Re-tested with limited
> effort, i.e. tested and bootstrapped on s390x biarch (but did not
> look for performance regressions compared to version 2 of the
> patch).
>
> Ciao
>
> Dominik ^_^  ^_^
>
> -- Dominik Vogt IBM Germany
>
>
> 0002-v3-ChangeLog
>
>
> gcc/ChangeLog
>
> 	* explow.c (round_push): Use know adjustment.
> 	(allocate_dynamic_stack_space): Pass known adjustment to round_push.
> gcc/testsuite/ChangeLog
>
> 	* gcc.dg/pr50938.c: New test.
>
>
> 0002-v3-Drop-excess-size-used-for-run-time-allocated-stack-v.patch
>
>
> From 4296d353e1d153b5b5ee435a44cae6117bf2fff0 Mon Sep 17 00:00:00 2001
> From: Dominik Vogt <vogt@linux.vnet.ibm.com>
> Date: Fri, 29 Apr 2016 08:36:59 +0100
> Subject: [PATCH 2/2] Drop excess size used for run time allocated stack
>  variables.
>
> The present calculation sometimes led to more stack memory being used than
> necessary with alloca.  First, (STACK_BOUNDARY -1) would be added to the
> allocated size:
>
>   size = plus_constant (Pmode, size, extra);
>   size = force_operand (size, NULL_RTX);
>
> Then round_push was called and added another (STACK_BOUNDARY - 1) before
> rounding down to a multiple of STACK_BOUNDARY.  On s390x this resulted in
> adding 14 before rounding down for "x" in the test case pr36728-1.c.
>
> round_push() now takes an argument to inform it about what has already been
> added to size.
> ---
>  gcc/explow.c                   | 45 +++++++++++++++++++++---------------
>  gcc/testsuite/gcc.dg/pr50938.c | 52 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 79 insertions(+), 18 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr50938.c
>
> diff --git a/gcc/explow.c b/gcc/explow.c
> index 09a0330..85596e2 100644
> --- a/gcc/explow.c
> +++ b/gcc/explow.c
> @@ -949,24 +949,30 @@ anti_adjust_stack (rtx adjust)
>  }
>
>  /* Round the size of a block to be pushed up to the boundary required
> -   by this machine.  SIZE is the desired size, which need not be constant.  */
> +   by this machine.  SIZE is the desired size, which need not be constant.
> +   ALREADY_ADDED is the number of units that have already been added to SIZE
> +   for other alignment reasons.
> +*/
>
>  static rtx
> -round_push (rtx size)
> +round_push (rtx size, int already_added)
>  {
> -  rtx align_rtx, alignm1_rtx;
> +  rtx align_rtx, add_rtx;
>
>    if (!SUPPORTS_STACK_ALIGNMENT
>        || crtl->preferred_stack_boundary == MAX_SUPPORTED_STACK_ALIGNMENT)
>      {
>        int align = crtl->preferred_stack_boundary / BITS_PER_UNIT;
> +      int add;
>
>        if (align == 1)
>  	return size;
>
> +      add = (align > already_added) ? align - already_added - 1 : 0;
> +
>        if (CONST_INT_P (size))
>  	{
> -	  HOST_WIDE_INT new_size = (INTVAL (size) + align - 1) / align * align;
> +	  HOST_WIDE_INT new_size = (INTVAL (size) + add) / align * align;
>
>  	  if (INTVAL (size) != new_size)
>  	    size = GEN_INT (new_size);
So presumably the idea here is when the requested SIZE would require 
allocating additional space to first see if the necessary space is 
already available inside ALREADY_ADDED and use that rather than rounding 
size up to an alignment boundary.

I can see how that works in the sense of avoiding allocating extra 
space.  What I'm struggling with is how do we know the space actually 
allocated is going to have the right alignment?

What am I missing here?

jeff
Dominik Vogt June 23, 2016, 9:57 a.m. UTC | #6
On Wed, Jun 22, 2016 at 10:24:02PM -0600, Jeff Law wrote:
> On 05/25/2016 07:32 AM, Dominik Vogt wrote:
> >On Wed, May 25, 2016 at 02:30:54PM +0100, Dominik Vogt wrote:
> >>> On Tue, May 03, 2016 at 03:17:53PM +0100, Dominik Vogt wrote:
> >>>> > Version two of the patch including a test case.
> >>>> >
> >>>> > On Mon, May 02, 2016 at 09:10:25AM -0600, Jeff Law wrote:
> >>>>> > > On 04/29/2016 04:12 PM, Dominik Vogt wrote:
> >>>>>> > > >The attached patch removes excess stack space allocation with
> >>>>>> > > >alloca in some situations.  Plese check the commit message in the
> >>>>>> > > >patch for details.
> >>>> >
> >>>>> > > However, I would strongly recommend some tests, even if they are
> >>>>> > > target specific.  You can always copy pr36728-1 into the s390x
> >>>>> > > directory and look at size of the generated stack.  Simliarly for
> >>>>> > > pr50938 for x86.
> >>>> >
> >>>> > However, x86 uses the "else" branch in round_push, i.e. it uses
> >>>> > "virtual_preferred_stack_boundary_rtx" to calculate the number of
> >>>> > bytes to add for stack alignment.  That value is unknown at the
> >>>> > time round_push is called, so the test case fails on such targets,
> >>>> > and I've no idea how to fix this properly.
> >>>
> >>> Third version of the patch with the suggested cleanup in the first
> >>> patch and the functional stuff in the second one.  The first patch
> >>> is based on Jeff's draft with the change suggested by Eric and
> >>> more cleanup added by me.
> >This is the updated funtional patch.  Re-tested with limited
> >effort, i.e. tested and bootstrapped on s390x biarch (but did not
> >look for performance regressions compared to version 2 of the
> >patch).
> >
> >Ciao
> >
> >Dominik ^_^  ^_^
> >
> >-- Dominik Vogt IBM Germany
> >
> >
> >0002-v3-ChangeLog
> >
> >
> >gcc/ChangeLog
> >
> >	* explow.c (round_push): Use know adjustment.
> >	(allocate_dynamic_stack_space): Pass known adjustment to round_push.
> >gcc/testsuite/ChangeLog
> >
> >	* gcc.dg/pr50938.c: New test.
> >
> >
> >0002-v3-Drop-excess-size-used-for-run-time-allocated-stack-v.patch
> >
> >
> >From 4296d353e1d153b5b5ee435a44cae6117bf2fff0 Mon Sep 17 00:00:00 2001
> >From: Dominik Vogt <vogt@linux.vnet.ibm.com>
> >Date: Fri, 29 Apr 2016 08:36:59 +0100
> >Subject: [PATCH 2/2] Drop excess size used for run time allocated stack
> > variables.
> >
> >The present calculation sometimes led to more stack memory being used than
> >necessary with alloca.  First, (STACK_BOUNDARY -1) would be added to the
> >allocated size:
> >
> >  size = plus_constant (Pmode, size, extra);
> >  size = force_operand (size, NULL_RTX);
> >
> >Then round_push was called and added another (STACK_BOUNDARY - 1) before
> >rounding down to a multiple of STACK_BOUNDARY.  On s390x this resulted in
> >adding 14 before rounding down for "x" in the test case pr36728-1.c.
> >
> >round_push() now takes an argument to inform it about what has already been
> >added to size.
> >---
> > gcc/explow.c                   | 45 +++++++++++++++++++++---------------
> > gcc/testsuite/gcc.dg/pr50938.c | 52 ++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 79 insertions(+), 18 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.dg/pr50938.c
> >
> >diff --git a/gcc/explow.c b/gcc/explow.c
> >index 09a0330..85596e2 100644
> >--- a/gcc/explow.c
> >+++ b/gcc/explow.c
> >@@ -949,24 +949,30 @@ anti_adjust_stack (rtx adjust)
> > }
> >
> > /* Round the size of a block to be pushed up to the boundary required
> >-   by this machine.  SIZE is the desired size, which need not be constant.  */
> >+   by this machine.  SIZE is the desired size, which need not be constant.
> >+   ALREADY_ADDED is the number of units that have already been added to SIZE
> >+   for other alignment reasons.
> >+*/
> >
> > static rtx
> >-round_push (rtx size)
> >+round_push (rtx size, int already_added)
> > {
> >-  rtx align_rtx, alignm1_rtx;
> >+  rtx align_rtx, add_rtx;
> >
> >   if (!SUPPORTS_STACK_ALIGNMENT
> >       || crtl->preferred_stack_boundary == MAX_SUPPORTED_STACK_ALIGNMENT)
> >     {
> >       int align = crtl->preferred_stack_boundary / BITS_PER_UNIT;
> >+      int add;
> >
> >       if (align == 1)
> > 	return size;
> >
> >+      add = (align > already_added) ? align - already_added - 1 : 0;
> >+
> >       if (CONST_INT_P (size))
> > 	{
> >-	  HOST_WIDE_INT new_size = (INTVAL (size) + align - 1) / align * align;
> >+	  HOST_WIDE_INT new_size = (INTVAL (size) + add) / align * align;
> >
> > 	  if (INTVAL (size) != new_size)
> > 	    size = GEN_INT (new_size);
> So presumably the idea here is when the requested SIZE would require
> allocating additional space to first see if the necessary space is
> already available inside ALREADY_ADDED

Yes.

> and use that rather than rounding size up to an alignment boundary.

Not exactly.  Consider the unpatched code.  At the beginning we
have some amount of space to be allocated on the stack at runtime
("SSIZE"), some requested alignment for it ("SALIGN").

get_dynamic_stack_size() first calculates the space needed for run
time alignment:

  SIZE = SSIZE + SALIGN - 1

Then it calls round_push() to add *another* chunk of memory to the
allocation size to be able to align it to the required stack slot
alignment ("SLOTALIGN") at run time.

  SIZE = SIZE + SLOTALIGN - 1
       = SSIZE + (SALIGN - 1) + (SLOTALIGN - 1)

Now it has added two chunks of memory but alignment is only done
once.  With the patch it just adds the maximum of (SALIGN - 1) and
(SLOTALIGN - 1), not both.  Thinking about it, the "round_push"
stuff is a very complicated way of saying "add max(A, B)".

I'd volunteer to clean this up more, but preferrably when the two
pending patches are in.  The current code is a real brain-twister.

> I can see how that works in the sense of avoiding allocating extra
> space.  What I'm struggling with is how do we know the space
> actually allocated is going to have the right alignment?

It doesn't.  That is what the extra space is needed for, i.e. the
data is placed in the (larger) allocated space at an address with
proper alignment, at run time.

Ciao

Dominik ^_^  ^_^
Jeff Law July 21, 2016, 8:07 p.m. UTC | #7
On 06/23/2016 03:57 AM, Dominik Vogt wrote:
>>> 0002-v3-ChangeLog
>>>
>>>
>>> gcc/ChangeLog
>>>
>>> 	* explow.c (round_push): Use know adjustment.
>>> 	(allocate_dynamic_stack_space): Pass known adjustment to round_push.
>>> gcc/testsuite/ChangeLog
>>>
>>> 	* gcc.dg/pr50938.c: New test.
>>>
>>>
>>> 0002-v3-Drop-excess-size-used-for-run-time-allocated-stack-v.patch
>>>
>>>
>> >From 4296d353e1d153b5b5ee435a44cae6117bf2fff0 Mon Sep 17 00:00:00 2001
>>> From: Dominik Vogt <vogt@linux.vnet.ibm.com>
>>> Date: Fri, 29 Apr 2016 08:36:59 +0100
>>> Subject: [PATCH 2/2] Drop excess size used for run time allocated stack
>>> variables.
>>>
>>> The present calculation sometimes led to more stack memory being used than
>>> necessary with alloca.  First, (STACK_BOUNDARY -1) would be added to the
>>> allocated size:
>>>
>>>  size = plus_constant (Pmode, size, extra);
>>>  size = force_operand (size, NULL_RTX);
>>>
>>> Then round_push was called and added another (STACK_BOUNDARY - 1) before
>>> rounding down to a multiple of STACK_BOUNDARY.  On s390x this resulted in
>>> adding 14 before rounding down for "x" in the test case pr36728-1.c.
>>>
>>> round_push() now takes an argument to inform it about what has already been
>>> added to size.
>>> ---
>>> gcc/explow.c                   | 45 +++++++++++++++++++++---------------
>>> gcc/testsuite/gcc.dg/pr50938.c | 52 ++++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 79 insertions(+), 18 deletions(-)
>>> create mode 100644 gcc/testsuite/gcc.dg/pr50938.c
>>>
>>> diff --git a/gcc/explow.c b/gcc/explow.c
>>> index 09a0330..85596e2 100644
>>> --- a/gcc/explow.c
>>> +++ b/gcc/explow.c
>>> @@ -949,24 +949,30 @@ anti_adjust_stack (rtx adjust)
>>> }
>>>
>>> /* Round the size of a block to be pushed up to the boundary required
>>> -   by this machine.  SIZE is the desired size, which need not be constant.  */
>>> +   by this machine.  SIZE is the desired size, which need not be constant.
>>> +   ALREADY_ADDED is the number of units that have already been added to SIZE
>>> +   for other alignment reasons.
>>> +*/
>>>
>>> static rtx
>>> -round_push (rtx size)
>>> +round_push (rtx size, int already_added)
>>> {
>>> -  rtx align_rtx, alignm1_rtx;
>>> +  rtx align_rtx, add_rtx;
>>>
>>>   if (!SUPPORTS_STACK_ALIGNMENT
>>>       || crtl->preferred_stack_boundary == MAX_SUPPORTED_STACK_ALIGNMENT)
>>>     {
>>>       int align = crtl->preferred_stack_boundary / BITS_PER_UNIT;
>>> +      int add;
>>>
>>>       if (align == 1)
>>> 	return size;
>>>
>>> +      add = (align > already_added) ? align - already_added - 1 : 0;
>>> +
>>>       if (CONST_INT_P (size))
>>> 	{
>>> -	  HOST_WIDE_INT new_size = (INTVAL (size) + align - 1) / align * align;
>>> +	  HOST_WIDE_INT new_size = (INTVAL (size) + add) / align * align;
>>>
>>> 	  if (INTVAL (size) != new_size)
>>> 	    size = GEN_INT (new_size);
>> So presumably the idea here is when the requested SIZE would require
>> allocating additional space to first see if the necessary space is
>> already available inside ALREADY_ADDED
>
> Yes.
>
>> and use that rather than rounding size up to an alignment boundary.
>
> Not exactly.  Consider the unpatched code.  At the beginning we
> have some amount of space to be allocated on the stack at runtime
> ("SSIZE"), some requested alignment for it ("SALIGN").
>
> get_dynamic_stack_size() first calculates the space needed for run
> time alignment:
>
>   SIZE = SSIZE + SALIGN - 1
>
> Then it calls round_push() to add *another* chunk of memory to the
> allocation size to be able to align it to the required stack slot
> alignment ("SLOTALIGN") at run time.
>
>   SIZE = SIZE + SLOTALIGN - 1
>        = SSIZE + (SALIGN - 1) + (SLOTALIGN - 1)
>
> Now it has added two chunks of memory but alignment is only done
> once.  With the patch it just adds the maximum of (SALIGN - 1) and
> (SLOTALIGN - 1), not both.  Thinking about it, the "round_push"
> stuff is a very complicated way of saying "add max(A, B)".
Now I see it.  Thanks, that helped a ton.

>
> I'd volunteer to clean this up more, but preferrably when the two
> pending patches are in.  The current code is a real brain-twister.
I'd be all for such cleanups after we wrap up the pending patches.  It's 
certainly a rats nest of code right now.

This patch is fine for the trunk.  Thanks for your patience.

jeff
Dominik Vogt July 22, 2016, 9:55 a.m. UTC | #8
On Thu, Jul 21, 2016 at 02:07:05PM -0600, Jeff Law wrote:
> On 06/23/2016 03:57 AM, Dominik Vogt wrote:
> >>and use that rather than rounding size up to an alignment boundary.
> >
> >Not exactly.  Consider the unpatched code.  At the beginning we
> >have some amount of space to be allocated on the stack at runtime
> >("SSIZE"), some requested alignment for it ("SALIGN").
> >
> >get_dynamic_stack_size() first calculates the space needed for run
> >time alignment:
> >
> >  SIZE = SSIZE + SALIGN - 1
> >
> >Then it calls round_push() to add *another* chunk of memory to the
> >allocation size to be able to align it to the required stack slot
> >alignment ("SLOTALIGN") at run time.
> >
> >  SIZE = SIZE + SLOTALIGN - 1
> >       = SSIZE + (SALIGN - 1) + (SLOTALIGN - 1)
> >
> >Now it has added two chunks of memory but alignment is only done
> >once.  With the patch it just adds the maximum of (SALIGN - 1) and
> >(SLOTALIGN - 1), not both.  Thinking about it, the "round_push"
> >stuff is a very complicated way of saying "add max(A, B)".
> Now I see it.  Thanks, that helped a ton.
> 
> >
> >I'd volunteer to clean this up more, but preferrably when the two
> >pending patches are in.  The current code is a real brain-twister.
> I'd be all for such cleanups after we wrap up the pending patches.
> It's certainly a rats nest of code right now.
> 
> This patch is fine for the trunk.  Thanks for your patience.

Actually I was goind to abandon the patch in its current state.
:-)  We talked about it internally and concluded that the problem
is really this:

 * get_dynamic_stack_size is passed a SIZE of a data block (which
   is allocated elsewhere), the SIZE_ALIGN of the SIZE (i.e. the
   alignment of the underlying memory units (e.g. 32 bytes split
   into 4 times 8 bytes = 64 bit alignment) and the
   REQUIRED_ALIGN of the data portion of the allocated memory.
 * Assuming the function is called with SIZE = 2, SIZE_ALIGN = 8
   and REQUIRED_ALIGN = 64 it first adds 7 bytes to SIZE -> 9.
   This is what is needed to have two bytes 8-byte-aligned at some
   memory location without any known alignment.
 * Finally round_push is called to round up SIZE to a multiple of
   the stack slot size.

The key to understanding this is that the function assumes that
STACK_DYNMAIC_OFFSET is completely unknown at the time its called
and therefore it does not make assumptions about the alignment of
STACKPOINTER + STACK_DYNMAIC_OFFSET.  The latest patch simply
hard-codes that SP + SDO is supposed to be aligned to at least
stack slot size (and does that in a very complicated way).  Since
there is no guarantee that this is the case on all targets, the
patch is broken.  It may miscalculate a SIZE that is too small in
some cases.

However, on many targets there is some guarantee about the
alignment of SP + SDO even if the actual value of SDO is unknown.
On s390x it's always 8-byte-aligned (stack slot size).  So the
right fix should be to add knowledge about the target's guaranteed
alignment of SP + SDO to the function.  I'm right now testing a
much simpler patch that uses
REGNO_POINTER_ALIGN(VIRTUAL_STACK_DYNAMIC_REGNUM) as the
alignment.

Ciao

Dominik ^_^  ^_^
diff mbox

Patch

From 4296d353e1d153b5b5ee435a44cae6117bf2fff0 Mon Sep 17 00:00:00 2001
From: Dominik Vogt <vogt@linux.vnet.ibm.com>
Date: Fri, 29 Apr 2016 08:36:59 +0100
Subject: [PATCH 2/2] Drop excess size used for run time allocated stack
 variables.

The present calculation sometimes led to more stack memory being used than
necessary with alloca.  First, (STACK_BOUNDARY -1) would be added to the
allocated size:

  size = plus_constant (Pmode, size, extra);
  size = force_operand (size, NULL_RTX);

Then round_push was called and added another (STACK_BOUNDARY - 1) before
rounding down to a multiple of STACK_BOUNDARY.  On s390x this resulted in
adding 14 before rounding down for "x" in the test case pr36728-1.c.

round_push() now takes an argument to inform it about what has already been
added to size.
---
 gcc/explow.c                   | 45 +++++++++++++++++++++---------------
 gcc/testsuite/gcc.dg/pr50938.c | 52 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr50938.c

diff --git a/gcc/explow.c b/gcc/explow.c
index 09a0330..85596e2 100644
--- a/gcc/explow.c
+++ b/gcc/explow.c
@@ -949,24 +949,30 @@  anti_adjust_stack (rtx adjust)
 }
 
 /* Round the size of a block to be pushed up to the boundary required
-   by this machine.  SIZE is the desired size, which need not be constant.  */
+   by this machine.  SIZE is the desired size, which need not be constant.
+   ALREADY_ADDED is the number of units that have already been added to SIZE
+   for other alignment reasons.
+*/
 
 static rtx
-round_push (rtx size)
+round_push (rtx size, int already_added)
 {
-  rtx align_rtx, alignm1_rtx;
+  rtx align_rtx, add_rtx;
 
   if (!SUPPORTS_STACK_ALIGNMENT
       || crtl->preferred_stack_boundary == MAX_SUPPORTED_STACK_ALIGNMENT)
     {
       int align = crtl->preferred_stack_boundary / BITS_PER_UNIT;
+      int add;
 
       if (align == 1)
 	return size;
 
+      add = (align > already_added) ? align - already_added - 1 : 0;
+
       if (CONST_INT_P (size))
 	{
-	  HOST_WIDE_INT new_size = (INTVAL (size) + align - 1) / align * align;
+	  HOST_WIDE_INT new_size = (INTVAL (size) + add) / align * align;
 
 	  if (INTVAL (size) != new_size)
 	    size = GEN_INT (new_size);
@@ -974,7 +980,7 @@  round_push (rtx size)
 	}
 
       align_rtx = GEN_INT (align);
-      alignm1_rtx = GEN_INT (align - 1);
+      add_rtx = (add > 0) ? GEN_INT (add) : const0_rtx;
     }
   else
     {
@@ -983,15 +989,15 @@  round_push (rtx size)
 	 substituted by the right value in vregs pass and optimized
 	 during combine.  */
       align_rtx = virtual_preferred_stack_boundary_rtx;
-      alignm1_rtx = force_operand (plus_constant (Pmode, align_rtx, -1),
-				   NULL_RTX);
+      add_rtx = force_operand (plus_constant (Pmode, align_rtx, -1), NULL_RTX);
     }
 
   /* CEIL_DIV_EXPR needs to worry about the addition overflowing,
      but we know it can't.  So add ourselves and then do
      TRUNC_DIV_EXPR.  */
-  size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
-		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
+  if (add_rtx != const0_rtx)
+    size = expand_binop (Pmode, add_optab, size, add_rtx,
+			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
   size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
@@ -1174,7 +1180,7 @@  allocate_dynamic_stack_space (rtx size, unsigned size_align,
   HOST_WIDE_INT stack_usage_size = -1;
   rtx_code_label *final_label;
   rtx final_target, target;
-  unsigned extra;
+  unsigned extra = 0;
 
   /* If we're asking for zero bytes, it doesn't matter what we point
      to since we can't dereference it.  But return a reasonable
@@ -1251,15 +1257,18 @@  allocate_dynamic_stack_space (rtx size, unsigned size_align,
      example), so we must preventively align the value.  We leave space
      in SIZE for the hole that might result from the alignment operation.  */
 
-  extra = (required_align - BITS_PER_UNIT) / BITS_PER_UNIT;
-  size = plus_constant (Pmode, size, extra);
-  size = force_operand (size, NULL_RTX);
+  if (required_align > BITS_PER_UNIT)
+    {
+      extra = (required_align - BITS_PER_UNIT) / BITS_PER_UNIT;
+      size = plus_constant (Pmode, size, extra);
+      size = force_operand (size, NULL_RTX);
 
-  if (flag_stack_usage_info)
-    stack_usage_size += extra;
+      if (flag_stack_usage_info)
+	stack_usage_size += extra;
 
-  if (extra && size_align > BITS_PER_UNIT)
-    size_align = BITS_PER_UNIT;
+      if (size_align > BITS_PER_UNIT)
+	size_align = BITS_PER_UNIT;
+    }
 
   /* Round the size to a multiple of the required stack alignment.
      Since the stack if presumed to be rounded before this allocation,
@@ -1276,7 +1285,7 @@  allocate_dynamic_stack_space (rtx size, unsigned size_align,
      momentarily mis-aligning the stack.  */
   if (size_align % MAX_SUPPORTED_STACK_ALIGNMENT != 0)
     {
-      size = round_push (size);
+      size = round_push (size, extra);
 
       if (flag_stack_usage_info)
 	{
diff --git a/gcc/testsuite/gcc.dg/pr50938.c b/gcc/testsuite/gcc.dg/pr50938.c
new file mode 100644
index 0000000..14b7540
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr50938.c
@@ -0,0 +1,52 @@ 
+/* PR/50938: Check that alloca () reserves the correct amount of stack space.
+ */
+
+/* { dg-do run } */
+/* { dg-options "-O0" } */
+/* { dg-require-effective-target alloca } */
+
+__attribute__ ((noinline))
+static void
+dummy (void)
+{
+}
+
+__attribute__ ((noinline))
+static char *
+get_frame_addr (void *p)
+{
+  void *faddr = __builtin_frame_address (0);
+  /* Call a function to make sure that a stack frame exists.  */
+  dummy ();
+  return faddr;
+}
+
+__attribute__ ((noinline))
+static void *
+stack_var (void)
+{
+  /* 1024 bytes on the stack.  */
+  char s[1024];
+  /* One stack slot.  */
+  void *p = (void *)s;
+  /* Keep stack memory alive by passing it to the function.  */
+  return get_frame_addr (p);
+}
+
+__attribute__ ((noinline))
+static void *
+alloca_var (void)
+{
+  /* 1024 bytes on the stack plus one stack slot for a.  */
+  void *a = __builtin_alloca (1024);
+  return get_frame_addr (a);
+}
+
+int main (void)
+{
+  if (stack_var () != alloca_var ())
+    /* The functions used a different amount of stack space.  */
+    __builtin_abort ();
+
+  return 0;
+}
-- 
2.3.0