diff mbox series

[RFC] nptl: change default stack guard size of threads

Message ID 5A1ECB40.9080801@arm.com
State New
Headers show
Series [RFC] nptl: change default stack guard size of threads | expand

Commit Message

Szabolcs Nagy Nov. 29, 2017, 2:59 p.m. UTC
RFC patch based on discussions about stack probing at
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg02306.html

Comments

Florian Weimer Nov. 29, 2017, 3:18 p.m. UTC | #1
On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
> The change can be made for aarch64 only

That doesn't seem to be the case, looking at the patch.

So what you intended to do, exactly?

A 64 KiB probe interval on legacy 32-bit architectures is really a 
no-go.  It means we have to increase the guard region size to 64 KiB. 
But we cannot do that: The guard allocation comes out of the overall 
thread stack size, and existing applications do not expect that 60K of 
configured stack suddenly becomes unavailable.  Adding the guard size on 
top of the allocation will break setups which are carefully tuned for a 
maximum number of threads.

And that's what people actually do.  Here's an example:

“
  -Xss128k

Reduces the default maximum thread stack size, which allows more of the 
process' virtual memory address space to be used by the Java heap.
”

<http://www.oracle.com/technetwork/java/tuning-139912.html>

We can likely support 64 KiB probe intervals on 64-bit architectures. 
But given the impact on backwards compatibility, I really don't see the 
benefit on (legacy) 32-bit.

Thanks,
Florian
Carlos O'Donell Nov. 29, 2017, 6:16 p.m. UTC | #2
On 11/29/2017 07:18 AM, Florian Weimer wrote:
> On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
>> The change can be made for aarch64 only
> 
> That doesn't seem to be the case, looking at the patch.
> 
> So what you intended to do, exactly?
> 
> A 64 KiB probe interval on legacy 32-bit architectures is really a
> no-go.  It means we have to increase the guard region size to 64 KiB.
> But we cannot do that: The guard allocation comes out of the overall
> thread stack size, and existing applications do not expect that 60K
> of configured stack suddenly becomes unavailable.  Adding the guard
> size on top of the allocation will break setups which are carefully
> tuned for a maximum number of threads.

We cannot be held to account for carefully tuned applications, such
applications have to be tuned again for newer glibc.

I think we *could* do this for 64-bit and 32-bit AArch64/ARM, but I
don't see the value in doing it for 32-bit.

> And that's what people actually do.  Here's an example:
> 
> “
>  -Xss128k
> 
> Reduces the default maximum thread stack size, which allows more of the process' virtual memory address space to be used by the Java heap.
> ”
> 
> <http://www.oracle.com/technetwork/java/tuning-139912.html>
> 
> We can likely support 64 KiB probe intervals on 64-bit architectures.
> But given the impact on backwards compatibility, I really don't see
> the benefit on (legacy) 32-bit.

I agree, this is expensive for 32-bit, without much reward.

Even on 64-bit, I would like to see bug 11787 fixed to move the guard
page accounting out of the stack, and then you can make the guard page
as big as you want without impacting the stack accounting.
Rich Felker Nov. 29, 2017, 6:29 p.m. UTC | #3
On Wed, Nov 29, 2017 at 10:16:54AM -0800, Carlos O'Donell wrote:
> On 11/29/2017 07:18 AM, Florian Weimer wrote:
> > On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
> >> The change can be made for aarch64 only
> > 
> > That doesn't seem to be the case, looking at the patch.
> > 
> > So what you intended to do, exactly?
> > 
> > A 64 KiB probe interval on legacy 32-bit architectures is really a
> > no-go.  It means we have to increase the guard region size to 64 KiB.
> > But we cannot do that: The guard allocation comes out of the overall
> > thread stack size, and existing applications do not expect that 60K
> > of configured stack suddenly becomes unavailable.  Adding the guard
> > size on top of the allocation will break setups which are carefully
> > tuned for a maximum number of threads.
> 
> We cannot be held to account for carefully tuned applications, such
> applications have to be tuned again for newer glibc.
> 
> I think we *could* do this for 64-bit and 32-bit AArch64/ARM, but I
> don't see the value in doing it for 32-bit.

If 64k guard is mandatory for safety against jumping over the guard
zone, then I don't think it's possible to "re-tune" 32-bit apps for
the new requirement. This imposes a relatively small limit on possible
number of threads the process can create.

> > And that's what people actually do.  Here's an example:
> > 
> > “
> >  -Xss128k
> > 
> > Reduces the default maximum thread stack size, which allows more of the process' virtual memory address space to be used by the Java heap.
> > ”
> > 
> > <http://www.oracle.com/technetwork/java/tuning-139912.html>
> > 
> > We can likely support 64 KiB probe intervals on 64-bit architectures.
> > But given the impact on backwards compatibility, I really don't see
> > the benefit on (legacy) 32-bit.
> 
> I agree, this is expensive for 32-bit, without much reward.
> 
> Even on 64-bit, I would like to see bug 11787 fixed to move the guard
> page accounting out of the stack, and then you can make the guard page
> as big as you want without impacting the stack accounting.

I agree completely that guard page should not subtract from the usable
stack size but should be in addition to it. If glibc is not currently
behaving that way, I think it's a conformance bug. But making it big
on 32-bit quickly cuts into total available virtual address space if
you have a moderately large number of threads.

Rich
Szabolcs Nagy Nov. 29, 2017, 6:40 p.m. UTC | #4
On 29/11/17 15:18, Florian Weimer wrote:
> On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
>> The change can be made for aarch64 only
> 
> That doesn't seem to be the case, looking at the patch.
> 
> So what you intended to do, exactly?

it seems the issue applies to all targets, and since
glibc itself have 64k stack jumps i sent the rfc
patch for all targets to err on the safe side.

some targets will have 4k probe interval in gcc but
even those are not safe with existing binaries and
several targets have no proposed stack probe patch
as far as i understand.

> A 64 KiB probe interval on legacy 32-bit architectures is really a no-go.  It means we have to increase the
> guard region size to 64 KiB. But we cannot do that: The guard allocation comes out of the overall thread stack
> size, and existing applications do not expect that 60K of configured stack suddenly becomes unavailable. 
> Adding the guard size on top of the allocation will break setups which are carefully tuned for a maximum number
> of threads.

i was aware of the address space limitation on 32bit
but e.g. aarch64 ilp32 will need the 64k guardsize too.

(i think there are other 32bit targets that support
>4k page size, those may not mind the increase either,
they have to portably support large page size anyway)

i was not aware of bug 11787 and i agree that should
be fixed before the default guardsize is increased.

ARCH_GUARD_DEFAULT_SIZE is a macro that each target
can override so it can be set to 4k on legacy 32bit
targets or its default value can be changed for
__WORDSIZE == 32.

on aarch64 at least the default guard size should be
increased but it may even be better to enforce a minimum
64k guardsize if it is set to a non-zero value by the user.
i was not sure if that's acceptable (observably changing
what the user requested) hence this patch only touches
the default size.

this has implications to generic glibc behaviour
and that's where i'd like to see feedback, because
if there are issues then we may need to rethink
the aarch64 probing implementation in gcc.

e.g. user allocated stacks (thread or signal) cannot
be fixed, so if it is known common practice to allocate
those with single (or no) guard page then changing
glibc default does not solve the problem.
Florian Weimer Nov. 29, 2017, 8:33 p.m. UTC | #5
On 11/29/2017 07:29 PM, Rich Felker wrote:
> On Wed, Nov 29, 2017 at 10:16:54AM -0800, Carlos O'Donell wrote:
>> On 11/29/2017 07:18 AM, Florian Weimer wrote:
>>> On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
>>>> The change can be made for aarch64 only
>>>
>>> That doesn't seem to be the case, looking at the patch.
>>>
>>> So what you intended to do, exactly?
>>>
>>> A 64 KiB probe interval on legacy 32-bit architectures is really a
>>> no-go.  It means we have to increase the guard region size to 64 KiB.
>>> But we cannot do that: The guard allocation comes out of the overall
>>> thread stack size, and existing applications do not expect that 60K
>>> of configured stack suddenly becomes unavailable.  Adding the guard
>>> size on top of the allocation will break setups which are carefully
>>> tuned for a maximum number of threads.
>>
>> We cannot be held to account for carefully tuned applications, such
>> applications have to be tuned again for newer glibc.
>>
>> I think we *could* do this for 64-bit and 32-bit AArch64/ARM, but I
>> don't see the value in doing it for 32-bit.
> 
> If 64k guard is mandatory for safety against jumping over the guard
> zone, then I don't think it's possible to "re-tune" 32-bit apps for
> the new requirement. This imposes a relatively small limit on possible
> number of threads the process can create.

With the probing implementations I have seen so far, it is feasible 
technically because a probe will never touch a stack area which could 
not conceivably be touched by a signal handler, too.

But it is still a bad situation when GCC documents that it is only 
providing full protection with guard size X, and you decide to run with 
size X/16 to actually get things going.  This means that your 
configuration is out of scope what your vendor will support 
security-wise, and probably non-compliant with applicable guidelines at 
the user site.

So technically you can run with a smaller guard size, but it's still 
impractical to do so for most organizations.

Thanks,
Florian
Florian Weimer Nov. 29, 2017, 8:44 p.m. UTC | #6
On 11/29/2017 07:40 PM, Szabolcs Nagy wrote:
> On 29/11/17 15:18, Florian Weimer wrote:
>> On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
>>> The change can be made for aarch64 only
>>
>> That doesn't seem to be the case, looking at the patch.
>>
>> So what you intended to do, exactly?
> 
> it seems the issue applies to all targets, and since
> glibc itself have 64k stack jumps i sent the rfc
> patch for all targets to err on the safe side.

glibc has many arbitrarily large stack jumps (although we've been 
eliminating them manually for a while).

What should guide the default size of the guard is not what glibc needs 
for its own routines, but what the stack probing in GCC needs to be correct.

> some targets will have 4k probe interval in gcc but
> even those are not safe with existing binaries and
> several targets have no proposed stack probe patch
> as far as i understand.

Isn't there are generic variant which covers at least alloca and VLAs?

>> A 64 KiB probe interval on legacy 32-bit architectures is really a no-go.  It means we have to increase the
>> guard region size to 64 KiB. But we cannot do that: The guard allocation comes out of the overall thread stack
>> size, and existing applications do not expect that 60K of configured stack suddenly becomes unavailable.
>> Adding the guard size on top of the allocation will break setups which are carefully tuned for a maximum number
>> of threads.
> 
> i was aware of the address space limitation on 32bit
> but e.g. aarch64 ilp32 will need the 64k guardsize too.

Why?

This is a new feature.  Why make this less usable from the start?

(I don't care about aarc64 ILP32 and page sizes smaller than 64 KiB on 
aarch64 in general, so I wont argue this, and this is just a courtesy 
notification that what you are doing is Very Wrong Indeed.)

> (i think there are other 32bit targets that support
> > 4k page size, those may not mind the increase either,
> they have to portably support large page size anyway)

GCC needs to emit probe intervals for the smallest supported page size 
on the the target architecture.  If it does not do that, we end up in 
trouble on the glibc side.

We can throw new code at this problem and solve it for 64-bit.  For 
32-bit, we simply do not have a universally applicable solution.  My 
understanding was that everywhere except on ARM, GCC was compatible with 
the pioneering glibc/Linux work in this area (the guard page we added to 
thread stacks, and the guard page added by the kernel).  If this isn't 
the case, then I'm really disappointed in the disregard of existing 
practice on the GCC side.

> e.g. user allocated stacks (thread or signal) cannot
> be fixed, so if it is known common practice to allocate
> those with single (or no) guard page then changing
> glibc default does not solve the problem.

We have test cases which do not allocate a guard page for those, but 
with the existing probing algorithms I've seen, this is not actually a 
problem if stack usage is kept low.

Thanks,
Florian
Rich Felker Nov. 29, 2017, 8:51 p.m. UTC | #7
On Wed, Nov 29, 2017 at 09:44:14PM +0100, Florian Weimer wrote:
> On 11/29/2017 07:40 PM, Szabolcs Nagy wrote:
> >On 29/11/17 15:18, Florian Weimer wrote:
> >>On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
> >>>The change can be made for aarch64 only
> >>
> >>That doesn't seem to be the case, looking at the patch.
> >>
> >>So what you intended to do, exactly?
> >
> >it seems the issue applies to all targets, and since
> >glibc itself have 64k stack jumps i sent the rfc
> >patch for all targets to err on the safe side.
> 
> glibc has many arbitrarily large stack jumps (although we've been
> eliminating them manually for a while).
> 
> What should guide the default size of the guard is not what glibc
> needs for its own routines, but what the stack probing in GCC needs
> to be correct.

Agreed.

> >>A 64 KiB probe interval on legacy 32-bit architectures is really a no-go.  It means we have to increase the
> >>guard region size to 64 KiB. But we cannot do that: The guard allocation comes out of the overall thread stack
> >>size, and existing applications do not expect that 60K of configured stack suddenly becomes unavailable.
> >>Adding the guard size on top of the allocation will break setups which are carefully tuned for a maximum number
> >>of threads.
> >
> >i was aware of the address space limitation on 32bit
> >but e.g. aarch64 ilp32 will need the 64k guardsize too.
> 
> Why?
> 
> This is a new feature.  Why make this less usable from the start?
> 
> (I don't care about aarc64 ILP32 and page sizes smaller than 64 KiB
> on aarch64 in general, so I wont argue this, and this is just a
> courtesy notification that what you are doing is Very Wrong Indeed.)

I'm not sure I follow, but from the standpoint of virtual address
space and what is an acceptable cost in wasted address space, any
ILP32-on-64 ABIs should be considered the same as 32-bit archs. As
such, I think GCC really needs to do the stack probe every 4k, not
64k, and the default (and certainly minimum-supported) guard size
should be kept at 4k, not 64k or anything larger.

> >(i think there are other 32bit targets that support
> >> 4k page size, those may not mind the increase either,
> >they have to portably support large page size anyway)
> 
> GCC needs to emit probe intervals for the smallest supported page
> size on the the target architecture.  If it does not do that, we end
> up in trouble on the glibc side.

Agreed.

> We can throw new code at this problem and solve it for 64-bit.  For
> 32-bit, we simply do not have a universally applicable solution.  My
> understanding was that everywhere except on ARM, GCC was compatible
> with the pioneering glibc/Linux work in this area (the guard page we
> added to thread stacks, and the guard page added by the kernel).  If
> this isn't the case, then I'm really disappointed in the disregard
> of existing practice on the GCC side.

Hm? What are you thinking of that GCC might have gotten wrong?

Rich
Florian Weimer Nov. 29, 2017, 9:02 p.m. UTC | #8
On 11/29/2017 09:51 PM, Rich Felker wrote:

> I'm not sure I follow, but from the standpoint of virtual address
> space and what is an acceptable cost in wasted address space, any
> ILP32-on-64 ABIs should be considered the same as 32-bit archs. As
> such, I think GCC really needs to do the stack probe every 4k, not
> 64k, and the default (and certainly minimum-supported) guard size
> should be kept at 4k, not 64k or anything larger.

Yes, and I expect that we will keep using 4 KiB probing on i386 (and 
s390/s390x).  That's what Jeff gave me for testing.  I do hope the final 
upstream version isn't going to be different in this regard.

But in the end, this is up to the machine maintainers (for gcc and glibc).

>> We can throw new code at this problem and solve it for 64-bit.  For
>> 32-bit, we simply do not have a universally applicable solution.  My
>> understanding was that everywhere except on ARM, GCC was compatible
>> with the pioneering glibc/Linux work in this area (the guard page we
>> added to thread stacks, and the guard page added by the kernel).  If
>> this isn't the case, then I'm really disappointed in the disregard
>> of existing practice on the GCC side.
> 
> Hm? What are you thinking of that GCC might have gotten wrong?

Use 64 KiB probe intervals (almost) everywhere as an optimization.  I 
assumed the original RFC patch was motivated by that.

I knew that ARM would be broken because that's what the gcc ARM 
maintainers want.  I assumed that it was restricted to that, but now I'm 
worried that it's not.

Thanks,
Florian
Wilco Dijkstra Nov. 29, 2017, 10:28 p.m. UTC | #9
Florian Weimer wrote:

> glibc has many arbitrarily large stack jumps (although we've been 
> eliminating them manually for a while).
>
> What should guide the default size of the guard is not what glibc needs 
> for its own routines, but what the stack probing in GCC needs to be correct.

It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to 
run old binaries so a larger guard size is definitely beneficial. No existing code
uses probing, so increasing the guard size means far fewer functions could
jump the guard.  The dropoff is exponential, each doubling of guard page size
halves the number of functions with a stack larger than the guard size. That's
why Linux went for a 1MB guard size. A larger thread guard size could be the
default on 64-bit targets or a per-target choice depending on whether they
want to improve safety.

> > i was aware of the address space limitation on 32bit
> > but e.g. aarch64 ilp32 will need the 64k guardsize too.
>
> Why?
>
> This is a new feature.  Why make this less usable from the start?

Why should it be any different from LP64? Typical page size will still be
64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
with 64KB pages could create  ~30000 threads per process. Does that
make it unusable?

> GCC needs to emit probe intervals for the smallest supported page size 
> on the the target architecture.  If it does not do that, we end up in 
> trouble on the glibc side.

Smallest supported page size on Arm is 1KB. Should that dictate
everything? It's an ABI choice, not something directly related to
smallest page size.

> > e.g. user allocated stacks (thread or signal) cannot
> > be fixed, so if it is known common practice to allocate
> > those with single (or no) guard page then changing
> > glibc default does not solve the problem.
>
> We have test cases which do not allocate a guard page for those, but 
> with the existing probing algorithms I've seen, this is not actually a 
> problem if stack usage is kept low.

Probing without a guard page is equivalent to no probing. So for these
cases we're simply unprotected.

Wilco
Carlos O'Donell Nov. 29, 2017, 10:38 p.m. UTC | #10
On 11/29/2017 02:28 PM, Wilco Dijkstra wrote:
> Why should it be any different from LP64? Typical page size will still be
> 64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
> with 64KB pages could create  ~30000 threads per process. Does that
> make it unusable?

... and this is already configurable via pthread_attr_setguardsize(), so
you *could* tune a system to have smaller guards against the recommendation
of the runtime authors?
Szabolcs Nagy Nov. 29, 2017, 10:44 p.m. UTC | #11
* Florian Weimer <fweimer@redhat.com> [2017-11-29 21:44:14 +0100]:
> On 11/29/2017 07:40 PM, Szabolcs Nagy wrote:
> > On 29/11/17 15:18, Florian Weimer wrote:
> > > On 11/29/2017 03:59 PM, Szabolcs Nagy wrote:
> > > > The change can be made for aarch64 only
> > > 
> > > That doesn't seem to be the case, looking at the patch.
> > > 
> > > So what you intended to do, exactly?
> > 
> > it seems the issue applies to all targets, and since
> > glibc itself have 64k stack jumps i sent the rfc
> > patch for all targets to err on the safe side.
> 
> glibc has many arbitrarily large stack jumps (although we've been
> eliminating them manually for a while).
> 
> What should guide the default size of the guard is not what glibc needs for
> its own routines, but what the stack probing in GCC needs to be correct.
> 

assuming you dont care about legacy binaries or binaries
where probing was turned off.

> > some targets will have 4k probe interval in gcc but
> > even those are not safe with existing binaries and
> > several targets have no proposed stack probe patch
> > as far as i understand.
> 
> Isn't there are generic variant which covers at least alloca and VLAs?
> 
> > > A 64 KiB probe interval on legacy 32-bit architectures is really a no-go.  It means we have to increase the
> > > guard region size to 64 KiB. But we cannot do that: The guard allocation comes out of the overall thread stack
> > > size, and existing applications do not expect that 60K of configured stack suddenly becomes unavailable.
> > > Adding the guard size on top of the allocation will break setups which are carefully tuned for a maximum number
> > > of threads.
> > 
> > i was aware of the address space limitation on 32bit
> > but e.g. aarch64 ilp32 will need the 64k guardsize too.
> 
> Why?
> 
> This is a new feature.  Why make this less usable from the start?
> 
> (I don't care about aarc64 ILP32 and page sizes smaller than 64 KiB on
> aarch64 in general, so I wont argue this, and this is just a courtesy
> notification that what you are doing is Very Wrong Indeed.)
> 
> > (i think there are other 32bit targets that support
> > > 4k page size, those may not mind the increase either,
> > they have to portably support large page size anyway)
> 
> GCC needs to emit probe intervals for the smallest supported page size on
> the the target architecture.  If it does not do that, we end up in trouble
> on the glibc side.
> 
> We can throw new code at this problem and solve it for 64-bit.  For 32-bit,
> we simply do not have a universally applicable solution.  My understanding
> was that everywhere except on ARM, GCC was compatible with the pioneering
> glibc/Linux work in this area (the guard page we added to thread stacks, and
> the guard page added by the kernel).  If this isn't the case, then I'm
> really disappointed in the disregard of existing practice on the GCC side.

i was not involved in the gcc probing design, but i noticed
the glibc implications that's why i posted a patch, my
understanding is that the efficiency of the probing on a
target depends on

- is there an implicit probe already at calls? (return
address stored on stack vs in register) if not then a
function using the stack must make an assumption about
how far the previous probe might have been (this is a
new call abi)
- is there a way to atomically increment sp and probe it
or access memory below sp?  if not then sp can only be
incremented by guardsize-eps at a time where eps depends
on how signal frame is stored. (and if signal frame is
large and written the wrong way then it may be impossible
to avoid memory writes accross a guard at all times)
- frame layout (incomming/outgoing args, locals, frame
pointer etc) can make probing require significant changes
to function prologues (extra instructinos) or not.

on targets like aarch64 the most conservative design
might have enough performance impact to make users turn
it off.

ilp32 has the same probing design as lp64 so it requires
the same guardsize.

to me it seemed a tradeoff between performance overhead
vs address-space overhead, if there are runtime troubles
because of the larger guardsize requirement then the
tradeoff may need to be revisited.
(with better data on both sides)

i also didnt think 64k guard would be prohibitive on
ilp32 (i know there are usecases where it is a problem,
but i expected those to be very rare cases)
Rich Felker Nov. 29, 2017, 11:02 p.m. UTC | #12
On Wed, Nov 29, 2017 at 10:28:49PM +0000, Wilco Dijkstra wrote:
> > > i was aware of the address space limitation on 32bit
> > > but e.g. aarch64 ilp32 will need the 64k guardsize too.
> >
> > Why?
> >
> > This is a new feature.  Why make this less usable from the start?
> 
> Why should it be any different from LP64? Typical page size will still be
> 64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
> with 64KB pages could create  ~30000 threads per process. Does that
> make it unusable?

A large page size like 64k seriously impacts the usefulness of a
32-bit virtual address space. You're limited to at most 64k total
pages (assuming userspace can use the whole 4GB which is likely if
it's 64-bit kernel/hardware, but not if it's 32-bit), essentially zero
bits of randomness for ASLR, etc. I understand that there's a trend to
favor larger page sizes to mitigate TLB pressure in huge software; I
don't think there's been nearly enough consideration of the negative
consequences of this trend.

> > GCC needs to emit probe intervals for the smallest supported page size 
> > on the the target architecture.  If it does not do that, we end up in 
> > trouble on the glibc side.
> 
> Smallest supported page size on Arm is 1KB. Should that dictate
> everything? It's an ABI choice, not something directly related to
> smallest page size.

For the purposes of this discussion, hardware-supported page sizes
that are not compatible with the ABI page size are not relevant. For
example SYS_mmap2 cannot represent 1k pages so they're not a
possibility with the ABI. Also, runtime page sizes larger than the ELF
segment alignment are not compatible with the ABI.

Rich
Szabolcs Nagy Nov. 29, 2017, 11:13 p.m. UTC | #13
* Florian Weimer <fweimer@redhat.com> [2017-11-29 22:02:48 +0100]:
> On 11/29/2017 09:51 PM, Rich Felker wrote:
> > Hm? What are you thinking of that GCC might have gotten wrong?
> 
> Use 64 KiB probe intervals (almost) everywhere as an optimization.  I
> assumed the original RFC patch was motivated by that.
> 
> I knew that ARM would be broken because that's what the gcc ARM maintainers
> want.  I assumed that it was restricted to that, but now I'm worried that
> it's not.

no,
it was just much easier to post the patch without setting a
value for each target separately, i believe that larger guard
is useful on all 64bit targets, but to make the patch depend
on __WORDSIZE is also wrong since i knew i'd need different
setting for aarch64 ilp32
James Greenhalgh Dec. 5, 2017, 10:55 a.m. UTC | #14
On Wed, Nov 29, 2017 at 09:02:48PM +0000, Florian Weimer wrote:
> On 11/29/2017 09:51 PM, Rich Felker wrote:
> 
> > I'm not sure I follow, but from the standpoint of virtual address
> > space and what is an acceptable cost in wasted address space, any
> > ILP32-on-64 ABIs should be considered the same as 32-bit archs. As
> > such, I think GCC really needs to do the stack probe every 4k, not
> > 64k, and the default (and certainly minimum-supported) guard size
> > should be kept at 4k, not 64k or anything larger.
> 
> Yes, and I expect that we will keep using 4 KiB probing on i386 (and 
> s390/s390x).  That's what Jeff gave me for testing.  I do hope the final 
> upstream version isn't going to be different in this regard.
> 
> But in the end, this is up to the machine maintainers (for gcc and glibc).
> 
> >> We can throw new code at this problem and solve it for 64-bit.  For
> >> 32-bit, we simply do not have a universally applicable solution.  My
> >> understanding was that everywhere except on ARM, GCC was compatible
> >> with the pioneering glibc/Linux work in this area (the guard page we
> >> added to thread stacks, and the guard page added by the kernel).  If
> >> this isn't the case, then I'm really disappointed in the disregard
> >> of existing practice on the GCC side.
> > 
> > Hm? What are you thinking of that GCC might have gotten wrong?
> 
> Use 64 KiB probe intervals (almost) everywhere as an optimization.  I 
> assumed the original RFC patch was motivated by that.
> 
> I knew that ARM would be broken because that's what the gcc ARM 
> maintainers want.  I assumed that it was restricted to that, but now I'm 
> worried that it's not.

To be clear here, I'm coming in to the review of the probing support in GCC
late, and with very little context on the design of the feature. I certainly
wouldn't want to cause you worry - I've got no intention of pushing for
optimization to a larger guard page size if it would leaves things broken
for AArch64.

Likewise, I have no real desire for us to emit a bunch of extra operations
if we're not required to for glibc.

If I'm reopening earlier conversations, it is only because I wasn't involved
in them. I have no interest in us doing something "Very Wrong Indeed".

If assuming that 64k probes are sufficient on AArch64 is not going to allow
us a correct implementation, then we can't assume 64k probes on AArch64. My
understanding was that we were safe in this as the kernel was giving us a
generous 1MB to play with, and we could modify glibc to also give us 64k
(I admit, I had not considered ILP32, where you've rightly pointed out we
will eat lots of address space if we make this decision).

> > GCC needs to emit probe intervals for the smallest supported page size 
> > on the the target architecture.  If it does not do that, we end up in 
> > trouble on the glibc side.

This is where I may have a misunderstanding, why would it require probing
at the smallest page size, rather than probing at a multiple of the guard
size? It is very likely I'm missing something here as I don't know the glibc
side of this at all.

Thanks for your advice so far. To reiterate, I'm not pushing any particular
optimization agenda in GCC, but I would like to understand the trade-off
we're discussing.

James
Florian Weimer Dec. 6, 2017, 12:50 p.m. UTC | #15
On 12/05/2017 11:55 AM, James Greenhalgh wrote:

> To be clear here, I'm coming in to the review of the probing support in GCC
> late, and with very little context on the design of the feature. I certainly
> wouldn't want to cause you worry - I've got no intention of pushing for
> optimization to a larger guard page size if it would leaves things broken
> for AArch64.

The situation with aarch64 is perhaps a bit complicated because we are 
faced with different kernel builds with different page sizes (4 KiB on 
the low end, 64 KiB for enterprise workloads).

> If assuming that 64k probes are sufficient on AArch64 is not going to allow
> us a correct implementation, then we can't assume 64k probes on AArch64. My
> understanding was that we were safe in this as the kernel was giving us a
> generous 1MB to play with, and we could modify glibc to also give us 64k
> (I admit, I had not considered ILP32, where you've rightly pointed out we
> will eat lots of address space if we make this decision).
You can't modify existing glibc installations which use one page for the 
guard.  For a 32-bit address space, I hope the consensus is that we 
don't want to switch to multiple guard pages by default.

All this means that it will be very difficult to document the protection 
afforded by the -fstack-clash-protection option.

This problem goes away completely if you probe at intervals of the 
minimum page size the system supports because existing glibc versions 
will interoperate with that (in the sense that they provide the intended 
level of protection).

Thanks,
Florian
Florian Weimer Dec. 6, 2017, 12:53 p.m. UTC | #16
On 11/29/2017 11:38 PM, Carlos O'Donell wrote:
> On 11/29/2017 02:28 PM, Wilco Dijkstra wrote:
>> Why should it be any different from LP64? Typical page size will still be
>> 64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
>> with 64KB pages could create  ~30000 threads per process. Does that
>> make it unusable?
> 
> ... and this is already configurable via pthread_attr_setguardsize(), so
> you *could* tune a system to have smaller guards against the recommendation
> of the runtime authors?

Sure, if you recompile your application.  That's not really a solution IMHO.

Florian
Wilco Dijkstra Dec. 6, 2017, 1:10 p.m. UTC | #17
Florian Weimer wrote:
> On 11/29/2017 11:38 PM, Carlos O'Donell wrote:
>> On 11/29/2017 02:28 PM, Wilco Dijkstra wrote:
>>> Why should it be any different from LP64? Typical page size will still be
>>> 64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
>>> with 64KB pages could create  ~30000 threads per process. Does that
>>> make it unusable?
>> 
>> ... and this is already configurable via pthread_attr_setguardsize(), so
>> you *could* tune a system to have smaller guards against the recommendation
>> of the runtime authors?
>
> Sure, if you recompile your application.  That's not really a solution IMHO.

You've got to do that already in order to run it on ILP32 - there isn't exactly
a large amount of existing ILP32 binaries around, let alone ones that need more
than 30000 threads per process.

Wilco
Florian Weimer Dec. 6, 2017, 1:12 p.m. UTC | #18
On 12/06/2017 02:10 PM, Wilco Dijkstra wrote:
> Florian Weimer wrote:
>> On 11/29/2017 11:38 PM, Carlos O'Donell wrote:
>>> On 11/29/2017 02:28 PM, Wilco Dijkstra wrote:
>>>> Why should it be any different from LP64? Typical page size will still be
>>>> 64KB, so a 4KB guard would be rounded up to 64KB. An ILP32 system
>>>> with 64KB pages could create  ~30000 threads per process. Does that
>>>> make it unusable?
>>>
>>> ... and this is already configurable via pthread_attr_setguardsize(), so
>>> you *could* tune a system to have smaller guards against the recommendation
>>> of the runtime authors?
>>
>> Sure, if you recompile your application.  That's not really a solution IMHO.
> 
> You've got to do that already in order to run it on ILP32 - there isn't exactly
> a large amount of existing ILP32 binaries around, let alone ones that need more
> than 30000 threads per process.

I wasn't talking about aarch64 ILP32, and I suspect neither was Carlos.

In my world, if you still have 32-bit binaries, it's because you lost 
the sources or the tools or knowledge to build them at some point.

Thanks,
Florian
Florian Weimer Dec. 6, 2017, 1:16 p.m. UTC | #19
On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
> run old binaries so a larger guard size is definitely beneficial. No existing code
> uses probing, so increasing the guard size means far fewer functions could
> jump the guard.  The dropoff is exponential, each doubling of guard page size
> halves the number of functions with a stack larger than the guard size.

That's not actually true in the sense that the math doesn't work out 
that way.  If you have a weird function which skips by a really large 
amount, you can double the guard size many, many times until the number 
of unprotected functions drops further.

And there is definitely a long tail here, due to GNU's affinity to 
variable-length arrays and alloca.

> That's why Linux went for a 1MB guard size.

The original intent was to ship only a kernel patch and not change ld.so 
at all.  Qualys did some math to support their recommendation, but it 
turns out the 1 MiB guard size is insufficient to protect ld.so on some 
architectures.  That's why we had to patch ld.so, too.

So far, I haven't seen a strong argument that 64 KiB is better than 32 
KiB, or that switching to 96 KiB or 128 KiB would not provide additional 
protection.  To me, it's still an arbitrary number.

Based on the ld.so experience, I think it is questionable that existing 
vulnerable applications can be fixed by increasing the guard size.  Our 
expectation is that we have to recompile with -fstack-clash-protection 
to get deterministic crashes (which we are doing with glibc), or to 
patch them to avoid the stack jump (which we did for ld.so because the 
GCC support wasn't available at the time).

Thanks,
Florian
Joseph Myers Dec. 6, 2017, 1:40 p.m. UTC | #20
On Wed, 6 Dec 2017, Florian Weimer wrote:

> Based on the ld.so experience, I think it is questionable that existing
> vulnerable applications can be fixed by increasing the guard size.  Our
> expectation is that we have to recompile with -fstack-clash-protection to get
> deterministic crashes (which we are doing with glibc), or to patch them to
> avoid the stack jump (which we did for ld.so because the GCC support wasn't
> available at the time).

I'd say we should continue to fix any cases of unbounded dynamic stack 
allocations in glibc, as being bugs (whether or not bugs with security 
impact), *and* expect to need to compile glibc and everything else with 
-fstack-clash-protection for safety (there are, after all, some quite 
large but bounded static stack allocations in glibc).
Florian Weimer Dec. 6, 2017, 1:51 p.m. UTC | #21
On 12/06/2017 02:40 PM, Joseph Myers wrote:
> On Wed, 6 Dec 2017, Florian Weimer wrote:
> 
>> Based on the ld.so experience, I think it is questionable that existing
>> vulnerable applications can be fixed by increasing the guard size.  Our
>> expectation is that we have to recompile with -fstack-clash-protection to get
>> deterministic crashes (which we are doing with glibc), or to patch them to
>> avoid the stack jump (which we did for ld.so because the GCC support wasn't
>> available at the time).
> 
> I'd say we should continue to fix any cases of unbounded dynamic stack
> allocations in glibc, as being bugs (whether or not bugs with security
> impact), *and* expect to need to compile glibc and everything else with
> -fstack-clash-protection for safety (there are, after all, some quite
> large but bounded static stack allocations in glibc).

Yes, I was too brief.  Any reachable large stack jump still needs to be 
fixed.  But it's very hard to identify those that matter.  Aldy and I 
actually looked at the llp_tmp jump long before the security impact was 
known, but I dismissed, probably because I wrongly assumed that 
LD_LIBRARY_PATH is ignored in AT_SECURE=1 binaries.

So -fstack-clash-protection is needed for the deterministic crashes, as 
a safety net, but it of course doesn't fix any actual bugs.

Thanks,
Florian
Wilco Dijkstra Dec. 6, 2017, 2:27 p.m. UTC | #22
Florian Weimer wrote:
> On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
>> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
>> run old binaries so a larger guard size is definitely beneficial. No existing code
>> uses probing, so increasing the guard size means far fewer functions could
>> jump the guard.  The dropoff is exponential, each doubling of guard page size
>> halves the number of functions with a stack larger than the guard size.
>
> That's not actually true in the sense that the math doesn't work out 
> that way.  If you have a weird function which skips by a really large 
> amount, you can double the guard size many, many times until the number 
> of unprotected functions drops further.
>
> And there is definitely a long tail here, due to GNU's affinity to 
> variable-length arrays and alloca.

The math works fine for most of the curve. The measurements I did
show that the number of functions with really large stack allocations is
extremely small. So it's a long tail only in terms of maximum stack
allocation, not in number of functions.

> So far, I haven't seen a strong argument that 64 KiB is better than 32 
> KiB, or that switching to 96 KiB or 128 KiB would not provide additional 
> protection.  To me, it's still an arbitrary number.

It's based on actual measurements. Increasing the guard further will indeed
improve protection slightly. However the returns are diminishing once you
hit the tail of the curve. The first few doublings matter the most (in that sense
it's similar to ASLR). In principle using a 1MB thread stack guard would be
slightly better but I think that eats too much address space even on a 64-bit
system, so it doesn't feel like a good tradeoff.

> Based on the ld.so experience, I think it is questionable that existing 
> vulnerable applications can be fixed by increasing the guard size.  Our 
> expectation is that we have to recompile with -fstack-clash-protection 
> to get deterministic crashes (which we are doing with glibc), or to 
> patch them to avoid the stack jump (which we did for ld.so because the 
> GCC support wasn't available at the time).

It was very clear from the start that no guard size can "fix" all existing
applications. Increasing the guard just makes it much harder to find
functions that can jump the guard as there are so few of them. That's
the best we can do for existing binaries.

To get to a safer state, probing must be enabled by default everywhere,
but that also means probing should have a low overhead.

Wilco
Jeff Law Dec. 6, 2017, 2:44 p.m. UTC | #23
On 12/06/2017 06:40 AM, Joseph Myers wrote:
> On Wed, 6 Dec 2017, Florian Weimer wrote:
> 
>> Based on the ld.so experience, I think it is questionable that existing
>> vulnerable applications can be fixed by increasing the guard size.  Our
>> expectation is that we have to recompile with -fstack-clash-protection to get
>> deterministic crashes (which we are doing with glibc), or to patch them to
>> avoid the stack jump (which we did for ld.so because the GCC support wasn't
>> available at the time).
> 
> I'd say we should continue to fix any cases of unbounded dynamic stack 
> allocations in glibc, as being bugs (whether or not bugs with security 
> impact), *and* expect to need to compile glibc and everything else with 
> -fstack-clash-protection for safety (there are, after all, some quite 
> large but bounded static stack allocations in glibc).
Agreed 100%.

An unbound dynamic allocation is a security vulnerability for multiple
reasons.  Treating them as bugs is absolutely the way to go.

I'd personally go a step further and say that a dynamic allocation
managed by a human is also a problem waiting to happen.  We've shown
that repeatedly through the decades.  We just can't seem to get them
right ;(


Compiling critical code such as glibc with -fstack-clash-protection is
also highly desirable -- even on architectures that do not have
stack-clash protected prologues.  You obviously get less protection on
such targets, but the primary vector for stack-jumps is dynamic
allocations in our experience and those are protected with generic code.

jeff
Szabolcs Nagy Dec. 6, 2017, 8:41 p.m. UTC | #24
On 06/12/17 14:27, Wilco Dijkstra wrote:
> Florian Weimer wrote:
>> On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
>>> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
>>> run old binaries so a larger guard size is definitely beneficial. No existing code
>>> uses probing, so increasing the guard size means far fewer functions could
>>> jump the guard.  The dropoff is exponential, each doubling of guard page size
>>> halves the number of functions with a stack larger than the guard size.
>>
>> That's not actually true in the sense that the math doesn't work out 
>> that way.  If you have a weird function which skips by a really large 
>> amount, you can double the guard size many, many times until the number 
>> of unprotected functions drops further.
>>
>> And there is definitely a long tail here, due to GNU's affinity to 
>> variable-length arrays and alloca.
> 
> The math works fine for most of the curve. The measurements I did
> show that the number of functions with really large stack allocations is
> extremely small. So it's a long tail only in terms of maximum stack
> allocation, not in number of functions.

with 4k probe interval about 1% of functions need probes
with 64k probe interval about 0.01% (order of magnitude,
alloca not included), so increasing the default guard can
be useful for existing code.

however i have doubts about mandating 64k guard page size
on aarch64.

in principle 64k guard on aarch64 should not cause problems
since there are systems with 64k page size already,

but there are some glibc issues:

bug 11787 needs fixing (at least the guardsize accounting
part) to avoid wasting stack.

some applications control their own memory use via RLIMIT_AS
which means we cannot arbitrarily waste address space either.

the right logic for pthread_attr_init, __pthread_get_minstack,
pthread_getattr_default_np, pthread_getattr_np, allocate_stack,
__libc_alloca_cutoff is not obvious. and backporting a change
with minimal impact may be difficult.

but more importantly glibc cannot fix everything:

there are user allocated stacks where the user presumably
allocates the guard too and we cannot affect that.

there are other libcs and all of them use single guard page now
(bsd systems, bionic, musl, ..) other language runtimes may
explicitly set single guard page too (openjdk, erlang,..?) and
providing reduced security guarantees on all those systems is
suboptimal. (can we round up the guardsize setting?)

so the question is whether pagesize < 64k systems sometimes
getting reduced security because of small guard is worse or
users turning probing off because of 4k probing overhead.
Adhemerval Zanella Netto Dec. 6, 2017, 9:24 p.m. UTC | #25
On 06/12/2017 18:41, Szabolcs Nagy wrote:
> On 06/12/17 14:27, Wilco Dijkstra wrote:
>> Florian Weimer wrote:
>>> On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
>>>> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
>>>> run old binaries so a larger guard size is definitely beneficial. No existing code
>>>> uses probing, so increasing the guard size means far fewer functions could
>>>> jump the guard.  The dropoff is exponential, each doubling of guard page size
>>>> halves the number of functions with a stack larger than the guard size.
>>>
>>> That's not actually true in the sense that the math doesn't work out 
>>> that way.  If you have a weird function which skips by a really large 
>>> amount, you can double the guard size many, many times until the number 
>>> of unprotected functions drops further.
>>>
>>> And there is definitely a long tail here, due to GNU's affinity to 
>>> variable-length arrays and alloca.
>>
>> The math works fine for most of the curve. The measurements I did
>> show that the number of functions with really large stack allocations is
>> extremely small. So it's a long tail only in terms of maximum stack
>> allocation, not in number of functions.
> 
> with 4k probe interval about 1% of functions need probes
> with 64k probe interval about 0.01% (order of magnitude,
> alloca not included), so increasing the default guard can
> be useful for existing code.

Do you mean the function from glibc itself or is it a generic analysis? 
How did you get these numbers?

> 
> however i have doubts about mandating 64k guard page size
> on aarch64.
> 
> in principle 64k guard on aarch64 should not cause problems
> since there are systems with 64k page size already,
> 
> but there are some glibc issues:
> 
> bug 11787 needs fixing (at least the guardsize accounting
> part) to avoid wasting stack.

I do not see BZ#11787 being a pre-requisite for this issue, although
it does interfere with thread stack allocation. Mainly because a
potential fix for 11787 will require a different way to account for
guard page, which will leads to its own possible issues.

> 
> some applications control their own memory use via RLIMIT_AS
> which means we cannot arbitrarily waste address space either.
> 
> the right logic for pthread_attr_init, __pthread_get_minstack,
> pthread_getattr_default_np, pthread_getattr_np, allocate_stack,
> __libc_alloca_cutoff is not obvious. and backporting a change
> with minimal impact may be difficult.

I really wish we could just get rid of __libc_alloca_cutoff internal
usage, it is very bad API and we have better ones to deal with 
potentially unbounded data (dynarray for instance where I recently posted 
a glob refactor to remove alloca usage).

> 
> but more importantly glibc cannot fix everything:
> 
> there are user allocated stacks where the user presumably
> allocates the guard too and we cannot affect that.
> 
> there are other libcs and all of them use single guard page now
> (bsd systems, bionic, musl, ..) other language runtimes may
> explicitly set single guard page too (openjdk, erlang,..?) and
> providing reduced security guarantees on all those systems is
> suboptimal. (can we round up the guardsize setting?)
> 
> so the question is whether pagesize < 64k systems sometimes
> getting reduced security because of small guard is worse or
> users turning probing off because of 4k probing overhead.
> 

I see the question as whether a slight more security coverage of
64k guard pages pays off the potential disruption it may incur.
Your examples and Florian points also does not give me much 
confidence such potential disruption is the best approach.

I think we can at least work from GLIBC side by prioritizing the
removal of unbounded dynamic stack allocations and deprecate the
GNU affinity for VLAs and alloca.
Rich Felker Dec. 6, 2017, 10:07 p.m. UTC | #26
On Wed, Dec 06, 2017 at 08:41:29PM +0000, Szabolcs Nagy wrote:
> On 06/12/17 14:27, Wilco Dijkstra wrote:
> > Florian Weimer wrote:
> >> On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
> >>> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
> >>> run old binaries so a larger guard size is definitely beneficial. No existing code
> >>> uses probing, so increasing the guard size means far fewer functions could
> >>> jump the guard.  The dropoff is exponential, each doubling of guard page size
> >>> halves the number of functions with a stack larger than the guard size.
> >>
> >> That's not actually true in the sense that the math doesn't work out 
> >> that way.  If you have a weird function which skips by a really large 
> >> amount, you can double the guard size many, many times until the number 
> >> of unprotected functions drops further.
> >>
> >> And there is definitely a long tail here, due to GNU's affinity to 
> >> variable-length arrays and alloca.
> > 
> > The math works fine for most of the curve. The measurements I did
> > show that the number of functions with really large stack allocations is
> > extremely small. So it's a long tail only in terms of maximum stack
> > allocation, not in number of functions.
> 
> with 4k probe interval about 1% of functions need probes
> with 64k probe interval about 0.01% (order of magnitude,
> alloca not included), so increasing the default guard can
> be useful for existing code.

Rather than what % of all functions need probes, I think it would be
more meaningful to ask what % of functions with a useful (not trivial
error) code path less than N instructions (for N=100 or so) need
probes. My hypothesis is that, for either threshold 4k or 64k, the %
is negligible.

Of course this is a bit harder to measure but maybe there's some way
to do it? It would be nice if static analysis tools could report that
(they wouldn't exclude trivial error ones but some additional
heuristic or manual review pass could).

Rich
Szabolcs Nagy Dec. 8, 2017, 6:28 p.m. UTC | #27
On 06/12/17 22:07, Rich Felker wrote:
> On Wed, Dec 06, 2017 at 08:41:29PM +0000, Szabolcs Nagy wrote:
>> On 06/12/17 14:27, Wilco Dijkstra wrote:
>>> Florian Weimer wrote:
>>>> On 11/29/2017 11:28 PM, Wilco Dijkstra wrote:
>>>>> It's not related to what GLIBC needs, but GLIBC, like Linux, must continue to
>>>>> run old binaries so a larger guard size is definitely beneficial. No existing code
>>>>> uses probing, so increasing the guard size means far fewer functions could
>>>>> jump the guard.  The dropoff is exponential, each doubling of guard page size
>>>>> halves the number of functions with a stack larger than the guard size.
>>>>
>>>> That's not actually true in the sense that the math doesn't work out 
>>>> that way.  If you have a weird function which skips by a really large 
>>>> amount, you can double the guard size many, many times until the number 
>>>> of unprotected functions drops further.
>>>>
>>>> And there is definitely a long tail here, due to GNU's affinity to 
>>>> variable-length arrays and alloca.
>>>
>>> The math works fine for most of the curve. The measurements I did
>>> show that the number of functions with really large stack allocations is
>>> extremely small. So it's a long tail only in terms of maximum stack
>>> allocation, not in number of functions.
>>
>> with 4k probe interval about 1% of functions need probes
>> with 64k probe interval about 0.01% (order of magnitude,
>> alloca not included), so increasing the default guard can
>> be useful for existing code.
> 
> Rather than what % of all functions need probes, I think it would be
> more meaningful to ask what % of functions with a useful (not trivial
> error) code path less than N instructions (for N=100 or so) need
> probes. My hypothesis is that, for either threshold 4k or 64k, the %
> is negligible.
> 

that is hard to tell, (because we don't know which path
is hot/cold and a big function can have a short hot path)

previously spec17 was analyzed with gcc instrumentation,
now i disasmd the binaries of about 5000 deb packages

all insn:  448415942
sp:=reg:         452
sp-=reg:       12903
sp-=imm:     4199819
imm>4096-128:  14809, 0.353%
imm>16384-512:  2703, 0.064%
imm>32768-1024: 1370, 0.033%
imm>65536-1024:  809, 0.019%

(the threshold values are not exact power of 2 to leave
space for outgoing args, otherwise every function with
outgoing args would need additional probes)

the >4k stack use is rarer in this set than in spec17.

sp-=reg is an alloca usually, when reg had a constant
value it was counted as sp-=imm, but accounting may
not be perfect here.

mov to sp (sp:=reg) is usually save and restore of the
sp around local vla use (frame pointer is not counted),
but it may be an alloca too (only counted if it seemed
it could be an alloca)

sp-=imm should be a good approximation of number of
functions with stack use, but in some cases there are
two stack adjustments in the same function (this is
used for the % numbers).

largest static stack use (some of them are in libraries):

./usr/bin/qpdldecode: 5246976
./usr/bin/ddstdecode: 4124672
./usr/lib/aarch64-linux-gnu/libgfortran.so.4.0.0: 2098912
./usr/lib/aarch64-linux-gnu/sane/libsane-hp4200.so.1.0.25: 1994752
./usr/bin/foomatic-combo-xml: 1130496
./usr/lib/aarch64-linux-gnu/sane/libsane-umax_pp.so.1.0.25: 1069056
./usr/bin/umax_pp: 1069056
./lib/aarch64-linux-gnu/libcryptsetup.so.4.7.0: 1048672
./usr/lib/aarch64-linux-gnu/libCbcSolver.so.3.9.9: 1000528
./usr/lib/aarch64-linux-gnu/libv4lconvert0/ov511-decomp: 999424
./usr/lib/aarch64-linux-gnu/libvtkRenderingVolume-6.3.so.6.3.0: 800256
./usr/lib/inkscape/libinkscape_base.so: 718768
./usr/lib/aarch64-linux-gnu/libv4lconvert0/ov518-decomp: 696320
./usr/lib/aarch64-linux-gnu/libcdaudio.so.1.0.0: 569344
./usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi/mca_btl_tcp.so: 524672
./usr/lib/apt/methods/rred: 524656
./usr/bin/ttf2tfm: 524656
./usr/lib/aarch64-linux-gnu/libvtkImagingColor-6.3.so.6.3.0: 524608
./usr/bin/luatex: 524592
./usr/lib/libgnustep-base.so.1.25.0: 524464
./usr/bin/ufraw-batch: 524384
./usr/bin/png-fix-itxt: 500128

and top binaries by alloca count:

./usr/lib/python2.7/dist-packages/cryptography/hazmat/bindings/_openssl.aarch64-linux-gnu.so: 1920
./usr/lib/gcc/aarch64-linux-gnu/7/cc1plus: 302
./usr/lib/gcc/aarch64-linux-gnu/6/cc1plus: 290
./usr/lib/aarch64-linux-gnu/s2tc/libtxc_dxtn.so: 282
./usr/lib/gcc/aarch64-linux-gnu/7/cc1: 270
./usr/lib/gcc/aarch64-linux-gnu/6/cc1: 257
./usr/bin/gdb: 247
./usr/lib/libgnustep-base.so.1.25.0: 244
./usr/bin/gdb: 241
./usr/lib/gcc/aarch64-linux-gnu/7/lto1: 237
./usr/lib/gcc/aarch64-linux-gnu/6/lto1: 224
./usr/lib/aarch64-linux-gnu/libgmp.so.10.3.2: 217
./lib/systemd/libsystemd-shared-235.so: 210
./usr/lib/thunderbird/libxul.so: 148
./lib/aarch64-linux-gnu/libc-2.25.so: 148
./sbin/ldconfig: 143
./usr/lib/firefox-esr/libxul.so: 142
./usr/lib/aarch64-linux-gnu/libopus.so.0.6.1: 138
./usr/lib/aarch64-linux-gnu/libgcj.so.17.0.0: 134
./usr/lib/libcln.so.6.0.4: 110
./usr/lib/aarch64-linux-gnu/libspeex.so.1.5.0: 104
./sbin/brltty: 102
Jeff Law Dec. 11, 2017, 11:49 p.m. UTC | #28
On 12/05/2017 03:55 AM, James Greenhalgh wrote:
>>>
>>> Hm? What are you thinking of that GCC might have gotten wrong?
>>
>> Use 64 KiB probe intervals (almost) everywhere as an optimization.  I 
>> assumed the original RFC patch was motivated by that.
>>
>> I knew that ARM would be broken because that's what the gcc ARM 
>> maintainers want.  I assumed that it was restricted to that, but now I'm 
>> worried that it's not.
> 
> To be clear here, I'm coming in to the review of the probing support in GCC
> late, and with very little context on the design of the feature. I certainly
> wouldn't want to cause you worry - I've got no intention of pushing for
> optimization to a larger guard page size if it would leaves things broken
> for AArch64.
No worries.  Richard E. can give you the background on the AArch64 side
of things.  I'll try to channel Richard's request. If I over-simplify or
mis-state something, Richard's view should be taken as "the truth".

AArch64 is at a bit of a disadvantage from an architectural standpoint
relative to x86 and at a disadvantage from an ABI standpoint relative to
PPC.

On x86 the calling sequence itself is an implicit probe as it pushes the
return address onto the stack.  Thus except for a couple oddball cases
we assume at function entry that *sp has been probed and thus an
attacker can't have pushed the stack pointer into the guard page.

On PPC we maintain a backchain pointer at *sp, so the mechanism is
different, but the net result again is that an attacker can't have
pushed the stack pointer into the guard page.

On s390 the caller typically allocates space for the callee to perform
register saves.  So while the attacker could push the stack pointer into
the guard, callee register saves limit how far into the guard the stack
pointer can possibly be -- and the prologue code obviously knows where
those save areas are located.

The closest thing we have on aarch64 is that the caller must have saved
LR into its stack.  But we have no idea the difference between where the
caller saved LR and the value of the stack pointer in the callee.

Thus to fully protect AArch64 we would have to emit many more probes
than on other architectures because we have to make worst case
assumptions at function entry.  This cost was deemed too high.

The key is that the outgoing argument space sits between the caller's LR
slot and the callee's frame.  So it was decided that certain
requirements would be placed on the caller and that the callee would be
able to make certain assumptions WRT whether or not the caller would
write into the outgoing argument area.

After analysis (of spec2k6 I believe) and review of the kernel's
guarantees WRT the guard size it was decided that the size of the guard
on aarch64 would be 64k.  That corresponds to 2 pages for a RHEL kernel.
 It corresponds to 16 pages on a Fedora kernel.

The caller would be responsible for ensuring that it always would
write/probe within 1k of the limit of its stack.  Thus the callee would
be able to allocate up to 63k without probing.  This essentially brings
the cost of probing down into the noise on AArch64.

Once probing, we probe at 4k intervals.  That fits nicely into the 12bit
shifted immediates available on aarch64.  In theory a larger probing
interval would reduce the cost of probing, but you'd have to twiddle the
sequences in the target files to get a scratch register in places were
they don't right now.

We all agreed that there's a bit of a hole where unprotected code
calling protected code could leave the stack pointer somewhere in the
guard page on aarch64 and be a vector for attack.  However, it's more
likely that if those scenarios that the attacker has enough control on
the caller side that they'd just jump the guard in the caller.

So that's why things are the way they are.  Again, if I've gotten
something wrong, I apologize to you and Richard :-)

> 
> Likewise, I have no real desire for us to emit a bunch of extra operations
> if we're not required to for glibc.
Agreed.  I think we all want probing to be low enough overhead that we
just enable it by default everywhere to get protected and nobody notices.

> If assuming that 64k probes are sufficient on AArch64 is not going to allow
> us a correct implementation, then we can't assume 64k probes on AArch64. My
> understanding was that we were safe in this as the kernel was giving us a
> generous 1MB to play with, and we could modify glibc to also give us 64k
> (I admit, I had not considered ILP32, where you've rightly pointed out we
> will eat lots of address space if we make this decision).
Richard E. explicitly took ILP32 off the table during out discussion.  I
believe the conclusion was that if/when the time came that ARM would
address ILP32 independently.  I don't offhand know of anything in the
current implementation that would break on ILP32 -- except for concerns
about eating up address space with large guards around thread stacks.


> 
>>> GCC needs to emit probe intervals for the smallest supported page size 
>>> on the the target architecture.  If it does not do that, we end up in 
>>> trouble on the glibc side.
> 
> This is where I may have a misunderstanding, why would it require probing
> at the smallest page size, rather than probing at a multiple of the guard
> size? It is very likely I'm missing something here as I don't know the glibc
> side of this at all.
I'm not sure where that statement comes from either.  I guess the
concern is someone could boot a kernel with a smaller page size and
perhaps the kernel/glibc create their guards based on # pages rather
than absolute size.  Thus booting a kernel with a smaller pagesize would
code with less protection.

jeff

ps.  I still need to address your questions/comments on the aarch64 GCC
bits.  I'm keen to wrap that up for gcc-8 as it's the only target where
we've got support for custom stack clash protected prologues that hasn't
been merged yet.
Szabolcs Nagy Dec. 12, 2017, 11:42 a.m. UTC | #29
On 11/12/17 23:49, Jeff Law wrote:
> On 12/05/2017 03:55 AM, James Greenhalgh wrote:
>>>> GCC needs to emit probe intervals for the smallest supported page size 
>>>> on the the target architecture.  If it does not do that, we end up in 
>>>> trouble on the glibc side.
>>
>> This is where I may have a misunderstanding, why would it require probing
>> at the smallest page size, rather than probing at a multiple of the guard
>> size? It is very likely I'm missing something here as I don't know the glibc
>> side of this at all.
> I'm not sure where that statement comes from either.  I guess the
> concern is someone could boot a kernel with a smaller page size and
> perhaps the kernel/glibc create their guards based on # pages rather
> than absolute size.  Thus booting a kernel with a smaller pagesize would
> code with less protection.

historically posix required a single page guard at the
end of thread stacks, that's what all existing libcs
implement and glibc internally assumes this at several
places where stack/guard size accounting happens.
(so larger guard is not conform to posix-2004, only
to >=posix-2008)

users can also set the guard size explicitly when creating
threads (at least openjdk and erlang do this for threads
that call c code) and that's not something glibc can change:
it is allowed to round this up to pagesize but not to
some arbitrary larger value.

glibc has another problem that it does stack accounting
incorrectly and thus increasing the guard size can easily
make existing code fail with stack overflow and fixing
this is non-trivial since many users already expect the
buggy behaviour (with glibc specific workarounds)
(i would not expect many breakage in practice because of
this, but it does make the guard size change harder to
agree on on the glibc side and making the logic target
specific has another set of issues)

in short, the existing posix ecosystem assumes a single
page guard all over the place, doing the probing at
larger intervals than the minimum page size will leave
holes, many of which glibc cannot fix and the ones it
can will leave glibc maintenance issues behind.
Rich Felker Dec. 12, 2017, 4:36 p.m. UTC | #30
On Tue, Dec 12, 2017 at 11:42:51AM +0000, Szabolcs Nagy wrote:
> On 11/12/17 23:49, Jeff Law wrote:
> > On 12/05/2017 03:55 AM, James Greenhalgh wrote:
> >>>> GCC needs to emit probe intervals for the smallest supported page size 
> >>>> on the the target architecture.  If it does not do that, we end up in 
> >>>> trouble on the glibc side.
> >>
> >> This is where I may have a misunderstanding, why would it require probing
> >> at the smallest page size, rather than probing at a multiple of the guard
> >> size? It is very likely I'm missing something here as I don't know the glibc
> >> side of this at all.
> > I'm not sure where that statement comes from either.  I guess the
> > concern is someone could boot a kernel with a smaller page size and
> > perhaps the kernel/glibc create their guards based on # pages rather
> > than absolute size.  Thus booting a kernel with a smaller pagesize would
> > code with less protection.
> 
> historically posix required a single page guard at the
> end of thread stacks, that's what all existing libcs
> implement and glibc internally assumes this at several
> places where stack/guard size accounting happens.
> (so larger guard is not conform to posix-2004, only
> to >=posix-2008)

I don't follow your reasoning about how a conformance distinction can
be made here. There is no formal model for "does not have additional
guard pages"; that's just an implementation detail of the memory
layout. As a thought experiment, a hardened kernel might always put
giant randomly-perturbed guard zones before and after every mmap.

If the default size was actually specified to be one page in old POSIX
and pthread_attr_getstacksize exposed a larger size for the default, I
suppose this could be a conformance distinction, but I'm not aware of
such a requirement.

> users can also set the guard size explicitly when creating
> threads (at least openjdk and erlang do this for threads
> that call c code) and that's not something glibc can change:
> it is allowed to round this up to pagesize but not to
> some arbitrary larger value.

Likewise I think this is mistaken, for the above reason. If the
rounding-up happens at pthread_create time rather than when setting
the attribute, it's certainly not observable as a conformance
distinction. Of course it is a QoI distinction in both directions:

- Rounding up reduces the available virtual memory space, possibly
  limiting the size of an application (detriment to QoI).

- Rounding up may limit the impact of stack overflow bugs to a crash
  rather than something more severe (QoI advantage).

There are also other safety improvements that could be made at the
expense of virtual memory space and run time costs. For instance in
musl I've considered a hardening option to put guard pages (maybe
repeating the requested guard size? or just one page?) between the end
of the stack and the TLS area so that stack-based overflows can't
clobber TLS/TCB. But this is rather costly since it doubles the number
of kernel VMAs needed, so I'm very hesitant to do it without evidence
that it would thwart real-world vulns.

> glibc has another problem that it does stack accounting
> incorrectly and thus increasing the guard size can easily
> make existing code fail with stack overflow and fixing
> this is non-trivial since many users already expect the
> buggy behaviour (with glibc specific workarounds)

This indeed needs to be fixed, whatever is done with regards to
default guard size.

Rich
Szabolcs Nagy Dec. 12, 2017, 6:07 p.m. UTC | #31
On 12/12/17 16:36, Rich Felker wrote:
> On Tue, Dec 12, 2017 at 11:42:51AM +0000, Szabolcs Nagy wrote:
>> On 11/12/17 23:49, Jeff Law wrote:
>>> On 12/05/2017 03:55 AM, James Greenhalgh wrote:
>>>>>> GCC needs to emit probe intervals for the smallest supported page size 
>>>>>> on the the target architecture.  If it does not do that, we end up in 
>>>>>> trouble on the glibc side.
>>>>
>>>> This is where I may have a misunderstanding, why would it require probing
>>>> at the smallest page size, rather than probing at a multiple of the guard
>>>> size? It is very likely I'm missing something here as I don't know the glibc
>>>> side of this at all.
>>> I'm not sure where that statement comes from either.  I guess the
>>> concern is someone could boot a kernel with a smaller page size and
>>> perhaps the kernel/glibc create their guards based on # pages rather
>>> than absolute size.  Thus booting a kernel with a smaller pagesize would
>>> code with less protection.
>>
>> historically posix required a single page guard at the
>> end of thread stacks, that's what all existing libcs
>> implement and glibc internally assumes this at several
>> places where stack/guard size accounting happens.
>> (so larger guard is not conform to posix-2004, only
>> to >=posix-2008)
> 
> I don't follow your reasoning about how a conformance distinction can
> be made here. There is no formal model for "does not have additional
> guard pages"; that's just an implementation detail of the memory
> layout. As a thought experiment, a hardened kernel might always put
> giant randomly-perturbed guard zones before and after every mmap.
> 
> If the default size was actually specified to be one page in old POSIX
> and pthread_attr_getstacksize exposed a larger size for the default, I
> suppose this could be a conformance distinction, but I'm not aware of
> such a requirement.
> 

pthread_attr_getguardsize must return the user setting or
the default value (which was required to be 1 page), glibc
has tests for this.

of course it can lie about the guardsize (internally use
large default guard size but only report 1 page)

>> users can also set the guard size explicitly when creating
>> threads (at least openjdk and erlang do this for threads
>> that call c code) and that's not something glibc can change:
>> it is allowed to round this up to pagesize but not to
>> some arbitrary larger value.
> 
> Likewise I think this is mistaken, for the above reason. If the
> rounding-up happens at pthread_create time rather than when setting
> the attribute, it's certainly not observable as a conformance
> distinction. Of course it is a QoI distinction in both directions:
> 
> - Rounding up reduces the available virtual memory space, possibly
>   limiting the size of an application (detriment to QoI).
> 

well glibc has non-standard apis (pthread_getattr_np) that
makes the guardsize of a thread visible (again this can lie)

if rounding up the guardsize to some large value actually break
existing setups then it is problematic even if there is no
conformance distinction.

i think such breakage is very unlikely (running into a limit
because of 64k guard would mean the code was extremely fragile
and would not work on a system with 64k page size).

otoh if the pthread_attr_get/setguardsize apis do not get/set
the actual guard size then these apis don't have much point
(i.e. users who "know what they are doing" and need to set
the guard size can't use them) so i'm not sure if doing something
different internally that is visible via these apis is what
users would want.

> - Rounding up may limit the impact of stack overflow bugs to a crash
>   rather than something more severe (QoI advantage).
> 
> There are also other safety improvements that could be made at the
> expense of virtual memory space and run time costs. For instance in
> musl I've considered a hardening option to put guard pages (maybe
> repeating the requested guard size? or just one page?) between the end
> of the stack and the TLS area so that stack-based overflows can't
> clobber TLS/TCB. But this is rather costly since it doubles the number
> of kernel VMAs needed, so I'm very hesitant to do it without evidence
> that it would thwart real-world vulns.
> 
>> glibc has another problem that it does stack accounting
>> incorrectly and thus increasing the guard size can easily
>> make existing code fail with stack overflow and fixing
>> this is non-trivial since many users already expect the
>> buggy behaviour (with glibc specific workarounds)
> 
> This indeed needs to be fixed, whatever is done with regards to
> default guard size.
> 
> Rich
>
Florian Weimer Dec. 12, 2017, 7:30 p.m. UTC | #32
On 12/12/2017 12:49 AM, Jeff Law wrote:

> No worries.  Richard E. can give you the background on the AArch64 side
> of things.  I'll try to channel Richard's request. If I over-simplify or
> mis-state something, Richard's view should be taken as "the truth".

[snip description of architectures where probing is “cheap”]

> The closest thing we have on aarch64 is that the caller must have saved
> LR into its stack.  But we have no idea the difference between where the
> caller saved LR and the value of the stack pointer in the callee.
> 
> Thus to fully protect AArch64 we would have to emit many more probes
> than on other architectures because we have to make worst case
> assumptions at function entry.  This cost was deemed too high.
> 
> The key is that the outgoing argument space sits between the caller's LR
> slot and the callee's frame.  So it was decided that certain
> requirements would be placed on the caller and that the callee would be
> able to make certain assumptions WRT whether or not the caller would
> write into the outgoing argument area.

Thanks for posting this summary.

> After analysis (of spec2k6 I believe) and review of the kernel's
> guarantees WRT the guard size it was decided that the size of the guard
> on aarch64 would be 64k.  That corresponds to 2 pages for a RHEL kernel.

1 page on Red Hat Enterprise Linux, I think.  At least “getconf 
PAGE_SIZE” returns 65536, and I hope that's accurate. 8-)

(There are also some folks who assume that changing the kernel page size 
wouldn't be an ABI change at this point, something which I do not agree 
with.)

Or do you suggest to a 128 KiB guard area to make the current aarch64 
probing sequence safe in the presence of asynchronous signals?

> It corresponds to 16 pages on a Fedora kernel.

Right.

And we can't switch to 64 KiB pages there because apparently, video 
memory sharing on the Raspberry Pi 3 would then need 512 MiB of RAM (not 
just address space), which is not an option for such a small device.

> The caller would be responsible for ensuring that it always would
> write/probe within 1k of the limit of its stack.  Thus the callee would
> be able to allocate up to 63k without probing.  This essentially brings
> the cost of probing down into the noise on AArch64.

(I assume this has been re-assessed after the change not to probe below 
the stack pointer (which might have added some extra cost).)

I'm a bit surprised that the 1K/3K split wouldn't achieve that (i.e., 
pushing the cost of probing into he noise).  Is this because GCC wasn't 
built for this and has no way to recognize implicit probes which occur 
throughout the regular execution of a function?  Or is the concern that 
we might skip the guard region if a single arrives at an inopportune 
moment?  (But gcc-4.8.5-25.el7.aarch64 is still not async-signal-safe 
because it decreases SP by 67 KiB before starting probing.)

> Once probing, we probe at 4k intervals.  That fits nicely into the 12bit
> shifted immediates available on aarch64.  In theory a larger probing
> interval would reduce the cost of probing, but you'd have to twiddle the
> sequences in the target files to get a scratch register in places were
> they don't right now.
> 
> We all agreed that there's a bit of a hole where unprotected code
> calling protected code could leave the stack pointer somewhere in the
> guard page on aarch64 and be a vector for attack.  However, it's more
> likely that if those scenarios that the attacker has enough control on
> the caller side that they'd just jump the guard in the caller.

Such assumptions still make sense to me.

> So that's why things are the way they are.  Again, if I've gotten
> something wrong, I apologize to you and Richard :-)
> 
>>
>> Likewise, I have no real desire for us to emit a bunch of extra operations
>> if we're not required to for glibc.
> Agreed.  I think we all want probing to be low enough overhead that we
> just enable it by default everywhere to get protected and nobody notices.
> 
>> If assuming that 64k probes are sufficient on AArch64 is not going to allow
>> us a correct implementation, then we can't assume 64k probes on AArch64. My
>> understanding was that we were safe in this as the kernel was giving us a
>> generous 1MB to play with, and we could modify glibc to also give us 64k
>> (I admit, I had not considered ILP32, where you've rightly pointed out we
>> will eat lots of address space if we make this decision).

> Richard E. explicitly took ILP32 off the table during out discussion.

The patch which started this libc-alpha thread applied the 64 KiB gap 
size to 32-bit architectures as well.  This is the primary reason why 
I'm objecting strongly to it.

If aarch64 wants to do their own thing in 64-bit mode, I'm won't 
complain that much.

>>>> GCC needs to emit probe intervals for the smallest supported page size
>>>> on the the target architecture.  If it does not do that, we end up in
>>>> trouble on the glibc side.
>>
>> This is where I may have a misunderstanding, why would it require probing
>> at the smallest page size, rather than probing at a multiple of the guard
>> size? It is very likely I'm missing something here as I don't know the glibc
>> side of this at all.

> I'm not sure where that statement comes from either.  I guess the
> concern is someone could boot a kernel with a smaller page size and
> perhaps the kernel/glibc create their guards based on # pages rather
> than absolute size.  Thus booting a kernel with a smaller pagesize would
> code with less protection.

The existing ecosystem offers only one page size.  The larger main stack 
guard region size provided by the kernel was a stop-gap measure, 
initially with the intent to patch nothing else, but it did not work out 
that way.  I don't think 64 KiB would make a significant dent in terms 
of practical exploitability (all published exploits used larger frames 
or alloca-based frames anyway).

If GCC assumes more than one guard page, a lot of things need to change, 
and it is difficult to communicate under what conditions a binary has 
been properly hardened.  If this is the cost for enabling 
-fstack-clash-protection by default on aarch64, then so be it, but we 
should not make these changes on architectures where they do not bring 
any tangible benefit or actually hurt (due to address space consumption).

Thanks,
Florian
Szabolcs Nagy Dec. 13, 2017, 11:57 a.m. UTC | #33
On 12/12/17 19:30, Florian Weimer wrote:
> I'm a bit surprised that the 1K/3K split wouldn't achieve that (i.e., pushing the cost of probing into he
> noise).  Is this because GCC wasn't built for this and has no way to recognize implicit probes which occur

we don't know the exact overhead, the current
patch does not emit optimal probing sequences,
but >3K stack use is common.

> The patch which started this libc-alpha thread applied the 64 KiB gap size to 32-bit architectures as well. 
> This is the primary reason why I'm objecting strongly to it.

that glibc patch added an arch specific knob.

otoh i don't think we have different code on
the gcc side for ilp32 so it will use the
same probing interval in gcc-8.

> If aarch64 wants to do their own thing in 64-bit mode, I'm won't complain that much.

i can prepare a patch, but it will need to
change generic code in glibc.
(because a lot of stack accounting code
is broken or hard codes 1page guard)

> The existing ecosystem offers only one page size.  The larger main stack guard region size provided by the
> kernel was a stop-gap measure, initially with the intent to patch nothing else, but it did not work out that
> way.  I don't think 64 KiB would make a significant dent in terms of practical exploitability (all published
> exploits used larger frames or alloca-based frames anyway).
> 
> If GCC assumes more than one guard page, a lot of things need to change, and it is difficult to communicate
> under what conditions a binary has been properly hardened.  If this is the cost for enabling
> -fstack-clash-protection by default on aarch64, then so be it, but we should not make these changes on
> architectures where they do not bring any tangible benefit or actually hurt (due to address space consumption).

glibc can force >=64k guard, no matter what the
user setting is, but as i wrote in another mail
i don't think that's a good idea.
James Greenhalgh Dec. 19, 2017, 12:34 p.m. UTC | #34
On Tue, Dec 05, 2017 at 10:55:31AM +0000, James Greenhalgh wrote:
> Thanks for your advice so far. To reiterate, I'm not pushing any particular
> optimization agenda in GCC, but I would like to understand the trade-off
> we're discussing.

Thank you for all the responses in this thread. I've certainly got a much
clearer view of the competing security, performance and correctness concerns
having read through the comments here.

One thing is very clear, we all want to ensure that the stack clash
protection is applied in as many cases as possible to ensure integrity of
the whole system security.

I see our trade-offs on LP64 AArch64 like this:

Option 1: 64k guard pages for LP64 on AArch64.

  * Better for performance
  * Almost all functions (99.981%) can handle their initial stack setup
      with a single probe
    ** 0.019% require multiple probes.
  * Performance impact unlikely to discourage use - GCC can enable by default.
  * Requires either 64k pages on the system, or for glibc to be rewritten
      to accommodate guard pages which are a multiple of the physical page size.
    ** Note: For small (Raspberry Pi sized) systems, 64k pages are considered
          inappropriate by distributions like fedora.
  * Likely to require similar rewrites for other C libraries, which also
      assume physical page-size guard pages.
  * Cannot give 100% guarantees about safety when considering
      backwards-compatibility and custom user-allocated stacks.
  * Divergent from ILP32 where we will likely want 4k.

Option 2: 4k guard pages for LP64 for AArch64

  * Worse for performance
  * 99.647% of functions can handle initial stack setup with a single probe
    ** 0.353% would require multiple probes
  * Performance impact may discourage users from enabling by default
  * No requirement on system (guard page size == minimal physical page size)
  * No requirement to rewrite C functions.
  * No divergence for ILP32

The fundamental disagreement is on what leads to a higher chance of achieving
complete protection - if the performance overhead is high, will our users
choose to disable the protection, if we keep the performance overhead low
do we lose protections for corner-cases like user-allocated stacks and older
C libraries.

And honestly, I don't see an easy way to draw a line here. Both sides have
clear merit and cover a slightly different use case. This should be a
distribution-level choice not a GCC policy.

Our proposal then, having spoken things through with the Arm engineers
here, and taken in to consideration the opinions on this thread, is that
we move to two "blessed" configurations of the GCC support for AArch64.

One would assume 64k guard pages. This would work out-of-the-box on systems
with a 64k page size, and would work with modifications to glibc (and other
libraries as appropriate) on systems with a 4k (or other) page size. The
performance impact will be low. If we take this approach, this will be the
default configuration for GCC.

The other would assume 4k guard pages. This would work everywhere, and
as-good-as guarantee complete coverage. However, it would be inefficient
on systems with a larger page size, or with a glibc upgraded to use
64k guard-pages.

My opinion is that leaving this configurable, with a sensible and low
overhead default is the best way to make forward progress in the diverse
AArch64 ecosystem. A distribution can then always choose to configure GCC to
assume 4k should they wish to. Naturally, this split model has advantages
for our ILP32 support, which could now reasonably default to the 4k mode.

That will still require some work on the glibc side to make the 64k guard
page on a 4k page system work, and will require extra work on the GCC side
(not required for Jeff's patches to be merged, but highly desirable for the
GCC 8 release) to set up the compiler configuration defaults.

I owe Jeff more review on the GCC side, but I hope this addresses one of the
key design differences and allows us to move forwards.

Please do let me know what I've missed and step in or if I'm still pushing
for a Very Bad Thing :-).

Thanks,
James
Florian Weimer Dec. 19, 2017, 1:06 p.m. UTC | #35
On 12/19/2017 01:34 PM, James Greenhalgh wrote:

> Option 1: 64k guard pages for LP64 on AArch64.

> Option 2: 4k guard pages for LP64 for AArch64

> Our proposal then, having spoken things through with the Arm engineers
> here, and taken in to consideration the opinions on this thread, is that
> we move to two "blessed" configurations of the GCC support for AArch64.

Are there any Arm engineers who prefer Option 2, or is that just there 
to accommodate feedback on libc-alpha?

My main concern was the variance in configurations with Option 1 
(compared to Option 2).  To some extent, the variance with Option 1 is 
temporary.  If both Option 1 and 2 are offered, we have permanent 
variance.  From my point of view, that's worth that just going with 
Option 1.

So if this is some sort of consensus proposal, as opposed to actual 
technical requirements which favor Option 2 in some deployments, I think 
that's not a good idea, and we should go with Option 1 instead.

Thanks,
Florian
Szabolcs Nagy Dec. 19, 2017, 6:21 p.m. UTC | #36
On 19/12/17 13:06, Florian Weimer wrote:
> On 12/19/2017 01:34 PM, James Greenhalgh wrote:
> 
>> Option 1: 64k guard pages for LP64 on AArch64.
> 
>> Option 2: 4k guard pages for LP64 for AArch64
> 
>> Our proposal then, having spoken things through with the Arm engineers
>> here, and taken in to consideration the opinions on this thread, is that
>> we move to two "blessed" configurations of the GCC support for AArch64.
> 
> Are there any Arm engineers who prefer Option 2, or is that just there to accommodate feedback on libc-alpha?
> 
> My main concern was the variance in configurations with Option 1 (compared to Option 2).  To some extent, the
> variance with Option 1 is temporary.  If both Option 1 and 2 are offered, we have permanent variance.  From my
> point of view, that's worth that just going with Option 1.
> 
> So if this is some sort of consensus proposal, as opposed to actual technical requirements which favor Option 2
> in some deployments, I think that's not a good idea, and we should go with Option 1 instead.
> 

well glibc can pretend that only Option 1 is available,
my latest patch assumes 64k probe interval:
https://sourceware.org/ml/libc-alpha/2017-12/msg00451.html

however Option 1 requires generic code to be changed
for aarch64 only (in the libc and elsewhere) and we
cannot easily do that on all (non-glibc) systems.

it seems to me if there are systems where Option 1
may not provide guaranteed trap on stack overflow
then gcc should have Option 2 for those systems.
Rich Felker Dec. 19, 2017, 8:34 p.m. UTC | #37
On Tue, Dec 19, 2017 at 06:21:32PM +0000, Szabolcs Nagy wrote:
> On 19/12/17 13:06, Florian Weimer wrote:
> > On 12/19/2017 01:34 PM, James Greenhalgh wrote:
> > 
> >> Option 1: 64k guard pages for LP64 on AArch64.
> > 
> >> Option 2: 4k guard pages for LP64 for AArch64
> > 
> >> Our proposal then, having spoken things through with the Arm engineers
> >> here, and taken in to consideration the opinions on this thread, is that
> >> we move to two "blessed" configurations of the GCC support for AArch64.
> > 
> > Are there any Arm engineers who prefer Option 2, or is that just there to accommodate feedback on libc-alpha?
> > 
> > My main concern was the variance in configurations with Option 1 (compared to Option 2).  To some extent, the
> > variance with Option 1 is temporary.  If both Option 1 and 2 are offered, we have permanent variance.  From my
> > point of view, that's worth that just going with Option 1.
> > 
> > So if this is some sort of consensus proposal, as opposed to actual technical requirements which favor Option 2
> > in some deployments, I think that's not a good idea, and we should go with Option 1 instead.
> > 
> 
> well glibc can pretend that only Option 1 is available,
> my latest patch assumes 64k probe interval:
> https://sourceware.org/ml/libc-alpha/2017-12/msg00451.html
> 
> however Option 1 requires generic code to be changed
> for aarch64 only (in the libc and elsewhere) and we
> cannot easily do that on all (non-glibc) systems.
> 
> it seems to me if there are systems where Option 1
> may not provide guaranteed trap on stack overflow
> then gcc should have Option 2 for those systems.

For what it's worth, I would prefer having the assumed minimum guard
size be 4k for musl targets. Even if we do increase the default guard
to 64k for 64-bit archs (seems likely), applications that manually set
it lower for whatever reason should still be handled safely.

I'm utterly unconvinced that there's any practical measurable
performance difference either way, unless someone can demonstrate an
existing real-world program (not artificial benchmark) where the
difference is measurable.

Rich
Jeff Law Dec. 20, 2017, 4:41 a.m. UTC | #38
On 12/19/2017 01:34 PM, Rich Felker wrote:
> On Tue, Dec 19, 2017 at 06:21:32PM +0000, Szabolcs Nagy wrote:
>> On 19/12/17 13:06, Florian Weimer wrote:
>>> On 12/19/2017 01:34 PM, James Greenhalgh wrote:
>>>
>>>> Option 1: 64k guard pages for LP64 on AArch64.
>>>
>>>> Option 2: 4k guard pages for LP64 for AArch64
>>>
>>>> Our proposal then, having spoken things through with the Arm engineers
>>>> here, and taken in to consideration the opinions on this thread, is that
>>>> we move to two "blessed" configurations of the GCC support for AArch64.
>>>
>>> Are there any Arm engineers who prefer Option 2, or is that just there to accommodate feedback on libc-alpha?
>>>
>>> My main concern was the variance in configurations with Option 1 (compared to Option 2).  To some extent, the
>>> variance with Option 1 is temporary.  If both Option 1 and 2 are offered, we have permanent variance.  From my
>>> point of view, that's worth that just going with Option 1.
>>>
>>> So if this is some sort of consensus proposal, as opposed to actual technical requirements which favor Option 2
>>> in some deployments, I think that's not a good idea, and we should go with Option 1 instead.
>>>
>>
>> well glibc can pretend that only Option 1 is available,
>> my latest patch assumes 64k probe interval:
>> https://sourceware.org/ml/libc-alpha/2017-12/msg00451.html
>>
>> however Option 1 requires generic code to be changed
>> for aarch64 only (in the libc and elsewhere) and we
>> cannot easily do that on all (non-glibc) systems.
>>
>> it seems to me if there are systems where Option 1
>> may not provide guaranteed trap on stack overflow
>> then gcc should have Option 2 for those systems.
> 
> For what it's worth, I would prefer having the assumed minimum guard
> size be 4k for musl targets. Even if we do increase the default guard
> to 64k for 64-bit archs (seems likely), applications that manually set
> it lower for whatever reason should still be handled safely.
> 
> I'm utterly unconvinced that there's any practical measurable
> performance difference either way, unless someone can demonstrate an
> existing real-world program (not artificial benchmark) where the
> difference is measurable.
I've believed all along that stack clash probing as implemented by GCC
is pretty damn cheap -- cheap enough that we ought to just turn it on by
default and move on to other more useful work items.  And I hold that
position regardless of whether or not the guard is 4k or 64k.

64k is marginally better simply because there's less probing,
particularly in a functions that allocate something like char array of
MAXPATHLEN entries.  Based on what I've looked at on a distro-wide
basis, MAXPATHLEN arrays on the stack are the single biggest reasons we
have to probe in prologues.

Jeff
Jeff Law Dec. 20, 2017, 4:43 a.m. UTC | #39
On 12/19/2017 06:06 AM, Florian Weimer wrote:
> On 12/19/2017 01:34 PM, James Greenhalgh wrote:
> 
>> Option 1: 64k guard pages for LP64 on AArch64.
> 
>> Option 2: 4k guard pages for LP64 for AArch64
> 
>> Our proposal then, having spoken things through with the Arm engineers
>> here, and taken in to consideration the opinions on this thread, is that
>> we move to two "blessed" configurations of the GCC support for AArch64.
> 
> Are there any Arm engineers who prefer Option 2, or is that just there
> to accommodate feedback on libc-alpha?
> 
> My main concern was the variance in configurations with Option 1
> (compared to Option 2).  To some extent, the variance with Option 1 is
> temporary.  If both Option 1 and 2 are offered, we have permanent
> variance.  From my point of view, that's worth that just going with
> Option 1.
> 
> So if this is some sort of consensus proposal, as opposed to actual
> technical requirements which favor Option 2 in some deployments, I think
> that's not a good idea, and we should go with Option 1 instead.
Ideally we set the guard size now and never ever decrease it.   I'd
really like to remove the option to make it a compile-time configurable
--param for GCC.  ie, it's baked into the port and never changes.

jeff
Rich Felker Dec. 20, 2017, 4:47 a.m. UTC | #40
On Tue, Dec 19, 2017 at 09:41:03PM -0700, Jeff Law wrote:
> On 12/19/2017 01:34 PM, Rich Felker wrote:
> > On Tue, Dec 19, 2017 at 06:21:32PM +0000, Szabolcs Nagy wrote:
> >> On 19/12/17 13:06, Florian Weimer wrote:
> >>> On 12/19/2017 01:34 PM, James Greenhalgh wrote:
> >>>
> >>>> Option 1: 64k guard pages for LP64 on AArch64.
> >>>
> >>>> Option 2: 4k guard pages for LP64 for AArch64
> >>>
> >>>> Our proposal then, having spoken things through with the Arm engineers
> >>>> here, and taken in to consideration the opinions on this thread, is that
> >>>> we move to two "blessed" configurations of the GCC support for AArch64.
> >>>
> >>> Are there any Arm engineers who prefer Option 2, or is that just there to accommodate feedback on libc-alpha?
> >>>
> >>> My main concern was the variance in configurations with Option 1 (compared to Option 2).  To some extent, the
> >>> variance with Option 1 is temporary.  If both Option 1 and 2 are offered, we have permanent variance.  From my
> >>> point of view, that's worth that just going with Option 1.
> >>>
> >>> So if this is some sort of consensus proposal, as opposed to actual technical requirements which favor Option 2
> >>> in some deployments, I think that's not a good idea, and we should go with Option 1 instead.
> >>>
> >>
> >> well glibc can pretend that only Option 1 is available,
> >> my latest patch assumes 64k probe interval:
> >> https://sourceware.org/ml/libc-alpha/2017-12/msg00451.html
> >>
> >> however Option 1 requires generic code to be changed
> >> for aarch64 only (in the libc and elsewhere) and we
> >> cannot easily do that on all (non-glibc) systems.
> >>
> >> it seems to me if there are systems where Option 1
> >> may not provide guaranteed trap on stack overflow
> >> then gcc should have Option 2 for those systems.
> > 
> > For what it's worth, I would prefer having the assumed minimum guard
> > size be 4k for musl targets. Even if we do increase the default guard
> > to 64k for 64-bit archs (seems likely), applications that manually set
> > it lower for whatever reason should still be handled safely.
> > 
> > I'm utterly unconvinced that there's any practical measurable
> > performance difference either way, unless someone can demonstrate an
> > existing real-world program (not artificial benchmark) where the
> > difference is measurable.
> I've believed all along that stack clash probing as implemented by GCC
> is pretty damn cheap -- cheap enough that we ought to just turn it on by
> default and move on to other more useful work items.  And I hold that
> position regardless of whether or not the guard is 4k or 64k.
> 
> 64k is marginally better simply because there's less probing,
> particularly in a functions that allocate something like char array of
> MAXPATHLEN entries.  Based on what I've looked at on a distro-wide
> basis, MAXPATHLEN arrays on the stack are the single biggest reasons we
> have to probe in prologues.

They're also almost certainly indicative of the function making
syscalls, in which case <10 instructions for a probe are completely
dominated by the syscall time (hundreds if not thousands of times
larger).

Rich
Wilco Dijkstra Dec. 27, 2017, 1:08 p.m. UTC | #41
Rich Felker wrote:
>On Tue, Dec 19, 2017 at 09:41:03PM -0700, Jeff Law wrote:

>> I've believed all along that stack clash probing as implemented by GCC
>> is pretty damn cheap -- cheap enough that we ought to just turn it on by
>> default and move on to other more useful work items.  And I hold that
>> position regardless of whether or not the guard is 4k or 64k.
>> 
>> 64k is marginally better simply because there's less probing,
>> particularly in a functions that allocate something like char array of
>> MAXPATHLEN entries.  Based on what I've looked at on a distro-wide
>> basis, MAXPATHLEN arrays on the stack are the single biggest reasons we
>> have to probe in prologues.
>
> They're also almost certainly indicative of the function making
> syscalls, in which case <10 instructions for a probe are completely
> dominated by the syscall time (hundreds if not thousands of times
> larger).

Note the stats Szabolcs collected showed here were several packages
where the percentage of functions requiring probes at a 4KB probe size
was over 20%.

The probes themselves also cause new pages to be allocated which
might not be used. This is the drawback of allocating all stack space for
the whole function in the prolog and then doing probes. There is also the
issue that shrinkwrapping still doesn't work that well - if there is a single
use of a callee-save register, you have to create a stack even on a fast
exit path that doesn't use the stack.

An option to reduce stack usage would be to automatically use alloca for
large allocations that may not always be used.

Wilco
diff mbox series

Patch

From 84da61fd1e0d80d8bfc4c1e867fb5df3b5148caa Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy <szabolcs.nagy@arm.com>
Date: Tue, 28 Nov 2017 11:49:24 +0000
Subject: [PATCH] nptl: change default stack guard size of threads

There are several compiler implementations that allow large stack
allocations to jump over the guard page at the end of the stack and
corrupt memory beyond that. See CVE-2017-1000364.

The linux kernel increased the default guard size of the main thread
to 1M to mitigate the issue, which fixes some of the known local
privilege escalation exploits using setuid binaries, but does not
affect memory corruption in case of thread stack overflow in
multi-threaded processes.

Compilers can emit code to probe the stack such that a guard page
cannot be skipped, but this is not yet available on many targets and
may introduce larger performance overhead than necessary when the
compiler has to assume the minimum supported page size of the target.

On aarch64 the plan is to assume 64K guard size in the probing code
since this makes the overhead sufficiently small that it can be
turned on by default, but 4K page size is also supported, so the
mitigation only works if the libc guard size is increased.

This patch increases the default guard size to 64K which should
prevent jumping over the guard page in most existing binaries, in
particular larger stack allocation should not really exist in glibc
itself because this is the __MAX_ALLOCA_CUTOFF limit.

Note that POSIX 2008 allows implementation defined default guardsize,
but previous versions required it to be one page.

This patch can be controversial because it is not conform to old
standards, changes the behaviour of pthread_attr_init with respect
to the non-standard and underspecified pthread_setattr_default_np,
it increases the address space usage of existing code and because
security of code using smaller guardsize is not addressed. The
address space usage can be a concern on 32bit targets. The change
can be made for aarch64 only but I think it may be useful on other
targets too.

	nptl/descr.h (ARCH_GUARD_DEFAULT_SIZE): Define.
	nptl/nptl-init.c (__pthread_initialize_minimal_internal): Change
	__default_pthread_attr.guardsize.
	nptl/pthread_create.c (__pthread_create_2_0): Use
	__default_pthread_attr.guardsize.
	nptl/pthread_attr_init.c (__pthread_attr_init_2_1): Likewise.
	nptl/tst-attr2.c (do_test): Don't check default guardsize.
---
 nptl/descr.h             | 5 +++++
 nptl/nptl-init.c         | 3 ++-
 nptl/pthread_attr_init.c | 5 +++--
 nptl/pthread_create.c    | 6 ++++--
 nptl/tst-attr2.c         | 6 ------
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/nptl/descr.h b/nptl/descr.h
index c83b17b674..abb86e917a 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -39,6 +39,11 @@ 
 # define TCB_ALIGNMENT	sizeof (double)
 #endif
 
+/* Default guard size will get rounded up to page size.
+   Targets can override the setting in pthreaddef.h.  */
+#ifndef ARCH_GUARD_DEFAULT_SIZE
+# define ARCH_GUARD_DEFAULT_SIZE 65536
+#endif
 
 /* We keep thread specific data in a special data structure, a two-level
    array.  The top-level array contains pointers to dynamically allocated
diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
index 869e926f17..447913e9ec 100644
--- a/nptl/nptl-init.c
+++ b/nptl/nptl-init.c
@@ -436,7 +436,8 @@  __pthread_initialize_minimal_internal (void)
   limit.rlim_cur = ALIGN_UP (limit.rlim_cur, pagesz);
   lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
   __default_pthread_attr.stacksize = limit.rlim_cur;
-  __default_pthread_attr.guardsize = GLRO (dl_pagesize);
+  __default_pthread_attr.guardsize = pagesz > ARCH_GUARD_DEFAULT_SIZE
+				     ? pagesz : ARCH_GUARD_DEFAULT_SIZE;
   lll_unlock (__default_pthread_attr_lock, LLL_PRIVATE);
 
 #ifdef SHARED
diff --git a/nptl/pthread_attr_init.c b/nptl/pthread_attr_init.c
index eceaf85dbf..ec063a01ae 100644
--- a/nptl/pthread_attr_init.c
+++ b/nptl/pthread_attr_init.c
@@ -43,8 +43,9 @@  __pthread_attr_init_2_1 (pthread_attr_t *attr)
 
   iattr = (struct pthread_attr *) attr;
 
-  /* Default guard size specified by the standard.  */
-  iattr->guardsize = __getpagesize ();
+  lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
+  iattr->guardsize = __default_pthread_attr.guardsize;
+  lll_unlock (__default_pthread_attr_lock, LLL_PRIVATE);
 
   return 0;
 }
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 51ae60dfca..24fdad6271 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -869,7 +869,9 @@  __pthread_create_2_0 (pthread_t *newthread, const pthread_attr_t *attr,
   if (attr != NULL)
     {
       struct pthread_attr *iattr = (struct pthread_attr *) attr;
-      size_t ps = __getpagesize ();
+      lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
+      size_t guardsize = __default_pthread_attr.guardsize;
+      lll_unlock (__default_pthread_attr_lock, LLL_PRIVATE);
 
       /* Copy values from the user-provided attributes.  */
       new_attr.schedparam = iattr->schedparam;
@@ -878,7 +880,7 @@  __pthread_create_2_0 (pthread_t *newthread, const pthread_attr_t *attr,
 
       /* Fill in default values for the fields not present in the old
 	 implementation.  */
-      new_attr.guardsize = ps;
+      new_attr.guardsize = guardsize;
       new_attr.stackaddr = NULL;
       new_attr.stacksize = 0;
       new_attr.cpuset = NULL;
diff --git a/nptl/tst-attr2.c b/nptl/tst-attr2.c
index 9967b69773..a965c3ac1c 100644
--- a/nptl/tst-attr2.c
+++ b/nptl/tst-attr2.c
@@ -90,12 +90,6 @@  default detach state wrong: %d, expected %d (PTHREAD_CREATE_JOINABLE)\n",
       puts ("1st attr_getguardsize failed");
       exit (1);
     }
-  if (g != (size_t) sysconf (_SC_PAGESIZE))
-    {
-      printf ("default guardsize %zu, expected %ld (PAGESIZE)\n",
-	      g, sysconf (_SC_PAGESIZE));
-      exit (1);
-    }
 
   e = pthread_attr_setguardsize (&a, 0);
   if (e != 0)
-- 
2.11.0