[RFC] : Align large allocations to cacheline

Message ID	002301cff37e$61c12330$25436990$@com
State	New
Headers	show Return-Path: <libc-alpha-return-53890-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:mime-version :content-type:content-transfer-encoding; q=dns; s=default; b=Z+P Rcoz2TYgSNtXbH4xK06b6tlSE/5RnjTSdZm5M5aWxopF3Jv/1x1OkXvF/+QQvEpe TbCmzw0f28knN8LDTJAiaB4+Ya/KLuznleTjsc5+vtdGCHEqyGB3uu2j85d/DQIy 3QfUshU75iSAmyN1/Q9iUSpYNCEoJTDIT6jltisI= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: <libc-alpha@sourceware.org> Subject: [RFC PATCH]: Align large allocations to cacheline Date: Wed, 29 Oct 2014 13:43:54 -0000 Message-ID: <002301cff37e$61c12330$25436990$@com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable

Wilco Oct. 29, 2014, 1:43 p.m. UTC

This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
reduce performance variations due to random alignment choices, however it improves performance on
several benchmarks as well. SPECFP2000 improves by ~1.5%.

Any comments?

ChangeLog:
2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>

	* malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.
	* sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
	New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).
	* sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.

---
 malloc/malloc.c                                 |    8 ++++++++
 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
 3 files changed, 54 insertions(+)
 create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
 create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h

Andrew Pinski Oct. 29, 2014, 1:55 p.m. UTC | #1

> On Oct 29, 2014, at 6:43 AM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> 
> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
> reduce performance variations due to random alignment choices, however it improves performance on
> several benchmarks as well. SPECFP2000 improves by ~1.5%.
> 
> Any comments?
> 
> ChangeLog:
> 2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>
> 
>    * malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.
>    * sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
>    New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).
>    * sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.
> 
> ---
> malloc/malloc.c                                 |    8 ++++++++
> sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
> sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
> 3 files changed, 54 insertions(+)
> create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> 
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 6cbe9f3..0b0466e 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -2878,6 +2878,14 @@ __libc_malloc (size_t bytes)
>   mstate ar_ptr;
>   void *victim;
> 
> +#ifdef MALLOC_LARGE_BLOCK_ALIGN
> +  if (bytes > MALLOC_LARGE_BLOCK_SIZE)
> +    {
> +      void *address = RETURN_ADDRESS (0);
> +      return _mid_memalign (MALLOC_LARGE_BLOCK_ALIGN, bytes, address);
> +    }
> +#endif
> +
>   void *(*hook) (size_t, const void *)
>     = atomic_forced_read (__malloc_hook);
>   if (__builtin_expect (hook != NULL, 0))
> diff --git a/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> b/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> new file mode 100644
> index 0000000..3fe9f72
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> @@ -0,0 +1,23 @@
> +/* System-specific malloc support functions.  AArch64 version.
> +   Copyright (C) 2012-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define MALLOC_LARGE_BLOCK_ALIGN (64)

Yes that is not the only cache line size on aarch64. Why don't you make it a global variable and read it from the system register like we do for memset?  ThunderX has a cache line size of 128. 

Thanks,
Andrew


> +#define MALLOC_LARGE_BLOCK_SIZE (16384)
> +
> +#include_next <malloc-sysdep.h>
> +
> diff --git a/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> b/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> new file mode 100644
> index 0000000..3fe9f72
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> @@ -0,0 +1,23 @@
> +/* System-specific malloc support functions.  AArch64 version.
> +   Copyright (C) 2012-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> +#define MALLOC_LARGE_BLOCK_SIZE (16384)
> +
> +#include_next <malloc-sysdep.h>
> +
> -- 
> 1.7.9.5
> 
>

Will Newton Oct. 29, 2014, 2:05 p.m. UTC | #2

On 29 October 2014 13:43, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
> reduce performance variations due to random alignment choices, however it improves performance on
> several benchmarks as well. SPECFP2000 improves by ~1.5%.
>
> Any comments?

This bug may be relevant to changing malloc alignments:

https://sourceware.org/bugzilla/show_bug.cgi?id=6527

I don't know precisely what the issue is with malloc_get_state so I am
not sure if it applies here.

> ChangeLog:
> 2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.

I think this should be:

malloc/malloc.c (__libc_malloc) [MALLOC_LARGE_BLOCK_ALIGN]: Add
support for aligning large blocks.

Although I am not a GNU ChangeLog expert...

>         * sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
>         New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).

"New file." should be sufficient.

>         * sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.
>
> ---
>  malloc/malloc.c                                 |    8 ++++++++
>  sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
>  sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
>  3 files changed, 54 insertions(+)
>  create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
>  create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 6cbe9f3..0b0466e 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -2878,6 +2878,14 @@ __libc_malloc (size_t bytes)
>    mstate ar_ptr;
>    void *victim;
>
> +#ifdef MALLOC_LARGE_BLOCK_ALIGN
> +  if (bytes > MALLOC_LARGE_BLOCK_SIZE)
> +    {
> +      void *address = RETURN_ADDRESS (0);
> +      return _mid_memalign (MALLOC_LARGE_BLOCK_ALIGN, bytes, address);
> +    }
> +#endif
> +

This will change the behaviour of malloc hooks i.e. the malloc hook
will not get triggered but memalign will.

>    void *(*hook) (size_t, const void *)
>      = atomic_forced_read (__malloc_hook);
>    if (__builtin_expect (hook != NULL, 0))
> diff --git a/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> b/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> new file mode 100644
> index 0000000..3fe9f72
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
> @@ -0,0 +1,23 @@
> +/* System-specific malloc support functions.  AArch64 version.
> +   Copyright (C) 2012-2014 Free Software Foundation, Inc.

These copyright lines are wrong, but the whole copy of the LGPL could
probably be dropped with the file being so simple.

> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> +#define MALLOC_LARGE_BLOCK_SIZE (16384)
> +
> +#include_next <malloc-sysdep.h>
> +
> diff --git a/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> b/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> new file mode 100644
> index 0000000..3fe9f72
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
> @@ -0,0 +1,23 @@
> +/* System-specific malloc support functions.  AArch64 version.
> +   Copyright (C) 2012-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> +#define MALLOC_LARGE_BLOCK_SIZE (16384)
> +
> +#include_next <malloc-sysdep.h>
> +
> --
> 1.7.9.5
>
>

Ramana Radhakrishnan Oct. 29, 2014, 2:16 p.m. UTC | #3

On Wed, Oct 29, 2014 at 2:05 PM, Will Newton <will.newton@linaro.org> wrote:
> On 29 October 2014 13:43, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
>> reduce performance variations due to random alignment choices, however it improves performance on
>> several benchmarks as well. SPECFP2000 improves by ~1.5%.
>>
>> Any comments?
>
> This bug may be relevant to changing malloc alignments:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=6527
>
> I don't know precisely what the issue is with malloc_get_state so I am
> not sure if it applies here.
>
>> ChangeLog:
>> 2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>
>>
>>         * malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.
>
> I think this should be:
>
> malloc/malloc.c (__libc_malloc) [MALLOC_LARGE_BLOCK_ALIGN]: Add
> support for aligning large blocks.
>
> Although I am not a GNU ChangeLog expert...
>
>>         * sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
>>         New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).
>
> "New file." should be sufficient.
>
>>         * sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.
>>
>> ---
>>  malloc/malloc.c                                 |    8 ++++++++
>>  sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
>>  sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
>>  3 files changed, 54 insertions(+)
>>  create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
>>  create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
>>
>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>> index 6cbe9f3..0b0466e 100644
>> --- a/malloc/malloc.c
>> +++ b/malloc/malloc.c
>> @@ -2878,6 +2878,14 @@ __libc_malloc (size_t bytes)
>>    mstate ar_ptr;
>>    void *victim;
>>
>> +#ifdef MALLOC_LARGE_BLOCK_ALIGN
>> +  if (bytes > MALLOC_LARGE_BLOCK_SIZE)
>> +    {
>> +      void *address = RETURN_ADDRESS (0);
>> +      return _mid_memalign (MALLOC_LARGE_BLOCK_ALIGN, bytes, address);
>> +    }
>> +#endif
>> +
>
> This will change the behaviour of malloc hooks i.e. the malloc hook
> will not get triggered but memalign will.

Yeah that was my first reaction too - but read _mid_memalign - that
appears to call the malloc hooks too ?


Ramana

Wilco Oct. 29, 2014, 2:25 p.m. UTC | #4

> Andrew Pinski wrote:
>
> > +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> 
> Yes that is not the only cache line size on aarch64. Why don't you make it a global variable
> and read it from the system register like we do for memset?  ThunderX has a cache line size of
> 128.

We could easily set it to 128 for AArch64 - the amount of wasted memory is 
minimal for large block sizes so overaligning doesn't hurt much.

However you raise an interesting point - several functions may want to know
the cacheline size and this would be useful across all targets. So we could
add generic infrastructure that sets the cacheline size at GLIBC startup
rather than having to add complex target specific code to compute it and
cache it on demand in every function that might query the line size.

Wilco

Will Newton Oct. 29, 2014, 2:27 p.m. UTC | #5

On 29 October 2014 14:16, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
> On Wed, Oct 29, 2014 at 2:05 PM, Will Newton <will.newton@linaro.org> wrote:
>> On 29 October 2014 13:43, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>>> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
>>> reduce performance variations due to random alignment choices, however it improves performance on
>>> several benchmarks as well. SPECFP2000 improves by ~1.5%.
>>>
>>> Any comments?
>>
>> This bug may be relevant to changing malloc alignments:
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=6527
>>
>> I don't know precisely what the issue is with malloc_get_state so I am
>> not sure if it applies here.
>>
>>> ChangeLog:
>>> 2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>
>>>
>>>         * malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.
>>
>> I think this should be:
>>
>> malloc/malloc.c (__libc_malloc) [MALLOC_LARGE_BLOCK_ALIGN]: Add
>> support for aligning large blocks.
>>
>> Although I am not a GNU ChangeLog expert...
>>
>>>         * sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
>>>         New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).
>>
>> "New file." should be sufficient.
>>
>>>         * sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.
>>>
>>> ---
>>>  malloc/malloc.c                                 |    8 ++++++++
>>>  sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
>>>  sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
>>>  3 files changed, 54 insertions(+)
>>>  create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
>>>  create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
>>>
>>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>>> index 6cbe9f3..0b0466e 100644
>>> --- a/malloc/malloc.c
>>> +++ b/malloc/malloc.c
>>> @@ -2878,6 +2878,14 @@ __libc_malloc (size_t bytes)
>>>    mstate ar_ptr;
>>>    void *victim;
>>>
>>> +#ifdef MALLOC_LARGE_BLOCK_ALIGN
>>> +  if (bytes > MALLOC_LARGE_BLOCK_SIZE)
>>> +    {
>>> +      void *address = RETURN_ADDRESS (0);
>>> +      return _mid_memalign (MALLOC_LARGE_BLOCK_ALIGN, bytes, address);
>>> +    }
>>> +#endif
>>> +
>>
>> This will change the behaviour of malloc hooks i.e. the malloc hook
>> will not get triggered but memalign will.
>
> Yeah that was my first reaction too - but read _mid_memalign - that
> appears to call the malloc hooks too ?

But _mid_memalign calls the memalign hook rather than the malloc hook
which is rather surprising given we called malloc. It should simply be
a case of moving the #ifdef block to below the call of the malloc hook
in __libc_malloc to retain the existing behaviour.

Wilco Oct. 29, 2014, 2:39 p.m. UTC | #6

> Will Newton wrote:
> >> This will change the behaviour of malloc hooks i.e. the malloc hook
> >> will not get triggered but memalign will.
> >
> > Yeah that was my first reaction too - but read _mid_memalign - that
> > appears to call the malloc hooks too ?
> 
> But _mid_memalign calls the memalign hook rather than the malloc hook
> which is rather surprising given we called malloc. It should simply be
> a case of moving the #ifdef block to below the call of the malloc hook
> in __libc_malloc to retain the existing behaviour.

Right, it would be trivial to move the code so that the malloc hook 
is tried first. However if it isn't set then it will try the memalign
hook as well. I'm unclear what the idea of the hooks is, to hook the
original call only or to hook whatever low-level function you eventually
fall into or any combination... There is also an overhead in trying
several hooks rather than just one.

One possibility would be to split _mid_memalign, skip all the
unnecessary tests on alignment (which we know is correct) and directly go 
to the arena_get part, which is simpler and faster.

Wilco

Rich Felker Oct. 29, 2014, 2:44 p.m. UTC | #7

On Wed, Oct 29, 2014 at 01:43:54PM -0000, Wilco Dijkstra wrote:
> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
> reduce performance variations due to random alignment choices, however it improves performance on
> several benchmarks as well. SPECFP2000 improves by ~1.5%.
> 
> Any comments?

It seems like this would pathologically increase fragmentation: when a
tiny block is split off the beginning to align a large allocation,
isn't it likely that this tiny chunk (rather than some other tiny
chunk already in the appropriate-sized free bin prior to the large
allocation) will be used to satisfy a tiny request, which may end up
being long-lived?

memalign-type operations are known to be a major problem for
fragmentation. See this bug report:

https://sourceware.org/bugzilla/show_bug.cgi?id=14581

A smarter approach might be to round up the _sizes_ of large
allocations so that the next available address after the allocated
block is nicely aligned. This will tend to make _future_ allocations
aligned, even if the first one isn't, and won't leave any free gaps
that could contribute to fragmentation.

Rich

Ryan Arnold Oct. 29, 2014, 3:03 p.m. UTC | #8

On Wed, Oct 29, 2014 at 9:25 AM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>
> > Andrew Pinski wrote:
> >
> > > +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> >
> > Yes that is not the only cache line size on aarch64. Why don't you make it a global variable
> > and read it from the system register like we do for memset?  ThunderX has a cache line size of
> > 128.
>
> We could easily set it to 128 for AArch64 - the amount of wasted memory is
> minimal for large block sizes so overaligning doesn't hurt much.
>
> However you raise an interesting point - several functions may want to know
> the cacheline size and this would be useful across all targets. So we could
> add generic infrastructure that sets the cacheline size at GLIBC startup
> rather than having to add complex target specific code to compute it and
> cache it on demand in every function that might query the line size.


This has already done for powerpc since the A2 processor can use
different cache-line sizes, so look there for precedent.

grep for __cache_line_size.

Ramana Radhakrishnan Oct. 29, 2014, 3:13 p.m. UTC | #9

On Wed, Oct 29, 2014 at 2:27 PM, Will Newton <will.newton@linaro.org> wrote:
> On 29 October 2014 14:16, Ramana Radhakrishnan
> <ramana.gcc@googlemail.com> wrote:
>> On Wed, Oct 29, 2014 at 2:05 PM, Will Newton <will.newton@linaro.org> wrote:
>>> On 29 October 2014 13:43, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>>>> This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main goal is to
>>>> reduce performance variations due to random alignment choices, however it improves performance on
>>>> several benchmarks as well. SPECFP2000 improves by ~1.5%.
>>>>
>>>> Any comments?
>>>
>>> This bug may be relevant to changing malloc alignments:
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=6527
>>>
>>> I don't know precisely what the issue is with malloc_get_state so I am
>>> not sure if it applies here.
>>>
>>>> ChangeLog:
>>>> 2014-10-29  Wilco Dijkstra  <wdijkstr@arm.com>
>>>>
>>>>         * malloc/malloc.c (__libc_malloc): Add support for aligning large blocks.
>>>
>>> I think this should be:
>>>
>>> malloc/malloc.c (__libc_malloc) [MALLOC_LARGE_BLOCK_ALIGN]: Add
>>> support for aligning large blocks.
>>>
>>> Although I am not a GNU ChangeLog expert...
>>>
>>>>         * sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h: New file.
>>>>         New defines (MALLOC_LARGE_BLOCK_ALIGN), (MALLOC_LARGE_BLOCK_SIZE).
>>>
>>> "New file." should be sufficient.
>>>
>>>>         * sysdeps/unix/sysv/linux/arm/malloc-sysdep.h: Likewise.
>>>>
>>>> ---
>>>>  malloc/malloc.c                                 |    8 ++++++++
>>>>  sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h |   23 +++++++++++++++++++++++
>>>>  sysdeps/unix/sysv/linux/arm/malloc-sysdep.h     |   23 +++++++++++++++++++++++
>>>>  3 files changed, 54 insertions(+)
>>>>  create mode 100644 sysdeps/unix/sysv/linux/aarch64/malloc-sysdep.h
>>>>  create mode 100644 sysdeps/unix/sysv/linux/arm/malloc-sysdep.h
>>>>
>>>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>>>> index 6cbe9f3..0b0466e 100644
>>>> --- a/malloc/malloc.c
>>>> +++ b/malloc/malloc.c
>>>> @@ -2878,6 +2878,14 @@ __libc_malloc (size_t bytes)
>>>>    mstate ar_ptr;
>>>>    void *victim;
>>>>
>>>> +#ifdef MALLOC_LARGE_BLOCK_ALIGN
>>>> +  if (bytes > MALLOC_LARGE_BLOCK_SIZE)
>>>> +    {
>>>> +      void *address = RETURN_ADDRESS (0);
>>>> +      return _mid_memalign (MALLOC_LARGE_BLOCK_ALIGN, bytes, address);
>>>> +    }
>>>> +#endif
>>>> +
>>>
>>> This will change the behaviour of malloc hooks i.e. the malloc hook
>>> will not get triggered but memalign will.
>>
>> Yeah that was my first reaction too - but read _mid_memalign - that
>> appears to call the malloc hooks too ?
>
> But _mid_memalign calls the memalign hook rather than the malloc hook
> which is rather surprising given we called malloc. It should simply be
> a case of moving the #ifdef block to below the call of the malloc hook
> in __libc_malloc to retain the existing behaviour.


Indeed - sorry for the noise.

Ramana
>
> --
> Will Newton
> Toolchain Working Group, Linaro

Wilco Oct. 29, 2014, 3:31 p.m. UTC | #10

> Rich Felker wrote: 
> On Wed, Oct 29, 2014 at 01:43:54PM -0000, Wilco Dijkstra wrote:
> > This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main
> goal is to
> > reduce performance variations due to random alignment choices, however it improves
> performance on
> > several benchmarks as well. SPECFP2000 improves by ~1.5%.
> >
> > Any comments?
> 
> It seems like this would pathologically increase fragmentation: when a
> tiny block is split off the beginning to align a large allocation,
> isn't it likely that this tiny chunk (rather than some other tiny
> chunk already in the appropriate-sized free bin prior to the large
> allocation) will be used to satisfy a tiny request, which may end up
> being long-lived?

A good implementation of memalign would try to avoid that, for example
by increasing the previous allocated block or not adding tiny splitoff
chunks to the free list.

> memalign-type operations are known to be a major problem for
> fragmentation. See this bug report:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=14581

It seems that report is claiming that freeing a block doesn't immediately
merge with adjacent freed blocks. If that is the case then that is a very
serious bug indeed! The fact that memalign is not fully integrated into 
the allocation code doesn't help either - it never tries to reuse an
correctly aligned free block.

> A smarter approach might be to round up the _sizes_ of large
> allocations so that the next available address after the allocated
> block is nicely aligned. This will tend to make _future_ allocations
> aligned, even if the first one isn't, and won't leave any free gaps
> that could contribute to fragmentation.

That might be possible too, I'd have to give it a go. I was trying to avoid
reworking the guts of the malloc code - I think it would be better to create
a new modern allocator from scratch.

Wilco

Rich Felker Oct. 29, 2014, 3:55 p.m. UTC | #11

On Wed, Oct 29, 2014 at 03:31:37PM -0000, Wilco Dijkstra wrote:
> > Rich Felker wrote: 
> > On Wed, Oct 29, 2014 at 01:43:54PM -0000, Wilco Dijkstra wrote:
> > > This patch aligns allocations of large blocks to a cacheline on ARM and AArch64. The main
> > goal is to
> > > reduce performance variations due to random alignment choices, however it improves
> > performance on
> > > several benchmarks as well. SPECFP2000 improves by ~1.5%.
> > >
> > > Any comments?
> > 
> > It seems like this would pathologically increase fragmentation: when a
> > tiny block is split off the beginning to align a large allocation,
> > isn't it likely that this tiny chunk (rather than some other tiny
> > chunk already in the appropriate-sized free bin prior to the large
> > allocation) will be used to satisfy a tiny request, which may end up
> > being long-lived?
> 
> A good implementation of memalign would try to avoid that, for example
> by increasing the previous allocated block or not adding tiny splitoff
> chunks to the free list.

I don't think modifying the previous allocated block is practical;
this almost certainly has data races unless you impose costly
synchronization.

If you do the split but don't add the split block to the free list, I
don't see how you'd track it to free it later.

You could refrain from splitting the tiny block at all and instead
keep it as part of the the allocated block with a special flag bit on
the header to indicate that the allocator needs to look back to find
the actual beginning of the block when freeing or reallocing it, etc.
This is essentially a more complex, but slightly more optimal, version
of my idea for rounding up the size.

> > memalign-type operations are known to be a major problem for
> > fragmentation. See this bug report:
> > 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=14581
> 
> It seems that report is claiming that freeing a block doesn't immediately
> merge with adjacent freed blocks. If that is the case then that is a very
> serious bug indeed! The fact that memalign is not fully integrated into 
> the allocation code doesn't help either - it never tries to reuse an
> correctly aligned free block.

Yes, there are probably improvements that could be made to memalign to
reduce the problem, but they have tradeoffs. I don't think it's
practical to make memalign reuse correctly aligned free blocks unless
you segregate the free lists not just by size but also by alignment.
Otherwise you'd have to do a non-constant-time search to determine
whether a free block of the desired alignment exists and select it.

> > A smarter approach might be to round up the _sizes_ of large
> > allocations so that the next available address after the allocated
> > block is nicely aligned. This will tend to make _future_ allocations
> > aligned, even if the first one isn't, and won't leave any free gaps
> > that could contribute to fragmentation.
> 
> That might be possible too, I'd have to give it a go. I was trying to avoid
> reworking the guts of the malloc code - I think it would be better to create
> a new modern allocator from scratch.

That's quite possible. All the basic algorithms involved are
understood and well documented, so it's certainly practical to redo
without all the cruft. However there is quite a big testing burden.
Even if you know what you're doing, it's very easy to make stupid
mistakes.

Rich

Carlos O'Donell Oct. 29, 2014, 5:40 p.m. UTC | #12

On 10/29/2014 09:43 AM, Wilco Dijkstra wrote:
> +#define MALLOC_LARGE_BLOCK_ALIGN (64)
> +#define MALLOC_LARGE_BLOCK_SIZE (16384)

Typo-prone interface. It will generate -Wundef errors.

See: https://sourceware.org/glibc/wiki/Wundef

It will become an error soon to leave these undefined
for other arches.

Cheers,
Carlos.

[RFC] : Align large allocations to cacheline

Commit Message

Comments

Patch