diff mbox series

ARC: Explicitly set ARCH_SLAB_MINALIGN = 8

Message ID 20190208105519.26750-1-abrodkin@synopsys.com
State New
Headers show
Series ARC: Explicitly set ARCH_SLAB_MINALIGN = 8 | expand

Commit Message

Alexey Brodkin Feb. 8, 2019, 10:55 a.m. UTC
By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as
"__alignof__(unsigned long long)" which looks fine but not for ARC.

ARC tools ABI sets align of "long long" the same as for "long" = 4
instead of 8 one may think of.

Thus slab allocator may easily allocate a buffer which is 32-bit aligned.
And most of the time it's OK until we start dealing with 64-bit atomics
with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit
counterparts LLOCK/SCOND) operate with full 64-bit words but those words
must be 64-bit aligned.

Fixes Ext4 folder removal:
--------------------->8-------------------
[    4.015732] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[    4.167881]
[    4.167881] Misaligned Access
[    4.172356] Path: /bin/busybox.nosuid
[    4.176004] CPU: 2 PID: 171 Comm: rm Not tainted 4.19.14-yocto-standard #1
[    4.182851]
[    4.182851] [ECR   ]: 0x000d0000 => Check Programmer's Manual
[    4.190061] [EFA   ]: 0xbeaec3fc
[    4.190061] [BLINK ]: ext4_delete_entry+0x210/0x234
[    4.190061] [ERET  ]: ext4_delete_entry+0x13e/0x234
[    4.202985] [STAT32]: 0x80080002 : IE K
[    4.207236] BTA: 0x9009329c   SP: 0xbe5b1ec4  FP: 0x00000000
[    4.212790] LPS: 0x9074b118  LPE: 0x9074b120 LPC: 0x00000000
[    4.218348] r00: 0x00000040  r01: 0x00000021 r02: 0x00000001
[    4.218348] r03: 0x00000000  r04: 0x00000002 r05: 0x00000000
[    4.218348] r06: 0x000000c6  r07: 0x00000000 r08: 0x9050f140
[    4.218348] r09: 0x000000c6  r10: 0x0000000a r11: 0x00000000
[    4.218348] r12: 0x90247a9c  r13: 0x9004e574 r14: 0x0008e150
[    4.218348] r15: 0x000989b8  r16: 0x0008cbec r17: 0x0009806c
[    4.218348] r18: 0x0009806c  r19: 0x0008e150 r20: 0x0008f0f8
[    4.218348] r21: 0x000000ab  r22: 0x0008f0f8 r23: 0x00000000
[    4.218348] r24: 0x00000000  r25: 0x00000000
[    4.218348]
[    4.218348]
[    4.270510]
[    4.270510] Stack Trace:
[    4.274510]   ext4_delete_entry+0x13e/0x234
[    4.278695]   ext4_rmdir+0xe0/0x238
[    4.282187]   vfs_rmdir+0x50/0xf0
[    4.285492]   do_rmdir+0x9e/0x154
[    4.288802]   EV_Trap+0x110/0x114
--------------------->8-------------------

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: <stable@vger.kernel.org> # 4.8+
---
 arch/arc/include/asm/cache.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Vineet Gupta Feb. 12, 2019, 5:17 p.m. UTC | #1
On 2/8/19 2:55 AM, Alexey Brodkin wrote:
> By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as
> "__alignof__(unsigned long long)" which looks fine but not for ARC.

Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load)
was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed
by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has
unaligned access enabled).

This in turn was happening because this word is embedded in some other struct and
happens to be 4 byte aligned


> ARC tools ABI sets align of "long long" the same as for "long" = 4
> instead of 8 one may think of.

Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows
regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI
relaxing the alignment for 64-bit data potentially causes more packing and less
space waste. But on the flip side we need to waste space at arbitrary places liek
this.

So this is all good and theory, but I'm not 100% sure how slab alignment helps
here (and is future proof). So the outer struct with embedded atomic64_t was
allocated via slab and your patch ensures that outer struct is 64-bit aligned ?

But how does that guarantee that all embedded atomic64_t in there will be 64-bit
aligned (in future say) in the light of ARC ABI and the gcc bug/feature which
Peter alluded to

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188
   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360

> Thus slab allocator may easily allocate a buffer which is 32-bit aligned.
> And most of the time it's OK until we start dealing with 64-bit atomics
> with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit
> counterparts LLOCK/SCOND) operate with full 64-bit words but those words
> must be 64-bit aligned.

Some of this text needed to go above to give more context.

> 
> Fixes Ext4 folder removal:
> --------------------->8-------------------
> [    4.015732] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
> [    4.167881]

..

> 
> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> Cc: <stable@vger.kernel.org> # 4.8+
> ---
>  arch/arc/include/asm/cache.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arc/include/asm/cache.h b/arch/arc/include/asm/cache.h
> index f393b663413e..74f8fcaaef5c 100644
> --- a/arch/arc/include/asm/cache.h
> +++ b/arch/arc/include/asm/cache.h
> @@ -52,6 +52,16 @@
>  #define cache_line_size()	SMP_CACHE_BYTES
>  #define ARCH_DMA_MINALIGN	SMP_CACHE_BYTES
>  
> +/*
> + * Make sure slab-allocated buffers are 64-bit aligned.
> + * This is required for llockd/scondd to deal with 64-bit aligned dwords.
> + * By default ARCH_SLAB_MINALIGN is __alignof__(long long) which in
> + * case of ARC is 4 instead of 8!
> + */
> +#ifdef CONFIG_ARC_HAS_LL64
> +#define ARCH_SLAB_MINALIGN	8
> +#endif
> +
>  extern void arc_cache_init(void);
>  extern char *arc_cache_mumbojumbo(int cpu_id, char *buf, int len);
>  extern void read_decode_cache_bcr(void);
>
David Laight Feb. 12, 2019, 5:30 p.m. UTC | #2
From:  Vineet Gupta
> Sent: 12 February 2019 17:17
> 
> On 2/8/19 2:55 AM, Alexey Brodkin wrote:
> > By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as
> > "__alignof__(unsigned long long)" which looks fine but not for ARC.
> 
> Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load)
> was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed
> by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has
> unaligned access enabled).
> 
> This in turn was happening because this word is embedded in some other struct and
> happens to be 4 byte aligned
> 
> 
> > ARC tools ABI sets align of "long long" the same as for "long" = 4
> > instead of 8 one may think of.

Right, but __alignof__() doesn't have to return the alignment that would
be used for a data item of the specified type.
(Read the gcc 'bug' info for gory details.)

On i386 __alignof__(long long) is 8, but structure members of type 'long long'
are 4 byte aligned and the alignment of a structure with a 'long long' member
is only 4.
(Although the microsoft compiler returns 4.)

> Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows
> regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI
> relaxing the alignment for 64-bit data potentially causes more packing and less
> space waste. But on the flip side we need to waste space at arbitrary places liek
> this.
> 
> So this is all good and theory, but I'm not 100% sure how slab alignment helps
> here (and is future proof). So the outer struct with embedded atomic64_t was
> allocated via slab and your patch ensures that outer struct is 64-bit aligned ?

Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.

> But how does that guarantee that all embedded atomic64_t in there will be 64-bit
> aligned (in future say) in the light of ARC ABI and the gcc bug/feature which
> Peter alluded to
> 
>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188
>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
> 
> > Thus slab allocator may easily allocate a buffer which is 32-bit aligned.
> > And most of the time it's OK until we start dealing with 64-bit atomics
> > with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit
> > counterparts LLOCK/SCOND) operate with full 64-bit words but those words
> > must be 64-bit aligned.
> 
> Some of this text needed to go above to give more context.

I suspect the slab allocator should be returning 8 byte aligned addresses
on all systems....

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Vineet Gupta Feb. 12, 2019, 5:45 p.m. UTC | #3
+CC some folks interested in alignment stuff in the past.

On 2/12/19 9:30 AM, David Laight wrote:
> From:  Vineet Gupta
>> Sent: 12 February 2019 17:17
>>
>> On 2/8/19 2:55 AM, Alexey Brodkin wrote:
>>> By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as
>>> "__alignof__(unsigned long long)" which looks fine but not for ARC.
>>
>> Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load)
>> was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed
>> by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has
>> unaligned access enabled).
>>
>> This in turn was happening because this word is embedded in some other struct and
>> happens to be 4 byte aligned
>>
>>
>>> ARC tools ABI sets align of "long long" the same as for "long" = 4
>>> instead of 8 one may think of.
> 
> Right, but __alignof__() doesn't have to return the alignment that would
> be used for a data item of the specified type.
> (Read the gcc 'bug' info for gory details.)
> 
> On i386 __alignof__(long long) is 8, but structure members of type 'long long'
> are 4 byte aligned and the alignment of a structure with a 'long long' member
> is only 4.
> (Although the microsoft compiler returns 4.)

Exactly my point that this fudging of outer alignment is no magic bullet.

> 
>> Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows
>> regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI
>> relaxing the alignment for 64-bit data potentially causes more packing and less
>> space waste. But on the flip side we need to waste space at arbitrary places liek
>> this.
>>
>> So this is all good and theory, but I'm not 100% sure how slab alignment helps
>> here (and is future proof). So the outer struct with embedded atomic64_t was
>> allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
> 
> Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.

It does for ARC

typedef struct {
        aligned_u64 counter;
} atomic64_t;

But what was your point ?

> 
>> But how does that guarantee that all embedded atomic64_t in there will be 64-bit
>> aligned (in future say) in the light of ARC ABI and the gcc bug/feature which
>> Peter alluded to
>>
>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188
>>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
>>
>>> Thus slab allocator may easily allocate a buffer which is 32-bit aligned.
>>> And most of the time it's OK until we start dealing with 64-bit atomics
>>> with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit
>>> counterparts LLOCK/SCOND) operate with full 64-bit words but those words
>>> must be 64-bit aligned.
>>
>> Some of this text needed to go above to give more context.
> 
> I suspect the slab allocator should be returning 8 byte aligned addresses
> on all systems.... 

why ? As I understand it is still not fool proof against the expected alignment of
inner members. There ought to be a better way to enforce all this.
Peter Zijlstra Feb. 13, 2019, 12:56 p.m. UTC | #4
On Tue, Feb 12, 2019 at 09:45:53AM -0800, Vineet Gupta wrote:
> +CC some folks interested in alignment stuff in the past.
> 
> On 2/12/19 9:30 AM, David Laight wrote:
> > From:  Vineet Gupta
> >> Sent: 12 February 2019 17:17
> >>
> >> On 2/8/19 2:55 AM, Alexey Brodkin wrote:
> >>> By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as
> >>> "__alignof__(unsigned long long)" which looks fine but not for ARC.
> >>
> >> Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load)
> >> was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed
> >> by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has
> >> unaligned access enabled).
> >>
> >> This in turn was happening because this word is embedded in some other struct and
> >> happens to be 4 byte aligned
> >>
> >>
> >>> ARC tools ABI sets align of "long long" the same as for "long" = 4
> >>> instead of 8 one may think of.
> > 
> > Right, but __alignof__() doesn't have to return the alignment that would
> > be used for a data item of the specified type.
> > (Read the gcc 'bug' info for gory details.)
> > 
> > On i386 __alignof__(long long) is 8, but structure members of type 'long long'
> > are 4 byte aligned and the alignment of a structure with a 'long long' member
> > is only 4.
> > (Although the microsoft compiler returns 4.)
> 
> Exactly my point that this fudging of outer alignment is no magic bullet.

IMO (and yes I knew about that i386 thing) this is just plain wrong. Of
course we'll have to live with that crap, but that doesn't make it less
crap.

> >> Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows
> >> regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI
> >> relaxing the alignment for 64-bit data potentially causes more packing and less
> >> space waste. But on the flip side we need to waste space at arbitrary places liek
> >> this.
> >>
> >> So this is all good and theory, but I'm not 100% sure how slab alignment helps
> >> here (and is future proof). So the outer struct with embedded atomic64_t was
> >> allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
> > 
> > Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.
> 
> It does for ARC
> 
> typedef struct {
>         aligned_u64 counter;
> } atomic64_t;

Note that atomic*_t is signed; also note that it doesn't matter in
practise because -fno-strict-overflow.

Personally I think u64 and company should already force natural
alignment; but alas. I though that was part of the reason we have __u64
and co., so that ABI is invariant to kernel alignment changes.

> >> But how does that guarantee that all embedded atomic64_t in there will be 64-bit
> >> aligned (in future say) in the light of ARC ABI and the gcc bug/feature which
> >> Peter alluded to
> >>
> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188
> >>    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360

I strongly agree with all those that say __alignof__ is broken and argue
for the C11 _Alignof/alignof semantics.

In particular I think that:

T x;

struct foo {
	T x;
};

alignof(x) == alignof(foo::x)

And:

    Aggregates (structures and arrays) and unions assume the alignment
    of their most strictly aligned component.

Otherwise none of this is remotely usable.

> >>> Thus slab allocator may easily allocate a buffer which is 32-bit aligned.
> >>> And most of the time it's OK until we start dealing with 64-bit atomics
> >>> with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit
> >>> counterparts LLOCK/SCOND) operate with full 64-bit words but those words
> >>> must be 64-bit aligned.
> >>
> >> Some of this text needed to go above to give more context.
> > 
> > I suspect the slab allocator should be returning 8 byte aligned addresses
> > on all systems.... 
> 
> why ? As I understand it is still not fool proof against the expected alignment of
> inner members. There ought to be a better way to enforce all this.

I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.

In the past I've proposed a GCC plugin / checker that would verify the
alignment requirements against the various allocators.

For instance:

struct foo {
	spinlock_t a;
	int b;
} __cacheline_aligned;

struct foo *my_foo = kmalloc(sizeof(struct foo), GFP_KERNEL);

would result in a warning; because obviously kmalloc (as per
ARCH_SLAB_MINALIGN) doesn't respect the cacheline alignment of the type.

Of course; it appears our kmalloc() function definition doesn't even
have a __malloc attribute, so there's plenty work to be done here.
David Laight Feb. 13, 2019, 2:13 p.m. UTC | #5
From: Peter Zijlstra 
> Sent: 13 February 2019 12:57
...
...
> In the past I've proposed a GCC plugin / checker that would verify the
> alignment requirements against the various allocators.
> 
> For instance:
> 
> struct foo {
> 	spinlock_t a;
> 	int b;
> } __cacheline_aligned;
> 
> struct foo *my_foo = kmalloc(sizeof(struct foo), GFP_KERNEL);
> 
> would result in a warning; because obviously kmalloc (as per
> ARCH_SLAB_MINALIGN) doesn't respect the cacheline alignment of the type.
> 
> Of course; it appears our kmalloc() function definition doesn't even
> have a __malloc attribute, so there's plenty work to be done here.

We could pass the alignment to the allocator by defining something like:

#define do_malloc(x) ((x) = (typeof(*(x)))_do_malloc(sizeof *(x), __alignof__(*(x))))

Although you probably want to compile-time detect alignments that are smaller
than the normal minimal alignment.

If the allocator needs to add a header it would need to use the byte before
the allocated item to find the header.
OTOH adding a header is horrid for page-sized items.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Vineet Gupta Feb. 13, 2019, 11:23 p.m. UTC | #6
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
>
> Personally I think u64 and company should already force natural
> alignment; but alas.

But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
allowed to take a 32-bit aligned address to load a register pair. Thus all u64
need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
in ABI (alignment of long long is 4). You could certainly argue that we end up
undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
still...

> I though that was part of the reason we have __u64
> and co., so that ABI is invariant to kernel alignment changes.

Apparently not.

>>> I suspect the slab allocator should be returning 8 byte aligned addresses
>>> on all systems.... 
>>
>> why ? As I understand it is still not fool proof against the expected alignment of
>> inner members. There ought to be a better way to enforce all this.
> 
> I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.

This issue aside, are there other reasons ? Because making it 8 on ARC is just
pending the eventuality for later.
Alexey Brodkin Feb. 14, 2019, 8:50 a.m. UTC | #7
Hi Vineet, Peter, all,

> -----Original Message-----
> From: Vineet Gupta <vgupta@synopsys.com>
> Sent: Thursday, February 14, 2019 2:24 AM
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: David Laight <David.Laight@ACULAB.COM>; Alexey Brodkin <alexey.brodkin@synopsys.com>; linux-snps-
> arc@lists.infradead.org; Arnd Bergmann <arnd.bergmann@linaro.org>; linux-kernel@vger.kernel.org;
> stable@vger.kernel.org; Mark Rutland <mark.rutland@arm.com>
> Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
> 
> On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> >
> > Personally I think u64 and company should already force natural
> > alignment; but alas.
> 
> But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> in ABI (alignment of long long is 4). You could certainly argue that we end up
> undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> still...
> 
> > I though that was part of the reason we have __u64
> > and co., so that ABI is invariant to kernel alignment changes.
> 
> Apparently not.
> 
> >>> I suspect the slab allocator should be returning 8 byte aligned addresses
> >>> on all systems....
> >>
> >> why ? As I understand it is still not fool proof against the expected alignment of
> >> inner members. There ought to be a better way to enforce all this.
> >
> > I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
> 
> This issue aside, are there other reasons ? Because making it 8 on ARC is just
> pending the eventuality for later.

But that's pretty much the same for other 32-bit arches that have 64-bit atomics
like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they
require double-word alignment of data as well.

That said if for some reason atomic64_t variable is unaligned execution on
any (or at least most) 32-bit architectures will lead to run-time failure,
i.e. we'll know about it and this will be fixed.

And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just
working-around peculiarity of ARC ABI.

Out of curiosity I checked if there're any other occurrences of "alingof(long long)"
and there seems to be a couple of more:
----------------------------------->8-----------------------------
# git grep alignof | grep "long long"

...

kernel/workqueue.c:5693:        WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
mm/slab.c:155:#define   REDZONE_ALIGN           max(BYTES_PER_WORD, __alignof__(unsigned long long))
mm/slab.c:2034: if (ralign > __alignof__(unsigned long long))
----------------------------------->8-----------------------------

Not really sure how important is "kernel/workqueue.c" part but in case of "mm/slab.c"
shouldn't we use ARCH_SLAB_MINALIGN there instead of that "not very meaningful" __alignof__(long long)?

-Alexey
Peter Zijlstra Feb. 14, 2019, 10:28 a.m. UTC | #8
On Thu, Feb 14, 2019 at 08:50:43AM +0000, Alexey Brodkin wrote:

> But that's pretty much the same for other 32-bit arches that have 64-bit atomics
> like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they
> require double-word alignment of data as well.
> 
> That said if for some reason atomic64_t variable is unaligned execution on
> any (or at least most) 32-bit architectures will lead to run-time failure,
> i.e. we'll know about it and this will be fixed.

On x86_32 we have cmpxchg8b that 'likes' 8b alignment, our atomic64_32
implementation has the explicit alignment in, however there
__alignof__(unsigned long long) is actually 8, so it all works.

Even though the hardware (obviously) never really requires alignment,
even for atomic ops (although that's coming, yay!).
Peter Zijlstra Feb. 14, 2019, 10:31 a.m. UTC | #9
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
> On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> >
> > Personally I think u64 and company should already force natural
> > alignment; but alas.
> 
> But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> in ABI (alignment of long long is 4). You could certainly argue that we end up
> undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> still...

So what happens if the data is then split across two cachelines; will a
STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
for > sizeof(unsigned long), with the obvious exception of atomic64_t,
but yuck...

So even though it is allowed by the chip; does it really make sense to
use this?
Alexey Brodkin Feb. 14, 2019, 10:44 a.m. UTC | #10
Hi Peter,

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Thursday, February 14, 2019 1:32 PM
> To: Vineet Gupta <vineet.gupta1@synopsys.com>
> Cc: David Laight <David.Laight@ACULAB.COM>; Alexey Brodkin <alexey.brodkin@synopsys.com>; linux-snps-
> arc@lists.infradead.org; Arnd Bergmann <arnd.bergmann@linaro.org>; linux-kernel@vger.kernel.org;
> stable@vger.kernel.org; Mark Rutland <mark.rutland@arm.com>
> Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
> 
> On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
> > On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> > >
> > > Personally I think u64 and company should already force natural
> > > alignment; but alas.
> >
> > But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> > allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> > need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> > in ABI (alignment of long long is 4). You could certainly argue that we end up
> > undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> > still...
> 
> So what happens if the data is then split across two cachelines; will a
> STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
> for > sizeof(unsigned long), with the obvious exception of atomic64_t,
> but yuck...

STD & LDD are simple store/load instructions so there's no problem for
their 64-bit data to be from 2 subsequent cache lines as well as 2 pages
(if we're that unlucky). Or you mean something else?

> So even though it is allowed by the chip; does it really make sense to
> use this?

It gives performance benefits when dealing with either 64-bit or even
larger buffers, see how we use it in our string routines like here [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arc/lib/memset-archs.S#n81

-Alexey
Peter Zijlstra Feb. 14, 2019, 11:08 a.m. UTC | #11
On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
> > On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
> > > On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> > > >
> > > > Personally I think u64 and company should already force natural
> > > > alignment; but alas.
> > >
> > > But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> > > allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> > > need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> > > in ABI (alignment of long long is 4). You could certainly argue that we end up
> > > undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> > > still...
> > 
> > So what happens if the data is then split across two cachelines; will a
> > STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
> > for > sizeof(unsigned long), with the obvious exception of atomic64_t,
> > but yuck...
> 
> STD & LDD are simple store/load instructions so there's no problem for
> their 64-bit data to be from 2 subsequent cache lines as well as 2 pages
> (if we're that unlucky). Or you mean something else?

u64 x;

WRITE_ONCE(x, 0x1111111100000000);
WRITE_ONCE(x, 0x0000000011111111);

vs

t = READ_ONCE(x);

is t allowed to be 0x1111111111111111 ?

If the data is split between two cachelines, the hardware must do
something very funny to avoid that.

single-copy-atomicity requires that to never happen; IOW no load or
store tearing. You must observe 'whole' values, no mixing.

Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for
<=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic
types. Your atomic64_t alignment should ensure this is so.

So while I think we're fine, I do find hardware instructions that tear
yuck (yah, I know, x86...)

> > So even though it is allowed by the chip; does it really make sense to
> > use this?
> 
> It gives performance benefits when dealing with either 64-bit or even
> larger buffers, see how we use it in our string routines like here [1].
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arc/lib/memset-archs.S#n81

That doesn't require the ABI alignment crud.
Alexey Brodkin Feb. 14, 2019, 12:05 p.m. UTC | #12
Hi Peter,

> -----Original Message-----
> From: linux-snps-arc <linux-snps-arc-bounces@lists.infradead.org> On Behalf Of Peter Zijlstra
> Sent: Thursday, February 14, 2019 2:08 PM
> To: Alexey Brodkin <alexey.brodkin@synopsys.com>
> Cc: Mark Rutland <mark.rutland@arm.com>; Vineet Gupta <vineet.gupta1@synopsys.com>; linux-
> kernel@vger.kernel.org; stable@vger.kernel.org; David Laight <David.Laight@ACULAB.COM>; Arnd Bergmann
> <arnd.bergmann@linaro.org>; linux-snps-arc@lists.infradead.org
> Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
> 
> On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
> > > On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
> > > > On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> > > > >
> > > > > Personally I think u64 and company should already force natural
> > > > > alignment; but alas.
> > > >
> > > > But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> > > > allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> > > > need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> > > > in ABI (alignment of long long is 4). You could certainly argue that we end up
> > > > undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> > > > still...
> > >
> > > So what happens if the data is then split across two cachelines; will a
> > > STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
> > > for > sizeof(unsigned long), with the obvious exception of atomic64_t,
> > > but yuck...
> >
> > STD & LDD are simple store/load instructions so there's no problem for
> > their 64-bit data to be from 2 subsequent cache lines as well as 2 pages
> > (if we're that unlucky). Or you mean something else?
> 
> u64 x;
> 
> WRITE_ONCE(x, 0x1111111100000000);
> WRITE_ONCE(x, 0x0000000011111111);
> 
> vs
> 
> t = READ_ONCE(x);
> 
> is t allowed to be 0x1111111111111111 ?
> 
> If the data is split between two cachelines, the hardware must do
> something very funny to avoid that.
> 
> single-copy-atomicity requires that to never happen; IOW no load or
> store tearing. You must observe 'whole' values, no mixing.
> 
> Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for
> <=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic
> types. Your atomic64_t alignment should ensure this is so.

Thanks for explanation!

I'm not completely sure about single-copy-atomic for our LDD/STD instructions
(need to check with HW guys) but given above requirement:
---------------------------->8--------------------------
READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long)
---------------------------->8--------------------------
that's OK for them (LDD/STD) to not follow this, right? As they are obviously
longer than "unsigned long".

Though I'm wondering if READ_ONCE()/WRITE_ONCE() could be used on 64-bit data
even on 32-bit arches?

Now as for LLOCKD/SCONDD which implement single instruction 64-bit atomics require
double-word alignment and so cannot possible span between cache lines.

So what am I missing here?

> So while I think we're fine, I do find hardware instructions that tear
> yuck (yah, I know, x86...)
> 
> > > So even though it is allowed by the chip; does it really make sense to
> > > use this?
> >
> > It gives performance benefits when dealing with either 64-bit or even
> > larger buffers, see how we use it in our string routines like here [1].
> >
> > [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_tree_arch_arc_lib_memset-2Darchs.S-
> 23n81&d=DwICAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656eViXO7breS55ytWkhpk5R81I&m=m60hCzPFQMtxeg
> 9HR5zZOJcRFMs6WLFJNSc6TNDqd4Y&s=Tapp7zbAmYYaTIaO5yKM0yUKfnaURFxdr56TS-JappQ&e=
> 
> That doesn't require the ABI alignment crud.

I'm not saying it has something to do with our ABI - that's just how we use it.

-Alexey
Peter Zijlstra Feb. 14, 2019, 12:24 p.m. UTC | #13
On Thu, Feb 14, 2019 at 12:05:47PM +0000, Alexey Brodkin wrote:

> > > > So what happens if the data is then split across two cachelines; will a
> > > > STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
> > > > for > sizeof(unsigned long), with the obvious exception of atomic64_t,
> > > > but yuck...
> > >
> > > STD & LDD are simple store/load instructions so there's no problem for
> > > their 64-bit data to be from 2 subsequent cache lines as well as 2 pages
> > > (if we're that unlucky). Or you mean something else?
> > 
> > u64 x;
> > 
> > WRITE_ONCE(x, 0x1111111100000000);
> > WRITE_ONCE(x, 0x0000000011111111);
> > 
> > vs
> > 
> > t = READ_ONCE(x);
> > 
> > is t allowed to be 0x1111111111111111 ?
> > 
> > If the data is split between two cachelines, the hardware must do
> > something very funny to avoid that.
> > 
> > single-copy-atomicity requires that to never happen; IOW no load or
> > store tearing. You must observe 'whole' values, no mixing.
> > 
> > Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for
> > <=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic
> > types. Your atomic64_t alignment should ensure this is so.
> 
> Thanks for explanation!
> 
> I'm not completely sure about single-copy-atomic for our LDD/STD instructions
> (need to check with HW guys) but given above requirement:
> ---------------------------->8--------------------------
> READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long)
> ---------------------------->8--------------------------
> that's OK for them (LDD/STD) to not follow this, right? As they are obviously
> longer than "unsigned long".

Correct.

> Though I'm wondering if READ_ONCE()/WRITE_ONCE() could be used on 64-bit data
> even on 32-bit arches?

Some people were talking about just that a while ago; there's a number
of 32bit platforms (including arm-v7) that can actually do this and
there would be some performance benefits.

But this is currently not done (in generic code) afaik. And once they do
this, it would be under some CONFIG knob.

In particular, things like:

  - u64_stats_sync.h
  - most CONFIG_64BIT chunks from kernel/sched/fair.c

and ISTR there were some other cases.

> Now as for LLOCKD/SCONDD which implement single instruction 64-bit atomics require
> double-word alignment and so cannot possible span between cache lines.
> 
> So what am I missing here?

atomic64_{set,read}() will use STD/LDD resp and should be
single-copy-atomic vs the LLOCKD/SCONDD. But that again is aligned
properly so I don't see problems there.

Aside of that just me being curious and paranoid at the same time.
David Laight Feb. 14, 2019, 2:14 p.m. UTC | #14
From: Peter Zijlstra
> Sent: 14 February 2019 11:08
> On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
> > > On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
> > > > On 2/13/19 4:56 AM, Peter Zijlstra wrote:
> > > > >
> > > > > Personally I think u64 and company should already force natural
> > > > > alignment; but alas.
> > > >
> > > > But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is
> > > > allowed to take a 32-bit aligned address to load a register pair. Thus all u64
> > > > need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation
> > > > in ABI (alignment of long long is 4). You could certainly argue that we end up
> > > > undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but
> > > > still...
> > >
> > > So what happens if the data is then split across two cachelines; will a
> > > STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that
> > > for > sizeof(unsigned long), with the obvious exception of atomic64_t,
> > > but yuck...
> >
> > STD & LDD are simple store/load instructions so there's no problem for
> > their 64-bit data to be from 2 subsequent cache lines as well as 2 pages
> > (if we're that unlucky). Or you mean something else?
> 
> u64 x;
> 
> WRITE_ONCE(x, 0x1111111100000000);
> WRITE_ONCE(x, 0x0000000011111111);
> 
> vs
> 
> t = READ_ONCE(x);
> 
> is t allowed to be 0x1111111111111111 ?
> 
> If the data is split between two cachelines, the hardware must do
> something very funny to avoid that.

Never mind cachelines, think about separate pages.
You'd have to 'lock' both TLB before doing either access.
Not to mention the fact that the two physical locations could be
on entirely different memory cards (or whatever).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Vineet Gupta Feb. 15, 2019, 1:34 a.m. UTC | #15
On 2/14/19 12:50 AM, Alexey Brodkin wrote:

>>>>> I suspect the slab allocator should be returning 8 byte aligned addresses
>>>>> on all systems....
>>>>
>>>> why ? As I understand it is still not fool proof against the expected alignment of
>>>> inner members. There ought to be a better way to enforce all this.
>>>
>>> I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
>>
>> This issue aside, are there other reasons ? Because making it 8 on ARC is just
>> pending the eventuality for later.
> 
> But that's pretty much the same for other 32-bit arches that have 64-bit atomics
> like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they
> require double-word alignment of data as well.

Right LLOCKD/SCONDD (64-bit exclusive load/store) needs 64-bit aligned effective
addresses for micro-arch reasons (1 vs 2 cache lines) etc.

So lets try to unpack this for me. Say we had.

   struct foo {
	int        a;
	atomic64_t b;
   };

The atomic64_t (which for ARC and most others is u64 __attribute__((aligned(8))
*already ensures* that there a 4 b padding is generated by gcc (I just confirmed
with a simple test case).

   #ifdef DOALIGN__
   #define my_u64	__u64 __attribute__((aligned(8)))
   #else
   #define my_u64	__u64
   #endif

  struct foo on_heap;

  printf(%d", &on_heap.b)

$ arc-linux-gcc -O2 test.c -DDOALIGN__ -c --save-temps

   main:
	mov_s r1,@on_heap+8   <----
	mov_s r0,@.LC0
	b @printf

W/o the alignment attribute (say normal LDD/STD)

$ arc-linux-gcc -O2 test.c -c --save-temps

   main:
	mov_s r1,@on_heap+4
	mov_s r0,@.LC0
	b @printf

So indeed your patch aligns dynamic structs to 64-bit, ensuring any embedded
aligned_u64 to be 64-bit aligned as well. Phew !


> That said if for some reason atomic64_t variable is unaligned execution on
> any (or at least most) 32-bit architectures will lead to run-time failure,
> i.e. we'll know about it and this will be fixed.
> 
> And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just
> working-around peculiarity of ARC ABI.

Right.

> 
> Out of curiosity I checked if there're any other occurrences of "alingof(long long)"
> and there seems to be a couple of more:
> ----------------------------------->8-----------------------------
> # git grep alignof | grep "long long"
> 
> ...
> 
> kernel/workqueue.c:5693:        WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
> mm/slab.c:155:#define   REDZONE_ALIGN           max(BYTES_PER_WORD, __alignof__(unsigned long long))

For ARC, it will be max(4,4) so 4
for others 32-bit,it will be max(4,8)

So indeed it makes sense to change it.

> mm/slab.c:2034: if (ralign > __alignof__(unsigned long long))
> ----------------------------------->8-----------------------------
> 
> Not really sure how important is "kernel/workqueue.c" part but in case of "mm/slab.c"
> shouldn't we use ARCH_SLAB_MINALIGN there instead of that "not very meaningful" __alignof__(long long)?
Alexey Brodkin Feb. 18, 2019, 8:53 a.m. UTC | #16
Hi Vineet,

> -----Original Message-----
> From: Vineet Gupta <vgupta@synopsys.com>
> Sent: Friday, February 15, 2019 4:34 AM
> To: Alexey Brodkin <alexey.brodkin@synopsys.com>; Peter Zijlstra <peterz@infradead.org>
> Cc: David Laight <David.Laight@ACULAB.COM>; linux-snps-arc@lists.infradead.org; Arnd Bergmann
> <arnd.bergmann@linaro.org>; linux-kernel@vger.kernel.org; stable@vger.kernel.org; Mark Rutland
> <mark.rutland@arm.com>
> Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
> 
> On 2/14/19 12:50 AM, Alexey Brodkin wrote:
> 
> >>>>> I suspect the slab allocator should be returning 8 byte aligned addresses
> >>>>> on all systems....
> >>>>
> >>>> why ? As I understand it is still not fool proof against the expected alignment of
> >>>> inner members. There ought to be a better way to enforce all this.
> >>>
> >>> I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
> >>
> >> This issue aside, are there other reasons ? Because making it 8 on ARC is just
> >> pending the eventuality for later.
> >
> > But that's pretty much the same for other 32-bit arches that have 64-bit atomics
> > like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they
> > require double-word alignment of data as well.
> 
> Right LLOCKD/SCONDD (64-bit exclusive load/store) needs 64-bit aligned effective
> addresses for micro-arch reasons (1 vs 2 cache lines) etc.
> 
> So lets try to unpack this for me. Say we had.
> 
>    struct foo {
> 	int        a;
> 	atomic64_t b;
>    };
> 
> The atomic64_t (which for ARC and most others is u64 __attribute__((aligned(8))
> *already ensures* that there a 4 b padding is generated by gcc (I just confirmed
> with a simple test case).
> 
>    #ifdef DOALIGN__
>    #define my_u64	__u64 __attribute__((aligned(8)))
>    #else
>    #define my_u64	__u64
>    #endif
> 
>   struct foo on_heap;
> 
>   printf(%d", &on_heap.b)
> 
> $ arc-linux-gcc -O2 test.c -DDOALIGN__ -c --save-temps
> 
>    main:
> 	mov_s r1,@on_heap+8   <----
> 	mov_s r0,@.LC0
> 	b @printf
> 
> W/o the alignment attribute (say normal LDD/STD)
> 
> $ arc-linux-gcc -O2 test.c -c --save-temps
> 
>    main:
> 	mov_s r1,@on_heap+4
> 	mov_s r0,@.LC0
> 	b @printf
> 
> So indeed your patch aligns dynamic structs to 64-bit, ensuring any embedded
> aligned_u64 to be 64-bit aligned as well. Phew !
> 
> 
> > That said if for some reason atomic64_t variable is unaligned execution on
> > any (or at least most) 32-bit architectures will lead to run-time failure,
> > i.e. we'll know about it and this will be fixed.
> >
> > And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just
> > working-around peculiarity of ARC ABI.
> 
> Right.

So are you OK with this patch or something should be done before applying?

> >
> > Out of curiosity I checked if there're any other occurrences of "alingof(long long)"
> > and there seems to be a couple of more:
> > ----------------------------------->8-----------------------------
> > # git grep alignof | grep "long long"
> >
> > ...
> >
> > kernel/workqueue.c:5693:        WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long
> long));
> > mm/slab.c:155:#define   REDZONE_ALIGN           max(BYTES_PER_WORD, __alignof__(unsigned long long))
> 
> For ARC, it will be max(4,4) so 4
> for others 32-bit,it will be max(4,8)
> 
> So indeed it makes sense to change it.

I guess that's still a separate change for generic code, right?
I.e. I'll do it in a separate patch.

-Alexey
Vineet Gupta Feb. 19, 2019, 11:30 p.m. UTC | #17
On 2/18/19 12:53 AM, Alexey Brodkin wrote:
> So are you OK with this patch or something should be done before applying?

Or your original patch was

+#ifdef CONFIG_ARC_HAS_LL64
+#define ARCH_SLAB_MINALIGN     8
+#endif

I don't think this issue is related to LL64 at all. A long long on ARC could be 4
byte aligned (as mandated by the ABI). It could use LDD or 2 separate LDs. This
has no bearing on the fact that LLOCKD have to use a 64-bit aligned addresses. So
I think we should just do this unconditionally. Agree ?
diff mbox series

Patch

diff --git a/arch/arc/include/asm/cache.h b/arch/arc/include/asm/cache.h
index f393b663413e..74f8fcaaef5c 100644
--- a/arch/arc/include/asm/cache.h
+++ b/arch/arc/include/asm/cache.h
@@ -52,6 +52,16 @@ 
 #define cache_line_size()	SMP_CACHE_BYTES
 #define ARCH_DMA_MINALIGN	SMP_CACHE_BYTES
 
+/*
+ * Make sure slab-allocated buffers are 64-bit aligned.
+ * This is required for llockd/scondd to deal with 64-bit aligned dwords.
+ * By default ARCH_SLAB_MINALIGN is __alignof__(long long) which in
+ * case of ARC is 4 instead of 8!
+ */
+#ifdef CONFIG_ARC_HAS_LL64
+#define ARCH_SLAB_MINALIGN	8
+#endif
+
 extern void arc_cache_init(void);
 extern char *arc_cache_mumbojumbo(int cpu_id, char *buf, int len);
 extern void read_decode_cache_bcr(void);