diff mbox

4.0.0-rc4: panic in free_block

Message ID 550F5852.5020405@oracle.com
State RFC
Delegated to: David Miller
Headers show

Commit Message

David Ahern March 23, 2015, 12:03 a.m. UTC
On 3/22/15 5:54 PM, David Miller wrote:
>> I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it
>> clearly suggests the memcpy or memmove are the root cause.
>
> Thanks, didn't notice that.
>
> So, something is amuck.

to continue to refine the problem ... I modified only the memmove lines 
(not the memcpy) and it works fine. So its the memmove.

I'm sure this will get whitespaced damaged on the copy and paste but to 
be clear this is the patch I am currently running and system is stable. 
On Friday it failed on every single; with this patch I have allyesconfig 
builds with -j 128 in a loop (clean in between) and nothing -- no panics.

   * Locking must be handled by the caller.
@@ -3344,7 +3354,7 @@ free_done:
         spin_unlock(&n->list_lock);
         slabs_destroy(cachep, &list);
         ac->avail -= batchcount;
-       memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void 
*)*ac->avail);
+       move_entries(ac->entry, &(ac->entry[batchcount]), ac->avail);
  }

  /*
@@ -3817,8 +3827,7 @@ static void drain_array(struct kmem_cache *cachep, 
struct kmem_cache_node *n,
                                 tofree = (ac->avail + 1) / 2;
                         free_block(cachep, ac->entry, tofree, node, &list);
                         ac->avail -= tofree;
-                       memmove(ac->entry, &(ac->entry[tofree]),
-                               sizeof(void *) * ac->avail);
+                       move_entries(ac->entry, &(ac->entry[tofree]), 
ac->avail);
                 }
                 spin_unlock_irq(&n->list_lock);
                 slabs_destroy(cachep, &list);


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller March 23, 2015, 2 a.m. UTC | #1
From: David Ahern <david.ahern@oracle.com>
Date: Sun, 22 Mar 2015 18:03:30 -0600

> On 3/22/15 5:54 PM, David Miller wrote:
>>> I just put it on 4.0.0-rc4 and ditto -- problem goes away, so it
>>> clearly suggests the memcpy or memmove are the root cause.
>>
>> Thanks, didn't notice that.
>>
>> So, something is amuck.
> 
> to continue to refine the problem ... I modified only the memmove
> lines (not the memcpy) and it works fine. So its the memmove.
> 
> I'm sure this will get whitespaced damaged on the copy and paste but
> to be clear this is the patch I am currently running and system is
> stable. On Friday it failed on every single; with this patch I have
> allyesconfig builds with -j 128 in a loop (clean in between) and
> nothing -- no panics.

Can you just try calling memcpy(), that should work because
I think we agree that if the memcpy() implementation copies
from low to high it should work.

I wonder if the triggering factor is configuring for a high
number of cpus.  I always have NR_CPUS=128 since that's the
largest machine I have.  I'll give NR_CPUS=1024 a spin.

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 23, 2015, 2:19 a.m. UTC | #2
Nevermind I think I figured out the problem.

It's the cache initializing stores, we can't do overlapping
copies where dst <= src in all cases because of them.

A store to a address modulo the cache line size (which for
these instructions is 64 bytes), clears that whole line.

But when we're doing these memmove() calls in SLAB/SLUB, we
can clear some bytes at the end of the line before they've
been read in.

And reading over NG4memcpy, this _can_ happen, the main unrolled
loop begins like this:

	load	src + 0x00
	load	src + 0x08
	load	src + 0x10
	load	src + 0x18
	load	src + 0x20
	store	dst + 0x00

Assume dst is 64 byte aligned and let's say that dst is src - 8 for
this memcpy() call, right?  That store at the end there is the one to
the first line in the cache line, thus clearing the whole line, which
thus clobbers "src + 0x28" before it even gets loaded.

I'm pretty sure this is what's happening.

And it's only going to trigger if the memcpy() is 128 bytes or larger.

I'll work on a fix.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/slab.c b/mm/slab.c
index c4b89ea..f5e9716 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -802,6 +802,16 @@  static inline void ac_put_obj(struct kmem_cache 
*cachep, struct array_cache *ac,
         ac->entry[ac->avail++] = objp;
  }

+static void move_entries(void *dest, void *src, int nr)
+{
+       unsigned long *dp = dest;
+       unsigned long *sp = src;
+
+       for (; nr; nr--, dp++, sp++)
+               *dp = *sp;
+}
+
+
  /*
   * Transfer objects in one arraycache to another.