Message ID | 20150323.122530.812870422534676208.davem@davemloft.net |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
David> ==================== David> [PATCH] sparc64: Fix several bugs in memmove(). David> Firstly, handle zero length calls properly. Believe it or not there David> are a few of these happening during early boot. David> Next, we can't just drop to a memcpy() call in the forward copy case David> where dst <= src. The reason is that the cache initializing stores David> used in the Niagara memcpy() implementations can end up clearing out David> cache lines before we've sourced their original contents completely. David> For example, considering NG4memcpy, the main unrolled loop begins like David> this: David> load src + 0x00 David> load src + 0x08 David> load src + 0x10 David> load src + 0x18 David> load src + 0x20 David> store dst + 0x00 David> Assume dst is 64 byte aligned and let's say that dst is src - 8 for David> this memcpy() call. That store at the end there is the one to the David> first line in the cache line, thus clearing the whole line, which thus David> clobbers "src + 0x28" before it even gets loaded. David> To avoid this, just fall through to a simple copy only mildly David> optimized for the case where src and dst are 8 byte aligned and the David> length is a multiple of 8 as well. We could get fancy and call David> GENmemcpy() but this is good enough for how this thing is actually David> used. Would it make sense to have some memmove()/memcopy() tests on bootup to catch problems like this? I know this is a strange case, and probably not too common, but how hard would it be to wire up tests that go through 1 to 128 byte memmove() on bootup to make sure things work properly? This seems like one of those critical, but subtle things to be checked. And doing it only on bootup wouldn't slow anything down and would (ideally) automatically get us coverage when people add new archs or update the code. John -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 9:25 AM, David Miller <davem@davemloft.net> wrote: > > Ok, here is what I committed. So I wonder - looking at that assembly, I get the feeling that it isn't any better code than gcc could generate from simple C code. Would it perhaps be better to turn memmove() into C? That's particularly true because if I read this code right, it now seems to seriously pessimise non-overlapping memmove's, in that it now *always* uses that slow downward copy if the destination is below the source. Now, admittedly, the kernel doesn't use a lot of memmov's, but this still falls back on the "byte at a time" model for a lot of cases (all non-64-bit-aligned ones). I could imagine those existing. And some people (reasonably) hate memcpy because they've been burnt by the overlapping case and end up using memmove as a "safe alternative", so it's not necessarily just the overlapping case that might trigger this. Maybe the code could be something like void *memmove(void *dst, const void *src, size_t n); { // non-overlapping cases if (src + n <= dst) return memcpy(dst, src, n); if (dst + n <= src) return memcpy(dst, src, n); // overlapping, but we know we // (a) copy upwards // (b) initialize the result in at most chunks of 64 if (dst+64 <= src) return memcpy(dst, src, n); .. do the backwards thing .. } (ok, maybe I got it wrong, but you get the idea). I *think* gcc should do ok on the above kind of code, and not generate wildly different code from your handcoded version. Linus -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/23/15 10:25 AM, David Miller wrote: > [PATCH] sparc64: Fix several bugs in memmove(). > > Firstly, handle zero length calls properly. Believe it or not there > are a few of these happening during early boot. > > Next, we can't just drop to a memcpy() call in the forward copy case > where dst <= src. The reason is that the cache initializing stores > used in the Niagara memcpy() implementations can end up clearing out > cache lines before we've sourced their original contents completely. > > For example, considering NG4memcpy, the main unrolled loop begins like > this: > > load src + 0x00 > load src + 0x08 > load src + 0x10 > load src + 0x18 > load src + 0x20 > store dst + 0x00 > > Assume dst is 64 byte aligned and let's say that dst is src - 8 for > this memcpy() call. That store at the end there is the one to the > first line in the cache line, thus clearing the whole line, which thus > clobbers "src + 0x28" before it even gets loaded. > > To avoid this, just fall through to a simple copy only mildly > optimized for the case where src and dst are 8 byte aligned and the > length is a multiple of 8 as well. We could get fancy and call > GENmemcpy() but this is good enough for how this thing is actually > used. > > Reported-by: David Ahern <david.ahern@oracle.com> > Reported-by: Bob Picco <bpicco@meloft.net> > Signed-off-by: David S. Miller <davem@davemloft.net> seems like a formality at this point, but this resolves the panic on the M7-based ldom and baremetal. The T5-8 failed to boot, but it could be a different problem. Thanks for the fast turnaround, David -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 23 Mar 2015 10:00:02 -0700 > Maybe the code could be something like > > void *memmove(void *dst, const void *src, size_t n); > { > // non-overlapping cases > if (src + n <= dst) > return memcpy(dst, src, n); > if (dst + n <= src) > return memcpy(dst, src, n); > > // overlapping, but we know we > // (a) copy upwards > // (b) initialize the result in at most chunks of 64 > if (dst+64 <= src) > return memcpy(dst, src, n); > > .. do the backwards thing .. > } > > (ok, maybe I got it wrong, but you get the idea). > > I *think* gcc should do ok on the above kind of code, and not generate > wildly different code from your handcoded version. Sure you could do that in C, but I really want to avoid using memcpy() if dst and src overlap in any way at all. Said another way, I don't want to codify that "64" thing. The next chip could do 128 byte initializing stores. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "John Stoffel" <john@stoffel.org> Date: Mon, 23 Mar 2015 12:51:03 -0400 > Would it make sense to have some memmove()/memcopy() tests on bootup > to catch problems like this? I know this is a strange case, and > probably not too common, but how hard would it be to wire up tests > that go through 1 to 128 byte memmove() on bootup to make sure things > work properly? > > This seems like one of those critical, but subtle things to be > checked. And doing it only on bootup wouldn't slow anything down and > would (ideally) automatically get us coverage when people add new > archs or update the code. One of two things is already happening. There have been assembler memcpy/memset development test harnesses around that most arch developers are using, and those test things rather extensively. Also, the memcpy/memset routines on sparc in particular are completely shared with glibc, we use the same exact code in both trees. So it's getting tested there too. memmove() is just not handled this way. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Ahern <david.ahern@oracle.com> Date: Mon, 23 Mar 2015 11:34:34 -0600 > seems like a formality at this point, but this resolves the panic on > the M7-based ldom and baremetal. The T5-8 failed to boot, but it could > be a different problem. Specifically, does the T5-8 boot without my patch applied? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 23, 2015 at 12:08 PM, David Miller <davem@davemloft.net> wrote: > > Sure you could do that in C, but I really want to avoid using memcpy() > if dst and src overlap in any way at all. > > Said another way, I don't want to codify that "64" thing. The next > chip could do 128 byte initializing stores. But David, THAT IS NOT WHAT YOUR BROKEN ASM DOES ANYWAY! Read it again. Your asm code does not check for overlap. Look at this: cmp %o0, %o1 bleu,pt %xcc, 2f and ponder. It's wrong. So even if you don't want to take that "allow overlap more than 64 bytes apart" thing, my C version actually is *better* than the broken asm version you have. The new asm version is better than the old one, because the new breakage is about really bad performance rather than actively breaking, but still.. Linus -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 23 Mar 2015 12:47:49 -0700 > On Mon, Mar 23, 2015 at 12:08 PM, David Miller <davem@davemloft.net> wrote: >> >> Sure you could do that in C, but I really want to avoid using memcpy() >> if dst and src overlap in any way at all. >> >> Said another way, I don't want to codify that "64" thing. The next >> chip could do 128 byte initializing stores. > > But David, THAT IS NOT WHAT YOUR BROKEN ASM DOES ANYWAY! > > Read it again. Your asm code does not check for overlap. Look at this: > > cmp %o0, %o1 > bleu,pt %xcc, 2f > > and ponder. It's wrong. Right, it's not checking for overlap. It's checking for "does a forward copy work?" That's the standard test for this, and it's what glibc uses in it's generic memmove() implementation FWIW. (granted, I know glibc is not generally a good source for "right way to do things :-) > The new asm version is better than the old one, because the new > breakage is about really bad performance rather than actively > breaking, but still.. I accept that it's suboptimal. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "David" == David Miller <davem@davemloft.net> writes: David> From: "John Stoffel" <john@stoffel.org> David> Date: Mon, 23 Mar 2015 12:51:03 -0400 >> Would it make sense to have some memmove()/memcopy() tests on bootup >> to catch problems like this? I know this is a strange case, and >> probably not too common, but how hard would it be to wire up tests >> that go through 1 to 128 byte memmove() on bootup to make sure things >> work properly? >> >> This seems like one of those critical, but subtle things to be >> checked. And doing it only on bootup wouldn't slow anything down and >> would (ideally) automatically get us coverage when people add new >> archs or update the code. David> One of two things is already happening. David> There have been assembler memcpy/memset development test harnesses David> around that most arch developers are using, and those test things David> rather extensively. David> Also, the memcpy/memset routines on sparc in particular are completely David> shared with glibc, we use the same exact code in both trees. So it's David> getting tested there too. Thats' good to know. I wasn't sure. David> memmove() is just not handled this way. Bummers. So why isn't this covered by the glibc tests too? Not accusing, not at all! Just wondering. Thanks for all your work David, I've been amazed at your energy here! -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/23/15 1:35 PM, David Miller wrote: > From: David Ahern <david.ahern@oracle.com> > Date: Mon, 23 Mar 2015 11:34:34 -0600 > >> seems like a formality at this point, but this resolves the panic on >> the M7-based ldom and baremetal. The T5-8 failed to boot, but it could >> be a different problem. > > Specifically, does the T5-8 boot without my patch applied? > I am running around in circles with it... it takes 15 minutes after a hard reset to get logged in, and I forgot that the 2.6.39 can't handle -j 1024 either (task scheduler problem), and then I wasted time waiting for sandwich shop to learn how to use mobile app ordering, ... I'll respond as soon as I can. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "John Stoffel" <john@stoffel.org> Date: Mon, 23 Mar 2015 15:56:02 -0400 >>>>>> "David" == David Miller <davem@davemloft.net> writes: > > David> From: "John Stoffel" <john@stoffel.org> > David> Date: Mon, 23 Mar 2015 12:51:03 -0400 > >>> Would it make sense to have some memmove()/memcopy() tests on bootup >>> to catch problems like this? I know this is a strange case, and >>> probably not too common, but how hard would it be to wire up tests >>> that go through 1 to 128 byte memmove() on bootup to make sure things >>> work properly? >>> >>> This seems like one of those critical, but subtle things to be >>> checked. And doing it only on bootup wouldn't slow anything down and >>> would (ideally) automatically get us coverage when people add new >>> archs or update the code. > > David> One of two things is already happening. > > David> There have been assembler memcpy/memset development test harnesses > David> around that most arch developers are using, and those test things > David> rather extensively. > > David> Also, the memcpy/memset routines on sparc in particular are completely > David> shared with glibc, we use the same exact code in both trees. So it's > David> getting tested there too. > > Thats' good to know. I wasn't sure. > > David> memmove() is just not handled this way. > > Bummers. So why isn't this covered by the glibc tests too? Because the kernel's memmove() is different from the one we use in glibc on sparc. In fact, we use the generic C version in glibc which expands to forward and backward word copies. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/23/15 1:35 PM, David Miller wrote: > From: David Ahern <david.ahern@oracle.com> > Date: Mon, 23 Mar 2015 11:34:34 -0600 > >> seems like a formality at this point, but this resolves the panic on >> the M7-based ldom and baremetal. The T5-8 failed to boot, but it could >> be a different problem. > > Specifically, does the T5-8 boot without my patch applied? > The T5-8 is having problems; has to be unrelated to this commit. T5-2 (256 cpus) boots fine, and make -j 256 on an allyesconfig builds fine. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: [Mon Mar 23 2015, 12:25:30PM EDT] > From: David Miller <davem@davemloft.net> > Date: Sun, 22 Mar 2015 22:19:06 -0400 (EDT) > > > I'll work on a fix. > > Ok, here is what I committed. David et al., let me know if you still > see the crashes with this applied. > > Of course, I'll queue this up for -stable as well. > > Thanks! > > ==================== > [PATCH] sparc64: Fix several bugs in memmove(). > > Firstly, handle zero length calls properly. Believe it or not there > are a few of these happening during early boot. > > Next, we can't just drop to a memcpy() call in the forward copy case > where dst <= src. The reason is that the cache initializing stores > used in the Niagara memcpy() implementations can end up clearing out > cache lines before we've sourced their original contents completely. > > For example, considering NG4memcpy, the main unrolled loop begins like > this: > > load src + 0x00 > load src + 0x08 > load src + 0x10 > load src + 0x18 > load src + 0x20 > store dst + 0x00 > > Assume dst is 64 byte aligned and let's say that dst is src - 8 for > this memcpy() call. That store at the end there is the one to the > first line in the cache line, thus clearing the whole line, which thus > clobbers "src + 0x28" before it even gets loaded. > > To avoid this, just fall through to a simple copy only mildly > optimized for the case where src and dst are 8 byte aligned and the > length is a multiple of 8 as well. We could get fancy and call > GENmemcpy() but this is good enough for how this thing is actually > used. > > Reported-by: David Ahern <david.ahern@oracle.com> > Reported-by: Bob Picco <bpicco@meloft.net> > Signed-off-by: David S. Miller <davem@davemloft.net> > --- Seems solid with 2.6.39 on M7-4. Jalap?no is happy with current sparc.git. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Bob Picco <bpicco@meloft.net> Date: Tue, 24 Mar 2015 10:57:53 -0400 > Seems solid with 2.6.39 on M7-4. Jalap?no is happy with current sparc.git. Thanks for all the testing, it's been integrated into the -stable queues as well. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
==================== [PATCH] sparc64: Fix several bugs in memmove(). Firstly, handle zero length calls properly. Believe it or not there are a few of these happening during early boot. Next, we can't just drop to a memcpy() call in the forward copy case where dst <= src. The reason is that the cache initializing stores used in the Niagara memcpy() implementations can end up clearing out cache lines before we've sourced their original contents completely. For example, considering NG4memcpy, the main unrolled loop begins like this: load src + 0x00 load src + 0x08 load src + 0x10 load src + 0x18 load src + 0x20 store dst + 0x00 Assume dst is 64 byte aligned and let's say that dst is src - 8 for this memcpy() call. That store at the end there is the one to the first line in the cache line, thus clearing the whole line, which thus clobbers "src + 0x28" before it even gets loaded. To avoid this, just fall through to a simple copy only mildly optimized for the case where src and dst are 8 byte aligned and the length is a multiple of 8 as well. We could get fancy and call GENmemcpy() but this is good enough for how this thing is actually used. Reported-by: David Ahern <david.ahern@oracle.com> Reported-by: Bob Picco <bpicco@meloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> --- arch/sparc/lib/memmove.S | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/sparc/lib/memmove.S b/arch/sparc/lib/memmove.S index b7f6334..857ad4f 100644 --- a/arch/sparc/lib/memmove.S +++ b/arch/sparc/lib/memmove.S @@ -8,9 +8,11 @@ .text ENTRY(memmove) /* o0=dst o1=src o2=len */ - mov %o0, %g1 + brz,pn %o2, 99f + mov %o0, %g1 + cmp %o0, %o1 - bleu,pt %xcc, memcpy + bleu,pt %xcc, 2f add %o1, %o2, %g7 cmp %g7, %o0 bleu,pt %xcc, memcpy @@ -24,7 +26,34 @@ ENTRY(memmove) /* o0=dst o1=src o2=len */ stb %g7, [%o0] bne,pt %icc, 1b sub %o0, 1, %o0 - +99: retl mov %g1, %o0 + + /* We can't just call memcpy for these memmove cases. On some + * chips the memcpy uses cache initializing stores and when dst + * and src are close enough, those can clobber the source data + * before we've loaded it in. + */ +2: or %o0, %o1, %g7 + or %o2, %g7, %g7 + andcc %g7, 0x7, %g0 + bne,pn %xcc, 4f + nop + +3: ldx [%o1], %g7 + add %o1, 8, %o1 + subcc %o2, 8, %o2 + add %o0, 8, %o0 + bne,pt %icc, 3b + stx %g7, [%o0 - 0x8] + ba,a,pt %xcc, 99b + +4: ldub [%o1], %g7 + add %o1, 1, %o1 + subcc %o2, 1, %o2 + add %o0, 1, %o0 + bne,pt %icc, 4b + stb %g7, [%o0 - 0x1] + ba,a,pt %xcc, 99b ENDPROC(memmove)