diff mbox series

Don't write beyond destination in __mempcpy_avx512_no_vzeroupper (bug 23196)

Message ID mvm1se4romh.fsf@suse.de
State New
Headers show
Series Don't write beyond destination in __mempcpy_avx512_no_vzeroupper (bug 23196) | expand

Commit Message

Andreas Schwab May 22, 2018, 10:06 a.m. UTC
[BZ #23196]
	CVE-2018-11237
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
	(L(preloop_large)): Save initial destination pointer in %r11 and
	use it instead of %rax after the loop.
	* string/test-mempcpy.c (MIN_PAGE_SIZE): Define.
---
 string/test-mempcpy.c                                   | 1 +
 sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

Comments

H.J. Lu May 22, 2018, 11:58 a.m. UTC | #1
On Tue, May 22, 2018 at 3:06 AM, Andreas Schwab <schwab@suse.de> wrote:
>         [BZ #23196]
>         CVE-2018-11237
>         * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
>         (L(preloop_large)): Save initial destination pointer in %r11 and
>         use it instead of %rax after the loop.
>         * string/test-mempcpy.c (MIN_PAGE_SIZE): Define.

Please include your analysis in commit message.

>  string/test-mempcpy.c                                   | 1 +
>  sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c
> index c08fba895e..d98ecdd2d9 100644
> --- a/string/test-mempcpy.c
> +++ b/string/test-mempcpy.c
> @@ -18,6 +18,7 @@
>     <http://www.gnu.org/licenses/>.  */
>
>  #define MEMCPY_RESULT(dst, len) (dst) + (len)
> +#define MIN_PAGE_SIZE 131072
>  #define TEST_MAIN
>  #define TEST_NAME "mempcpy"
>  #include "test-string.h"

The modified test does't fail on Skylake server with unchanged
memmove-avx512-no-vzeroupper.S.   Can you modify the test
so that it fails with the original memmove-avx512-no-vzeroupper.S
on Skylake server?

> diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> index 23c0f7a9ed..effc3ac2de 100644
> --- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> +++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> @@ -336,6 +336,7 @@ L(preloop_large):
>         vmovups (%rsi), %zmm4
>         vmovups 0x40(%rsi), %zmm5
>
> +       mov     %rdi, %r11
>  /* Align destination for access with non-temporal stores in the loop.  */
>         mov     %rdi, %r8
>         and     $-0x80, %rdi
> @@ -366,8 +367,8 @@ L(gobble_256bytes_nt_loop):
>         cmp     $256, %rdx
>         ja      L(gobble_256bytes_nt_loop)
>         sfence
> -       vmovups %zmm4, (%rax)
> -       vmovups %zmm5, 0x40(%rax)
> +       vmovups %zmm4, (%r11)
> +       vmovups %zmm5, 0x40(%r11)
>         jmp     L(check)
>
>  L(preloop_large_bkw):

memmove-vec-unaligned-erms.S supports no vzeroupper:

#ifndef VZEROUPPER
# if VEC_SIZE > 16
#  define VZEROUPPER vzeroupper
# else
#  define VZEROUPPER
# endif
#endif

Should it be used instead?
Andreas Schwab May 22, 2018, 12:17 p.m. UTC | #2
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> The modified test does't fail on Skylake server with unchanged
> memmove-avx512-no-vzeroupper.S.   Can you modify the test
> so that it fails with the original memmove-avx512-no-vzeroupper.S
> on Skylake server?

How to do that?  I have tested it on a Xeon Phi (whateve that means)
where it failed without the fix.

> memmove-vec-unaligned-erms.S supports no vzeroupper:
>
> #ifndef VZEROUPPER
> # if VEC_SIZE > 16
> #  define VZEROUPPER vzeroupper
> # else
> #  define VZEROUPPER
> # endif
> #endif
>
> Should it be used instead?

I don't understand what that has to do with the bug in question.

Andreas.
H.J. Lu May 22, 2018, 12:25 p.m. UTC | #3
On Tue, May 22, 2018 at 5:17 AM, Andreas Schwab <schwab@suse.de> wrote:
> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>> The modified test does't fail on Skylake server with unchanged
>> memmove-avx512-no-vzeroupper.S.   Can you modify the test
>> so that it fails with the original memmove-avx512-no-vzeroupper.S
>> on Skylake server?
>
> How to do that?  I have tested it on a Xeon Phi (whateve that means)
> where it failed without the fix.

__mempcpy_avx512_no_vzeroupper is also tested on Skylake server:

[hjl@gnu-skx-1 build-x86_64-linux]$ ./string/test-mempcpy
                        simple_mempcpy __mempcpy_avx512_no_vzeroupper
__mempcpy_avx512_unaligned __mempcpy_avx512_unaligned_erms
__mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms
__mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned
__mempcpy_sse2_unaligned_erms __mempcpy_erms
[hjl@gnu-skx-1 build-x86_64-linux]$

It should also fail without the fix.

>> memmove-vec-unaligned-erms.S supports no vzeroupper:
>>
>> #ifndef VZEROUPPER
>> # if VEC_SIZE > 16
>> #  define VZEROUPPER vzeroupper
>> # else
>> #  define VZEROUPPER
>> # endif
>> #endif
>>
>> Should it be used instead?
>
> I don't understand what that has to do with the bug in question.

__mempcpy_avx512_no_vzeroupper can be implemented with
memmove-vec-unaligned-erms.S.  It doesn't have the issue in
memmove-avx512-no-vzeroupper.S.
Andreas Schwab May 22, 2018, 12:38 p.m. UTC | #4
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> It should also fail without the fix.

Again, tell me how to do that.

> __mempcpy_avx512_no_vzeroupper can be implemented with
> memmove-vec-unaligned-erms.S.

In which way?  There is no definition of __mempcpy_avx512_no_vzeroupper
in memmove-vec-unaligned-erms.S.

Andreas.
Florian Weimer May 22, 2018, 12:47 p.m. UTC | #5
On 05/22/2018 02:17 PM, Andreas Schwab wrote:
> On Mai 22 2018, "H.J. Lu"<hjl.tools@gmail.com>  wrote:
> 
>> The modified test does't fail on Skylake server with unchanged
>> memmove-avx512-no-vzeroupper.S.   Can you modify the test
>> so that it fails with the original memmove-avx512-no-vzeroupper.S
>> on Skylake server?

> How to do that?  I have tested it on a Xeon Phi (whateve that means)
> where it failed without the fix.

It has to do with copy sizes and relation to the cache size.  I had to 
adjust your reproducer to make it fail on KNL, basically increasing the 
copy size by a factor of 10.

Thanks,
Florian
H.J. Lu May 22, 2018, 12:55 p.m. UTC | #6
On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote:
> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>> It should also fail without the fix.
>
> Again, tell me how to do that.

You need to trigger this condition on all AVX512 machines.

>> __mempcpy_avx512_no_vzeroupper can be implemented with
>> memmove-vec-unaligned-erms.S.
>
> In which way?  There is no definition of __mempcpy_avx512_no_vzeroupper
> in memmove-vec-unaligned-erms.S.
>

memmove-vec-unaligned-erms.S is a template.  It can be used to
implement AVX512 memmove without vzeroupper.
Andreas Schwab May 22, 2018, 1:06 p.m. UTC | #7
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote:
>> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>>
>>> It should also fail without the fix.
>>
>> Again, tell me how to do that.
>
> You need to trigger this condition on all AVX512 machines.

Again, tell me how to do that.

> memmove-vec-unaligned-erms.S is a template.  It can be used to
> implement AVX512 memmove without vzeroupper.

How is that relevant?

Andreas.
Andreas Schwab May 22, 2018, 1:21 p.m. UTC | #8
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> It has to do with copy sizes and relation to the cache size.  I had to
> adjust your reproducer to make it fail on KNL, basically increasing the
> copy size by a factor of 10.

What is the architectural maximum cache size?

Andreas.
Florian Weimer May 22, 2018, 1:53 p.m. UTC | #9
On 05/22/2018 03:21 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> It has to do with copy sizes and relation to the cache size.  I had to
>> adjust your reproducer to make it fail on KNL, basically increasing the
>> copy size by a factor of 10.
> 
> What is the architectural maximum cache size?

My test machine had an L2 cache size of 1 MiB.  I think the cut-over 
happens at half of that (at least for our backported version), which is why

#define N 976999

in your reproducer word for me (along with -O0 -fno-builtin).
Thanks,
Florian
Andreas Schwab May 22, 2018, 2:12 p.m. UTC | #10
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> My test machine had an L2 cache size of 1 MiB.

Then you should be safe, a 64K mempcpy is enough to trigger it.

Andreas.
Florian Weimer May 22, 2018, 2:36 p.m. UTC | #11
On 05/22/2018 04:12 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> My test machine had an L2 cache size of 1 MiB.
> 
> Then you should be safe, a 64K mempcpy is enough to trigger it.

Our memcpy has:


L(512bytesormore):
#ifdef SHARED_CACHE_SIZE_HALF
         mov     $SHARED_CACHE_SIZE_HALF, %r8
#else
         mov     __x86_64_shared_cache_size_half(%rip), %r8
#endif
         cmp     %r8, %rdx
         jae     L(preloop_large)
         cmp     $1024, %rdx
         ja      L(1024bytesormore)

And with the original reproducer, we wouldn't hit preloop_large, where 
the bug is.

Thanks,
Florian
Andreas Schwab May 22, 2018, 3:06 p.m. UTC | #12
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> And with the original reproducer, we wouldn't hit preloop_large, where the
> bug is.

Worksforme.

$ string/test-mempcpy 
                        simple_mempcpy  __mempcpy_avx512_no_vzeroupper  __mempcpy_avx512_unaligned      __mempcpy_avx512_unaligned_erms __mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms     __mempcpy_ssse3_back    __mempcpy_ssse3 __mempcpy_sse2_unaligned        __mempcpy_sse2_unaligned_erms   __mempcpy_erms
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536

Andreas.
Florian Weimer May 22, 2018, 3:35 p.m. UTC | #13
On 05/22/2018 05:06 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer<fweimer@redhat.com>  wrote:
> 
>> And with the original reproducer, we wouldn't hit preloop_large, where the
>> bug is.
> Worksforme.
> 
> $ string/test-mempcpy

I don't doubt that.  Our CPUs probably have different cache sizes, or 
there are differences in how different glibc versions compute the 
effective cache size.

Thanks,
Florian
Andreas Schwab May 22, 2018, 3:38 p.m. UTC | #14
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> On 05/22/2018 05:06 PM, Andreas Schwab wrote:
>> On Mai 22 2018, Florian Weimer<fweimer@redhat.com>  wrote:
>>
>>> And with the original reproducer, we wouldn't hit preloop_large, where the
>>> bug is.
>> Worksforme.
>>
>> $ string/test-mempcpy
>
> I don't doubt that.  Our CPUs probably have different cache sizes,

Same cache size.

Andreas.
H.J. Lu May 22, 2018, 4:33 p.m. UTC | #15
On Tue, May 22, 2018 at 8:38 AM, Andreas Schwab <schwab@suse.de> wrote:
> On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
>
>> On 05/22/2018 05:06 PM, Andreas Schwab wrote:
>>> On Mai 22 2018, Florian Weimer<fweimer@redhat.com>  wrote:
>>>
>>>> And with the original reproducer, we wouldn't hit preloop_large, where the
>>>> bug is.
>>> Worksforme.
>>>
>>> $ string/test-mempcpy
>>
>> I don't doubt that.  Our CPUs probably have different cache sizes,
>
> Same cache size.

Here is the testcase.   Since both memcpy and memcpy share the
the same body of code, we can't use RAX to refer the beginning of
destination.  Please update your patch with this info together this
tesrcase.

Thanks.
Andreas Schwab May 23, 2018, 7:54 a.m. UTC | #16
I have installed the patch now, please commit your test case.

Andreas.
diff mbox series

Patch

diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c
index c08fba895e..d98ecdd2d9 100644
--- a/string/test-mempcpy.c
+++ b/string/test-mempcpy.c
@@ -18,6 +18,7 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #define MEMCPY_RESULT(dst, len) (dst) + (len)
+#define MIN_PAGE_SIZE 131072
 #define TEST_MAIN
 #define TEST_NAME "mempcpy"
 #include "test-string.h"
diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
index 23c0f7a9ed..effc3ac2de 100644
--- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
+++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
@@ -336,6 +336,7 @@  L(preloop_large):
 	vmovups	(%rsi), %zmm4
 	vmovups	0x40(%rsi), %zmm5
 
+	mov	%rdi, %r11
 /* Align destination for access with non-temporal stores in the loop.  */
 	mov	%rdi, %r8
 	and	$-0x80, %rdi
@@ -366,8 +367,8 @@  L(gobble_256bytes_nt_loop):
 	cmp	$256, %rdx
 	ja	L(gobble_256bytes_nt_loop)
 	sfence
-	vmovups	%zmm4, (%rax)
-	vmovups	%zmm5, 0x40(%rax)
+	vmovups	%zmm4, (%r11)
+	vmovups	%zmm5, 0x40(%r11)
 	jmp	L(check)
 
 L(preloop_large_bkw):