Message ID | mvm1se4romh.fsf@suse.de |
---|---|
State | New |
Headers | show |
Series | Don't write beyond destination in __mempcpy_avx512_no_vzeroupper (bug 23196) | expand |
On Tue, May 22, 2018 at 3:06 AM, Andreas Schwab <schwab@suse.de> wrote: > [BZ #23196] > CVE-2018-11237 > * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S > (L(preloop_large)): Save initial destination pointer in %r11 and > use it instead of %rax after the loop. > * string/test-mempcpy.c (MIN_PAGE_SIZE): Define. Please include your analysis in commit message. > string/test-mempcpy.c | 1 + > sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S | 5 +++-- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c > index c08fba895e..d98ecdd2d9 100644 > --- a/string/test-mempcpy.c > +++ b/string/test-mempcpy.c > @@ -18,6 +18,7 @@ > <http://www.gnu.org/licenses/>. */ > > #define MEMCPY_RESULT(dst, len) (dst) + (len) > +#define MIN_PAGE_SIZE 131072 > #define TEST_MAIN > #define TEST_NAME "mempcpy" > #include "test-string.h" The modified test does't fail on Skylake server with unchanged memmove-avx512-no-vzeroupper.S. Can you modify the test so that it fails with the original memmove-avx512-no-vzeroupper.S on Skylake server? > diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S > index 23c0f7a9ed..effc3ac2de 100644 > --- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S > +++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S > @@ -336,6 +336,7 @@ L(preloop_large): > vmovups (%rsi), %zmm4 > vmovups 0x40(%rsi), %zmm5 > > + mov %rdi, %r11 > /* Align destination for access with non-temporal stores in the loop. */ > mov %rdi, %r8 > and $-0x80, %rdi > @@ -366,8 +367,8 @@ L(gobble_256bytes_nt_loop): > cmp $256, %rdx > ja L(gobble_256bytes_nt_loop) > sfence > - vmovups %zmm4, (%rax) > - vmovups %zmm5, 0x40(%rax) > + vmovups %zmm4, (%r11) > + vmovups %zmm5, 0x40(%r11) > jmp L(check) > > L(preloop_large_bkw): memmove-vec-unaligned-erms.S supports no vzeroupper: #ifndef VZEROUPPER # if VEC_SIZE > 16 # define VZEROUPPER vzeroupper # else # define VZEROUPPER # endif #endif Should it be used instead?
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: > The modified test does't fail on Skylake server with unchanged > memmove-avx512-no-vzeroupper.S. Can you modify the test > so that it fails with the original memmove-avx512-no-vzeroupper.S > on Skylake server? How to do that? I have tested it on a Xeon Phi (whateve that means) where it failed without the fix. > memmove-vec-unaligned-erms.S supports no vzeroupper: > > #ifndef VZEROUPPER > # if VEC_SIZE > 16 > # define VZEROUPPER vzeroupper > # else > # define VZEROUPPER > # endif > #endif > > Should it be used instead? I don't understand what that has to do with the bug in question. Andreas.
On Tue, May 22, 2018 at 5:17 AM, Andreas Schwab <schwab@suse.de> wrote: > On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: > >> The modified test does't fail on Skylake server with unchanged >> memmove-avx512-no-vzeroupper.S. Can you modify the test >> so that it fails with the original memmove-avx512-no-vzeroupper.S >> on Skylake server? > > How to do that? I have tested it on a Xeon Phi (whateve that means) > where it failed without the fix. __mempcpy_avx512_no_vzeroupper is also tested on Skylake server: [hjl@gnu-skx-1 build-x86_64-linux]$ ./string/test-mempcpy simple_mempcpy __mempcpy_avx512_no_vzeroupper __mempcpy_avx512_unaligned __mempcpy_avx512_unaligned_erms __mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2_unaligned_erms __mempcpy_erms [hjl@gnu-skx-1 build-x86_64-linux]$ It should also fail without the fix. >> memmove-vec-unaligned-erms.S supports no vzeroupper: >> >> #ifndef VZEROUPPER >> # if VEC_SIZE > 16 >> # define VZEROUPPER vzeroupper >> # else >> # define VZEROUPPER >> # endif >> #endif >> >> Should it be used instead? > > I don't understand what that has to do with the bug in question. __mempcpy_avx512_no_vzeroupper can be implemented with memmove-vec-unaligned-erms.S. It doesn't have the issue in memmove-avx512-no-vzeroupper.S.
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: > It should also fail without the fix. Again, tell me how to do that. > __mempcpy_avx512_no_vzeroupper can be implemented with > memmove-vec-unaligned-erms.S. In which way? There is no definition of __mempcpy_avx512_no_vzeroupper in memmove-vec-unaligned-erms.S. Andreas.
On 05/22/2018 02:17 PM, Andreas Schwab wrote: > On Mai 22 2018, "H.J. Lu"<hjl.tools@gmail.com> wrote: > >> The modified test does't fail on Skylake server with unchanged >> memmove-avx512-no-vzeroupper.S. Can you modify the test >> so that it fails with the original memmove-avx512-no-vzeroupper.S >> on Skylake server? > How to do that? I have tested it on a Xeon Phi (whateve that means) > where it failed without the fix. It has to do with copy sizes and relation to the cache size. I had to adjust your reproducer to make it fail on KNL, basically increasing the copy size by a factor of 10. Thanks, Florian
On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote: > On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: > >> It should also fail without the fix. > > Again, tell me how to do that. You need to trigger this condition on all AVX512 machines. >> __mempcpy_avx512_no_vzeroupper can be implemented with >> memmove-vec-unaligned-erms.S. > > In which way? There is no definition of __mempcpy_avx512_no_vzeroupper > in memmove-vec-unaligned-erms.S. > memmove-vec-unaligned-erms.S is a template. It can be used to implement AVX512 memmove without vzeroupper.
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: > On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote: >> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote: >> >>> It should also fail without the fix. >> >> Again, tell me how to do that. > > You need to trigger this condition on all AVX512 machines. Again, tell me how to do that. > memmove-vec-unaligned-erms.S is a template. It can be used to > implement AVX512 memmove without vzeroupper. How is that relevant? Andreas.
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > It has to do with copy sizes and relation to the cache size. I had to > adjust your reproducer to make it fail on KNL, basically increasing the > copy size by a factor of 10. What is the architectural maximum cache size? Andreas.
On 05/22/2018 03:21 PM, Andreas Schwab wrote: > On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > >> It has to do with copy sizes and relation to the cache size. I had to >> adjust your reproducer to make it fail on KNL, basically increasing the >> copy size by a factor of 10. > > What is the architectural maximum cache size? My test machine had an L2 cache size of 1 MiB. I think the cut-over happens at half of that (at least for our backported version), which is why #define N 976999 in your reproducer word for me (along with -O0 -fno-builtin). Thanks, Florian
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
> My test machine had an L2 cache size of 1 MiB.
Then you should be safe, a 64K mempcpy is enough to trigger it.
Andreas.
On 05/22/2018 04:12 PM, Andreas Schwab wrote: > On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > >> My test machine had an L2 cache size of 1 MiB. > > Then you should be safe, a 64K mempcpy is enough to trigger it. Our memcpy has: L(512bytesormore): #ifdef SHARED_CACHE_SIZE_HALF mov $SHARED_CACHE_SIZE_HALF, %r8 #else mov __x86_64_shared_cache_size_half(%rip), %r8 #endif cmp %r8, %rdx jae L(preloop_large) cmp $1024, %rdx ja L(1024bytesormore) And with the original reproducer, we wouldn't hit preloop_large, where the bug is. Thanks, Florian
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > And with the original reproducer, we wouldn't hit preloop_large, where the > bug is. Worksforme. $ string/test-mempcpy simple_mempcpy __mempcpy_avx512_no_vzeroupper __mempcpy_avx512_unaligned __mempcpy_avx512_unaligned_erms __mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2_unaligned_erms __mempcpy_erms /mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������ #:Qh����� 7Ne|�����4Kby�����▒1H_v�����" len 65536 /mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������ #:Qh����� 7Ne|�����4Kby�����▒1H_v�����" len 65536 /mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������ #:Qh����� 7Ne|�����4Kby�����▒1H_v�����" len 65536 /mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������ #:Qh����� 7Ne|�����4Kby�����▒1H_v�����" len 65536 Andreas.
On 05/22/2018 05:06 PM, Andreas Schwab wrote: > On Mai 22 2018, Florian Weimer<fweimer@redhat.com> wrote: > >> And with the original reproducer, we wouldn't hit preloop_large, where the >> bug is. > Worksforme. > > $ string/test-mempcpy I don't doubt that. Our CPUs probably have different cache sizes, or there are differences in how different glibc versions compute the effective cache size. Thanks, Florian
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > On 05/22/2018 05:06 PM, Andreas Schwab wrote: >> On Mai 22 2018, Florian Weimer<fweimer@redhat.com> wrote: >> >>> And with the original reproducer, we wouldn't hit preloop_large, where the >>> bug is. >> Worksforme. >> >> $ string/test-mempcpy > > I don't doubt that. Our CPUs probably have different cache sizes, Same cache size. Andreas.
On Tue, May 22, 2018 at 8:38 AM, Andreas Schwab <schwab@suse.de> wrote: > On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote: > >> On 05/22/2018 05:06 PM, Andreas Schwab wrote: >>> On Mai 22 2018, Florian Weimer<fweimer@redhat.com> wrote: >>> >>>> And with the original reproducer, we wouldn't hit preloop_large, where the >>>> bug is. >>> Worksforme. >>> >>> $ string/test-mempcpy >> >> I don't doubt that. Our CPUs probably have different cache sizes, > > Same cache size. Here is the testcase. Since both memcpy and memcpy share the the same body of code, we can't use RAX to refer the beginning of destination. Please update your patch with this info together this tesrcase. Thanks.
I have installed the patch now, please commit your test case. Andreas.
diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c index c08fba895e..d98ecdd2d9 100644 --- a/string/test-mempcpy.c +++ b/string/test-mempcpy.c @@ -18,6 +18,7 @@ <http://www.gnu.org/licenses/>. */ #define MEMCPY_RESULT(dst, len) (dst) + (len) +#define MIN_PAGE_SIZE 131072 #define TEST_MAIN #define TEST_NAME "mempcpy" #include "test-string.h" diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S index 23c0f7a9ed..effc3ac2de 100644 --- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S +++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S @@ -336,6 +336,7 @@ L(preloop_large): vmovups (%rsi), %zmm4 vmovups 0x40(%rsi), %zmm5 + mov %rdi, %r11 /* Align destination for access with non-temporal stores in the loop. */ mov %rdi, %r8 and $-0x80, %rdi @@ -366,8 +367,8 @@ L(gobble_256bytes_nt_loop): cmp $256, %rdx ja L(gobble_256bytes_nt_loop) sfence - vmovups %zmm4, (%rax) - vmovups %zmm5, 0x40(%rax) + vmovups %zmm4, (%r11) + vmovups %zmm5, 0x40(%r11) jmp L(check) L(preloop_large_bkw):