Message ID | YSJtKKbBGoDI4hOd@gmail.com |
---|---|
State | New |
Headers | show |
Series | PING [PATCH] x86: Update memcpy/memset inline strategies for -mtune=generic | expand |
On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > eembc. > > > > > > Here is code size difference for this patch > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > PING. > > OK for master branch? > > Thanks. > > H.J. > --- > Simplify memcpy and memset inline strategies to avoid branches for > -mtune=generic: > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > load and store for up to 16 * 16 (256) bytes when the data size is > fixed and known. > 2. Inline only if data size is known to be <= 256. > a. Use "rep movsb/stosb" with simple code sequence if the data size > is a constant. > b. Use loop if data size is not a constant. > 3. Use memcpy/memset libray function if data size is unknown or > 256. > PING: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html
On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > eembc. > > > > > > > > Here is code size difference for this patch > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > PING. > > > > OK for master branch? > > > > Thanks. > > > > H.J. > > --- > > Simplify memcpy and memset inline strategies to avoid branches for > > -mtune=generic: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > load and store for up to 16 * 16 (256) bytes when the data size is > > fixed and known. > > 2. Inline only if data size is known to be <= 256. > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > is a constant. > > b. Use loop if data size is not a constant. > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > PING: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > PING. This should fix: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > > eembc. > > > > > > > > > > Here is code size difference for this patch > > > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > > > > PING. > > > > > > OK for master branch? > > > > > > Thanks. > > > > > > H.J. > > > --- > > > Simplify memcpy and memset inline strategies to avoid branches for > > > -mtune=generic: > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > fixed and known. > > > 2. Inline only if data size is known to be <= 256. > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > is a constant. > > > b. Use loop if data size is not a constant. > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > > PING: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > > > > PING. This should fix: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 > PING.
On Mon, Sep 20, 2021 at 10:06 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Sep 13, 2021 at 6:38 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Sep 7, 2021 at 8:01 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Sun, Aug 22, 2021 at 8:28 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > > > On Tue, Mar 23, 2021 at 09:19:38AM +0100, Richard Biener wrote: > > > > > On Tue, Mar 23, 2021 at 3:41 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote: > > > > > > > > > > > > > Hongyue, please collect code size differences on SPEC CPU 2017 and > > > > > > > eembc. > > > > > > > > > > > > Here is code size difference for this patch > > > > > > > > > > Thanks, nothing too bad although slightly larger impacts than envisioned. > > > > > > > > > > > > > PING. > > > > > > > > OK for master branch? > > > > > > > > Thanks. > > > > > > > > H.J. > > > > --- > > > > Simplify memcpy and memset inline strategies to avoid branches for > > > > -mtune=generic: > > > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > > fixed and known. > > > > 2. Inline only if data size is known to be <= 256. > > > > a. Use "rep movsb/stosb" with simple code sequence if the data size > > > > is a constant. > > > > b. Use loop if data size is not a constant. > > > > 3. Use memcpy/memset libray function if data size is unknown or > 256. > > > > > > > > > > PING: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577889.html > > > > > > > PING. This should fix: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 > > > > PING. > Any comments or objections to this patch?
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index ffe810f2bcb..30e7c3e4261 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = { "16", /* Func alignment. */ }; -/* Generic should produce code tuned for Core-i7 (and newer chips) - and btver1 (and newer chips). */ +/* Generic should produce code tuned for Haswell (and newer chips) + and znver1 (and newer chips). NB: rep_prefix_1_byte is used only + for known size. */ static stringop_algs generic_memcpy[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs generic_memset[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs generic_cost = { { @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = { COSTS_N_INSNS (1), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 8f55da89c92..a9a023f33f5 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c new file mode 100644 index 00000000000..e9998b70ab2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 249); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c new file mode 100644 index 00000000000..109bd675a51 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c new file mode 100644 index 00000000000..685d6e5a5c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-avx" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c new file mode 100644 index 00000000000..61ee463a8cf --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 253); +} diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c index 94dadd6cdbd..44fe7d2836e 100644 --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c @@ -1,5 +1,5 @@ /* { dg-do compile { target { ! ia32 } } } */ -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */ enum machine_mode { diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c index a9c89fca4ec..234db0e67c2 100644 --- a/gcc/testsuite/gcc.target/i386/sw-1.c +++ b/gcc/testsuite/gcc.target/i386/sw-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ /* { dg-additional-options "-mno-avx" { target ia32 } } */ /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */