Message ID | 20210201003014.785099-2-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] x86: Refactor and improve performance of strchr-avx2.S | expand |
On Sun, Jan 31, 2021 at 4:30 PM noah <goldstein.w.n@gmail.com> wrote: > > This patch adds additional benchmarks for string size of 4096 and > several benchmarks for string size 256 with different alignments. > > Signed-off-by: noah <goldstein.w.n@gmail.com> > --- > Added 2 additional benchmark sizes: > > 4096: Just feels like a natural "large" size to test > > 256 with multiple alignments: This essentially is to test how > expensive the initial work prior to the 4x loop is depending on > different alignments. > > results from bench-strchr: All times are in seconds and the medium of > 100 runs. Old is current strchr-avx2.S implementation. New is this > patch. > > Summary: New is definetly faster for medium -> large sizes. Once the > 4x loop is hit there is a 10%+ speedup and New always wins out. For > smaller sizes there is more variance as to which is faster and the > differences are small. Generally it seems the New version wins > out. This is likely because 0 - 31 sized strings are the fast path for > new (no jmp). > > Benchmarking CPU: > Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz > > size, algn, Old T , New T -------- Win Dif > 0 , 0 , 2.54 , 2.52 -------- New -0.02 > 1 , 0 , 2.57 , 2.52 -------- New -0.05 > 2 , 0 , 2.56 , 2.52 -------- New -0.04 > 3 , 0 , 2.58 , 2.54 -------- New -0.04 > 4 , 0 , 2.61 , 2.55 -------- New -0.06 > 5 , 0 , 2.65 , 2.62 -------- New -0.03 > 6 , 0 , 2.73 , 2.74 -------- Old -0.01 > 7 , 0 , 2.75 , 2.74 -------- New -0.01 > 8 , 0 , 2.62 , 2.6 -------- New -0.02 > 9 , 0 , 2.73 , 2.75 -------- Old -0.02 > 10 , 0 , 2.74 , 2.74 -------- Eq N/A > 11 , 0 , 2.76 , 2.72 -------- New -0.04 > 12 , 0 , 2.74 , 2.72 -------- New -0.02 > 13 , 0 , 2.75 , 2.72 -------- New -0.03 > 14 , 0 , 2.74 , 2.73 -------- New -0.01 > 15 , 0 , 2.74 , 2.73 -------- New -0.01 > 16 , 0 , 2.74 , 2.73 -------- New -0.01 > 17 , 0 , 2.74 , 2.74 -------- Eq N/A > 18 , 0 , 2.73 , 2.73 -------- Eq N/A > 19 , 0 , 2.73 , 2.73 -------- Eq N/A > 20 , 0 , 2.73 , 2.73 -------- Eq N/A > 21 , 0 , 2.73 , 2.72 -------- New -0.01 > 22 , 0 , 2.71 , 2.74 -------- Old -0.03 > 23 , 0 , 2.71 , 2.69 -------- New -0.02 > 24 , 0 , 2.68 , 2.67 -------- New -0.01 > 25 , 0 , 2.66 , 2.62 -------- New -0.04 > 26 , 0 , 2.64 , 2.62 -------- New -0.02 > 27 , 0 , 2.71 , 2.64 -------- New -0.07 > 28 , 0 , 2.67 , 2.69 -------- Old -0.02 > 29 , 0 , 2.72 , 2.72 -------- Eq N/A > 30 , 0 , 2.68 , 2.69 -------- Old -0.01 > 31 , 0 , 2.68 , 2.68 -------- Eq N/A > 32 , 0 , 3.51 , 3.52 -------- Old -0.01 > 32 , 1 , 3.52 , 3.51 -------- New -0.01 > 64 , 0 , 3.97 , 3.93 -------- New -0.04 > 64 , 2 , 3.95 , 3.9 -------- New -0.05 > 64 , 1 , 4.0 , 3.93 -------- New -0.07 > 64 , 3 , 3.97 , 3.88 -------- New -0.09 > 64 , 4 , 3.95 , 3.89 -------- New -0.06 > 64 , 5 , 3.94 , 3.9 -------- New -0.04 > 64 , 6 , 3.97 , 3.9 -------- New -0.07 > 64 , 7 , 3.97 , 3.91 -------- New -0.06 > 96 , 0 , 4.74 , 4.52 -------- New -0.22 > 128 , 0 , 5.29 , 5.19 -------- New -0.1 > 128 , 2 , 5.29 , 5.15 -------- New -0.14 > 128 , 3 , 5.31 , 5.22 -------- New -0.09 > 256 , 0 , 11.19 , 9.81 -------- New -1.38 > 256 , 3 , 11.19 , 9.84 -------- New -1.35 > 256 , 4 , 11.2 , 9.88 -------- New -1.32 > 256 , 16 , 11.21 , 9.79 -------- New -1.42 > 256 , 32 , 11.39 , 10.34 -------- New -1.05 > 256 , 48 , 11.88 , 10.56 -------- New -1.32 > 256 , 64 , 11.82 , 10.83 -------- New -0.99 > 256 , 80 , 11.85 , 10.86 -------- New -0.99 > 256 , 96 , 9.56 , 8.76 -------- New -0.8 > 256 , 112 , 9.55 , 8.9 -------- New -0.65 > 512 , 0 , 15.76 , 13.72 -------- New -2.04 > 512 , 4 , 15.72 , 13.74 -------- New -1.98 > 512 , 5 , 15.73 , 13.74 -------- New -1.99 > 1024, 0 , 24.85 , 21.33 -------- New -3.52 > 1024, 5 , 24.86 , 21.27 -------- New -3.59 > 1024, 6 , 24.87 , 21.32 -------- New -3.55 > 2048, 0 , 45.75 , 36.7 -------- New -9.05 > 2048, 6 , 43.91 , 35.42 -------- New -8.49 > 2048, 7 , 44.43 , 36.37 -------- New -8.06 > 4096, 0 , 96.94 , 81.34 -------- New -15.6 > 4096, 7 , 97.01 , 81.32 -------- New -15.69 > > > > benchtests/bench-strchr.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c > index bf493fe458..5fd98a5d43 100644 > --- a/benchtests/bench-strchr.c > +++ b/benchtests/bench-strchr.c > @@ -100,9 +100,13 @@ do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char) > size_t i; > CHAR *result; > CHAR *buf = (CHAR *) buf1; > - align &= 15; > + > + align &= 127; > if ((align + len) * sizeof (CHAR) >= page_size) > - return; > + { > + return; > + } > + > > for (i = 0; i < len; ++i) > { > @@ -151,12 +155,24 @@ test_main (void) > do_test (i, 16 << i, 2048, SMALL_CHAR, MIDDLE_CHAR); > } > > + for (i = 1; i < 8; ++i) > + { > + do_test (0, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR); > + do_test (i, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR); > + } > + > for (i = 1; i < 8; ++i) > { > do_test (i, 64, 256, SMALL_CHAR, MIDDLE_CHAR); > do_test (i, 64, 256, SMALL_CHAR, BIG_CHAR); > } > > + for (i = 0; i < 8; ++i) > + { > + do_test (16 * i, 256, 512, SMALL_CHAR, MIDDLE_CHAR); > + do_test (16 * i, 256, 512, SMALL_CHAR, BIG_CHAR); > + } > + > for (i = 0; i < 32; ++i) > { > do_test (0, i, i + 1, SMALL_CHAR, MIDDLE_CHAR); > @@ -169,12 +185,24 @@ test_main (void) > do_test (i, 16 << i, 2048, 0, MIDDLE_CHAR); > } > > + for (i = 1; i < 8; ++i) > + { > + do_test (0, 16 << i, 4096, 0, MIDDLE_CHAR); > + do_test (i, 16 << i, 4096, 0, MIDDLE_CHAR); > + } > + > for (i = 1; i < 8; ++i) > { > do_test (i, 64, 256, 0, MIDDLE_CHAR); > do_test (i, 64, 256, 0, BIG_CHAR); > } > > + for (i = 0; i < 8; ++i) > + { > + do_test (16 * i, 256, 512, 0, MIDDLE_CHAR); > + do_test (16 * i, 256, 512, 0, BIG_CHAR); > + } > + > for (i = 0; i < 32; ++i) > { > do_test (0, i, i + 1, 0, MIDDLE_CHAR); > -- > 2.29.2 Please make the similar changes in string/test-strchr.c. Thanks.
diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c index bf493fe458..5fd98a5d43 100644 --- a/benchtests/bench-strchr.c +++ b/benchtests/bench-strchr.c @@ -100,9 +100,13 @@ do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char) size_t i; CHAR *result; CHAR *buf = (CHAR *) buf1; - align &= 15; + + align &= 127; if ((align + len) * sizeof (CHAR) >= page_size) - return; + { + return; + } + for (i = 0; i < len; ++i) { @@ -151,12 +155,24 @@ test_main (void) do_test (i, 16 << i, 2048, SMALL_CHAR, MIDDLE_CHAR); } + for (i = 1; i < 8; ++i) + { + do_test (0, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR); + do_test (i, 16 << i, 4096, SMALL_CHAR, MIDDLE_CHAR); + } + for (i = 1; i < 8; ++i) { do_test (i, 64, 256, SMALL_CHAR, MIDDLE_CHAR); do_test (i, 64, 256, SMALL_CHAR, BIG_CHAR); } + for (i = 0; i < 8; ++i) + { + do_test (16 * i, 256, 512, SMALL_CHAR, MIDDLE_CHAR); + do_test (16 * i, 256, 512, SMALL_CHAR, BIG_CHAR); + } + for (i = 0; i < 32; ++i) { do_test (0, i, i + 1, SMALL_CHAR, MIDDLE_CHAR); @@ -169,12 +185,24 @@ test_main (void) do_test (i, 16 << i, 2048, 0, MIDDLE_CHAR); } + for (i = 1; i < 8; ++i) + { + do_test (0, 16 << i, 4096, 0, MIDDLE_CHAR); + do_test (i, 16 << i, 4096, 0, MIDDLE_CHAR); + } + for (i = 1; i < 8; ++i) { do_test (i, 64, 256, 0, MIDDLE_CHAR); do_test (i, 64, 256, 0, BIG_CHAR); } + for (i = 0; i < 8; ++i) + { + do_test (16 * i, 256, 512, 0, MIDDLE_CHAR); + do_test (16 * i, 256, 512, 0, BIG_CHAR); + } + for (i = 0; i < 32; ++i) { do_test (0, i, i + 1, 0, MIDDLE_CHAR);
This patch adds additional benchmarks for string size of 4096 and several benchmarks for string size 256 with different alignments. Signed-off-by: noah <goldstein.w.n@gmail.com> --- Added 2 additional benchmark sizes: 4096: Just feels like a natural "large" size to test 256 with multiple alignments: This essentially is to test how expensive the initial work prior to the 4x loop is depending on different alignments. results from bench-strchr: All times are in seconds and the medium of 100 runs. Old is current strchr-avx2.S implementation. New is this patch. Summary: New is definetly faster for medium -> large sizes. Once the 4x loop is hit there is a 10%+ speedup and New always wins out. For smaller sizes there is more variance as to which is faster and the differences are small. Generally it seems the New version wins out. This is likely because 0 - 31 sized strings are the fast path for new (no jmp). Benchmarking CPU: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz size, algn, Old T , New T -------- Win Dif 0 , 0 , 2.54 , 2.52 -------- New -0.02 1 , 0 , 2.57 , 2.52 -------- New -0.05 2 , 0 , 2.56 , 2.52 -------- New -0.04 3 , 0 , 2.58 , 2.54 -------- New -0.04 4 , 0 , 2.61 , 2.55 -------- New -0.06 5 , 0 , 2.65 , 2.62 -------- New -0.03 6 , 0 , 2.73 , 2.74 -------- Old -0.01 7 , 0 , 2.75 , 2.74 -------- New -0.01 8 , 0 , 2.62 , 2.6 -------- New -0.02 9 , 0 , 2.73 , 2.75 -------- Old -0.02 10 , 0 , 2.74 , 2.74 -------- Eq N/A 11 , 0 , 2.76 , 2.72 -------- New -0.04 12 , 0 , 2.74 , 2.72 -------- New -0.02 13 , 0 , 2.75 , 2.72 -------- New -0.03 14 , 0 , 2.74 , 2.73 -------- New -0.01 15 , 0 , 2.74 , 2.73 -------- New -0.01 16 , 0 , 2.74 , 2.73 -------- New -0.01 17 , 0 , 2.74 , 2.74 -------- Eq N/A 18 , 0 , 2.73 , 2.73 -------- Eq N/A 19 , 0 , 2.73 , 2.73 -------- Eq N/A 20 , 0 , 2.73 , 2.73 -------- Eq N/A 21 , 0 , 2.73 , 2.72 -------- New -0.01 22 , 0 , 2.71 , 2.74 -------- Old -0.03 23 , 0 , 2.71 , 2.69 -------- New -0.02 24 , 0 , 2.68 , 2.67 -------- New -0.01 25 , 0 , 2.66 , 2.62 -------- New -0.04 26 , 0 , 2.64 , 2.62 -------- New -0.02 27 , 0 , 2.71 , 2.64 -------- New -0.07 28 , 0 , 2.67 , 2.69 -------- Old -0.02 29 , 0 , 2.72 , 2.72 -------- Eq N/A 30 , 0 , 2.68 , 2.69 -------- Old -0.01 31 , 0 , 2.68 , 2.68 -------- Eq N/A 32 , 0 , 3.51 , 3.52 -------- Old -0.01 32 , 1 , 3.52 , 3.51 -------- New -0.01 64 , 0 , 3.97 , 3.93 -------- New -0.04 64 , 2 , 3.95 , 3.9 -------- New -0.05 64 , 1 , 4.0 , 3.93 -------- New -0.07 64 , 3 , 3.97 , 3.88 -------- New -0.09 64 , 4 , 3.95 , 3.89 -------- New -0.06 64 , 5 , 3.94 , 3.9 -------- New -0.04 64 , 6 , 3.97 , 3.9 -------- New -0.07 64 , 7 , 3.97 , 3.91 -------- New -0.06 96 , 0 , 4.74 , 4.52 -------- New -0.22 128 , 0 , 5.29 , 5.19 -------- New -0.1 128 , 2 , 5.29 , 5.15 -------- New -0.14 128 , 3 , 5.31 , 5.22 -------- New -0.09 256 , 0 , 11.19 , 9.81 -------- New -1.38 256 , 3 , 11.19 , 9.84 -------- New -1.35 256 , 4 , 11.2 , 9.88 -------- New -1.32 256 , 16 , 11.21 , 9.79 -------- New -1.42 256 , 32 , 11.39 , 10.34 -------- New -1.05 256 , 48 , 11.88 , 10.56 -------- New -1.32 256 , 64 , 11.82 , 10.83 -------- New -0.99 256 , 80 , 11.85 , 10.86 -------- New -0.99 256 , 96 , 9.56 , 8.76 -------- New -0.8 256 , 112 , 9.55 , 8.9 -------- New -0.65 512 , 0 , 15.76 , 13.72 -------- New -2.04 512 , 4 , 15.72 , 13.74 -------- New -1.98 512 , 5 , 15.73 , 13.74 -------- New -1.99 1024, 0 , 24.85 , 21.33 -------- New -3.52 1024, 5 , 24.86 , 21.27 -------- New -3.59 1024, 6 , 24.87 , 21.32 -------- New -3.55 2048, 0 , 45.75 , 36.7 -------- New -9.05 2048, 6 , 43.91 , 35.42 -------- New -8.49 2048, 7 , 44.43 , 36.37 -------- New -8.06 4096, 0 , 96.94 , 81.34 -------- New -15.6 4096, 7 , 97.01 , 81.32 -------- New -15.69 benchtests/bench-strchr.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)