Message ID | 20150726131622.GA10623@domone |
---|---|
State | New |
Headers | show |
On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> >> > > > >> >> wrote: >> > > > >> >> > Fixed in the attached patch >> > > > >> >> > >> > > > >> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and >> > > > >> >> without on i386 and x86-64. >> > > > >> > >> > > > >> > Done, all works fine >> > > > >> > >> > > > >> >> > > > >> I checked it in for you. >> > > > >> >> > > > > These are nice but you could have same problem with lazy tls allocation. >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write >> > > > > similar patch to solve that? Original purpose was to always save xmm >> > > > > registers so we could use sse2 routines which speeds up lookup time. >> > > > >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How >> > > > much gain it will give us? >> > > > >> > > I couldn't measure that without patch. Gain now would be big as we now >> > > use byte-by-byte loop to check symbol name which is slow, especially >> > > with c++ name mangling. Would be following benchmark good to measure >> > > speedup or do I need to measure startup time which is bit harder? >> > > >> > >> > Please try this. >> > >> >> We have to use movups instead of movaps due to >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 >> >> > Thanks, this looks promising. > > I think how to do definite benchmark, Now I have evidence that its > likely improvement but not definite. > > I found that benchmark that i intended causes too much noise and I > didn't get useful from that yet. It was creating 1000 functions in > library and calling them from main where performance between runs vary > by factor of 3 for same implementation. > > I have indirect evidence. With attached patch to use sse2 routines I > decreased startup time of running binaries when you run "make bench" > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. > > See results on haswell of > > LD_DEBUG=statistics make bench &> old_rtld > > that are large so you could browse these here > > http://kam.mff.cuni.cz/~ondra/old_rtld > http://kam.mff.cuni.cz/~ondra/new_rtld > > For dlopen benchmark I measure ten times performance of > dlopen(RTLD_DEFAULT,"memcpy"); > dlopen(RTLD_DEFAULT,"strlen"); > > Without patch I get > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 > and with patch > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 > > I attached vip patches, I didn't change memcpy yet. > > So if you have idea how directly measure fixup change it would be > welcome. > There is a potential performance issue. This won't change parameters passed in S256-bit/512-bit vector registers because SSE load will only update the lower 128 bits of 256-bit/512-bit vector registers while preserving the upper bits. But these SSE load operations may not be fast on all current and future processors. To load the entire 256-bit/512-bit vector registers, we need to check CPU feature in each symbol lookup. On the other hand, we can compile x86-64 ld.so with -msse2. I don't know what the final performance impact is.
On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote: > On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: > >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: > >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: > >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: > >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: > >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: > >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> > >> > > > >> >> wrote: > >> > > > >> >> > Fixed in the attached patch > >> > > > >> >> > > >> > > > >> >> > >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for > >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and > >> > > > >> >> without on i386 and x86-64. > >> > > > >> > > >> > > > >> > Done, all works fine > >> > > > >> > > >> > > > >> > >> > > > >> I checked it in for you. > >> > > > >> > >> > > > > These are nice but you could have same problem with lazy tls allocation. > >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write > >> > > > > similar patch to solve that? Original purpose was to always save xmm > >> > > > > registers so we could use sse2 routines which speeds up lookup time. > >> > > > > >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How > >> > > > much gain it will give us? > >> > > > > >> > > I couldn't measure that without patch. Gain now would be big as we now > >> > > use byte-by-byte loop to check symbol name which is slow, especially > >> > > with c++ name mangling. Would be following benchmark good to measure > >> > > speedup or do I need to measure startup time which is bit harder? > >> > > > >> > > >> > Please try this. > >> > > >> > >> We have to use movups instead of movaps due to > >> > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 > >> > >> > > Thanks, this looks promising. > > > > I think how to do definite benchmark, Now I have evidence that its > > likely improvement but not definite. > > > > I found that benchmark that i intended causes too much noise and I > > didn't get useful from that yet. It was creating 1000 functions in > > library and calling them from main where performance between runs vary > > by factor of 3 for same implementation. > > > > I have indirect evidence. With attached patch to use sse2 routines I > > decreased startup time of running binaries when you run "make bench" > > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. > > > > See results on haswell of > > > > LD_DEBUG=statistics make bench &> old_rtld > > > > that are large so you could browse these here > > > > http://kam.mff.cuni.cz/~ondra/old_rtld > > http://kam.mff.cuni.cz/~ondra/new_rtld > > > > For dlopen benchmark I measure ten times performance of > > dlopen(RTLD_DEFAULT,"memcpy"); > > dlopen(RTLD_DEFAULT,"strlen"); > > > > Without patch I get > > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 > > and with patch > > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 > > > > I attached vip patches, I didn't change memcpy yet. > > > > So if you have idea how directly measure fixup change it would be > > welcome. > > > > There is a potential performance issue. This won't change parameters > passed in S256-bit/512-bit vector registers because SSE load will only > update the lower 128 bits of 256-bit/512-bit vector registers while > preserving the upper bits. But these SSE load operations may not be > fast on all current and future processors. To load the entire > 256-bit/512-bit vector registers, we need to check CPU feature in > each symbol lookup. On the other hand, we can compile x86-64 ld.so > with -msse2. I don't know what the final performance impact is. > Yes, these should be saved due problems with modes. There could be problem that saving these takes longer. You don't need check cpu features on each call. Make _dl_runtime_resolve a function pointer and on startup initialize it to correct variant. This depends if more complicated code is worth gain. On other hand I read original code to find it was also about fixing bug 15128 where ifunc resolver clobbers xmm registers, similar problem is with LD_AUDIT. These could be solved with this or by saving xmm registers only when ifunc is called. I don't know whats best maintainable solution.
On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote: >> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> >> >> > > > >> >> wrote: >> >> > > > >> >> > Fixed in the attached patch >> >> > > > >> >> > >> >> > > > >> >> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for >> >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and >> >> > > > >> >> without on i386 and x86-64. >> >> > > > >> > >> >> > > > >> > Done, all works fine >> >> > > > >> > >> >> > > > >> >> >> > > > >> I checked it in for you. >> >> > > > >> >> >> > > > > These are nice but you could have same problem with lazy tls allocation. >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write >> >> > > > > similar patch to solve that? Original purpose was to always save xmm >> >> > > > > registers so we could use sse2 routines which speeds up lookup time. >> >> > > > >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How >> >> > > > much gain it will give us? >> >> > > > >> >> > > I couldn't measure that without patch. Gain now would be big as we now >> >> > > use byte-by-byte loop to check symbol name which is slow, especially >> >> > > with c++ name mangling. Would be following benchmark good to measure >> >> > > speedup or do I need to measure startup time which is bit harder? >> >> > > >> >> > >> >> > Please try this. >> >> > >> >> >> >> We have to use movups instead of movaps due to >> >> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 >> >> >> >> >> > Thanks, this looks promising. >> > >> > I think how to do definite benchmark, Now I have evidence that its >> > likely improvement but not definite. >> > >> > I found that benchmark that i intended causes too much noise and I >> > didn't get useful from that yet. It was creating 1000 functions in >> > library and calling them from main where performance between runs vary >> > by factor of 3 for same implementation. >> > >> > I have indirect evidence. With attached patch to use sse2 routines I >> > decreased startup time of running binaries when you run "make bench" >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. >> > >> > See results on haswell of >> > >> > LD_DEBUG=statistics make bench &> old_rtld >> > >> > that are large so you could browse these here >> > >> > http://kam.mff.cuni.cz/~ondra/old_rtld >> > http://kam.mff.cuni.cz/~ondra/new_rtld >> > >> > For dlopen benchmark I measure ten times performance of >> > dlopen(RTLD_DEFAULT,"memcpy"); >> > dlopen(RTLD_DEFAULT,"strlen"); >> > >> > Without patch I get >> > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 >> > and with patch >> > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 >> > >> > I attached vip patches, I didn't change memcpy yet. >> > >> > So if you have idea how directly measure fixup change it would be >> > welcome. >> > >> >> There is a potential performance issue. This won't change parameters >> passed in S256-bit/512-bit vector registers because SSE load will only >> update the lower 128 bits of 256-bit/512-bit vector registers while >> preserving the upper bits. But these SSE load operations may not be >> fast on all current and future processors. To load the entire >> 256-bit/512-bit vector registers, we need to check CPU feature in >> each symbol lookup. On the other hand, we can compile x86-64 ld.so >> with -msse2. I don't know what the final performance impact is. >> > Yes, these should be saved due problems with modes. There could be > problem that saving these takes longer. You don't need > check cpu features on each call. > Make _dl_runtime_resolve a function pointer and on > startup initialize it to correct variant. One more indirect call. > This depends if more complicated code is worth gain. On other hand I > read original code to find it was also about fixing bug 15128 where > ifunc resolver clobbers xmm registers, similar problem is with LD_AUDIT. > > These could be solved with this or by saving xmm registers only when > ifunc is called. I don't know whats best maintainable solution. That is true. Can you try compile ld.so with -msse2?
On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote: > On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote: > >> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: > >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: > >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: > >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: > >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: > >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: > >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> > >> >> > > > >> >> wrote: > >> >> > > > >> >> > Fixed in the attached patch > >> >> > > > >> >> > > >> >> > > > >> >> > >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for > >> >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and > >> >> > > > >> >> without on i386 and x86-64. > >> >> > > > >> > > >> >> > > > >> > Done, all works fine > >> >> > > > >> > > >> >> > > > >> > >> >> > > > >> I checked it in for you. > >> >> > > > >> > >> >> > > > > These are nice but you could have same problem with lazy tls allocation. > >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write > >> >> > > > > similar patch to solve that? Original purpose was to always save xmm > >> >> > > > > registers so we could use sse2 routines which speeds up lookup time. > >> >> > > > > >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How > >> >> > > > much gain it will give us? > >> >> > > > > >> >> > > I couldn't measure that without patch. Gain now would be big as we now > >> >> > > use byte-by-byte loop to check symbol name which is slow, especially > >> >> > > with c++ name mangling. Would be following benchmark good to measure > >> >> > > speedup or do I need to measure startup time which is bit harder? > >> >> > > > >> >> > > >> >> > Please try this. > >> >> > > >> >> > >> >> We have to use movups instead of movaps due to > >> >> > >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 > >> >> > >> >> > >> > Thanks, this looks promising. > >> > > >> > I think how to do definite benchmark, Now I have evidence that its > >> > likely improvement but not definite. > >> > > >> > I found that benchmark that i intended causes too much noise and I > >> > didn't get useful from that yet. It was creating 1000 functions in > >> > library and calling them from main where performance between runs vary > >> > by factor of 3 for same implementation. > >> > > >> > I have indirect evidence. With attached patch to use sse2 routines I > >> > decreased startup time of running binaries when you run "make bench" > >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. > >> > > >> > See results on haswell of > >> > > >> > LD_DEBUG=statistics make bench &> old_rtld > >> > > >> > that are large so you could browse these here > >> > > >> > http://kam.mff.cuni.cz/~ondra/old_rtld > >> > http://kam.mff.cuni.cz/~ondra/new_rtld > >> > > >> > For dlopen benchmark I measure ten times performance of > >> > dlopen(RTLD_DEFAULT,"memcpy"); > >> > dlopen(RTLD_DEFAULT,"strlen"); > >> > > >> > Without patch I get > >> > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 > >> > and with patch > >> > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 > >> > > >> > I attached vip patches, I didn't change memcpy yet. > >> > > >> > So if you have idea how directly measure fixup change it would be > >> > welcome. > >> > > >> > >> There is a potential performance issue. This won't change parameters > >> passed in S256-bit/512-bit vector registers because SSE load will only > >> update the lower 128 bits of 256-bit/512-bit vector registers while > >> preserving the upper bits. But these SSE load operations may not be > >> fast on all current and future processors. To load the entire > >> 256-bit/512-bit vector registers, we need to check CPU feature in > >> each symbol lookup. On the other hand, we can compile x86-64 ld.so > >> with -msse2. I don't know what the final performance impact is. > >> > > Yes, these should be saved due problems with modes. There could be > > problem that saving these takes longer. You don't need > > check cpu features on each call. > > Make _dl_runtime_resolve a function pointer and on > > startup initialize it to correct variant. > > One more indirect call. > no, my proposal is different, we could do this: void *_dl_runtime_resolve; int startup() { if (has_avx()) _dl_runtime_resolve = _dl_runtime_resolve_avx; else _dl_runtime_resolve = _dl_runtime_resolve_sse; } Then we will assign correct variant.
On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote: >> On Mon, Jul 27, 2015 at 3:10 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> > On Sun, Jul 26, 2015 at 12:38:20PM -0700, H.J. Lu wrote: >> >> On Sun, Jul 26, 2015 at 6:16 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> >> > On Sat, Jul 11, 2015 at 04:50:02PM -0700, H.J. Lu wrote: >> >> >> On Sat, Jul 11, 2015 at 01:27:42PM -0700, H.J. Lu wrote: >> >> >> > On Sat, Jul 11, 2015 at 12:46:54PM +0200, Ondřej Bílka wrote: >> >> >> > > On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote: >> >> >> > > > On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> >> >> > > > > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote: >> >> >> > > > >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote: >> >> >> > > > >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> >> >> >> > > > >> >> wrote: >> >> >> > > > >> >> > Fixed in the attached patch >> >> >> > > > >> >> > >> >> >> > > > >> >> >> >> >> > > > >> >> I fixed some typos and updated sysdeps/i386/configure for >> >> >> > > > >> >> HAVE_MPX_SUPPORT. Please verify both with HAVE_MPX_SUPPORT and >> >> >> > > > >> >> without on i386 and x86-64. >> >> >> > > > >> > >> >> >> > > > >> > Done, all works fine >> >> >> > > > >> > >> >> >> > > > >> >> >> >> > > > >> I checked it in for you. >> >> >> > > > >> >> >> >> > > > > These are nice but you could have same problem with lazy tls allocation. >> >> >> > > > > I wrote patch to merge trampolines, which now conflicts. Could you write >> >> >> > > > > similar patch to solve that? Original purpose was to always save xmm >> >> >> > > > > registers so we could use sse2 routines which speeds up lookup time. >> >> >> > > > >> >> >> > > > So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How >> >> >> > > > much gain it will give us? >> >> >> > > > >> >> >> > > I couldn't measure that without patch. Gain now would be big as we now >> >> >> > > use byte-by-byte loop to check symbol name which is slow, especially >> >> >> > > with c++ name mangling. Would be following benchmark good to measure >> >> >> > > speedup or do I need to measure startup time which is bit harder? >> >> >> > > >> >> >> > >> >> >> > Please try this. >> >> >> > >> >> >> >> >> >> We have to use movups instead of movaps due to >> >> >> >> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 >> >> >> >> >> >> >> >> > Thanks, this looks promising. >> >> > >> >> > I think how to do definite benchmark, Now I have evidence that its >> >> > likely improvement but not definite. >> >> > >> >> > I found that benchmark that i intended causes too much noise and I >> >> > didn't get useful from that yet. It was creating 1000 functions in >> >> > library and calling them from main where performance between runs vary >> >> > by factor of 3 for same implementation. >> >> > >> >> > I have indirect evidence. With attached patch to use sse2 routines I >> >> > decreased startup time of running binaries when you run "make bench" >> >> > by ~6000 cycles and dlopen time by 4% on haswell and ivy bridge. >> >> > >> >> > See results on haswell of >> >> > >> >> > LD_DEBUG=statistics make bench &> old_rtld >> >> > >> >> > that are large so you could browse these here >> >> > >> >> > http://kam.mff.cuni.cz/~ondra/old_rtld >> >> > http://kam.mff.cuni.cz/~ondra/new_rtld >> >> > >> >> > For dlopen benchmark I measure ten times performance of >> >> > dlopen(RTLD_DEFAULT,"memcpy"); >> >> > dlopen(RTLD_DEFAULT,"strlen"); >> >> > >> >> > Without patch I get >> >> > 624.49 559.58 556.6 556.04 558.42 557.86 559.46 555.17 556.93 555.32 >> >> > and with patch >> >> > 604.71 536.74 536.08 535.78 534.11 533.67 534.8 534.8 533.46 536.08 >> >> > >> >> > I attached vip patches, I didn't change memcpy yet. >> >> > >> >> > So if you have idea how directly measure fixup change it would be >> >> > welcome. >> >> > >> >> >> >> There is a potential performance issue. This won't change parameters >> >> passed in S256-bit/512-bit vector registers because SSE load will only >> >> update the lower 128 bits of 256-bit/512-bit vector registers while >> >> preserving the upper bits. But these SSE load operations may not be >> >> fast on all current and future processors. To load the entire >> >> 256-bit/512-bit vector registers, we need to check CPU feature in >> >> each symbol lookup. On the other hand, we can compile x86-64 ld.so >> >> with -msse2. I don't know what the final performance impact is. >> >> >> > Yes, these should be saved due problems with modes. There could be >> > problem that saving these takes longer. You don't need >> > check cpu features on each call. >> > Make _dl_runtime_resolve a function pointer and on >> > startup initialize it to correct variant. >> >> One more indirect call. >> > no, my proposal is different, we could do this: > > void *_dl_runtime_resolve; > int startup() > { > if (has_avx()) > _dl_runtime_resolve = _dl_runtime_resolve_avx; > else > _dl_runtime_resolve = _dl_runtime_resolve_sse; > } > > Then we will assign correct variant. Yes, this may work for both _dl_runtime_profile and _dl_runtime_resolve. I will see what I can do.
On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote: >>> >> >>> >> There is a potential performance issue. This won't change parameters >>> >> passed in S256-bit/512-bit vector registers because SSE load will only >>> >> update the lower 128 bits of 256-bit/512-bit vector registers while >>> >> preserving the upper bits. But these SSE load operations may not be >>> >> fast on all current and future processors. To load the entire >>> >> 256-bit/512-bit vector registers, we need to check CPU feature in >>> >> each symbol lookup. On the other hand, we can compile x86-64 ld.so >>> >> with -msse2. I don't know what the final performance impact is. >>> >> >>> > Yes, these should be saved due problems with modes. There could be >>> > problem that saving these takes longer. You don't need >>> > check cpu features on each call. >>> > Make _dl_runtime_resolve a function pointer and on >>> > startup initialize it to correct variant. >>> >>> One more indirect call. >>> >> no, my proposal is different, we could do this: >> >> void *_dl_runtime_resolve; >> int startup() >> { >> if (has_avx()) >> _dl_runtime_resolve = _dl_runtime_resolve_avx; >> else >> _dl_runtime_resolve = _dl_runtime_resolve_sse; >> } >> >> Then we will assign correct variant. > > Yes, this may work for both _dl_runtime_profile and > _dl_runtime_resolve. I will see what I can do. > Please try hjl/pr18661 branch. I implemented: 0000000000016fd0 t _dl_runtime_profile_avx 0000000000016b50 t _dl_runtime_profile_avx512 0000000000017450 t _dl_runtime_profile_sse 00000000000168d0 t _dl_runtime_resolve_avx 0000000000016780 t _dl_runtime_resolve_avx512 0000000000016a20 t _dl_runtime_resolve_sse
On Tue, Jul 28, 2015 at 1:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >>> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote: >>>> >> >>>> >> There is a potential performance issue. This won't change parameters >>>> >> passed in S256-bit/512-bit vector registers because SSE load will only >>>> >> update the lower 128 bits of 256-bit/512-bit vector registers while >>>> >> preserving the upper bits. But these SSE load operations may not be >>>> >> fast on all current and future processors. To load the entire >>>> >> 256-bit/512-bit vector registers, we need to check CPU feature in >>>> >> each symbol lookup. On the other hand, we can compile x86-64 ld.so >>>> >> with -msse2. I don't know what the final performance impact is. >>>> >> >>>> > Yes, these should be saved due problems with modes. There could be >>>> > problem that saving these takes longer. You don't need >>>> > check cpu features on each call. >>>> > Make _dl_runtime_resolve a function pointer and on >>>> > startup initialize it to correct variant. >>>> >>>> One more indirect call. >>>> >>> no, my proposal is different, we could do this: >>> >>> void *_dl_runtime_resolve; >>> int startup() >>> { >>> if (has_avx()) >>> _dl_runtime_resolve = _dl_runtime_resolve_avx; >>> else >>> _dl_runtime_resolve = _dl_runtime_resolve_sse; >>> } >>> >>> Then we will assign correct variant. >> >> Yes, this may work for both _dl_runtime_profile and >> _dl_runtime_resolve. I will see what I can do. >> > > Please try hjl/pr18661 branch. I implemented: > > 0000000000016fd0 t _dl_runtime_profile_avx > 0000000000016b50 t _dl_runtime_profile_avx512 > 0000000000017450 t _dl_runtime_profile_sse > 00000000000168d0 t _dl_runtime_resolve_avx > 0000000000016780 t _dl_runtime_resolve_avx512 > 0000000000016a20 t _dl_runtime_resolve_sse I enabled SSE in ld.so and it works fine.
On Tue, Jul 28, 2015 at 7:03 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, Jul 28, 2015 at 1:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Mon, Jul 27, 2015 at 6:37 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> On Mon, Jul 27, 2015 at 6:26 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >>>> On Mon, Jul 27, 2015 at 06:14:07AM -0700, H.J. Lu wrote: >>>>> >> >>>>> >> There is a potential performance issue. This won't change parameters >>>>> >> passed in S256-bit/512-bit vector registers because SSE load will only >>>>> >> update the lower 128 bits of 256-bit/512-bit vector registers while >>>>> >> preserving the upper bits. But these SSE load operations may not be >>>>> >> fast on all current and future processors. To load the entire >>>>> >> 256-bit/512-bit vector registers, we need to check CPU feature in >>>>> >> each symbol lookup. On the other hand, we can compile x86-64 ld.so >>>>> >> with -msse2. I don't know what the final performance impact is. >>>>> >> >>>>> > Yes, these should be saved due problems with modes. There could be >>>>> > problem that saving these takes longer. You don't need >>>>> > check cpu features on each call. >>>>> > Make _dl_runtime_resolve a function pointer and on >>>>> > startup initialize it to correct variant. >>>>> >>>>> One more indirect call. >>>>> >>>> no, my proposal is different, we could do this: >>>> >>>> void *_dl_runtime_resolve; >>>> int startup() >>>> { >>>> if (has_avx()) >>>> _dl_runtime_resolve = _dl_runtime_resolve_avx; >>>> else >>>> _dl_runtime_resolve = _dl_runtime_resolve_sse; >>>> } >>>> >>>> Then we will assign correct variant. >>> >>> Yes, this may work for both _dl_runtime_profile and >>> _dl_runtime_resolve. I will see what I can do. >>> >> >> Please try hjl/pr18661 branch. I implemented: >> >> 0000000000016fd0 t _dl_runtime_profile_avx >> 0000000000016b50 t _dl_runtime_profile_avx512 >> 0000000000017450 t _dl_runtime_profile_sse >> 00000000000168d0 t _dl_runtime_resolve_avx >> 0000000000016780 t _dl_runtime_resolve_avx512 >> 0000000000016a20 t _dl_runtime_resolve_sse > > I enabled SSE in ld.so and it works fine. > I fully enabled SSE in ld.so on hjl/pr18661 branch. I didn't notice any issue on both AVX and non-AVX machines. There may be some performance improvement. But it is hard to tell.
diff --git a/benchtests/Makefile b/benchtests/Makefile index 8e615e5..9e82e43 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -30,7 +30,7 @@ bench-pthread := pthread_once bench := $(bench-math) $(bench-pthread) # String function benchmarks. -string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \ +string-bench := dlopen bcopy bzero memccpy memchr memcmp memcpy memmem memmove \ mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \ strcat strchr strchrnul strcmp strcpy strcspn strlen \ strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \ @@ -57,6 +57,7 @@ CFLAGS-bench-ffsll.c += -fno-builtin bench-malloc := malloc-thread +$(objpfx)bench-dlopen: -ldl $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(objpfx)bench-malloc-thread: $(shared-thread-library) diff --git a/benchtests/bench-dlopen.c b/benchtests/bench-dlopen.c new file mode 100644 index 0000000..b47d18b --- /dev/null +++ b/benchtests/bench-dlopen.c @@ -0,0 +1,47 @@ +/* Measure strcpy functions. + Copyright (C) 2013-2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ +#include <dlfcn.h> +#include <string.h> +#include "bench-timing.h" + +int +main (void) +{ + long ret = 0; + + size_t i, j,iters = 100; + timing_t start, stop, cur; + for (j=0;j<10;j++) + { + TIMING_NOW (start); + for (i = 0; i < iters; ++i) + { + ret += (long) dlsym (RTLD_DEFAULT, "strlen"); + ret += (long) dlsym (RTLD_DEFAULT, "memcpy"); + } + TIMING_NOW (stop); + + TIMING_DIFF (cur, start, stop); + + TIMING_PRINT_MEAN ((double) cur, (double) iters); + } + + return ret; +} + +