Message ID | 1512418610-84032-2-git-send-email-bhanuprakash.bodireddy@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev,RFC,1/5] compiler: Introduce OVS_PREFETCH variants. | expand |
On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote: > Processors support prefetch instruction in anticipation of write but > compilers(gcc) won't use them unless explicitly asked to do so even > with '-march=native' specified. > > [Problem] > Case A: > OVS_PREFETCH_CACHE(addr, OPCH_HTW) > __builtin_prefetch(addr, 1, 3) > leaq -112(%rbp), %rax [Assembly] > prefetchw (%rax) > > Case B: > OVS_PREFETCH_CACHE(addr, OPCH_LTW) > __builtin_prefetch(addr, 1, 1) > leaq -112(%rbp), %rax [Assembly] > prefetchw (%rax) <***problem***> > > Inspite of specifying -march=native and using Low Temporal Write(OPCH_LTW), > the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' > instruction available on processor. > > [Solution] > Include -mprefetchwt1 > > Case B: > OVS_PREFETCH_CACHE(addr, OPCH_LTW) > __builtin_prefetch(addr, 1, 1) > leaq -112(%rbp), %rax [Assembly] > prefetchwt1 (%rax) > > [Testing] > $ ./boot.sh > $ ./configure > checking target hint for cgcc... x86_64 > checking whether gcc accepts -mprefetchwt1... yes > $ make -j > > Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> Does this have any effect if the architecture or CPU configured for use does not support prefetchwt1? If it could lead to that situation, then this does not seem like the right thing to do, and we might want to fall back to recommending use of the option when the person building knows that the software will run on a machine with prefetchwt1.
>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote: >> Processors support prefetch instruction in anticipation of write but >> compilers(gcc) won't use them unless explicitly asked to do so even >> with '-march=native' specified. >> >> [Problem] >> Case A: >> OVS_PREFETCH_CACHE(addr, OPCH_HTW) >> __builtin_prefetch(addr, 1, 3) >> leaq -112(%rbp), %rax [Assembly] >> prefetchw (%rax) >> >> Case B: >> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >> __builtin_prefetch(addr, 1, 1) >> leaq -112(%rbp), %rax [Assembly] >> prefetchw (%rax) <***problem***> >> >> Inspite of specifying -march=native and using Low Temporal >Write(OPCH_LTW), >> the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' >> instruction available on processor. >> >> [Solution] >> Include -mprefetchwt1 >> >> Case B: >> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >> __builtin_prefetch(addr, 1, 1) >> leaq -112(%rbp), %rax [Assembly] >> prefetchwt1 (%rax) >> >> [Testing] >> $ ./boot.sh >> $ ./configure >> checking target hint for cgcc... x86_64 >> checking whether gcc accepts -mprefetchwt1... yes >> $ make -j >> >> Signed-off-by: Bhanuprakash Bodireddy >> <bhanuprakash.bodireddy@intel.com> > >Does this have any effect if the architecture or CPU configured for use does >not support prefetchwt1? That's a good question and I spent reasonable time today to figure this out. I have Haswell, Broadwell and Skylake CPUs and they all support this instruction. But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this. Coming to your question, there won't be side effects on using OPCH_LTW. On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction. On processors that support PREFETCHW the compiler generates 'prefetchw' instruction. On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled. >If it could lead to that situation, then this does not >seem like the right thing to do, and we might want to fall back to >recommending use of the option when the person building knows that the >software will run on a machine with prefetchwt1. According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects. I verified this using https://gcc.godbolt.org/ and carefully checking the instructions generated for different compiler versions and march flags. - Bhanuprakash.
On Mon, Dec 04, 2017 at 08:59:47PM +0000, Bodireddy, Bhanuprakash wrote: > >On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote: > >> Processors support prefetch instruction in anticipation of write but > >> compilers(gcc) won't use them unless explicitly asked to do so even > >> with '-march=native' specified. > >> > >> [Problem] > >> Case A: > >> OVS_PREFETCH_CACHE(addr, OPCH_HTW) > >> __builtin_prefetch(addr, 1, 3) > >> leaq -112(%rbp), %rax [Assembly] > >> prefetchw (%rax) > >> > >> Case B: > >> OVS_PREFETCH_CACHE(addr, OPCH_LTW) > >> __builtin_prefetch(addr, 1, 1) > >> leaq -112(%rbp), %rax [Assembly] > >> prefetchw (%rax) <***problem***> > >> > >> Inspite of specifying -march=native and using Low Temporal > >Write(OPCH_LTW), > >> the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' > >> instruction available on processor. > >> > >> [Solution] > >> Include -mprefetchwt1 > >> > >> Case B: > >> OVS_PREFETCH_CACHE(addr, OPCH_LTW) > >> __builtin_prefetch(addr, 1, 1) > >> leaq -112(%rbp), %rax [Assembly] > >> prefetchwt1 (%rax) > >> > >> [Testing] > >> $ ./boot.sh > >> $ ./configure > >> checking target hint for cgcc... x86_64 > >> checking whether gcc accepts -mprefetchwt1... yes > >> $ make -j > >> > >> Signed-off-by: Bhanuprakash Bodireddy > >> <bhanuprakash.bodireddy@intel.com> > > > >Does this have any effect if the architecture or CPU configured for use does > >not support prefetchwt1? > > That's a good question and I spent reasonable time today to figure this out. > I have Haswell, Broadwell and Skylake CPUs and they all support this instruction. But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this. > > Coming to your question, there won't be side effects on using OPCH_LTW. > On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction. > On processors that support PREFETCHW the compiler generates 'prefetchw' instruction. > On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled. > > >If it could lead to that situation, then this does not > >seem like the right thing to do, and we might want to fall back to > >recommending use of the option when the person building knows that the > >software will run on a machine with prefetchwt1. > > According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects. > I verified this using https://gcc.godbolt.org/ and carefully checking the instructions generated for different compiler versions and march flags. OK. That is good reassurance, then, so: Acked-by: Ben Pfaff <blp@ovn.org> Beyond the comments I've made already, I don't expect to review this series myself. Thanks for all the work you've put into it.
>>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy wrote: >>> Processors support prefetch instruction in anticipation of write but >>> compilers(gcc) won't use them unless explicitly asked to do so even >>> with '-march=native' specified. >>> >>> [Problem] >>> Case A: >>> OVS_PREFETCH_CACHE(addr, OPCH_HTW) >>> __builtin_prefetch(addr, 1, 3) >>> leaq -112(%rbp), %rax [Assembly] >>> prefetchw (%rax) >>> >>> Case B: >>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>> __builtin_prefetch(addr, 1, 1) >>> leaq -112(%rbp), %rax [Assembly] >>> prefetchw (%rax) <***problem***> >>> >>> Inspite of specifying -march=native and using Low Temporal >>Write(OPCH_LTW), >>> the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' >>> instruction available on processor. >>> >>> [Solution] >>> Include -mprefetchwt1 >>> >>> Case B: >>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>> __builtin_prefetch(addr, 1, 1) >>> leaq -112(%rbp), %rax [Assembly] >>> prefetchwt1 (%rax) >>> >>> [Testing] >>> $ ./boot.sh >>> $ ./configure >>> checking target hint for cgcc... x86_64 >>> checking whether gcc accepts -mprefetchwt1... yes >>> $ make -j >>> >>> Signed-off-by: Bhanuprakash Bodireddy >>> <bhanuprakash.bodireddy at intel.com> >> >>Does this have any effect if the architecture or CPU configured for use does >>not support prefetchwt1? > > That's a good question and I spent reasonable time today to figure this out. > I have Haswell, Broadwell and Skylake CPUs and they all support this instruction. Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and both of them doesn't have prefetchwt1 instruction according to cpuid: PREFETCHWT1 = false This means that introducing of this change will break binary compatibility even between CPUs of the same generation, i.e. I will not be able to run on my system binaries compiled on yours. If it's true I prefer to not have this change. Anyway adding of this change will make compiling a generic binary for a different platforms impossible if your build server supports prefetchwt1. There should be way to disable this arch specific compiler flag even if it supported on my current platform. Best regards, Ilya Maximets. > But I found that this instruction isn't enabled by default even with march=native and so need to explicitly enable this. > > Coming to your question, there won't be side effects on using OPCH_LTW. > On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler generates a 'prefetcht1' instruction. > On processors that support PREFETCHW the compiler generates 'prefetchw' instruction. > On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled. > >>If it could lead to that situation, then this does not >>seem like the right thing to do, and we might want to fall back to >>recommending use of the option when the person building knows that the >>software will run on a machine with prefetchwt1. > > According to above on processors that doesn't have this instruction support, 'prefetchnt1' instruction would be generated and doesn't have side effects. > I verified this using https://gcc.godbolt.org/ and carefully checking the instructions generated for different compiler versions and march flags. > > - Bhanuprakash.
>>>On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy >wrote: >>>> Processors support prefetch instruction in anticipation of write but >>>> compilers(gcc) won't use them unless explicitly asked to do so even >>>> with '-march=native' specified. >>>> >>>> [Problem] >>>> Case A: >>>> OVS_PREFETCH_CACHE(addr, OPCH_HTW) >>>> __builtin_prefetch(addr, 1, 3) >>>> leaq -112(%rbp), %rax [Assembly] >>>> prefetchw (%rax) >>>> >>>> Case B: >>>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>>> __builtin_prefetch(addr, 1, 1) >>>> leaq -112(%rbp), %rax [Assembly] >>>> prefetchw (%rax) <***problem***> >>>> >>>> Inspite of specifying -march=native and using Low Temporal >>>Write(OPCH_LTW), >>>> the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' >>>> instruction available on processor. >>>> >>>> [Solution] >>>> Include -mprefetchwt1 >>>> >>>> Case B: >>>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>>> __builtin_prefetch(addr, 1, 1) >>>> leaq -112(%rbp), %rax [Assembly] >>>> prefetchwt1 (%rax) >>>> >>>> [Testing] >>>> $ ./boot.sh >>>> $ ./configure >>>> checking target hint for cgcc... x86_64 >>>> checking whether gcc accepts -mprefetchwt1... yes >>>> $ make -j >>>> >>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at >>>> intel.com> >>> >>>Does this have any effect if the architecture or CPU configured for >>>use does not support prefetchwt1? >> >> That's a good question and I spent reasonable time today to figure this out. >> I have Haswell, Broadwell and Skylake CPUs and they all support this >instruction. > >Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and >both of them doesn't have prefetchwt1 instruction according to cpuid: > > PREFETCHWT1 = false Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop variant where as E3-12XX v5 is equivalent skylake workstation/server variant. AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid displays it otherwise. pmd_thread_main() ------------------------------------------------------------------------------------------- WITH OPCH_HTW, we see prefetchw instruction. OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW); cycles_count_start(pmd); for (;;) { for (i = 0; i < poll_cnt; i++) { process_packets = dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx, poll_list[i].port_no); cycles_count_intermediate(pmd, poll_list[i].rxq, Address Source Line Assembly 0x6e29ef 4,086 movl 0x823ecb(%rip), %edi 0x6e29f5 4,085 movq 0x50(%rsp), %rax 0x6e29fa 4,086 test %edi, %edi 0x6e29fc 4,085 prefetchwz (%rax) ---------------------------------------------------------------------------------------- With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to show this). OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW); cycles_count_start(pmd); for (;;) { for (i = 0; i < poll_cnt; i++) { .......... Address Source Line Assembly 0x6e29ef 4,086 movl 0x823ecb(%rip), %edi 0x6e29f5 4,085 movq 0x50(%rsp), %rax 0x6e29fa 4,086 test %edi, %edi 0x6e29fc 4,085 prefetchwt1b (%rax) ----------------------------------------------------------------------------------------- > >This means that introducing of this change will break binary compatibility even >between CPUs of the same generation, i.e. I will not be able to run on my >system binaries compiled on yours. > >If it's true I prefer to not have this change. > >Anyway adding of this change will make compiling a generic binary for a >different platforms impossible if your build server supports prefetchwt1. >There should be way to disable this arch specific compiler flag even if it >supported on my current platform. I see your point where a build server can be advanced and supports the prefetchwt1 instruction and when I copy and run the precompiled binaries on a server not supporting it, how does this behave? Not sure on this. May be Redhat/canonical developers can comment on how they handle this kind of cases. I will try to check this on my side. - Bhanuprakash. > >Best regards, Ilya Maximets. > >> But I found that this instruction isn't enabled by default even with >march=native and so need to explicitly enable this. >> >> Coming to your question, there won't be side effects on using OPCH_LTW. >> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the >compiler generates a 'prefetcht1' instruction. >> On processors that support PREFETCHW the compiler generates 'prefetchw' >instruction. >> On processors that support PREFETCHW & PREFETCHWT1, the compiler >generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled. >> >>>If it could lead to that situation, then this does not seem like the >>>right thing to do, and we might want to fall back to recommending use >>>of the option when the person building knows that the software will >>>run on a machine with prefetchwt1. >> >> According to above on processors that doesn't have this instruction support, >'prefetchnt1' instruction would be generated and doesn't have side effects. >> I verified this using https://gcc.godbolt.org/ and carefully checking the >instructions generated for different compiler versions and march flags. >> >> - Bhanuprakash.
On 05.12.2017 16:54, Bodireddy, Bhanuprakash wrote: >>>> On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy >> wrote: >>>>> Processors support prefetch instruction in anticipation of write but >>>>> compilers(gcc) won't use them unless explicitly asked to do so even >>>>> with '-march=native' specified. >>>>> >>>>> [Problem] >>>>> Case A: >>>>> OVS_PREFETCH_CACHE(addr, OPCH_HTW) >>>>> __builtin_prefetch(addr, 1, 3) >>>>> leaq -112(%rbp), %rax [Assembly] >>>>> prefetchw (%rax) >>>>> >>>>> Case B: >>>>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>>>> __builtin_prefetch(addr, 1, 1) >>>>> leaq -112(%rbp), %rax [Assembly] >>>>> prefetchw (%rax) <***problem***> >>>>> >>>>> Inspite of specifying -march=native and using Low Temporal >>>> Write(OPCH_LTW), >>>>> the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' >>>>> instruction available on processor. >>>>> >>>>> [Solution] >>>>> Include -mprefetchwt1 >>>>> >>>>> Case B: >>>>> OVS_PREFETCH_CACHE(addr, OPCH_LTW) >>>>> __builtin_prefetch(addr, 1, 1) >>>>> leaq -112(%rbp), %rax [Assembly] >>>>> prefetchwt1 (%rax) >>>>> >>>>> [Testing] >>>>> $ ./boot.sh >>>>> $ ./configure >>>>> checking target hint for cgcc... x86_64 >>>>> checking whether gcc accepts -mprefetchwt1... yes >>>>> $ make -j >>>>> >>>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at >>>>> intel.com> >>>> >>>> Does this have any effect if the architecture or CPU configured for >>>> use does not support prefetchwt1? >>> >>> That's a good question and I spent reasonable time today to figure this out. >>> I have Haswell, Broadwell and Skylake CPUs and they all support this >> instruction. >> >> Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and >> both of them doesn't have prefetchwt1 instruction according to cpuid: >> >> PREFETCHWT1 = false > > Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop variant where as E3-12XX v5 is equivalent skylake workstation/server variant. > AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid displays it otherwise. That is totally weird. I tried to compile following simple program: int main() { int c; __builtin_prefetch(&c, 1, 1); c = 8; return c; } on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw': PREFETCHWT1 = false 3DNow! PREFETCH/PREFETCHW instructions = false Results: $ gcc 1.c $ objdump -S ./a.out | grep prefetch -A2 -B2 40055b: 31 c0 xor %eax,%eax 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax 400561: 0f 18 18 prefetcht2 (%rax) 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) 40056b: 8b 45 f4 mov -0xc(%rbp),%eax $ gcc 1.c -march=native $ objdump -S ./a.out | grep prefetch -A2 -B2 40055b: 31 c0 xor %eax,%eax 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax 400561: 0f 18 18 prefetcht2 (%rax) 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) 40056b: 8b 45 f4 mov -0xc(%rbp),%eax $ gcc 1.c -march=native -mprefetchwt1 $ objdump -S ./a.out | grep prefetch -A2 -B2 40055b: 31 c0 xor %eax,%eax 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax 400561: 0f 0d 10 prefetchwt1 (%rax) 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) 40056b: 8b 45 f4 mov -0xc(%rbp),%eax So, it inserts this instruction even if I have on such instruction in CPU. More interesting is that program still works without any issues. I assume that CPU just skips that instruction or executes something else. So, it's really strange and it's unclear what CPU really executes in case where we have 'prefetchwt1' in code but not supported by CPU. If CPU just skips this instruction we will lost all the prefetching optimizations because all the calls will be replaced by non-existent 'prefetchwt1'. How can we be sure that 'prefetchwt1' was really executed? Best regards, Ilya Maximets. > > pmd_thread_main() > ------------------------------------------------------------------------------------------- > WITH OPCH_HTW, we see prefetchw instruction. > > OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW); > cycles_count_start(pmd); > for (;;) { > for (i = 0; i < poll_cnt; i++) { > process_packets = > dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx, > poll_list[i].port_no); > cycles_count_intermediate(pmd, poll_list[i].rxq, > > > Address Source Line Assembly > 0x6e29ef 4,086 movl 0x823ecb(%rip), %edi > 0x6e29f5 4,085 movq 0x50(%rsp), %rax > 0x6e29fa 4,086 test %edi, %edi > 0x6e29fc 4,085 prefetchwz (%rax) > ---------------------------------------------------------------------------------------- > With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to show this). > > OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW); > cycles_count_start(pmd); > for (;;) { > for (i = 0; i < poll_cnt; i++) { > .......... > > Address Source Line Assembly > 0x6e29ef 4,086 movl 0x823ecb(%rip), %edi > 0x6e29f5 4,085 movq 0x50(%rsp), %rax > 0x6e29fa 4,086 test %edi, %edi > 0x6e29fc 4,085 prefetchwt1b (%rax) > ----------------------------------------------------------------------------------------- > >> >> This means that introducing of this change will break binary compatibility even >> between CPUs of the same generation, i.e. I will not be able to run on my >> system binaries compiled on yours. >> >> If it's true I prefer to not have this change. >> >> Anyway adding of this change will make compiling a generic binary for a >> different platforms impossible if your build server supports prefetchwt1. >> There should be way to disable this arch specific compiler flag even if it >> supported on my current platform. > > I see your point where a build server can be advanced and supports the prefetchwt1 instruction > and when I copy and run the precompiled binaries on a server not supporting it, how does this behave? > > Not sure on this. May be Redhat/canonical developers can comment on how they handle this kind of cases. > > I will try to check this on my side. > > - Bhanuprakash. > >> >> Best regards, Ilya Maximets. >> >>> But I found that this instruction isn't enabled by default even with >> march=native and so need to explicitly enable this. >>> >>> Coming to your question, there won't be side effects on using OPCH_LTW. >>> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the >> compiler generates a 'prefetcht1' instruction. >>> On processors that support PREFETCHW the compiler generates 'prefetchw' >> instruction. >>> On processors that support PREFETCHW & PREFETCHWT1, the compiler >> generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled. >>> >>>> If it could lead to that situation, then this does not seem like the >>>> right thing to do, and we might want to fall back to recommending use >>>> of the option when the person building knows that the software will >>>> run on a machine with prefetchwt1. >>> >>> According to above on processors that doesn't have this instruction support, >> 'prefetchnt1' instruction would be generated and doesn't have side effects. >>> I verified this using https://gcc.godbolt.org/ and carefully checking the >> instructions generated for different compiler versions and march flags. >>> >>> - Bhanuprakash.
[...] >int main() >{ > int c; > > __builtin_prefetch(&c, 1, 1); > c = 8; > > return c; >} > >on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw': > > PREFETCHWT1 = false > 3DNow! PREFETCH/PREFETCHW instructions = false > >Results: [Bhanu] I found https://gcc.godbolt.org/ the other day and its handy to generate code for different targets and compilers. >$ gcc 1.c >$ objdump -S ./a.out | grep prefetch -A2 -B2 > 40055b: 31 c0 xor %eax,%eax > 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax > 400561: 0f 18 18 prefetcht2 (%rax) > 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) > 40056b: 8b 45 f4 mov -0xc(%rbp),%eax [Bhanu] Expected and compiler generates prefetcht2. > >$ gcc 1.c -march=native >$ objdump -S ./a.out | grep prefetch -A2 -B2 > 40055b: 31 c0 xor %eax,%eax > 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax > 400561: 0f 18 18 prefetcht2 (%rax) > 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) > 40056b: 8b 45 f4 mov -0xc(%rbp),%eax [Bhanu] Though march=native is specified the processor doesn't have it and still prefetchnt2 is generated by compiler. >$ gcc 1.c -march=native -mprefetchwt1 >$ objdump -S ./a.out | grep prefetch -A2 -B2 > 40055b: 31 c0 xor %eax,%eax > 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax > 400561: 0f 0d 10 prefetchwt1 (%rax) > 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) > 40056b: 8b 45 f4 mov -0xc(%rbp),%eax [Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do. > >So, it inserts this instruction even if I have no such instruction in CPU. [Bhanu] Though the compiler generates this, as the instruction isn't available on the processor it just become a multi byte NO-Operation(NOP). On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD) it decodes in to NOP. http://ref.x86asm.net/coder64.html#x0F0D - Click on '0D' in two-byte opcode index - (16. 0F0D NOP) - More information on this can be found in Intel SW developers manual (Combined Volumes) >More interesting is that program still works without any issues. >I assume that CPU just skips that instruction or executes something else. [Bhanu] This is what is mostly expected. On processors that supports prefetchwt1 it executes and others it just becomes a NOP. > >So, it's really strange and it's unclear what CPU really executes in case where >we have 'prefetchwt1' in code but not supported by CPU. [Bhanu] It’s decoded in to NOP may be by pipeline decoding units. > >If CPU just skips this instruction we will lost all the prefetching optimizations >because all the calls will be replaced by non-existent 'prefetchwt1'. [Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it doesn't support it. So the micro optimizations doesn't really do any thing on the processors that doesn't support it. > >How can we be sure that 'prefetchwt1' was really executed? [Bhanu] I don’t know how we can see this unless we can peek in to Instruction queues & Decoders of the pipeline :(. - Bhanuprakash.
On 05.12.2017 19:19, Bodireddy, Bhanuprakash wrote: > [...] >> int main() >> { >> int c; >> >> __builtin_prefetch(&c, 1, 1); >> c = 8; >> >> return c; >> } >> >> on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw': >> >> PREFETCHWT1 = false >> 3DNow! PREFETCH/PREFETCHW instructions = false >> >> Results: > > [Bhanu] I found https://gcc.godbolt.org/ the other day and its handy to generate code for different targets and compilers. > >> $ gcc 1.c >> $ objdump -S ./a.out | grep prefetch -A2 -B2 >> 40055b: 31 c0 xor %eax,%eax >> 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax >> 400561: 0f 18 18 prefetcht2 (%rax) >> 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) >> 40056b: 8b 45 f4 mov -0xc(%rbp),%eax > > [Bhanu] Expected and compiler generates prefetcht2. > >> >> $ gcc 1.c -march=native >> $ objdump -S ./a.out | grep prefetch -A2 -B2 >> 40055b: 31 c0 xor %eax,%eax >> 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax >> 400561: 0f 18 18 prefetcht2 (%rax) >> 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) >> 40056b: 8b 45 f4 mov -0xc(%rbp),%eax > > [Bhanu] Though march=native is specified the processor doesn't have it and still prefetchnt2 is generated by compiler. > >> $ gcc 1.c -march=native -mprefetchwt1 >> $ objdump -S ./a.out | grep prefetch -A2 -B2 >> 40055b: 31 c0 xor %eax,%eax >> 40055d: 48 8d 45 f4 lea -0xc(%rbp),%rax >> 400561: 0f 0d 10 prefetchwt1 (%rax) >> 400564: c7 45 f4 08 00 00 00 movl $0x8,-0xc(%rbp) >> 40056b: 8b 45 f4 mov -0xc(%rbp),%eax > > [Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do. > >> >> So, it inserts this instruction even if I have no such instruction in CPU. > > [Bhanu] > Though the compiler generates this, as the instruction isn't available on the processor it just become a multi byte NO-Operation(NOP). > On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD) it decodes in to NOP. > http://ref.x86asm.net/coder64.html#x0F0D > - Click on '0D' in two-byte opcode index - (16. 0F0D NOP) > - More information on this can be found in Intel SW developers manual (Combined Volumes) > >> More interesting is that program still works without any issues. >> I assume that CPU just skips that instruction or executes something else. > > [Bhanu] This is what is mostly expected. On processors that supports prefetchwt1 it executes and others it just becomes a NOP. > >> >> So, it's really strange and it's unclear what CPU really executes in case where >> we have 'prefetchwt1' in code but not supported by CPU. > > [Bhanu] It’s decoded in to NOP may be by pipeline decoding units. > >> >> If CPU just skips this instruction we will lost all the prefetching optimizations >> because all the calls will be replaced by non-existent 'prefetchwt1'. > > [Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it doesn't support it. > So the micro optimizations doesn't really do any thing on the processors that doesn't support it. This could be an issue. If someday we'll have real performance optimization based on OPCH_HTW prefetch, we will have prefetchwt1 on system that supports it and NOP on others even if they have usual prefetchw which could provide performance improvement too. As I understand, checking of '-mprefetchwt1' is equal to checking compiler version. It doesn't check anything about supporting of this instruction in CPU. This could end up with non-working performance optimizations and even degradation on systems that supports usual prefetches but not prefetchwt1 (useless NOPs degrades performance if they are on a hot path). IMHO, This compiler option should be passed only if CPU really supports it. I guess, the maximum that we can do is add a note into performance optimization guide that '-mprefetchwt1' could be passed via CFLAGS if user sure that it supported by target CPU. > >> >> How can we be sure that 'prefetchwt1' was really executed? > > [Bhanu] I don’t know how we can see this unless we can peek in to Instruction queues & Decoders of the pipeline :(. > > - Bhanuprakash. >
> >> If CPU just skips this instruction we will lost all the prefetching optimizations > >> because all the calls will be replaced by non-existent 'prefetchwt1'. > > > > [Bhanu] I would be worried if core generates an exception treating it as illegal instruction. Instead pipeline units treat this as NOP if it > doesn't support it. > > So the micro optimizations doesn't really do any thing on the processors that doesn't support it. > > This could be an issue. If someday we'll have real performance optimization > based on OPCH_HTW prefetch, we will have prefetchwt1 on system that supports > it and NOP on others even if they have usual prefetchw which could provide > performance improvement too. > > As I understand, checking of '-mprefetchwt1' is equal to checking compiler > version. It doesn't check anything about supporting of this instruction in CPU. > This could end up with non-working performance optimizations and even > degradation on systems that supports usual prefetches but not prefetchwt1 > (useless NOPs degrades performance if they are on a hot path). > > IMHO, This compiler option should be passed only if CPU really supports it. > I guess, the maximum that we can do is add a note into performance optimization > guide that '-mprefetchwt1' could be passed via CFLAGS if user sure that it > supported by target CPU. That is my thinking as well. The people/organizations building OVS packages for deployment have the responsibility to specify the minimum requirements on the target architecture and feed that into the compiler using CFLAGS. That may well be leaning towards the lower end of capabilities to maximize compatibility and sacrifice some performance on high-end CPUs. The specialized prefetch macros should be mapped to the best available target instructions by the compiler and/or conditional compile directives based on the CFLAGS architecture settings. We would gather all these target-specific compiler optimization guidelines in the advanced DPDK documentation of OVS. Of course developers or benchmark testers are free to use -march=native or similar at their discretion in their local test beds for best possible performance. BR, Jan
>> >> If CPU just skips this instruction we will lost all the prefetching >> >> optimizations because all the calls will be replaced by non-existent >'prefetchwt1'. >> > >> > [Bhanu] I would be worried if core generates an exception treating >> > it as illegal instruction. Instead pipeline units treat this as NOP >> > if it >> doesn't support it. >> > So the micro optimizations doesn't really do any thing on the processors >that doesn't support it. >> >> This could be an issue. If someday we'll have real performance >> optimization based on OPCH_HTW prefetch, we will have prefetchwt1 on >> system that supports it and NOP on others even if they have usual >> prefetchw which could provide performance improvement too. [Bhanu] Adding the below information only for future reference, (going to point to this thread in the commit log) On systems that has *only* prefetchw and no prefetchwt1 instruction. OPCH_LTW - prefetchw OPCH_MTW - prefetchw OPCH_HTW - prefetchw OPCH_NTW - prefetchw On systems that supports both prefetchw and prefetchwt1, OPCH_LTW - prefetchwt1 OPCH_MTW - prefetchwt1 OPCH_HTW - prefetchw OPCH_NTW - prefetchwt1 So OPCH_HTW would always be prefetchw and LTW/MTW/HTW might turn in to NOPs on processors that support prefetchw alone. (when compiled with CFLAGS = -march=native -mprefetchwt1) >> >> As I understand, checking of '-mprefetchwt1' is equal to checking >> compiler version. It doesn't check anything about supporting of this >instruction in CPU. >> This could end up with non-working performance optimizations and even >> degradation on systems that supports usual prefetches but not >> prefetchwt1 (useless NOPs degrades performance if they are on a hot >path). >> >> IMHO, This compiler option should be passed only if CPU really supports it. >> I guess, the maximum that we can do is add a note into performance >> optimization guide that '-mprefetchwt1' could be passed via CFLAGS if >> user sure that it supported by target CPU. > >That is my thinking as well. The people/organizations building OVS packages >for deployment have the responsibility to specify the minimum requirements >on the target architecture and feed that into the compiler using CFLAGS. That >may well be leaning towards the lower end of capabilities to maximize >compatibility and sacrifice some performance on high-end CPUs. > >The specialized prefetch macros should be mapped to the best available >target instructions by the compiler and/or conditional compile directives >based on the CFLAGS architecture settings. > >We would gather all these target-specific compiler optimization guidelines in >the advanced DPDK documentation of OVS. > >Of course developers or benchmark testers are free to use -march=native or >similar at their discretion in their local test beds for best possible performance. If the general view is get rid of this flag at compilation and only to document this, I am happy with this and can update the documentation. But I still think we are being too defensive here and with few NOPs performance impact isn't even noticeable. - Bhanuprakash.
diff --git a/configure.ac b/configure.ac index 6a8113a..8f4fbe2 100644 --- a/configure.ac +++ b/configure.ac @@ -171,6 +171,7 @@ OVS_CONDITIONAL_CC_OPTION([-Wno-unused], [HAVE_WNO_UNUSED]) OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], [HAVE_WNO_UNUSED_PARAMETER]) OVS_ENABLE_WERROR OVS_ENABLE_SPARSE +OVS_ENABLE_OPTION([-mprefetchwt1]) OVS_CTAGS_IDENTIFIERS AC_ARG_VAR(KARCH, [Kernel Architecture String])
Processors support prefetch instruction in anticipation of write but compilers(gcc) won't use them unless explicitly asked to do so even with '-march=native' specified. [Problem] Case A: OVS_PREFETCH_CACHE(addr, OPCH_HTW) __builtin_prefetch(addr, 1, 3) leaq -112(%rbp), %rax [Assembly] prefetchw (%rax) Case B: OVS_PREFETCH_CACHE(addr, OPCH_LTW) __builtin_prefetch(addr, 1, 1) leaq -112(%rbp), %rax [Assembly] prefetchw (%rax) <***problem***> Inspite of specifying -march=native and using Low Temporal Write(OPCH_LTW), the compiler generates 'prefetchw' instruction instead of 'prefetchwt1' instruction available on processor. [Solution] Include -mprefetchwt1 Case B: OVS_PREFETCH_CACHE(addr, OPCH_LTW) __builtin_prefetch(addr, 1, 1) leaq -112(%rbp), %rax [Assembly] prefetchwt1 (%rax) [Testing] $ ./boot.sh $ ./configure checking target hint for cgcc... x86_64 checking whether gcc accepts -mprefetchwt1... yes $ make -j Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com> --- configure.ac | 1 + 1 file changed, 1 insertion(+)