Message ID | bb264395301754f43b77ddec68a16dd34220abb4.1484326337.git.naveen.n.rao@linux.vnet.ibm.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Naveen N. Rao > Sent: 13 January 2017 17:10 > Generate instructions to perform the endian conversion using registers, > rather than generating two memory accesses. > > The "way easier and faster" comment was obviously for the author, not > the processor. That rather depends on whether the processor has a store to load forwarder that will satisfy the read from the store buffer. I don't know about ppc, but at least some x86 will do that. David
On 2017/01/13 05:17PM, David Laight wrote: > From: Naveen N. Rao > > Sent: 13 January 2017 17:10 > > Generate instructions to perform the endian conversion using registers, > > rather than generating two memory accesses. > > > > The "way easier and faster" comment was obviously for the author, not > > the processor. > > That rather depends on whether the processor has a store to load forwarder > that will satisfy the read from the store buffer. > I don't know about ppc, but at least some x86 will do that. Interesting - good to know that. However, I don't think powerpc does that and in-register swap is likely faster regardless. Note also that gcc prefers this form at higher optimization levels. Thanks, Naveen
On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote: > > That rather depends on whether the processor has a store to load forwarder > > that will satisfy the read from the store buffer. > > I don't know about ppc, but at least some x86 will do that. > > Interesting - good to know that. > > However, I don't think powerpc does that and in-register swap is likely > faster regardless. Note also that gcc prefers this form at higher > optimization levels. Of course powerpc has a load-store forwarder these days, however, I wouldn't be surprised if the in-register form was still faster on some implementations, but this needs to be tested. Ideally, you'd want to try to "optimize" load+swap or swap+store though. Cheers, Ben.
On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote: > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote: > > > That rather depends on whether the processor has a store to load forwarder > > > that will satisfy the read from the store buffer. > > > I don't know about ppc, but at least some x86 will do that. > > > > Interesting - good to know that. > > > > However, I don't think powerpc does that and in-register swap is likely > > faster regardless. Note also that gcc prefers this form at higher > > optimization levels. > > Of course powerpc has a load-store forwarder these days, however, I > wouldn't be surprised if the in-register form was still faster on some > implementations, but this needs to be tested. Thanks for clarifying! To test this, I wrote a simple (perhaps naive) test that just issues a whole lot of endian swaps and in _that_ test, it does look like the load-store forwarder is doing pretty well. The tests: bpf-bswap.S: ----------- .file "bpf-bswap.S" .abiversion 2 .section ".text" .align 2 .globl main .type main, @function main: mflr 0 std 0,16(1) stdu 1,-32760(1) addi 3,1,32 li 4,0 li 5,32720 li 11,32720 mulli 11,11,8 li 10,0 li 7,16 1: ldx 6,3,4 stdx 6,1,7 ldbrx 6,1,7 stdx 6,3,4 addi 4,4,8 cmpd 4,5 beq 2f b 1b 2: addi 10,10,1 li 4,0 cmpd 10,11 beq 3f b 1b 3: li 3,0 addi 1,1,32760 ld 0,16(1) mtlr 0 blr bpf-bswap-reg.S: --------------- .file "bpf-bswap-reg.S" .abiversion 2 .section ".text" .align 2 .globl main .type main, @function main: mflr 0 std 0,16(1) stdu 1,-32760(1) addi 3,1,32 li 4,0 li 5,32720 li 11,32720 mulli 11,11,8 li 10,0 1: ldx 6,3,4 rldicl 7,6,32,32 rlwinm 8,6,24,0,31 rlwimi 8,6,8,8,15 rlwinm 9,7,24,0,31 rlwimi 8,6,8,24,31 rlwimi 9,7,8,8,15 rlwimi 9,7,8,24,31 rldicr 8,8,32,31 or 6,8,9 stdx 6,3,4 addi 4,4,8 cmpd 4,5 beq 2f b 1b 2: addi 10,10,1 li 4,0 cmpd 10,11 beq 3f b 1b 3: li 3,0 addi 1,1,32760 ld 0,16(1) mtlr 0 blr Profiling the two variants: # perf stat ./bpf-bswap Performance counter stats for './bpf-bswap': 1395.979224 task-clock (msec) # 0.999 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 45 page-faults # 0.032 K/sec 4,651,874,673 cycles # 3.332 GHz (66.87%) 3,141,186 stalled-cycles-frontend # 0.07% frontend cycles idle (50.57%) 1,117,289,485 stalled-cycles-backend # 24.02% backend cycles idle (50.57%) 8,565,963,861 instructions # 1.84 insn per cycle # 0.13 stalled cycles per insn (67.05%) 2,174,029,771 branches # 1557.351 M/sec (49.69%) 262,656 branch-misses # 0.01% of all branches (50.05%) 1.396893189 seconds time elapsed # perf stat ./bpf-bswap-reg Performance counter stats for './bpf-bswap-reg': 1819.758102 task-clock (msec) # 0.999 CPUs utilized 3 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 44 page-faults # 0.024 K/sec 6,034,777,602 cycles # 3.316 GHz (66.83%) 2,010,983 stalled-cycles-frontend # 0.03% frontend cycles idle (50.47%) 1,024,975,759 stalled-cycles-backend # 16.98% backend cycles idle (50.52%) 16,043,732,849 instructions # 2.66 insn per cycle # 0.06 stalled cycles per insn (67.01%) 2,148,710,750 branches # 1180.767 M/sec (49.57%) 268,046 branch-misses # 0.01% of all branches (49.52%) 1.821501345 seconds time elapsed This is all in a POWER8 vm. On POWER7, the in-register variant is around 4 times faster than the ldbrx variant. So, yes, unless I've missed something, the ldbrx variant seems to perform better, if not on par with the in-register swap variant on POWER8. > > Ideally, you'd want to try to "optimize" load+swap or swap+store > though. Agreed. This is already the case with BPF for packet access - those use skb helpers which issue the appropriate lhbrx/lwbrx/ldbrx. The newer BPF_FROM_LE/BPF_FROM_BE are for endian operations with other BPF programs. We can probably implement an extra pass to detect use of endian swap and try to match it up with a previous load or a subsequent store though... Thanks! - Naveen
From: 'Naveen N. Rao' > Sent: 23 January 2017 19:22 > On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote: > > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote: > > > > That rather depends on whether the processor has a store to load forwarder > > > > that will satisfy the read from the store buffer. > > > > I don't know about ppc, but at least some x86 will do that. > > > > > > Interesting - good to know that. > > > > > > However, I don't think powerpc does that and in-register swap is likely > > > faster regardless. Note also that gcc prefers this form at higher > > > optimization levels. > > > > Of course powerpc has a load-store forwarder these days, however, I > > wouldn't be surprised if the in-register form was still faster on some > > implementations, but this needs to be tested. > > Thanks for clarifying! To test this, I wrote a simple (perhaps naive) > test that just issues a whole lot of endian swaps and in _that_ test, it > does look like the load-store forwarder is doing pretty well. ... > This is all in a POWER8 vm. On POWER7, the in-register variant is around > 4 times faster than the ldbrx variant. ... I wonder which is faster on the little 1GHz embedded ppc we use here. David
On 2017/01/24 04:13PM, David Laight wrote: > From: 'Naveen N. Rao' > > Sent: 23 January 2017 19:22 > > On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote: > > > On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote: > > > > > That rather depends on whether the processor has a store to load forwarder > > > > > that will satisfy the read from the store buffer. > > > > > I don't know about ppc, but at least some x86 will do that. > > > > > > > > Interesting - good to know that. > > > > > > > > However, I don't think powerpc does that and in-register swap is likely > > > > faster regardless. Note also that gcc prefers this form at higher > > > > optimization levels. > > > > > > Of course powerpc has a load-store forwarder these days, however, I > > > wouldn't be surprised if the in-register form was still faster on some > > > implementations, but this needs to be tested. > > > > Thanks for clarifying! To test this, I wrote a simple (perhaps naive) > > test that just issues a whole lot of endian swaps and in _that_ test, it > > does look like the load-store forwarder is doing pretty well. > ... > > This is all in a POWER8 vm. On POWER7, the in-register variant is around > > 4 times faster than the ldbrx variant. > ... > > I wonder which is faster on the little 1GHz embedded ppc we use here. Worth a test, for sure. FWIW, this patch won't matter since eBPF JIT is for ppc64. Thanks, Naveen
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c index 1e313db..0413a89 100644 --- a/arch/powerpc/net/bpf_jit_comp64.c +++ b/arch/powerpc/net/bpf_jit_comp64.c @@ -599,16 +599,22 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, break; case 64: /* - * Way easier and faster(?) to store the value - * into stack and then use ldbrx + * We'll split it up into two words, swap those + * independently and then merge them back. * - * ctx->seen will be reliable in pass2, but - * the instructions generated will remain the - * same across all passes + * First up, let's swap the most-significant word. */ - PPC_STD(dst_reg, 1, bpf_jit_stack_local(ctx)); - PPC_ADDI(b2p[TMP_REG_1], 1, bpf_jit_stack_local(ctx)); - PPC_LDBRX(dst_reg, 0, b2p[TMP_REG_1]); + PPC_RLDICL(b2p[TMP_REG_1], dst_reg, 32, 32); + PPC_RLWINM(b2p[TMP_REG_2], b2p[TMP_REG_1], 8, 0, 31); + PPC_RLWIMI(b2p[TMP_REG_2], b2p[TMP_REG_1], 24, 0, 7); + PPC_RLWIMI(b2p[TMP_REG_2], b2p[TMP_REG_1], 24, 16, 23); + /* Then, the second half */ + PPC_RLWINM(b2p[TMP_REG_1], dst_reg, 8, 0, 31); + PPC_RLWIMI(b2p[TMP_REG_1], dst_reg, 24, 0, 7); + PPC_RLWIMI(b2p[TMP_REG_1], dst_reg, 24, 16, 23); + /* Merge back */ + PPC_RLDICR(dst_reg, b2p[TMP_REG_1], 32, 31); + PPC_OR(dst_reg, dst_reg, b2p[TMP_REG_2]); break; } break;
Generate instructions to perform the endian conversion using registers, rather than generating two memory accesses. The "way easier and faster" comment was obviously for the author, not the processor. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> --- arch/powerpc/net/bpf_jit_comp64.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)