Message ID | 20150604141653.GA23376@domone |
---|---|
State | New |
Headers | show |
On 04/06/15 15:16, Ondřej Bílka wrote: > On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote: >> On 04/06/15 13:28, Ondřej Bílka wrote: >>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote: >>>> On 25/05/15 12:45, Ondřej Bílka wrote: >>>>> Replaces it with strcpy. One could argue that opposite way to replace >>>>> strcpy with stpcpy is faster. >>>>> >>>>> Reason is register pressure. Strcpy needs extra register to save return >>>>> value while stpcpy has return value already in register used for writing >>>>> terminating zero. >>>> >>>> >>>> Depends on your architecture. On aarch64 we have plenty of spare >>>> registers, so strcpy simply copies the destination register into a >>>> scratch. It then doesn't have to carefully calculate the return value >>>> at the end of the function (making the tail code simpler - there are >>>> multiple return statements, but only one entry point). >>>> >>> Thats correct, main saving you get is from return value is first register, that >>> forces needing extra copy which is suboptimal. >> >> No, look at the AArch64 code. The only time we ever end up with a >> simple MOV instruction to copy the register from one location to another >> is in the stPcpy code. In strcpy it always ends up folded into some >> other operation that we have to do anyway. Once it's been copied to >> that other register we never have to use it elsewhere again. >> Furthermore, the need to handle smallish copies with overlapping stores >> means we need both the original base address /and/ the final result, so >> we'd still need to end up saving it for stpcpy. >> > Wrote too fast, was refering that you would need to copy that on small > address. With dest in different register I could make strcpy and stpcpy > const same instructions in most cases except size 8-15 by adjusting > offsets with some constants. > > Also if I think that you could remove extra instructions for stpcpy loop > with following, which also removes one instruction from strcpy if I read > code correctly. > > diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S > index 28846fb..7ca0412 100644 > --- a/sysdeps/aarch64/strcpy.S > +++ b/sysdeps/aarch64/strcpy.S > @@ -204,6 +204,9 @@ L(bulk_entry): > sub to_align, to_align, #16 > stp data1, data2, [dstin] > sub src, srcin, to_align > +#ifdef BUILD_STPCPY > +# define dst dstin > +#endif > sub dst, dstin, to_align > b L(entry_no_page_cross) > > @@ -243,17 +246,16 @@ L(entry_no_page_cross): > #endif > rev has_nul1, has_nul1 > clz pos, has_nul1 > - add tmp1, pos, #72 > - add pos, pos, #8 > + add tmp1, pos, #64 > csel pos, pos, tmp1, ne > add src, src, pos, lsr #3 > add dst, dst, pos, lsr #3 > - ldp data1, data2, [src, #-32] > - stp data1, data2, [dst, #-16] > + ldp data1, data2, [src, #-31] > + stp data1, data2, [dst, #-15] That's not valid, the offset has to be a multiple of the register size (8 in this case). > + ret > #ifdef BUILD_STPCPY > - sub dstin, dst, #1 > +# undef dst Nor is this, dst is already a #define, so this leaves it unspecified. But since you can't avoid the late subtract, that's irrelevant anyway. R. > #endif > - ret > > L(page_cross): > bic src, srcin, #15 >
diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S index 28846fb..7ca0412 100644 --- a/sysdeps/aarch64/strcpy.S +++ b/sysdeps/aarch64/strcpy.S @@ -204,6 +204,9 @@ L(bulk_entry): sub to_align, to_align, #16 stp data1, data2, [dstin] sub src, srcin, to_align +#ifdef BUILD_STPCPY +# define dst dstin +#endif sub dst, dstin, to_align b L(entry_no_page_cross) @@ -243,17 +246,16 @@ L(entry_no_page_cross): #endif rev has_nul1, has_nul1 clz pos, has_nul1 - add tmp1, pos, #72 - add pos, pos, #8 + add tmp1, pos, #64 csel pos, pos, tmp1, ne add src, src, pos, lsr #3 add dst, dst, pos, lsr #3 - ldp data1, data2, [src, #-32] - stp data1, data2, [dst, #-16] + ldp data1, data2, [src, #-31] + stp data1, data2, [dst, #-15] + ret #ifdef BUILD_STPCPY - sub dstin, dst, #1 +# undef dst #endif - ret L(page_cross): bic src, srcin, #15