From patchwork Thu Jun 4 14:16:53 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?T25kxZllaiBCw61sa2E=?= X-Patchwork-Id: 480749 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 31A6B140129 for ; Fri, 5 Jun 2015 00:17:11 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=sourceware.org header.i=@sourceware.org header.b=xNFDNZrv; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-transfer-encoding :in-reply-to; q=dns; s=default; b=HV3UDxdtUi5mSRY5RRsW+PwY470lkc 9sooK/OiLI4sdCKoJLkAQ5FD4LOaDvCOHtHgFzktojARaQg2bmSBU1wSyqa/yZj+ 04yHVZbFApb1qAO1KMykEZpBVEr8yoGj8ApJHjjzxhrNshD3HG/Qji4ltekyizll rn1TBvXKS2Dg0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-transfer-encoding :in-reply-to; s=default; bh=pkesyaLT6AVuFSUf7TwZhxfP9uI=; b=xNFD NZrvz5+2ePcHQnngIJF4WR8AKocsqxq0J90+7nxUlMmPYPb7AMqGv/3qOb82JaHx 5wYSWh6o06MbTZ7JDI7W2Pp+cYEKWwRv8civdev3OdvnxPEqWAsGd7HcxI8Z6XJi 0770Rasv9FLXL5y9ep4aTYBA7YYooM6/7oCTVxg= Received: (qmail 66047 invoked by alias); 4 Jun 2015 14:17:05 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 66038 invoked by uid 89); 4 Jun 2015 14:17:05 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Thu, 4 Jun 2015 16:16:53 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Richard Earnshaw Cc: libc-alpha@sourceware.org, Andrew Pinski Subject: [RFC] Aarch64: optimize stpcpy a bit. Message-ID: <20150604141653.GA23376@domone> References: <20150525101505.GA11233@domone> <20150525114545.GC11233@domone> <5570282D.4080509@foss.arm.com> <20150604122837.GA21337@domone> <5570565B.8040403@foss.arm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <5570565B.8040403@foss.arm.com> User-Agent: Mutt/1.5.20 (2009-06-14) On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote: > On 04/06/15 13:28, Ondřej Bílka wrote: > > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote: > >> On 25/05/15 12:45, Ondřej Bílka wrote: > >>> Replaces it with strcpy. One could argue that opposite way to replace > >>> strcpy with stpcpy is faster. > >>> > >>> Reason is register pressure. Strcpy needs extra register to save return > >>> value while stpcpy has return value already in register used for writing > >>> terminating zero. > >> > >> > >> Depends on your architecture. On aarch64 we have plenty of spare > >> registers, so strcpy simply copies the destination register into a > >> scratch. It then doesn't have to carefully calculate the return value > >> at the end of the function (making the tail code simpler - there are > >> multiple return statements, but only one entry point). > >> > > Thats correct, main saving you get is from return value is first register, that > > forces needing extra copy which is suboptimal. > > No, look at the AArch64 code. The only time we ever end up with a > simple MOV instruction to copy the register from one location to another > is in the stPcpy code. In strcpy it always ends up folded into some > other operation that we have to do anyway. Once it's been copied to > that other register we never have to use it elsewhere again. > Furthermore, the need to handle smallish copies with overlapping stores > means we need both the original base address /and/ the final result, so > we'd still need to end up saving it for stpcpy. > Wrote too fast, was refering that you would need to copy that on small address. With dest in different register I could make strcpy and stpcpy const same instructions in most cases except size 8-15 by adjusting offsets with some constants. Also if I think that you could remove extra instructions for stpcpy loop with following, which also removes one instruction from strcpy if I read code correctly. diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S index 28846fb..7ca0412 100644 --- a/sysdeps/aarch64/strcpy.S +++ b/sysdeps/aarch64/strcpy.S @@ -204,6 +204,9 @@ L(bulk_entry): sub to_align, to_align, #16 stp data1, data2, [dstin] sub src, srcin, to_align +#ifdef BUILD_STPCPY +# define dst dstin +#endif sub dst, dstin, to_align b L(entry_no_page_cross) @@ -243,17 +246,16 @@ L(entry_no_page_cross): #endif rev has_nul1, has_nul1 clz pos, has_nul1 - add tmp1, pos, #72 - add pos, pos, #8 + add tmp1, pos, #64 csel pos, pos, tmp1, ne add src, src, pos, lsr #3 add dst, dst, pos, lsr #3 - ldp data1, data2, [src, #-32] - stp data1, data2, [dst, #-16] + ldp data1, data2, [src, #-31] + stp data1, data2, [dst, #-15] + ret #ifdef BUILD_STPCPY - sub dstin, dst, #1 +# undef dst #endif - ret L(page_cross): bic src, srcin, #15