From patchwork Thu Oct 5 17:16:35 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Siddhesh Poyarekar X-Patchwork-Id: 821972 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-85479-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="l5ds7KXp"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3y7KFv59YJz9t5b for ; Fri, 6 Oct 2017 04:17:23 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=dIoO7BgaZiXK7wme3utx7bLEe+ZySrp KFDI+XsBG25Cnk0ZChLxbEhuIuRaaiPDYNHx6F+Op0qCK8RENjRYbd8GFGpvz+0o bzIGlwKK/H/HfJVqEt4StgJ2dJAmdQt9C24vOB8ScTk/46/TV8kgjdrrGanvFcTi 4/qSbYXo+130= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=t/bD/jrl7CaoL208iLMNPgbprLQ=; b=l5ds7 KXppklagPRbPUt3vrsIgOLbA2TMk7v3QtK2vjle1DnFq0GKpRnTKqJo/bx+yn3OV IM6+QAQoBPnWOJ0Ul6jWm0Aha7CB0Ss/qrTXvKa+3Om86ePHB1OrNZfE/m8mKUsb GMBDyWmKA7CvF8dCD55bvoMSdvTpRY1xnXx0es= Received: (qmail 68490 invoked by alias); 5 Oct 2017 17:17:01 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 68384 invoked by uid 89); 5 Oct 2017 17:17:00 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_NONE, RCVD_IN_SORBS_SPAM, SPF_NEUTRAL autolearn=ham version=3.3.2 spammy=5399, 911, 15.10, 4972 X-HELO: homiemail-a51.g.dreamhost.com From: Siddhesh Poyarekar To: libc-alpha@sourceware.org Cc: adhemerval.zanella@linaro.org, szabolcs.nagy@arm.com Subject: [PATCHv3 2/2] aarch64: Reduce zva prologue to 64 bytes to reduce one instruction Date: Thu, 5 Oct 2017 22:46:35 +0530 Message-Id: <1507223795-4893-2-git-send-email-siddhesh@sourceware.org> In-Reply-To: <1507223795-4893-1-git-send-email-siddhesh@sourceware.org> References: <1507223795-4893-1-git-send-email-siddhesh@sourceware.org> The current zva copy of 64 bytes has a prologue and epilogue of 128 bytes, which is quite suboptimal for falkor as well as mustang (the two arm machines I have access to). Dropping it to 64 bytes, which is the mininmum alignment required for 64 byte zva to work correctly results in a decent gain in performance for falkor as well as mustang. For falkor the timings reduction against generic memset goes up to about 54% in the best case, i.e. over twice as fast compared to 40% in the previous case. Function: memset Variant: walk simple_memset __memset_nozva __memset_zva_64 __memset_generic ======================================================================================================================== length=256, char=0: 35933.10 (-705.25%) 2458.49 ( 44.91%) 2052.48 ( 54.00%) 4462.35 length=257, char=0: 36241.40 (-709.07%) 2449.31 ( 45.32%) 2088.25 ( 53.38%) 4479.40 length=258, char=0: 36506.50 (-707.37%) 2533.37 ( 43.97%) 2073.68 ( 54.14%) 4521.66 length=259, char=0: 36766.80 (-710.04%) 2573.23 ( 43.31%) 2088.39 ( 53.99%) 4538.88 length=260, char=0: 36958.30 (-711.39%) 2638.15 ( 42.08%) 2098.50 ( 53.93%) 4554.95 length=261, char=0: 37384.90 (-717.06%) 2635.56 ( 42.40%) 2112.04 ( 53.84%) 4575.56 length=262, char=0: 37582.20 (-720.77%) 2661.22 ( 41.88%) 2126.52 ( 53.56%) 4578.89 length=263, char=0: 37893.80 (-724.00%) 2705.17 ( 41.18%) 2155.46 ( 53.13%) 4598.76 length=264, char=0: 38126.30 (-720.58%) 2748.52 ( 40.84%) 2148.82 ( 53.75%) 4646.25 length=265, char=0: 38430.50 (-725.42%) 2789.58 ( 40.08%) 2159.59 ( 53.62%) 4655.89 length=266, char=0: 38699.40 (-728.77%) 2850.63 ( 38.95%) 2170.97 ( 53.51%) 4669.48 length=267, char=0: 38956.10 (-733.21%) 2896.94 ( 38.04%) 2182.96 ( 53.31%) 4675.43 length=268, char=0: 39198.60 (-734.03%) 2965.66 ( 36.90%) 2226.63 ( 52.62%) 4699.92 length=269, char=0: 39542.90 (-738.41%) 3069.46 ( 34.92%) 2229.59 ( 52.73%) 4716.43 length=270, char=0: 39822.00 (-737.24%) 3071.14 ( 35.43%) 2221.98 ( 53.28%) 4756.34 length=271, char=0: 40095.60 (-739.78%) 3116.24 ( 34.73%) 2234.95 ( 53.19%) 4774.54 length=512, char=0: 137512.00 (-1279.92%) 8993.82 ( 9.75%) 6227.36 ( 37.51%) 9965.19 length=513, char=0: 138047.00 (-1282.33%) 9076.86 ( 9.11%) 6187.04 ( 38.05%) 9986.55 length=514, char=0: 138744.00 (-1275.79%) 9174.97 ( 9.02%) 6207.71 ( 38.44%) 10084.70 length=515, char=0: 139094.00 (-1276.45%) 9249.32 ( 8.47%) 6225.89 ( 38.39%) 10105.30 length=516, char=0: 139488.00 (-1286.59%) 9387.34 ( 6.68%) 6248.57 ( 37.89%) 10059.80 length=517, char=0: 140146.00 (-1292.00%) 9406.52 ( 6.61%) 6253.62 ( 37.91%) 10072.00 length=518, char=0: 140728.00 (-1284.75%) 9495.81 ( 6.56%) 6274.71 ( 38.26%) 10162.70 length=519, char=0: 141207.00 (-1294.93%) 9570.67 ( 5.46%) 6358.77 ( 37.18%) 10122.90 length=520, char=0: 141825.00 (-1298.09%) 9800.91 ( 3.38%) 6316.39 ( 37.73%) 10144.20 length=521, char=0: 142328.00 (-1299.56%) 9870.21 ( 2.94%) 6336.67 ( 37.69%) 10169.50 length=522, char=0: 142876.00 (-1291.83%) 9936.09 ( 3.21%) 6357.73 ( 38.07%) 10265.30 length=523, char=0: 143333.00 (-1302.71%) 9946.90 ( 2.66%) 6447.33 ( 36.90%) 10218.30 length=524, char=0: 143793.00 (-1304.24%) 10093.50 ( 1.43%) 6403.34 ( 37.47%) 10239.90 length=525, char=0: 144453.00 (-1298.91%) 10122.80 ( 1.97%) 6418.25 ( 37.84%) 10326.10 length=526, char=0: 145077.00 (-1299.05%) 10208.90 ( 1.55%) 6440.72 ( 37.89%) 10369.70 length=527, char=0: 145490.00 (-1310.72%) 10346.70 ( -0.32%) 6461.50 ( 37.35%) 10313.20 length=1024, char=0: 537455.00 (-2112.45%) 34630.10 (-42.56%) 20549.10 ( 15.41%) 24292.30 length=1025, char=0: 538527.00 (-2114.96%) 34632.10 (-42.44%) 20584.50 ( 15.34%) 24313.20 length=1026, char=0: 539582.00 (-2115.53%) 34993.10 (-43.68%) 20620.10 ( 15.33%) 24354.50 length=1027, char=0: 540599.00 (-2115.10%) 35129.10 (-43.94%) 20644.70 ( 15.41%) 24405.20 length=1028, char=0: 541682.00 (-2117.35%) 35523.70 (-45.41%) 20685.00 ( 15.33%) 24429.30 length=1029, char=0: 542909.00 (-2119.46%) 35735.10 (-46.09%) 20741.20 ( 15.21%) 24461.30 length=1030, char=0: 543762.00 (-2119.68%) 35845.00 (-46.32%) 20760.60 ( 15.25%) 24497.30 length=1031, char=0: 544860.00 (-2119.82%) 36079.80 (-46.99%) 20798.00 ( 15.27%) 24545.20 length=1032, char=0: 545893.00 (-2120.69%) 36166.20 (-47.12%) 20829.20 ( 15.27%) 24582.10 length=1033, char=0: 546847.00 (-2121.45%) 36366.10 (-47.73%) 20877.20 ( 15.19%) 24616.70 length=1034, char=0: 548095.00 (-2124.11%) 36520.50 (-48.20%) 20911.00 ( 15.15%) 24643.30 length=1035, char=0: 549395.00 (-2125.90%) 36769.10 (-48.97%) 20940.70 ( 15.16%) 24681.90 length=1036, char=0: 549914.00 (-2124.01%) 36847.90 (-49.02%) 20979.40 ( 15.15%) 24726.20 length=1037, char=0: 551102.00 (-2126.26%) 37040.50 (-49.63%) 21018.10 ( 15.09%) 24754.60 length=1038, char=0: 552181.00 (-2127.43%) 37019.80 (-49.33%) 21052.80 ( 15.08%) 24790.10 length=1039, char=0: 553138.00 (-2127.14%) 37184.30 (-49.72%) 21086.90 ( 15.10%) 24836.20 * sysdeps/aarch64/memset.S (do_zva_64): Set 64 bytes in prologue and epilogue instead of 128 bytes. --- sysdeps/aarch64/memset.S | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S index 9fea4c2..78ead60 100644 --- a/sysdeps/aarch64/memset.S +++ b/sysdeps/aarch64/memset.S @@ -60,20 +60,14 @@ str q0, [dst, 16] stp q0, q0, [dst, 32] bic dst, dst, 63 - stp q0, q0, [dst, 64] - stp q0, q0, [dst, 96] - sub count, dstend, dst /* Count is now 128 too large. */ - sub count, count, 128+64+64 /* Adjust count and bias for loop. */ - add dst, dst, 128 - nop + add dst, dst, 64 + sub dstend, dstend, 64 1: dc zva, dst add dst, dst, 64 - subs count, count, 64 + cmp dstend, dst b.hi 1b - stp q0, q0, [dst, 0] - stp q0, q0, [dst, 32] - stp q0, q0, [dstend, -64] - stp q0, q0, [dstend, -32] + stp q0, q0, [dstend] + stp q0, q0, [dstend, 32] ret .endm