From patchwork Mon Mar 29 22:57:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1459832 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=pP8UMXWE; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F8Sg425GFz9sVm for ; Tue, 30 Mar 2021 09:58:40 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id F1A02385702E; Mon, 29 Mar 2021 22:58:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org F1A02385702E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1617058717; bh=tjpinNGpXOGmIMQTAO51Qu3q1RT/RCTBW2n5WQP8HvQ=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=pP8UMXWE74IbErhNal9bUBV/1S3NX7tCHfFiLgpdHyKdMhDyigr+8PAg8YDsewh+s mfzM3Bj0VLzt9COmCms2dnuWnuZhAgFrLNeijPxJ7Ldbjgfawi+cloRJf1NnzFEtJ3 CvB/kLLt5/b8a0z25nQg14t6tNFtY/cn/BFVkw8I= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by sourceware.org (Postfix) with ESMTPS id A6B813858002 for ; Mon, 29 Mar 2021 22:58:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A6B813858002 Received: by mail-qk1-x732.google.com with SMTP id y5so14132445qkl.9 for ; Mon, 29 Mar 2021 15:58:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=tjpinNGpXOGmIMQTAO51Qu3q1RT/RCTBW2n5WQP8HvQ=; b=dC3diRkBT5kgObZWSZ5o0LaYpq2DRNV5bNPajp+LsSeaukA7UxZXPTYfZ3qmWpXUsy aTbzkv+LUVj5y2vaLNQzRB3RRxfCG1xPW7XeAUTS5bqWencybHD7HhHD3ZQuinjfrdvL x0+RSTjvIACc3V7103Jz2gpEFYE+05kPPIjWwHpKZOjbR3eqRqb2KvUaON51MYcclC80 48y2vrC9ta3bGwseBk/L3CsKCuBGxIvyn7wTkUeEYk9ok5zZn/bs4oWHryxyOLVBEn+k HLxpZ26BS9aBkIo02ctt0mnuijtHoIbd1gwxeSeKLeqyZsCR13l/9oG0WNYMJlgE3VEE drlQ== X-Gm-Message-State: AOAM530iShUyEXIV59tK7c+1BYxc++wnbeSx5bgD6GrBs14WjOVjhyL3 wmqnLnEzS/jS+gLjhG+Aks7EzgE+wEvipw== X-Google-Smtp-Source: ABdhPJxCc0QYDKsTBb6Pln4vezFr9NlmaSg9uWKwCItisWduVAdo6/TeZVFcU+fpVyiCdwa66Vay7Q== X-Received: by 2002:a37:9fce:: with SMTP id i197mr26769893qke.155.1617058711356; Mon, 29 Mar 2021 15:58:31 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id c7sm9213118qtv.48.2021.03.29.15.58.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Mar 2021 15:58:31 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S Date: Mon, 29 Mar 2021 18:57:52 -0400 Message-Id: <20210329225752.235397-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: noah via Libc-alpha From: Noah Goldstein Reply-To: noah Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: noah --- In this patch is an update to memmove-vec-unaligned-erms.S, additions to test-memmove.c and test-memcp.c, and additions to bench-memcpy-large.c. Test Changes: These changes where largely in the vein of increasing the maximum test size, increasing the range of misalignments, and expanding the to cover both forward/backward copying. Bench Changes: These changes where to increase the range of tested alignments. Relative alignment and source and destination can make a huge impact on performance (more below) even when the there is no overlap. Memmove Changes: The change was benchmarked on an Icelake and Skylake CPU. See below for CSV of data. Time is median of 25 runs of bench-memcpy-large.c in nanoseconds. "New" is this patch, "Old" is the current implementation. The majority of changes in performance where beneficial. The most clear example is on icelake where alleviating the pressure on false memory aliasing lead to more than a 2x performance improvement for certain alignments of VEC_SIZE=16 and 1.5x performance improvement for certain alignments of VEC_SIZE=32. i.e: func ,size ,align1,align2,Old ,New ,% New / Old sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 As well across the board for larger sizes (starting around size = 2^23) there was a roughly 0-10% performance improvement. i.e: Skylake: sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 Icelake: sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 There where performance degregations, however: Medium large sizes [2^20, 2^22] had roughly a 0-6% performance loss on Icelake for VEC_SIZE=64. This degregation is worst for destination alignment=127. i.e: avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 Around 2^23 the change becomes neutral - advantageous: avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 Across the board, aside from the address aliasing case, the performance difference is roughly in the range of [-6%, 12%] with some extreme [150%, 200%] cases that are heavily dependent on alignment. Its possible these changes should only be made for VEC_SIZE=16/32 or to keep the original forward memcpy for sizes [2^20, 2^22] in the case that there is no address aliasing. Please let me know what you think. Performance Numbers (Skylake Numbers Below): Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz func ,size ,align1,align2,Old ,New ,% New / Old sse2 ,1048583 ,0 ,0 ,147297.0 ,146234.0 ,99.3 sse2 ,1048591 ,0 ,3 ,400336.0 ,173518.0 ,43.3 sse2 ,1048607 ,3 ,0 ,151488.0 ,150773.0 ,99.5 sse2 ,1048639 ,3 ,5 ,399842.0 ,174222.0 ,43.6 sse2 ,1048576 ,0 ,127 ,356326.0 ,171422.0 ,48.1 sse2 ,1048576 ,0 ,255 ,144145.0 ,152123.0 ,105.5 sse2 ,1048576 ,0 ,256 ,147605.0 ,148005.0 ,100.3 sse2 ,1048576 ,0 ,4064 ,146929.0 ,147812.0 ,100.6 sse2 ,2097159 ,0 ,0 ,293910.0 ,291403.0 ,99.1 sse2 ,2097167 ,0 ,3 ,798920.0 ,346694.0 ,43.4 sse2 ,2097183 ,3 ,0 ,301171.0 ,299606.0 ,99.5 sse2 ,2097215 ,3 ,5 ,799129.0 ,346597.0 ,43.4 sse2 ,2097152 ,0 ,127 ,710256.0 ,341110.0 ,48.0 sse2 ,2097152 ,0 ,255 ,286370.0 ,302553.0 ,105.7 sse2 ,2097152 ,0 ,256 ,293691.0 ,294825.0 ,100.4 sse2 ,2097152 ,0 ,4064 ,292920.0 ,294180.0 ,100.4 sse2 ,4194311 ,0 ,0 ,587894.0 ,586827.0 ,99.8 sse2 ,4194319 ,0 ,3 ,1596340.0 ,694200.0 ,43.5 sse2 ,4194335 ,3 ,0 ,601996.0 ,601342.0 ,99.9 sse2 ,4194367 ,3 ,5 ,1596870.0 ,694562.0 ,43.5 sse2 ,4194304 ,0 ,127 ,1414140.0 ,682856.0 ,48.3 sse2 ,4194304 ,0 ,255 ,573752.0 ,607024.0 ,105.8 sse2 ,4194304 ,0 ,256 ,586961.0 ,591899.0 ,100.8 sse2 ,4194304 ,0 ,4064 ,586618.0 ,591267.0 ,100.8 sse2 ,8388615 ,0 ,0 ,1267450.0 ,1213660.0 ,95.8 sse2 ,8388623 ,0 ,3 ,3204280.0 ,1404460.0 ,43.8 sse2 ,8388639 ,3 ,0 ,1298940.0 ,1245790.0 ,95.9 sse2 ,8388671 ,3 ,5 ,3200790.0 ,1404540.0 ,43.9 sse2 ,8388608 ,0 ,127 ,2843880.0 ,1380490.0 ,48.5 sse2 ,8388608 ,0 ,255 ,1261040.0 ,1259110.0 ,99.8 sse2 ,8388608 ,0 ,256 ,1301120.0 ,1228890.0 ,94.4 sse2 ,8388608 ,0 ,4064 ,1263930.0 ,1233400.0 ,97.6 sse2 ,16777223 ,0 ,0 ,2845260.0 ,2690490.0 ,94.6 sse2 ,16777231 ,0 ,3 ,6424220.0 ,2999980.0 ,46.7 sse2 ,16777247 ,3 ,0 ,2902290.0 ,2764350.0 ,95.2 sse2 ,16777279 ,3 ,5 ,6413600.0 ,2999310.0 ,46.8 sse2 ,16777216 ,0 ,127 ,5704050.0 ,2986650.0 ,52.4 sse2 ,16777216 ,0 ,255 ,2823440.0 ,2790510.0 ,98.8 sse2 ,16777216 ,0 ,256 ,2926150.0 ,2711540.0 ,92.7 sse2 ,16777216 ,0 ,4064 ,2836530.0 ,2738850.0 ,96.6 sse2 ,33554439 ,0 ,0 ,5926350.0 ,5588810.0 ,94.3 sse2 ,33554447 ,0 ,3 ,12850900.0 ,6171500.0 ,48.0 sse2 ,33554463 ,3 ,0 ,6041090.0 ,5731480.0 ,94.9 sse2 ,33554495 ,3 ,5 ,12851100.0 ,6179870.0 ,48.1 sse2 ,33554432 ,0 ,127 ,11381900.0 ,6134130.0 ,53.9 sse2 ,33554432 ,0 ,255 ,5899320.0 ,5792680.0 ,98.2 sse2 ,33554432 ,0 ,256 ,6066220.0 ,5636270.0 ,92.9 sse2 ,33554432 ,0 ,4064 ,5915210.0 ,5688830.0 ,96.2 avx ,1048583 ,0 ,0 ,134392.0 ,136494.0 ,101.6 avx ,1048591 ,0 ,3 ,210664.0 ,146304.0 ,69.4 avx ,1048607 ,3 ,0 ,138559.0 ,138887.0 ,100.2 avx ,1048639 ,3 ,5 ,210655.0 ,146690.0 ,69.6 avx ,1048576 ,0 ,127 ,219819.0 ,155758.0 ,70.9 avx ,1048576 ,0 ,255 ,180740.0 ,146392.0 ,81.0 avx ,1048576 ,0 ,256 ,138448.0 ,142813.0 ,103.2 avx ,1048576 ,0 ,4064 ,133067.0 ,136384.0 ,102.5 avx ,2097159 ,0 ,0 ,268811.0 ,272810.0 ,101.5 avx ,2097167 ,0 ,3 ,419724.0 ,292730.0 ,69.7 avx ,2097183 ,3 ,0 ,277358.0 ,277789.0 ,100.2 avx ,2097215 ,3 ,5 ,421091.0 ,292907.0 ,69.6 avx ,2097152 ,0 ,127 ,439166.0 ,311969.0 ,71.0 avx ,2097152 ,0 ,255 ,359858.0 ,293484.0 ,81.6 avx ,2097152 ,0 ,256 ,276467.0 ,285067.0 ,103.1 avx ,2097152 ,0 ,4064 ,266145.0 ,273049.0 ,102.6 avx ,4194311 ,0 ,0 ,538566.0 ,547454.0 ,101.7 avx ,4194319 ,0 ,3 ,841884.0 ,586111.0 ,69.6 avx ,4194335 ,3 ,0 ,555930.0 ,557857.0 ,100.3 avx ,4194367 ,3 ,5 ,841146.0 ,586329.0 ,69.7 avx ,4194304 ,0 ,127 ,879711.0 ,625865.0 ,71.1 avx ,4194304 ,0 ,255 ,718131.0 ,588442.0 ,81.9 avx ,4194304 ,0 ,256 ,553593.0 ,571956.0 ,103.3 avx ,4194304 ,0 ,4064 ,534461.0 ,547903.0 ,102.5 avx ,8388615 ,0 ,0 ,1145460.0 ,1127430.0 ,98.4 avx ,8388623 ,0 ,3 ,1704200.0 ,1185410.0 ,69.6 avx ,8388639 ,3 ,0 ,1179600.0 ,1145670.0 ,97.1 avx ,8388671 ,3 ,5 ,1702480.0 ,1183410.0 ,69.5 avx ,8388608 ,0 ,127 ,1773750.0 ,1264360.0 ,71.3 avx ,8388608 ,0 ,255 ,1450840.0 ,1189310.0 ,82.0 avx ,8388608 ,0 ,256 ,1179160.0 ,1157490.0 ,98.2 avx ,8388608 ,0 ,4064 ,1135990.0 ,1128150.0 ,99.3 avx ,16777223 ,0 ,0 ,2630160.0 ,2553770.0 ,97.1 avx ,16777231 ,0 ,3 ,3539370.0 ,2667050.0 ,75.4 avx ,16777247 ,3 ,0 ,2671830.0 ,2585550.0 ,96.8 avx ,16777279 ,3 ,5 ,3537460.0 ,2664080.0 ,75.3 avx ,16777216 ,0 ,127 ,3598350.0 ,2784810.0 ,77.4 avx ,16777216 ,0 ,255 ,3012890.0 ,2650420.0 ,88.0 avx ,16777216 ,0 ,256 ,2690480.0 ,2605640.0 ,96.8 avx ,16777216 ,0 ,4064 ,2607870.0 ,2537450.0 ,97.3 avx ,33554439 ,0 ,0 ,5582940.0 ,5313320.0 ,95.2 avx ,33554447 ,0 ,3 ,7208430.0 ,5541330.0 ,76.9 avx ,33554463 ,3 ,0 ,5613760.0 ,5399880.0 ,96.2 avx ,33554495 ,3 ,5 ,7202140.0 ,5547470.0 ,77.0 avx ,33554432 ,0 ,127 ,7287570.0 ,5784590.0 ,79.4 avx ,33554432 ,0 ,255 ,6156640.0 ,5508630.0 ,89.5 avx ,33554432 ,0 ,256 ,5700530.0 ,5441950.0 ,95.5 avx ,33554432 ,0 ,4064 ,5531820.0 ,5302580.0 ,95.9 avx512 ,1048583 ,0 ,0 ,133915.0 ,136436.0 ,101.9 avx512 ,1048591 ,0 ,3 ,142372.0 ,146319.0 ,102.8 avx512 ,1048607 ,3 ,0 ,134629.0 ,139098.0 ,103.3 avx512 ,1048639 ,3 ,5 ,142362.0 ,146405.0 ,102.8 avx512 ,1048576 ,0 ,127 ,142207.0 ,151144.0 ,106.3 avx512 ,1048576 ,0 ,255 ,143736.0 ,147800.0 ,102.8 avx512 ,1048576 ,0 ,256 ,139937.0 ,142958.0 ,102.2 avx512 ,1048576 ,0 ,4064 ,134730.0 ,139222.0 ,103.3 avx512 ,2097159 ,0 ,0 ,267396.0 ,272355.0 ,101.9 avx512 ,2097167 ,0 ,3 ,284152.0 ,293076.0 ,103.1 avx512 ,2097183 ,3 ,0 ,269656.0 ,278215.0 ,103.2 avx512 ,2097215 ,3 ,5 ,284422.0 ,293030.0 ,103.0 avx512 ,2097152 ,0 ,127 ,284003.0 ,303094.0 ,106.7 avx512 ,2097152 ,0 ,255 ,287381.0 ,295503.0 ,102.8 avx512 ,2097152 ,0 ,256 ,280224.0 ,286054.0 ,102.1 avx512 ,2097152 ,0 ,4064 ,270038.0 ,277907.0 ,102.9 avx512 ,4194311 ,0 ,0 ,536810.0 ,546741.0 ,101.9 avx512 ,4194319 ,0 ,3 ,570476.0 ,584715.0 ,102.5 avx512 ,4194335 ,3 ,0 ,539745.0 ,556838.0 ,103.2 avx512 ,4194367 ,3 ,5 ,570148.0 ,586154.0 ,102.8 avx512 ,4194304 ,0 ,127 ,570463.0 ,605906.0 ,106.2 avx512 ,4194304 ,0 ,255 ,576014.0 ,590627.0 ,102.5 avx512 ,4194304 ,0 ,256 ,560921.0 ,572248.0 ,102.0 avx512 ,4194304 ,0 ,4064 ,540550.0 ,557613.0 ,103.2 avx512 ,8388615 ,0 ,0 ,1136350.0 ,1125880.0 ,99.1 avx512 ,8388623 ,0 ,3 ,1218350.0 ,1192400.0 ,97.9 avx512 ,8388639 ,3 ,0 ,1139420.0 ,1144530.0 ,100.4 avx512 ,8388671 ,3 ,5 ,1219760.0 ,1191420.0 ,97.7 avx512 ,8388608 ,0 ,127 ,1220480.0 ,1225000.0 ,100.4 avx512 ,8388608 ,0 ,255 ,1222290.0 ,1190400.0 ,97.4 avx512 ,8388608 ,0 ,256 ,1194810.0 ,1154410.0 ,96.6 avx512 ,8388608 ,0 ,4064 ,1138850.0 ,1147750.0 ,100.8 avx512 ,16777223 ,0 ,0 ,2601040.0 ,2535500.0 ,97.5 avx512 ,16777231 ,0 ,3 ,2759350.0 ,2674570.0 ,96.9 avx512 ,16777247 ,3 ,0 ,2603500.0 ,2588260.0 ,99.4 avx512 ,16777279 ,3 ,5 ,2743810.0 ,2674870.0 ,97.5 avx512 ,16777216 ,0 ,127 ,2754910.0 ,2726860.0 ,99.0 avx512 ,16777216 ,0 ,255 ,2750980.0 ,2651370.0 ,96.4 avx512 ,16777216 ,0 ,256 ,2707940.0 ,2589660.0 ,95.6 avx512 ,16777216 ,0 ,4064 ,2606760.0 ,2580980.0 ,99.0 avx512 ,33554439 ,0 ,0 ,5531050.0 ,5292570.0 ,95.7 avx512 ,33554447 ,0 ,3 ,5788490.0 ,5574380.0 ,96.3 avx512 ,33554463 ,3 ,0 ,5558950.0 ,5415190.0 ,97.4 avx512 ,33554495 ,3 ,5 ,5775400.0 ,5582390.0 ,96.7 avx512 ,33554432 ,0 ,127 ,5787680.0 ,5659730.0 ,97.8 avx512 ,33554432 ,0 ,255 ,5823500.0 ,5516530.0 ,94.7 avx512 ,33554432 ,0 ,256 ,5678760.0 ,5401000.0 ,95.1 avx512 ,33554432 ,0 ,4064 ,5573540.0 ,5400460.0 ,96.9 Skylake: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz func ,size ,align1,align2,Old ,New ,% New / Old sse2 ,1048583 ,0 ,0 ,71890.2 ,70626.8 ,98.2 sse2 ,1048591 ,0 ,3 ,72200.5 ,74263.6 ,102.9 sse2 ,1048607 ,3 ,0 ,71360.5 ,70106.5 ,98.2 sse2 ,1048639 ,3 ,5 ,71972.1 ,73468.0 ,102.1 sse2 ,1048576 ,0 ,127 ,81634.2 ,77607.6 ,95.1 sse2 ,1048576 ,0 ,255 ,71575.2 ,71951.5 ,100.5 sse2 ,1048576 ,0 ,256 ,72383.2 ,69610.8 ,96.2 sse2 ,1048576 ,0 ,4064 ,71996.6 ,70941.0 ,98.5 sse2 ,2097159 ,0 ,0 ,143835.0 ,140186.0 ,97.5 sse2 ,2097167 ,0 ,3 ,146347.0 ,147984.0 ,101.1 sse2 ,2097183 ,3 ,0 ,145740.0 ,140317.0 ,96.3 sse2 ,2097215 ,3 ,5 ,147099.0 ,147066.0 ,100.0 sse2 ,2097152 ,0 ,127 ,163712.0 ,157386.0 ,96.1 sse2 ,2097152 ,0 ,255 ,145048.0 ,144970.0 ,99.9 sse2 ,2097152 ,0 ,256 ,144545.0 ,139948.0 ,96.8 sse2 ,2097152 ,0 ,4064 ,143519.0 ,140975.0 ,98.2 sse2 ,4194311 ,0 ,0 ,293848.0 ,283531.0 ,96.5 sse2 ,4194319 ,0 ,3 ,305127.0 ,295478.0 ,96.8 sse2 ,4194335 ,3 ,0 ,299170.0 ,283950.0 ,94.9 sse2 ,4194367 ,3 ,5 ,307419.0 ,293175.0 ,95.4 sse2 ,4194304 ,0 ,127 ,332567.0 ,318276.0 ,95.7 sse2 ,4194304 ,0 ,255 ,304897.0 ,300309.0 ,98.5 sse2 ,4194304 ,0 ,256 ,298929.0 ,284008.0 ,95.0 sse2 ,4194304 ,0 ,4064 ,296282.0 ,286087.0 ,96.6 sse2 ,8388615 ,0 ,0 ,751380.0 ,724191.0 ,96.4 sse2 ,8388623 ,0 ,3 ,775657.0 ,734942.0 ,94.8 sse2 ,8388639 ,3 ,0 ,756674.0 ,712934.0 ,94.2 sse2 ,8388671 ,3 ,5 ,774934.0 ,736895.0 ,95.1 sse2 ,8388608 ,0 ,127 ,781242.0 ,741475.0 ,94.9 sse2 ,8388608 ,0 ,255 ,762849.0 ,725086.0 ,95.0 sse2 ,8388608 ,0 ,256 ,758465.0 ,711665.0 ,93.8 sse2 ,8388608 ,0 ,4064 ,755243.0 ,738092.0 ,97.7 sse2 ,16777223 ,0 ,0 ,2104730.0 ,1954140.0 ,92.8 sse2 ,16777231 ,0 ,3 ,2129590.0 ,1951410.0 ,91.6 sse2 ,16777247 ,3 ,0 ,2102950.0 ,1952530.0 ,92.8 sse2 ,16777279 ,3 ,5 ,2126250.0 ,1952410.0 ,91.8 sse2 ,16777216 ,0 ,127 ,2074290.0 ,1932070.0 ,93.1 sse2 ,16777216 ,0 ,255 ,2060610.0 ,1941860.0 ,94.2 sse2 ,16777216 ,0 ,256 ,2106430.0 ,1952060.0 ,92.7 sse2 ,16777216 ,0 ,4064 ,2100660.0 ,1945610.0 ,92.6 sse2 ,33554439 ,0 ,0 ,4672510.0 ,4391660.0 ,94.0 sse2 ,33554447 ,0 ,3 ,4687860.0 ,4387680.0 ,93.6 sse2 ,33554463 ,3 ,0 ,4655420.0 ,4402580.0 ,94.6 sse2 ,33554495 ,3 ,5 ,4692800.0 ,4386350.0 ,93.5 sse2 ,33554432 ,0 ,127 ,4558620.0 ,4341510.0 ,95.2 sse2 ,33554432 ,0 ,255 ,4545130.0 ,4374230.0 ,96.2 sse2 ,33554432 ,0 ,256 ,4665000.0 ,4390850.0 ,94.1 sse2 ,33554432 ,0 ,4064 ,4666350.0 ,4374400.0 ,93.7 avx ,1048583 ,0 ,0 ,105460.0 ,104097.0 ,98.7 avx ,1048591 ,0 ,3 ,66369.2 ,67306.4 ,101.4 avx ,1048607 ,3 ,0 ,66625.8 ,64741.2 ,97.2 avx ,1048639 ,3 ,5 ,66757.7 ,65796.3 ,98.6 avx ,1048576 ,0 ,127 ,65272.4 ,65130.6 ,99.8 avx ,1048576 ,0 ,255 ,65632.1 ,65678.6 ,100.1 avx ,1048576 ,0 ,256 ,67530.1 ,64841.5 ,96.0 avx ,1048576 ,0 ,4064 ,65955.1 ,66194.8 ,100.4 avx ,2097159 ,0 ,0 ,132883.0 ,131644.0 ,99.1 avx ,2097167 ,0 ,3 ,133825.0 ,132308.0 ,98.9 avx ,2097183 ,3 ,0 ,133567.0 ,129040.0 ,96.6 avx ,2097215 ,3 ,5 ,133856.0 ,132735.0 ,99.2 avx ,2097152 ,0 ,127 ,131219.0 ,129983.0 ,99.1 avx ,2097152 ,0 ,255 ,131450.0 ,131755.0 ,100.2 avx ,2097152 ,0 ,256 ,135219.0 ,132616.0 ,98.1 avx ,2097152 ,0 ,4064 ,131692.0 ,132351.0 ,100.5 avx ,4194311 ,0 ,0 ,278494.0 ,265144.0 ,95.2 avx ,4194319 ,0 ,3 ,282868.0 ,267499.0 ,94.6 avx ,4194335 ,3 ,0 ,275956.0 ,262626.0 ,95.2 avx ,4194367 ,3 ,5 ,283080.0 ,266712.0 ,94.2 avx ,4194304 ,0 ,127 ,270912.0 ,266153.0 ,98.2 avx ,4194304 ,0 ,255 ,266650.0 ,267640.0 ,100.4 avx ,4194304 ,0 ,256 ,276224.0 ,264929.0 ,95.9 avx ,4194304 ,0 ,4064 ,274156.0 ,265264.0 ,96.8 avx ,8388615 ,0 ,0 ,820710.0 ,799313.0 ,97.4 avx ,8388623 ,0 ,3 ,881478.0 ,816087.0 ,92.6 avx ,8388639 ,3 ,0 ,881138.0 ,788571.0 ,89.5 avx ,8388671 ,3 ,5 ,883555.0 ,820020.0 ,92.8 avx ,8388608 ,0 ,127 ,799727.0 ,785502.0 ,98.2 avx ,8388608 ,0 ,255 ,785782.0 ,800006.0 ,101.8 avx ,8388608 ,0 ,256 ,876745.0 ,809691.0 ,92.4 avx ,8388608 ,0 ,4064 ,895120.0 ,809204.0 ,90.4 avx ,16777223 ,0 ,0 ,2138420.0 ,1955110.0 ,91.4 avx ,16777231 ,0 ,3 ,2208590.0 ,1966590.0 ,89.0 avx ,16777247 ,3 ,0 ,2209190.0 ,1968980.0 ,89.1 avx ,16777279 ,3 ,5 ,2207120.0 ,1964830.0 ,89.0 avx ,16777216 ,0 ,127 ,2123460.0 ,1942180.0 ,91.5 avx ,16777216 ,0 ,255 ,2120500.0 ,1951910.0 ,92.0 avx ,16777216 ,0 ,256 ,2193680.0 ,1963540.0 ,89.5 avx ,16777216 ,0 ,4064 ,2196110.0 ,1970050.0 ,89.7 avx ,33554439 ,0 ,0 ,4849470.0 ,4398720.0 ,90.7 avx ,33554447 ,0 ,3 ,4855270.0 ,4402670.0 ,90.7 avx ,33554463 ,3 ,0 ,4877600.0 ,4405480.0 ,90.3 avx ,33554495 ,3 ,5 ,4851190.0 ,4401330.0 ,90.7 avx ,33554432 ,0 ,127 ,4699810.0 ,4324860.0 ,92.0 avx ,33554432 ,0 ,255 ,4676570.0 ,4363830.0 ,93.3 avx ,33554432 ,0 ,256 ,4846720.0 ,4376970.0 ,90.3 avx ,33554432 ,0 ,4064 ,4839810.0 ,4400570.0 ,90.9 .../multiarch/memmove-vec-unaligned-erms.S | 327 ++++++++++++++---- 1 file changed, 257 insertions(+), 70 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 897a3d9762..a2d790cf61 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -67,6 +67,35 @@ # endif #endif +#ifndef PAGE_SIZE +# define PAGE_SIZE 4096 +#endif + +#if PAGE_SIZE != 4096 +# error Unsupported PAGE_SIZE +#endif + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + + +/* Amount to shift rdx by to compare for memcpy_large_4x. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16) @@ -106,6 +135,28 @@ # error Unsupported PREFETCH_SIZE! #endif +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + #ifndef SECTION # error SECTION is not defined! #endif @@ -393,6 +444,15 @@ L(last_4x_vec): VZEROUPPER_RETURN L(more_8x_vec): + /* Check if non-temporal move candidate. */ +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) + /* Check non-temporal store threshold. */ + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + ja L(large_memcpy_2x) +#endif + /* Entry if rdx is greater than non-temporal threshold but there + is overlap. */ +L(more_8x_vec_check): cmpq %rsi, %rdi ja L(more_8x_vec_backward) /* Source == destination is less common. */ @@ -419,11 +479,6 @@ L(more_8x_vec): subq %r8, %rdi /* Adjust length. */ addq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_forward) -#endif L(loop_4x_vec_forward): /* Copy 4 * VEC a time forward. */ VMOVU (%rsi), %VEC(0) @@ -447,6 +502,7 @@ L(loop_4x_vec_forward): /* Store the first VEC. */ VMOVU %VEC(4), (%r11) VZEROUPPER_RETURN + ret L(more_8x_vec_backward): /* Load the first 4 * VEC and last VEC to support overlapping @@ -470,11 +526,6 @@ L(more_8x_vec_backward): subq %r8, %r9 /* Adjust length. */ subq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_backward) -#endif L(loop_4x_vec_backward): /* Copy 4 * VEC a time backward. */ VMOVU (%rcx), %VEC(0) @@ -498,75 +549,211 @@ L(loop_4x_vec_backward): /* Store the last VEC. */ VMOVU %VEC(8), (%r11) VZEROUPPER_RETURN + ret #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_forward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rdi, %rdx), %r10 - cmpq %r10, %rsi - jb L(loop_4x_vec_forward) -L(loop_large_forward): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3) +L(large_memcpy_2x): + /* Compute absolute value of difference between source and + destination. */ + movq %rdi, %r9 + subq %rsi, %r9 + movq %r9, %r8 + leaq -1(%r9), %rcx + sarq $63, %r8 + xorq %r8, %r9 + subq %r8, %r9 + /* Don't use non-temporal store if there is overlap between + destination and source since destination may be in cache when + source is loaded. */ + cmpq %r9, %rdx + ja L(more_8x_vec_check) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + VMOVU (%rsi), %VEC(8) +#if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VEC(9) +#if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(10) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(11) +#endif +#endif + VMOVU %VEC(8), (%rdi) +#if VEC_SIZE < 64 + VMOVU %VEC(9), VEC_SIZE(%rdi) +#if VEC_SIZE < 32 + VMOVU %VEC(10), (VEC_SIZE * 2)(%rdi) + VMOVU %VEC(11), (VEC_SIZE * 3)(%rdi) +#endif +#endif + /* Adjust source, destination, and size. */ + MOVQ %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they do + the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + movq %rdx, %r10 + shrq $LOG_4X_MEMCPY_THRESH, %r10 + cmp __x86_shared_non_temporal_threshold(%rip), %r10 + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10 + /* Copy 4x VEC at a time from 2 pages. */ + .p2align 5 +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. Use temporal stores + here. The region will fit in cache and it should fit user + expectations for the tail of the memcpy region to be hot. */ + .p2align 4 +L(loop_large_memcpy_2x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - addq $PREFETCHED_LOAD_SIZE, %rsi - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%rdi) - VMOVNT %VEC(1), VEC_SIZE(%rdi) - VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi) - VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi) - addq $PREFETCHED_LOAD_SIZE, %rdi - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_forward) - sfence + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): /* Store the last 4 * VEC. */ - VMOVU %VEC(5), (%rcx) - VMOVU %VEC(6), -VEC_SIZE(%rcx) - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) - /* Store the first VEC. */ - VMOVU %VEC(4), (%r11) + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN - -L(large_backward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rcx, %rdx), %r10 - cmpq %r10, %r9 - jb L(loop_4x_vec_backward) -L(loop_large_backward): - /* Copy 4 * VEC a time backward with non-temporal stores. */ - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3) - VMOVU (%rcx), %VEC(0) - VMOVU -VEC_SIZE(%rcx), %VEC(1) - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - subq $PREFETCHED_LOAD_SIZE, %rcx - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%r9) - VMOVNT %VEC(1), -VEC_SIZE(%r9) - VMOVNT %VEC(2), -(VEC_SIZE * 2)(%r9) - VMOVNT %VEC(3), -(VEC_SIZE * 3)(%r9) - subq $PREFETCHED_LOAD_SIZE, %r9 - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_backward) - sfence - /* Store the first 4 * VEC. */ - VMOVU %VEC(4), (%rdi) - VMOVU %VEC(5), VEC_SIZE(%rdi) - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) - /* Store the last VEC. */ - VMOVU %VEC(8), (%r11) + ret + +L(large_memcpy_4x): + movq %rdx, %r10 + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ + .p2align 5 +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more time + for prefetcher to keep up. */ + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ + .p2align 4 +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VEC(0) + VMOVU VEC_SIZE(%rsi), %VEC(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN + ret #endif END (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) From patchwork Mon Mar 29 22:57:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1459833 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=mFt8qufX; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F8Sg90n6Yz9sVm for ; Tue, 30 Mar 2021 09:58:45 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 114F0385141E; Mon, 29 Mar 2021 22:58:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 114F0385141E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1617058719; bh=9nxaT7BwahcL1i4dcG31rtGUHeC4vNFk4xgSUELWGsM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=mFt8qufXH/pVhbTqEr/bEb6XjPpLPSIFyqcFb9KZkgd/ll7ysCrDBS8AF8TNOzQkh GZcM41QTg9SyZNX1Q8qMujjg0vqUoQMihk3dJok3d+6PEZWVQ5WjBjesidy+deob6p EqNRWYq/b6eFsLKTbUfxx/X6THVbdM3O5GDPn3ek= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) by sourceware.org (Postfix) with ESMTPS id 7EAF13857023 for ; Mon, 29 Mar 2021 22:58:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7EAF13857023 Received: by mail-qt1-x82b.google.com with SMTP id g24so10598204qts.6 for ; Mon, 29 Mar 2021 15:58:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=9nxaT7BwahcL1i4dcG31rtGUHeC4vNFk4xgSUELWGsM=; b=ZAEXL/hvu21/0VLiIt0rCNNPTxFFSMDBVpJ+sRgc/H4nasSabsmdOn28F97kJQ7me2 iGovbdC9YsJ4qpj5wtVt8rukfLrH7qwqpsLdK+7aXq539CuH0isSvQE6DyiTbrp5mQWI LWLRs40QaeBl0KtdHS0Y1FiMQy1FVBceIJacWYDqzQ32jDIRuoPkS2xhoPOEauc7wsZ1 5BJKZT2KvKTK9WolOxtdNdgrnK4viTKxiQRFwIhcQwD8RXWqSmheY52o+xEmG6gNTwQL Xa1uDq6DHaeEYy2nRnNwBkUZQiBLBGbr6MjmRvExjvR6d9+yQNF7TfkR46eMCuIvtbVn AvWg== X-Gm-Message-State: AOAM531N7gRT3d8O73otqeICLVDyav+TCYo7HHFYO77ByCZM9avDqu/2 4izLnoof1v3AYKV6VJ4aoFiEqr/1qfqG0A== X-Google-Smtp-Source: ABdhPJx1uYkpY8k1/ouB9E7VQNfN982kxQqAaBJ6mo7HdsdiQwweXF8dPWdQUKPuxDp69wLx/P3Pzw== X-Received: by 2002:a05:622a:84:: with SMTP id o4mr24653588qtw.382.1617058713906; Mon, 29 Mar 2021 15:58:33 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id c7sm9213118qtv.48.2021.03.29.15.58.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Mar 2021 15:58:33 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 2/2] x86: Expanding test-memmove.c, test-memcpy.c, bench-memcpy-large.c Date: Mon, 29 Mar 2021 18:57:53 -0400 Message-Id: <20210329225752.235397-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20210329225752.235397-1-goldstein.w.n@gmail.com> References: <20210329225752.235397-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: noah via Libc-alpha From: Noah Goldstein Reply-To: noah Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No Bug. This commit expanding the range of tests / benchmarks for memmove and memcpy. The test expansion is mostly in the vein of increasing the maximum size, increasing the number of unique alignments tested, and testing both source < destination and vice versa. The benchmark expansaion is just to increase the number of unique alignments. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: noah --- benchtests/bench-memcpy-large.c | 8 +++- string/test-memcpy.c | 61 ++++++++++++++++------------ string/test-memmove.c | 70 ++++++++++++++++++++------------- 3 files changed, 83 insertions(+), 56 deletions(-) diff --git a/benchtests/bench-memcpy-large.c b/benchtests/bench-memcpy-large.c index 3df1575514..efb9627b1e 100644 --- a/benchtests/bench-memcpy-large.c +++ b/benchtests/bench-memcpy-large.c @@ -57,11 +57,11 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len) size_t i, j; char *s1, *s2; - align1 &= 63; + align1 &= 4095; if (align1 + len >= page_size) return; - align2 &= 63; + align2 &= 4095; if (align2 + len >= page_size) return; @@ -113,6 +113,10 @@ test_main (void) do_test (&json_ctx, 0, 3, i + 15); do_test (&json_ctx, 3, 0, i + 31); do_test (&json_ctx, 3, 5, i + 63); + do_test (&json_ctx, 0, 127, i); + do_test (&json_ctx, 0, 255, i); + do_test (&json_ctx, 0, 256, i); + do_test (&json_ctx, 0, 4064, i); } json_array_end (&json_ctx); diff --git a/string/test-memcpy.c b/string/test-memcpy.c index 2e9c6bd099..c9dfc88fed 100644 --- a/string/test-memcpy.c +++ b/string/test-memcpy.c @@ -82,11 +82,11 @@ do_test (size_t align1, size_t align2, size_t len) size_t i, j; char *s1, *s2; - align1 &= 63; + align1 &= 4095; if (align1 + len >= page_size) return; - align2 &= 63; + align2 &= 4095; if (align2 + len >= page_size) return; @@ -213,11 +213,9 @@ do_random_tests (void) } static void -do_test1 (void) +do_test1 (size_t size) { - size_t size = 0x100000; void *large_buf; - large_buf = mmap (NULL, size * 2 + page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (large_buf == MAP_FAILED) @@ -233,27 +231,32 @@ do_test1 (void) uint32_t *dest = large_buf; uint32_t *src = large_buf + size + page_size; size_t i; - - for (i = 0; i < arrary_size; i++) - src[i] = (uint32_t) i; - - FOR_EACH_IMPL (impl, 0) + size_t repeats; + for(repeats = 0; repeats < 2; repeats++) { - memset (dest, -1, size); - CALL (impl, (char *) dest, (char *) src, size); for (i = 0; i < arrary_size; i++) - if (dest[i] != src[i]) - { - error (0, 0, - "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", - impl->name, dest, src, i); - ret = 1; - break; - } + src[i] = (uint32_t) i; + + FOR_EACH_IMPL (impl, 0) + { + printf ("\t\tRunning: %s\n", impl->name); + memset (dest, -1, size); + CALL (impl, (char *) dest, (char *) src, size); + for (i = 0; i < arrary_size; i++) + if (dest[i] != src[i]) + { + error (0, 0, + "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", + impl->name, dest, src, i); + ret = 1; + munmap ((void *) large_buf, size * 2 + page_size); + return; + } + } + dest = src; + src = large_buf; } - - munmap ((void *) dest, size); - munmap ((void *) src, size); + munmap ((void *) large_buf, size * 2 + page_size); } int @@ -275,7 +278,6 @@ test_main (void) do_test (0, i, 1 << i); do_test (i, i, 1 << i); } - for (i = 0; i < 32; ++i) { do_test (0, 0, i); @@ -294,12 +296,19 @@ test_main (void) do_test (i, i, 16 * i); } + for (i = 19; i <= 25; ++i) + { + do_test (255, 0, 1 << i); + do_test (0, 255, i); + do_test (0, 4000, i); + } + do_test (0, 0, getpagesize ()); do_random_tests (); - do_test1 (); - + do_test1 (0x100000); + do_test1 (0x2000000); return ret; } diff --git a/string/test-memmove.c b/string/test-memmove.c index 2e3ce75b9b..ff8099d12f 100644 --- a/string/test-memmove.c +++ b/string/test-memmove.c @@ -247,7 +247,7 @@ do_random_tests (void) } static void -do_test2 (void) +do_test2 (size_t offset) { size_t size = 0x20000000; uint32_t * large_buf; @@ -268,33 +268,45 @@ do_test2 (void) } size_t bytes_move = 0x80000000 - (uintptr_t) large_buf; + if (bytes_move + offset * sizeof (uint32_t) > size) + { + munmap ((void *) large_buf, size); + return; + } size_t arr_size = bytes_move / sizeof (uint32_t); size_t i; - - FOR_EACH_IMPL (impl, 0) - { - for (i = 0; i < arr_size; i++) - large_buf[i] = (uint32_t) i; - - uint32_t * dst = &large_buf[33]; - -#ifdef TEST_BCOPY - CALL (impl, (char *) large_buf, (char *) dst, bytes_move); -#else - CALL (impl, (char *) dst, (char *) large_buf, bytes_move); -#endif - - for (i = 0; i < arr_size; i++) - { - if (dst[i] != (uint32_t) i) - { - error (0, 0, - "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", - impl->name, dst, large_buf, i); - ret = 1; - break; - } - } + size_t repeats; + uint32_t * src = large_buf; + uint32_t * dst = &large_buf[offset]; + for (repeats = 0; repeats < 2; ++repeats) + { + FOR_EACH_IMPL (impl, 0) + { + for (i = 0; i < arr_size; i++) + src[i] = (uint32_t) i; + + + #ifdef TEST_BCOPY + CALL (impl, (char *) src, (char *) dst, bytes_move); + #else + CALL (impl, (char *) dst, (char *) src, bytes_move); + #endif + + for (i = 0; i < arr_size; i++) + { + if (dst[i] != (uint32_t) i) + { + error (0, 0, + "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", + impl->name, dst, large_buf, i); + ret = 1; + munmap ((void *) large_buf, size); + return; + } + } + } + src = dst; + dst = large_buf; } munmap ((void *) large_buf, size); @@ -340,8 +352,10 @@ test_main (void) do_random_tests (); - do_test2 (); - + do_test2 (33); + do_test2 (0x200000); + do_test2 (0x4000000 - 1); + do_test2 (0x4000000); return ret; }