From patchwork Fri Apr 2 04:15:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1461562 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=CcIwpLHq; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FBRYb2qHbz9sVq for ; Fri, 2 Apr 2021 15:15:47 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DD0EB385703B; Fri, 2 Apr 2021 04:15:43 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DD0EB385703B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1617336943; bh=lNv17ahjVe2gaGA7kkVsHPW+ejdlTit28ehO14bUT5k=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=CcIwpLHqgLtxDDnH3fUP8hjUfjsJmE9W9aIg/y8nEwrknhbtiZUGlUoVDbqBraTy0 xAlG37MfB6leDT61Yz1q6BAnmc5yzSAjLMgzCmrMcXhd1sSreoqPeue327ba1b5JxY dfWj7NNzJbfhasRX4y9SUCB/GAzkUXA4yYoIHq3E= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x72b.google.com (mail-qk1-x72b.google.com [IPv6:2607:f8b0:4864:20::72b]) by sourceware.org (Postfix) with ESMTPS id E07CA3857015 for ; Fri, 2 Apr 2021 04:15:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E07CA3857015 Received: by mail-qk1-x72b.google.com with SMTP id i9so4419603qka.2 for ; Thu, 01 Apr 2021 21:15:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=lNv17ahjVe2gaGA7kkVsHPW+ejdlTit28ehO14bUT5k=; b=YidawZRU+/nYV4Sea2OSldXYZHgI20GOiPbdQT8UvbF/d0h40hyofz6AUnI8Q2peOw UGWLB0IlBVbXBqmCLhKg3B87TJcgGPjfankudyautFJoKSCqpkOeE9Yq7b6IjcDkNmlA pMDtnlp3evSc4vNm8tr/WNJ0cMGt0zL+1MW5MKIoDNW3eQKq8Ty0ptwI0DQnYwW2FkRB r8ZD0LQMvejM0vZG8aMvSUYDhix7tbS4pLRe+w7bU0tNw5jaLTmTfKtk8fVs6LJGAvo6 rAS8CnsqcoCrpyUIw0gc+AeQNMOZg+w2SILhvBWQmatq4ZJPYPhHrP71ZB6mkrKO9tRg bTpA== X-Gm-Message-State: AOAM532smbaJt47UkLySqPcGr0SWdDYNk+kpEYwEr7DVUZ/WLLJ/oB0G S1XA83uKIjQYh94GWMVE55l39KUxP0E= X-Google-Smtp-Source: ABdhPJzIZQnOgEJHncVIsnNeADtj8PdD180D2EPGVMWmJ7kCIYjilESoloY1K+a0VPpLmbplnd03LQ== X-Received: by 2002:a37:a2c8:: with SMTP id l191mr11748427qke.413.1617336939944; Thu, 01 Apr 2021 21:15:39 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id h13sm5378173qtn.26.2021.04.01.21.15.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Apr 2021 21:15:39 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v5 1/2] x86: Update large memcpy case in memmove-vec-unaligned-erms.S Date: Fri, 2 Apr 2021 00:15:27 -0400 Message-Id: <20210402041528.3391911-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" From: noah No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein --- .../multiarch/memmove-vec-unaligned-erms.S | 325 ++++++++++++++---- 1 file changed, 256 insertions(+), 69 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 897a3d9762..fbb9df2d09 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -35,7 +35,16 @@ __x86_rep_movsb_stop_threshold, then REP MOVSB will be used. 7. If size >= __x86_shared_non_temporal_threshold and there is no overlap between destination and source, use non-temporal store - instead of aligned store. */ + instead of aligned store copying from either 2 or 4 pages at + once. + 8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold + and source and destination do not page alias, copy from 2 pages + at once using non-temporal stores. Page aliasing in this case is + considered true if destination's page alignment - sources' page + alignment is less than 8 * VEC_SIZE. + 9. If size >= 16 * __x86_shared_non_temporal_threshold or source + and destination do page alias copy from 4 pages at once using + non-temporal stores. */ #include @@ -67,6 +76,34 @@ # endif #endif +#ifndef PAGE_SIZE +# define PAGE_SIZE 4096 +#endif + +#if PAGE_SIZE != 4096 +# error Unsupported PAGE_SIZE +#endif + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + +/* Amount to shift rdx by to compare for memcpy_large_4x. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB # define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16) @@ -106,6 +143,28 @@ # error Unsupported PREFETCH_SIZE! #endif +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + #ifndef SECTION # error SECTION is not defined! #endif @@ -255,7 +314,7 @@ L(return): #endif L(movsb): - cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP + cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP jae L(more_8x_vec) cmpq %rsi, %rdi jb 1f @@ -393,6 +452,15 @@ L(last_4x_vec): VZEROUPPER_RETURN L(more_8x_vec): + /* Check if non-temporal move candidate. */ +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) + /* Check non-temporal store threshold. */ + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + ja L(large_memcpy_2x) +#endif + /* Entry if rdx is greater than non-temporal threshold but there + is overlap. */ +L(more_8x_vec_check): cmpq %rsi, %rdi ja L(more_8x_vec_backward) /* Source == destination is less common. */ @@ -419,11 +487,6 @@ L(more_8x_vec): subq %r8, %rdi /* Adjust length. */ addq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_forward) -#endif L(loop_4x_vec_forward): /* Copy 4 * VEC a time forward. */ VMOVU (%rsi), %VEC(0) @@ -470,11 +533,6 @@ L(more_8x_vec_backward): subq %r8, %r9 /* Adjust length. */ subq %r8, %rdx -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) - /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_backward) -#endif L(loop_4x_vec_backward): /* Copy 4 * VEC a time backward. */ VMOVU (%rcx), %VEC(0) @@ -500,72 +558,201 @@ L(loop_4x_vec_backward): VZEROUPPER_RETURN #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_forward): +L(large_memcpy_2x): + /* Compute absolute value of difference between source and + destination. */ + movq %rdi, %r9 + subq %rsi, %r9 + movq %r9, %r8 + leaq -1(%r9), %rcx + sarq $63, %r8 + xorq %r8, %r9 + subq %r8, %r9 /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rdi, %rdx), %r10 - cmpq %r10, %rsi - jb L(loop_4x_vec_forward) -L(loop_large_forward): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3) + destination and source since destination may be in cache when + source is loaded. */ + cmpq %r9, %rdx + ja L(more_8x_vec_check) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + VMOVU (%rsi), %VEC(8) +#if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VEC(9) +#if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(10) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(11) +#endif +#endif + VMOVU %VEC(8), (%rdi) +#if VEC_SIZE < 64 + VMOVU %VEC(9), VEC_SIZE(%rdi) +#if VEC_SIZE < 32 + VMOVU %VEC(10), (VEC_SIZE * 2)(%rdi) + VMOVU %VEC(11), (VEC_SIZE * 3)(%rdi) +#endif +#endif + /* Adjust source, destination, and size. */ + movq %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they do + the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + movq %rdx, %r10 + shrq $LOG_4X_MEMCPY_THRESH, %r10 + cmp __x86_shared_non_temporal_threshold(%rip), %r10 + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10 + /* Copy 4x VEC at a time from 2 pages. */ +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. Use temporal stores + here. The region will fit in cache and it should fit user + expectations for the tail of the memcpy region to be hot. */ +L(loop_large_memcpy_2x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - addq $PREFETCHED_LOAD_SIZE, %rsi - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%rdi) - VMOVNT %VEC(1), VEC_SIZE(%rdi) - VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi) - VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi) - addq $PREFETCHED_LOAD_SIZE, %rdi - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_forward) - sfence + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): /* Store the last 4 * VEC. */ - VMOVU %VEC(5), (%rcx) - VMOVU %VEC(6), -VEC_SIZE(%rcx) - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) - /* Store the first VEC. */ - VMOVU %VEC(4), (%r11) + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN -L(large_backward): - /* Don't use non-temporal store if there is overlap between - destination and source since destination may be in cache - when source is loaded. */ - leaq (%rcx, %rdx), %r10 - cmpq %r10, %r9 - jb L(loop_4x_vec_backward) -L(loop_large_backward): - /* Copy 4 * VEC a time backward with non-temporal stores. */ - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3) - VMOVU (%rcx), %VEC(0) - VMOVU -VEC_SIZE(%rcx), %VEC(1) - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - subq $PREFETCHED_LOAD_SIZE, %rcx - subq $PREFETCHED_LOAD_SIZE, %rdx - VMOVNT %VEC(0), (%r9) - VMOVNT %VEC(1), -VEC_SIZE(%r9) - VMOVNT %VEC(2), -(VEC_SIZE * 2)(%r9) - VMOVNT %VEC(3), -(VEC_SIZE * 3)(%r9) - subq $PREFETCHED_LOAD_SIZE, %r9 - cmpq $PREFETCHED_LOAD_SIZE, %rdx - ja L(loop_large_backward) +L(large_memcpy_4x): + movq %rdx, %r10 + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more time + for prefetcher to keep up. */ + PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET((%rsi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + LOAD_ONE_SET((%rsi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET((%rdi), 0, %VEC(0), %VEC(1), %VEC(2), %VEC(3)) + STORE_ONE_SET((%rdi), PAGE_SIZE, %VEC(4), %VEC(5), %VEC(6), %VEC(7)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VEC(8), %VEC(9), %VEC(10), %VEC(11)) + STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VEC(12), %VEC(13), %VEC(14), %VEC(15)) + addq $LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) sfence - /* Store the first 4 * VEC. */ - VMOVU %VEC(4), (%rdi) - VMOVU %VEC(5), VEC_SIZE(%rdi) - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) - /* Store the last VEC. */ - VMOVU %VEC(8), (%r11) + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VEC(0) + VMOVU VEC_SIZE(%rsi), %VEC(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) + addq $(VEC_SIZE * 4), %rsi + subl $(VEC_SIZE * 4), %edx + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(1), VEC_SIZE(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) + addq $(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(3) + + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VEC(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(3), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN #endif END (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) From patchwork Fri Apr 2 04:15:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1461563 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=Dv11+qlt; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FBRYg3wBYz9sVq for ; Fri, 2 Apr 2021 15:15:51 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6B592386EC3E; Fri, 2 Apr 2021 04:15:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6B592386EC3E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1617336945; bh=2ETFgSDyE2fu1gaCmUhTxi4BcizI9aJazCWBvm4Vrj4=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Dv11+qltfO0+JKbS3E4iqO9XedeedgMtZ/6p897BhNALNNmIp9Y0/z0xd4MCO77As rXn+jPtKPLJQEsTCtwnfYHFcUNebm4Vz1JOhmryJyELzbVK8/SpCDbYsGMnPt+C8We Kj55agoZ4CNOg+CoaDuMZzqwBZ542J+bQDNT711Y= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qt1-x833.google.com (mail-qt1-x833.google.com [IPv6:2607:f8b0:4864:20::833]) by sourceware.org (Postfix) with ESMTPS id C5D8B3857037 for ; Fri, 2 Apr 2021 04:15:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C5D8B3857037 Received: by mail-qt1-x833.google.com with SMTP id i19so3068944qtv.7 for ; Thu, 01 Apr 2021 21:15:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2ETFgSDyE2fu1gaCmUhTxi4BcizI9aJazCWBvm4Vrj4=; b=k8LLovUGS1sQeR5XCkaktSyjxh1IV86yqYGlCVidbQyLJVI/WTV4miu5S+BAc6LW+2 8EuVFQ1onDomWP7uVkKwSJJ+TDm0JNA1z2Vuh25vHK6HHuy5qorMWPlD16ugIfZcXYBX mdqJTS6YypcrtZQF0WGQDCfS+UT4Wk+L30jZWLehO+G2nZsjsRUfeYIP67ZiLazz5zFE UNo4b7pTKE3/IbfGqtzY1Yz6qGD1tAQC+ZuM4/Cheb/mJ/Iuk3K4jN1FleI9853gVTqa 8kz4IIQIrt2w52eEvwukkfAxXAkQnuFNauMdULwvU+wZj6T2E4bWCCQSbKMw8uXjFmSb mutA== X-Gm-Message-State: AOAM5337y9gxQ/v/PII3N1WCRKSjhgknmuBq3D7th2MWwr3LX6xG3RNX 5xCK1kZK/ChxFDZZI7gdw5nhBvZ7XYE= X-Google-Smtp-Source: ABdhPJxTzn1aDGOQNbA0f4ap3MydJT6U/C9q1JCiDqPxOoQcWbEWwdvJJbfMAJmpLIXHsb1/ErhaKg== X-Received: by 2002:ac8:59c7:: with SMTP id f7mr9943414qtf.320.1617336942206; Thu, 01 Apr 2021 21:15:42 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id h13sm5378173qtn.26.2021.04.01.21.15.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Apr 2021 21:15:41 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v5 2/2] x86: Expanding test-memmove.c, test-memcpy.c, bench-memcpy-large.c Date: Fri, 2 Apr 2021 00:15:28 -0400 Message-Id: <20210402041528.3391911-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20210402041528.3391911-1-goldstein.w.n@gmail.com> References: <20210402041528.3391911-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" From: noah No Bug. This commit expanding the range of tests / benchmarks for memmove and memcpy. The test expansion is mostly in the vein of increasing the maximum size, increasing the number of unique alignments tested, and testing both source < destination and vice versa. The benchmark expansaion is just to increase the number of unique alignments. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein --- benchtests/bench-memcpy-large.c | 8 +++- string/test-memcpy.c | 61 ++++++++++++++++------------ string/test-memmove.c | 70 ++++++++++++++++++++------------- 3 files changed, 83 insertions(+), 56 deletions(-) diff --git a/benchtests/bench-memcpy-large.c b/benchtests/bench-memcpy-large.c index 3df1575514..efb9627b1e 100644 --- a/benchtests/bench-memcpy-large.c +++ b/benchtests/bench-memcpy-large.c @@ -57,11 +57,11 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len) size_t i, j; char *s1, *s2; - align1 &= 63; + align1 &= 4095; if (align1 + len >= page_size) return; - align2 &= 63; + align2 &= 4095; if (align2 + len >= page_size) return; @@ -113,6 +113,10 @@ test_main (void) do_test (&json_ctx, 0, 3, i + 15); do_test (&json_ctx, 3, 0, i + 31); do_test (&json_ctx, 3, 5, i + 63); + do_test (&json_ctx, 0, 127, i); + do_test (&json_ctx, 0, 255, i); + do_test (&json_ctx, 0, 256, i); + do_test (&json_ctx, 0, 4064, i); } json_array_end (&json_ctx); diff --git a/string/test-memcpy.c b/string/test-memcpy.c index 2e9c6bd099..c9dfc88fed 100644 --- a/string/test-memcpy.c +++ b/string/test-memcpy.c @@ -82,11 +82,11 @@ do_test (size_t align1, size_t align2, size_t len) size_t i, j; char *s1, *s2; - align1 &= 63; + align1 &= 4095; if (align1 + len >= page_size) return; - align2 &= 63; + align2 &= 4095; if (align2 + len >= page_size) return; @@ -213,11 +213,9 @@ do_random_tests (void) } static void -do_test1 (void) +do_test1 (size_t size) { - size_t size = 0x100000; void *large_buf; - large_buf = mmap (NULL, size * 2 + page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (large_buf == MAP_FAILED) @@ -233,27 +231,32 @@ do_test1 (void) uint32_t *dest = large_buf; uint32_t *src = large_buf + size + page_size; size_t i; - - for (i = 0; i < arrary_size; i++) - src[i] = (uint32_t) i; - - FOR_EACH_IMPL (impl, 0) + size_t repeats; + for(repeats = 0; repeats < 2; repeats++) { - memset (dest, -1, size); - CALL (impl, (char *) dest, (char *) src, size); for (i = 0; i < arrary_size; i++) - if (dest[i] != src[i]) - { - error (0, 0, - "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", - impl->name, dest, src, i); - ret = 1; - break; - } + src[i] = (uint32_t) i; + + FOR_EACH_IMPL (impl, 0) + { + printf ("\t\tRunning: %s\n", impl->name); + memset (dest, -1, size); + CALL (impl, (char *) dest, (char *) src, size); + for (i = 0; i < arrary_size; i++) + if (dest[i] != src[i]) + { + error (0, 0, + "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", + impl->name, dest, src, i); + ret = 1; + munmap ((void *) large_buf, size * 2 + page_size); + return; + } + } + dest = src; + src = large_buf; } - - munmap ((void *) dest, size); - munmap ((void *) src, size); + munmap ((void *) large_buf, size * 2 + page_size); } int @@ -275,7 +278,6 @@ test_main (void) do_test (0, i, 1 << i); do_test (i, i, 1 << i); } - for (i = 0; i < 32; ++i) { do_test (0, 0, i); @@ -294,12 +296,19 @@ test_main (void) do_test (i, i, 16 * i); } + for (i = 19; i <= 25; ++i) + { + do_test (255, 0, 1 << i); + do_test (0, 255, i); + do_test (0, 4000, i); + } + do_test (0, 0, getpagesize ()); do_random_tests (); - do_test1 (); - + do_test1 (0x100000); + do_test1 (0x2000000); return ret; } diff --git a/string/test-memmove.c b/string/test-memmove.c index 2e3ce75b9b..ff8099d12f 100644 --- a/string/test-memmove.c +++ b/string/test-memmove.c @@ -247,7 +247,7 @@ do_random_tests (void) } static void -do_test2 (void) +do_test2 (size_t offset) { size_t size = 0x20000000; uint32_t * large_buf; @@ -268,33 +268,45 @@ do_test2 (void) } size_t bytes_move = 0x80000000 - (uintptr_t) large_buf; + if (bytes_move + offset * sizeof (uint32_t) > size) + { + munmap ((void *) large_buf, size); + return; + } size_t arr_size = bytes_move / sizeof (uint32_t); size_t i; - - FOR_EACH_IMPL (impl, 0) - { - for (i = 0; i < arr_size; i++) - large_buf[i] = (uint32_t) i; - - uint32_t * dst = &large_buf[33]; - -#ifdef TEST_BCOPY - CALL (impl, (char *) large_buf, (char *) dst, bytes_move); -#else - CALL (impl, (char *) dst, (char *) large_buf, bytes_move); -#endif - - for (i = 0; i < arr_size; i++) - { - if (dst[i] != (uint32_t) i) - { - error (0, 0, - "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", - impl->name, dst, large_buf, i); - ret = 1; - break; - } - } + size_t repeats; + uint32_t * src = large_buf; + uint32_t * dst = &large_buf[offset]; + for (repeats = 0; repeats < 2; ++repeats) + { + FOR_EACH_IMPL (impl, 0) + { + for (i = 0; i < arr_size; i++) + src[i] = (uint32_t) i; + + + #ifdef TEST_BCOPY + CALL (impl, (char *) src, (char *) dst, bytes_move); + #else + CALL (impl, (char *) dst, (char *) src, bytes_move); + #endif + + for (i = 0; i < arr_size; i++) + { + if (dst[i] != (uint32_t) i) + { + error (0, 0, + "Wrong result in function %s dst \"%p\" src \"%p\" offset \"%zd\"", + impl->name, dst, large_buf, i); + ret = 1; + munmap ((void *) large_buf, size); + return; + } + } + } + src = dst; + dst = large_buf; } munmap ((void *) large_buf, size); @@ -340,8 +352,10 @@ test_main (void) do_random_tests (); - do_test2 (); - + do_test2 (33); + do_test2 (0x200000); + do_test2 (0x4000000 - 1); + do_test2 (0x4000000); return ret; }