From patchwork Sat Apr 17 02:52:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1467456 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=nTzBjE9+; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FMd1m310Xz9vGh for ; Sat, 17 Apr 2021 12:53:32 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BE53D3835829; Sat, 17 Apr 2021 02:53:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BE53D3835829 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1618628008; bh=TWuOaglbVM3tEOzxLrgEm35CViSmqpGvYy1rYmNe1pc=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=nTzBjE9+eNlcLShjOSQrC0yKIcbVH+eDyT29EiD7hHECQOik0tKdg3GIJF7opzrUL TtlWZAm6U9uYEyXzZlkiRerufBkeUUWnK0aIXIXd0OQz1CuSGgaLG4EBMXJIITYlKw VnnFeUIJr7Fm4CdUAAe5+0QPu8aB40m2tat2nR7w= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qt1-x82c.google.com (mail-qt1-x82c.google.com [IPv6:2607:f8b0:4864:20::82c]) by sourceware.org (Postfix) with ESMTPS id 781B3385782D for ; Sat, 17 Apr 2021 02:53:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 781B3385782D Received: by mail-qt1-x82c.google.com with SMTP id h7so22298156qtx.3 for ; Fri, 16 Apr 2021 19:53:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=TWuOaglbVM3tEOzxLrgEm35CViSmqpGvYy1rYmNe1pc=; b=W8eVNxifCCeKoxy4OGZ+OCc7ZDnoj0lCLR3bDSjAnSfr8BjrpAn9qZ+f5fCubLbg7G /j1CVjd05nW1FllOB4XoY06mr3d4WeCN7bscwse7xfm7zrE94JhQFfGOqQAUzxS0blYC lNtY8fIAj7mN5MZ9Xf2+KHS3+pOBDFVGNjQHVghIAZNqKAhlQkfAmNnH5ap+v9sEcd4o 084uSQ4x8ixUK6SNr1dh+X5ap44EWjzoMEnOr3FI7mG3Uimvgt9FLTIecHY+MjuwA9Q8 c7Tc2MTqL9+cTruV5gQWfBgfoxoBzY6iD3GQxHpjdntnjQP7Tg3ObJce2xyJBZYCL1um az4g== X-Gm-Message-State: AOAM533rQNCTntSB4mf+n39m1PPbOctmS16hdMcVR1zAXtIR3gYvhW+B jmtKzNISV+sCj11pekDMl3EKgyekSVY= X-Google-Smtp-Source: ABdhPJwrHGhkGqv7ghtxcFMMy4KNxv5K7BwMyqDx7q/n8nOur2PKj0B7fwyg7H7Y/s2M8Ez689UnAw== X-Received: by 2002:ac8:7084:: with SMTP id y4mr1894602qto.171.1618628002183; Fri, 16 Apr 2021 19:53:22 -0700 (PDT) Received: from localhost.localdomain (pool-71-245-178-39.pitbpa.fios.verizon.net. [71.245.178.39]) by smtp.googlemail.com with ESMTPSA id j15sm5215960qtl.88.2021.04.16.19.53.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Apr 2021 19:53:21 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 1/2] x86: Optimize strlen-evex.S Date: Fri, 16 Apr 2021 22:52:16 -0400 Message-Id: <20210417025215.874105-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein --- Tests where run on the following CPUs: Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html All times are the geometric mean of N=20. The unit of time is seconds. "Cur" refers to the current implementation "New" refers to this patches implementation The strlen numbers are universal improvements: Results For Skylake strlen-avx2 size, algn, Cur T , New T , Win , Dif 1 , 0 , 4.76 , 4.27 , New , 0.49 2 , 0 , 4.77 , 4.165 , New , 0.6 3 , 0 , 4.617 , 4.095 , New , 0.52 4 , 0 , 4.579 , 4.006 , New , 0.57 5 , 0 , 4.608 , 4.008 , New , 0.6 6 , 0 , 4.655 , 4.086 , New , 0.57 7 , 0 , 4.661 , 4.071 , New , 0.59 8 , 0 , 4.625 , 4.092 , New , 0.53 16 , 0 , 4.608 , 4.021 , New , 0.59 10 , 0 , 4.645 , 4.111 , New , 0.53 32 , 0 , 5.532 , 4.817 , New , 0.71 21 , 0 , 4.636 , 3.775 , New , 0.86 64 , 0 , 5.991 , 5.352 , New , 0.64 42 , 0 , 5.529 , 4.789 , New , 0.74 128 , 0 , 7.042 , 5.473 , New , 1.57 85 , 0 , 6.118 , 5.466 , New , 0.65 256 , 0 , 10.64 , 7.954 , New , 2.69 170 , 0 , 9.918 , 9.585 , New , 0.33 512 , 0 , 12.916, 10.242, New , 2.67 341 , 0 , 10.764, 10.216, New , 0.55 1024, 0 , 18.163, 14.844, New , 3.32 682 , 0 , 15.292, 13.382, New , 1.91 2048, 0 , 38.732, 24.396, New , 14.34 1365, 0 , 22.299, 20.08 , New , 2.22 4096, 0 , 79.054, 68.682, New , 10.37 2730, 0 , 61.47 , 40.705, New , 20.77 Results For Icelake strlen-avx2 size, algn, Cur T , New T , Win , Dif 1 , 0 , 2.681 , 1.99 , New , 0.69 2 , 0 , 2.823 , 2.232 , New , 0.59 3 , 0 , 2.57 , 2.077 , New , 0.49 4 , 0 , 2.659 , 2.128 , New , 0.53 5 , 0 , 2.666 , 2.109 , New , 0.56 6 , 0 , 2.596 , 2.053 , New , 0.54 7 , 0 , 2.623 , 2.152 , New , 0.47 8 , 0 , 2.675 , 2.178 , New , 0.5 16 , 0 , 2.675 , 2.202 , New , 0.47 10 , 0 , 2.672 , 2.2 , New , 0.47 32 , 0 , 3.383 , 2.868 , New , 0.52 21 , 0 , 2.693 , 2.032 , New , 0.66 64 , 0 , 3.404 , 3.056 , New , 0.35 42 , 0 , 3.511 , 2.967 , New , 0.54 128 , 0 , 4.191 , 3.627 , New , 0.56 85 , 0 , 3.559 , 2.922 , New , 0.64 256 , 0 , 6.782 , 5.493 , New , 1.29 170 , 0 , 6.24 , 4.988 , New , 1.25 512 , 0 , 9.305 , 7.308 , New , 2.0 341 , 0 , 7.626 , 6.272 , New , 1.35 1024, 0 , 14.455, 11.544, New , 2.91 682 , 0 , 10.728, 8.738 , New , 1.99 2048, 0 , 24.171, 24.101, New , 0.07 1365, 0 , 17.474, 14.387, New , 3.09 4096, 0 , 57.659, 51.675, New , 5.98 2730, 0 , 44.702, 28.04 , New , 16.66 Results For Tigerlake strlen-avx2 size, algn, Cur T , New T , Win , Dif 1 , 0 , 4.369 , 3.008 , New , 1.36 2 , 0 , 4.054 , 3.231 , New , 0.82 3 , 0 , 4.081 , 3.243 , New , 0.84 4 , 0 , 3.904 , 3.17 , New , 0.73 5 , 0 , 3.915 , 3.178 , New , 0.74 6 , 0 , 3.924 , 3.184 , New , 0.74 7 , 0 , 3.917 , 3.177 , New , 0.74 8 , 0 , 3.889 , 3.209 , New , 0.68 16 , 0 , 3.878 , 3.03 , New , 0.85 10 , 0 , 3.892 , 3.004 , New , 0.89 32 , 0 , 4.957 , 4.162 , New , 0.79 21 , 0 , 3.866 , 3.18 , New , 0.69 64 , 0 , 5.035 , 4.521 , New , 0.51 42 , 0 , 5.039 , 4.276 , New , 0.76 128 , 0 , 6.117 , 5.253 , New , 0.86 85 , 0 , 4.932 , 4.421 , New , 0.51 256 , 0 , 10.019, 8.221 , New , 1.8 170 , 0 , 8.954 , 7.404 , New , 1.55 512 , 0 , 14.071, 10.948, New , 3.12 341 , 0 , 11.177, 9.246 , New , 1.93 1024, 0 , 21.808, 17.034, New , 4.77 682 , 0 , 16.07 , 12.941, New , 3.13 2048, 0 , 37.332, 29.853, New , 7.48 1365, 0 , 26.394, 21.516, New , 4.88 4096, 0 , 87.951, 80.35 , New , 7.6 2730, 0 , 62.768, 44.247, New , 18.52 Results For Icelake strlen-evex size, algn, Cur T , New T , Win , Dif 1 , 0 , 2.681 , 1.99 , New , 0.69 2 , 0 , 2.823 , 2.232 , New , 0.59 3 , 0 , 2.57 , 2.077 , New , 0.49 4 , 0 , 2.659 , 2.128 , New , 0.53 5 , 0 , 2.666 , 2.109 , New , 0.56 6 , 0 , 2.596 , 2.053 , New , 0.54 7 , 0 , 2.623 , 2.152 , New , 0.47 8 , 0 , 2.675 , 2.178 , New , 0.5 16 , 0 , 2.675 , 2.202 , New , 0.47 10 , 0 , 2.672 , 2.2 , New , 0.47 32 , 0 , 3.383 , 2.868 , New , 0.52 21 , 0 , 2.693 , 2.032 , New , 0.66 64 , 0 , 3.404 , 3.056 , New , 0.35 42 , 0 , 3.511 , 2.967 , New , 0.54 128 , 0 , 4.191 , 3.627 , New , 0.56 85 , 0 , 3.559 , 2.922 , New , 0.64 256 , 0 , 6.782 , 5.493 , New , 1.29 170 , 0 , 6.24 , 4.988 , New , 1.25 512 , 0 , 9.305 , 7.308 , New , 2.0 341 , 0 , 7.626 , 6.272 , New , 1.35 1024, 0 , 14.455, 11.544, New , 2.91 682 , 0 , 10.728, 8.738 , New , 1.99 2048, 0 , 24.171, 24.101, New , 0.07 1365, 0 , 17.474, 14.387, New , 3.09 4096, 0 , 57.659, 51.675, New , 5.98 2730, 0 , 44.702, 28.04 , New , 16.66 Results For Tigerlake strlen-evex size, algn, Cur T , New T , Win , Dif 1 , 0 , 4.369 , 3.008 , New , 1.36 2 , 0 , 4.054 , 3.231 , New , 0.82 3 , 0 , 4.081 , 3.243 , New , 0.84 4 , 0 , 3.904 , 3.17 , New , 0.73 5 , 0 , 3.915 , 3.178 , New , 0.74 6 , 0 , 3.924 , 3.184 , New , 0.74 7 , 0 , 3.917 , 3.177 , New , 0.74 8 , 0 , 3.889 , 3.209 , New , 0.68 16 , 0 , 3.878 , 3.03 , New , 0.85 10 , 0 , 3.892 , 3.004 , New , 0.89 32 , 0 , 4.957 , 4.162 , New , 0.79 21 , 0 , 3.866 , 3.18 , New , 0.69 64 , 0 , 5.035 , 4.521 , New , 0.51 42 , 0 , 5.039 , 4.276 , New , 0.76 128 , 0 , 6.117 , 5.253 , New , 0.86 85 , 0 , 4.932 , 4.421 , New , 0.51 256 , 0 , 10.019, 8.221 , New , 1.8 170 , 0 , 8.954 , 7.404 , New , 1.55 512 , 0 , 14.071, 10.948, New , 3.12 341 , 0 , 11.177, 9.246 , New , 1.93 1024, 0 , 21.808, 17.034, New , 4.77 682 , 0 , 16.07 , 12.941, New , 3.13 2048, 0 , 37.332, 29.853, New , 7.48 1365, 0 , 26.394, 21.516, New , 4.88 4096, 0 , 87.951, 80.35 , New , 7.6 2730, 0 , 62.768, 44.247, New , 18.52 The strnlen numbers are a bit more of a mixed bag but I think generally positive. Its possible that the current version should be kept. Let me know. Results For Skylake strnlen-avx2 size, algn, Cur T , Sub T , Win , Dif 1 , 0 , 4.06 , 4.1 , Cur , 0.04 2 , 0 , 4.15 , 4.08 , New , 0.07 3 , 0 , 4.1 , 4.03 , New , 0.07 4 , 0 , 3.95 , 3.91 , New , 0.04 5 , 0 , 4.07 , 3.9 , New , 0.17 6 , 0 , 4.04 , 3.92 , New , 0.12 7 , 0 , 4.03 , 3.89 , New , 0.14 1 , 1 , 3.75 , 3.79 , Cur , 0.04 2 , 2 , 4.0 , 3.91 , New , 0.09 3 , 3 , 4.04 , 3.92 , New , 0.12 4 , 4 , 3.95 , 3.86 , New , 0.09 5 , 5 , 3.97 , 3.91 , New , 0.06 6 , 6 , 3.96 , 3.92 , New , 0.04 7 , 7 , 3.97 , 3.91 , New , 0.06 4 , 1 , 3.76 , 3.83 , Cur , 0.07 8 , 0 , 3.73 , 4.01 , Cur , 0.28 8 , 1 , 3.73 , 3.88 , Cur , 0.15 16 , 0 , 3.68 , 3.84 , Cur , 0.16 16 , 1 , 3.75 , 3.92 , Cur , 0.17 32 , 0 , 5.93 , 5.95 , Cur , 0.02 32 , 1 , 5.95 , 5.98 , Cur , 0.03 64 , 0 , 6.46 , 5.31 , New , 1.15 64 , 1 , 6.66 , 5.43 , New , 1.23 128 , 0 , 7.42 , 6.02 , New , 1.4 128 , 1 , 7.57 , 5.81 , New , 1.76 256 , 0 , 12.02 , 9.89 , New , 2.13 256 , 1 , 11.91 , 9.84 , New , 2.07 512 , 0 , 15.06 , 11.77 , New , 3.29 512 , 1 , 14.79 , 11.75 , New , 3.04 1024, 0 , 23.61 , 16.98 , New , 6.63 1024, 1 , 23.63 , 16.91 , New , 6.72 Results For Icelake strnlen-avx2 size, algn, Cur T , Sub T , Win , Dif 1 , 0 , 2.81 , 2.51 , New , 0.3 2 , 0 , 2.8 , 2.53 , New , 0.27 3 , 0 , 2.7 , 2.57 , New , 0.13 4 , 0 , 2.68 , 2.55 , New , 0.13 5 , 0 , 2.7 , 2.57 , New , 0.13 6 , 0 , 2.73 , 2.6 , New , 0.13 7 , 0 , 2.69 , 2.61 , New , 0.08 1 , 1 , 2.53 , 2.5 , New , 0.03 2 , 2 , 2.67 , 2.6 , New , 0.07 3 , 3 , 2.67 , 2.59 , New , 0.08 4 , 4 , 2.66 , 2.57 , New , 0.09 5 , 5 , 2.65 , 2.56 , New , 0.09 6 , 6 , 2.67 , 2.59 , New , 0.08 7 , 7 , 2.65 , 2.62 , New , 0.03 4 , 1 , 2.65 , 2.41 , New , 0.24 8 , 0 , 2.68 , 2.56 , New , 0.12 8 , 1 , 2.62 , 2.55 , New , 0.07 16 , 0 , 2.66 , 2.56 , New , 0.1 16 , 1 , 2.63 , 2.55 , New , 0.08 32 , 0 , 3.62 , 3.19 , New , 0.43 32 , 1 , 3.74 , 3.45 , New , 0.29 64 , 0 , 3.9 , 3.7 , New , 0.2 64 , 1 , 4.13 , 3.68 , New , 0.45 128 , 0 , 4.34 , 4.17 , New , 0.17 128 , 1 , 4.59 , 4.07 , New , 0.52 256 , 0 , 6.74 , 6.56 , New , 0.18 256 , 1 , 7.34 , 7.13 , New , 0.21 512 , 0 , 9.64 , 8.67 , New , 0.97 512 , 1 , 9.49 , 8.56 , New , 0.93 1024, 0 , 13.57 , 12.35 , New , 1.22 1024, 1 , 13.57 , 12.59 , New , 0.98 Results For Tigerlake strnlen-avx2 size, algn, Cur T , Sub T , Win , Dif 1 , 0 , 4.21 , 3.91 , New , 0.3 2 , 0 , 4.1 , 3.79 , New , 0.31 3 , 0 , 4.02 , 3.81 , New , 0.21 4 , 0 , 4.06 , 3.82 , New , 0.24 5 , 0 , 4.1 , 3.81 , New , 0.29 6 , 0 , 4.08 , 3.82 , New , 0.26 7 , 0 , 4.07 , 3.87 , New , 0.2 1 , 1 , 3.95 , 3.8 , New , 0.15 2 , 2 , 4.11 , 3.88 , New , 0.23 3 , 3 , 4.08 , 3.88 , New , 0.2 4 , 4 , 4.05 , 3.94 , New , 0.11 5 , 5 , 4.02 , 3.89 , New , 0.13 6 , 6 , 4.02 , 3.89 , New , 0.13 7 , 7 , 4.08 , 3.84 , New , 0.24 4 , 1 , 4.07 , 3.7 , New , 0.37 8 , 0 , 4.08 , 3.95 , New , 0.13 8 , 1 , 4.01 , 4.02 , Cur , 0.01 16 , 0 , 4.03 , 4.03 , Eq , 0.0 16 , 1 , 4.05 , 4.0 , New , 0.05 32 , 0 , 5.86 , 5.23 , New , 0.63 32 , 1 , 5.88 , 5.36 , New , 0.52 64 , 0 , 6.38 , 5.73 , New , 0.65 64 , 1 , 6.49 , 5.56 , New , 0.93 128 , 0 , 7.17 , 6.39 , New , 0.78 128 , 1 , 7.1 , 6.41 , New , 0.69 256 , 0 , 11.65 , 11.0 , New , 0.65 256 , 1 , 11.37 , 10.97 , New , 0.4 512 , 0 , 14.86 , 13.43 , New , 1.43 512 , 1 , 14.63 , 13.35 , New , 1.28 1024, 0 , 20.92 , 19.33 , New , 1.59 1024, 1 , 20.85 , 19.38 , New , 1.47 Results For Icelake strnlen-evex size, algn, Cur T , Sub T , Win , Dif 1 , 0 , 2.9 , 2.66 , New , 0.24 2 , 0 , 2.99 , 2.72 , New , 0.27 3 , 0 , 2.93 , 2.64 , New , 0.29 4 , 0 , 2.83 , 2.55 , New , 0.28 5 , 0 , 2.92 , 2.64 , New , 0.28 6 , 0 , 2.95 , 2.64 , New , 0.31 7 , 0 , 2.91 , 2.65 , New , 0.26 1 , 1 , 2.63 , 2.49 , New , 0.14 2 , 2 , 2.89 , 2.6 , New , 0.29 3 , 3 , 2.89 , 2.59 , New , 0.3 4 , 4 , 2.9 , 2.58 , New , 0.32 5 , 5 , 2.87 , 2.57 , New , 0.3 6 , 6 , 2.9 , 2.57 , New , 0.33 7 , 7 , 2.88 , 2.64 , New , 0.24 4 , 1 , 2.65 , 2.39 , New , 0.26 8 , 0 , 2.85 , 2.57 , New , 0.28 8 , 1 , 2.62 , 2.4 , New , 0.22 16 , 0 , 2.83 , 2.56 , New , 0.27 16 , 1 , 2.63 , 2.39 , New , 0.24 32 , 0 , 3.95 , 3.06 , New , 0.89 32 , 1 , 3.95 , 3.15 , New , 0.8 64 , 0 , 3.98 , 3.6 , New , 0.38 64 , 1 , 3.88 , 3.48 , New , 0.4 128 , 0 , 4.45 , 4.19 , New , 0.26 128 , 1 , 4.57 , 4.21 , New , 0.36 256 , 0 , 6.75 , 6.97 , Cur , 0.22 256 , 1 , 7.55 , 7.76 , Cur , 0.21 512 , 0 , 9.75 , 10.09 , Cur , 0.34 512 , 1 , 9.84 , 10.13 , Cur , 0.29 1024, 0 , 14.45 , 14.4 , New , 0.05 1024, 1 , 14.39 , 14.26 , New , 0.13 Results For Tigerlake strnlen-evex size, algn, Cur T , Sub T , Win , Dif 1 , 0 , 3.86 , 3.59 , New , 0.27 2 , 0 , 3.78 , 3.41 , New , 0.37 3 , 0 , 3.69 , 3.4 , New , 0.29 4 , 0 , 3.62 , 3.33 , New , 0.29 5 , 0 , 3.76 , 3.37 , New , 0.39 6 , 0 , 3.73 , 3.39 , New , 0.34 7 , 0 , 3.7 , 3.4 , New , 0.3 1 , 1 , 3.58 , 3.35 , New , 0.23 2 , 2 , 3.75 , 3.34 , New , 0.41 3 , 3 , 3.72 , 3.39 , New , 0.33 4 , 4 , 3.69 , 3.38 , New , 0.31 5 , 5 , 3.69 , 3.37 , New , 0.32 6 , 6 , 3.68 , 3.37 , New , 0.31 7 , 7 , 3.74 , 3.35 , New , 0.39 4 , 1 , 3.39 , 3.27 , New , 0.12 8 , 0 , 3.4 , 3.29 , New , 0.11 8 , 1 , 3.34 , 3.32 , New , 0.02 16 , 0 , 3.36 , 3.34 , New , 0.02 16 , 1 , 3.39 , 3.3 , New , 0.09 32 , 0 , 5.13 , 5.13 , Eq , 0.0 32 , 1 , 5.18 , 5.16 , New , 0.02 64 , 0 , 5.87 , 5.44 , New , 0.43 64 , 1 , 5.97 , 5.44 , New , 0.53 128 , 0 , 7.14 , 6.48 , New , 0.66 128 , 1 , 7.08 , 6.63 , New , 0.45 256 , 0 , 11.68 , 12.57 , Cur , 0.89 256 , 1 , 11.67 , 12.23 , Cur , 0.56 512 , 0 , 15.64 , 15.74 , Cur , 0.1 512 , 1 , 15.52 , 15.69 , Cur , 0.17 1024, 0 , 23.02 , 22.57 , New , 0.45 1024, 1 , 23.0 , 22.8 , New , 0.2 sysdeps/x86_64/multiarch/strlen-evex.S | 569 +++++++++++++------------ 1 file changed, 307 insertions(+), 262 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S index 0583819078..d1aafac76f 100644 --- a/sysdeps/x86_64/multiarch/strlen-evex.S +++ b/sysdeps/x86_64/multiarch/strlen-evex.S @@ -29,11 +29,13 @@ # ifdef USE_AS_WCSLEN # define VPCMP vpcmpd # define VPMINU vpminud -# define SHIFT_REG r9d +# define SHIFT_REG ecx +# define CHAR_SIZE 4 # else # define VPCMP vpcmpb # define VPMINU vpminub -# define SHIFT_REG ecx +# define SHIFT_REG edx +# define CHAR_SIZE 1 # endif # define XMMZERO xmm16 @@ -46,132 +48,169 @@ # define YMM6 ymm22 # define VEC_SIZE 32 +# define PAGE_SIZE 4096 +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) .section .text.evex,"ax",@progbits ENTRY (STRLEN) # ifdef USE_AS_STRNLEN - /* Check for zero length. */ + /* Check zero length. */ test %RSI_LP, %RSI_LP jz L(zero) -# ifdef USE_AS_WCSLEN - shl $2, %RSI_LP -# elif defined __ILP32__ +# if !defined USE_AS_WCSLEN && defined __ILP32__ /* Clear the upper 32 bits. */ movl %esi, %esi # endif mov %RSI_LP, %R8_LP # endif - movl %edi, %ecx - movq %rdi, %rdx + movl %edi, %eax + sall $20, %eax vpxorq %XMMZERO, %XMMZERO, %XMMZERO - /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx - ja L(cros_page_boundary) + cmpl $((PAGE_SIZE - VEC_SIZE) << 20), %eax + ja L(cross_page_boundary) - /* Check the first VEC_SIZE bytes. Each bit in K0 represents a - null byte. */ + /* Check the first VEC_SIZE bytes. */ VPCMP $0, (%rdi), %YMMZERO, %k0 kmovd %k0, %eax - testl %eax, %eax - # ifdef USE_AS_STRNLEN - jnz L(first_vec_x0_check) - /* Adjust length and check the end of data. */ - subq $VEC_SIZE, %rsi - jbe L(max) -# else - jnz L(first_vec_x0) + /* If length < VEC_SIZE handle special. */ + cmpq $CHAR_PER_VEC, %rsi + jbe L(first_vec_x0) # endif - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - + testl %eax, %eax + jz L(aligned_more) + tzcntl %eax, %eax + ret # ifdef USE_AS_STRNLEN - /* Adjust length. */ - addq %rcx, %rsi +L(zero): + xorl %eax, %eax + ret - subq $(VEC_SIZE * 4), %rsi - jbe L(last_4x_vec_or_less) + .p2align 4 +L(first_vec_x0): + /* Select min of length and position of first null. */ + btsq %rsi, %rax + tzcntl %eax, %eax + ret # endif - jmp L(more_4x_vec) .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - -# ifdef USE_AS_WCSLEN - /* NB: Divide shift count by 4 since each bit in K0 represent 4 - bytes. */ - movl %ecx, %SHIFT_REG - sarl $2, %SHIFT_REG +L(first_vec_x1): + tzcntl %eax, %eax + /* Safe to use 32 bit instructions as these are only called for + size = [1, 159]. */ +# ifdef USE_AS_STRNLEN + /* Use ecx which was computed earlier to compute correct value. + */ +# ifdef USE_AS_WCSLEN + sarl $2, %ecx +# endif + leal -(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax +# else + subl %edx, %edi +# ifdef USE_AS_WCSLEN + sarl $2, %edi +# endif + leal CHAR_PER_VEC(%rdi, %rax), %eax # endif - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax + ret - /* Remove the leading bytes. */ - sarxl %SHIFT_REG, %eax, %eax - testl %eax, %eax - jz L(aligned_more) + .p2align 4 +L(first_vec_x2): tzcntl %eax, %eax -# ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax -# endif + /* Safe to use 32 bit instructions as these are only called for + size = [1, 159]. */ # ifdef USE_AS_STRNLEN - /* Check the end of data. */ - cmpq %rax, %rsi - jbe L(max) -# endif - addq %rdi, %rax - addq %rcx, %rax - subq %rdx, %rax -# ifdef USE_AS_WCSLEN - shrq $2, %rax + /* Use ecx which was computed earlier to compute correct value. + */ +# ifdef USE_AS_WCSLEN + sarl $2, %ecx +# endif + leal -(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax +# else + subl %edx, %edi +# ifdef USE_AS_WCSLEN + sarl $2, %edi +# endif + leal (CHAR_PER_VEC * 2)(%rdi, %rax), %eax # endif ret .p2align 4 -L(aligned_more): +L(first_vec_x3): + tzcntl %eax, %eax + /* Safe to use 32 bit instructions as these are only called for + size = [1, 159]. */ # ifdef USE_AS_STRNLEN - /* "rcx" is less than VEC_SIZE. Calculate "rdx + rcx - VEC_SIZE" - with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC_SIZE" - to void possible addition overflow. */ - negq %rcx - addq $VEC_SIZE, %rcx - - /* Check the end of data. */ - subq %rcx, %rsi - jbe L(max) + /* Use ecx which was computed earlier to compute correct value. + */ +# ifdef USE_AS_WCSLEN + sarl $2, %ecx +# endif + leal -(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax +# else + subl %edx, %edi +# ifdef USE_AS_WCSLEN + sarl $2, %edi +# endif + leal (CHAR_PER_VEC * 3)(%rdi, %rax), %eax # endif + ret - addq $VEC_SIZE, %rdi - + .p2align 4 +L(first_vec_x4): + tzcntl %eax, %eax + /* Safe to use 32 bit instructions as these are only called for + size = [1, 159]. */ # ifdef USE_AS_STRNLEN - subq $(VEC_SIZE * 4), %rsi - jbe L(last_4x_vec_or_less) + /* Use ecx which was computed earlier to compute correct value. + */ +# ifdef USE_AS_WCSLEN + sarl $2, %ecx +# endif + leal -(CHAR_PER_VEC + 1)(%rcx, %rax), %eax +# else + subl %edx, %edi +# ifdef USE_AS_WCSLEN + sarl $2, %edi +# endif + leal (CHAR_PER_VEC * 4)(%rdi, %rax), %eax # endif + ret -L(more_4x_vec): + /* strnlen jumps here. strlen falls through. */ + .p2align 5 +L(aligned_more): + movq %rdi, %rdx + /* Align data to VEC_SIZE. */ + andq $-(VEC_SIZE), %rdi +L(cross_page_continue): /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time since data is only aligned to VEC_SIZE. */ - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x0) - +# ifdef USE_AS_STRNLEN +# ifdef USE_AS_WCSLEN + salq $2, %rsi +# endif + /* + CHAR_SIZE because it simplies the logic in + last_4x_vec_or_less. */ + leaq (VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx + subq %rdx, %rcx +# endif + /* Load first VEC regardless. */ VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0 +# ifdef USE_AS_STRNLEN + /* Adjust length. If near end handle specially. */ + subq %rcx, %rsi + jb L(last_4x_vec_or_less) +# endif kmovd %k0, %eax testl %eax, %eax jnz L(first_vec_x1) VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 kmovd %k0, %eax - testl %eax, %eax + test %eax, %eax jnz L(first_vec_x2) VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 @@ -179,258 +218,264 @@ L(more_4x_vec): testl %eax, %eax jnz L(first_vec_x3) - addq $(VEC_SIZE * 4), %rdi - -# ifdef USE_AS_STRNLEN - subq $(VEC_SIZE * 4), %rsi - jbe L(last_4x_vec_or_less) -# endif - - /* Align data to 4 * VEC_SIZE. */ - movq %rdi, %rcx - andl $(4 * VEC_SIZE - 1), %ecx - andq $-(4 * VEC_SIZE), %rdi + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 + kmovd %k0, %eax + testl %eax, %eax + jnz L(first_vec_x4) + addq $VEC_SIZE, %rdi # ifdef USE_AS_STRNLEN - /* Adjust length. */ + /* Check if at last VEC_SIZE * 4 length. */ + cmpq $(VEC_SIZE * 4 - 1), %rsi + jbe L(last_4x_vec_or_less_load) + movl %edi, %ecx + andl $(VEC_SIZE * 4 - 1), %ecx + /* Readjust length. */ addq %rcx, %rsi # endif + /* Align data to VEC_SIZE * 4. */ + andq $-(VEC_SIZE * 4), %rdi + /* Compare 4 * VEC at a time forward. */ .p2align 4 L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - VMOVA (%rdi), %YMM1 - VMOVA VEC_SIZE(%rdi), %YMM2 - VMOVA (VEC_SIZE * 2)(%rdi), %YMM3 - VMOVA (VEC_SIZE * 3)(%rdi), %YMM4 - - VPMINU %YMM1, %YMM2, %YMM5 - VPMINU %YMM3, %YMM4, %YMM6 + /* Load first VEC regardless. */ + VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 +# ifdef USE_AS_STRNLEN + /* Break if at end of length. */ + subq $(VEC_SIZE * 4), %rsi + jb L(last_4x_vec_or_less_cmpeq) +# endif + VPMINU (VEC_SIZE * 5)(%rdi), %YMM1, %YMM2 + VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 + VPMINU (VEC_SIZE * 7)(%rdi), %YMM3, %YMM4 + VPCMP $0, %YMM2, %YMMZERO, %k0 + VPCMP $0, %YMM4, %YMMZERO, %k1 + subq $-(VEC_SIZE * 4), %rdi + kortestd %k0, %k1 + jz L(loop_4x_vec) + + /* Check if end was in first half. */ + kmovd %k0, %eax + subq %rdx, %rdi +# ifdef USE_AS_WCSLEN + shrq $2, %rdi +# endif + testl %eax, %eax + jz L(second_vec_return) - VPMINU %YMM5, %YMM6, %YMM5 - VPCMP $0, %YMM5, %YMMZERO, %k0 - ktestd %k0, %k0 - jnz L(4x_vec_end) + VPCMP $0, %YMM1, %YMMZERO, %k2 + kmovd %k2, %edx + /* Combine YMM1 matches (k2) with YMM2 matches (k0). */ +# ifdef USE_AS_WCSLEN + sall $CHAR_PER_VEC, %eax + orl %edx, %eax + tzcntl %eax, %eax +# else + salq $CHAR_PER_VEC, %rax + orq %rdx, %rax + tzcntq %rax, %rax +# endif + addq %rdi, %rax + ret - addq $(VEC_SIZE * 4), %rdi -# ifndef USE_AS_STRNLEN - jmp L(loop_4x_vec) -# else - subq $(VEC_SIZE * 4), %rsi - ja L(loop_4x_vec) +# ifdef USE_AS_STRNLEN +L(last_4x_vec_or_less_load): + /* Depending on entry adjust rdi / prepare first VEC in YMM1. */ + VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 +L(last_4x_vec_or_less_cmpeq): + VPCMP $0, %YMM1, %YMMZERO, %k0 + addq $(VEC_SIZE * 3), %rdi L(last_4x_vec_or_less): - /* Less than 4 * VEC and aligned to VEC_SIZE. */ - addl $(VEC_SIZE * 2), %esi - jle L(last_2x_vec) - - VPCMP $0, (%rdi), %YMMZERO, %k0 kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x0) + /* If remaining length > VEC_SIZE * 2. */ + testl $(VEC_SIZE * 2), %esi + jnz L(last_4x_vec) - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax + /* length may have been negative or positive depending on where + this was called from. This fixes that. */ + andl $(VEC_SIZE * 4 - 1), %esi testl %eax, %eax - jnz L(first_vec_x1) + jnz L(last_vec_x1_check) - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x2_check) subl $VEC_SIZE, %esi - jle L(max) + jb L(max) - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x3_check) - movq %r8, %rax + tzcntl %eax, %eax # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarl $2, %esi # endif - ret - - .p2align 4 -L(last_2x_vec): - addl $(VEC_SIZE * 2), %esi - - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x0_check) - subl $VEC_SIZE, %esi - jle L(max) + /* Check the end of data. */ + cmpl %eax, %esi + jb L(max) - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x1_check) - movq %r8, %rax + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarq $2, %rdi # endif + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax + ret +L(max): + movq %r8, %rax ret +# endif .p2align 4 -L(first_vec_x0_check): +L(second_vec_return): + VPCMP $0, %YMM3, %YMMZERO, %k0 + /* Combine YMM3 matches (k0) with YMM4 matches (k1). */ +# ifdef USE_AS_WCSLEN + kunpckbw %k0, %k1, %k0 + kmovd %k0, %eax + tzcntl %eax, %eax +# else + kunpckdq %k0, %k1, %k0 + kmovq %k0, %rax + tzcntq %rax, %rax +# endif + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax + ret + + +# ifdef USE_AS_STRNLEN +L(last_vec_x1_check): tzcntl %eax, %eax # ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax + sarl $2, %esi # endif /* Check the end of data. */ - cmpq %rax, %rsi - jbe L(max) - addq %rdi, %rax - subq %rdx, %rax + cmpl %eax, %esi + jb L(max) + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarq $2, %rdi # endif + leaq (CHAR_PER_VEC)(%rdi, %rax), %rax ret .p2align 4 -L(first_vec_x1_check): +L(last_4x_vec): + /* Test first 2x VEC normally. */ + testl %eax, %eax + jnz L(last_vec_x1) + + VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 + kmovd %k0, %eax + testl %eax, %eax + jnz L(last_vec_x2) + + /* Normalize length. */ + andl $(VEC_SIZE * 4 - 1), %esi + VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 + kmovd %k0, %eax + testl %eax, %eax + jnz L(last_vec_x3) + + subl $(VEC_SIZE * 3), %esi + jb L(max) + + VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 + kmovd %k0, %eax tzcntl %eax, %eax # ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax + sarl $2, %esi # endif /* Check the end of data. */ - cmpq %rax, %rsi - jbe L(max) - addq $VEC_SIZE, %rax - addq %rdi, %rax - subq %rdx, %rax + cmpl %eax, %esi + jb L(max_end) + + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarq $2, %rdi # endif + leaq (CHAR_PER_VEC * 4)(%rdi, %rax), %rax ret .p2align 4 -L(first_vec_x2_check): +L(last_vec_x1): tzcntl %eax, %eax + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax + sarq $2, %rdi # endif - /* Check the end of data. */ - cmpq %rax, %rsi - jbe L(max) - addq $(VEC_SIZE * 2), %rax - addq %rdi, %rax - subq %rdx, %rax + leaq (CHAR_PER_VEC)(%rdi, %rax), %rax + ret + + .p2align 4 +L(last_vec_x2): + tzcntl %eax, %eax + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarq $2, %rdi # endif + leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax ret .p2align 4 -L(first_vec_x3_check): +L(last_vec_x3): tzcntl %eax, %eax # ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax + sarl $2, %esi # endif + subl $(CHAR_PER_VEC * 2), %esi /* Check the end of data. */ - cmpq %rax, %rsi - jbe L(max) - addq $(VEC_SIZE * 3), %rax - addq %rdi, %rax - subq %rdx, %rax + cmpl %eax, %esi + jb L(max_end) + subq %rdx, %rdi # ifdef USE_AS_WCSLEN - shrq $2, %rax + sarq $2, %rdi # endif + leaq (CHAR_PER_VEC * 3)(%rdi, %rax), %rax ret - - .p2align 4 -L(max): +L(max_end): movq %r8, %rax -# ifdef USE_AS_WCSLEN - shrq $2, %rax -# endif - ret - - .p2align 4 -L(zero): - xorl %eax, %eax ret # endif + /* Cold case for crossing page with first load. */ .p2align 4 -L(first_vec_x0): - tzcntl %eax, %eax -# ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax -# endif - addq %rdi, %rax - subq %rdx, %rax +L(cross_page_boundary): + movq %rdi, %rdx + /* Align data to VEC_SIZE. */ + andq $-VEC_SIZE, %rdi + VPCMP $0, (%rdi), %YMMZERO, %k0 + kmovd %k0, %eax + /* Remove the leading bytes. */ # ifdef USE_AS_WCSLEN - shrq $2, %rax + movl %edx, %ecx + shrl $2, %ecx + andl $(CHAR_PER_VEC - 1), %ecx # endif - ret - - .p2align 4 -L(first_vec_x1): + /* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise. */ + sarxl %SHIFT_REG, %eax, %eax + testl %eax, %eax +# ifndef USE_AS_STRNLEN + jz L(cross_page_continue) tzcntl %eax, %eax -# ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax -# endif - addq $VEC_SIZE, %rax - addq %rdi, %rax - subq %rdx, %rax -# ifdef USE_AS_WCSLEN - shrq $2, %rax -# endif ret - - .p2align 4 -L(first_vec_x2): - tzcntl %eax, %eax -# ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax -# endif - addq $(VEC_SIZE * 2), %rax - addq %rdi, %rax - subq %rdx, %rax -# ifdef USE_AS_WCSLEN - shrq $2, %rax -# endif +# else + jnz L(cross_page_less_vec) +# ifndef USE_AS_WCSLEN + movl %edx, %ecx + andl $(CHAR_PER_VEC - 1), %ecx +# endif + movl $CHAR_PER_VEC, %eax + subl %ecx, %eax + cmpq %rax, %rsi + ja L(cross_page_continue) + movl %esi, %eax ret - - .p2align 4 -L(4x_vec_end): - VPCMP $0, %YMM1, %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(first_vec_x0) - VPCMP $0, %YMM2, %YMMZERO, %k1 - kmovd %k1, %eax - testl %eax, %eax - jnz L(first_vec_x1) - VPCMP $0, %YMM3, %YMMZERO, %k2 - kmovd %k2, %eax - testl %eax, %eax - jnz L(first_vec_x2) - VPCMP $0, %YMM4, %YMMZERO, %k3 - kmovd %k3, %eax -L(first_vec_x3): +L(cross_page_less_vec): tzcntl %eax, %eax -# ifdef USE_AS_WCSLEN - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %eax -# endif - addq $(VEC_SIZE * 3), %rax - addq %rdi, %rax - subq %rdx, %rax -# ifdef USE_AS_WCSLEN - shrq $2, %rax -# endif + /* Select min of length and position of first null. */ + cmpq %rax, %rsi + cmovb %esi, %eax ret +# endif END (STRLEN) #endif