From patchwork Wed Jul 1 15:35:15 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?T25kxZllaiBCw61sa2E=?= X-Patchwork-Id: 490216 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 350DC1402B2 for ; Thu, 2 Jul 2015 01:36:24 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=sourceware.org header.i=@sourceware.org header.b=xW+4l0E8; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id :mime-version:content-type; q=dns; s=default; b=axZojTCs83f/FjHu lhGyMS3uMBjO+OTi52QR/HuG77Q3A144+25DOScrpBhh9ot2XwQnHBGObEbhMBa4 SWQjPW0W7b1kcLixTBvn7cn9P+PIvpoKwqwG89gqCai118qE1OcdnRuhPsyg3QsM +zVr2USs6wjxGVdkFAGYxh4fFxc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id :mime-version:content-type; s=default; bh=a32wattiCD98TXpJTd9qQa ksqqk=; b=xW+4l0E8JgYOmJkSaQe2DHLzY8Y3th0zGdMmb8ddspG/07cOooxvL6 26qp024GObijpLCclehhBwSsUcRXv48MWteVXBYXui/TCdecM8AaDZNaFb5nZ8XZ 73b1dvXxTyuNDzVb4vcNUrBNqWlvF4yeiH6IKUuI2cae6s1wiOqRE= Received: (qmail 13214 invoked by alias); 1 Jul 2015 15:36:10 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 13153 invoked by uid 89); 1 Jul 2015 15:36:09 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Wed, 1 Jul 2015 17:35:15 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: Carlos O'Donell Cc: libc-alpha@sourceware.org Subject: [PATCH] Improve generic strcspn and strpbrk Message-ID: <20150701153515.GA31676@domone> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Hi, As I promised previously to improve sse2 strspn I found that I could convince gcc to generate reasonable code. Its really needed to use goto, otherwise gcc totally messes up a loop which will be 10% slower than current. Generated assembly is relatively ok, gcc only increments two registers by 4 instead one which doesn't affect performance much. So here is a generic strcspn/strpbrk implementation and I will send patch to replace x64 code later. Carlos as usual this is code where you need to make hard decision based on actual profile. This implementation on gcc workload shows 30% improvement. That is by exploiting property of workload that match in 4 bytes is likely. If it isn't a performance will suffer as these checks increase cost of subsequent characters. This is again case where its unavoidable that some patterns will become five times slower. If accept has 128 characters and match is in second character then we spend ages with byte-by-byte checks of accept. But if match is in first character then strchr will find that quickly. Problem is that 99.8% of call have accept less than 8 bytes. So adding just a check if size exceeds 8 will harm performance. See following two graphs. http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strpbrk_profile/results_gcc/result.html http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strpbrk_profile/results_rand/result.html I cannot avoid that without my generic string framework where I would first to vectorized check of first 8 bytes and its dubious if that would help 32bit architectures. For x64 I correct solution is use sse2 prolog, hard part is tuning as these break even depending on size. For use in sse2 I need to add LATE_CHECK as it when called from sse42 routine accept size is greater than 16 so early heuristic doesn't make much sense. Main gain versus current sse2 assembly is that I align s to 4 bytes so I could read 4 bytes at start of loop before check. That helps mainly older cpu to execute loads earlier, it improves performance by 50% on core2 large inputs, but newer are better at speculative loads so difference gets smaller until haswell that doesn't show a difference. Tested on x64 by including them in test-strcspn/strpbrk OK to commit? * string/strcspn.c(STRCSPN): Check membership by array lookup. * string/strcspn.c: Include string/strpbrk.c * string/test-strcspn.c (simple_strcspn): Use string version. * string/test-strpbrk.c (simple_strpbrk): Use string version. --- string/strcspn.c | 43 +----------------- string/strpbrk.c | 120 +++++++++++++++++++++++++++++++++++++++++++++----- string/test-strcspn.c | 15 ++----- string/test-strpbrk.c | 14 +----- 4 files changed, 116 insertions(+), 76 deletions(-) diff --git a/string/strcspn.c b/string/strcspn.c index 2694d2a..2a82dd2 100644 --- a/string/strcspn.c +++ b/string/strcspn.c @@ -1,41 +1,2 @@ -/* Copyright (C) 1991-2015 Free Software Foundation, Inc. - This file is part of the GNU C Library. - - The GNU C Library is free software; you can redistribute it and/or - modify it under the terms of the GNU Lesser General Public - License as published by the Free Software Foundation; either - version 2.1 of the License, or (at your option) any later version. - - The GNU C Library is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - Lesser General Public License for more details. - - You should have received a copy of the GNU Lesser General Public - License along with the GNU C Library; if not, see - . */ - -#include - -#undef strcspn - -#ifndef STRCSPN -# define STRCSPN strcspn -#endif - -/* Return the length of the maximum initial segment of S - which contains no characters from REJECT. */ -size_t -STRCSPN (const char *s, const char *reject) -{ - size_t count = 0; - - while (*s != '\0') - if (strchr (reject, *s++) == NULL) - ++count; - else - return count; - - return count; -} -libc_hidden_builtin_def (strcspn) +#define AS_STRCSPN +#include "strpbrk.c" diff --git a/string/strpbrk.c b/string/strpbrk.c index 4f1d9b7..62808da 100644 --- a/string/strpbrk.c +++ b/string/strpbrk.c @@ -16,26 +16,124 @@ . */ #include - +#include #undef strpbrk +#undef strcspn + + +#ifdef AS_STRCSPN +# ifndef STRPBRK +# define STRPBRK strcspn +# endif +# define RETURN_TYPE size_t +# define RETURN(c) return c +#else +# define RETURN_TYPE char * +# define RETURN(c) return (char *) (s[c] != '\0' ? s + c : NULL) +#endif #ifndef STRPBRK #define STRPBRK strpbrk #endif + /* Find the first occurrence in S of any character in ACCEPT. */ -char * -STRPBRK (const char *s, const char *accept) +RETURN_TYPE +STRPBRK (const char *_s, const char *_accept) { - while (*s != '\0') + unsigned char *s = (unsigned char *) _s; + unsigned char *a = (unsigned char *) _accept; + +#ifndef LATE_CHECK + /* We need to align s to 4 bytes. We do check now to avoid expensive table + construction. */ + do + { + if (s[0] == *a) + RETURN(0); + } + while (*a++); + a = (unsigned char *) _accept; + + /* We couldn't do these checks in one loop as gcc + messes up register allocation. */ + do + { + if (s[1] == *a) + RETURN(1); + } + while (*a++); + a = (unsigned char *) _accept; + + do + { + if (s[2] == *a) + RETURN(2); + } + while (*a++); + a = (unsigned char *) _accept; + + do + { + if (s[3] == *a) + RETURN(3); + } + while (*a++); + a = (unsigned char *) _accept; + +#endif + + unsigned char table[256]; + memset (table, 0, 256); + do { - const char *a = accept; - while (*a != '\0') - if (*a++ == *s) - return (char *) s; - ++s; + table[*a] = 1; } + while (*a++); + unsigned char s0, s1, s2, s3; + size_t count = 0; +#ifdef LATE_CHECK + s0 = s[count + 0]; + s1 = s[count + 1]; + s2 = s[count + 2]; + s3 = s[count + 3]; + if (table[s0]) + goto ret0; + if (table[s1]) + goto ret1; + if (table[s2]) + goto ret2; + if (table[s3]) + goto ret3; - return NULL; +#endif + + count = 4 - ((uintptr_t) s) % 4; + + while (1) + { + s0 = s[count + 0]; + s1 = s[count + 1]; + s2 = s[count + 2]; + s3 = s[count + 3]; + if (table[s0]) + goto ret0; + if (table[s1]) + goto ret1; + if (table[s2]) + goto ret2; + if (table[s3]) + goto ret3; + count += 4; + } + ret3: + count++; + ret2: + count++; + ret1: + count++; + ret0: + RETURN(count); } -libc_hidden_builtin_def (strpbrk) + +libc_hidden_builtin_def (STRPBRK) diff --git a/string/test-strcspn.c b/string/test-strcspn.c index b60a048..50a06e4 100644 --- a/string/test-strcspn.c +++ b/string/test-strcspn.c @@ -31,18 +31,9 @@ IMPL (stupid_strcspn, 0) IMPL (simple_strcspn, 0) IMPL (strcspn, 1) -size_t -simple_strcspn (const char *s, const char *rej) -{ - const char *r, *str = s; - char c; - - while ((c = *s++) != '\0') - for (r = rej; *r != '\0'; ++r) - if (*r == c) - return s - str - 1; - return s - str - 1; -} +#define AS_STRCSPN +#define STRPBRK simple_strcspn +#include "string/strpbrk.c" size_t stupid_strcspn (const char *s, const char *rej) diff --git a/string/test-strpbrk.c b/string/test-strpbrk.c index b4ac389..f389e9d 100644 --- a/string/test-strpbrk.c +++ b/string/test-strpbrk.c @@ -32,18 +32,8 @@ IMPL (stupid_strpbrk, 0) IMPL (simple_strpbrk, 0) IMPL (strpbrk, 1) -char * -simple_strpbrk (const char *s, const char *rej) -{ - const char *r; - char c; - - while ((c = *s++) != '\0') - for (r = rej; *r != '\0'; ++r) - if (*r == c) - return (char *) s - 1; - return NULL; -} +#define STRPBRK simple_strpbrk +#include "string/strpbrk.c" char * stupid_strpbrk (const char *s, const char *rej)