From patchwork Tue Jul 30 15:41:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andi Kleen X-Patchwork-Id: 1966612 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=HarMEpD4; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WYKHg4z28z1ybX for ; Wed, 31 Jul 2024 01:42:55 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id CABA7385B50C for ; Tue, 30 Jul 2024 15:42:53 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) by sourceware.org (Postfix) with ESMTPS id C0B83385C6E1; Tue, 30 Jul 2024 15:42:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C0B83385C6E1 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.intel.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=linux.intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C0B83385C6E1 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=198.175.65.16 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1722354146; cv=none; b=oAq7SigVNCcEWLorPQYkJ+eH16zUO2EjID/B6E4WglURW7RhjhCjG/fs9t3iviInWT8GMr9o0EZZK1SpSdQ0zktF/89jlov8+ddzzw0wFXU1bvT3jG2kwwNvqZOg9gS7PnCBMfCVuv31n+yRCa/Urw2angBZaszJmZ6gAkzxMQw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1722354146; c=relaxed/simple; bh=PnKgdEOUIIBSPk/uVefv3QXVSGVnQyE5KssZqy6P8Bo=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=w3OvpN3RQhsfqdcioU95+xZe/4OyijBvddQBtJQEE+dlvjg9GvMqvJASBeeCsiF55K4M0c4R/eAYf5EP5HsP0PU3+eyjSlHuLSyDMaz5mD4qUVhAxB+Bg7D56QWPcEBfunrnyXf0tZiBnfreoISafB8hDvc7XiHXLX9sqf4+UVc= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1722354142; x=1753890142; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=PnKgdEOUIIBSPk/uVefv3QXVSGVnQyE5KssZqy6P8Bo=; b=HarMEpD4swjTx2vluGGzLk2FKr7oim6YfxpS9t0J9yjlnnzUXqXSDFIO pc7hNGNxTSfP2aPUy3m91MDUbIp3Mn4laXOyItyvWHTypvWoqkRLwbl7G Nzfh+t2/Vje1t2FBsn48hnUcHFk7fckPCIgjzjNPFd+A/9HG4LeSdBGDS WjSn+D6kbdwtMHbnLC/VR99h2c6X2vUR0mSjFUfPP539oivdzk51DAJTy x6RLnSnDKrEguK0o7icVx9LJuhMpXioo86bayZ74MK3U+jHGfj6tzK9as eEngxu1KmU7FAG9cJhBhMlWICYIN7x4qTPIxbPp71yszXQ8aLHTQZz27B A==; X-CSE-ConnectionGUID: q2C9T74AR2Wxo3puaCldSg== X-CSE-MsgGUID: E7cQMijPRnepYqztD8+TcQ== X-IronPort-AV: E=McAfee;i="6700,10204,11149"; a="20336646" X-IronPort-AV: E=Sophos;i="6.09,248,1716274800"; d="scan'208";a="20336646" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jul 2024 08:42:13 -0700 X-CSE-ConnectionGUID: cA6VVIZsSyCOgCpEAxGG/w== X-CSE-MsgGUID: QNe8T8F4Q8WGvZy50TItKw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,248,1716274800"; d="scan'208";a="58517829" Received: from tassilo.jf.intel.com ([10.54.38.190]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jul 2024 08:42:13 -0700 From: Andi Kleen To: gcc-patches@gcc.gnu.org Cc: Andi Kleen Subject: [PATCH 2/2] Add AVX2 code path to lexer Date: Tue, 30 Jul 2024 08:41:59 -0700 Message-ID: <20240730154159.3799008-2-ak@linux.intel.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20240730154159.3799008-1-ak@linux.intel.com> References: <20240730154159.3799008-1-ak@linux.intel.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org From: Andi Kleen AVX2 is widely available on x86 and it allows to do the scanner line check with 32 bytes at a time. The code is similar to the SSE2 code path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes. Also adjust the code to allow inlining when the compiler is built for an AVX2 host, following what other architectures do. I see about a ~0.6% compile time improvement for compiling i386 insn-recog.i with -O0. libcpp/ChangeLog: * config.in (HAVE_AVX2): Add. * configure: Regenerate. * configure.ac: Add HAVE_AVX2 check. * lex.cc (repl_chars): Extend to 32 bytes. (search_line_avx2): New function to scan line using AVX2. (init_vectorized_lexer): Check for AVX2 in CPUID. --- libcpp/config.in | 3 ++ libcpp/configure | 17 +++++++++ libcpp/configure.ac | 3 ++ libcpp/lex.cc | 91 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 110 insertions(+), 4 deletions(-) diff --git a/libcpp/config.in b/libcpp/config.in index 253ef03a3dea..8fad6bd4b4f5 100644 --- a/libcpp/config.in +++ b/libcpp/config.in @@ -213,6 +213,9 @@ /* Define to 1 if you can assemble SSE4 insns. */ #undef HAVE_SSE4 +/* Define to 1 if you can assemble AVX2 insns. */ +#undef HAVE_AVX2 + /* Define to 1 if you have the header file. */ #undef HAVE_STDDEF_H diff --git a/libcpp/configure b/libcpp/configure index 32d6aaa30699..6d9286ac9601 100755 --- a/libcpp/configure +++ b/libcpp/configure @@ -9149,6 +9149,23 @@ if ac_fn_c_try_compile "$LINENO"; then : $as_echo "#define HAVE_SSE4 1" >>confdefs.h +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0)) + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + +$as_echo "#define HAVE_AVX2 1" >>confdefs.h + fi rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext esac diff --git a/libcpp/configure.ac b/libcpp/configure.ac index b883fec776fe..c06609827924 100644 --- a/libcpp/configure.ac +++ b/libcpp/configure.ac @@ -200,6 +200,9 @@ case $target in AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))], [AC_DEFINE([HAVE_SSE4], [1], [Define to 1 if you can assemble SSE4 insns.])]) + AC_TRY_COMPILE([], [asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))], + [AC_DEFINE([HAVE_AVX2], [1], + [Define to 1 if you can assemble AVX2 insns.])]) esac # Enable --enable-host-shared. diff --git a/libcpp/lex.cc b/libcpp/lex.cc index 1591dcdf151a..72f3402aac99 100644 --- a/libcpp/lex.cc +++ b/libcpp/lex.cc @@ -278,19 +278,31 @@ search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED) /* Replicated character data to be shared between implementations. Recall that outside of a context with vector support we can't define compatible vector types, therefore these are all defined - in terms of raw characters. */ -static const char repl_chars[4][16] __attribute__((aligned(16))) = { + in terms of raw characters. + gcc constant propagates this and usually turns it into a + vector broadcast, so it actually disappears. */ + +static const char repl_chars[4][32] __attribute__((aligned(32))) = { { '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', + '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', + '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' }, { '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', + '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', + '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' }, { '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', + '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', + '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' }, { '?', '?', '?', '?', '?', '?', '?', '?', + '?', '?', '?', '?', '?', '?', '?', '?', + '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?' }, }; +#ifndef __AVX2__ /* A version of the fast scanner using SSE2 vectorized byte compare insns. */ static const uchar * @@ -343,8 +355,9 @@ search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED) found = __builtin_ctz(found); return (const uchar *)p + found; } +#endif -#ifdef HAVE_SSE4 +#if defined(HAVE_SSE4) && !defined(__AVX2__) /* A version of the fast scanner using SSE 4.2 vectorized string insns. */ static const uchar * @@ -425,6 +438,71 @@ search_line_sse42 (const uchar *s, const uchar *end) #define search_line_sse42 search_line_sse2 #endif +#ifdef HAVE_AVX2 + +/* A version of the fast scanner using AVX2 vectorized byte compare insns. */ + +static const uchar * +#ifndef __AVX2__ +__attribute__((__target__("avx2"))) +#endif +search_line_avx2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED) +{ + typedef char v32qi __attribute__ ((__vector_size__ (32))); + + const v32qi repl_nl = *(const v32qi *)repl_chars[0]; + const v32qi repl_cr = *(const v32qi *)repl_chars[1]; + const v32qi repl_bs = *(const v32qi *)repl_chars[2]; + const v32qi repl_qm = *(const v32qi *)repl_chars[3]; + + unsigned int misalign, found, mask; + const v32qi *p; + v32qi data, t; + + /* Align the source pointer. */ + misalign = (uintptr_t)s & 31; + p = (const v32qi *)((uintptr_t)s & -32); + data = *p; + + /* Create a mask for the bytes that are valid within the first + 32-byte block. The Idea here is that the AND with the mask + within the loop is "free", since we need some AND or TEST + insn in order to set the flags for the branch anyway. */ + mask = -1u << misalign; + + /* Main loop processing 32 bytes at a time. */ + goto start; + do + { + data = *++p; + mask = -1; + + start: + t = data == repl_nl; + t |= data == repl_cr; + t |= data == repl_bs; + t |= data == repl_qm; + found = __builtin_ia32_pmovmskb256 (t); + found &= mask; + } + while (!found); + + /* FOUND contains 1 in bits for which we matched a relevant + character. Conversion to the byte index is trivial. */ + found = __builtin_ctz (found); + return (const uchar *)p + found; +} + +#else +#define search_line_avx2 search_line_sse2 +#endif + +#ifdef __AVX2__ +/* Avoid indirect calls to encourage inlining if the compiler is built + using AVX. */ +#define search_line_fast search_line_avx2 +#else + /* Check the CPU capabilities. */ #include "../gcc/config/i386/cpuid.h" @@ -436,7 +514,7 @@ static search_line_fast_type search_line_fast; static inline void init_vectorized_lexer (void) { - unsigned dummy, ecx = 0, edx = 0; + unsigned dummy, ecx = 0, edx = 0, ebx = 0; search_line_fast_type impl = search_line_acc_char; int minimum = 0; @@ -448,6 +526,10 @@ init_vectorized_lexer (void) if (minimum == 3) impl = search_line_sse42; + else if (__get_cpuid_max (0, &dummy) >= 7 + && __get_cpuid_count (7, 0, &dummy, &ebx, &dummy, &dummy) + && (ebx & bit_AVX2)) + impl = search_line_avx2; else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2) { if (minimum == 3 || (ecx & bit_SSE4_2)) @@ -458,6 +540,7 @@ init_vectorized_lexer (void) search_line_fast = impl; } +#endif /* !__AVX2__ */ #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)