From patchwork Tue Jul 30 15:41:59 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andi Kleen <ak@linux.intel.com>
X-Patchwork-Id: 1966612
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256
 header.s=Intel header.b=HarMEpD4;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WYKHg4z28z1ybX
	for <incoming@patchwork.ozlabs.org>; Wed, 31 Jul 2024 01:42:55 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id CABA7385B50C
	for <incoming@patchwork.ozlabs.org>; Tue, 30 Jul 2024 15:42:53 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16])
 by sourceware.org (Postfix) with ESMTPS id C0B83385C6E1;
 Tue, 30 Jul 2024 15:42:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C0B83385C6E1
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=linux.intel.com
Authentication-Results: sourceware.org; spf=none smtp.mailfrom=linux.intel.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C0B83385C6E1
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=198.175.65.16
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1722354146; cv=none;
 b=oAq7SigVNCcEWLorPQYkJ+eH16zUO2EjID/B6E4WglURW7RhjhCjG/fs9t3iviInWT8GMr9o0EZZK1SpSdQ0zktF/89jlov8+ddzzw0wFXU1bvT3jG2kwwNvqZOg9gS7PnCBMfCVuv31n+yRCa/Urw2angBZaszJmZ6gAkzxMQw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1722354146; c=relaxed/simple;
 bh=PnKgdEOUIIBSPk/uVefv3QXVSGVnQyE5KssZqy6P8Bo=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=w3OvpN3RQhsfqdcioU95+xZe/4OyijBvddQBtJQEE+dlvjg9GvMqvJASBeeCsiF55K4M0c4R/eAYf5EP5HsP0PU3+eyjSlHuLSyDMaz5mD4qUVhAxB+Bg7D56QWPcEBfunrnyXf0tZiBnfreoISafB8hDvc7XiHXLX9sqf4+UVc=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1722354142; x=1753890142;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=PnKgdEOUIIBSPk/uVefv3QXVSGVnQyE5KssZqy6P8Bo=;
 b=HarMEpD4swjTx2vluGGzLk2FKr7oim6YfxpS9t0J9yjlnnzUXqXSDFIO
 pc7hNGNxTSfP2aPUy3m91MDUbIp3Mn4laXOyItyvWHTypvWoqkRLwbl7G
 Nzfh+t2/Vje1t2FBsn48hnUcHFk7fckPCIgjzjNPFd+A/9HG4LeSdBGDS
 WjSn+D6kbdwtMHbnLC/VR99h2c6X2vUR0mSjFUfPP539oivdzk51DAJTy
 x6RLnSnDKrEguK0o7icVx9LJuhMpXioo86bayZ74MK3U+jHGfj6tzK9as
 eEngxu1KmU7FAG9cJhBhMlWICYIN7x4qTPIxbPp71yszXQ8aLHTQZz27B A==;
X-CSE-ConnectionGUID: q2C9T74AR2Wxo3puaCldSg==
X-CSE-MsgGUID: E7cQMijPRnepYqztD8+TcQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11149"; a="20336646"
X-IronPort-AV: E=Sophos;i="6.09,248,1716274800"; d="scan'208";a="20336646"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
 by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Jul 2024 08:42:13 -0700
X-CSE-ConnectionGUID: cA6VVIZsSyCOgCpEAxGG/w==
X-CSE-MsgGUID: QNe8T8F4Q8WGvZy50TItKw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,248,1716274800"; d="scan'208";a="58517829"
Received: from tassilo.jf.intel.com ([10.54.38.190])
 by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Jul 2024 08:42:13 -0700
From: Andi Kleen <ak@linux.intel.com>
To: gcc-patches@gcc.gnu.org
Cc: Andi Kleen <ak@gcc.gnu.org>
Subject: [PATCH 2/2] Add AVX2 code path to lexer
Date: Tue, 30 Jul 2024 08:41:59 -0700
Message-ID: <20240730154159.3799008-2-ak@linux.intel.com>
X-Mailer: git-send-email 2.45.2
In-Reply-To: <20240730154159.3799008-1-ak@linux.intel.com>
References: <20240730154159.3799008-1-ak@linux.intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org

From: Andi Kleen <ak@gcc.gnu.org>

AVX2 is widely available on x86 and it allows to do the scanner line
check with 32 bytes at a time. The code is similar to the SSE2 code
path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.

Also adjust the code to allow inlining when the compiler
is built for an AVX2 host, following what other architectures
do.

I see about a ~0.6% compile time improvement for compiling i386
insn-recog.i with -O0.

libcpp/ChangeLog:

	* config.in (HAVE_AVX2): Add.
	* configure: Regenerate.
	* configure.ac: Add HAVE_AVX2 check.
	* lex.cc (repl_chars): Extend to 32 bytes.
	(search_line_avx2): New function to scan line using AVX2.
	(init_vectorized_lexer): Check for AVX2 in CPUID.
---
 libcpp/config.in    |  3 ++
 libcpp/configure    | 17 +++++++++
 libcpp/configure.ac |  3 ++
 libcpp/lex.cc       | 91 +++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/libcpp/config.in b/libcpp/config.in
index 253ef03a3dea..8fad6bd4b4f5 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -213,6 +213,9 @@
 /* Define to 1 if you can assemble SSE4 insns. */
 #undef HAVE_SSE4
 
+/* Define to 1 if you can assemble AVX2 insns. */
+#undef HAVE_AVX2
+
 /* Define to 1 if you have the <stddef.h> header file. */
 #undef HAVE_STDDEF_H
 
diff --git a/libcpp/configure b/libcpp/configure
index 32d6aaa30699..6d9286ac9601 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -9149,6 +9149,23 @@ if ac_fn_c_try_compile "$LINENO"; then :
 
 $as_echo "#define HAVE_SSE4 1" >>confdefs.h
 
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+    cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+
+$as_echo "#define HAVE_AVX2 1" >>confdefs.h
+
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 esac
diff --git a/libcpp/configure.ac b/libcpp/configure.ac
index b883fec776fe..c06609827924 100644
--- a/libcpp/configure.ac
+++ b/libcpp/configure.ac
@@ -200,6 +200,9 @@ case $target in
     AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
       [AC_DEFINE([HAVE_SSE4], [1],
 		 [Define to 1 if you can assemble SSE4 insns.])])
+    AC_TRY_COMPILE([], [asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))],
+      [AC_DEFINE([HAVE_AVX2], [1],
+		 [Define to 1 if you can assemble AVX2 insns.])])
 esac
 
 # Enable --enable-host-shared.
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 1591dcdf151a..72f3402aac99 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -278,19 +278,31 @@ search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
 /* Replicated character data to be shared between implementations.
    Recall that outside of a context with vector support we can't
    define compatible vector types, therefore these are all defined
-   in terms of raw characters.  */
-static const char repl_chars[4][16] __attribute__((aligned(16))) = {
+   in terms of raw characters.
+   gcc constant propagates this and usually turns it into a
+   vector broadcast, so it actually disappears.  */
+
+static const char repl_chars[4][32] __attribute__((aligned(32))) = {
   { '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
     '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' },
   { '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
     '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' },
   { '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
     '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' },
   { '?', '?', '?', '?', '?', '?', '?', '?',
+    '?', '?', '?', '?', '?', '?', '?', '?',
+    '?', '?', '?', '?', '?', '?', '?', '?',
     '?', '?', '?', '?', '?', '?', '?', '?' },
 };
 
 
+#ifndef __AVX2__
 /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
 
 static const uchar *
@@ -343,8 +355,9 @@ search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
   found = __builtin_ctz(found);
   return (const uchar *)p + found;
 }
+#endif
 
-#ifdef HAVE_SSE4
+#if defined(HAVE_SSE4) && !defined(__AVX2__)
 /* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
 
 static const uchar *
@@ -425,6 +438,71 @@ search_line_sse42 (const uchar *s, const uchar *end)
 #define search_line_sse42 search_line_sse2
 #endif
 
+#ifdef HAVE_AVX2
+
+/* A version of the fast scanner using AVX2 vectorized byte compare insns.  */
+
+static const uchar *
+#ifndef __AVX2__
+__attribute__((__target__("avx2")))
+#endif
+search_line_avx2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
+{
+  typedef char v32qi __attribute__ ((__vector_size__ (32)));
+
+  const v32qi repl_nl = *(const v32qi *)repl_chars[0];
+  const v32qi repl_cr = *(const v32qi *)repl_chars[1];
+  const v32qi repl_bs = *(const v32qi *)repl_chars[2];
+  const v32qi repl_qm = *(const v32qi *)repl_chars[3];
+
+  unsigned int misalign, found, mask;
+  const v32qi *p;
+  v32qi data, t;
+
+  /* Align the source pointer.  */
+  misalign = (uintptr_t)s & 31;
+  p = (const v32qi *)((uintptr_t)s & -32);
+  data = *p;
+
+  /* Create a mask for the bytes that are valid within the first
+     32-byte block.  The Idea here is that the AND with the mask
+     within the loop is "free", since we need some AND or TEST
+     insn in order to set the flags for the branch anyway.  */
+  mask = -1u << misalign;
+
+  /* Main loop processing 32 bytes at a time.  */
+  goto start;
+  do
+    {
+      data = *++p;
+      mask = -1;
+
+    start:
+      t  = data == repl_nl;
+      t |= data == repl_cr;
+      t |= data == repl_bs;
+      t |= data == repl_qm;
+      found = __builtin_ia32_pmovmskb256 (t);
+      found &= mask;
+    }
+  while (!found);
+
+  /* FOUND contains 1 in bits for which we matched a relevant
+     character.  Conversion to the byte index is trivial.  */
+  found = __builtin_ctz (found);
+  return (const uchar *)p + found;
+}
+
+#else
+#define search_line_avx2 search_line_sse2
+#endif
+
+#ifdef __AVX2__
+/* Avoid indirect calls to encourage inlining if the compiler is built
+   using AVX.  */
+#define search_line_fast search_line_avx2
+#else
+
 /* Check the CPU capabilities.  */
 
 #include "../gcc/config/i386/cpuid.h"
@@ -436,7 +514,7 @@ static search_line_fast_type search_line_fast;
 static inline void
 init_vectorized_lexer (void)
 {
-  unsigned dummy, ecx = 0, edx = 0;
+  unsigned dummy, ecx = 0, edx = 0, ebx = 0;
   search_line_fast_type impl = search_line_acc_char;
   int minimum = 0;
 
@@ -448,6 +526,10 @@ init_vectorized_lexer (void)
 
   if (minimum == 3)
     impl = search_line_sse42;
+  else if (__get_cpuid_max (0, &dummy) >= 7
+	       && __get_cpuid_count (7, 0, &dummy, &ebx, &dummy, &dummy)
+	       && (ebx & bit_AVX2))
+    impl = search_line_avx2;
   else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
     {
       if (minimum == 3 || (ecx & bit_SSE4_2))
@@ -458,6 +540,7 @@ init_vectorized_lexer (void)
 
   search_line_fast = impl;
 }
+#endif /* !__AVX2__ */
 
 #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)