From patchwork Sun May 12 21:57:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 1934348 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=nextmovesoftware.com header.i=@nextmovesoftware.com header.a=rsa-sha256 header.s=default header.b=QI/1mqOT; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VcxLw0v3Cz1yfq for ; Mon, 13 May 2024 07:58:00 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1502C384AB57 for ; Sun, 12 May 2024 21:57:58 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [69.48.154.134]) by sourceware.org (Postfix) with ESMTPS id 7C7833858D3C for ; Sun, 12 May 2024 21:57:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7C7833858D3C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 7C7833858D3C Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=69.48.154.134 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715551057; cv=none; b=jLyoCeBrUw7dGOzP8Tjc6SfRKG6znu7bAiiZyTZ4ri399oe+P8z8JJksuaLwkBOEEyH7R1Eqx9lnQLZugsPk/Bps66QkEsC1zb0IyeAC09OqHUvVq2ENK6RkgdAJdG29DPzjZTQ/guCIrSsglw4RDCqbAUpnAZSEwycrV/p6tbQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715551057; c=relaxed/simple; bh=F01YkbDaLcnoQKKUmx4SdbnPO4aIvXMmc16+65BQeSg=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=EbsQfz4UdTH9aUOmHCiD+4Xfl2ZPx4DDTPWAiLTAbSZwFtdr069PC996uZDX0mOnoACGzUu7TqB2TNP+RDcpsga/XDsXPUA4GWeT/hvSh2vY3Ja2RZDcxz5/RKNb6fHjjz8+CzZ2OH6z5cuncEqaWVUiqNc+vVQtHDQPcowM5xc= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=ZoXBeOLKFWfL0ZklaNAXhhKo+UA/OpHFhsvZO4sTias=; b=QI/1mqOT7+KQSKJntwukdOvCud CYz2YEGvyxSHm8bemn7DY7FQLUhzIGIDVLbXPv4vfrU7Lb7Kqyx9mUXDAHpHQnaFQZSA9q8Da6NYd 3EJ2oKcfC9THsvjOA9FoWYnO1ZqPHL082YsXfsUJtdSF5Aq4x1G9UO/fyVZV8SDgvO/89PQhIkJf0 AS9Vsc0hWNiTtdh176Ew8HPDG1TNCSNkkRXcfgNE9bUPBEvC4zGHJEl7d1Nq58rFGyksK0O+8tGVS 5/YxEH7EUTPhzxHyNiNdaSNuSd8mNaIwLMOn6gUkJ5uQTE4/255UIJC9WmlAUqXhr4t5WuX5nTmOU rpXZIecQ==; Received: from [168.86.198.223] (port=62091 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96.2) (envelope-from ) id 1s6HCF-009Ybo-1n; Sun, 12 May 2024 17:57:31 -0400 From: "Roger Sayle" To: Cc: "'Hongtao Liu'" , "'Uros Bizjak'" Subject: [x86 SSE] Improve handling of ternlog instructions in i386/sse.md Date: Sun, 12 May 2024 22:57:28 +0100 Message-ID: <001801daa4b7$62d704c0$28850e40$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Content-Language: en-gb Thread-Index: AdqktZ0cTFf4DxIFTl6+XjnNHWhkBw== X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MEDICAL_SUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org This patch improves the way that the x86 backend recognizes and expands AVX512's bitwise ternary logic (vpternlog) instructions. As a motivating example consider the following code which calculates the carry out from a (binary) full adder: typedef unsigned long long v4di __attribute((vector_size(32))); v4di foo(v4di a, v4di b, v4di c) { return (a & b) | ((a ^ b) & c); } with -O2 -march=cascadelake current mainline produces: foo: vpternlogq $96, %ymm0, %ymm1, %ymm2 vmovdqa %ymm0, %ymm3 vmovdqa %ymm2, %ymm0 vpternlogq $248, %ymm3, %ymm1, %ymm0 ret with the patch below, we now generate a single instruction: foo: vpternlogq $232, %ymm2, %ymm1, %ymm0 ret The AVX512 vpternlog[qd] instructions are a very cool addition to the x86 instruction set, that can calculate any Boolean function of three inputs in a single fast instruction. As the truth table for any three-input function has 8 rows, any specific function can be represented by specifying those bits, i.e. by an 8-bit byte, an immediate integer between 0 and 256. Examples of ternary functions and their indices are given below: 0x01 1: ~((b|a)|c) 0x02 2: (~(b|a))&c 0x03 3: ~(b|a) 0x04 4: (~(c|a))&b 0x05 5: ~(c|a) 0x06 6: (c^b)&~a 0x07 7: ~((c&b)|a) 0x08 8: (~a&c)&b (~a&b)&c (c&b)&~a 0x09 9: ~((c^b)|a) 0x0a 10: ~a&c 0x0b 11: ~((~c&b)|a) (~b|c)&~a 0x0c 12: ~a&b 0x0d 13: ~((~b&c)|a) (~c|b)&~a 0x0e 14: (c|b)&~a 0x0f 15: ~a 0x10 16: (~(c|b))&a 0x11 17: ~(c|b) ... 0xf4 244: (~c&b)|a 0xf5 245: ~c|a 0xf6 246: (c^b)|a 0xf7 247: (~(c&b))|a 0xf8 248: (c&b)|a 0xf9 249: (~(c^b))|a 0xfa 250: c|a 0xfb 251: (c|a)|~b (~b|a)|c (~b|c)|a 0xfc 252: b|a 0xfd 253: (b|a)|~c (~c|a)|b (~c|b)|a 0xfe 254: (b|a)|c (c|a)|b (c|b)|a A naive implementation (in many compilers) might be add define_insn patterns for all 256 different functions. The situation is even worse as many of these Boolean functions don't have a "canonical form" (as produced by simplify_rtx) and would each need multiple patterns. See the space-separated equivalent expressions in the table above. This need to provide instruction "templates" might explain why GCC, LLVM and ICC all exhibit similar coverage problems in their ability to recognize x86 ternlog ternary functions. Perhaps a unique feature of GCC's design is that in addition to regular define_insn templates, machine descriptions can also perform pattern matching via a match_operator (and its corresponding predicate). This patch introduces a ternlog_operand predicate that matches a (possibly infinite) set of expression trees, identifying those that have at most three unique operands. This then allows a define_insn_and_split to recognize suitable expressions and then transform them into the appropriate UNSPEC_VTERNLOG as a pre-reload splitter. This design allows combine to smash together arbitrarily complex Boolean expressions, then transform them into an UNSPEC before register allocation. As an "optimization", where possible ix86_expand_ternlog generates a simpler binary operation, using AND, XOR, IOR or ANDN where possible, and in a few cases attempts to "canonicalize" the ternlog, by reordering or duplicating operands, so that later CSE passes have a hope of spotting equivalent values. Another benefit of this patch is that it improves the code generated for PR target/115021 [see comment #1]. This patch leaves the existing ternlog patterns in sse.md (for now), many of which are made obsolete by these changes. In theory we now only need one define_insn for UNSPEC_VTERNLOG. One complication from these previous variants was that they inconsistently used decimal vs. hexadecimal to specify the immediate constant operand in assembly language, making the list of tweaks to the testsuite with this patch larger than it might have been. I propose to remove the vestigial patterns in a follow-up patch, once this approach has baked (proven to be stable) on mainline. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2024-05-12 Roger Sayle gcc/ChangeLog PR target/115021 * config/i386/i386-expand.cc (ix86_expand_args_builtin): Call fixup_modeless_constant before testing predicates. Only call copy_to_mode_reg on memory operands (after the first one). (ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR into a VEC_DUPLICATE if possible. (ix86_ternlog_idx): Convert an RTX expression into a ternlog index between 0 and 255, recording the operands in ARGS, if possible or return -1 if this is not possible/valid. (ix86_ternlog_leaf_p): Helper function to identify "leaves" of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc. (ix86_ternlog_operand_p): Test whether a expression is suitable for and prefered as an UNSPEC_TERNLOG. (ix86_expand_ternlog_binop): Helper function to construct the binary operation corresponding to a sufficiently simple ternlog. (ix86_expand_ternlog_andnot): Helper function to construct a ANDN operation corresponding to a sufficiently simple ternlog. (ix86_expand_ternlog): Expand a 3-operand ternary logic expression, constructing either an UNSPEC_TERNLOG or simpler rtx expression. Called from builtin expanders and pre-reload splitters. * config/i386/i386-protos.h (ix86_ternlog_idx): Prototype here. (ix86_ternlog_operand_p): Likewise. (ix86_expand_ternlog): Likewise. * config/i386/predicates.md (ternlog_operand): New predicate that calls xi86_ternlog_operand_p. * config/i386/sse.md (_vpternlog_0): New define_insn_and_split that recognizes a SET_SRC of ternlog_operand and expands it via ix86_expand_ternlog pre-reload. (_vternlog_mask): Convert from define_insn to define_expand. Use ix86_expand_ternlog if the mask operand is ~0 (or 255 or -1). (*_vternlog_mask): define_insn renamed from above. gcc/testsuite/ChangeLog * gcc.target/i386/avx512f-andn-di-zmm-2.c: Update test case. * gcc.target/i386/avx512f-andn-si-zmm-2.c: Likewise. * gcc.target/i386/avx512f-orn-si-zmm-1.c: Likewise. * gcc.target/i386/avx512f-orn-si-zmm-2.c: Likewise. * gcc.target/i386/avx512f-vpternlogd-1.c: Likewise. * gcc.target/i386/avx512f-vpternlogq-1.c: Likewise. * gcc.target/i386/avx512vl-vpternlogd-1.c: Likewise. * gcc.target/i386/avx512vl-vpternlogq-1.c: Likewise. * gcc.target/i386/pr100711-3.c: Likewise. * gcc.target/i386/pr100711-4.c: Likewise. * gcc.target/i386/pr100711-5.c: Likewise. Thanks in advance, Roger diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index a613291..be60915 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -11707,6 +11707,8 @@ ix86_expand_args_builtin (const struct builtin_description *d, tree arg = CALL_EXPR_ARG (exp, i); rtx op = expand_normal (arg); machine_mode mode = insn_p->operand[i + 1].mode; + /* Need to fixup modeless constant before testing predicate. */ + op = fixup_modeless_constant (op, mode); bool match = insn_p->operand[i + 1].predicate (op, mode); if (second_arg_count && i == 1) @@ -11873,13 +11875,15 @@ ix86_expand_args_builtin (const struct builtin_description *d, /* If we aren't optimizing, only allow one memory operand to be generated. */ if (memory_operand (op, mode)) - num_memory++; - - op = fixup_modeless_constant (op, mode); + { + num_memory++; + if (!optimize && num_memory > 1) + op = copy_to_mode_reg (mode, op); + } if (GET_MODE (op) == mode || GET_MODE (op) == VOIDmode) { - if (optimize || !match || num_memory > 1) + if (!match) op = copy_to_mode_reg (mode, op); } else @@ -25480,4 +25484,511 @@ ix86_expand_fast_convert_bf_to_sf (rtx val) return ret; } +/* Attempt to convert a CONST_VECTOR into a bcst_mem_operand. + Returns NULL_RTX if X is cannot be expressed as a suitable + VEC_DUPLICATE in mode MODE. */ + +static rtx +ix86_gen_bcst_mem (machine_mode mode, rtx x) +{ + if (!TARGET_AVX512F + || GET_CODE (x) != CONST_VECTOR + || (!TARGET_AVX512VL + && (GET_MODE_SIZE (mode) != 64 || !TARGET_EVEX512)) + || !VALID_BCST_MODE_P (GET_MODE_INNER (mode)) + /* Disallow HFmode broadcast. */ + || GET_MODE_SIZE (GET_MODE_INNER (mode)) < 4) + return NULL_RTX; + + rtx cst = CONST_VECTOR_ELT (x, 0); + if (!CONST_SCALAR_INT_P (cst) + && !CONST_DOUBLE_P (cst) + && !CONST_FIXED_P (cst)) + return NULL_RTX; + + int n_elts = GET_MODE_NUNITS (mode); + if (CONST_VECTOR_NUNITS (x) != n_elts) + return NULL_RTX; + + for (int i = 1; i < n_elts; i++) + if (!rtx_equal_p (cst, CONST_VECTOR_ELT (x, i))) + return NULL_RTX; + + rtx mem = force_const_mem (GET_MODE_INNER (mode), cst); + return gen_rtx_VEC_DUPLICATE (mode, validize_mem (mem)); +} + +/* Determine the ternlog immediate index that implements 3-operand + ternary logic expression OP. This uses and modifies the 3 element + array ARGS to record and check the leaves, either 3 REGs, or 2 REGs + and MEM. Returns an index between 0 and 255 for a valid ternlog, + or -1 if the expression isn't suitable. */ + +int +ix86_ternlog_idx (rtx op, rtx *args) +{ + int idx0, idx1; + + if (!op) + return -1; + + switch (GET_CODE (op)) + { + case REG: + if (!args[0]) + { + args[0] = op; + return 0xf0; + } + if (REGNO (op) == REGNO (args[0])) + return 0xf0; + if (!args[1]) + { + args[1] = op; + return 0xcc; + } + if (REGNO (op) == REGNO (args[1])) + return 0xcc; + if (!args[2]) + { + args[2] = op; + return 0xaa; + } + if (REG_P (args[2]) && REGNO (op) == REGNO (args[2])) + return 0xaa; + return -1; + + case VEC_DUPLICATE: + if (!bcst_mem_operand (op, GET_MODE (op))) + return -1; + /* FALLTHRU */ + + case MEM: + if (MEM_P (op) + && MEM_VOLATILE_P (op) + && !volatile_ok) + return -1; + /* FALLTHRU */ + + case CONST_VECTOR: + if (!args[2]) + { + args[2] = op; + return 0xaa; + } + /* Maximum of one volatile memory reference per expression. */ + if (side_effects_p (op) || side_effects_p (args[2])) + return -1; + if (rtx_equal_p (op, args[2])) + return 0xaa; + /* Check if one CONST_VECTOR is the ones-complement of the other. */ + if (GET_CODE (op) == CONST_VECTOR + && GET_CODE (args[2]) == CONST_VECTOR + && rtx_equal_p (simplify_const_unary_operation (NOT, GET_MODE (op), + op, GET_MODE (op)), + args[2])) + return 0x55; + return -1; + + case SUBREG: + if (!VECTOR_MODE_P (GET_MODE (SUBREG_REG (op))) + || GET_MODE_SIZE (GET_MODE (SUBREG_REG (op))) + != GET_MODE_SIZE (GET_MODE (op))) + return -1; + return ix86_ternlog_idx (SUBREG_REG (op), args); + + case NOT: + idx0 = ix86_ternlog_idx (XEXP (op, 0), args); + return (idx0 >= 0) ? idx0 ^ 0xff : -1; + + case AND: + idx0 = ix86_ternlog_idx (XEXP (op, 0), args); + if (idx0 < 0) + return -1; + idx1 = ix86_ternlog_idx (XEXP (op, 1), args); + return (idx1 >= 0) ? idx0 & idx1 : -1; + + case IOR: + idx0 = ix86_ternlog_idx (XEXP (op, 0), args); + if (idx0 < 0) + return -1; + idx1 = ix86_ternlog_idx (XEXP (op, 1), args); + return (idx1 >= 0) ? idx0 | idx1 : -1; + + case XOR: + idx0 = ix86_ternlog_idx (XEXP (op, 0), args); + if (idx0 < 0) + return -1; + if (vector_all_ones_operand (XEXP (op, 1), GET_MODE (op))) + return idx0 ^ 0xff; + idx1 = ix86_ternlog_idx (XEXP (op, 1), args); + return (idx1 >= 0) ? idx0 ^ idx1 : -1; + + case UNSPEC: + if (XINT (op, 1) != UNSPEC_VTERNLOG + || XVECLEN (op, 0) != 4 + || CONST_INT_P (XVECEXP (op, 0, 3))) + return -1; + + /* TODO: Handle permuted operands. */ + if (ix86_ternlog_idx (XVECEXP (op, 0, 0), args) != 0xf0 + || ix86_ternlog_idx (XVECEXP (op, 0, 1), args) != 0xcc + || ix86_ternlog_idx (XVECEXP (op, 0, 2), args) != 0xaa) + return -1; + return INTVAL (XVECEXP (op, 0, 3)); + + default: + return -1; + } +} + +/* Return TRUE if OP (in mode MODE) is the leaf of a ternary logic + expression, such as a register or a memory reference. */ + +bool +ix86_ternlog_leaf_p (rtx op, machine_mode mode) +{ + /* We can't use memory_operand here, as it may return a different + value before and after reload (for volatile MEMs) which creates + problems splitting instructions. */ + return register_operand (op, mode) + || MEM_P (op) + || GET_CODE (op) == CONST_VECTOR + || bcst_mem_operand (op, mode); +} + +/* Test whether OP is a 3-operand ternary logic expression suitable + for use in a ternlog instruction. */ + +bool +ix86_ternlog_operand_p (rtx op) +{ + rtx op0, op1; + rtx args[3]; + + args[0] = NULL_RTX; + args[1] = NULL_RTX; + args[2] = NULL_RTX; + int idx = ix86_ternlog_idx (op, args); + if (idx < 0) + return false; + + /* Don't match simple (binary or unary) expressions. */ + machine_mode mode = GET_MODE (op); + switch (GET_CODE (op)) + { + case AND: + op0 = XEXP (op, 0); + op1 = XEXP (op, 1); + + /* Prefer pand. */ + if (ix86_ternlog_leaf_p (op0, mode) + && ix86_ternlog_leaf_p (op1, mode)) + return false; + /* Prefer pandn. */ + if (GET_CODE (op0) == NOT + && register_operand (XEXP (op0, 0), mode) + && ix86_ternlog_leaf_p (op1, mode)) + return false; + break; + + case IOR: + /* Prefer por. */ + if (ix86_ternlog_leaf_p (XEXP (op, 0), mode) + && ix86_ternlog_leaf_p (XEXP (op, 1), mode)) + return false; + break; + + case XOR: + op1 = XEXP (op, 1); + /* Prefer pxor. */ + if (ix86_ternlog_leaf_p (XEXP (op, 0), mode) + && (ix86_ternlog_leaf_p (op1, mode) + || vector_all_ones_operand (op1, mode))) + return false; + break; + + default: + break; + } + return true; +} + +/* Helper function for ix86_expand_ternlog. */ +static rtx +ix86_expand_ternlog_binop (enum rtx_code code, machine_mode mode, + rtx op0, rtx op1, rtx target) +{ + if (GET_MODE (op0) != mode) + op0 = gen_lowpart (mode, op0); + if (GET_MODE (op1) != mode) + op1 = gen_lowpart (mode, op1); + + if (GET_CODE (op0) == CONST_VECTOR) + op0 = validize_mem (force_const_mem (mode, op0)); + if (GET_CODE (op1) == CONST_VECTOR) + op1 = validize_mem (force_const_mem (mode, op1)); + + if (memory_operand (op0, mode)) + { + if (memory_operand (op1, mode)) + op0 = force_reg (mode, op0); + else + std::swap (op0, op1); + } + rtx ops[3] = { target, op0, op1 }; + ix86_expand_vector_logical_operator (code, mode, ops); + return target; +} + + +/* Helper function for ix86_expand_ternlog. */ +static rtx +ix86_expand_ternlog_andnot (machine_mode mode, rtx op0, rtx op1, rtx target) +{ + if (GET_MODE (op0) != mode) + op0 = gen_lowpart (mode, op0); + op0 = gen_rtx_NOT (mode, op0); + if (GET_MODE (op1) != mode) + op1 = gen_lowpart (mode, op1); + emit_move_insn (target, gen_rtx_AND (mode, op0, op1)); + return target; +} + +/* Expand a 3-operand ternary logic expression. Return TARGET. */ +rtx +ix86_expand_ternlog (machine_mode mode, rtx op0, rtx op1, rtx op2, int idx, + rtx target) +{ + rtx tmp0, tmp1, tmp2; + + if (!target) + target = gen_reg_rtx (mode); + + /* Canonicalize ternlog index for degenerate (duplicated) operands. */ + if (rtx_equal_p (op0, op1) && rtx_equal_p (op0, op2)) + switch (idx & 0x81) + { + case 0x00: + idx = 0x00; + break; + case 0x01: + idx = 0x0f; + break; + case 0x80: + idx = 0xf0; + break; + case 0x81: + idx = 0xff; + break; + } + + switch (idx & 0xff) + { + case 0x00: + emit_move_insn (target, CONST0_RTX (mode)); + return target; + + case 0x0a: /* ~a&c */ + if ((!op1 || !side_effects_p (op1)) + && register_operand (op0, mode) + && register_operand (op2, mode)) + return ix86_expand_ternlog_andnot (mode, op0, op1, target); + break; + + case 0x0c: /* ~a&b */ + if ((!op2 || !side_effects_p (op2)) + && register_operand (op0, mode) + && register_operand (op1, mode)) + return ix86_expand_ternlog_andnot (mode, op0, op1, target); + break; + + case 0x0f: /* ~a */ + if ((!op1 || !side_effects_p (op1)) + && (!op2 || !side_effects_p (op2))) + { + if (GET_MODE (op0) != mode) + op0 = gen_lowpart (mode, op0); + if (!TARGET_64BIT && !register_operand (op0, mode)) + op0 = force_reg (mode, op0); + emit_move_insn (target, gen_rtx_XOR (mode, op0, CONSTM1_RTX (mode))); + return target; + } + break; + + case 0x22: /* ~b&c */ + if ((!op0 || !side_effects_p (op0)) + && register_operand (op1, mode) + && register_operand (op2, mode)) + return ix86_expand_ternlog_andnot (mode, op1, op2, target); + break; + + case 0x30: /* ~b&a */ + if ((!op2 || !side_effects_p (op2)) + && register_operand (op0, mode) + && register_operand (op1, mode)) + return ix86_expand_ternlog_andnot (mode, op1, op0, target); + break; + + case 0x33: /* ~b */ + if ((!op0 || !side_effects_p (op0)) + && (!op2 || !side_effects_p (op2))) + { + if (GET_MODE (op1) != mode) + op1 = gen_lowpart (mode, op1); + if (!TARGET_64BIT && !register_operand (op1, mode)) + op1 = force_reg (mode, op1); + emit_move_insn (target, gen_rtx_XOR (mode, op1, CONSTM1_RTX (mode))); + return target; + } + break; + + case 0x3c: /* a^b */ + if (!op2 || !side_effects_p (op2)) + return ix86_expand_ternlog_binop (XOR, mode, op0, op1, target); + break; + + case 0x44: /* ~c&b */ + if ((!op0 || !side_effects_p (op0)) + && register_operand (op1, mode) + && register_operand (op2, mode)) + return ix86_expand_ternlog_andnot (mode, op2, op1, target); + break; + + case 0x50: /* ~c&a */ + if ((!op1 || !side_effects_p (op1)) + && register_operand (op0, mode) + && register_operand (op2, mode)) + return ix86_expand_ternlog_andnot (mode, op2, op0, target); + break; + + case 0x55: /* ~c */ + if ((!op0 || !side_effects_p (op0)) + && (!op1 || !side_effects_p (op1))) + { + if (GET_MODE (op2) != mode) + op2 = gen_lowpart (mode, op2); + if (!TARGET_64BIT && !register_operand (op2, mode)) + op2 = force_reg (mode, op2); + emit_move_insn (target, gen_rtx_XOR (mode, op2, CONSTM1_RTX (mode))); + return target; + } + break; + + case 0x5a: /* a^c */ + if (!op1 || !side_effects_p (op1)) + return ix86_expand_ternlog_binop (XOR, mode, op0, op2, target); + break; + + case 0x66: /* b^c */ + if (!op0 || !side_effects_p (op0)) + return ix86_expand_ternlog_binop (XOR, mode, op1, op2, target); + break; + + case 0x88: /* b&c */ + if (!op0 || !side_effects_p (op0)) + return ix86_expand_ternlog_binop (AND, mode, op1, op2, target); + break; + + case 0xa0: /* a&c */ + if (!op1 || !side_effects_p (op1)) + return ix86_expand_ternlog_binop (AND, mode, op0, op2, target); + break; + + case 0xaa: /* c */ + if ((!op0 || !side_effects_p (op0)) + && (!op1 || !side_effects_p (op1))) + { + if (GET_MODE (op2) != mode) + op2 = gen_lowpart (mode, op2); + emit_move_insn (target, op2); + return target; + } + break; + + case 0xc0: /* a&b */ + if (!op2 || !side_effects_p (op2)) + return ix86_expand_ternlog_binop (AND, mode, op0, op1, target); + break; + + case 0xcc: /* b */ + if ((!op0 || !side_effects_p (op0)) + && (!op2 || !side_effects_p (op2))) + { + if (GET_MODE (op1) != mode) + op1 = gen_lowpart (mode, op1); + emit_move_insn (target, op1); + return target; + } + break; + + case 0xee: /* b|c */ + if (!op0 || !side_effects_p (op0)) + return ix86_expand_ternlog_binop (IOR, mode, op1, op2, target); + break; + + case 0xf0: /* a */ + if ((!op1 || !side_effects_p (op1)) + && (!op2 || !side_effects_p (op2))) + { + if (GET_MODE (op0) != mode) + op0 = gen_lowpart (mode, op0); + emit_move_insn (target, op0); + return target; + } + break; + + case 0xfa: /* a|c */ + if (!op1 || !side_effects_p (op1)) + return ix86_expand_ternlog_binop (IOR, mode, op0, op2, target); + break; + + case 0xfc: /* a|b */ + if (!op2 || !side_effects_p (op2)) + return ix86_expand_ternlog_binop (IOR, mode, op0, op1, target); + break; + + case 0xff: + emit_move_insn (target, CONSTM1_RTX (mode)); + return target; + } + + tmp0 = register_operand (op0, mode) ? op0 : force_reg (mode, op0); + if (GET_MODE (tmp0) != mode) + tmp0 = gen_lowpart (mode, tmp0); + + if (!op1 || rtx_equal_p (op0, op1)) + tmp1 = copy_rtx (tmp0); + else if (!register_operand (op1, mode)) + tmp1 = force_reg (mode, op1); + else + tmp1 = op1; + if (GET_MODE (tmp1) != mode) + tmp1 = gen_lowpart (mode, tmp1); + + if (!op2 || rtx_equal_p (op0, op2)) + tmp2 = copy_rtx (tmp0); + else if (rtx_equal_p (op1, op2)) + tmp2 = copy_rtx (tmp1); + else if (GET_CODE (op2) == CONST_VECTOR) + { + if (GET_MODE (op2) != mode) + op2 = gen_lowpart (mode, op2); + tmp2 = ix86_gen_bcst_mem (mode, op2); + if (!tmp2) + tmp2 = validize_mem (force_const_mem (mode, op2)); + } + else + tmp2 = op2; + if (GET_MODE (tmp2) != mode) + tmp2 = gen_lowpart (mode, tmp2); + /* Some memory_operands are not vector_memory_operands. */ + if (!bcst_vector_operand (tmp2, mode)) + tmp2 = force_reg (mode, tmp2); + + rtvec vec = gen_rtvec (4, tmp0, tmp1, tmp2, GEN_INT (idx)); + emit_move_insn (target, gen_rtx_UNSPEC (mode, vec, UNSPEC_VTERNLOG)); + return target; +} + #include "gt-i386-expand.h" diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 46214a6..9a3e183 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -245,6 +245,11 @@ extern rtx ix86_expand_fast_convert_bf_to_sf (rtx); extern rtx ix86_memtag_untagged_pointer (rtx, rtx); extern bool ix86_memtag_can_tag_addresses (void); +extern int ix86_ternlog_idx (rtx op, rtx *args); +extern bool ix86_ternlog_operand_p (rtx op); +extern rtx ix86_expand_ternlog (machine_mode mode, rtx op0, rtx op1, rtx op2, + int idx, rtx target); + #ifdef TREE_CODE extern void init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree, int); #endif /* TREE_CODE */ diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md index 2a97776..e0f38cd 100644 --- a/gcc/config/i386/predicates.md +++ b/gcc/config/i386/predicates.md @@ -1098,6 +1098,11 @@ (and (match_code "not") (match_test "nonimmediate_operand (XEXP (op, 0), mode)")))) +;; True for expressions valid for 3-operand ternlog instructions. +(define_predicate "ternlog_operand" + (and (match_code "not,and,ior,xor,subreg") + (match_test "ix86_ternlog_operand_p (op)"))) + ;; True if OP is acceptable as operand of DImode shift expander. (define_predicate "shiftdi_operand" (if_then_else (match_test "TARGET_64BIT") diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 1bf5072..3148651 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -12940,6 +12940,26 @@ ;; ;; and so on. +(define_insn_and_split "*_vpternlog_0" + [(set (match_operand:V 0 "register_operand") + (match_operand:V 1 "ternlog_operand"))] + "( == 64 || TARGET_AVX512VL + || (TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256)) + && ix86_pre_reload_split ()" + "#" + "&& 1" + [(const_int 0)] +{ + rtx args[3]; + args[0] = NULL_RTX; + args[1] = NULL_RTX; + args[2] = NULL_RTX; + int idx = ix86_ternlog_idx (operands[1], args); + ix86_expand_ternlog (mode, args[0], args[1], args[2], idx, + operands[0]); + DONE; +}) + (define_code_iterator any_logic1 [and ior xor]) (define_code_iterator any_logic2 [and ior xor]) (define_code_attr logic_op [(and "&") (ior "|") (xor "^")]) @@ -13160,7 +13180,33 @@ }) -(define_insn "_vternlog_mask" +(define_expand "_vternlog_mask" + [(set (match_operand:VI48_AVX512VL 0 "register_operand") + (vec_merge:VI48_AVX512VL + (unspec:VI48_AVX512VL + [(match_operand:VI48_AVX512VL 1 "register_operand") + (match_operand:VI48_AVX512VL 2 "register_operand") + (match_operand:VI48_AVX512VL 3 "bcst_vector_operand") + (match_operand:SI 4 "const_0_to_255_operand")] + UNSPEC_VTERNLOG) + (match_dup 1) + (match_operand: 5 "general_operand")))] + "TARGET_AVX512F" +{ + unsigned HOST_WIDE_INT mode_mask = GET_MODE_MASK (mode); + if (CONST_INT_P (operands[5]) + && (UINTVAL (operands[5]) & mode_mask) == mode_mask) + { + ix86_expand_ternlog (mode, operands[1], operands[2], + operands[3], INTVAL (operands[4]), + operands[0]); + DONE; + } + if (!register_operand (operands[5], mode)) + operands[5] = force_reg (mode, operands[5]); +}) + +(define_insn "*_vternlog_mask" [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v") (vec_merge:VI48_AVX512VL (unspec:VI48_AVX512VL diff --git a/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c b/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c index 4ebb30f..24f3d6c 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-di-zmm-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ +/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\\\$80, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ /* { dg-final { scan-assembler-not "vpbroadcast" } } */ #define type __m512i diff --git a/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c b/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c index 86e7ebe..1f5e72d 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-andn-si-zmm-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0x44, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$80, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ /* { dg-final { scan-assembler-not "vpbroadcast" } } */ #define type __m512i diff --git a/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c index 7d02f03..d21f48f 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-1.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xdd, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$245, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ /* { dg-final { scan-assembler-not "vpbroadcast" } } */ #define type __m512i diff --git a/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c index c793083..5359200 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-orn-si-zmm-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -mno-avx512vl -mprefer-vector-width=512 -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$0xbb, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ +/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\\\$175, \\(%(?:eax|rdi|edi)\\)\\\{1to\[1-8\]+\\\}, %zmm\[0-9\]+, %zmm0" 1 } } */ /* { dg-final { scan-assembler-not "vpbroadcast" } } */ #define type __m512i diff --git a/gcc/testsuite/gcc.target/i386/avx512f-vpternlogd-1.c b/gcc/testsuite/gcc.target/i386/avx512f-vpternlogd-1.c index a88153a..b098487 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-vpternlogd-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-vpternlogd-1.c @@ -1,6 +1,5 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/avx512f-vpternlogq-1.c b/gcc/testsuite/gcc.target/i386/avx512f-vpternlogq-1.c index ef30246..8e5d22f 100644 --- a/gcc/testsuite/gcc.target/i386/avx512f-vpternlogq-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512f-vpternlogq-1.c @@ -1,6 +1,5 @@ /* { dg-do compile } */ /* { dg-options "-mavx512f -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%zmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogd-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogd-1.c index 045a266..dd53563 100644 --- a/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogd-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogd-1.c @@ -1,7 +1,5 @@ /* { dg-do compile } */ /* { dg-options "-mavx512vl -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ -/* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogd\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogq-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogq-1.c index 3a6707c..31fec3e 100644 --- a/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogq-1.c +++ b/gcc/testsuite/gcc.target/i386/avx512vl-vpternlogq-1.c @@ -1,7 +1,5 @@ /* { dg-do compile } */ /* { dg-options "-mavx512vl -O2" } */ -/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ -/* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpternlogq\[ \\t\]+\[^\{\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr100711-3.c b/gcc/testsuite/gcc.target/i386/pr100711-3.c index 98cc1c3..ea60190 100644 --- a/gcc/testsuite/gcc.target/i386/pr100711-3.c +++ b/gcc/testsuite/gcc.target/i386/pr100711-3.c @@ -39,4 +39,4 @@ v8di foo_v8di (long long a, v8di b) /* { dg-final { scan-assembler-times "vpandn" 4 { target { ! ia32 } } } } */ /* { dg-final { scan-assembler-times "vpandn" 2 { target { ia32 } } } } */ -/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x44" 2 { target { ia32 } } } } */ +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$80" 2 { target { ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr100711-4.c b/gcc/testsuite/gcc.target/i386/pr100711-4.c index 3ca524f..a33f0a1 100644 --- a/gcc/testsuite/gcc.target/i386/pr100711-4.c +++ b/gcc/testsuite/gcc.target/i386/pr100711-4.c @@ -37,6 +37,6 @@ v8di foo_v8di (long long a, v8di b) return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) | b; } -/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 4 { target { ! ia32 } } } } */ -/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xbb" 2 { target { ia32 } } } } */ -/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0xdd" 2 { target { ia32 } } } } */ +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$207" 4 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$207" 2 { target { ia32 } } } } */ +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$245" 2 { target { ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr100711-5.c b/gcc/testsuite/gcc.target/i386/pr100711-5.c index 161fbfc..99cafc1 100644 --- a/gcc/testsuite/gcc.target/i386/pr100711-5.c +++ b/gcc/testsuite/gcc.target/i386/pr100711-5.c @@ -37,4 +37,4 @@ v8di foo_v8di (long long a, v8di b) return (__extension__ (v8di) {~a, ~a, ~a, ~a, ~a, ~a, ~a, ~a}) ^ b; } -/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$0x99" 4 } } */ +/* { dg-final { scan-assembler-times "vpternlog\[dq\]\[ \\t\]+\\\$1\[69\]5" 4 } } */