From patchwork Mon Mar 22 13:16:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 1456549 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=CliN7++A; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F3w575HJ1z9sjB for ; Tue, 23 Mar 2021 00:16:59 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A4623385480F; Mon, 22 Mar 2021 13:16:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A4623385480F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1616419008; bh=ZlaqGTyM1tffUwGVAYiaAC6yrj3SKxru3iGQC6O1Lv4=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=CliN7++A2xP8ddJ165qoqAweTjczwll/b51U/ShP+v4/q1wEKTxzBgsstqjpimnF7 DmL2SjU743Rwt0H2Q0rpPBtmKoVK+z0yahTINIyR23uDx4LnK+fNkMwYq0drp/VOU9 AOVoIByDS83LCDfcvBSISFIOM+nNzB1SKIlYYuOg= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by sourceware.org (Postfix) with ESMTPS id A1B733858004 for ; Mon, 22 Mar 2021 13:16:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A1B733858004 Received: by mail-pj1-x1031.google.com with SMTP id q6-20020a17090a4306b02900c42a012202so8500856pjg.5 for ; Mon, 22 Mar 2021 06:16:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZlaqGTyM1tffUwGVAYiaAC6yrj3SKxru3iGQC6O1Lv4=; b=Z66GHwujmBmxmssDnudTrJ2i0gsU4c4LYL3opSMuSBRZxcAyjHpscSAUK0qVP5JNXS VYuYEuKMokIRcxqd/Istc6J9N88fvJet+zofBKutTcxxccSav6/Am5qqb2C9hmUbYg8S lqM6yYz8X4oyX5f8d5q1gAz1W51r2zgwfhyn7ahRi8JVthdc6VIjxUHQBB0mPzUcZSUq reO3hY2XmZHPP/4tltxdrmMgBKrh/Gto0hCYXo+fkbqAYjeLL6BzsQNIvfyjwJTz3yi+ U5XIVUCpNi8lRD909QgGQrX8B/0wbtffYlNOHeg0RCjwjLeHN5TqQ2WZz+3QDDIdSJAz Qj2A== X-Gm-Message-State: AOAM530aA38IMIOJpl43yWPPRhqXdgg2r7B0L+oZYCU0ng5bQ8CMQlwK QyxVWPTWi3jhtLVYxIMIZ2FiZzpGRuo= X-Google-Smtp-Source: ABdhPJyV3yWfTFMDbd1kV3TbrFTwkxGOJbron4G9iO3jGmBWmr8unMEUfncty/+fVCBKbvlv4KD/nw== X-Received: by 2002:a17:902:8218:b029:e6:190e:48e with SMTP id x24-20020a1709028218b02900e6190e048emr27448743pln.33.1616419000110; Mon, 22 Mar 2021 06:16:40 -0700 (PDT) Received: from gnu-cfl-2.localdomain ([172.56.38.37]) by smtp.gmail.com with ESMTPSA id na8sm13593653pjb.2.2021.03.22.06.16.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Mar 2021 06:16:37 -0700 (PDT) Received: from gnu-cfl-2.?040none?041 (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id 5E9F41A08EA; Mon, 22 Mar 2021 06:16:36 -0700 (PDT) To: gcc-patches@gcc.gnu.org Subject: [PATCH 1/3] x86: Update memcpy/memset inline strategies for Ice Lake Date: Mon, 22 Mar 2021 06:16:34 -0700 Message-Id: <20210322131636.58461-2-hjl.tools@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210322131636.58461-1-hjl.tools@gmail.com> References: <20210322131636.58461-1-hjl.tools@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-3036.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Gcc-patches" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: Jan Hubicka , Hongtao Liu , Hongyu Wang Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Simply memcpy and memset inline strategies to avoid branches for -mtune=icelake: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. On Ice Lake processor with -march=native -Ofast -flto, 1. Performance impacts of SPEC CPU 2017 rate are: 500.perlbench_r -0.93% 502.gcc_r 0.36% 505.mcf_r 0.31% 520.omnetpp_r -0.07% 523.xalancbmk_r -0.53% 525.x264_r -0.09% 531.deepsjeng_r -0.19% 541.leela_r 0.16% 548.exchange2_r 0.22% 557.xz_r -1.64% Geomean -0.24% 503.bwaves_r -0.01% 507.cactuBSSN_r 0.00% 508.namd_r 0.12% 510.parest_r 0.07% 511.povray_r 0.29% 519.lbm_r 0.00% 521.wrf_r -0.38% 526.blender_r 0.16% 527.cam4_r 0.18% 538.imagick_r 0.76% 544.nab_r -0.84% 549.fotonik3d_r -0.07% 554.roms_r -0.01% Geomean 0.02% 2. Significant impacts on eembc benchmarks are: eembc/nnet_test 9.90% eembc/mp2decoddata2 16.42% eembc/textv2data3 -4.86% eembc/qos 12.90% gcc/ * config/i386/i386-expand.c (expand_set_or_cpymem_via_rep): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, don't convert QImode to SImode. (decide_alg): For TARGET_PREFER_KNOWN_REP_MOVSB_STOSB, use "rep movsb/stosb" only for known sizes. * config/i386/i386-options.c (processor_cost_table): Use Ice Lake cost for Cannon Lake, Ice Lake, Tiger Lake, Sapphire Rapids and Alder Lake. * config/i386/i386.h (TARGET_PREFER_KNOWN_REP_MOVSB_STOSB): New. * config/i386/x86-tune-costs.h (icelake_memcpy): New. (icelake_memset): Likewise. (icelake_cost): Likewise. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): New. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-5.c: New test. * gcc.target/i386/memcpy-strategy-6.c: Likewise. * gcc.target/i386/memcpy-strategy-7.c: Likewise. * gcc.target/i386/memcpy-strategy-8.c: Likewise. * gcc.target/i386/memset-strategy-3.c: Likewise. * gcc.target/i386/memset-strategy-4.c: Likewise. * gcc.target/i386/memset-strategy-5.c: Likewise. * gcc.target/i386/memset-strategy-6.c: Likewise. --- gcc/config/i386/i386-expand.c | 11 +- gcc/config/i386/i386-options.c | 12 +- gcc/config/i386/i386.h | 2 + gcc/config/i386/x86-tune-costs.h | 127 ++++++++++++++++++ gcc/config/i386/x86-tune.def | 7 + .../gcc.target/i386/memcpy-strategy-5.c | 11 ++ .../gcc.target/i386/memcpy-strategy-6.c | 18 +++ .../gcc.target/i386/memcpy-strategy-7.c | 9 ++ .../gcc.target/i386/memcpy-strategy-8.c | 18 +++ .../gcc.target/i386/memset-strategy-3.c | 17 +++ .../gcc.target/i386/memset-strategy-4.c | 17 +++ .../gcc.target/i386/memset-strategy-5.c | 11 ++ .../gcc.target/i386/memset-strategy-6.c | 9 ++ 13 files changed, 260 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-3.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-4.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-6.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index ac69eed4d32..00efe090d97 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -5976,6 +5976,7 @@ expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem, /* If possible, it is shorter to use rep movs. TODO: Maybe it is better to move this logic to decide_alg. */ if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3) + && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB && (!issetmem || orig_value == const0_rtx)) mode = SImode; @@ -6890,6 +6891,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, const struct processor_costs *cost; int i; bool any_alg_usable_p = false; + bool known_size_p = expected_size != -1; *noalign = false; *dynamic_check = -1; @@ -6899,7 +6901,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, if (optimize_function_for_size_p (cfun) || (optimize_insn_for_size_p () && (max_size < 256 - || (expected_size != -1 && expected_size < 256)))) + || (known_size_p && expected_size < 256)))) optimize_for_speed = false; else optimize_for_speed = true; @@ -6925,7 +6927,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, so inline version is a win, set expected size into the range. */ if (((max > 1 && (unsigned HOST_WIDE_INT) max >= max_size) || max == -1) - && expected_size == -1) + && !known_size_p) expected_size = min_size / 2 + max_size / 2; /* If user specified the algorithm, honor it if possible. */ @@ -6984,7 +6986,10 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, else if (!any_alg_usable_p) break; } - else if (alg_usable_p (candidate, memset, have_as)) + else if (alg_usable_p (candidate, memset, have_as) + && !(TARGET_PREFER_KNOWN_REP_MOVSB_STOSB + && candidate == rep_prefix_1_byte + && !known_size_p)) { *noalign = algs->size[i].noalign; return candidate; diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index b653527d266..bd52ce6ffec 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -721,14 +721,14 @@ static const struct processor_costs *processor_cost_table[] = &slm_cost, &skylake_cost, &skylake_cost, + &icelake_cost, + &icelake_cost, + &icelake_cost, &skylake_cost, + &icelake_cost, &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, - &skylake_cost, + &icelake_cost, + &icelake_cost, &intel_cost, &geode_cost, &k6_cost, diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 058c1cc25b2..b4001d21b70 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -523,6 +523,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; #define TARGET_PROMOTE_QImode ix86_tune_features[X86_TUNE_PROMOTE_QIMODE] #define TARGET_FAST_PREFIX ix86_tune_features[X86_TUNE_FAST_PREFIX] #define TARGET_SINGLE_STRINGOP ix86_tune_features[X86_TUNE_SINGLE_STRINGOP] +#define TARGET_PREFER_KNOWN_REP_MOVSB_STOSB \ + ix86_tune_features[X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB] #define TARGET_MISALIGNED_MOVE_STRING_PRO_EPILOGUES \ ix86_tune_features[X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES] #define TARGET_QIMODE_MATH ix86_tune_features[X86_TUNE_QIMODE_MATH] diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 58b3b81985b..0e00ff99df3 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1936,6 +1936,133 @@ struct processor_costs skylake_cost = { "0:0:8", /* Label alignment. */ "16", /* Func alignment. */ }; + +/* icelake_cost should produce code tuned for Icelake family of CPUs. + NB: rep_prefix_1_byte is used only for known size. */ + +static stringop_algs icelake_memcpy[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; + +static stringop_algs icelake_memset[2] = { + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; + +static const +struct processor_costs icelake_cost = { + { + /* Start of register allocator costs. integer->integer move cost is 2. */ + 6, /* cost for loading QImode using movzbl */ + {4, 4, 4}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + 2, /* cost of reg,reg fld/fst */ + {6, 6, 8}, /* cost of loading fp registers + in SFmode, DFmode and XFmode */ + {6, 6, 10}, /* cost of storing fp registers + in SFmode, DFmode and XFmode */ + 2, /* cost of moving MMX register */ + {6, 6}, /* cost of loading MMX registers + in SImode and DImode */ + {6, 6}, /* cost of storing MMX registers + in SImode and DImode */ + 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ + {6, 6, 6, 10, 20}, /* cost of loading SSE registers + in 32,64,128,256 and 512-bit */ + {8, 8, 8, 12, 24}, /* cost of storing SSE registers + in 32,64,128,256 and 512-bit */ + 6, 6, /* SSE->integer and integer->SSE moves */ + 5, 5, /* mask->integer and integer->mask moves */ + {8, 8, 8}, /* cost of loading mask register + in QImode, HImode, SImode. */ + {6, 6, 6}, /* cost if storing mask register + in QImode, HImode, SImode. */ + 3, /* cost of moving mask register. */ + /* End of register allocator costs. */ + }, + + COSTS_N_INSNS (1), /* cost of an add instruction */ + COSTS_N_INSNS (1)+1, /* cost of a lea instruction */ + COSTS_N_INSNS (1), /* variable shift costs */ + COSTS_N_INSNS (1), /* constant shift costs */ + {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ + COSTS_N_INSNS (4), /* HI */ + COSTS_N_INSNS (3), /* SI */ + COSTS_N_INSNS (3), /* DI */ + COSTS_N_INSNS (3)}, /* other */ + 0, /* cost of multiply per each bit set */ + /* Expanding div/mod currently doesn't consider parallelism. So the cost + model is not realistic. We compensate by increasing the latencies a bit. */ + {COSTS_N_INSNS (11), /* cost of a divide/mod for QI */ + COSTS_N_INSNS (11), /* HI */ + COSTS_N_INSNS (14), /* SI */ + COSTS_N_INSNS (76), /* DI */ + COSTS_N_INSNS (76)}, /* other */ + COSTS_N_INSNS (1), /* cost of movsx */ + COSTS_N_INSNS (0), /* cost of movzx */ + 8, /* "large" insn */ + 17, /* MOVE_RATIO */ + 17, /* CLEAR_RATIO */ + {4, 4, 4}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 6, 10, 20}, /* cost of loading SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {8, 8, 8, 12, 24}, /* cost of storing SSE register + in 32bit, 64bit, 128bit, 256bit and 512bit */ + {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ + {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ + 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ + 6, /* cost of moving SSE register to integer. */ + 20, 8, /* Gather load static, per_elt. */ + 22, 10, /* Gather store static, per_elt. */ + 64, /* size of l1 cache. */ + 512, /* size of l2 cache. */ + 64, /* size of prefetch block */ + 6, /* number of parallel prefetches */ + 3, /* Branch cost */ + COSTS_N_INSNS (3), /* cost of FADD and FSUB insns. */ + COSTS_N_INSNS (4), /* cost of FMUL instruction. */ + COSTS_N_INSNS (20), /* cost of FDIV instruction. */ + COSTS_N_INSNS (1), /* cost of FABS instruction. */ + COSTS_N_INSNS (1), /* cost of FCHS instruction. */ + COSTS_N_INSNS (20), /* cost of FSQRT instruction. */ + + COSTS_N_INSNS (1), /* cost of cheap SSE instruction. */ + COSTS_N_INSNS (4), /* cost of ADDSS/SD SUBSS/SD insns. */ + COSTS_N_INSNS (4), /* cost of MULSS instruction. */ + COSTS_N_INSNS (4), /* cost of MULSD instruction. */ + COSTS_N_INSNS (4), /* cost of FMA SS instruction. */ + COSTS_N_INSNS (4), /* cost of FMA SD instruction. */ + COSTS_N_INSNS (11), /* cost of DIVSS instruction. */ + COSTS_N_INSNS (14), /* cost of DIVSD instruction. */ + COSTS_N_INSNS (12), /* cost of SQRTSS instruction. */ + COSTS_N_INSNS (18), /* cost of SQRTSD instruction. */ + 1, 4, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */ + icelake_memcpy, + icelake_memset, + COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ + COSTS_N_INSNS (1), /* cond_not_taken_branch_cost. */ + "16:11:8", /* Loop alignment. */ + "16:11:8", /* Jump alignment. */ + "0:0:8", /* Label alignment. */ + "16", /* Func alignment. */ +}; + /* BTVER1 has optimized REP instruction for medium sized blocks, but for very small blocks it is better to use loop. For large blocks, libcall can do nontemporary accesses and beat inline considerably. */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index caebf76736e..134916cc972 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -269,6 +269,13 @@ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove", as MOVS and STOS (without a REP prefix) to move/set sequences of bytes. */ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) +/* X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB: Enable use of REP MOVSB/STOSB to + move/set sequences of bytes with known size. */ +DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, + "prefer_known_rep_movsb_stosb", + m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE + | m_ALDERLAKE | m_SAPPHIRERAPIDS) + /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This requires target to handle misaligned moves and partial memory stalls diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c new file mode 100644 index 00000000000..83c333b551d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c new file mode 100644 index 00000000000..ed963dec853 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-6.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c new file mode 100644 index 00000000000..be66d6b8426 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-7.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c new file mode 100644 index 00000000000..e8fe0a66c98 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-8.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-3.c b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c new file mode 100644 index 00000000000..9ea1e1ae7c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-3.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-4.c b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c new file mode 100644 index 00000000000..00d82f13ff8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-4.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-5.c b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c new file mode 100644 index 00000000000..dc1de8e79c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-6.c b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c new file mode 100644 index 00000000000..e51af3b730f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-6.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=tigerlake -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 256); +} From patchwork Mon Mar 22 13:16:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 1456547 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=KF8HT7hJ; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F3w4z3b6fz9sjB for ; Tue, 23 Mar 2021 00:16:51 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 5267D385801D; Mon, 22 Mar 2021 13:16:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5267D385801D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1616419005; bh=H4fN9BcjNGfwgg4Lx85Bbzzy3xF1I0XSEprpnbrfA38=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=KF8HT7hJKvlGSaoJhR5Xpd0E1a3syKhiHU9Qs4xMLgK5XcdVGCgSqL77VPaX33S6v RZbvqaqZZK33axSYMiGS/bs3YvqQsp3CGiSbML05uALS8RTAe1ZVloYLQrT85hlzit 3iXqRiw7FI+EiUfVhiYrrVpfDlh33S06Ms69KuV8= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by sourceware.org (Postfix) with ESMTPS id 8A543385801D for ; Mon, 22 Mar 2021 13:16:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8A543385801D Received: by mail-pf1-x42c.google.com with SMTP id c17so4012906pfn.6 for ; Mon, 22 Mar 2021 06:16:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=H4fN9BcjNGfwgg4Lx85Bbzzy3xF1I0XSEprpnbrfA38=; b=SK9W3TZ/5PQF2D0nf41znRVrv0K4wEIajUmoZDBn6zOdTutTs3B6xfTePkYv5WDT2y 0Sy1atJ5XEbBJDOcu5KOaeTbgbJIByo2/ie5BH6X5ddPWIWbUmzt9G/VpzTkOsvf79Vr caNj5/v/uEbU9E1LhVnAMV1yfOR3izH9n0YnoeLcB/2g4vw1xlsI7XAWtFSrn+W4bFZP 9ABbE7yWBMMs6i4xripmqUjWnEqbb42olykqfRJpiFfe3aQo9BJcpz8cXRDcY8itCqvc AB1BTYumImcahoYtunTLe4gLr1dluIwwd3h1Dz6Mq65op4VF9LtwZm3sfgoT2SqwhEwY LFIw== X-Gm-Message-State: AOAM530f7PyimWzLI5NVYKlF7uz2itTK+ESRoc50ykiowRp080VqStfW V4uFCYu2xqzgOFzcaq/bR5btpFXqIR4= X-Google-Smtp-Source: ABdhPJxPzxMkekesYNt+rG5ohJ+pj36TBv8dHAV+F+774uQgU/rLP2UQ2Io/EP7Oliho0qI0zRjS8A== X-Received: by 2002:a65:4043:: with SMTP id h3mr22309926pgp.148.1616418998897; Mon, 22 Mar 2021 06:16:38 -0700 (PDT) Received: from gnu-cfl-2.localdomain ([172.56.38.37]) by smtp.gmail.com with ESMTPSA id b24sm12614888pgj.58.2021.03.22.06.16.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Mar 2021 06:16:37 -0700 (PDT) Received: from gnu-cfl-2.?040none?041 (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id 6A7F21A09E0; Mon, 22 Mar 2021 06:16:36 -0700 (PDT) To: gcc-patches@gcc.gnu.org Subject: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs Date: Mon, 22 Mar 2021 06:16:35 -0700 Message-Id: <20210322131636.58461-3-hjl.tools@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210322131636.58461-1-hjl.tools@gmail.com> References: <20210322131636.58461-1-hjl.tools@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-3036.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Gcc-patches" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: Jan Hubicka , Hongtao Liu , Hongyu Wang Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Simply memcpy and memset inline strategies to avoid branches for Skylake family CPUs: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. On Cascadelake processor with -march=native -Ofast -flto, 1. Performance impacts of SPEC CPU 2017 rate are: 500.perlbench_r 0.17% 502.gcc_r -0.36% 505.mcf_r 0.00% 520.omnetpp_r 0.08% 523.xalancbmk_r -0.62% 525.x264_r 1.04% 531.deepsjeng_r 0.11% 541.leela_r -1.09% 548.exchange2_r -0.25% 557.xz_r 0.17% Geomean -0.08% 503.bwaves_r 0.00% 507.cactuBSSN_r 0.69% 508.namd_r -0.07% 510.parest_r 1.12% 511.povray_r 1.82% 519.lbm_r 0.00% 521.wrf_r -1.32% 526.blender_r -0.47% 527.cam4_r 0.23% 538.imagick_r -1.72% 544.nab_r -0.56% 549.fotonik3d_r 0.12% 554.roms_r 0.43% Geomean 0.02% 2. Significant impacts on eembc benchmarks are: eembc/idctrn01 9.23% eembc/nnet_test 29.26% gcc/ * config/i386/x86-tune-costs.h (skylake_memcpy): Updated. (skylake_memset): Likewise. (skylake_cost): Change CLEAR_RATIO to 17. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER, m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-9.c: New test. * gcc.target/i386/memcpy-strategy-10.c: Likewise. * gcc.target/i386/memcpy-strategy-11.c: Likewise. * gcc.target/i386/memset-strategy-7.c: Likewise. * gcc.target/i386/memset-strategy-8.c: Likewise. * gcc.target/i386/memset-strategy-9.c: Likewise. --- gcc/config/i386/x86-tune-costs.h | 27 ++++++++++++------- gcc/config/i386/x86-tune.def | 3 +-- .../gcc.target/i386/memcpy-strategy-10.c | 11 ++++++++ .../gcc.target/i386/memcpy-strategy-11.c | 18 +++++++++++++ .../gcc.target/i386/memcpy-strategy-9.c | 9 +++++++ .../gcc.target/i386/memset-strategy-7.c | 11 ++++++++ .../gcc.target/i386/memset-strategy-8.c | 9 +++++++ .../gcc.target/i386/memset-strategy-9.c | 17 ++++++++++++ 8 files changed, 93 insertions(+), 12 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 0e00ff99df3..ffe810f2bcb 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = { /* skylake_cost should produce code tuned for Skylake familly of CPUs. */ static stringop_algs skylake_memcpy[2] = { - {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}}, - {libcall, {{16, loop, false}, {512, unrolled_loop, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs skylake_memset[2] = { - {libcall, {{6, loop_1_byte, true}, - {24, loop, true}, - {8192, rep_prefix_4_byte, true}, - {-1, libcall, false}}}, - {libcall, {{24, loop, true}, {512, unrolled_loop, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs skylake_cost = { @@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = { COSTS_N_INSNS (0), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 134916cc972..eb057a67750 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,8 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_CANNONLAKE | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_TIGERLAKE - | m_ALDERLAKE | m_SAPPHIRERAPIDS) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c new file mode 100644 index 00000000000..970aa741971 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c new file mode 100644 index 00000000000..b6041944630 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + e_u8 b[4][MAXBC]; + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = b[i][j]; +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c new file mode 100644 index 00000000000..b0dc7484d09 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-7.c b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c new file mode 100644 index 00000000000..07c2816910c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-7.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-8.c b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c new file mode 100644 index 00000000000..52ea882c814 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-8.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake -mno-sse" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 256); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-9.c b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c new file mode 100644 index 00000000000..d4db031958f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-9.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=skylake" } */ +/* { dg-final { scan-assembler-not "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-not "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +typedef unsigned char e_u8; + +#define MAXBC 8 + +void MixColumn(e_u8 a[4][MAXBC], e_u8 BC) +{ + int i, j; + + for(i = 0; i < 4; i++) + for(j = 0; j < BC; j++) a[i][j] = 1; +} From patchwork Mon Mar 22 13:16:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 1456548 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=sYvYmr5d; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F3w543QnGz9sjB for ; Tue, 23 Mar 2021 00:16:56 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 81CAE385703C; Mon, 22 Mar 2021 13:16:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 81CAE385703C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1616419006; bh=DA9FPWIMH3feoYnMmz1rnUVoFZ7H/5FUagMwXYTqqbw=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=sYvYmr5dPEqZSBpouQj/mLM/fDMe9PER6MIJ4a9dlY6yT3aLPn0DEwuZlrzPgYwR6 qMRTf1gUYeFYQyErVf0UhfhSpg3sz9HSVRDgyrGjF3ARGKCw3jIT5xoeb/xRBRj8G/ kBCHTRhMUzXM2L/GQgk82SZrQ5OdISr92mc7r6Bk= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by sourceware.org (Postfix) with ESMTPS id F0812385781A for ; Mon, 22 Mar 2021 13:16:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org F0812385781A Received: by mail-pf1-x435.google.com with SMTP id l3so10928636pfc.7 for ; Mon, 22 Mar 2021 06:16:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DA9FPWIMH3feoYnMmz1rnUVoFZ7H/5FUagMwXYTqqbw=; b=HT+SKZ8E/l7UZ8wZ1lIQK0jfjA7YR/wV3eSF2jC80CHzRhp22m7hZSSdjFY0BLQQ+A N/drYSsuh3r4+c9iy7HmvJJ1mZk8f5He1sfhI7aA5mCwNSPHgQ9sO2sbBpUS4D+MzMfa oyvY9YjrsNGwOVaQmhoyehYzrfeyeYrRXoxxZapvRDkriaYYKjbyr1jCXi4EYYxgaFKq 8FBC8FErxqGqnva2uG51gP95vAkkRidLto2Ql7NvZTz6T6pxnFyoLYIO4x0Ez1WIXTyZ eGiTKUmwusKRbvzDiT+EPPhfefPwAsaOD50NcrL3cu+Kmd3XMP6ZHs9xCHi+cJy7dYeV PFKw== X-Gm-Message-State: AOAM530IcAa3qYFOKP0xZAcd++S+cIQvJxHJeTAopjwcPL13s4TH7KjT Ltxiwrve5wDEs/VwqB/4PO22Ww7mkdY= X-Google-Smtp-Source: ABdhPJwPNF7YaFvkyLkiCQqQw7cJkD1DEEcy7di9WhDkZeC5uy0Kz6rWRWaDb1jlyYemsnAddEA2FA== X-Received: by 2002:a65:4c43:: with SMTP id l3mr22274907pgr.327.1616418999661; Mon, 22 Mar 2021 06:16:39 -0700 (PDT) Received: from gnu-cfl-2.localdomain ([172.56.38.37]) by smtp.gmail.com with ESMTPSA id h68sm3649871pfe.111.2021.03.22.06.16.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Mar 2021 06:16:37 -0700 (PDT) Received: from gnu-cfl-2.?040none?041 (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id 75C891A0A33; Mon, 22 Mar 2021 06:16:36 -0700 (PDT) To: gcc-patches@gcc.gnu.org Subject: [PATCH 3/3] x86: Update memcpy/memset inline strategies for -mtune=generic Date: Mon, 22 Mar 2021 06:16:36 -0700 Message-Id: <20210322131636.58461-4-hjl.tools@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210322131636.58461-1-hjl.tools@gmail.com> References: <20210322131636.58461-1-hjl.tools@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-3036.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "H.J. Lu via Gcc-patches" From: "H.J. Lu" Reply-To: "H.J. Lu" Cc: Jan Hubicka , Hongtao Liu , Hongyu Wang Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Simply memcpy and memset inline strategies to avoid branches for -mtune=generic: 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector load and store for up to 16 * 16 (256) bytes when the data size is fixed and known. 2. Inline only if data size is known to be <= 256. a. Use "rep movsb/stosb" with simple code sequence if the data size is a constant. b. Use loop if data size is not a constant. 3. Use memcpy/memset libray function if data size is unknown or > 256. With -mtune=generic -O2, 1. On Ice Lake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r 0.51% 502.gcc_r 0.55% 505.mcf_r 0.38% 520.omnetpp_r -0.74% 523.xalancbmk_r -0.35% 525.x264_r 2.99% 531.deepsjeng_r -0.17% 541.leela_r -0.98% 548.exchange2_r 0.89% 557.xz_r 0.70% Geomean 0.37% 503.bwaves_r 0.04% 507.cactuBSSN_r -0.01% 508.namd_r -0.45% 510.parest_r -0.09% 511.povray_r -1.37% 519.lbm_r 0.00% 521.wrf_r -2.56% 526.blender_r -0.01% 527.cam4_r -0.05% 538.imagick_r 0.36% 544.nab_r 0.08% 549.fotonik3d_r -0.06% 554.roms_r 0.05% Geomean -0.34% Significant impacts on eembc benchmarks: eembc/nnet_test 14.85% eembc/mp2decoddata2 13.57% 2. On Cascadelake processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.02% 502.gcc_r 0.10% 505.mcf_r -1.14% 520.omnetpp_r -0.22% 523.xalancbmk_r 0.21% 525.x264_r 0.94% 531.deepsjeng_r -0.37% 541.leela_r -0.46% 548.exchange2_r -0.40% 557.xz_r 0.60% Geomean -0.08% 503.bwaves_r -0.50% 507.cactuBSSN_r 0.05% 508.namd_r -0.02% 510.parest_r 0.09% 511.povray_r -1.35% 519.lbm_r 0.00% 521.wrf_r -0.03% 526.blender_r -0.83% 527.cam4_r 1.23% 538.imagick_r 0.97% 544.nab_r -0.02% 549.fotonik3d_r -0.12% 554.roms_r 0.55% Geomean 0.00% Significant impacts on eembc benchmarks: eembc/nnet_test 9.90% eembc/mp2decoddata2 16.42% eembc/textv2data3 -4.86% eembc/qos 12.90% 3. On Znver3 processor, Performance impacts on SPEC CPU 2017: 500.perlbench_r -0.96% 502.gcc_r -1.06% 505.mcf_r -0.01% 520.omnetpp_r -1.45% 523.xalancbmk_r 2.89% 525.x264_r 4.98% 531.deepsjeng_r 0.18% 541.leela_r -1.54% 548.exchange2_r -1.25% 557.xz_r -0.01% Geomean 0.16% 503.bwaves_r 0.04% 507.cactuBSSN_r 0.85% 508.namd_r -0.13% 510.parest_r 0.39% 511.povray_r 0.00% 519.lbm_r 0.00% 521.wrf_r 0.28% 526.blender_r -0.10% 527.cam4_r -0.58% 538.imagick_r 0.69% 544.nab_r -0.04% 549.fotonik3d_r -0.04% 554.roms_r 0.40% Geomean 0.15% Significant impacts on eembc benchmarks: eembc/aifftr01 13.95% eembc/idctrn01 8.41% eembc/nnet_test 30.25% eembc/mp2decoddata2 5.05% eembc/textv2data3 6.43% eembc/qos -5.79% gcc/ * config/i386/x86-tune-costs.h (generic_memcpy): Updated. (generic_memset): Likewise. (generic_cost): Change CLEAR_RATIO to 17. * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Add m_GENERIC. gcc/testsuite/ * gcc.target/i386/memcpy-strategy-12.c: New test. * gcc.target/i386/memcpy-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-10.c: Likewise. * gcc.target/i386/memset-strategy-11.c: Likewise. * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-strategy=rep_8byte:-1:align. * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. --- gcc/config/i386/x86-tune-costs.h | 31 ++++++++++++------- gcc/config/i386/x86-tune.def | 2 +- .../gcc.target/i386/memcpy-strategy-12.c | 9 ++++++ .../gcc.target/i386/memcpy-strategy-13.c | 11 +++++++ .../gcc.target/i386/memset-strategy-10.c | 11 +++++++ .../gcc.target/i386/memset-strategy-11.c | 9 ++++++ gcc/testsuite/gcc.target/i386/shrink_wrap_1.c | 2 +- gcc/testsuite/gcc.target/i386/sw-1.c | 2 +- 8 files changed, 63 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-10.c create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-11.c diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index ffe810f2bcb..30e7c3e4261 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2844,19 +2844,28 @@ struct processor_costs intel_cost = { "16", /* Func alignment. */ }; -/* Generic should produce code tuned for Core-i7 (and newer chips) - and btver1 (and newer chips). */ +/* Generic should produce code tuned for Haswell (and newer chips) + and znver1 (and newer chips). NB: rep_prefix_1_byte is used only + for known size. */ static stringop_algs generic_memcpy[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static stringop_algs generic_memset[2] = { - {libcall, {{32, loop, false}, {8192, rep_prefix_4_byte, false}, - {-1, libcall, false}}}, - {libcall, {{32, loop, false}, {8192, rep_prefix_8_byte, false}, - {-1, libcall, false}}}}; + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}, + {libcall, + {{256, rep_prefix_1_byte, true}, + {256, loop, false}, + {-1, libcall, false}}}}; static const struct processor_costs generic_cost = { { @@ -2913,7 +2922,7 @@ struct processor_costs generic_cost = { COSTS_N_INSNS (1), /* cost of movzx */ 8, /* "large" insn */ 17, /* MOVE_RATIO */ - 6, /* CLEAR_RATIO */ + 17, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index eb057a67750..fd9c011a3f5 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -273,7 +273,7 @@ DEF_TUNE (X86_TUNE_SINGLE_STRINGOP, "single_stringop", m_386 | m_P4_NOCONA) move/set sequences of bytes with known size. */ DEF_TUNE (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB, "prefer_known_rep_movsb_stosb", - m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512) + m_SKYLAKE | m_ALDERLAKE | m_CORE_AVX512 | m_GENERIC) /* X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES: Enable generation of compact prologues and epilogues by issuing a misaligned moves. This diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c new file mode 100644 index 00000000000..87f03352736 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-12.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 249); +} diff --git a/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c new file mode 100644 index 00000000000..cfc3cfba623 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memcpy-strategy-13.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "jmp\tmemcpy" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemcpy" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep movsb" } } */ + +void +foo (char *dest, char *src) +{ + __builtin_memcpy (dest, src, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-10.c b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c new file mode 100644 index 00000000000..ade5e8da42c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-10.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "jmp\tmemset" { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler "call\tmemset" { target ia32 } } } */ +/* { dg-final { scan-assembler-not "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 257); +} diff --git a/gcc/testsuite/gcc.target/i386/memset-strategy-11.c b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c new file mode 100644 index 00000000000..d1b86152474 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/memset-strategy-11.c @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=generic" } */ +/* { dg-final { scan-assembler "rep stosb" } } */ + +void +foo (char *dest) +{ + __builtin_memset (dest, 0, 253); +} diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c index 94dadd6cdbd..44fe7d2836e 100644 --- a/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_1.c @@ -1,5 +1,5 @@ /* { dg-do compile { target { ! ia32 } } } */ -/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mmemset-strategy=rep_8byte:-1:align -fdump-rtl-pro_and_epilogue" } */ enum machine_mode { diff --git a/gcc/testsuite/gcc.target/i386/sw-1.c b/gcc/testsuite/gcc.target/i386/sw-1.c index aec095eda62..f61621e42bf 100644 --- a/gcc/testsuite/gcc.target/i386/sw-1.c +++ b/gcc/testsuite/gcc.target/i386/sw-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=generic -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ +/* { dg-options "-O2 -mtune=generic -mstringop-strategy=rep_byte -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */ /* { dg-skip-if "No shrink-wrapping preformed" { x86_64-*-mingw* } } */ #include