From patchwork Tue Dec 8 17:27:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Van Haaren, Harry" X-Patchwork-Id: 1412806 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=140.211.166.138; helo=whitealder.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Received: from whitealder.osuosl.org (smtp1.osuosl.org [140.211.166.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4Cr6f94Xnvz9sW1 for ; Wed, 9 Dec 2020 04:30:57 +1100 (AEDT) Received: from localhost (localhost [127.0.0.1]) by whitealder.osuosl.org (Postfix) with ESMTP id 32CA686E33; Tue, 8 Dec 2020 17:30:56 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from whitealder.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DvkDh4Gvuk7s; Tue, 8 Dec 2020 17:30:51 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by whitealder.osuosl.org (Postfix) with ESMTP id E39B787625; Tue, 8 Dec 2020 17:29:41 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id C8D63C0FA7; Tue, 8 Dec 2020 17:29:41 +0000 (UTC) X-Original-To: ovs-dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from silver.osuosl.org (smtp3.osuosl.org [140.211.166.136]) by lists.linuxfoundation.org (Postfix) with ESMTP id B21E3C0FA7 for ; Tue, 8 Dec 2020 17:29:40 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by silver.osuosl.org (Postfix) with ESMTP id 9A607274E3 for ; Tue, 8 Dec 2020 17:29:40 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from silver.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dGkEkRe6WFzK for ; Tue, 8 Dec 2020 17:29:31 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by silver.osuosl.org (Postfix) with ESMTPS id 160362E2BE for ; Tue, 8 Dec 2020 17:28:37 +0000 (UTC) IronPort-SDR: NR5gdW1RWXaKekXwW5qufk3wHcft5cv40M4S2Mj/XhsS+3ep13AtRi7Me8QjtNKkFWeMnyKp2/ 7l620x5K3QsQ== X-IronPort-AV: E=McAfee;i="6000,8403,9829"; a="238039110" X-IronPort-AV: E=Sophos;i="5.78,403,1599548400"; d="scan'208";a="238039110" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2020 09:28:36 -0800 IronPort-SDR: inSy6Cla71YXVId+46cEYNVQOrEqf7XHNE+d57ysN5lTLehjmzH1OcLrCvqL2C4LiV1L32rANr aIvDRioWmgmQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.78,403,1599548400"; d="scan'208";a="407699927" Received: from silpixa00400633.ir.intel.com ([10.237.213.44]) by orsmga001.jf.intel.com with ESMTP; 08 Dec 2020 09:28:34 -0800 From: Harry van Haaren To: ovs-dev@openvswitch.org Date: Tue, 8 Dec 2020 17:27:53 +0000 Message-Id: <20201208172753.349273-16-harry.van.haaren@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201208172753.349273-1-harry.van.haaren@intel.com> References: <20201203193747.807120-2-harry.van.haaren@intel.com> <20201208172753.349273-1-harry.van.haaren@intel.com> MIME-Version: 1.0 Cc: i.maximets@ovn.org Subject: [ovs-dev] [PATCH v6 15/15] dpcls-avx512: enabling avx512 vector popcount instruction X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" This commit enables the AVX512-VPOPCNTDQ Vector Popcount instruction. This instruction is not available on every CPU that supports the AVX512-F Foundation ISA, hence it is enabled only when the additional VPOPCNTDQ ISA check is passed. The vector popcount instruction is used instead of the AVX512 popcount emulation code present in the avx512 optimized DPCLS today. It provides higher performance in the SIMD miniflow processing as that requires the popcount to calculate the miniflow block indexes. Signed-off-by: Harry van Haaren --- v6: - Now that the DPDK 20.11 dependency exists, it is possible to use the RTE_CPUFLAG_* defines to enable the AVX512 vectorized popcount instruction. --- lib/dpdk.c | 1 + lib/dpif-netdev-lookup-avx512-gather.c | 86 ++++++++++++++++++++------ 2 files changed, 69 insertions(+), 18 deletions(-) diff --git a/lib/dpdk.c b/lib/dpdk.c index 703602603..10491866a 100644 --- a/lib/dpdk.c +++ b/lib/dpdk.c @@ -653,6 +653,7 @@ dpdk_get_cpu_has_isa(const char *arch, const char *feature) #if __x86_64__ /* CPU flags only defined for the architecture that support it. */ CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F); + CHECK_CPU_FEATURE(feature, "avx512vpopcntdq", RTE_CPUFLAG_AVX512VPOPCNTDQ); CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2); #endif diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-avx512-gather.c index 2a70915cc..9d680abe9 100644 --- a/lib/dpif-netdev-lookup-avx512-gather.c +++ b/lib/dpif-netdev-lookup-avx512-gather.c @@ -53,6 +53,15 @@ VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather); + +/* Wrapper function required to enable ISA. */ +static inline __m512i +__attribute__((__target__("avx512vpopcntdq"))) +_mm512_popcnt_epi64_wrapper(__m512i v_in) +{ + return _mm512_popcnt_epi64(v_in); +} + static inline __m512i _mm512_popcnt_epi64_manual(__m512i v_in) { @@ -125,7 +134,8 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 bits */ __mmask64 u1_bcast_msk, /* mask of u1 lanes */ const uint64_t pkt_mf_u0_pop, /* num bits in u0 of pkt */ __mmask64 zero_mask, /* maskz if pkt not have mf bit */ - __mmask64 u64_lanes_mask) /* total lane count to use */ + __mmask64 u64_lanes_mask, /* total lane count to use */ + const uint32_t use_vpop) /* use AVX512 vpopcntdq */ { /* Suggest to compiler to load tbl blocks ahead of gather() */ __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask, @@ -139,8 +149,15 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 bits */ tbl_mf_masks); __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks); - /* Manual AVX512 popcount for u64 lanes. */ - __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks); + /* Calculate AVX512 popcount for u64 lanes using the native instruction + * if available, or using emulation if not available. + */ + __m512i v_popcnts; + if (use_vpop) { + v_popcnts = _mm512_popcnt_epi64_wrapper(v_masks); + } else { + v_popcnts = _mm512_popcnt_epi64_manual(v_masks); + } /* Add popcounts and offset for u1 bits. */ __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk, @@ -165,8 +182,11 @@ avx512_lookup_impl(struct dpcls_subtable *subtable, const struct netdev_flow_key *keys[], struct dpcls_rule **rules, const uint32_t bit_count_u0, - const uint32_t bit_count_u1) + const uint32_t bit_count_u1, + const uint32_t use_vpop) { + (void)use_vpop; + OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t block_cache[BLOCKS_CACHE_SIZE]; uint32_t hashes[NETDEV_MAX_BURST]; @@ -217,7 +237,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable, u1_bcast_mask, pkt_mf_u0_pop, zero_mask, - bit_count_total_mask); + bit_count_total_mask, + use_vpop); _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], v_blocks); if (bit_count_total > 8) { @@ -238,7 +259,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable, u1_bcast_mask_gt8, pkt_mf_u0_pop, zero_mask_gt8, - bit_count_gt8_mask); + bit_count_gt8_mask, + use_vpop); _mm512_storeu_si512(&block_cache[(i * MF_BLOCKS_PER_PACKET) + 8], v_blocks_gt8); } @@ -287,7 +309,11 @@ avx512_lookup_impl(struct dpcls_subtable *subtable, return found_map; } -/* Expand out specialized functions with U0 and U1 bit attributes. */ +/* Expand out specialized functions with U0 and U1 bit attributes. As the + * AVX512 vpopcnt instruction is not supported on all AVX512 capable CPUs, + * create two functions for each miniflow signature. This allows the runtime + * CPU detection in probe() to select the ideal implementation. + */ #define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1) \ static uint32_t \ dpcls_avx512_gather_mf_##U0##_##U1(struct dpcls_subtable *subtable, \ @@ -295,7 +321,20 @@ avx512_lookup_impl(struct dpcls_subtable *subtable, const struct netdev_flow_key *keys[], \ struct dpcls_rule **rules) \ { \ - return avx512_lookup_impl(subtable, keys_map, keys, rules, U0, U1); \ + const uint32_t use_vpop = 0; \ + return avx512_lookup_impl(subtable, keys_map, keys, rules, \ + U0, U1, use_vpop); \ + } \ + \ + static uint32_t __attribute__((__target__("avx512vpopcntdq"))) \ + dpcls_avx512_gather_mf_##U0##_##U1##_vpop(struct dpcls_subtable *subtable,\ + uint32_t keys_map, \ + const struct netdev_flow_key *keys[], \ + struct dpcls_rule **rules) \ + { \ + const uint32_t use_vpop = 1; \ + return avx512_lookup_impl(subtable, keys_map, keys, rules, \ + U0, U1, use_vpop); \ } \ DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4) @@ -305,11 +344,18 @@ DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1) DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0) -/* Check if a specialized function is valid for the required subtable. */ -#define CHECK_LOOKUP_FUNCTION(U0, U1) \ +/* Check if a specialized function is valid for the required subtable. + * The use_vpop variable is used to decide if the VPOPCNT instruction can be + * used or not. + */ +#define CHECK_LOOKUP_FUNCTION(U0, U1, use_vpop) \ ovs_assert((U0 + U1) <= (NUM_U64_IN_ZMM_REG * 2)); \ if (!f && u0_bits == U0 && u1_bits == U1) { \ - f = dpcls_avx512_gather_mf_##U0##_##U1; \ + if (use_vpop) { \ + f = dpcls_avx512_gather_mf_##U0##_##U1##_vpop; \ + } else { \ + f = dpcls_avx512_gather_mf_##U0##_##U1; \ + } \ } static uint32_t @@ -317,9 +363,11 @@ dpcls_avx512_gather_mf_any(struct dpcls_subtable *subtable, uint32_t keys_map, const struct netdev_flow_key *keys[], struct dpcls_rule **rules) { + const uint32_t use_vpop = 0; return avx512_lookup_impl(subtable, keys_map, keys, rules, subtable->mf_bits_set_unit0, - subtable->mf_bits_set_unit1); + subtable->mf_bits_set_unit1, + use_vpop); } dpcls_subtable_lookup_func @@ -333,12 +381,14 @@ dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, uint32_t u1_bits) return NULL; } - CHECK_LOOKUP_FUNCTION(9, 4); - CHECK_LOOKUP_FUNCTION(9, 1); - CHECK_LOOKUP_FUNCTION(5, 3); - CHECK_LOOKUP_FUNCTION(5, 1); - CHECK_LOOKUP_FUNCTION(4, 1); - CHECK_LOOKUP_FUNCTION(4, 0); + int use_vpop = dpdk_get_cpu_has_isa("x86_64", "avx512vpopcntdq"); + + CHECK_LOOKUP_FUNCTION(9, 4, use_vpop); + CHECK_LOOKUP_FUNCTION(9, 1, use_vpop); + CHECK_LOOKUP_FUNCTION(5, 3, use_vpop); + CHECK_LOOKUP_FUNCTION(5, 1, use_vpop); + CHECK_LOOKUP_FUNCTION(4, 1, use_vpop); + CHECK_LOOKUP_FUNCTION(4, 0, use_vpop); /* Check if the _any looping version of the code can perform this miniflow * lookup. Performance gain may be less pronounced due to non-specialized