From patchwork Fri May 7 02:29:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kewen.Lin" X-Patchwork-Id: 1475302 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=Bmkszz5x; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FbvXW1XQWz9sW7 for ; Fri, 7 May 2021 12:29:15 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E68783896430; Fri, 7 May 2021 02:29:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E68783896430 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1620354551; bh=HcpXBsQZOlTIrAJBtrRHnar3iFp9nnBrOOWas1lfR4k=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=Bmkszz5xBf1XjYJX0mLpK9mAW4L6W+bLN0+nqJN/c9v+1ODp83h9kmCO48O5/9h6q PCifkCHhk8zWXfuUlwlLkYrtdqDajdVXyH2LoC/xchFw9J3LAG586mw5DEbFYVIM3d G3Ax9Ou6cuQuzH0HYSZIVrme889qdTJcmt25J9dw= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 0D4323896427 for ; Fri, 7 May 2021 02:29:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0D4323896427 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 14723G33166681; Thu, 6 May 2021 22:29:07 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 38cub99tna-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 06 May 2021 22:29:07 -0400 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 1472Naia040683; Thu, 6 May 2021 22:29:07 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0b-001b2d01.pphosted.com with ESMTP id 38cub99tn2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 06 May 2021 22:29:07 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 1472T5jG026736; Fri, 7 May 2021 02:29:05 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma03ams.nl.ibm.com with ESMTP id 38csqdr1gm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 07 May 2021 02:29:05 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1472SbG432768284 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 7 May 2021 02:28:37 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 352E84C050; Fri, 7 May 2021 02:29:03 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0484B4C04E; Fri, 7 May 2021 02:29:02 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Fri, 7 May 2021 02:29:01 +0000 (GMT) To: GCC Patches Subject: [PATCH] rs6000: Adjust rs6000_density_test for strided_load Message-ID: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> Date: Fri, 7 May 2021 10:29:00 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-GUID: _7LjT2a8GT8F6IrdzWSjSQNX6C9Nupxz X-Proofpoint-ORIG-GUID: ls3ElXvt6E1ti1raS847qxilDP_g1oeB X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-05-06_16:2021-05-06, 2021-05-06 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 impostorscore=0 mlxscore=0 priorityscore=1501 bulkscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 phishscore=0 clxscore=1015 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2105070012 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Kewen.Lin via Gcc-patches" From: "Kewen.Lin" Reply-To: "Kewen.Lin" Cc: Bill Schmidt , David Edelsohn , Segher Boessenkool Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Hi, We noticed that SPEC2017 503.bwaves_r run time degrades by about 8% on P8 and P9 if we enabled vectorization at O2 fast-math. Comparing to Ofast, compiler doesn't do the loop interchange on the innermost loop, it's not profitable to vectorize it then. Since with loop vectorization, the loop becomes very intensive (density ratio is 83), there are many scalar loads and further to construct vector, it's bad that the vector CTORs have to wait for the required loads are ready. Now we have the function rs6000_density_test to check this kind of intensive case, but for this case, the threshold is too generic and a bit high. This patch is to tweak the density heuristics by introducing some more thresholds for strided_load, avoid to affect some potential bmks sensitive to DENSITY_PCT_THRESHOLD change which is generic. Bootstrapped/regtested on powerpc64le-linux-gnu P9. Nothing remarkable was observed with SPEC2017 Power9 full run, excepting for bwaves_r degradation has been fixed. Is it ok for trunk? BR, Kewen ------ gcc/ChangeLog: * config/rs6000/rs6000.c (rs6000_density_test): Add new heuristics for strided_load density check. --- gcc/config/rs6000/rs6000.c | 88 +++++++++++++++++++++++++++++++++----- 1 file changed, 77 insertions(+), 11 deletions(-) diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index ffdf10098a9..5ae40d6f4ce 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5245,12 +5245,16 @@ rs6000_density_test (rs6000_cost_data *data) const int DENSITY_PCT_THRESHOLD = 85; const int DENSITY_SIZE_THRESHOLD = 70; const int DENSITY_PENALTY = 10; + const int DENSITY_LOAD_PCT_THRESHOLD = 80; + const int DENSITY_LOAD_FOR_CTOR_PCT_THRESHOLD = 65; + const int DENSITY_LOAD_SIZE_THRESHOLD = 20; struct loop *loop = data->loop_info; basic_block *bbs = get_loop_body (loop); int nbbs = loop->num_nodes; loop_vec_info loop_vinfo = loop_vec_info_for_loop (data->loop_info); int vec_cost = data->cost[vect_body], not_vec_cost = 0; int i, density_pct; + unsigned int nload_total = 0, nctor_for_strided = 0, nload_for_ctor = 0; /* Only care about cost of vector version, so exclude scalar version here. */ if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data) @@ -5272,21 +5276,83 @@ rs6000_density_test (rs6000_cost_data *data) if (!STMT_VINFO_RELEVANT_P (stmt_info) && !STMT_VINFO_IN_PATTERN_P (stmt_info)) not_vec_cost++; + else + { + stmt_vec_info vstmt_info = vect_stmt_to_vectorize (stmt_info); + if (STMT_VINFO_DATA_REF (vstmt_info) + && DR_IS_READ (STMT_VINFO_DATA_REF (vstmt_info))) + { + if (STMT_VINFO_STRIDED_P (vstmt_info)) + { + unsigned int ncopies = 1; + unsigned int nunits = 1; + /* TODO: For VMAT_STRIDED_SLP, the total CTOR can be + fewer due to group access. Simply handle it here + for now. */ + if (!STMT_SLP_TYPE (vstmt_info)) + { + tree vectype = STMT_VINFO_VECTYPE (vstmt_info); + ncopies = vect_get_num_copies (loop_vinfo, vectype); + nunits = vect_nunits_for_cost (vectype); + } + unsigned int nloads = ncopies * nunits; + nload_for_ctor += nloads; + nload_total += nloads; + nctor_for_strided += ncopies; + } + else + nload_total++; + } + } } } - free (bbs); - density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); - if (density_pct > DENSITY_PCT_THRESHOLD - && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD) - { - data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "density %d%%, cost %d exceeds threshold, penalizing " - "loop body cost by %d%%", density_pct, - vec_cost + not_vec_cost, DENSITY_PENALTY); + if (vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD) + { + density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); + if (density_pct > DENSITY_PCT_THRESHOLD) + { + data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "density %d%%, cost %d exceeds threshold, " + "penalizing loop body cost by %d%%.\n", + density_pct, vec_cost + not_vec_cost, + DENSITY_PENALTY); + } + /* For one loop which has a large proportion scalar loads of all + loads fed into vector construction, if the density is high, + the loads will have more stalls than usual, further affect + the vector construction. One typical case is the innermost + loop of the hotspot of spec2017 503.bwaves_r without loop + interchange. Here we price more on the related vector + construction and penalize the body cost. */ + else if (density_pct > DENSITY_LOAD_PCT_THRESHOLD + && nload_total > DENSITY_LOAD_SIZE_THRESHOLD) + { + int load_for_ctor_pct = (nload_for_ctor * 100) / nload_total; + /* Large proportion of scalar loads fed to vector CTOR. */ + if (load_for_ctor_pct > DENSITY_LOAD_FOR_CTOR_PCT_THRESHOLD) + { + vec_cost += nctor_for_strided; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Found high density loop with a large " + "proportion %d%% of scalar loads fed to " + "vector ctor, add cost %d.\n", + load_for_ctor_pct, nctor_for_strided); + + data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "density %d%%, cost %d exceeds threshold, " + "penalizing loop body cost by %d%% for " + "load.\n", + density_pct, vec_cost + not_vec_cost, + DENSITY_PENALTY); + } + } } }