From patchwork Wed Jul 22 15:48:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Kewen.Lin" X-Patchwork-Id: 1333965 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=gcc.gnu.org Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=EBt+3/xS; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4BBfy765Bwz9sRN for ; Thu, 23 Jul 2020 01:48:30 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 96640385700B; Wed, 22 Jul 2020 15:48:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 96640385700B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1595432906; bh=YETCf8kt89+h6Ku/Bwkub4PZs45tqDU9WV6Fes7INnk=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=EBt+3/xSL/xLEyqHAm47XRv7dmGeWlWGHEsVCjl/ReJXDRN4tdnWvp8f9eLrUgwT5 r6Q0sz70/p9OFYjXj8LMLmqejuK1JqbStNyfaDVyXOJ/kl8Ec2GwhXyPdwODgTQ/cC FtGD5doB9M2d3+KgH5padRkKJWJ3J6/AfChVLKio= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 8B2F93857C6A for ; Wed, 22 Jul 2020 15:48:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8B2F93857C6A Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 06MFVtQQ104561; Wed, 22 Jul 2020 11:48:21 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 32equj96fc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jul 2020 11:48:20 -0400 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 06MFW5Pq105454; Wed, 22 Jul 2020 11:48:20 -0400 Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0b-001b2d01.pphosted.com with ESMTP id 32equj96en-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jul 2020 11:48:19 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 06MFVvax023740; Wed, 22 Jul 2020 15:48:16 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma06ams.nl.ibm.com with ESMTP id 32brbh53xh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 22 Jul 2020 15:48:16 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 06MFmEGi29557232 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 22 Jul 2020 15:48:14 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0D1B04C044; Wed, 22 Jul 2020 15:48:14 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 62CFE4C040; Wed, 22 Jul 2020 15:48:11 +0000 (GMT) Received: from KewenLins-MacBook-Pro.local (unknown [9.197.238.225]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 22 Jul 2020 15:48:11 +0000 (GMT) Subject: [PATCH v3] vect/rs6000: Support vector with length cost modeling To: GCC Patches , richard.sandiford@arm.com References: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com> Message-ID: <1aeabdc7-0cf4-055b-a3ec-74c283053cf5@linux.ibm.com> Date: Wed, 22 Jul 2020 23:48:09 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-07-22_09:2020-07-22, 2020-07-22 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=999 clxscore=1015 adultscore=0 mlxscore=0 spamscore=0 malwarescore=0 bulkscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2007220105 X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Kewen.Lin via Gcc-patches" From: "Kewen.Lin" Reply-To: "Kewen.Lin" Cc: Bill Schmidt , Segher Boessenkool Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Hi Richard, Thanks for the review! on 2020/7/22 下午5:11, Richard Sandiford wrote: > "Kewen.Lin" writes: >> - else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) >> - { >> - peel_iters_prologue = 0; >> - peel_iters_epilogue = 0; >> + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >> + { >> + /* Calculate how many masks we need to generate. */ >> + unsigned int num_masks = 0; >> + rgroup_controls *rgm; >> + unsigned int num_vectors_m1; >> + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) >> + if (rgm->type) >> + num_masks += num_vectors_m1 + 1; >> + gcc_assert (num_masks > 0); >> + >> + /* In the worst case, we need to generate each mask in the prologue >> + and in the loop body. One of the loop body mask instructions >> + replaces the comparison in the scalar loop, and since we don't >> + count the scalar comparison against the scalar body, we shouldn't >> + count that vector instruction against the vector body either. >> + >> + Sometimes we can use unpacks instead of generating prologue >> + masks and sometimes the prologue mask will fold to a constant, >> + so the actual prologue cost might be smaller. However, it's >> + simpler and safer to use the worst-case cost; if this ends up >> + being the tie-breaker between vectorizing or not, then it's >> + probably better not to vectorize. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, >> + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, >> + vector_stmt, NULL, NULL_TREE, 0, vect_body); >> + } >> + else >> + { >> + gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); >> + >> + /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */ >> + if (npeel < 0) >> + { >> + peel_iters_prologue = assumed_vf / 2; >> + /* See below, if peeled iterations are unknown, count a taken >> + branch and a not taken branch per peeled loop. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_not_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + } >> + else >> + { >> + peel_iters_prologue = npeel; >> + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) >> + /* See vect_get_known_peeling_cost, if peeled iterations are >> + known but number of scalar loop iterations are unknown, count >> + a taken branch per peeled loop. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + } > > I think it'd be good to avoid duplicating this. How about the > following structure? > > if (vect_use_loop_mask_for_alignment_p (…)) > { > peel_iters_prologue = 0; > peel_iters_epilogue = 0; > } > else if (npeel < 0) > { > … // A > } > else > { > …vect_get_known_peeling_cost stuff… > } > > but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to: > > LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0 > > for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to > whatever value we'd normally use. Then wrap: > > (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, > NULL, NULL_TREE, 0, vect_epilogue); > (void) add_stmt_cost (loop_vinfo, > target_cost_data, 1, cond_branch_not_taken, > NULL, NULL_TREE, 0, vect_epilogue); > > in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue > stuff in A conditional on peel_iters_epilogue != 0. > > This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P > code: > > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) > { > /* We need to peel exactly one iteration. */ > peel_iters_epilogue += 1; > stmt_info_for_cost *si; > int j; > FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), > j, si) > (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, > si->kind, si->stmt_info, si->vectype, > si->misalign, vect_epilogue); > } > > Then, after the above, have: > > if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > …add costs for mask overhead… > else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > …add costs for lengths overhead… > > So we'd have one block of code for estimating the prologue and epilogue > peeling cost, and a separate block of code for the loop control overhead. > It's a great idea, by following your subsequent suggestion to make the structure like: - calculate peel_iters_prologue - calculate peel_iters_epilogue - add costs associated with peel_iters_prologue - add costs associated with peel_iters_epilogue - add costs related to branch taken/not_taken. the updated v3 is attached. Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit param vect-partial-vector-usage=1, I'll test it without partial vectors setting, also on aarch64 later. BR, Kewen ----- gcc/ChangeLog: * config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function. (rs6000_finish_cost): Call adjust_vect_cost_per_loop. * tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code to determine peel_iters_epilogue to function ... (vect_get_peel_iters_epilogue): ... this. New function. (vect_estimate_min_profitable_iters): Add cost modeling for vector with length, refactor cost calculation on peel_iters_prologue and peel_iters_epilogue. diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 009afc5f894..d71f2bf1c16 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, return retval; } +/* For some target specific vectorization cost which can't be handled per stmt, + we check the requisite conditions and adjust the vectorization cost + accordingly if satisfied. One typical example is to model shift cost for + vector with length by counting number of required lengths under condition + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ + +static void +adjust_vect_cost_per_loop (rs6000_cost_data *data) +{ + struct loop *loop = data->loop_info; + gcc_assert (loop); + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); + + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + rgroup_controls *rgc; + unsigned int num_vectors_m1; + unsigned int shift_cnt = 0; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + /* Each length needs one shift to fill into bits 0-7. */ + shift_cnt += (num_vectors_m1 + 1); + + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } +} + /* Implement targetm.vectorize.finish_cost. */ static void @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, rs6000_cost_data *cost_data = (rs6000_cost_data*) data; if (cost_data->loop_info) - rs6000_density_test (cost_data); + { + adjust_vect_cost_per_loop (cost_data); + rs6000_density_test (cost_data); + } /* Don't vectorize minimum-vectorization-factor, simple copy loops that require versioning for any reason. The vectorization is at diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index e933441b922..c85b3ea1393 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -3474,42 +3474,56 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info, return NULL; } -/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times. */ -int -vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue, - int *peel_iters_epilogue, - stmt_vector_for_cost *scalar_cost_vec, - stmt_vector_for_cost *prologue_cost_vec, - stmt_vector_for_cost *epilogue_cost_vec) +/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO + and PEEL_ITERS_PROLOGUE. */ + +static int +vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue) { - int retval = 0; int assumed_vf = vect_vf_for_cost (loop_vinfo); - if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) { - *peel_iters_epilogue = assumed_vf / 2; if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, + dump_printf_loc (MSG_NOTE, vect_location, "cost model: epilogue peel iters set to vf/2 " "because loop iterations are unknown .\n"); - - /* If peeled iterations are known but number of scalar loop - iterations are unknown, count a taken branch per peeled loop. */ - retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, - NULL, NULL_TREE, 0, vect_prologue); - retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, - NULL, NULL_TREE, 0, vect_epilogue); + return assumed_vf / 2; } else { int niters = LOOP_VINFO_INT_NITERS (loop_vinfo); - peel_iters_prologue = niters < peel_iters_prologue ? - niters : peel_iters_prologue; - *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf; + int npeel_prol + = niters < peel_iters_prologue ? niters : peel_iters_prologue; + int npeel_epil = (niters - npeel_prol) % assumed_vf; /* If we need to peel for gaps, but no peeling is required, we have to peel VF iterations. */ - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue) - *peel_iters_epilogue = assumed_vf; + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil) + npeel_epil = assumed_vf; + return npeel_epil; + } +} + +/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times. */ +int +vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue, + int *peel_iters_epilogue, + stmt_vector_for_cost *scalar_cost_vec, + stmt_vector_for_cost *prologue_cost_vec, + stmt_vector_for_cost *epilogue_cost_vec) +{ + int retval = 0; + + *peel_iters_epilogue + = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue); + + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) + { + /* If peeled iterations are known but number of scalar loop + iterations are unknown, count a taken branch per peeled loop. */ + retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL, + NULL_TREE, 0, vect_prologue); + retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL, + NULL_TREE, 0, vect_epilogue); } stmt_info_for_cost *si; @@ -3652,24 +3666,106 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, TODO: Build an expression that represents peel_iters for prologue and epilogue to be used in a run-time test. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + bool prol_need_br_taken_cost = false; + bool prol_need_br_not_taken_cost = false; + + /* Calculate peel_iters_prologue. */ + if (vect_use_loop_mask_for_alignment_p (loop_vinfo)) + peel_iters_prologue = 0; + else if (npeel < 0) { - peel_iters_prologue = 0; - peel_iters_epilogue = 0; + peel_iters_prologue = assumed_vf / 2; + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "cost model: " + "prologue peel iters set to vf/2.\n"); - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) + /* If peeled iterations are unknown, count a taken branch and a not taken + branch per peeled loop. Even if scalar loop iterations are known, + vector iterations are not known since peeled prologue iterations are + not known. Hence guards remain the same. */ + prol_need_br_taken_cost = true; + prol_need_br_not_taken_cost = true; + } + else + peel_iters_prologue = npeel; + + bool epil_need_br_taken_cost = false; + bool epil_need_br_not_taken_cost = false; + + /* Calculate peel_iters_epilogue. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) + /* We need to peel exactly one iteration for gaps. */ + peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; + else if (npeel < 0) + { + /* If peeling for alignment is unknown, loop bound of main loop + becomes unknown. */ + peel_iters_epilogue = assumed_vf / 2; + if (dump_enabled_p ()) + dump_printf (MSG_NOTE, "cost model: " + "epilogue peel iters set to vf/2 because " + "peeling for alignment is unknown.\n"); + + /* See the same reason above in peel_iters_prologue calculation. */ + epil_need_br_taken_cost = true; + epil_need_br_not_taken_cost = true; + } + else + { + peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel); + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) { - /* We need to peel exactly one iteration. */ - peel_iters_epilogue += 1; - stmt_info_for_cost *si; - int j; - FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), - j, si) - (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, - si->kind, si->stmt_info, si->vectype, - si->misalign, vect_epilogue); + /* If peeled iterations are known but number of scalar loop + iterations are unknown, count a taken branch per peeled loop. */ + prol_need_br_taken_cost = true; + epil_need_br_taken_cost = true; } + } + + stmt_info_for_cost *si; + int j; + /* Add costs associated with peel_iters_prologue. */ + if (peel_iters_prologue) + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si) + { + (void) add_stmt_cost (loop_vinfo, target_cost_data, + si->count * peel_iters_prologue, si->kind, + si->stmt_info, si->vectype, si->misalign, + vect_prologue); + } + + /* Add costs associated with peel_iters_prologue. */ + if (peel_iters_epilogue) + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si) + { + (void) add_stmt_cost (loop_vinfo, target_cost_data, + si->count * peel_iters_epilogue, si->kind, + si->stmt_info, si->vectype, si->misalign, + vect_epilogue); + } + + /* Add possible cond_branch_taken/cond_branch_not_taken cost. */ + if (prol_need_br_taken_cost) + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, + NULL, NULL_TREE, 0, vect_prologue); + if (prol_need_br_not_taken_cost) + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_not_taken, NULL, NULL_TREE, 0, + vect_prologue); + + if (epil_need_br_taken_cost) + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, + NULL, NULL_TREE, 0, vect_epilogue); + + if (epil_need_br_not_taken_cost) + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_not_taken, NULL, NULL_TREE, 0, + vect_epilogue); + + /* Take care of special costs for rgroup controls of partial vectors. */ + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { /* Calculate how many masks we need to generate. */ unsigned int num_masks = 0; rgroup_controls *rgm; @@ -3691,93 +3787,79 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, simpler and safer to use the worst-case cost; if this ends up being the tie-breaker between vectorizing or not, then it's probably better not to vectorize. */ - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks, vector_stmt, - NULL, NULL_TREE, 0, vect_prologue); - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks - 1, vector_stmt, - NULL, NULL_TREE, 0, vect_body); + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, + vector_stmt, NULL, NULL_TREE, 0, vect_body); } else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) { - peel_iters_prologue = 0; - peel_iters_epilogue = 0; - } - else if (npeel < 0) - { - peel_iters_prologue = assumed_vf / 2; - if (dump_enabled_p ()) - dump_printf (MSG_NOTE, "cost model: " - "prologue peel iters set to vf/2.\n"); - - /* If peeling for alignment is unknown, loop bound of main loop becomes - unknown. */ - peel_iters_epilogue = assumed_vf / 2; - if (dump_enabled_p ()) - dump_printf (MSG_NOTE, "cost model: " - "epilogue peel iters set to vf/2 because " - "peeling for alignment is unknown.\n"); + /* Refer to the functions vect_set_loop_condition_partial_vectors + and vect_set_loop_controls_directly, we need to generate each + length in the prologue and in the loop body if required. Although + there are some possible optimization, we consider the worst case + here. */ + + /* For now we only operate length-based partial vectors on Power, + which has constant VF all the time, we need some tweakings below + if it doesn't hold in future. */ + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); + + /* For wrap around checking. */ + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); + unsigned int compare_precision = TYPE_PRECISION (compare_type); + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); + + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); + bool need_iterate_p + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_known_niters_smaller_than_vf (loop_vinfo)); + + /* Init min/max, shift and minus cost relative to single + scalar_stmt. For now we only use length-based partial vectors on + Power, target specific cost tweaking may be needed for other + ports in future. */ + unsigned int min_max_cost = 2; + unsigned int shift_cost = 1, minus_cost = 1; + + /* Init cost relative to single scalar_stmt. */ + unsigned int prol_cnt = 0; + unsigned int body_cnt = 0; + + rgroup_controls *rgc; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + { + unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor; - /* If peeled iterations are unknown, count a taken branch and a not taken - branch per peeled loop. Even if scalar loop iterations are known, - vector iterations are not known since peeled prologue iterations are - not known. Hence guards remain the same. */ - (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, - NULL, NULL_TREE, 0, vect_prologue); - (void) add_stmt_cost (loop_vinfo, - target_cost_data, 1, cond_branch_not_taken, - NULL, NULL_TREE, 0, vect_prologue); - (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, - NULL, NULL_TREE, 0, vect_epilogue); - (void) add_stmt_cost (loop_vinfo, - target_cost_data, 1, cond_branch_not_taken, - NULL, NULL_TREE, 0, vect_epilogue); - stmt_info_for_cost *si; - int j; - FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si) - { - (void) add_stmt_cost (loop_vinfo, target_cost_data, - si->count * peel_iters_prologue, - si->kind, si->stmt_info, si->vectype, - si->misalign, - vect_prologue); - (void) add_stmt_cost (loop_vinfo, target_cost_data, - si->count * peel_iters_epilogue, - si->kind, si->stmt_info, si->vectype, - si->misalign, - vect_epilogue); - } - } - else - { - stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec; - stmt_info_for_cost *si; - int j; - void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo); + /* Need one shift for niters_total computation. */ + if (!niters_known_p && nitems != 1) + prol_cnt += shift_cost; - prologue_cost_vec.create (2); - epilogue_cost_vec.create (2); - peel_iters_prologue = npeel; + /* Need to handle wrap around. */ + if (iv_limit == -1 + || (wi::min_precision (iv_limit * nitems, UNSIGNED) + > compare_precision)) + prol_cnt += (min_max_cost + minus_cost); - (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue, - &peel_iters_epilogue, - &LOOP_VINFO_SCALAR_ITERATION_COST - (loop_vinfo), - &prologue_cost_vec, - &epilogue_cost_vec); + /* Need to handle batch limit excepting for the 1st one. */ + prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1; - FOR_EACH_VEC_ELT (prologue_cost_vec, j, si) - (void) add_stmt_cost (loop_vinfo, - data, si->count, si->kind, si->stmt_info, - si->vectype, si->misalign, vect_prologue); + unsigned int num_vectors = num_vectors_m1 + 1; + /* Need to set up lengths in prologue, only one MIN required + since start index is zero. */ + prol_cnt += min_max_cost * num_vectors; - FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si) - (void) add_stmt_cost (loop_vinfo, - data, si->count, si->kind, si->stmt_info, - si->vectype, si->misalign, vect_epilogue); + /* Need to update lengths in body for next iteration. */ + if (need_iterate_p) + body_cnt += (2 * min_max_cost + minus_cost) * num_vectors; + } - prologue_cost_vec.release (); - epilogue_cost_vec.release (); + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); } /* FORNOW: The scalar outside cost is incremented in one of the @@ -3913,8 +3995,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* ??? The "if" arm is written to handle all cases; see below for what - we would do for !LOOP_VINFO_FULLY_MASKED_P. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Rewriting the condition above in terms of the number of vector iterations (vniters) rather than the number of @@ -3941,7 +4023,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", min_vec_niters); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Now that we know the minimum number of vector iterations, find the minimum niters for which the scalar cost is larger: @@ -3996,6 +4078,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; + else if (min_profitable_iters < peel_iters_prologue) + /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the + vectorized loop to execute at least once. */ + min_profitable_iters = peel_iters_prologue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -4013,7 +4099,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (vec_outside_cost <= 0) min_profitable_estimate = 0; - else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* This is a repeat of the code above, but with + SOC rather than - SOC. */ @@ -4025,7 +4111,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (outside_overhead > 0) min_vec_niters = outside_overhead / saving_per_viter + 1; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { int threshold = (vec_inside_cost * min_vec_niters + vec_outside_cost