From patchwork Wed Jul 22 15:48:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Kewen.Lin" <linkw@linux.ibm.com>
X-Patchwork-Id: 1333965
Return-Path: <gcc-patches-bounces@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
 dmarc=none (p=none dis=none) header.from=gcc.gnu.org
Authentication-Results: ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256
 header.s=default header.b=EBt+3/xS;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4BBfy765Bwz9sRN
	for <incoming@patchwork.ozlabs.org>; Thu, 23 Jul 2020 01:48:30 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 96640385700B;
	Wed, 22 Jul 2020 15:48:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 96640385700B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1595432906;
	bh=YETCf8kt89+h6Ku/Bwkub4PZs45tqDU9WV6Fes7INnk=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=EBt+3/xSL/xLEyqHAm47XRv7dmGeWlWGHEsVCjl/ReJXDRN4tdnWvp8f9eLrUgwT5
	 r6Q0sz70/p9OFYjXj8LMLmqejuK1JqbStNyfaDVyXOJ/kl8Ec2GwhXyPdwODgTQ/cC
	 FtGD5doB9M2d3+KgH5padRkKJWJ3J6/AfChVLKio=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 8B2F93857C6A
 for <gcc-patches@gcc.gnu.org>; Wed, 22 Jul 2020 15:48:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8B2F93857C6A
Received: from pps.filterd (m0098413.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 06MFVtQQ104561; Wed, 22 Jul 2020 11:48:21 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 32equj96fc-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 22 Jul 2020 11:48:20 -0400
Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 06MFW5Pq105454;
 Wed, 22 Jul 2020 11:48:20 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0b-001b2d01.pphosted.com with ESMTP id 32equj96en-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 22 Jul 2020 11:48:19 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 06MFVvax023740;
 Wed, 22 Jul 2020 15:48:16 GMT
Received: from b06cxnps3074.portsmouth.uk.ibm.com
 (d06relay09.portsmouth.uk.ibm.com [9.149.109.194])
 by ppma06ams.nl.ibm.com with ESMTP id 32brbh53xh-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 22 Jul 2020 15:48:16 +0000
Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com
 [9.149.105.58])
 by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 06MFmEGi29557232
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 22 Jul 2020 15:48:14 GMT
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 0D1B04C044;
 Wed, 22 Jul 2020 15:48:14 +0000 (GMT)
Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 62CFE4C040;
 Wed, 22 Jul 2020 15:48:11 +0000 (GMT)
Received: from KewenLins-MacBook-Pro.local (unknown [9.197.238.225])
 by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Wed, 22 Jul 2020 15:48:11 +0000 (GMT)
Subject: [PATCH v3] vect/rs6000: Support vector with length cost modeling
To: GCC Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
References: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com>
 <CAFiYyc07vGM7RBYmxWGwEpZuBA0vfvBMPeb+RQPMnXr+E55edw@mail.gmail.com>
 <a06e714e-04c3-8a2f-fa1d-02a72aecf7f4@linux.ibm.com>
 <mptv9ifud65.fsf@arm.com>
Message-ID: <1aeabdc7-0cf4-055b-a3ec-74c283053cf5@linux.ibm.com>
Date: Wed, 22 Jul 2020 23:48:09 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.9.0
MIME-Version: 1.0
In-Reply-To: <mptv9ifud65.fsf@arm.com>
Content-Language: en-US
X-TM-AS-GCONF: 00
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687
 definitions=2020-07-22_09:2020-07-22,
 2020-07-22 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 phishscore=0 mlxlogscore=999
 clxscore=1015 adultscore=0 mlxscore=0 spamscore=0 malwarescore=0
 bulkscore=0 suspectscore=0 priorityscore=1501 impostorscore=0
 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2006250000 definitions=main-2007220105
X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Kewen.Lin via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Reply-To: "Kewen.Lin" <linkw@linux.ibm.com>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>
Errors-To: gcc-patches-bounces@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>

Hi Richard,

Thanks for the review!

on 2020/7/22 下午5:11, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> -    {
>> -      peel_iters_prologue = 0;
>> -      peel_iters_epilogue = 0;
>> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> +	{
>> +	  /* Calculate how many masks we need to generate.  */
>> +	  unsigned int num_masks = 0;
>> +	  rgroup_controls *rgm;
>> +	  unsigned int num_vectors_m1;
>> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
>> +	    if (rgm->type)
>> +	      num_masks += num_vectors_m1 + 1;
>> +	  gcc_assert (num_masks > 0);
>> +
>> +	  /* In the worst case, we need to generate each mask in the prologue
>> +	     and in the loop body.  One of the loop body mask instructions
>> +	     replaces the comparison in the scalar loop, and since we don't
>> +	     count the scalar comparison against the scalar body, we shouldn't
>> +	     count that vector instruction against the vector body either.
>> +
>> +	     Sometimes we can use unpacks instead of generating prologue
>> +	     masks and sometimes the prologue mask will fold to a constant,
>> +	     so the actual prologue cost might be smaller.  However, it's
>> +	     simpler and safer to use the worst-case cost; if this ends up
>> +	     being the tie-breaker between vectorizing or not, then it's
>> +	     probably better not to vectorize.  */
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
>> +	}
>> +      else
>> +	{
>> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
>> +
>> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
>> +	  if (npeel < 0)
>> +	    {
>> +	      peel_iters_prologue = assumed_vf / 2;
>> +	      /* See below, if peeled iterations are unknown, count a taken
>> +		 branch and a not taken branch per peeled loop.  */
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	    }
>> +	  else
>> +	    {
>> +	      peel_iters_prologue = npeel;
>> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> +		/* See vect_get_known_peeling_cost, if peeled iterations are
>> +		   known but number of scalar loop iterations are unknown, count
>> +		   a taken branch per peeled loop.  */
>> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				      cond_branch_taken, NULL, NULL_TREE, 0,
>> +				      vect_prologue);
>> +	    }
> 
> I think it'd be good to avoid duplicating this.  How about the
> following structure?
> 
>   if (vect_use_loop_mask_for_alignment_p (…))
>     {
>       peel_iters_prologue = 0;
>       peel_iters_epilogue = 0;
>     }
>   else if (npeel < 0)
>     {
>       … // A
>     }
>   else
>     {
>       …vect_get_known_peeling_cost stuff…
>     }
> 
> but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:
> 
>   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0
> 
> for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
> whatever value we'd normally use.  Then wrap:
> 
>       (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
>       (void) add_stmt_cost (loop_vinfo,
> 			    target_cost_data, 1, cond_branch_not_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
> 
> in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
> stuff in A conditional on peel_iters_epilogue != 0.
> 
> This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
> code:
> 
>       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> 	{
> 	  /* We need to peel exactly one iteration.  */
> 	  peel_iters_epilogue += 1;
> 	  stmt_info_for_cost *si;
> 	  int j;
> 	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> 			    j, si)
> 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
> 				  si->kind, si->stmt_info, si->vectype,
> 				  si->misalign, vect_epilogue);
> 	}
> 
> Then, after the above, have:
> 
>   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>     …add costs for mask overhead…
>   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>     …add costs for lengths overhead…
> 
> So we'd have one block of code for estimating the prologue and epilogue
> peeling cost, and a separate block of code for the loop control overhead.
> 

It's a great idea, by following your subsequent suggestion to make the structure
like:

  - calculate peel_iters_prologue
  - calculate peel_iters_epilogue
  - add costs associated with peel_iters_prologue
  - add costs associated with peel_iters_epilogue
  - add costs related to branch taken/not_taken.

the updated v3 is attached.

Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1, I'll test it without partial vectors
setting, also on aarch64 later.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
	to determine peel_iters_epilogue to function ...
	(vect_get_peel_iters_epilogue): ... this.  New function.
	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
	with length, refactor cost calculation on peel_iters_prologue and
	peel_iters_epilogue.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..d71f2bf1c16 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..c85b3ea1393 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3474,42 +3474,56 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   return NULL;
 }
 
-/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
-int
-vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
-                             int *peel_iters_epilogue,
-                             stmt_vector_for_cost *scalar_cost_vec,
-			     stmt_vector_for_cost *prologue_cost_vec,
-			     stmt_vector_for_cost *epilogue_cost_vec)
+/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
+   and PEEL_ITERS_PROLOGUE.  */
+
+static int
+vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue)
 {
-  int retval = 0;
   int assumed_vf = vect_vf_for_cost (loop_vinfo);
-
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      *peel_iters_epilogue = assumed_vf / 2;
       if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
+	dump_printf_loc (MSG_NOTE, vect_location,
 			 "cost model: epilogue peel iters set to vf/2 "
 			 "because loop iterations are unknown .\n");
-
-      /* If peeled iterations are known but number of scalar loop
-         iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
-				 NULL, NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
-				  NULL, NULL_TREE, 0, vect_epilogue);
+      return assumed_vf / 2;
     }
   else
     {
       int niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      peel_iters_prologue = niters < peel_iters_prologue ?
-                            niters : peel_iters_prologue;
-      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
+      int npeel_prol
+	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
+      int npeel_epil = (niters - npeel_prol) % assumed_vf;
       /* If we need to peel for gaps, but no peeling is required, we have to
 	 peel VF iterations.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue)
-	*peel_iters_epilogue = assumed_vf;
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil)
+	npeel_epil = assumed_vf;
+      return npeel_epil;
+    }
+}
+
+/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
+int
+vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
+			     int *peel_iters_epilogue,
+			     stmt_vector_for_cost *scalar_cost_vec,
+			     stmt_vector_for_cost *prologue_cost_vec,
+			     stmt_vector_for_cost *epilogue_cost_vec)
+{
+  int retval = 0;
+
+  *peel_iters_epilogue
+    = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue);
+
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      /* If peeled iterations are known but number of scalar loop
+	 iterations are unknown, count a taken branch per peeled loop.  */
+      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
+				 NULL_TREE, 0, vect_prologue);
+      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
+				  NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3652,24 +3666,106 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  bool prol_need_br_taken_cost = false;
+  bool prol_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_prologue.  */
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    peel_iters_prologue = 0;
+  else if (npeel < 0)
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      peel_iters_prologue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "prologue peel iters set to vf/2.\n");
 
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+      /* If peeled iterations are unknown, count a taken branch and a not taken
+	 branch per peeled loop. Even if scalar loop iterations are known,
+	 vector iterations are not known since peeled prologue iterations are
+	 not known. Hence guards remain the same.  */
+      prol_need_br_taken_cost = true;
+      prol_need_br_not_taken_cost = true;
+    }
+  else
+    peel_iters_prologue = npeel;
+
+  bool epil_need_br_taken_cost = false;
+  bool epil_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_epilogue.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+    /* We need to peel exactly one iteration for gaps.  */
+    peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  else if (npeel < 0)
+    {
+      /* If peeling for alignment is unknown, loop bound of main loop
+	 becomes unknown.  */
+      peel_iters_epilogue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "epilogue peel iters set to vf/2 because "
+			       "peeling for alignment is unknown.\n");
+
+      /* See the same reason above in peel_iters_prologue calculation.  */
+      epil_need_br_taken_cost = true;
+      epil_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
 	{
-	  /* We need to peel exactly one iteration.  */
-	  peel_iters_epilogue += 1;
-	  stmt_info_for_cost *si;
-	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
-	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
-				  si->kind, si->stmt_info, si->vectype,
-				  si->misalign, vect_epilogue);
+	  /* If peeled iterations are known but number of scalar loop
+	     iterations are unknown, count a taken branch per peeled loop.  */
+	  prol_need_br_taken_cost = true;
+	  epil_need_br_taken_cost = true;
 	}
+    }
+
+  stmt_info_for_cost *si;
+  int j;
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_prologue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_prologue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_prologue);
+      }
+
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_epilogue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_epilogue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_epilogue);
+      }
+
+  /* Add possible cond_branch_taken/cond_branch_not_taken cost.  */
+  if (prol_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_prologue);
 
+  if (prol_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_prologue);
+
+  if (epil_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_epilogue);
+
+  if (epil_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_epilogue);
+
+  /* Take care of special costs for rgroup controls of partial vectors.  */
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
@@ -3691,93 +3787,79 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	 simpler and safer to use the worst-case cost; if this ends up
 	 being the tie-breaker between vectorizing or not, then it's
 	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
-    }
-  else if (npeel < 0)
-    {
-      peel_iters_prologue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "prologue peel iters set to vf/2.\n");
-
-      /* If peeling for alignment is unknown, loop bound of main loop becomes
-         unknown.  */
-      peel_iters_epilogue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "epilogue peel iters set to vf/2 because "
-		     "peeling for alignment is unknown.\n");
+      /* Refer to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimization, we consider the worst case
+	 here.  */
+
+      /* For now we only operate length-based partial vectors on Power,
+	 which has constant VF all the time, we need some tweakings below
+	 if it doesn't hold in future.  */
+      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prol_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
 
-      /* If peeled iterations are unknown, count a taken branch and a not taken
-         branch per peeled loop. Even if scalar loop iterations are known,
-         vector iterations are not known since peeled prologue iterations are
-         not known. Hence guards remain the same.  */
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      stmt_info_for_cost *si;
-      int j;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
-	{
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_prologue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_prologue);
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_epilogue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_epilogue);
-	}
-    }
-  else
-    {
-      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
-      stmt_info_for_cost *si;
-      int j;
-      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
+	    /* Need one shift for niters_total computation.  */
+	    if (!niters_known_p && nitems != 1)
+	      prol_cnt += shift_cost;
 
-      prologue_cost_vec.create (2);
-      epilogue_cost_vec.create (2);
-      peel_iters_prologue = npeel;
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prol_cnt += (min_max_cost + minus_cost);
 
-      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
-					  &peel_iters_epilogue,
-					  &LOOP_VINFO_SCALAR_ITERATION_COST
-					    (loop_vinfo),
-					  &prologue_cost_vec,
-					  &epilogue_cost_vec);
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
 
-      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_prologue);
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prol_cnt += min_max_cost * num_vectors;
 
-      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_epilogue);
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
 
-      prologue_cost_vec.release ();
-      epilogue_cost_vec.release ();
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
     }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
@@ -3913,8 +3995,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4023,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4078,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4099,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4111,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost