From patchwork Wed May 3 19:43:09 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bill Schmidt X-Patchwork-Id: 758188 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3wJ7r03Q0Qz9rxl for ; Thu, 4 May 2017 05:43:27 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="rRDaU7eF"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:content-type :content-transfer-encoding:message-id; q=dns; s=default; b=QiLaL o29LriA6po+tS1eOGtJFfknRjIf29XeQMaUzRStuTNyEHFryr15cPOW33kjlfJv6 u09n/PaA01QRTkR7Lw9F063msuCcaHlOvVT3THofTZ+PEtNL3ESeEb0lcjdDQLgg 70l0fpVUSnaueGOFc2FAA4so2Jl/kICTuM+Qsk= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:content-type :content-transfer-encoding:message-id; s=default; bh=8dp8bT+banb /SzeVhg7NZq13Fq4=; b=rRDaU7eFBlriCMoM93M4Doz/CVxB/B4FU5PkRJJjdlH hJK6st6B5uYLfhtRAawegOHaT5r+ubP2PapOkcuL43P2RcUhAtmERksMIkZzJanF 6uKzJwcBrQlaYmVu80yrD6ZvzazwhKqsIG1VgalRiP6W6EUOxL7HMUf5sQiAc9rg = Received: (qmail 60353 invoked by alias); 3 May 2017 19:43:16 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 60342 invoked by uid 89); 3 May 2017 19:43:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-10.2 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=reversed, profitably, wash X-HELO: mx0a-001b2d01.pphosted.com Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 03 May 2017 19:43:14 +0000 Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v43Jcm0x038081 for ; Wed, 3 May 2017 15:43:14 -0400 Received: from e18.ny.us.ibm.com (e18.ny.us.ibm.com [129.33.205.208]) by mx0a-001b2d01.pphosted.com with ESMTP id 2a7mehts7f-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 03 May 2017 15:43:13 -0400 Received: from localhost by e18.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 3 May 2017 15:43:13 -0400 Received: from b01cxnp23032.gho.pok.ibm.com (9.57.198.27) by e18.ny.us.ibm.com (146.89.104.205) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 3 May 2017 15:43:10 -0400 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v43JhAOs39780562; Wed, 3 May 2017 19:43:10 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D7A55112054; Wed, 3 May 2017 15:43:10 -0400 (EDT) Received: from bigmac.rchland.ibm.com (unknown [9.10.86.41]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id 9D856112034; Wed, 3 May 2017 15:43:10 -0400 (EDT) To: GCC Patches Cc: Segher Boessenkool , David Edelsohn From: Bill Schmidt Subject: [PATCH, rs6000] Avoid vectorizing versioned copy loops with vectorization factor 2 Date: Wed, 3 May 2017 14:43:09 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 17050319-0044-0000-0000-000003201D32 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007018; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000208; SDB=6.00855796; UDB=6.00423582; IPR=6.00634959; BA=6.00005323; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00015286; XFM=3.00000014; UTC=2017-05-03 19:43:11 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17050319-0045-0000-0000-0000074E2732 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-05-03_14:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1705030345 X-IsSubscribed: yes Hi, We recently became aware of some poor code generation as a result of unprofitable (for POWER) loop vectorization. When a loop is simply copying data with 64-bit loads and stores, vectorizing with 128-bit loads and stores generally does not provide any benefit on modern POWER processors. Furthermore, if there is a requirement to version the loop for aliasing, alignment, etc., the cost of the versioning test is almost certainly a performance loss for such loops. The user code example included such a copy loop, executed only a few times on average, within an outer loop that was executed many times on average, causing a tremendous slowdown. This patch very specifically targets these kinds of loops and no others, and artificially inflates the vectorization cost to ensure vectorization does not appear profitable. This is done within the target model cost hooks to avoid affecting other targets. A new test case is included that demonstrates the refusal to vectorize. We've done SPEC performance testing to verify that the patch does not degrade such workloads. Results were all in the noise range. The customer code performance loss was verified to have been reversed. Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions. Is this ok for trunk? Thanks, Bill [gcc] 2017-05-03 Bill Schmidt * config/rs6000/rs6000.c (rs6000_vect_nonmem): New static var. (rs6000_init_cost): Initialize rs6000_vect_nonmem. (rs6000_add_stmt_cost): Update rs6000_vect_nonmem. (rs6000_finish_cost): Avoid vectorizing simple copy loops with VF=2 that require versioning. [gcc/testsuite] 2017-05-03 Bill Schmidt * gcc.target/powerpc/veresioned-copy-loop.c: New file. Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 247560) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -5873,6 +5873,8 @@ rs6000_density_test (rs6000_cost_data *data) /* Implement targetm.vectorize.init_cost. */ +static bool rs6000_vect_nonmem; + static void * rs6000_init_cost (struct loop *loop_info) { @@ -5881,6 +5883,7 @@ rs6000_init_cost (struct loop *loop_info) data->cost[vect_prologue] = 0; data->cost[vect_body] = 0; data->cost[vect_epilogue] = 0; + rs6000_vect_nonmem = false; return data; } @@ -5907,6 +5910,19 @@ rs6000_add_stmt_cost (void *data, int count, enum retval = (unsigned) (count * stmt_cost); cost_data->cost[where] += retval; + + /* Check whether we're doing something other than just a copy loop. + Not all such loops may be profitably vectorized; see + rs6000_finish_cost. */ + if ((where == vect_body + && (kind == vector_stmt || kind == vec_to_scalar || kind == vec_perm + || kind == vec_promote_demote || kind == vec_construct + || kind == scalar_to_vec)) + || (where != vect_body + && (kind == vec_to_scalar || kind == vec_perm + || kind == vec_promote_demote || kind == vec_construct + || kind == scalar_to_vec))) + rs6000_vect_nonmem = true; } return retval; @@ -5923,6 +5939,19 @@ rs6000_finish_cost (void *data, unsigned *prologue if (cost_data->loop_info) rs6000_density_test (cost_data); + /* Don't vectorize minimum-vectorization-factor, simple copy loops + that require versioning for any reason. The vectorization is at + best a wash inside the loop, and the versioning checks make + profitability highly unlikely and potentially quite harmful. */ + if (cost_data->loop_info) + { + loop_vec_info vec_info = loop_vec_info_for_loop (cost_data->loop_info); + if (!rs6000_vect_nonmem + && LOOP_VINFO_VECT_FACTOR (vec_info) == 2 + && LOOP_REQUIRES_VERSIONING (vec_info)) + cost_data->cost[vect_body] += 10000; + } + *prologue_cost = cost_data->cost[vect_prologue]; *body_cost = cost_data->cost[vect_body]; *epilogue_cost = cost_data->cost[vect_epilogue]; Index: gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (working copy) @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-options "-O3 -fdump-tree-vect-details" } */ + +/* Verify that a pure copy loop with a vectorization factor of two + that requires alignment will not be vectorized. See the cost + model hooks in rs6000.c. */ + +typedef long unsigned int size_t; +typedef unsigned char uint8_t; + +extern void *memcpy (void *__restrict __dest, const void *__restrict __src, + size_t __n) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1, 2))); + +void foo (void *dstPtr, const void *srcPtr, void *dstEnd) +{ + uint8_t *d = (uint8_t*)dstPtr; + const uint8_t *s = (const uint8_t*)srcPtr; + uint8_t* const e = (uint8_t*)dstEnd; + + do + { + memcpy (d, s, 8); + d += 8; + s += 8; + } + while (d < e); +} + +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */