From patchwork Tue May 6 11:27:37 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Evgeny Stupachenko X-Patchwork-Id: 346135 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3C8F01412FA for ; Tue, 6 May 2014 21:27:59 +1000 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; q=dns; s=default; b=OlGEN7JWmU+Fu8v7rk ljVpwoHdNgd9GDX20ME/I7duJ8NIALhUaiRJWf4I19xDox2cd7NPBT2mzNhxADj2 Cz3vfJQwmCarVolqCsaEN9+HWj3NXGQaBLhm/g0/qtHzGyo7b4MIbkRkMvj9kwOl gIe5m1C0QhHOQd2QHL9+3kkOU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; s=default; bh=dU0fOiCtCOavBPq/kbhBJIJi ZkI=; b=mx1E13QIw4IqnnV3+Wv2rK2i3qRCKlk5t2m6rzV5bGcF2hAvz8c41yWz ZLJvQ5kv3IjWVKRHtqePgLdjbEj6NYp30X6Dqn1USBqApzDfq8r/DgevuIcqBwLf VZUP8qHJPTj6RDlPxv7c0hPNfR5XaBDDP0ModxDBybckPLb2lfE= Received: (qmail 18119 invoked by alias); 6 May 2014 11:27:46 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 18015 invoked by uid 89); 6 May 2014 11:27:45 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ob0-f176.google.com Received: from mail-ob0-f176.google.com (HELO mail-ob0-f176.google.com) (209.85.214.176) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 06 May 2014 11:27:41 +0000 Received: by mail-ob0-f176.google.com with SMTP id wo20so604448obc.7 for ; Tue, 06 May 2014 04:27:38 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.182.241.67 with SMTP id wg3mr37909869obc.16.1399375658002; Tue, 06 May 2014 04:27:38 -0700 (PDT) Received: by 10.76.170.39 with HTTP; Tue, 6 May 2014 04:27:37 -0700 (PDT) In-Reply-To: References: Date: Tue, 6 May 2014 15:27:37 +0400 Message-ID: Subject: Re: [PATCH, PR52252] Vectorization for load/store groups of size 3. From: Evgeny Stupachenko To: Richard Biener Cc: GCC Patches , Jakub Jelinek , Uros Bizjak X-IsSubscribed: yes The patch on cost model was successfully committed. I've separated the rest part of the patch on loads/stores group into 2: on loads group and on stores group. Below is first part on loads group. Bootstrap and make check passed on x86. Is it ok? ChangeLog: 2014-05-06 Evgeny Stupachenko * tree-vect-data-refs.c (vect_grouped_load_supported): New check for loads group of length 3. (vect_permute_load_chain): New permutations for loads group of length 3. * tree-vect-stmts.c (vect_model_load_cost): Change cost of vec_perm_shuffle for the new permutations. ChangeLog for testsuite: 2014-05-06 Evgeny Stupachenko PR tree-optimization/52252 * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3. On Wed, Apr 30, 2014 at 6:31 PM, Evgeny Stupachenko wrote: > Ping. > > On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko wrote: >> Hi, >> >> Merged with current master the patch passes bootstrap and is giving >> expected gains. >> Patch and new tests are attached. >> >> ChangeLog: >> >> 2014-04-18 Evgeny Stupachenko >> >> * tree-vect-data-refs.c (vect_grouped_store_supported): New >> check for stores group of length 3. >> (vect_permute_store_chain): New permutations for stores group of >> length 3. >> (vect_grouped_load_supported): New check for loads group of length 3. >> (vect_permute_load_chain): New permutations for loads group of length 3. >> * tree-vect-stmts.c (vect_model_store_cost): Change cost >> of vec_perm_shuffle for the new permutations. >> (vect_model_load_cost): Ditto. >> >> ChangeLog for testsuite: >> >> 2014-04-18 Evgeny Stupachenko >> >> PR tree-optimization/52252 >> * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3. >> * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3. >> >> Evgeny >> >> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko wrote: >>> Missed attachment. >>> >>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko wrote: >>>> I've separated the patch into 2: cost model tuning and load/store >>>> groups parallelism. >>>> SLM tuning was partially introduced in the patch: >>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html >>>> The patch introducing vectorization for load/store groups of size 3 attached. >>>> >>>> Is it ok for stage1? >>>> >>>> ChangeLog: >>>> >>>> 2014-03-06 Evgeny Stupachenko >>>> >>>> * tree-vect-data-refs.c (vect_grouped_store_supported): New >>>> check for stores group of length 3. >>>> (vect_permute_store_chain): New permutations for stores group of >>>> length 3. >>>> (vect_grouped_load_supported): New check for loads group of length 3. >>>> (vect_permute_load_chain): New permutations for loads group of length 3. >>>> * tree-vect-stmts.c (vect_model_store_cost): Change cost >>>> of vec_perm_shuffle for the new permutations. >>>> (vect_model_load_cost): Ditto. >>>> >>>> >>>> >>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener wrote: >>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote: >>>>> >>>>>> Missed patch attached in plain-text. >>>>>> >>>>>> I have copyright assignment on file with the FSF covering work on GCC. >>>>>> >>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2 >>>>>> case. It is used in RGB image processing (like test case in PR52252). >>>>>> For sure we can extend the patch to length 5 and more. However, this >>>>>> potentially affect performance on some other architectures and >>>>>> requires larger testing. So length 3 it is just first step.The >>>>>> algorithm in the patch could be modified for a general case in several >>>>>> steps. >>>>>> >>>>>> I understand that the patch should wait for the stage 1, however since >>>>>> its ready we can discuss it right now and make some changes (like >>>>>> general size of group). >>>>> >>>>> Other than that I'd like to see a vectorizer hook querying the cost of a >>>>> vec_perm_const expansion instead of adding vec_perm_shuffle >>>>> (thus requires the constant shuffle mask to be passed as well >>>>> as the vector type). That's more useful for other uses that >>>>> would require (arbitrary) shuffles. >>>>> >>>>> Didn't look at the rest of the patch yet - queued in my review >>>>> pipeline. >>>>> >>>>> Thanks, >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> Evgeny >>>>>> >>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener wrote: >>>>>> > >>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote: >>>>>> > >>>>>> > > Hi, >>>>>> > > >>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252 >>>>>> > > (and even 6 times for AVX2). >>>>>> > > It passes make check and bootstrap on x86. >>>>>> > > spec2000/spec2006 got no regressions/gains on x86. >>>>>> > > >>>>>> > > Is this patch ok? >>>>>> > >>>>>> > I've worked on generalizing the permutation support in the light >>>>>> > of the availability of the generic shuffle support in the IL >>>>>> > but hit some road-blocks in the way code-generation works for >>>>>> > group loads with permutations (I don't remember if I posted all patches). >>>>>> > >>>>>> > This patch seems to be to a slightly different place but it again >>>>>> > special-cases a specific permutation. Why's that? Why can't we >>>>>> > support groups of size 7 for example? So - can this be generalized >>>>>> > to support arbitrary non-power-of-two load/store groups? >>>>>> > >>>>>> > Other than that the patch has to wait for stage1 to open again, >>>>>> > of course. And it misses a testcase. >>>>>> > >>>>>> > Btw, do you have a copyright assignment on file with the FSF covering >>>>>> > work on GCC? >>>>>> > >>>>>> > Thanks, >>>>>> > Richard. >>>>>> > >>>>>> > > ChangeLog: >>>>>> > > >>>>>> > > 2014-02-11 Evgeny Stupachenko >>>>>> > > >>>>>> > > * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle. >>>>>> > > * tree-vect-data-refs.c (vect_grouped_store_supported): New >>>>>> > > check for stores group of length 3. >>>>>> > > (vect_permute_store_chain): New permutations for stores group of >>>>>> > > length 3. >>>>>> > > (vect_grouped_load_supported): New check for loads group of length >>>>>> > > 3. >>>>>> > > (vect_permute_load_chain): New permutations for loads group of >>>>>> > > length 3. >>>>>> > > * tree-vect-stmts.c (vect_model_store_cost): New cost >>>>>> > > vec_perm_shuffle >>>>>> > > for the new permutations. >>>>>> > > (vect_model_load_cost): Ditto. >>>>>> > > * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding >>>>>> > > vec_perm_shuffle cost as equvivalent of vec_perm cost. >>>>>> > > * config/arm/arm.c: Ditto. >>>>>> > > * config/rs6000/rs6000.c: Ditto. >>>>>> > > * config/spu/spu.c: Ditto. >>>>>> > > * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow >>>>>> > > byte >>>>>> > > shuffle on some x86 architectures. >>>>>> > > * config/i386/i386.h (processor_costs): Defining pshuffb cost. >>>>>> > > * config/i386/i386.c (processor_costs): Adding pshuffb cost. >>>>>> > > (ix86_builtin_vectorization_cost): Adding cost for the new >>>>>> > > permutations. >>>>>> > > Fixing cost for other permutations. >>>>>> > > (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are >>>>>> > > slow (TARGET_SLOW_PHUFFB). >>>>>> > > (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY. >>>>>> > > Adding new shuffle cost only when byte shuffle is expected. >>>>>> > > Fixing cost model for Silvermont. >>>>>> > > >>>>>> > > Thanks, >>>>>> > > Evgeny >>>>>> > > >>>>>> > >>>>>> > -- >>>>>> > Richard Biener >>>>>> > SUSE / SUSE Labs >>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 >>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer >>>>>> >>>>> >>>>> -- >>>>> Richard Biener >>>>> SUSE / SUSE Labs >>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 >>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 274cdbd..feafb38 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -4812,36 +4812,74 @@ vect_grouped_load_supported (tree vectype, unsigned HOST_WIDE_INT count) { enum machine_mode mode = TYPE_MODE (vectype); - /* vect_permute_load_chain requires the group size to be a power of two. */ - if (exact_log2 (count) == -1) + /* vect_permute_load_chain requires the group size to be equal to 3 or + be a power of two. */ + if (count != 3 && exact_log2 (count) == -1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "the size of the group of accesses" - " is not a power of 2\n"); + "the size of the group of accesses" + " is not a power of 2 or not eqaul to 3\n"); return false; } /* Check that the permutation is supported. */ if (VECTOR_MODE_P (mode)) { - unsigned int i, nelt = GET_MODE_NUNITS (mode); + unsigned int i, j, nelt = GET_MODE_NUNITS (mode); unsigned char *sel = XALLOCAVEC (unsigned char, nelt); - for (i = 0; i < nelt; i++) - sel[i] = i * 2; - if (can_vec_perm_p (mode, false, sel)) + if (exact_log2 (count) != -1) { for (i = 0; i < nelt; i++) - sel[i] = i * 2 + 1; + sel[i] = i * 2; if (can_vec_perm_p (mode, false, sel)) - return true; + { + for (i = 0; i < nelt; i++) + sel[i] = i * 2 + 1; + if (can_vec_perm_p (mode, false, sel)) + return true; + } + } + else if (count == 3) + { + unsigned int k; + for (k = 0; k < 3; k++) + { + for (i = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = 3 * i + k; + else + sel[i] = 0; + if (!can_vec_perm_p (mode, false, sel)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "shuffle of 3 loads is not supported by \ + target\n"); + return false; + } + for (i = 0, j = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = i; + else + sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); + if (!can_vec_perm_p (mode, false, sel)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "shuffle of 3 loads is not supported by \ + target\n"); + return false; + } + } + return true; } } if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "extract even/odd not supported by target\n"); + "extract even/odd not supported by target\n"); return false; } @@ -4859,8 +4897,9 @@ vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count) /* Function vect_permute_load_chain. Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be - a power of 2, generate extract_even/odd stmts to reorder the input data - correctly. Return the final references for loads in RESULT_CHAIN. + a power of 2 or equal to 3, generate extract_even/odd stmts to reorder + the input data correctly. Return the final references for loads in + RESULT_CHAIN. E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8. The input is 4 vectors each containing 8 elements. We assign a number to each @@ -4941,6 +4980,7 @@ vect_permute_load_chain (vec dr_chain, { tree data_ref, first_vect, second_vect; tree perm_mask_even, perm_mask_odd; + tree perm3_mask_low, perm3_mask_high; gimple perm_stmt; tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt)); unsigned int i, j, log_length = exact_log2 (length); @@ -4951,45 +4991,99 @@ vect_permute_load_chain (vec dr_chain, memcpy (result_chain->address (), dr_chain.address (), length * sizeof (tree)); - for (i = 0; i < nelt; ++i) - sel[i] = i * 2; - perm_mask_even = vect_gen_perm_mask (vectype, sel); - gcc_assert (perm_mask_even != NULL); + if (log_length != (unsigned int)-1) + { + for (i = 0; i < nelt; ++i) + sel[i] = i * 2; + perm_mask_even = vect_gen_perm_mask (vectype, sel); + gcc_assert (perm_mask_even != NULL); - for (i = 0; i < nelt; ++i) - sel[i] = i * 2 + 1; - perm_mask_odd = vect_gen_perm_mask (vectype, sel); - gcc_assert (perm_mask_odd != NULL); + for (i = 0; i < nelt; ++i) + sel[i] = i * 2 + 1; + perm_mask_odd = vect_gen_perm_mask (vectype, sel); + gcc_assert (perm_mask_odd != NULL); - for (i = 0; i < log_length; i++) - { - for (j = 0; j < length; j += 2) + for (i = 0; i < log_length; i++) { - first_vect = dr_chain[j]; - second_vect = dr_chain[j+1]; + for (j = 0; j < length; j += 2) + { + first_vect = dr_chain[j]; + second_vect = dr_chain[j+1]; + + /* data_ref = permute_even (first_data_ref, second_data_ref); */ + data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even"); + perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref, + first_vect, second_vect, + perm_mask_even); + vect_finish_stmt_generation (stmt, perm_stmt, gsi); + (*result_chain)[j/2] = data_ref; + + /* data_ref = permute_odd (first_data_ref, second_data_ref); */ + data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd"); + perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref, + first_vect, second_vect, + perm_mask_odd); + vect_finish_stmt_generation (stmt, perm_stmt, gsi); + (*result_chain)[j/2+length/2] = data_ref; + } + memcpy (dr_chain.address (), result_chain->address (), + length * sizeof (tree)); + } + } + /* length is not a power of 2. */ + else + { + unsigned int k; - /* data_ref = permute_even (first_data_ref, second_data_ref); */ - data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even"); + /* currently only length 3 is supported as most frequent case. */ + gcc_assert (length == 3); + + for (k = 0; k < 3; k++) + { + for (i = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = 3 * i + k; + else + sel[i] = 0; + perm3_mask_low = vect_gen_perm_mask (vectype, sel); + gcc_assert (perm3_mask_low != NULL); + + for (i = 0, j = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = i; + else + sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); + + perm3_mask_high = vect_gen_perm_mask (vectype, sel); + gcc_assert (perm3_mask_high != NULL); + + first_vect = dr_chain[0]; + second_vect = dr_chain[1]; + + /* Create interleaving stmt (low part of): + low = VEC_PERM_EXPR */ + data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low"); perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref, first_vect, second_vect, - perm_mask_even); + perm3_mask_low); vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[j/2] = data_ref; - /* data_ref = permute_odd (first_data_ref, second_data_ref); */ - data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd"); + /* Create interleaving stmt (high part of): + high = VEC_PERM_EXPR */ + first_vect = data_ref; + second_vect = dr_chain[2]; + data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high"); perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref, first_vect, second_vect, - perm_mask_odd); + perm3_mask_high); vect_finish_stmt_generation (stmt, perm_stmt, gsi); - (*result_chain)[j/2+length/2] = data_ref; + (*result_chain)[k] = data_ref; } - memcpy (dr_chain.address (), result_chain->address (), - length * sizeof (tree)); } } - /* Function vect_transform_grouped_load. Given a chain of input interleaved data-refs (in DR_CHAIN), build statements diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 1a51d6d..b87c143 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info, int ncopies, include the cost of the permutes. */ if (!load_lanes_p && group_size > 1) { - /* Uses an even and odd extract operations for each needed permute. */ - int nstmts = ncopies * exact_log2 (group_size) * group_size; - inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm, - stmt_info, 0, vect_body); + /* Uses an even and odd extract operations or shuffle operations + for each needed permute. */ + int nstmts = ncopies * ceil_log2 (group_size) * group_size; + inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm, + stmt_info, 0, vect_body); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c new file mode 100644 index 0000000..6e3cb52 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */ + +#define byte unsigned char + +void +matrix_mul (byte *in, byte *out, int size) +{ + int i; + for (i = 0; i < size; i++) + { + byte in0 = in[0]; + byte in1 = in[1]; + byte in2 = in[2]; + byte out0, out1, out2, out3; + out0 = in0 + in1; + out1 = in0 + in2; + out2 = in1 + in2; + out3 = in0 + in1 + in2; + out[0] = out0; + out[1] = out1; + out[2] = out2; + out[3] = out3; + in += 3; + out += 4; + } +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { cleanup-tree-dump "vect" } } */