From patchwork Fri Apr 29 14:00:43 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 616805 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3qxFjK6Qzsz9t3p for ; Sat, 30 Apr 2016 00:01:08 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=cirHWjaC; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=phygfhASum2szkEVZHBiyao35eQPNk9beDi7hMBLFJAxExJE8Q EB0g3v/07kdbKxFKLZ5lftMHqd0XMG4c/y7U7aN8QPrGo279oVENYNKSpdibwhuy d/FkMvNIiJSzex2JsryX5RuMe4pUy++zgUj/Ou+rphiUaV5X8wZatHNsU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:message-id:date:mime-version:content-type; s= default; bh=9LYnPc+doxl0ilEsg/8fhrFq6ds=; b=cirHWjaCsk+ObNvFpq0c by0Wy0OR+SFB+bThRzQn06KERfq2kykaw8ohxH+hxoicU/Vadop14glFzOCcmSSc BAB0vI5tFNJTi22jJLFH8XTUExMcu20LGbpfm5SM76RWHH/Qp2SW64vjAQso5uXW pu1MIlrp1Rz0+0h9emU4cQk= Received: (qmail 28121 invoked by alias); 29 Apr 2016 14:01:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 28091 invoked by uid 89); 29 Apr 2016 14:00:57 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 spammy=Auto, Pick, 1187, 118, 7 X-HELO: mail-qk0-f170.google.com Received: from mail-qk0-f170.google.com (HELO mail-qk0-f170.google.com) (209.85.220.170) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Fri, 29 Apr 2016 14:00:47 +0000 Received: by mail-qk0-f170.google.com with SMTP id x7so45576132qkd.3 for ; Fri, 29 Apr 2016 07:00:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:to:cc:from:subject:message-id:date :user-agent:mime-version; bh=EuqeDen80ZykNpb9BbVxu+wLBmHWILEciuLqmdpf3f0=; b=U6Htu1xuvZ6YZOe29TPWNovsAoI7YRQlrfQayVstLfgayx5dmR2eV7GoIKwVA//isL cV8Ou5TJpeWoPB+15i2btSrgCOBtgtC+Pyc4+OTvJamGkFVyJ1dVkR+i1186nD92N54T CpZ+xp2K8/MTT7qx0W14ew+jwcjnlqwUhxJ3f/kFeEp5Mi5jZZI/34mRjzA8pjBz2he6 BtLJI9a0L2/BeliwIwHme6cbVKuOeiexLBvcFy6M019ylZddi2qPzpxygYBq1QkFWtzM BSpFIeB3VG5XFyz5XMo3SXzW+8N+5ltIvBTUkatQ7scn8Yi4NNLqtCF1Ut0JDIvjTpwo QLYg== X-Gm-Message-State: AOPr4FUqxLilhWpmHdBa5yEMxLLuRiMpWlXZh22ed4VUMZufEmajfDdi1UxGeU10D/99YA== X-Received: by 10.55.73.211 with SMTP id w202mr20730384qka.39.1461938445285; Fri, 29 Apr 2016 07:00:45 -0700 (PDT) Received: from ?IPv6:2601:181:c003:1930:a2a8:cdff:fe3e:b48? ([2601:181:c003:1930:a2a8:cdff:fe3e:b48]) by smtp.googlemail.com with ESMTPSA id t193sm4525591qke.21.2016.04.29.07.00.44 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 29 Apr 2016 07:00:44 -0700 (PDT) To: Jakub Jelinek Cc: GCC Patches From: Nathan Sidwell Subject: [Openacc] Adjust automatic loop partitioning Message-ID: Date: Fri, 29 Apr 2016 10:00:43 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 Jakub, currently automatic loop partitioning assigns from the innermost loop outwards -- that was the simplest thing to implement. A better algorithm is to assign the outermost loop to the outermost available axis, and then assign from the innermost loop outwards. That way we (generally) get gang partitioning on the outermost loop. Just inside that we'll get non-partitioned loops if the nest is too deep, and the two innermost nested loops will get worker and vector partitioning. This patch has been on the gomp4 branch for a while. ok for trunk? nathan 2016-04-29 Nathan Sidwell gcc/ * omp-low.c (struct oacc_loop): Add 'inner' field. (new_oacc_loop_raw): Initialize it to zero. (oacc_loop_fixed_partitions): Initialize it. (oacc_loop_auto_partitions): Partition outermost loop to outermost available partitioning. gcc/testsuite/ * c-c++-common/goacc/loop-auto-1.c: Adjust expected warnings. libgomp/ * testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Adjust expected partitioning. Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c =================================================================== --- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c (revision 235511) +++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c (working copy) @@ -103,9 +103,11 @@ int vector_1 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { +#pragma acc loop gang + for (int jx = 0; jx < 1; jx++) #pragma acc loop auto - for (int ix = 0; ix < size; ix++) - ary[ix] = place (); + for (int ix = 0; ix < size; ix++) + ary[ix] = place (); } return check (ary, size, 0, 0, 1); @@ -118,7 +120,7 @@ int vector_2 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { #pragma acc loop worker - for (int jx = 0; jx < size / 64; jx++) + for (int jx = 0; jx < size / 64; jx++) #pragma acc loop auto for (int ix = 0; ix < 64; ix++) ary[ix + jx * 64] = place (); @@ -133,30 +135,16 @@ int worker_1 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { +#pragma acc loop gang + for (int kx = 0; kx < 1; kx++) #pragma acc loop auto - for (int jx = 0; jx < size / 64; jx++) + for (int jx = 0; jx < size / 64; jx++) #pragma acc loop vector - for (int ix = 0; ix < 64; ix++) - ary[ix + jx * 64] = place (); - } - - return check (ary, size, 0, 1, 1); -} - -int worker_2 (int *ary, int size) -{ - clear (ary, size); - -#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) - { -#pragma acc loop auto - for (int jx = 0; jx < size / 64; jx++) -#pragma acc loop auto - for (int ix = 0; ix < 64; ix++) - ary[ix + jx * 64] = place (); + for (int ix = 0; ix < 64; ix++) + ary[ix + jx * 64] = place (); } - return check (ary, size, 0, 1, 1); + return check (ary, size, 0, 1, 1); } int gang_1 (int *ary, int size) @@ -193,6 +181,22 @@ int gang_2 (int *ary, int size) return check (ary, size, 1, 1, 1); } +int gang_3 (int *ary, int size) +{ + clear (ary, size); + +#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) + { +#pragma acc loop auto + for (int jx = 0; jx < size / 64; jx++) +#pragma acc loop auto + for (int ix = 0; ix < 64; ix++) + ary[ix + jx * 64] = place (); + } + + return check (ary, size, 1, 0, 1); +} + #define N (32*32*32) int main () { @@ -214,13 +218,13 @@ int main () if (worker_1 (ary, N)) return 1; - if (worker_2 (ary, N)) - return 1; if (gang_1 (ary, N)) return 1; if (gang_2 (ary, N)) return 1; + if (gang_3 (ary, N)) + return 1; return 0; } Index: gcc/omp-low.c =================================================================== --- gcc/omp-low.c (revision 235511) +++ gcc/omp-low.c (working copy) @@ -241,6 +241,7 @@ struct oacc_loop tree routine; /* Pseudo-loop enclosing a routine. */ unsigned mask; /* Partitioning mask. */ + unsigned inner; /* Partitioning of inner loops. */ unsigned flags; /* Partitioning flags. */ unsigned ifns; /* Contained loop abstraction functions. */ tree chunk_size; /* Chunk size. */ @@ -18921,7 +18922,7 @@ new_oacc_loop_raw (oacc_loop *parent, lo memset (loop->tails, 0, sizeof (loop->tails)); loop->routine = NULL_TREE; - loop->mask = loop->flags = 0; + loop->mask = loop->flags = loop->inner = 0; loop->ifns = 0; loop->chunk_size = 0; loop->head_end = NULL; @@ -19449,8 +19450,11 @@ oacc_loop_fixed_partitions (oacc_loop *l mask_all |= this_mask; if (loop->child) - mask_all |= oacc_loop_fixed_partitions (loop->child, - outer_mask | this_mask); + { + loop->inner = oacc_loop_fixed_partitions (loop->child, + outer_mask | this_mask); + mask_all |= loop->inner; + } if (loop->sibling) mask_all |= oacc_loop_fixed_partitions (loop->sibling, outer_mask); @@ -19466,7 +19470,7 @@ oacc_loop_fixed_partitions (oacc_loop *l static unsigned oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask) { - unsigned inner_mask = 0; + bool assign = (loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT); bool noisy = true; #ifdef ACCEL_COMPILER @@ -19475,16 +19479,33 @@ oacc_loop_auto_partitions (oacc_loop *lo noisy = false; #endif + if (assign && outer_mask < GOMP_DIM_MASK (GOMP_DIM_MAX - 1)) + { + /* Allocate the outermost loop at the outermost available + level. */ + unsigned this_mask = outer_mask + 1; + + if (!(this_mask & loop->inner)) + loop->mask = this_mask; + } + if (loop->child) - inner_mask |= oacc_loop_auto_partitions (loop->child, - outer_mask | loop->mask); + { + unsigned child_mask = outer_mask | loop->mask; + + if (loop->mask || assign) + child_mask |= GOMP_DIM_MASK (GOMP_DIM_MAX); - if ((loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT)) + loop->inner = oacc_loop_auto_partitions (loop->child, child_mask); + } + + if (assign && !loop->mask) { + /* Allocate the loop at the innermost available level. */ unsigned this_mask = 0; /* Determine the outermost partitioning used within this loop. */ - this_mask = inner_mask | GOMP_DIM_MASK (GOMP_DIM_MAX); + this_mask = loop->inner | GOMP_DIM_MASK (GOMP_DIM_MAX); this_mask = (this_mask & -this_mask); /* Pick the partitioning just inside that one. */ @@ -19497,17 +19518,20 @@ oacc_loop_auto_partitions (oacc_loop *lo warning_at (loop->loc, 0, "insufficient partitioning available to parallelize loop"); - if (dump_file) - fprintf (dump_file, "Auto loop %s:%d assigned %d\n", - LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), - this_mask); - loop->mask = this_mask; } - inner_mask |= loop->mask; + + if (assign && dump_file) + fprintf (dump_file, "Auto loop %s:%d assigned %d\n", + LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), + loop->mask); + + unsigned inner_mask = 0; if (loop->sibling) inner_mask |= oacc_loop_auto_partitions (loop->sibling, outer_mask); + + inner_mask |= loop->inner | loop->mask; return inner_mask; } Index: gcc/testsuite/c-c++-common/goacc/loop-auto-1.c =================================================================== --- gcc/testsuite/c-c++-common/goacc/loop-auto-1.c (revision 235511) +++ gcc/testsuite/c-c++-common/goacc/loop-auto-1.c (working copy) @@ -186,10 +186,10 @@ void Worker (void) for (int jx = 0; jx < 10; jx++) {} } -#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ +#pragma acc loop auto for (int ix = 0; ix < 10; ix++) { -#pragma acc loop auto +#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ for (int jx = 0; jx < 10; jx++) { #pragma acc loop auto @@ -214,10 +214,10 @@ void Vector (void) #pragma acc loop auto for (int ix = 0; ix < 10; ix++) {} -#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ +#pragma acc loop auto for (int ix = 0; ix < 10; ix++) { -#pragma acc loop auto +#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ for (int jx = 0; jx < 10; jx++) {} } }