From patchwork Fri Jan 22 22:11:20 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 571824 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id BFA731402DD for ; Sat, 23 Jan 2016 09:11:36 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=j29LMxtT; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=MhuPGWq90tv5UFZ3szjBmNlBSGn3v0D8xckHGs/0ruGFwya6qN +HQZu771tH6KwuLuXfqEqvag/mBSm5wohTg674QYQkDvpobzMOQYVtWCJdhCwOF8 kS8uqRqMZMKgbmhi38WeOhvNT4yH1NRrMKe+Ycd5GqUijK0g4yEuiI9e0= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; s= default; bh=/H8Z4FV6bUY11X8/2gwYebRFocs=; b=j29LMxtTLLBEBL+yG7Ap pjQ3ycCA1bMUxN0qilxiAQltTtnYxLk84Fd3z72JwLOc1Em7SOyMm72Ix12q6gft 5EwfE5iw7iO7HkWklwMY2a1NIjxmcXEszExqwbrZim8R1JDJfm7oVjcpOW3r6nER ABaLSJTuQCn4CM0JkUAys8k= Received: (qmail 18800 invoked by alias); 22 Jan 2016 22:11:25 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 18783 invoked by uid 89); 22 Jan 2016 22:11:24 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 spammy=sk:nathan, sk:nathan@, U*nathan, nathancodesourcerycom X-HELO: mail-qg0-f47.google.com Received: from mail-qg0-f47.google.com (HELO mail-qg0-f47.google.com) (209.85.192.47) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Fri, 22 Jan 2016 22:11:23 +0000 Received: by mail-qg0-f47.google.com with SMTP id o11so68371913qge.2 for ; Fri, 22 Jan 2016 14:11:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:to:from:subject:message-id:date :user-agent:mime-version:content-type; bh=0qohxLzSq5W9Y8aOiHtBx2ZHLEqJMljDSSOo6nJWLlc=; b=GWgr183zY5vlPX+0+ai0X+JO+uvtFnFs6O48Vs7qGDEjQCc3Ki86+wOhaxEX5GWPau JVmzFoFd0Pby4QvNyy4pq2yTbL3jLL2uvUxPij9mn5rHgW7L09B0yByBQD0Mo2n57h9j ltKgQVH1HV+XmSFlDpYFpG0WVqVcP4LprRgaN0Nzko2O5FVfxzrYJ0tDZmBtKzTqd3O2 qWbtCNewXnjLfVzd2c6W1coFijK7ppUEJj6IQHgNcUAFFjDsLzMZQyblzs+3jWLBi7m5 K0hr/CbXdnFjRau+Q/UXJB1iOfoP8jQ1bx1wRRrWwLc2HIn4lEpoRDQcCjsX36OX8NRp 4nPw== X-Gm-Message-State: AG10YOT8sO2OtJ0KACfouQbAU9ZnRUYAxXg6YCn4970fVn1j6EjbEYWLTTeqTmwULPzySA== X-Received: by 10.140.97.202 with SMTP id m68mr6706730qge.102.1453500681181; Fri, 22 Jan 2016 14:11:21 -0800 (PST) Received: from ?IPv6:2601:181:c000:c497:a2a8:cdff:fe3e:b48? ([2601:181:c000:c497:a2a8:cdff:fe3e:b48]) by smtp.googlemail.com with ESMTPSA id f5sm3638050qkb.30.2016.01.22.14.11.20 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 22 Jan 2016 14:11:20 -0800 (PST) To: GCC Patches From: Nathan Sidwell Subject: [gomp4] gang partitioning Message-ID: <56A2A908.2070001@acm.org> Date: Fri, 22 Jan 2016 17:11:20 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 MIME-Version: 1.0 I've committed this patch to gomp4 branch. It changes the auto partitioning logic to allocate the outermost loop to the outermost available partitioning. For instance, gang partitioning will be used for the outermost loop of a parallel region. Innermost loops remain partitioned at the innermost available level. This means that if we run out of available partitions, we've parallelized the outer loop and the innermost loops, rather than just parallelized the inner loops. nathan 2016-01-22 Nathan Sidwell gcc/ * omp-low.c (struct oacc_loop): Add 'inner' field. (new_oacc_loop_raw): Initialize it to zero. (oacc_loop_fixed_partitions): Initialize it. (oacc_loop_auto_partitions): Partition outermost loop to outermost available partitioning. gcc/testsuite/ * c-c++-common/goacc/loop-auto-1.c: Adjust expected warnings. libgomp/ * testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Adjust expected partitioning. Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c =================================================================== --- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c (revision 232749) +++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c (working copy) @@ -102,9 +102,11 @@ int vector_1 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { +#pragma acc loop gang + for (int jx = 0; jx < 1; jx++) #pragma acc loop auto - for (int ix = 0; ix < size; ix++) - ary[ix] = place (); + for (int ix = 0; ix < size; ix++) + ary[ix] = place (); } return check (ary, size, 0, 0, 1); @@ -117,7 +119,7 @@ int vector_2 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { #pragma acc loop worker - for (int jx = 0; jx < size / 64; jx++) + for (int jx = 0; jx < size / 64; jx++) #pragma acc loop auto for (int ix = 0; ix < 64; ix++) ary[ix + jx * 64] = place (); @@ -132,30 +134,16 @@ int worker_1 (int *ary, int size) #pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) { +#pragma acc loop gang + for (int kx = 0; kx < 1; kx++) #pragma acc loop auto - for (int jx = 0; jx < size / 64; jx++) + for (int jx = 0; jx < size / 64; jx++) #pragma acc loop vector - for (int ix = 0; ix < 64; ix++) - ary[ix + jx * 64] = place (); - } - - return check (ary, size, 0, 1, 1); -} - -int worker_2 (int *ary, int size) -{ - clear (ary, size); - -#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) - { -#pragma acc loop auto - for (int jx = 0; jx < size / 64; jx++) -#pragma acc loop auto - for (int ix = 0; ix < 64; ix++) - ary[ix + jx * 64] = place (); + for (int ix = 0; ix < 64; ix++) + ary[ix + jx * 64] = place (); } - return check (ary, size, 0, 1, 1); + return check (ary, size, 0, 1, 1); } int gang_1 (int *ary, int size) @@ -192,6 +180,22 @@ int gang_2 (int *ary, int size) return check (ary, size, 1, 1, 1); } +int gang_3 (int *ary, int size) +{ + clear (ary, size); + +#pragma acc parallel num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size) + { +#pragma acc loop auto + for (int jx = 0; jx < size / 64; jx++) +#pragma acc loop auto + for (int ix = 0; ix < 64; ix++) + ary[ix + jx * 64] = place (); + } + + return check (ary, size, 1, 0, 1); +} + #define N (32*32*32) int main () { @@ -213,13 +217,13 @@ int main () if (worker_1 (ary, N)) return 1; - if (worker_2 (ary, N)) - return 1; if (gang_1 (ary, N)) return 1; if (gang_2 (ary, N)) return 1; + if (gang_3 (ary, N)) + return 1; return 0; } Index: gcc/omp-low.c =================================================================== --- gcc/omp-low.c (revision 232749) +++ gcc/omp-low.c (working copy) @@ -249,8 +249,9 @@ struct oacc_loop tree routine; /* Pseudo-loop enclosing a routine. */ unsigned mask; /* Partitioning mask. */ - unsigned flags; /* Partitioning flags. */ - tree chunk_size; /* Chunk size. */ + unsigned inner; /* Partitioning of inner loops. */ + unsigned flags; /* Partitioning flags. */ + tree chunk_size; /* Chunk size. */ gcall *head_end; /* Final marker of head sequence. */ }; @@ -19434,7 +19435,7 @@ new_oacc_loop_raw (oacc_loop *parent, lo memset (loop->tails, 0, sizeof (loop->tails)); loop->routine = NULL_TREE; - loop->mask = loop->flags = 0; + loop->mask = loop->flags = loop->inner = 0; loop->chunk_size = 0; loop->head_end = NULL; @@ -19941,8 +19942,11 @@ oacc_loop_fixed_partitions (oacc_loop *l mask_all |= this_mask; if (loop->child) - mask_all |= oacc_loop_fixed_partitions (loop->child, - outer_mask | this_mask); + { + loop->inner = oacc_loop_fixed_partitions (loop->child, + outer_mask | this_mask); + mask_all |= loop->inner; + } if (loop->sibling) mask_all |= oacc_loop_fixed_partitions (loop->sibling, outer_mask); @@ -19958,7 +19962,7 @@ oacc_loop_fixed_partitions (oacc_loop *l static unsigned oacc_loop_auto_partitions (oacc_loop *loop, unsigned outer_mask) { - unsigned inner_mask = 0; + bool assign = (loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT); bool noisy = true; #ifdef ACCEL_COMPILER @@ -19967,16 +19971,33 @@ oacc_loop_auto_partitions (oacc_loop *lo noisy = false; #endif + if (assign && outer_mask < GOMP_DIM_MASK (GOMP_DIM_MAX - 1)) + { + /* Allocate the outermost loop at the outermost available + level. */ + unsigned this_mask = outer_mask + 1; + + if (!(this_mask & loop->inner)) + loop->mask = this_mask; + } + if (loop->child) - inner_mask |= oacc_loop_auto_partitions (loop->child, - outer_mask | loop->mask); + { + unsigned child_mask = outer_mask | loop->mask; + + if (loop->mask || assign) + child_mask |= GOMP_DIM_MASK (GOMP_DIM_MAX); - if ((loop->flags & OLF_AUTO) && (loop->flags & OLF_INDEPENDENT)) + loop->inner = oacc_loop_auto_partitions (loop->child, child_mask); + } + + if (assign && !loop->mask) { + /* Allocate the loop at the innermost available level. */ unsigned this_mask = 0; /* Determine the outermost partitioning used within this loop. */ - this_mask = inner_mask | GOMP_DIM_MASK (GOMP_DIM_MAX); + this_mask = loop->inner | GOMP_DIM_MASK (GOMP_DIM_MAX); this_mask = (this_mask & -this_mask); /* Pick the partitioning just inside that one. */ @@ -19989,17 +20010,20 @@ oacc_loop_auto_partitions (oacc_loop *lo warning_at (loop->loc, 0, "insufficient partitioning available to parallelize loop"); - if (dump_file) - fprintf (dump_file, "Auto loop %s:%d assigned %d\n", - LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), - this_mask); - loop->mask = this_mask; } - inner_mask |= loop->mask; + + if (assign && dump_file) + fprintf (dump_file, "Auto loop %s:%d assigned %d\n", + LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc), + loop->mask); + + unsigned inner_mask = 0; if (loop->sibling) inner_mask |= oacc_loop_auto_partitions (loop->sibling, outer_mask); + + inner_mask |= loop->inner | loop->mask; return inner_mask; } Index: gcc/testsuite/c-c++-common/goacc/loop-auto-1.c =================================================================== --- gcc/testsuite/c-c++-common/goacc/loop-auto-1.c (revision 232749) +++ gcc/testsuite/c-c++-common/goacc/loop-auto-1.c (working copy) @@ -186,10 +186,10 @@ void Worker (void) for (int jx = 0; jx < 10; jx++) {} } -#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ +#pragma acc loop auto for (int ix = 0; ix < 10; ix++) { -#pragma acc loop auto +#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ for (int jx = 0; jx < 10; jx++) { #pragma acc loop auto @@ -214,10 +214,10 @@ void Vector (void) #pragma acc loop auto for (int ix = 0; ix < 10; ix++) {} -#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ +#pragma acc loop auto for (int ix = 0; ix < 10; ix++) { -#pragma acc loop auto +#pragma acc loop auto /* { dg-warning "insufficient partitioning" } */ for (int jx = 0; jx < 10; jx++) {} } }