From patchwork Fri Sep 18 11:25:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Stubbs X-Patchwork-Id: 1366831 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4BtBN5416Gz9sSt for ; Fri, 18 Sep 2020 21:25:39 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 73F1F395383D; Fri, 18 Sep 2020 11:25:37 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id EEA85394D826 for ; Fri, 18 Sep 2020 11:25:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org EEA85394D826 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Andrew_Stubbs@mentor.com IronPort-SDR: 44Gtl6XFws+1K9ymiVW1INKRO35fp7TMXixb2qDa2zYz5Y39ECqpu5+a1gD55dIZepfS6kuri3 YcDgKtpYLCKpTl2MDHUNJG53M8qvtWu5DFBhgds6F0OwgZzajO5jBPpn/dB1L/sODsCNbb6+Ro j1Rf67J4r2QySUtIrf65P0civzTVMTVWa/Mbm0qa9LBozJV5GcjZ+6MMEKMJykJ7+PfkulNl/n EIIbOc7zxJ+OUhTjQ2VKX6D19oXZrId5Sd5rWvpElQ3FsissrlxHmd4bLcKpnAQC/lwfeQTtrb 9pw= X-IronPort-AV: E=Sophos;i="5.77,274,1596528000"; d="scan'208";a="53067684" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 18 Sep 2020 03:25:32 -0800 IronPort-SDR: 3YR1uotTm8Jjz+LmXyEzO8glcEjGKEzpvrvwgWNLlMT8WK3BRY+5LkObMzB6u+dInxXGXC0mHZ qyDSjmYspoOMN6JsloZRBRh6NU0+MWDftQ0VZdmaz7jRIIXR9bKX3GlrUo92QXvPtEw8B+ieQu Vv6vxrQkH3HlgEtnieXSizmANPq2qzgsJDRrz/QbczHiUkshj2gbQbm3MZT/pbmoITWx+1X8NX uN1/sh4ceVYtSHIUgQdSh59GqpriJ7you7/NQOl5CttEdyhHYZ5aELeIh4KKiy5wLH3PohpXbg f4E= From: Andrew Stubbs Subject: [PATCH] amdgcn, nvptx: Disable OMP barriers in nested teams To: "gcc-patches@gcc.gnu.org" Message-ID: <7024bb16-6b91-f5ff-f72e-92c298c73f3b@codesourcery.com> Date: Fri, 18 Sep 2020 12:25:27 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 Content-Language: en-GB X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-03.mgc.mentorg.com (139.181.222.3) To svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" This patch fixes a problem in which nested OpenMP parallel regions cause errors if the number of inner teams is not balanced (i.e. the number of loop iterations is not divisible by the number of physical threads). A testcase is included. On NVPTX the symptom was a fatal error: libgomp: cuCtxSynchronize error: an illegal instruction was encountered This was caused by mismatched "bar.sync" instructions (one waiting for 32 threads while another is waiting for 256). The source of the mismatch being that some threads were still busy while others had run out of work to do. On GCN there was no such error (GCN barriers always wait for all threads), but it worked only by chance: the idle threads were "matching" different barriers to the busy threads, but it was harmless because the thread function pointer remained NULL. This patch simply skips barriers when they would "wait" for only one thread (the current thread). This means that teams nested inside other teams now run independently, instead of strictly in lock-step, and is only valid as long as inner teams are limited to one thread each (currently the case). When the inner regions exit then the barriers for the outer region will sync everything up again. OK to commit? Andrew P.S. I can approve the amdgcn portion myself; I'm seeking approval for the nvptx portion. libgomp: disable barriers in nested teams Both GCN and NVPTX allow nested parallel regions, but the barrier implementation did not allow the nested teams to run independently of each other (due to hardware limitations). This patch fixes that, under the assumption that each thread will create a new subteam of one thread, by simply not using barriers when there's no other thread to synchronise. libgomp/ChangeLog: * config/gcn/bar.c (gomp_barrier_wait_end): Skip the barrier if the total number of threads is one. (gomp_team_barrier_wake): Likewise. (gomp_team_barrier_wait_end): Likewise. (gomp_team_barrier_wait_cancel_end): Likewise. * config/nvptx/bar.c (gomp_barrier_wait_end): Likewise. (gomp_team_barrier_wake): Likewise. (gomp_team_barrier_wait_end): Likewise. (gomp_team_barrier_wait_cancel_end): Likewise. * testsuite/libgomp.c-c++-common/nested-parallel-unbalanced.c: New test. diff --git a/libgomp/config/gcn/bar.c b/libgomp/config/gcn/bar.c index 02fd19710d4..a21529a624b 100644 --- a/libgomp/config/gcn/bar.c +++ b/libgomp/config/gcn/bar.c @@ -43,7 +43,8 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, MEMMODEL_RELAXED); } - asm ("s_barrier" ::: "memory"); + if (bar->total > 1) + asm ("s_barrier" ::: "memory"); } void @@ -71,7 +72,8 @@ gomp_barrier_wait_last (gomp_barrier_t *bar) void gomp_team_barrier_wake (gomp_barrier_t *bar, int count) { - asm ("s_barrier" ::: "memory"); + if (bar->total > 1) + asm ("s_barrier" ::: "memory"); } void @@ -97,7 +99,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELAXED); - asm ("s_barrier" ::: "memory"); + if (bar->total > 1) + asm ("s_barrier" ::: "memory"); return; } } @@ -172,7 +175,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, { state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELAXED); - asm ("s_barrier" ::: "memory"); + if (bar->total > 1) + asm ("s_barrier" ::: "memory"); return false; } } @@ -195,7 +199,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, abort(); } - asm ("s_barrier" ::: "memory"); + if (bar->total > 1) + asm ("s_barrier" ::: "memory"); gen = __atomic_load_n (&bar->generation, MEMMODEL_RELAXED); if (__builtin_expect (gen & BAR_CANCELLED, 0)) return true; diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c index 125ca3e49ec..0a723087b9e 100644 --- a/libgomp/config/nvptx/bar.c +++ b/libgomp/config/nvptx/bar.c @@ -41,7 +41,8 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, MEMMODEL_RELEASE); } - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); } void @@ -69,7 +70,9 @@ gomp_barrier_wait_last (gomp_barrier_t *bar) void gomp_team_barrier_wake (gomp_barrier_t *bar, int count) { - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + asm ("bar.sync 1, %0;" : : "r" (32 * 8/*bar->total*/)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); } void @@ -95,7 +98,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); return; } } @@ -104,7 +108,8 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; do { - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE); if (__builtin_expect (gen & BAR_TASK_PENDING, 0)) { @@ -158,7 +163,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, { state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); return false; } } @@ -169,7 +175,8 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, generation = state; do { - asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE); if (__builtin_expect (gen & BAR_CANCELLED, 0)) return true; diff --git a/libgomp/testsuite/libgomp.c-c++-common/nested-parallel-unbalanced.c b/libgomp/testsuite/libgomp.c-c++-common/nested-parallel-unbalanced.c new file mode 100644 index 00000000000..e777271dde1 --- /dev/null +++ b/libgomp/testsuite/libgomp.c-c++-common/nested-parallel-unbalanced.c @@ -0,0 +1,31 @@ +/* Ensure that nested parallel regions work even when the number of loop + iterations is not divisible by the number of threads. */ + +#include + +int main() { + int A[30][40], B[30][40]; + size_t n = 30; + + for (size_t i = 0; i < 30; ++i) + for (size_t j = 0; j < 40; ++j) + A[i][j] = 42; + +#pragma omp target map(A[0:30][0:40], B[0:30][0:40]) + { +#pragma omp parallel for num_threads(8) + for (size_t i = 0; i < n; ++i) + { +#pragma omp parallel for + for (size_t j = 0; j < n; ++j) + { + B[i][j] = A[i][j]; + } + } + } + +for (size_t i = 0; i < n; ++i) + for (size_t j = 0; j < n; ++j) + if (B[i][j] != 42) + abort (); +}