OpenMP: Disable GPU threads when only teams are used

Message ID	7c4bb857-a078-0c43-d853-6729cfa07221@codesourcery.com
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7719F386F803 IronPort-SDR: nGDrCfEyP+MXzRTF0HLj/2wzjE9aTzTqJKtvsp+iduNtXwVnyorgDV/5RD8pAkJjA9FmGVNFE3 6PdKqiAH/RgYitpJp2peP8unf30mteRQjmUvXmDEXHaNM55YCI5uih5nHntRlRV3wMmyM8wmFJ BJJVnZ+sUmHHbC+msbVPuTzPU5VQlDIa0pdAPOSeyho+2exi5V9PoQdhmbZQSTgVx/+2W7x38n mmBdAryLTuo8UaD1gl00BV2jK+50LEVWd1RXoTIpcIrOYz0ggnpus2JemXLtxGi6TTeYSMYhNH 870= IronPort-SDR: RiOVqYZIIqP6XaOrIPyVzgqBMj7f3YfhYtBh8qah2bcb9FostavmvVl1+Bp8k2ueg+edfHFlBp wm431bO3MXZfTXSXPqSMv71DpjTs6dWK6j4tZi90DlvbcGtOMmW6nHY92WrWWMJf82Q2eY5wiN bYwO+TSNPcVnUAPwsun57c4g+fdTHZ3cpLXOz0+G9tAbPWz7yHBqp8gXB4/nlJ6R3EQk0EDD6R /rZVlIS+iURNrv+R0TmVJdQDnMP+01+kSs/XYfvSMJayXZh6UIauvd+QBQI/BmsAzGepahdD9m LpA= From: Andrew Stubbs <ams@codesourcery.com> Subject: [PATCH] OpenMP: Disable GPU threads when only teams are used To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> Message-ID: <7c4bb857-a078-0c43-d853-6729cfa07221@codesourcery.com> Date: Thu, 2 Jul 2020 17:15:20 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------FD50148CD6F17110FE9BA771" Content-Language: en-GB Precedence: list Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	OpenMP: Disable GPU threads when only teams are used \| expand OpenMP: Disable GPU threads when only teams are used

Message ID

7c4bb857-a078-0c43-d853-6729cfa07221@codesourcery.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7719F386F803
IronPort-SDR: 
 nGDrCfEyP+MXzRTF0HLj/2wzjE9aTzTqJKtvsp+iduNtXwVnyorgDV/5RD8pAkJjA9FmGVNFE3
 6PdKqiAH/RgYitpJp2peP8unf30mteRQjmUvXmDEXHaNM55YCI5uih5nHntRlRV3wMmyM8wmFJ
 BJJVnZ+sUmHHbC+msbVPuTzPU5VQlDIa0pdAPOSeyho+2exi5V9PoQdhmbZQSTgVx/+2W7x38n
 mmBdAryLTuo8UaD1gl00BV2jK+50LEVWd1RXoTIpcIrOYz0ggnpus2JemXLtxGi6TTeYSMYhNH
 870=
IronPort-SDR: 
 RiOVqYZIIqP6XaOrIPyVzgqBMj7f3YfhYtBh8qah2bcb9FostavmvVl1+Bp8k2ueg+edfHFlBp
 wm431bO3MXZfTXSXPqSMv71DpjTs6dWK6j4tZi90DlvbcGtOMmW6nHY92WrWWMJf82Q2eY5wiN
 bYwO+TSNPcVnUAPwsun57c4g+fdTHZ3cpLXOz0+G9tAbPWz7yHBqp8gXB4/nlJ6R3EQk0EDD6R
 /rZVlIS+iURNrv+R0TmVJdQDnMP+01+kSs/XYfvSMJayXZh6UIauvd+QBQI/BmsAzGepahdD9m
 LpA=
From: Andrew Stubbs <ams@codesourcery.com>
Subject: [PATCH] OpenMP: Disable GPU threads when only teams are used
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Message-ID: <7c4bb857-a078-0c43-d853-6729cfa07221@codesourcery.com>
Date: Thu, 2 Jul 2020 17:15:20 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.8.0
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------FD50148CD6F17110FE9BA771"
Content-Language: en-GB
Precedence: list
Errors-To: gcc-patches-bounces@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>

Series

OpenMP: Disable GPU threads when only teams are used | expand

Commit Message

Andrew Stubbs July 2, 2020, 4:15 p.m. UTC

This patch, originally by Kwok, auto-adjusts the default OpenMP target 
arguments to set num_threads(1) when there are no parallel regions. 
There may still be multiple teams in this case.

The result is that libgomp will not attempt to launch GPU threads that 
will never get used.

OK to commit?

Andrew

Comments

Jakub Jelinek July 2, 2020, 5 p.m. UTC | #1

On Thu, Jul 02, 2020 at 05:15:20PM +0100, Andrew Stubbs wrote:
> This patch, originally by Kwok, auto-adjusts the default OpenMP target
> arguments to set num_threads(1) when there are no parallel regions. There
> may still be multiple teams in this case.
> 
> The result is that libgomp will not attempt to launch GPU threads that will
> never get used.
> 
> OK to commit?

That doesn't look safe to me.
My understanding of the patch is that it looks for parallel construct
lexically in the target region, but that isn't sufficient, one can do that
only if the target region can't encounter a parallel construct in the target
region (i.e. the body and all functions that are called from it at runtime).

void
foo ()
{
  #pragma omp distribute parallel for simd
  for (int i = 0; i < 10000000; i++)
    do_something;
}

extern void baz (); // function that calls foo, unconditionally or conditionally
#pragma omp declare target to (foo, baz)

void
bar ()
{
  #pragma omp target teams
  baz ();
}

Perhaps one could ignore some builtin calls but it would need to be ones
where one can assume there will be no OpenMP code in them.

Also, it needs to avoid doing the optimization if there is or might
indirectly be called omp_get_thread_limit (), because if the optimization
forces thread_limit (1), that means that omp_get_thread_limit () in the
region will also return 1 rather than the expected value.

	Jakub

Andrew Stubbs July 2, 2020, 9:16 p.m. UTC | #2

On 02/07/2020 18:00, Jakub Jelinek wrote:
> On Thu, Jul 02, 2020 at 05:15:20PM +0100, Andrew Stubbs wrote:
>> This patch, originally by Kwok, auto-adjusts the default OpenMP target
>> arguments to set num_threads(1) when there are no parallel regions. There
>> may still be multiple teams in this case.
>>
>> The result is that libgomp will not attempt to launch GPU threads that will
>> never get used.
>>
>> OK to commit?
> 
> That doesn't look safe to me.
> My understanding of the patch is that it looks for parallel construct
> lexically in the target region, but that isn't sufficient, one can do that
> only if the target region can't encounter a parallel construct in the target
> region (i.e. the body and all functions that are called from it at runtime).

OpenMP is complicated. :-(

Is it normally expected that the runtime will always launch the maximum 
number of threads, just in case?

There's a cost to both launching and running excess threads that it 
would be nice to avoid, but the real point of the optimization is that 
launching fewer threads allows us to launch more teams.

AMD GPUs usually allow us to run 2040 or 2400 wavefronts simultaneously, 
so if we're running 15 unused threads for each team then we're limiting 
ourselves to 60 or 64 teams. If we limit each team to 1 thread then we 
can run the full 2040 or 2400 teams. Potentially, that's a 16x speed 
improvement on kernels that happen to not use parallel regions.

I would like to be able to do this, but it appears that the region data 
is insufficient for complex cases. Can you suggest a good way to solve this?

> Perhaps one could ignore some builtin calls but it would need to be ones
> where one can assume there will be no OpenMP code in them.
> 
> Also, it needs to avoid doing the optimization if there is or might
> indirectly be called omp_get_thread_limit (), because if the optimization
> forces thread_limit (1), that means that omp_get_thread_limit () in the
> region will also return 1 rather than the expected value.

Would that not be the correct answer, if the number of threads actually 
has been limited to 1?

Thanks for the prompt review.

Andrew

Jakub Jelinek July 3, 2020, 11:05 a.m. UTC | #3

On Thu, Jul 02, 2020 at 10:16:25PM +0100, Andrew Stubbs wrote:
> On 02/07/2020 18:00, Jakub Jelinek wrote:
> > On Thu, Jul 02, 2020 at 05:15:20PM +0100, Andrew Stubbs wrote:
> > > This patch, originally by Kwok, auto-adjusts the default OpenMP target
> > > arguments to set num_threads(1) when there are no parallel regions. There
> > > may still be multiple teams in this case.
> > > 
> > > The result is that libgomp will not attempt to launch GPU threads that will
> > > never get used.
> > > 
> > > OK to commit?
> > 
> > That doesn't look safe to me.
> > My understanding of the patch is that it looks for parallel construct
> > lexically in the target region, but that isn't sufficient, one can do that
> > only if the target region can't encounter a parallel construct in the target
> > region (i.e. the body and all functions that are called from it at runtime).
> 
> OpenMP is complicated. :-(

And it is and getting worse.

> Is it normally expected that the runtime will always launch the maximum
> number of threads, just in case?

That is an implementation detail, the OpenMP model doesn't require that.
The question is whether when encountering the parallel you can ask for more
threads or not.  E.g. on the host or in the host fallback, that is the case, we
can just pthread_create as many threads as needed, for PTX there is the
theoretical possibility to use dynamic parallelism, but I think it doesn't
really work well and there were major problems with that.

Anyway, I'd think OpenMP code that will only do teams and not parallel
paralelism will be very rare in practice, it is true that in our testsuite
we have probably a lot of tests for that but those are artificial tests.
If somebody wants to get as much as possible from the hw, one should use all
of teams, parallel and simd parallelism.

If the user put an explicit thread_limit clause, I'd just trust the user
what he is doing.  If not, it is implementation defined what the maximum
will be, but I'd say using a maximum of 1 if we don't find a parallel
construct lexically nested is not a good default, even when it can be
conforming.  Because a reasonable application will have the parallel
parallelism burried in one or more of the functions it calls, or if not,
will use explicit thread_limit(1).

If you want to perform some IPA analysis for this and tweak the default
thread_limit based on what it (conservatively) finds out, I have nothing
against that.

	Jakub

OpenMP: Disable GPU threads when only teams are used

	gcc/
	* omp-expand.c (contains_threads): New.
	(get_target_arguments): Add region argument.  Set number of threads
	to one if region does not contain threads.
	(expand_omp_target): Add extra argument in call to
	get_target_arguments.

Co-Authored-By: Andrew Stubbs  <ams@codesourcery.com>

diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index 0f07e51f7e8..6afe18d5ee0 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -8461,10 +8461,22 @@  push_target_argument_according_to_value (gimple_stmt_iterator *gsi, int device,
     }
 }
 
+static bool
+contains_threads (struct omp_region *region)
+{
+  if (!region)
+    return false;
+
+  return region->type == GIMPLE_OMP_PARALLEL
+	 || contains_threads (region->inner)
+	 || contains_threads (region->next);
+}
+
 /* Create an array of arguments that is then passed to GOMP_target.  */
 
 static tree
-get_target_arguments (gimple_stmt_iterator *gsi, gomp_target *tgt_stmt)
+get_target_arguments (gimple_stmt_iterator *gsi, gomp_target *tgt_stmt,
+		      struct omp_region *region)
 {
   auto_vec <tree, 6> args;
   tree clauses = gimple_omp_target_clauses (tgt_stmt);
@@ -8481,6 +8493,11 @@  get_target_arguments (gimple_stmt_iterator *gsi, gomp_target *tgt_stmt)
     t = OMP_CLAUSE_THREAD_LIMIT_EXPR (c);
   else
     t = integer_minus_one_node;
+
+  if (tree_int_cst_equal (t, integer_zero_node)
+      && !contains_threads (region->inner))
+    t = integer_one_node;
+
   push_target_argument_according_to_value (gsi, GOMP_TARGET_ARG_DEVICE_ALL,
 					   GOMP_TARGET_ARG_THREAD_LIMIT, t,
 					   &args);
@@ -8994,7 +9011,7 @@  expand_omp_target (struct omp_region *region)
 	depend = build_int_cst (ptr_type_node, 0);
       args.quick_push (depend);
       if (start_ix == BUILT_IN_GOMP_TARGET)
-	args.quick_push (get_target_arguments (&gsi, entry_stmt));
+	args.quick_push (get_target_arguments (&gsi, entry_stmt, region));
       break;
     case BUILT_IN_GOACC_PARALLEL:
       if (lookup_attribute ("oacc serial", DECL_ATTRIBUTES (child_fn)) != NULL)

OpenMP: Disable GPU threads when only teams are used

Commit Message

Comments

Patch