diff mbox series

Stabilize flaky GCN target/offloading testing

Message ID 87il1z7e9m.fsf@euler.schwinge.ddns.net
State New
Headers show
Series Stabilize flaky GCN target/offloading testing | expand

Commit Message

Thomas Schwinge March 6, 2024, 12:09 p.m. UTC
Hi!

On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote:
> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
>> [...] per my work on <https://gcc.gnu.org/PR66005>
>> "libgomp make check time is excessive", all execution testing in libgomp
>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
>> (... with the caveat that execution tests for
>> effective-targets are *not* governed by that, as I've found yesterday.
>> I have a WIP hack for that, too.)

>> What disturbs the testing a lot is, that the GPU may get into a bad
>> state, upon which any use either fails with a
>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
>> 'libhsa-runtime64.so.1'...
>> 
>> I've now tried to debug the latter case (hang).  When the GPU gets into
>> this bad state (whatever exactly that is),
>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
>> There it hangs until killed (for example, until DejaGnu's timeout
>> mechanism kills the process -- just that the next GPU-using execution
>> test then runs into the same thing again...).
>> 
>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
>> we're able to recover via:
>> 
>>    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>>    0

At least most of the times.  I've found that -- sometimes... ;-( -- if
you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
by injecting some artificial "cool-down period"...  (The latter I've not
yet tested extensively.)

>> This is, obviously, a hack, probably needs a serial lock to not disturb
>> other things, has hard-coded 'dri/0', and as I said in
>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
>> "GCN RDNA2+ vs. GCC SLP vectorizer":
>> 
>> | I've no idea what
>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
>
> It ends up terminating your X session…

Eh....  ;'-|

> (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing).

> I need to try using the integrated graphics for X11 to see if that avoids the issue.

A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
remember correctly -- basically got it to work, via hand-editing
'/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
to work in that setup, and therefore reverted to "standard".

> Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware).

>> However, it's very useful in my testing.  :-|
>> 
>> The questions is, how to detect the "hang" state without first running
>> into a timeout (and disambiguating such a timeout from a user code
>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
>> initialization, and before the actual GPU kernel launch cancel it with
>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
>> error message that we can then react on, like for
>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
>> no-go in libgomp -- instead, use a helper thread to similarly implement a
>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
>> other purposes.)  Any other clever ideas?  What's a suitable value for
>> "a few seconds"?

I'm attaching my current "GCN: Watchdog for device image load", covering
both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
(That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )

That, plus routing *all* potential GPU usage (in particular: including
execution tests for effective-targets, see above) through a serial lock
('flock', implemented in DejaGnu board file, outside of the the
"DejaGnu timeout domain", similar to
'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
the "fake" ones via "GCN: Watchdog for device image load") and in that
case 'amdgpu_gpu_recover' and re-execution of the respective executable,
does greatly stabilize flaky GCN target/offloading testing.

Do we have consensus to move forward with this approach, generally?


Grüße
 Thomas

Comments

Andrew Stubbs March 6, 2024, 12:39 p.m. UTC | #1
On 06/03/2024 12:09, Thomas Schwinge wrote:
> Hi!
> 
> On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote:
>> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
>>> [...] per my work on <https://gcc.gnu.org/PR66005>
>>> "libgomp make check time is excessive", all execution testing in libgomp
>>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
>>> (... with the caveat that execution tests for
>>> effective-targets are *not* governed by that, as I've found yesterday.
>>> I have a WIP hack for that, too.)
> 
>>> What disturbs the testing a lot is, that the GPU may get into a bad
>>> state, upon which any use either fails with a
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
>>> 'libhsa-runtime64.so.1'...
>>>
>>> I've now tried to debug the latter case (hang).  When the GPU gets into
>>> this bad state (whatever exactly that is),
>>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
>>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
>>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
>>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
>>> There it hangs until killed (for example, until DejaGnu's timeout
>>> mechanism kills the process -- just that the next GPU-using execution
>>> test then runs into the same thing again...).
>>>
>>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
>>> we're able to recover via:
>>>
>>>     $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>>>     0
> 
> At least most of the times.  I've found that -- sometimes... ;-( -- if
> you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
> 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
> into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
> by injecting some artificial "cool-down period"...  (The latter I've not
> yet tested extensively.)
> 
>>> This is, obviously, a hack, probably needs a serial lock to not disturb
>>> other things, has hard-coded 'dri/0', and as I said in
>>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
>>> "GCN RDNA2+ vs. GCC SLP vectorizer":
>>>
>>> | I've no idea what
>>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
>>
>> It ends up terminating your X session…
> 
> Eh....  ;'-|
> 
>> (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing).
> 
>> I need to try using the integrated graphics for X11 to see if that avoids the issue.
> 
> A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
> remember correctly -- basically got it to work, via hand-editing
> '/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
> to work in that setup, and therefore reverted to "standard".
> 
>> Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware).
> 
>>> However, it's very useful in my testing.  :-|
>>>
>>> The questions is, how to detect the "hang" state without first running
>>> into a timeout (and disambiguating such a timeout from a user code
>>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
>>> initialization, and before the actual GPU kernel launch cancel it with
>>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
>>> error message that we can then react on, like for
>>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
>>> no-go in libgomp -- instead, use a helper thread to similarly implement a
>>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
>>> other purposes.)  Any other clever ideas?  What's a suitable value for
>>> "a few seconds"?
> 
> I'm attaching my current "GCN: Watchdog for device image load", covering
> both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
> (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )
> 
> That, plus routing *all* potential GPU usage (in particular: including
> execution tests for effective-targets, see above) through a serial lock
> ('flock', implemented in DejaGnu board file, outside of the the
> "DejaGnu timeout domain", similar to
> 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
> catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
> the "fake" ones via "GCN: Watchdog for device image load") and in that
> case 'amdgpu_gpu_recover' and re-execution of the respective executable,
> does greatly stabilize flaky GCN target/offloading testing.
> 
> Do we have consensus to move forward with this approach, generally?

I've also observed a number of random hangs in host-side code outside 
our control, but after the kernel has exited. In general this watchdog 
approach might help with these. I do feel like it's "papering over the 
cracks", but if we can't fix it.... at the end of the day it's just a 
little extra code.

My only concern is that it might actually cause failures, perhaps on 
heavily loaded systems, or with network filesystems, or during debugging.

Andrew
Richard Biener March 6, 2024, 1:29 p.m. UTC | #2
On Wed, 6 Mar 2024, Andrew Stubbs wrote:

> On 06/03/2024 12:09, Thomas Schwinge wrote:
> > Hi!
> > 
> > On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote:
> >> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
> >>> [...] per my work on <https://gcc.gnu.org/PR66005>
> >>> "libgomp make check time is excessive", all execution testing in libgomp
> >>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
> >>> (... with the caveat that execution tests for
> >>> effective-targets are *not* governed by that, as I've found yesterday.
> >>> I have a WIP hack for that, too.)
> > 
> >>> What disturbs the testing a lot is, that the GPU may get into a bad
> >>> state, upon which any use either fails with a
> >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> >>> 'libhsa-runtime64.so.1'...
> >>>
> >>> I've now tried to debug the latter case (hang).  When the GPU gets into
> >>> this bad state (whatever exactly that is),
> >>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
> >>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
> >>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
> >>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
> >>> There it hangs until killed (for example, until DejaGnu's timeout
> >>> mechanism kills the process -- just that the next GPU-using execution
> >>> test then runs into the same thing again...).
> >>>
> >>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
> >>> we're able to recover via:
> >>>
> >>>     $ flock /tmp/gpu.lock sudo cat
> >>>     /sys/kernel/debug/dri/0/amdgpu_gpu_recover
> >>>     0
> > 
> > At least most of the times.  I've found that -- sometimes... ;-( -- if
> > you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
> > 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
> > into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
> > by injecting some artificial "cool-down period"...  (The latter I've not
> > yet tested extensively.)
> > 
> >>> This is, obviously, a hack, probably needs a serial lock to not disturb
> >>> other things, has hard-coded 'dri/0', and as I said in
> >>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
> >>> "GCN RDNA2+ vs. GCC SLP vectorizer":
> >>>
> >>> | I've no idea what
> >>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.
> >>
> >> It ends up terminating your X session?
> > 
> > Eh....  ;'-|
> > 
> >> (there?s some automatic driver recovery that?s also sometimes triggered
> >> which sounds like the same thing).
> > 
> >> I need to try using the integrated graphics for X11 to see if that avoids
> >> the issue.
> > 
> > A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
> > remember correctly -- basically got it to work, via hand-editing
> > '/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
> > to work in that setup, and therefore reverted to "standard".
> > 
> >> Guess AMD needs to improve the driver/runtime (or we - it?s open source at
> >> least up to the firmware).
> > 
> >>> However, it's very useful in my testing.  :-|
> >>>
> >>> The questions is, how to detect the "hang" state without first running
> >>> into a timeout (and disambiguating such a timeout from a user code
> >>> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
> >>> initialization, and before the actual GPU kernel launch cancel it with
> >>> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
> >>> error message that we can then react on, like for
> >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
> >>> no-go in libgomp -- instead, use a helper thread to similarly implement a
> >>> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
> >>> other purposes.)  Any other clever ideas?  What's a suitable value for
> >>> "a few seconds"?
> > 
> > I'm attaching my current "GCN: Watchdog for device image load", covering
> > both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
> > (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )
> > 
> > That, plus routing *all* potential GPU usage (in particular: including
> > execution tests for effective-targets, see above) through a serial lock
> > ('flock', implemented in DejaGnu board file, outside of the the
> > "DejaGnu timeout domain", similar to
> > 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
> > catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
> > the "fake" ones via "GCN: Watchdog for device image load") and in that
> > case 'amdgpu_gpu_recover' and re-execution of the respective executable,
> > does greatly stabilize flaky GCN target/offloading testing.
> > 
> > Do we have consensus to move forward with this approach, generally?
> 
> I've also observed a number of random hangs in host-side code outside our
> control, but after the kernel has exited. In general this watchdog approach
> might help with these. I do feel like it's "papering over the cracks", but if
> we can't fix it.... at the end of the day it's just a little extra code.

I wonder if you maybe have contact to people at AMD that are willing
to debug this and improve the driver side of this?  I'm seeing quite
a number of similar reports for the issue I hit in the github tracker,
multiple years old and also current, so that doesn't seem to be a good
way to get things fixed ...

Richard.

> My only concern is that it might actually cause failures, perhaps on heavily
> loaded systems, or with network filesystems, or during debugging.
>
> Andrew
>
diff mbox series

Patch

From 21795353483c263c91a5efa80da41a75a6b2b629 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Thu, 22 Feb 2024 21:50:45 +0100
Subject: [PATCH] GCN: Watchdog for device image load

---
 gcc/config/gcn/gcn-run.cc   | 76 ++++++++++++++++++++++++++++++++++
 libgomp/plugin/plugin-gcn.c | 81 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 156 insertions(+), 1 deletion(-)

diff --git a/gcc/config/gcn/gcn-run.cc b/gcc/config/gcn/gcn-run.cc
index d45ff3e6c2ba..ab15185af471 100644
--- a/gcc/config/gcn/gcn-run.cc
+++ b/gcc/config/gcn/gcn-run.cc
@@ -33,6 +33,8 @@ 
 #include <unistd.h>
 #include <elf.h>
 #include <signal.h>
+#include <time.h>
+#include <errno.h>
 
 #include "hsa.h"
 #include "../../../libgomp/config/gcn/libgomp-gcn.h"
@@ -616,6 +618,70 @@  run (uint64_t kernel, void *kernargs)
 	"Clean up signal");
 }
 
+/* Watchdog.  */
+
+static void
+watchdog_bark (union sigval sigev_value)
+{
+  const char *msg = sigev_value.sival_ptr;
+  fprintf (stderr, "Watchdog barking %s\n", msg);
+  exit (EXIT_FAILURE);
+}
+
+static void
+watchdog_start (timer_t *restrict timeridp, const int s, const char *msg)
+{
+  if (debug)
+    fprintf (stderr, "Starting watchdog\n");
+
+  struct sigevent sev;
+  sev.sigev_notify = SIGEV_THREAD;
+  sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg;
+  sev.sigev_notify_function = watchdog_bark;
+  sev.sigev_notify_attributes = NULL;
+  int res;
+  /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts.  */
+  int32_t wait_ns = 255;
+  while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN
+	 && wait_ns <= 999999999)
+    {
+      if (debug)
+	fprintf (stderr, "'timer_create': 'EAGAIN'; waiting %d ns\n",
+		 (int) wait_ns);
+      struct timespec wait_ts = { 0, wait_ns };
+      (void) nanosleep (&wait_ts, NULL);
+      wait_ns <<= 1;
+    }
+  if (res != 0)
+    {
+      perror ("'timer_create' FAILED");
+      exit (EXIT_FAILURE);
+    }
+
+  struct itimerspec its = { { 0, 0 }, { s, 0 } };
+  res = timer_settime (*timeridp, 0, &its, NULL);
+  if (res != 0)
+    {
+      perror ("'timer_settime' FAILED");
+      exit (EXIT_FAILURE);
+    }
+}
+
+static void
+watchdog_stop (timer_t timerid)
+{
+  int res;
+  res = timer_delete (timerid);
+  if (res != 0)
+    {
+      perror ("'timer_delete' FAILED");
+      exit (EXIT_FAILURE);
+    }
+
+  if (debug)
+    fprintf (stderr, "Stopped watchdog\n");
+}
+
 int
 main (int argc, char *argv[])
 {
@@ -658,7 +724,17 @@  main (int argc, char *argv[])
   char **kernel_argv = &argv[kernel_arg];
 
   init_device ();
+
+  /* Something's wrong if the device image load doesn't complete quickly;
+     <https://inbox.sourceware.org/87il2ij8sm.fsf@euler.schwinge.ddns.net>
+     "Stabilizing flaky libgomp GCN target/offloading testing".  */
+  timer_t watchdog;
+  static const int watchdog_s = 10;
+  watchdog_start (&watchdog, watchdog_s,
+		  "during device image load; maybe handle similar to"
+		  " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?");
   load_image (kernel_argv[0]);
+  watchdog_stop (watchdog);
 
   /* Calculate size of function parameters + argv data.  */
   size_t args_size = 0;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 2771123252a8..5680d9f5a34a 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -48,6 +48,8 @@ 
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 #include <assert.h>
+#include <time.h>
+#include <errno.h>
 
 /* These probably won't be in elf.h for a while.  */
 #ifndef R_AMDGPU_NONE
@@ -1371,6 +1373,71 @@  hsa_queue_callback (hsa_status_t status,
   hsa_fatal ("Asynchronous queue error", status);
 }
 
+/* }}}  */
+/* {{{ Watchdog  */
+
+static void
+watchdog_bark (union sigval sigev_value)
+{
+  const char *msg = sigev_value.sival_ptr;
+  GOMP_PLUGIN_error ("GCN fatal error: watchdog barking %s\n", msg);
+  _Exit (EXIT_FAILURE);
+}
+
+static void
+watchdog_start (timer_t *restrict timeridp, const int s, const char *msg)
+{
+  GCN_DEBUG ("Starting watchdog\n");
+
+  struct sigevent sev;
+  sev.sigev_notify = SIGEV_THREAD;
+  sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg;
+  sev.sigev_notify_function = watchdog_bark;
+  sev.sigev_notify_attributes = NULL;
+  int res;
+  /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts.  */
+  int32_t wait_ns = 255;
+  while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN
+	 && wait_ns <= 999999999)
+    {
+      GCN_DEBUG ("'timer_create': 'EAGAIN'; waiting %d ns\n",
+		 (int) wait_ns);
+      struct timespec wait_ts = { 0, wait_ns };
+      (void) nanosleep (&wait_ts, NULL);
+      wait_ns <<= 1;
+    }
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_create' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+
+  struct itimerspec its = { { 0, 0 }, { s, 0 } };
+  res = timer_settime (*timeridp, 0, &its, NULL);
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_settime' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+}
+
+static void
+watchdog_stop (timer_t timerid)
+{
+  int res;
+  res = timer_delete (timerid);
+  if (res != 0)
+    {
+      GOMP_PLUGIN_error ("GCN fatal error: 'timer_delete' FAILED: %s",
+			 strerror (errno));
+      _Exit (EXIT_FAILURE);
+    }
+
+  GCN_DEBUG ("Stopped watchdog\n");
+}
+
 /* }}}  */
 /* {{{ HSA initialization  */
 
@@ -2502,7 +2569,16 @@  create_and_finalize_hsa_program (struct agent_info *agent)
       return false;
     }
   if (agent->prog_finalized)
-    goto final;
+    goto unlock;
+
+  /* Something's wrong if the device image load doesn't complete quickly;
+     <https://inbox.sourceware.org/87il2ij8sm.fsf@euler.schwinge.ddns.net>
+     "Stabilizing flaky libgomp GCN target/offloading testing".  */
+  timer_t watchdog;
+  static const int watchdog_s = 10;
+  watchdog_start (&watchdog, watchdog_s,
+		  "during device image load; maybe handle similar to"
+		  " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?");
 
   status
     = hsa_fns.hsa_executable_create_fn (HSA_PROFILE_FULL,
@@ -2581,6 +2657,9 @@  create_and_finalize_hsa_program (struct agent_info *agent)
 final:
   agent->prog_finalized = true;
 
+  watchdog_stop (watchdog);
+
+unlock:
   if (pthread_mutex_unlock (&agent->prog_mutex))
     {
       GOMP_PLUGIN_error ("Could not unlock a GCN agent program mutex");
-- 
2.43.0