Message ID | 87il1z7e9m.fsf@euler.schwinge.ddns.net |
---|---|
State | New |
Headers | show |
Series | Stabilize flaky GCN target/offloading testing | expand |
On 06/03/2024 12:09, Thomas Schwinge wrote: > Hi! > > On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote: >> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>: >>> [...] per my work on <https://gcc.gnu.org/PR66005> >>> "libgomp make check time is excessive", all execution testing in libgomp >>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. [...] >>> (... with the caveat that execution tests for >>> effective-targets are *not* governed by that, as I've found yesterday. >>> I have a WIP hack for that, too.) > >>> What disturbs the testing a lot is, that the GPU may get into a bad >>> state, upon which any use either fails with a >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in >>> 'libhsa-runtime64.so.1'... >>> >>> I've now tried to debug the latter case (hang). When the GPU gets into >>> this bad state (whatever exactly that is), >>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but >>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' >>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right >>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. >>> There it hangs until killed (for example, until DejaGnu's timeout >>> mechanism kills the process -- just that the next GPU-using execution >>> test then runs into the same thing again...). >>> >>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), >>> we're able to recover via: >>> >>> $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover >>> 0 > > At least most of the times. I've found that -- sometimes... ;-( -- if > you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do > 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run > into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'. That appears to be avoidable > by injecting some artificial "cool-down period"... (The latter I've not > yet tested extensively.) > >>> This is, obviously, a hack, probably needs a serial lock to not disturb >>> other things, has hard-coded 'dri/0', and as I said in >>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net> >>> "GCN RDNA2+ vs. GCC SLP vectorizer": >>> >>> | I've no idea what >>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display. >> >> It ends up terminating your X session… > > Eh.... ;'-| > >> (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing). > >> I need to try using the integrated graphics for X11 to see if that avoids the issue. > > A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now > remember correctly -- basically got it to work, via hand-editing > '/etc/X11/xorg.conf' and all that... But: I couldn't get external HDMI > to work in that setup, and therefore reverted to "standard". > >> Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware). > >>> However, it's very useful in my testing. :-| >>> >>> The questions is, how to detect the "hang" state without first running >>> into a timeout (and disambiguating such a timeout from a user code >>> timeout)? Add a watchdog: call 'alarm([a few seconds])' before device >>> initialization, and before the actual GPU kernel launch cancel it with >>> 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct >>> error message that we can then react on, like for >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a >>> no-go in libgomp -- instead, use a helper thread to similarly implement a >>> watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for >>> other purposes.) Any other clever ideas? What's a suitable value for >>> "a few seconds"? > > I'm attaching my current "GCN: Watchdog for device image load", covering > both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'. > (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. ) > > That, plus routing *all* potential GPU usage (in particular: including > execution tests for effective-targets, see above) through a serial lock > ('flock', implemented in DejaGnu board file, outside of the the > "DejaGnu timeout domain", similar to > 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus > catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and > the "fake" ones via "GCN: Watchdog for device image load") and in that > case 'amdgpu_gpu_recover' and re-execution of the respective executable, > does greatly stabilize flaky GCN target/offloading testing. > > Do we have consensus to move forward with this approach, generally? I've also observed a number of random hangs in host-side code outside our control, but after the kernel has exited. In general this watchdog approach might help with these. I do feel like it's "papering over the cracks", but if we can't fix it.... at the end of the day it's just a little extra code. My only concern is that it might actually cause failures, perhaps on heavily loaded systems, or with network filesystems, or during debugging. Andrew
On Wed, 6 Mar 2024, Andrew Stubbs wrote: > On 06/03/2024 12:09, Thomas Schwinge wrote: > > Hi! > > > > On 2024-02-21T17:32:13+0100, Richard Biener <rguenther@suse.de> wrote: > >> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>: > >>> [...] per my work on <https://gcc.gnu.org/PR66005> > >>> "libgomp make check time is excessive", all execution testing in libgomp > >>> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'. [...] > >>> (... with the caveat that execution tests for > >>> effective-targets are *not* governed by that, as I've found yesterday. > >>> I have a WIP hack for that, too.) > > > >>> What disturbs the testing a lot is, that the GPU may get into a bad > >>> state, upon which any use either fails with a > >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in > >>> 'libhsa-runtime64.so.1'... > >>> > >>> I've now tried to debug the latter case (hang). When the GPU gets into > >>> this bad state (whatever exactly that is), > >>> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but > >>> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze' > >>> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right > >>> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'. > >>> There it hangs until killed (for example, until DejaGnu's timeout > >>> mechanism kills the process -- just that the next GPU-using execution > >>> test then runs into the same thing again...). > >>> > >>> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state), > >>> we're able to recover via: > >>> > >>> $ flock /tmp/gpu.lock sudo cat > >>> /sys/kernel/debug/dri/0/amdgpu_gpu_recover > >>> 0 > > > > At least most of the times. I've found that -- sometimes... ;-( -- if > > you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do > > 'amdgpu_gpu_recover', and then immediately re-execute, you'll again run > > into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'. That appears to be avoidable > > by injecting some artificial "cool-down period"... (The latter I've not > > yet tested extensively.) > > > >>> This is, obviously, a hack, probably needs a serial lock to not disturb > >>> other things, has hard-coded 'dri/0', and as I said in > >>> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net> > >>> "GCN RDNA2+ vs. GCC SLP vectorizer": > >>> > >>> | I've no idea what > >>> | 'amdgpu_gpu_recover' would do if the GPU is also used for display. > >> > >> It ends up terminating your X session? > > > > Eh.... ;'-| > > > >> (there?s some automatic driver recovery that?s also sometimes triggered > >> which sounds like the same thing). > > > >> I need to try using the integrated graphics for X11 to see if that avoids > >> the issue. > > > > A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now > > remember correctly -- basically got it to work, via hand-editing > > '/etc/X11/xorg.conf' and all that... But: I couldn't get external HDMI > > to work in that setup, and therefore reverted to "standard". > > > >> Guess AMD needs to improve the driver/runtime (or we - it?s open source at > >> least up to the firmware). > > > >>> However, it's very useful in my testing. :-| > >>> > >>> The questions is, how to detect the "hang" state without first running > >>> into a timeout (and disambiguating such a timeout from a user code > >>> timeout)? Add a watchdog: call 'alarm([a few seconds])' before device > >>> initialization, and before the actual GPU kernel launch cancel it with > >>> 'alarm(0)'? (..., and add a handler for 'SIGALRM' to print a distinct > >>> error message that we can then react on, like for > >>> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.) Probably 'alarm'/'SIGALRM' is a > >>> no-go in libgomp -- instead, use a helper thread to similarly implement a > >>> watchdog? ('libgomp/plugin/plugin-gcn.c' already is using pthreads for > >>> other purposes.) Any other clever ideas? What's a suitable value for > >>> "a few seconds"? > > > > I'm attaching my current "GCN: Watchdog for device image load", covering > > both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'. > > (That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. ) > > > > That, plus routing *all* potential GPU usage (in particular: including > > execution tests for effective-targets, see above) through a serial lock > > ('flock', implemented in DejaGnu board file, outside of the the > > "DejaGnu timeout domain", similar to > > 'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus > > catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and > > the "fake" ones via "GCN: Watchdog for device image load") and in that > > case 'amdgpu_gpu_recover' and re-execution of the respective executable, > > does greatly stabilize flaky GCN target/offloading testing. > > > > Do we have consensus to move forward with this approach, generally? > > I've also observed a number of random hangs in host-side code outside our > control, but after the kernel has exited. In general this watchdog approach > might help with these. I do feel like it's "papering over the cracks", but if > we can't fix it.... at the end of the day it's just a little extra code. I wonder if you maybe have contact to people at AMD that are willing to debug this and improve the driver side of this? I'm seeing quite a number of similar reports for the issue I hit in the github tracker, multiple years old and also current, so that doesn't seem to be a good way to get things fixed ... Richard. > My only concern is that it might actually cause failures, perhaps on heavily > loaded systems, or with network filesystems, or during debugging. > > Andrew >
From 21795353483c263c91a5efa80da41a75a6b2b629 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <tschwinge@baylibre.com> Date: Thu, 22 Feb 2024 21:50:45 +0100 Subject: [PATCH] GCN: Watchdog for device image load --- gcc/config/gcn/gcn-run.cc | 76 ++++++++++++++++++++++++++++++++++ libgomp/plugin/plugin-gcn.c | 81 ++++++++++++++++++++++++++++++++++++- 2 files changed, 156 insertions(+), 1 deletion(-) diff --git a/gcc/config/gcn/gcn-run.cc b/gcc/config/gcn/gcn-run.cc index d45ff3e6c2ba..ab15185af471 100644 --- a/gcc/config/gcn/gcn-run.cc +++ b/gcc/config/gcn/gcn-run.cc @@ -33,6 +33,8 @@ #include <unistd.h> #include <elf.h> #include <signal.h> +#include <time.h> +#include <errno.h> #include "hsa.h" #include "../../../libgomp/config/gcn/libgomp-gcn.h" @@ -616,6 +618,70 @@ run (uint64_t kernel, void *kernargs) "Clean up signal"); } +/* Watchdog. */ + +static void +watchdog_bark (union sigval sigev_value) +{ + const char *msg = sigev_value.sival_ptr; + fprintf (stderr, "Watchdog barking %s\n", msg); + exit (EXIT_FAILURE); +} + +static void +watchdog_start (timer_t *restrict timeridp, const int s, const char *msg) +{ + if (debug) + fprintf (stderr, "Starting watchdog\n"); + + struct sigevent sev; + sev.sigev_notify = SIGEV_THREAD; + sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg; + sev.sigev_notify_function = watchdog_bark; + sev.sigev_notify_attributes = NULL; + int res; + /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts. */ + int32_t wait_ns = 255; + while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN + && wait_ns <= 999999999) + { + if (debug) + fprintf (stderr, "'timer_create': 'EAGAIN'; waiting %d ns\n", + (int) wait_ns); + struct timespec wait_ts = { 0, wait_ns }; + (void) nanosleep (&wait_ts, NULL); + wait_ns <<= 1; + } + if (res != 0) + { + perror ("'timer_create' FAILED"); + exit (EXIT_FAILURE); + } + + struct itimerspec its = { { 0, 0 }, { s, 0 } }; + res = timer_settime (*timeridp, 0, &its, NULL); + if (res != 0) + { + perror ("'timer_settime' FAILED"); + exit (EXIT_FAILURE); + } +} + +static void +watchdog_stop (timer_t timerid) +{ + int res; + res = timer_delete (timerid); + if (res != 0) + { + perror ("'timer_delete' FAILED"); + exit (EXIT_FAILURE); + } + + if (debug) + fprintf (stderr, "Stopped watchdog\n"); +} + int main (int argc, char *argv[]) { @@ -658,7 +724,17 @@ main (int argc, char *argv[]) char **kernel_argv = &argv[kernel_arg]; init_device (); + + /* Something's wrong if the device image load doesn't complete quickly; + <https://inbox.sourceware.org/87il2ij8sm.fsf@euler.schwinge.ddns.net> + "Stabilizing flaky libgomp GCN target/offloading testing". */ + timer_t watchdog; + static const int watchdog_s = 10; + watchdog_start (&watchdog, watchdog_s, + "during device image load; maybe handle similar to" + " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?"); load_image (kernel_argv[0]); + watchdog_stop (watchdog); /* Calculate size of function parameters + argv data. */ size_t args_size = 0; diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index 2771123252a8..5680d9f5a34a 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -48,6 +48,8 @@ #include "oacc-plugin.h" #include "oacc-int.h" #include <assert.h> +#include <time.h> +#include <errno.h> /* These probably won't be in elf.h for a while. */ #ifndef R_AMDGPU_NONE @@ -1371,6 +1373,71 @@ hsa_queue_callback (hsa_status_t status, hsa_fatal ("Asynchronous queue error", status); } +/* }}} */ +/* {{{ Watchdog */ + +static void +watchdog_bark (union sigval sigev_value) +{ + const char *msg = sigev_value.sival_ptr; + GOMP_PLUGIN_error ("GCN fatal error: watchdog barking %s\n", msg); + _Exit (EXIT_FAILURE); +} + +static void +watchdog_start (timer_t *restrict timeridp, const int s, const char *msg) +{ + GCN_DEBUG ("Starting watchdog\n"); + + struct sigevent sev; + sev.sigev_notify = SIGEV_THREAD; + sev.sigev_value.sival_ptr = (void *) (uintptr_t) msg; + sev.sigev_notify_function = watchdog_bark; + sev.sigev_notify_attributes = NULL; + int res; + /* Backoff in case of 'EAGAIN': waiting 255..534773760 ns in 22 attempts. */ + int32_t wait_ns = 255; + while ((res = timer_create (CLOCK_MONOTONIC, &sev, timeridp)) == EAGAIN + && wait_ns <= 999999999) + { + GCN_DEBUG ("'timer_create': 'EAGAIN'; waiting %d ns\n", + (int) wait_ns); + struct timespec wait_ts = { 0, wait_ns }; + (void) nanosleep (&wait_ts, NULL); + wait_ns <<= 1; + } + if (res != 0) + { + GOMP_PLUGIN_error ("GCN fatal error: 'timer_create' FAILED: %s", + strerror (errno)); + _Exit (EXIT_FAILURE); + } + + struct itimerspec its = { { 0, 0 }, { s, 0 } }; + res = timer_settime (*timeridp, 0, &its, NULL); + if (res != 0) + { + GOMP_PLUGIN_error ("GCN fatal error: 'timer_settime' FAILED: %s", + strerror (errno)); + _Exit (EXIT_FAILURE); + } +} + +static void +watchdog_stop (timer_t timerid) +{ + int res; + res = timer_delete (timerid); + if (res != 0) + { + GOMP_PLUGIN_error ("GCN fatal error: 'timer_delete' FAILED: %s", + strerror (errno)); + _Exit (EXIT_FAILURE); + } + + GCN_DEBUG ("Stopped watchdog\n"); +} + /* }}} */ /* {{{ HSA initialization */ @@ -2502,7 +2569,16 @@ create_and_finalize_hsa_program (struct agent_info *agent) return false; } if (agent->prog_finalized) - goto final; + goto unlock; + + /* Something's wrong if the device image load doesn't complete quickly; + <https://inbox.sourceware.org/87il2ij8sm.fsf@euler.schwinge.ddns.net> + "Stabilizing flaky libgomp GCN target/offloading testing". */ + timer_t watchdog; + static const int watchdog_s = 10; + watchdog_start (&watchdog, watchdog_s, + "during device image load; maybe handle similar to" + " 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'?"); status = hsa_fns.hsa_executable_create_fn (HSA_PROFILE_FULL, @@ -2581,6 +2657,9 @@ create_and_finalize_hsa_program (struct agent_info *agent) final: agent->prog_finalized = true; + watchdog_stop (watchdog); + +unlock: if (pthread_mutex_unlock (&agent->prog_mutex)) { GOMP_PLUGIN_error ("Could not unlock a GCN agent program mutex"); -- 2.43.0