From patchwork Tue Apr 19 14:01:06 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 612164 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3qq6BS3j4Nz9t3l for ; Wed, 20 Apr 2016 00:01:35 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=Xn80O1HG; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:in-reply-to:references:date:message-id :mime-version:content-type:content-transfer-encoding; q=dns; s= default; b=RD1bkBlqp6Qy37ZMaaWfg7Jilgm/TyTdBKn5O8BlwQSId4QUU0oV/ qTNs3dwSO13yrUeeORdBgrNyVWGqZ2qtM5G79lrC8mKPZycPDudKKx97w7uUPMSL jnlCnakoquN4VObI0YQwIsqtpEQL2oTCow9dtnU1MEdURIXnacWAQw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:in-reply-to:references:date:message-id :mime-version:content-type:content-transfer-encoding; s=default; bh=37bHulWzk9qsQDJyUrmdgr6qbXM=; b=Xn80O1HGCsTZ9bcItJsj5o/KwX/A Oju5YMbDgb2N3303sw1RG4NbRxLuVCQEeftqzuH2mc2puX5B7Aj+alRYEQvfy0O0 V95Xn4Fl9/eOYb/bbejE1iQvUsVAiFr+1lVhYaj0nWDXFrrd6133os/Pq0YtPSae wZkTmOC4+pBgtdA= Received: (qmail 28958 invoked by alias); 19 Apr 2016 14:01:27 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 28941 invoked by uid 89); 19 Apr 2016 14:01:26 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=nvidia, offload, sk:GOMP_PL, NVIDIA X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Tue, 19 Apr 2016 14:01:16 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1asWDH-0000JC-NR from Thomas_Schwinge@mentor.com ; Tue, 19 Apr 2016 07:01:12 -0700 Received: from hertz.schwinge.homeip.net (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Tue, 19 Apr 2016 15:01:10 +0100 From: Thomas Schwinge To: Jakub Jelinek , Bernd Schmidt CC: Ilya Verbin , Chung-Lin Tang , James Norris , , Kirill Yukhin , Alexander Monakov Subject: Re: gomp_target_fini In-Reply-To: <20160122101607.GN3017@tucnak.redhat.com> References: <20151201081800.GN5675@tucnak.redhat.com> <20151201131559.GU5675@tucnak.redhat.com> <20151201172927.GA7692@msticlxl57.ims.intel.com> <20151201190504.GY5675@tucnak.redhat.com> <20151208144559.GB14238@msticlxl57.ims.intel.com> <20151211172713.GF5675@tucnak.redhat.com> <20151214164736.GA63018@msticlxl57.ims.intel.com> <87r3impode.fsf@kepler.schwinge.homeip.net> <56A0F83E.3040808@redhat.com> <20160122101607.GN3017@tucnak.redhat.com> User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu) Date: Tue, 19 Apr 2016 16:01:06 +0200 Message-ID: <87d1pl3dgd.fsf@hertz.schwinge.homeip.net> MIME-Version: 1.0 Hi! On Fri, 22 Jan 2016 11:16:07 +0100, Jakub Jelinek wrote: > On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote: > > On 12/16/2015 01:30 PM, Thomas Schwinge wrote: > > >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the > > >atexit handler, gomp_target_fini, which, with the device lock held, will > > >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to > > >clean up. > > > > > >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA > > >context is now in an inconsistent state > > > > >Thus, any cuMemFreeHost invocations that are run during clean-up will now > > >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call > > >GOMP_PLUGIN_fatal, which again will trigger the same or another > > >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock > > >trying to lock the device again, which is still locked. (... causing "WARNING: program timed out" for the affected libgomp test cases, as well as deadlocks for any such user code, too.) > > > libgomp/ > > > * error.c (gomp_vfatal): Call _exit instead of exit. > > > > It seems unfortunate to disable the atexit handlers for everything for what > > seems purely an nvptx problem. [...] > I agree, _exit is just wrong, there could be important atexit hooks from the > application. You can set some flag that the libgomp or nvptx plugin atexit > hooks should not do anything, or should do things differently. But > bypassing all atexit handlers is risky. Well, I certainly had done at least some thinking before proposing this: we're talking about the libgomp "fatal exit" function, called when something has gone very wrong, and we're about to terminate the process, because there's no hope to recover. In this situation/consideration it didn't seem important to me to have atexit handlers called. Just like these are also not called when we run into a SIGSEGV, or the kernel kills the process for other reasons. So I'm not completely convinced by your assessment that calling "_exit is just wrong". Anyway, I can certainly accept that my understanding of the seriousness of a libgomp "fatal exit" has been too pessimistic, and that we can do better than my proposed _exit solution. Two other solutions have been proposed in the past months: Chung-Lin's patches with subject: "Adjust offload plugin interface for avoiding deadlock on exit", later: "Resolve libgomp plugin deadlock on exit", later: "Resolve deadlock on plugin exit" (still pending review/approval), and Alexander's much smaller patch with subject: "libgomp plugin: make cuMemFreeHost error non-fatal", . (Both of which I have not reviewed in detail.) Assuming that Chung-Lin's patches are considered too invasive for gcc-6-branch, can we at least get Alexander's patch committed to gcc-6-branch as well as on trunk, please? commit d86a582bd9c21451dc888695ee6ecef37b5fb6ac Author: Alexander Monakov Date: Fri Mar 11 15:31:33 2016 +0300 libgomp plugin: make cuMemFreeHost error non-fatal Unlike cuMemFree and other resource-releasing functions called on exit, cuMemFreeHost appears to re-report errors encountered in kernel launch. This leads to a deadlock after GOMP_PLUGIN_fatal is reentered. While the behavior on libgomp side is suboptimal (there's no need to call resource-releasing functions if we're about to destroy the CUDA context anyway), this behavior on cuMemFreeHost part is not useful and just makes error "recovery" harder. This was reported to NVIDIA (bug ref. 1737876), but we can work around it by simply reporting the error without making it fatal. * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal. --- libgomp/ChangeLog.gomp-nvptx | 4 ++++ libgomp/plugin/plugin-nvptx.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) Grüße Thomas diff --git libgomp/ChangeLog.gomp-nvptx libgomp/ChangeLog.gomp-nvptx index 7eefe0b..6bd9e5e 100644 --- libgomp/ChangeLog.gomp-nvptx +++ libgomp/ChangeLog.gomp-nvptx @@ -1,3 +1,7 @@ +2016-03-11 Alexander Monakov + + * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal. + 2016-03-04 Alexander Monakov * config/nvptx/bar.c: Remove wrong invocation of diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c index adf57b1..4e44242 100644 --- libgomp/plugin/plugin-nvptx.c +++ libgomp/plugin/plugin-nvptx.c @@ -135,7 +135,7 @@ map_fini (struct ptx_stream *s) r = cuMemFreeHost (s->h); if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r)); + GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r)); } static void