From patchwork Tue Jun 4 19:53:40 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 1943556 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=baylibre-com.20230601.gappssmtp.com header.i=@baylibre-com.20230601.gappssmtp.com header.a=rsa-sha256 header.s=20230601 header.b=CLKoZATr; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vv1Wx3B59z20PW for ; Wed, 5 Jun 2024 05:54:37 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EB07938D60EE for ; Tue, 4 Jun 2024 19:54:34 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com [IPv6:2a00:1450:4864:20::335]) by sourceware.org (Postfix) with ESMTPS id 0B7693811095 for ; Tue, 4 Jun 2024 19:54:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0B7693811095 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=baylibre.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=baylibre.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0B7693811095 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::335 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717530855; cv=none; b=BMytDnNPo94TwCbU6FyIO+zAGbe6rAGw2ZBE9LFt/ssRnNIy6ktKxe14Uaa+XIMvawQOEGBDK/smetJokAeCTtYoS7z3XY582MxpvdSZFkwikugpCqit6mhvVX+LX1kEKPh0sFTvJI/tVSJH1SfroOyf28j4C5sT5qIvqcxBXX4= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717530855; c=relaxed/simple; bh=tx/znM9DU2yM3V4obdlzVVKVg5siQMrNyNqM5EIHIH4=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=ZzOITgVPg2QcbceWv9gokx5rc8FMlYcL8scBPEZxJIBUimJmYvLf1T3vT0ro1Zt+Q+qpSl3K9HodosbXo27f0WvhDPTtvqfE4TubhMCShF7Tvlo1OgZCClT+gZYd6HxTimjR1VCpcJTznjICQ9UY4PHK3/Jri4l9C8XPtBSeo5Y= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-wm1-x335.google.com with SMTP id 5b1f17b1804b1-42139c66027so14358335e9.3 for ; Tue, 04 Jun 2024 12:54:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=baylibre-com.20230601.gappssmtp.com; s=20230601; t=1717530851; x=1718135651; darn=gcc.gnu.org; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=YrLjD8W32Fj7Z95muDEJpdq0UV184DbIueiD3vXO8W4=; b=CLKoZATrg2jjC7ouF+XguFVFrI5gWTtGPnguh6UZIAXvdnKUv7UF3/4rYYbNiQOkNk w9w60aW+M8/MXOPTn3FUNs1UAjfD5Ymj+kWzRkRnCVeuTqI+izYilT/HIHWhPJgt/E8H 3B1p6I9Zzd++H9kgZCY2RwsC2wQeJFW17aL0sqPffio7K5G6jXl+g2JNmfmHvABBMrD2 F9eEn2p2nwfQaVC4csgqnbsndCmojJDl/pTRo5FneYlFrBLOdGatStcR+Ey+4wZGyquF I/5A+/AgfVD/l/z7hhq9kTSveHaMWN6YYS79LV923/uDCxd1A0Tw0H9okFjxZgM+ZKWf dzug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717530851; x=1718135651; h=mime-version:message-id:date:user-agent:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=YrLjD8W32Fj7Z95muDEJpdq0UV184DbIueiD3vXO8W4=; b=xIU646tJllvueW7tYkHTIQjE7oSrw778ru1H7HXwiTc4vxYMH/VFcV1pNtmC2Te2Ax esG3TL3fMuk/d0KvBa7TqMqx0xouojg9R/6VhzURaCwZG5eJLNbiDG+ZieS0GkitY017 TTb0HwVqrDH62LJiuoi21xiiHF+xIyBKdEajZIK09jMmb4FWWiAFK/R5WHXGkQInZlrq l5cjc0CA1jTWYozZiwarsTgFtrgnZSzNwerLNSv9iNZWnndXGWY7DvSg1J1AjjqoXfsQ jIwokdAbP5g6H0YRKCj516ropNVBc44SYAKOT73ohTjIglyaJ76jjjKk3L48Tffx67qy Jk1w== X-Gm-Message-State: AOJu0YzTQlebWHZqN7tO0MxNWpwB7pNtAXGT8ovvPyJATW6egTei2MWh ecocsG/Sa22jIzzKn/Pyef9sArri+lWltE4r34jjILiAgkhKQ0AtDNc2mwr8Q6m9aAFls/0n4bJ P X-Google-Smtp-Source: AGHT+IGBKRMAaMepaBEaWVIBKXTznJyUdo8bAVmWGFLc4OqoMwjBuKRS9pyft0YoP1AZPe/cRfL0Yw== X-Received: by 2002:a05:600c:190c:b0:421:418d:8f7 with SMTP id 5b1f17b1804b1-421562cf5e3mr4316125e9.12.1717530850511; Tue, 04 Jun 2024 12:54:10 -0700 (PDT) Received: from euler.schwinge.ddns.net (p200300c8b735b200abad01548d5b2541.dip0.t-ipconnect.de. [2003:c8:b735:b200:abad:154:8d5b:2541]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4213860126fsm117867915e9.40.2024.06.04.12.54.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Jun 2024 12:54:10 -0700 (PDT) From: Thomas Schwinge To: gcc-patches@gcc.gnu.org Cc: Tom de Vries Subject: nvptx: Make 'nvptx_uniform_warp_check' fit for non-full-warp execution, via 'vote.all.pred' (was: nvptx: Make 'nvptx_uniform_warp_check' fit for non-full-warp execution (was: [committed][nvptx] Add uniform_warp_check insn)) In-Reply-To: <87a63ofrpf.fsf@euler.schwinge.homeip.net> References: <20220201183125.GA4286@delia.home> <87a63ofrpf.fsf@euler.schwinge.homeip.net> User-Agent: Notmuch/0.30+8~g47a4bad (https://notmuchmail.org) Emacs/29.3 (x86_64-pc-linux-gnu) Date: Tue, 04 Jun 2024 21:53:40 +0200 Message-ID: <87v82oeb7f.fsf@euler.schwinge.ddns.net> MIME-Version: 1.0 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Hi! On 2022-12-15T19:27:08+0100, I wrote: > First "a bit" of context; skip to "the proposed patch" if you'd like to > see just that. Here, I'm not again providing all the context; see the previous email if necessary. > My following discussion is about the implementation of > 'nvptx_uniform_warp_check', originally introduced as follows: > > On 2022-02-01T19:31:27+0100, Tom de Vries via Gcc-patches wrote: >> --- a/gcc/config/nvptx/nvptx.md >> +++ b/gcc/config/nvptx/nvptx.md >> +(define_insn "nvptx_uniform_warp_check" >> + [(unspec_volatile [(const_int 0)] UNSPECV_UNIFORM_WARP_CHECK)] >> + "" >> + { >> + output_asm_insn ("{", NULL); >> + output_asm_insn ("\\t" ".reg.b32" "\\t" "act;", NULL); >> + output_asm_insn ("\\t" "vote.ballot.b32" "\\t" "act,1;", NULL); >> + output_asm_insn ("\\t" ".reg.pred" "\\t" "uni;", NULL); >> + output_asm_insn ("\\t" "setp.eq.b32" "\\t" "uni,act,0xffffffff;", >> + NULL); >> + output_asm_insn ("@ !uni\\t" "trap;", NULL); >> + output_asm_insn ("@ !uni\\t" "exit;", NULL); >> + output_asm_insn ("}", NULL); >> + return ""; >> + } >> + [(set_attr "predicable" "false")]) > > Later adjusted, but the fundamental idea is still the same. > Now, "the proposed patch". I'd like to make 'nvptx_uniform_warp_check' > fit for non-full-warp execution. For example, to be able to execute such > code in single-threaded 'cuLaunchKernel' for execution of global > constructors/destructors, where those may, for example, call into nvptx > target libraries compiled with '-mgomp' (thus, '-muniform-simt'). > > OK to push (after proper testing, and with TODO markers adjusted/removed) > the attached > "nvptx: Make 'nvptx_uniform_warp_check' fit for non-full-warp execution"? > --- a/gcc/config/nvptx/nvptx.md > +++ b/gcc/config/nvptx/nvptx.md > @@ -2282,10 +2282,24 @@ > "{", > "\\t" ".reg.b32" "\\t" "%%r_act;", > "%.\\t" "vote.ballot.b32" "\\t" "%%r_act,1;", > + /* For '%r_exp', we essentially need 'activemask.b32', but that is "Introduced in PTX ISA version 6.2", and this code here is used only 'if (!TARGET_PTX_6_0)'. Thus, emulate it. > + TODO Is that actually correct? Wouldn't 'activemask.b32' rather replace our 'vote.ballot.b32' given that it registers the *currently active threads*? */ > + /* Compute the "membermask" of all threads of the warp that are expected to be converged here. > + For OpenACC, '%ntid.x' is 'vector_length', which per 'nvptx_goacc_validate_dims' always is a multiple of 32. > + For OpenMP, '%ntid.x' always is 32. > + Thus, this is typically 0xffffffff, but additionally always for the case that not all 32 threads of the warp have been launched. > + This assume that lane IDs are assigned in ascending order. */ > + //TODO Can we rely on '1 << 32 == 0', and '0 - 1 = 0xffffffff'? > + //TODO https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ > + //TODO https://stackoverflow.com/questions/54055195/activemask-vs-ballot-sync > + "\\t" ".reg.b32" "\\t" "%%r_exp;", > + "%.\\t" "mov.b32" "\\t" "%%r_exp, %%ntid.x;", > + "%.\\t" "shl.b32" "\\t" "%%r_exp, 1, %%r_exp;", > + "%.\\t" "sub.u32" "\\t" "%%r_exp, %%r_exp, 1;", > "\\t" ".reg.pred" "\\t" "%%r_do_abort;", > "\\t" "mov.pred" "\\t" "%%r_do_abort,0;", > "%.\\t" "setp.ne.b32" "\\t" "%%r_do_abort,%%r_act," > - "0xffffffff;", > + "%%r_exp;", > "@ %%r_do_abort\\t" "trap;", > "@ %%r_do_abort\\t" "exit;", > "}", Turns out, there is a simpler way, via 'vote.all.pred'. :-) Unless there are any comments, I intend to soon push the attached "nvptx: Make 'nvptx_uniform_warp_check' fit for non-full-warp execution, via 'vote.all.pred'". Grüße Thomas From f7f4a20ca14761d39822e9d79cb3ac711df45b90 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Fri, 10 May 2024 12:50:23 +0200 Subject: [PATCH] nvptx: Make 'nvptx_uniform_warp_check' fit for non-full-warp execution, via 'vote.all.pred' For example, this allows for '-muniform-simt' code to be executed single-threaded, which currently fails (device-side 'trap'): the '0xffffffff' bitmask isn't correct if not all 32 threads of a warp are active. The same issue/fix, I suppose but have not verified, would apply if we were to allow for OpenACC 'vector_length' smaller than 32, for example for OpenACC 'serial'. We use 'nvptx_uniform_warp_check' only for PTX ISA version less than 6.0. Otherwise we're using 'nvptx_warpsync', which emits 'bar.warp.sync 0xffffffff', which evidently appears to do the right thing. (I've tested '-muniform-simt' code executing single-threaded.) The change that I proposed on 2022-12-15 was to emit PTX code to calculate '(1 << %ntid.x) - 1' as the actual bitmask to use instead of '0xffffffff'. This works, but the PTX JIT generates SASS code to do this computation. In turn, this change now uses PTX 'vote.all.pred' -- which even simplifies upon the original code a little bit, see the following examplary SASS 'diff' before vs. after this change: [...] /*[...]*/ SYNC (*"BRANCH_TARGETS .L_x_332"*) } .L_x_332: - /*[...]*/ VOTE.ANY R9, PT, PT ; + /*[...]*/ VOTE.ALL P1, PT ; - /*[...]*/ ISETP.NE.U32.AND P1, PT, R9, -0x1, PT ; - /*[...]*/ @!P1 BRA `(.L_x_333) ; + /*[...]*/ @P1 BRA `(.L_x_333) ; /*[...]*/ BPT.TRAP 0x1 ; .L_x_333: - /*[...]*/ @P1 EXIT ; + /*[...]*/ @!P1 EXIT ; [...] gcc/ * config/nvptx/nvptx.md (nvptx_uniform_warp_check): Make fit for non-full-warp execution, via 'vote.all.pred'. gcc/testsuite/ * gcc.target/nvptx/nvptx.exp (check_effective_target_default_ptx_isa_version_at_least_6_0): New. * gcc.target/nvptx/uniform-simt-2.c: Adjust. * gcc.target/nvptx/uniform-simt-5.c: New. --- gcc/config/nvptx/nvptx.md | 13 ++++----- gcc/testsuite/gcc.target/nvptx/nvptx.exp | 5 ++++ .../gcc.target/nvptx/uniform-simt-2.c | 2 +- .../gcc.target/nvptx/uniform-simt-5.c | 28 +++++++++++++++++++ 4 files changed, 39 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/nvptx/uniform-simt-5.c diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index ef7e3fb00fa..7878a3b6f09 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -2316,14 +2316,11 @@ { const char *insns[] = { "{", - "\\t" ".reg.b32" "\\t" "%%r_act;", - "%.\\t" "vote.ballot.b32" "\\t" "%%r_act,1;", - "\\t" ".reg.pred" "\\t" "%%r_do_abort;", - "\\t" "mov.pred" "\\t" "%%r_do_abort,0;", - "%.\\t" "setp.ne.b32" "\\t" "%%r_do_abort,%%r_act," - "0xffffffff;", - "@ %%r_do_abort\\t" "trap;", - "@ %%r_do_abort\\t" "exit;", + "\\t" ".reg.pred" "\\t" "%%r_sync;", + "\\t" "mov.pred" "\\t" "%%r_sync, 1;", + "%.\\t" "vote.all.pred" "\\t" "%%r_sync, 1;", + "@!%%r_sync\\t" "trap;", + "@!%%r_sync\\t" "exit;", "}", NULL }; diff --git a/gcc/testsuite/gcc.target/nvptx/nvptx.exp b/gcc/testsuite/gcc.target/nvptx/nvptx.exp index 97aa7ae0852..3151381f51a 100644 --- a/gcc/testsuite/gcc.target/nvptx/nvptx.exp +++ b/gcc/testsuite/gcc.target/nvptx/nvptx.exp @@ -49,6 +49,11 @@ proc check_effective_target_default_ptx_isa_version_at_least { major minor } { return $res } +# Return 1 if code by default compiles for at least PTX ISA version 6.0. +proc check_effective_target_default_ptx_isa_version_at_least_6_0 { } { + return [check_effective_target_default_ptx_isa_version_at_least 6 0] +} + # Return 1 if code with PTX ISA version major.minor or higher can be run. proc check_effective_target_runtime_ptx_isa_version_at_least { major minor } { set name runtime_ptx_isa_version_${major}_${minor} diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c index b1eee0d618f..1d83c49a44b 100644 --- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c @@ -17,4 +17,4 @@ f (void) /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */ /* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */ -/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */ +/* { dg-final { scan-assembler-times "vote.all.pred" 1 } } */ diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-5.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-5.c new file mode 100644 index 00000000000..cd6ea82d293 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-5.c @@ -0,0 +1,28 @@ +/* Verify that '-muniform-simt' code may be executed single-threaded. + + { dg-do run } + { dg-options {-save-temps -O2 -muniform-simt} } */ + +enum memmodel +{ + MEMMODEL_RELAXED = 0 +}; + +unsigned long long int v64; +unsigned long long int *p64 = &v64; + +int +main() +{ + /* Trigger uniform-SIMT processing. */ + __atomic_fetch_add (p64, v64, MEMMODEL_RELAXED); + + return 0; +} + +/* Per 'omp_simt_exit': + - 'nvptx_warpsync' + { dg-final { scan-assembler-times {bar\.warp\.sync\t0xffffffff;} 1 { target default_ptx_isa_version_at_least_6_0 } } } + - 'nvptx_uniform_warp_check' + { dg-final { scan-assembler-times {vote\.all\.pred\t%r_sync, 1;} 1 { target { ! default_ptx_isa_version_at_least_6_0 } } } } +*/ -- 2.34.1