From patchwork Tue Aug 30 14:53:45 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 664166 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3sNs3g5Wrfz9s1h for ; Wed, 31 Aug 2016 00:54:07 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=MGL31XjP; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; q=dns; s=default; b=n5qBwKAINkPjYDh0rL/QihNtfiDfCM+iH5UrjCY1Ez8dTtIDPe h3XvQ11ryNdxSOfc/T/IX8rl7t+e4/ajK7xNf67UxxzfM4y0abOibEpEYxmqmi2P /bON86oiqB2uHne/nTikGJ3zjMURqo79SqCIsnbqQ6puMPff60Fq3NxHc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to :from:subject:message-id:date:mime-version:content-type; s= default; bh=ZRltB8KnQkan+eAShvSyU7uDF48=; b=MGL31XjP1yOBHq/5oT3r 0z21PKWL+xg4ChkgSLRhq2yhBp8SsM112Ru5dX43+ulb8UTHdVEcWcPGE1SgE5/O rscErYp6ePIBA/HRkHGod6ZHyFHGzcBTBLHQL8lIifxhSfwrPu/BBJVvMGtLNhA1 7EcDo06rPNOBAneXLSO9pJE= Received: (qmail 21283 invoked by alias); 30 Aug 2016 14:53:59 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 21243 invoked by uid 89); 30 Aug 2016 14:53:58 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=BAYES_00, FREEMAIL_FROM, RCVD_IN_DNSWL_LOW, SPF_PASS, URIBL_RED autolearn=ham version=3.3.2 spammy=sk:nathan, parcels, dimensions, sk:nathan@ X-HELO: mail-qk0-f170.google.com Received: from mail-qk0-f170.google.com (HELO mail-qk0-f170.google.com) (209.85.220.170) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 30 Aug 2016 14:53:48 +0000 Received: by mail-qk0-f170.google.com with SMTP id v123so20873309qkh.2 for ; Tue, 30 Aug 2016 07:53:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:to:from:subject:message-id:date :user-agent:mime-version; bh=ioX8PJyUNU93UpMTrdIHF6b9kwixyM1DEUMj87nz5J4=; b=Knlz/YvFYTDWtOvEh909llY1nhWpCpdtOFNzOcYqcKbHeatQUHRbSWa/gXUTFJADPS Hzs8n7dDH25c7ZOmSIcjoVHCrQknN4vG1ZHjaGfSe6dCchNpLfwSy19D1EprgIEy/eBr VehFilD54W9ZGAYvufq8vAv/eeScZciJkieDlRqc9FOJC89kVFdj3JawQBnAQhYO3LKu yfyqL1uw1PylOZBZFRN2UQ2FaAhFGsc7rchX8AX3plhti8SKWgWxIAmniI7deSjeTHL+ yRzkKsjPvkKS05mmkwEtIU1jU5YdORN/olnk/6pSiMAEX2B+89HrA/F5jwT7LjMyIDn3 wpmw== X-Gm-Message-State: AE9vXwO2J7wA2a8l4/E4UUrL20ooK6GDp1KUxuchuCgdlYAo6PQwVTCZJ9aB6FfBUoP0YA== X-Received: by 10.55.120.2 with SMTP id t2mr4409321qkc.62.1472568826905; Tue, 30 Aug 2016 07:53:46 -0700 (PDT) Received: from ?IPv6:2601:181:c003:1930:3fe6:c217:b86a:6e86? ([2601:181:c003:1930:3fe6:c217:b86a:6e86]) by smtp.googlemail.com with ESMTPSA id x201sm557652qkx.32.2016.08.30.07.53.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 30 Aug 2016 07:53:46 -0700 (PDT) To: GCC Patches From: Nathan Sidwell Subject: [gomp4] runtime default compute dimensions Message-ID: <8b489f0c-0159-80e5-28a8-1460029326d3@acm.org> Date: Tue, 30 Aug 2016 10:53:45 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 This patch interrogates the target device to determine default gemotry at runtime. This has the greatest difference on gang partitioning, where there's a noticeable sawtooth in the relationship between number of gangs and execution time. Picking the number of gangs as an exact multiple of number of physical multi-cpus gets the best performance. Picking one more than that gives a step increase in execution time. The sawtooth gets blunter as the multiplication factor increases, as one might expect when scheduling smaller and smaller parcels of work onto a limited set of physical cpus. nathan 2016-08-30 Nathan Sidwell * plugin/plugin-nvptx.c (nvptx_exec): Interrogate board attributes to determine default geometry. Index: plugin/plugin-nvptx.c =================================================================== --- plugin/plugin-nvptx.c (revision 239862) +++ plugin/plugin-nvptx.c (working copy) @@ -938,14 +938,42 @@ nvptx_exec (void (*fn), size_t mapnum, v } } - /* Do some sanity checking. The CUDA API doesn't appear to - provide queries to determine these limits. */ + int warp_size, block_size, dev_size, cpu_size; + CUdevice dev = nvptx_thread()->ptx_dev->dev; + /* 32 is the default for known hardware. */ + int gang = 0, worker = 32, vector = 32; + + if (CUDA_SUCCESS == cuDeviceGetAttribute + (&block_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&warp_size, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&dev_size, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev) + && CUDA_SUCCESS == cuDeviceGetAttribute + (&cpu_size, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev)) + { + GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d," + " dev_size=%d, cpu_size=%d\n", + warp_size, block_size, dev_size, cpu_size); + gang = (cpu_size / block_size) * dev_size; + worker = block_size / warp_size; + vector = warp_size; + } + + /* There is no upper bound on the gang size. The best size + matches the hardware configuration. Logical gangs are + scheduled onto physical hardware. To maximize usage, we + should guess a large number. */ if (default_dims[GOMP_DIM_GANG] < 1) - default_dims[GOMP_DIM_GANG] = 32; + default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; + /* The worker size must not exceed the hardware. */ if (default_dims[GOMP_DIM_WORKER] < 1 - || default_dims[GOMP_DIM_WORKER] > 32) - default_dims[GOMP_DIM_WORKER] = 32; - default_dims[GOMP_DIM_VECTOR] = 32; + || (default_dims[GOMP_DIM_WORKER] > worker && gang)) + default_dims[GOMP_DIM_WORKER] = worker; + /* The vector size must exactly match the hardware. */ + if (default_dims[GOMP_DIM_VECTOR] < 1 + || (default_dims[GOMP_DIM_VECTOR] != vector && gang)) + default_dims[GOMP_DIM_VECTOR] = vector; GOMP_PLUGIN_debug (0, " default dimensions [%d,%d,%d]\n", default_dims[GOMP_DIM_GANG],