[PR,81616] Deferring FMA transformations in tight loops

Hello,

the patch below prevents creation if fused-multiply-and-add instructions
in the widening_mul gimple pass on the Zen-based AMD CPUs and as a
result fixes regressions of native znver1 tuning when compared to
generic tuning in:

  - the matrix.c testcase of PR 81616 (straightforward matrix
    multiplication) at -O2 and -O3 which is currently 60% (!),

  - SPEC 2006 454.calculix at -O2, which is currently over 20%, and

  - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also
    about 20% in both cases.

The basic idea is to detect loops in the following form:

    <bb 6>
    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
    ...
    _65 = _14 * _16;
    accumulator_66 = _65 + accumulator_111;

and prevents from creating FMA for it.  Because at least in the parest
and calculix cases it has to, it also deals with more than one chain of
FMA candidates that feed the next one's addend:

    <bb 6>
    # accumulator_111 = PHI <0.0(5), accumulator_66(6)>
    ...
    _65 = _14 * _16;
    accumulator_55 = _65 + accumulator_111;
    _65 = _24 * _36;
    accumulator_66 = _65 + accumulator_55;

Unfortunately, to really get rid of the calculix regression, the
algorithm cannot just look at one BB at a time but also has to work for
cases like the following:

     1  void mult(void)
     2  {
     3      int i, j, k, l;
     4  
     5     for(i=0; i<SIZE; ++i)
     6     {
     7        for(j=0; j<SIZE; ++j)
     8        {
     9           for(l=0; l<SIZE; l+=10)
    10           {
    11               c[i][j] += a[i][l] * b[k][l];
    12               for(k=1; k<10; ++k)
    13               {
    14                   c[i][j] += a[i][l+k] * b[k][l+k];
    15               }
    16  
    17           }
    18        }
    19     }
    20  }

where the FMA on line 14 feeds into the one on line 11 in an
encompassing loop.  Therefore I have changed the structure of the pass
to work in reverse dominance order and it keeps a hash set of results of
rejected FMAs candidates which it checks when looking at PHI nodes of
the current BB.  Without this reorganization, calculix was still 8%
slower with native tuning than with generic one.

When the deferring mechanism realizes that in the current BB, the FMA
candidates do not all form a one chain tight-loop like in the examples
above, it goes back to all the previously deferred candidates (in the
current BB only) and performs the transformation.

The main reason is to keep the patch conservative (and also simple), but
it also means that the following function is not affected and is still
20% slower when compiled with native march and tuning compared to the
generic one:

     1  void mult(struct s *p1, struct s *p2)
     2  {
     3     int i, j, k;
     4  
     5     for(i=0; i<SIZE; ++i)
     6     {
     7        for(j=0; j<SIZE; ++j)
     8        {
     9           for(k=0; k<SIZE; ++k)
    10           {
    11              p1->c[i][j] += p1->a[i][k] * p1->b[k][j];
    12              p2->c[i][j] += p2->a[i][k] * p2->b[k][j];
    13           }
    14        }
    15     }
    16  }

I suppose that the best optimization for the above would be to split the
loops, but one could probably construct at least an artificial testcase
where the FMAs would keep enough locality that it is not the case.  The
mechanism can be easily extended to keep track of not just one chain but
a few, preferably as a followup, if people think it makes sense.

An interesting observation is that the matrix multiplication does not
suffer the penalty when compiled with -O3 -mprefer-vector-width=256.
Apparently the 256 vector processing can hide the latency penalty when
internally it is split into two halves.  The same goes for 512 bit
vectors.  That is why the patch leaves those be - well, there is a param
for the threshold which is set to zero for everybody but znver1.  If
maintainers of any other architecture suspect that their FMAs might
suffer similar latency problem, they can easily try tweaking that
parameter and see what happens with the matrix multiplication example.

I have bootstrapped and tested the patch on x86_64-linux (as it is and
also with the param set to a 256 by default to make it trigger).  I have
also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017
FPrate and the only changes are the big improvements of calculix and
parest.

After I address any comments and/or suggestions, would it be OK for
trunk?

Thanks,

Martin

2017-12-13  Martin Jambor  <mjambor@suse.cz>

	PR target/81616
	* params.def: New parameter PARAM_AVOID_FMA_MAX_BITS.
	* tree-ssa-math-opts.c: Include domwalk.h.
	(convert_mult_to_fma_1): New function.
	(fma_transformation_info): New type.
	(fma_deferring_state): Likewise.
	(cancel_fma_deferring): New function.
	(result_of_phi): Likewise.
	(last_fma_candidate_feeds_initial_phi): Likewise.
	(convert_mult_to_fma): Added deferring logic, split actual
	transformation to convert_mult_to_fma_1.
	(math_opts_dom_walker): New type.
	(math_opts_dom_walker::after_dom_children): New method, body moved
	here from pass_optimize_widening_mul::execute, added deferring logic
	bits.
	(pass_optimize_widening_mul::execute): Moved most of code to
	math_opts_dom_walker::after_dom_children.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): New.
	* config/i386/i386.c (ix86_option_override_internal): Added
	maybe_setting of PARAM_AVOID_FMA_MAX_BITS.
---
 gcc/config/i386/i386.c       |   5 +
 gcc/config/i386/x86-tune.def |   4 +
 gcc/params.def               |   5 +
 gcc/tree-ssa-math-opts.c     | 521 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 407 insertions(+), 128 deletions(-)

Message ID	ri6mv2knkeg.fsf@suse.cz
State	New
Headers	show Return-Path: <gcc-patches-return-469350-incoming=patchwork.ozlabs.org@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-469350-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="b8+XwWbX"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3yysyR5QDxz9sNw for <incoming@patchwork.ozlabs.org>; Sat, 16 Dec 2017 01:19:59 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:mime-version:content-type; q=dns; s=default; b=p9D7vbJkKrOH7WQPDtAHYT3CAMg5V/7cO4Bv7o2fJPeNm7bQMX RvHB95KbTPunjWplQQ4IB1OzR+av9mcX8U1aVNE1AnEL0216WrDh8CxiG9Zw2mxO y8N4Odn6biPYqJr+LULVijons+9Og4yGFLi5S2FpKpdgiZEsWxp2iMxVw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:mime-version:content-type; s= default; bh=SOHvPH8PL7ZdBaAdCHjRpi5+UCg=; b=b8+XwWbXspqWgxegDG1R zPJniox/A+69O41fNfBDVE3HZPgjEu++2yFD6fw+6qgOHvTBQjP5FhwUIyL3AiwJ 2Xu8kbpqogeqvzAcvkTx8yuToqd3LI/ctrsdSak3GP9t//8Ixh3pQhH8OsqH71Fw m6UtQ3hTabyIWqDSyIkm/4k= Received: (qmail 94580 invoked by alias); 15 Dec 2017 14:19:42 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: <gcc-patches.gcc.gnu.org> List-Unsubscribe: <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org> List-Archive: <http://gcc.gnu.org/ml/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-help@gcc.gnu.org> Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 94567 invoked by uid 89); 15 Dec 2017 14:19:42 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, SPF_PASS autolearn=ham version=3.3.2 spammy=BUILT_IN_POW, built_in_pow X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 15 Dec 2017 14:19:38 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id EEE8DAD10 for <gcc-patches@gcc.gnu.org>; Fri, 15 Dec 2017 14:19:35 +0000 (UTC) From: Martin Jambor <mjambor@suse.cz> To: GCC Patches <gcc-patches@gcc.gnu.org> Cc: Subject: [PR 81616] Deferring FMA transformations in tight loops User-Agent: Notmuch/0.25.1 (https://notmuchmail.org) Emacs/25.3.1 (x86_64-suse-linux-gnu) Date: Fri, 15 Dec 2017 15:19:35 +0100 Message-ID: <ri6mv2knkeg.fsf@suse.cz> MIME-Version: 1.0 Content-Type: text/plain X-IsSubscribed: yes
Series	[PR,81616] Deferring FMA transformations in tight loops \| expand [PR,81616] Deferring FMA transformations in tight loops

[PR,81616] Deferring FMA transformations in tight loops

Commit Message

Comments

Patch