From patchwork Fri Aug 20 20:43:57 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 62316 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 4EDBAB6F10 for ; Sat, 21 Aug 2010 06:44:08 +1000 (EST) Received: (qmail 6969 invoked by alias); 20 Aug 2010 20:44:05 -0000 Received: (qmail 6961 invoked by uid 22791); 20 Aug 2010 20:44:04 -0000 X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL, BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE X-Spam-Check-By: sourceware.org Received: from mail-vw0-f47.google.com (HELO mail-vw0-f47.google.com) (209.85.212.47) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 20 Aug 2010 20:44:00 +0000 Received: by vws13 with SMTP id 13so3572696vws.20 for ; Fri, 20 Aug 2010 13:43:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.158.9 with SMTP id d9mr1248965vcx.33.1282337037933; Fri, 20 Aug 2010 13:43:57 -0700 (PDT) Received: by 10.220.164.142 with HTTP; Fri, 20 Aug 2010 13:43:57 -0700 (PDT) In-Reply-To: <4C6EE072.4070802@codesourcery.com> References: <4C6EE072.4070802@codesourcery.com> Date: Fri, 20 Aug 2010 13:43:57 -0700 Message-ID: Subject: Re: Core 2 and Core i7 tuning From: "H.J. Lu" To: Bernd Schmidt Cc: GCC Patches , Maxim Kuvyrkov , Paul Brook X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org On Fri, Aug 20, 2010 at 1:07 PM, Bernd Schmidt wrote: > Here's something I've been working on for a while.  This adds a corei7 > processor type, a Core 2/Core i7 scheduling description, and twiddles a > few of the x86 tuning flags.  I'm not terribly happy with it yet due to > the relatively small performance improvement, but I'd promised some > folks I'd post it this week, so... > > The scheduling description is heavily based on ppro.md.  There seems to > be no publicly available, detailed information from Intel about the Core > 2 pipeline, so this work is based on Agner Fog's manuals.  It should be > correct in the essentials, at least as well as ppro.md (we aren't really > able to do a good job with the execution ports since we have no concept > of the out-of-order core).  I have not tried to implement latencies or > port reservations for every last MMX or SSE instruction, since who knows > whether the information is totally accurate anyway. > > The i386 port has a lot of tuning flags, and I've mostly been running > SPEC2000 benchmarks for the last few weeks, trying to find a set of them > that works well on these processors.  This is slightly tricky since > there's some inherent noise in the results. > > Not using the LEAVE instruction seemed to make a difference on my Penryn > laptop in 64 bit mode, but that's probably moot now that > -fomit-frame-pointer is the default.  I've changed a few others, but > mostly these attempts resulted in lower or unchanged performance, for > example: > >  * using push/pop insns more often (there are about six of these tuning >   flags).  I would have expected this to be a win. >  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7 >  * upping the branch cost to 5; initial results looked good for Core i7 >   but in a full SPEC2000 run it seemed to be a slight loss, and a large >   loss on Core 2 >  * using different string algorithms (from tune_generic) >  * enabling SPLIT_LONG_MOVES >  * enabling the flags related to partial reg stalls >  * reducing code alignments (based on a comment in Agner's manual that >   they aren't important anymore) > > I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based > on the recommendation in Agner's manual not to use operand size prefixes > when they change the length of the instruction (i.e. if there's an > immediate operand).  That happens in the second of the following four > instructions, and is said to cause a decoder stall: > > $ as > orl $32768,%eax > orw $32768,%ax > orl $8,%eax > orw $8,%ax > >   0:   0d 00 80 00 00          or     $0x8000,%eax >   5:   66 0d 00 80             or     $0x8000,%ax >   9:   83 c8 08                or     $0x8,%eax >   c:   66 83 c8 08             or     $0x8,%ax > > This didn't seem to have a large impact either however. > > On my last test run, I had > SPECfp2000: >  -mtune=generic  3023 >  -mtune=core2    3036 > SPECint2000: >  -mtune=generic  2774 >  -mtune=core2    2794 > > This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode. > SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually > with cpufreq-set (1 step below maximum, which seems to avoid turbo mode > effectively). > Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few > weeks old so it doesn't have -fomit-frame-pointer by default.  I also > had -mtune=corei7 numbers, but they were a little lower since I was > using that run for an experiment with higher branch costs. > > These numbers pretty much match the differences I was seeing on the Core > 2 laptop during development.  I'd welcome if other people would also run > benchmarks. > > Comments?  Is this OK? > Please also include this patch. Thanks. diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c index 8a76857..998214b 100644 --- a/gcc/config/i386/driver-i386.c +++ b/gcc/config/i386/driver-i386.c @@ -554,21 +554,21 @@ const char *host_detect_local_cpu (int argc, const char **argv) case 0x1e: case 0x1f: case 0x2e: - /* FIXME: Optimize for Nehalem. */ - cpu = "core2"; + /* Nehalem. */ + cpu = "corei7"; break; case 0x25: case 0x2f: - /* FIXME: Optimize for Westmere. */ - cpu = "core2"; + /* Westmere. */ + cpu = "corei7"; break; case 0x17: case 0x1d: - /* Penryn. FIXME: -mtune=core2 is slower than -mtune=generic */ + /* Penryn. */ cpu = "core2"; break; case 0x0f: - /* Merom. FIXME: -mtune=core2 is slower than -mtune=generic */ + /* Merom. */ cpu = "core2"; break; default: