From patchwork Fri Aug 20 20:43:57 2010
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "H.J. Lu" <hjl.tools@gmail.com>
X-Patchwork-Id: 62316
Return-Path: 
 <gcc-patches-return-271075-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	by ozlabs.org (Postfix) with SMTP id 4EDBAB6F10
	for <incoming@patchwork.ozlabs.org>;
	Sat, 21 Aug 2010 06:44:08 +1000 (EST)
Received: (qmail 6969 invoked by alias); 20 Aug 2010 20:44:05 -0000
Received: (qmail 6961 invoked by uid 22791); 20 Aug 2010 20:44:04 -0000
X-SWARE-Spam-Status: No, hits=-1.8 required=5.0	tests=AWL, BAYES_00,
	DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM,
	RCVD_IN_DNSWL_NONE
X-Spam-Check-By: sourceware.org
Received: from mail-vw0-f47.google.com (HELO mail-vw0-f47.google.com)
	(209.85.212.47) by sourceware.org (qpsmtpd/0.43rc1) with
	ESMTP; Fri, 20 Aug 2010 20:44:00 +0000
Received: by vws13 with SMTP id 13so3572696vws.20 for
	<gcc-patches@gcc.gnu.org>; Fri, 20 Aug 2010 13:43:58 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.220.158.9 with SMTP id d9mr1248965vcx.33.1282337037933;
	Fri, 20 Aug 2010 13:43:57 -0700 (PDT)
Received: by 10.220.164.142 with HTTP; Fri, 20 Aug 2010 13:43:57 -0700 (PDT)
In-Reply-To: <4C6EE072.4070802@codesourcery.com>
References: <4C6EE072.4070802@codesourcery.com>
Date: Fri, 20 Aug 2010 13:43:57 -0700
Message-ID: <AANLkTinJX=DF7yvQgzRE1tNqcYvPXLUBLbaZJX_W4WTN@mail.gmail.com>
Subject: Re: Core 2 and Core i7 tuning
From: "H.J. Lu" <hjl.tools@gmail.com>
To: Bernd Schmidt <bernds@codesourcery.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	Maxim Kuvyrkov <maxim@codesourcery.com>,
	Paul Brook <paul@codesourcery.com>
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org

On Fri, Aug 20, 2010 at 1:07 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> Here's something I've been working on for a while.  This adds a corei7
> processor type, a Core 2/Core i7 scheduling description, and twiddles a
> few of the x86 tuning flags.  I'm not terribly happy with it yet due to
> the relatively small performance improvement, but I'd promised some
> folks I'd post it this week, so...
>
> The scheduling description is heavily based on ppro.md.  There seems to
> be no publicly available, detailed information from Intel about the Core
> 2 pipeline, so this work is based on Agner Fog's manuals.  It should be
> correct in the essentials, at least as well as ppro.md (we aren't really
> able to do a good job with the execution ports since we have no concept
> of the out-of-order core).  I have not tried to implement latencies or
> port reservations for every last MMX or SSE instruction, since who knows
> whether the information is totally accurate anyway.
>
> The i386 port has a lot of tuning flags, and I've mostly been running
> SPEC2000 benchmarks for the last few weeks, trying to find a set of them
> that works well on these processors.  This is slightly tricky since
> there's some inherent noise in the results.
>
> Not using the LEAVE instruction seemed to make a difference on my Penryn
> laptop in 64 bit mode, but that's probably moot now that
> -fomit-frame-pointer is the default.  I've changed a few others, but
> mostly these attempts resulted in lower or unchanged performance, for
> example:
>
>  * using push/pop insns more often (there are about six of these tuning
>   flags).  I would have expected this to be a win.
>  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
>  * upping the branch cost to 5; initial results looked good for Core i7
>   but in a full SPEC2000 run it seemed to be a slight loss, and a large
>   loss on Core 2
>  * using different string algorithms (from tune_generic)
>  * enabling SPLIT_LONG_MOVES
>  * enabling the flags related to partial reg stalls
>  * reducing code alignments (based on a comment in Agner's manual that
>   they aren't important anymore)
>
> I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
> on the recommendation in Agner's manual not to use operand size prefixes
> when they change the length of the instruction (i.e. if there's an
> immediate operand).  That happens in the second of the following four
> instructions, and is said to cause a decoder stall:
>
> $ as
> orl $32768,%eax
> orw $32768,%ax
> orl $8,%eax
> orw $8,%ax
>
>   0:   0d 00 80 00 00          or     $0x8000,%eax
>   5:   66 0d 00 80             or     $0x8000,%ax
>   9:   83 c8 08                or     $0x8,%eax
>   c:   66 83 c8 08             or     $0x8,%ax
>
> This didn't seem to have a large impact either however.
>
> On my last test run, I had
> SPECfp2000:
>  -mtune=generic  3023
>  -mtune=core2    3036
> SPECint2000:
>  -mtune=generic  2774
>  -mtune=core2    2794
>
> This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
> SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
> with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
> effectively).
> Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
> weeks old so it doesn't have -fomit-frame-pointer by default.  I also
> had -mtune=corei7 numbers, but they were a little lower since I was
> using that run for an experiment with higher branch costs.
>
> These numbers pretty much match the differences I was seeing on the Core
> 2 laptop during development.  I'd welcome if other people would also run
> benchmarks.
>
> Comments?  Is this OK?
>

Please also include this patch.

Thanks.

diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 8a76857..998214b 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -554,21 +554,21 @@ const char *host_detect_local_cpu (int argc, const char **argv)
 	case 0x1e:
 	case 0x1f:
 	case 0x2e:
-	  /* FIXME: Optimize for Nehalem.  */
-	  cpu = "core2";
+	  /* Nehalem.  */
+	  cpu = "corei7";
 	  break;
 	case 0x25:
 	case 0x2f:
-	  /* FIXME: Optimize for Westmere.  */
-	  cpu = "core2";
+	  /* Westmere.  */
+	  cpu = "corei7";
 	  break;
 	case 0x17:
 	case 0x1d:
-	  /* Penryn.  FIXME: -mtune=core2 is slower than -mtune=generic  */
+	  /* Penryn.  */
 	  cpu = "core2";
 	  break;
 	case 0x0f:
-	  /* Merom.  FIXME: -mtune=core2 is slower than -mtune=generic  */
+	  /* Merom.  */
 	  cpu = "core2";
 	  break;
 	default: