Core 2 and Core i7 tuning

Here's something I've been working on for a while.  This adds a corei7
processor type, a Core 2/Core i7 scheduling description, and twiddles a
few of the x86 tuning flags.  I'm not terribly happy with it yet due to
the relatively small performance improvement, but I'd promised some
folks I'd post it this week, so...

The scheduling description is heavily based on ppro.md.  There seems to
be no publicly available, detailed information from Intel about the Core
2 pipeline, so this work is based on Agner Fog's manuals.  It should be
correct in the essentials, at least as well as ppro.md (we aren't really
able to do a good job with the execution ports since we have no concept
of the out-of-order core).  I have not tried to implement latencies or
port reservations for every last MMX or SSE instruction, since who knows
whether the information is totally accurate anyway.

The i386 port has a lot of tuning flags, and I've mostly been running
SPEC2000 benchmarks for the last few weeks, trying to find a set of them
that works well on these processors.  This is slightly tricky since
there's some inherent noise in the results.

Not using the LEAVE instruction seemed to make a difference on my Penryn
laptop in 64 bit mode, but that's probably moot now that
-fomit-frame-pointer is the default.  I've changed a few others, but
mostly these attempts resulted in lower or unchanged performance, for
example:

 * using push/pop insns more often (there are about six of these tuning
   flags).  I would have expected this to be a win.
 * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
 * upping the branch cost to 5; initial results looked good for Core i7
   but in a full SPEC2000 run it seemed to be a slight loss, and a large
   loss on Core 2
 * using different string algorithms (from tune_generic)
 * enabling SPLIT_LONG_MOVES
 * enabling the flags related to partial reg stalls
 * reducing code alignments (based on a comment in Agner's manual that
   they aren't important anymore)

I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
on the recommendation in Agner's manual not to use operand size prefixes
when they change the length of the instruction (i.e. if there's an
immediate operand).  That happens in the second of the following four
instructions, and is said to cause a decoder stall:

$ as
orl $32768,%eax
orw $32768,%ax
orl $8,%eax
orw $8,%ax

   0:	0d 00 80 00 00       	or     $0x8000,%eax
   5:	66 0d 00 80          	or     $0x8000,%ax
   9:	83 c8 08             	or     $0x8,%eax
   c:	66 83 c8 08          	or     $0x8,%ax

This didn't seem to have a large impact either however.

On my last test run, I had
SPECfp2000:
 -mtune=generic  3023
 -mtune=core2    3036
SPECint2000:
 -mtune=generic  2774
 -mtune=core2    2794

This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
effectively).
Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
weeks old so it doesn't have -fomit-frame-pointer by default.  I also
had -mtune=corei7 numbers, but they were a little lower since I was
using that run for an experiment with higher branch costs.

These numbers pretty much match the differences I was seeing on the Core
2 laptop during development.  I'd welcome if other people would also run
benchmarks.

Comments?  Is this OK?

Bernd
* doc/invoke.texi (i386 and x86-64 Options): Document corei7 cpu type.
	* config/i386/i386.h (TARGET_COREI7): New macro.
	(enum ix86_tune_indices): Add X86_TUNE_PROMOTE_HI_CONSTANTS.
	(enum target_cpu_default): Add TARGET_CPU_DEFAULT_corei7.
	(enum processor_type): Add PROCESSOR_COREI7.
	* config/i386/i386.md: Include "core2.md".
	(attr "cpu"): Add "corei7".
	(mul_operands): New attribute.
	(mul<mode>3_1, mulsi3_1_zext, mulhi3_1, mulqi3_1, <u>mul<mode><dwi>3_1,
	<u>mulqihi3_1, <s>muldi3_highpart_1, <s>mulsi3_highpart_1,
	<s>mulsi3_highpart_zext): Set it.
	* config/i386/core2.md: New file.
	* config/i386/i386-c.c (ix86_target-macros_internal): Handle
	PROCESSOR_COREI7.
	* config/i386/i386.c (corei7_cost): New static variable.
	(m_COREI7, m_CORE2I7): New macros.
	(initial_ix86_tune_features): Use them.  Disable X86_TUNE_USE_LEAVE,
	X86_TUNE_PAD_RETURNS and X86_TUNE_USE_INCDEC, and enable
	X86_TUNE_PROMOTE_HI_REGS and X86_TUNE_PROMOTE_HI_CONSTANTS for Core 2
	and Core i7.
	(x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387): Use
	m_CORE2I7 instead of m_CORE2.
	(processor_target_table): Add entry for corei7_cost.
	(cpu_names): Add "corei7" entr.
	(override_options): Add entry for Core i7.
	(ix86_fixup_binary_operands, ix86_binary_operator_ok): Handle
	TARGET_PROMOTE_HI_CONSTANTS.
	(ix86_issue_rate): 4 for Core i7.
	(ix86_adjust_cost): Try to do something sensible about domains for
	PROCESSOR_COREI7.

Core 2 and Core i7 tuning

Commit Message

Comments

Patch