[0/2] Align tight loops to solve cross cacheline issue

Message ID	20240515030429.2575440-1-haochen.jiang@intel.com
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 214723858C3A From: Haochen Jiang <haochen.jiang@intel.com> To: gcc-patches@gcc.gnu.org Cc: hongtao.liu@intel.com, ubizjak@gmail.com Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue Date: Wed, 15 May 2024 11:04:27 +0800 Message-Id: <20240515030429.2575440-1-haochen.jiang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	[1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors \| expand [0/2] Align tight loops to solve cross cacheline issue [1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors [2/2] Align tight&hot loop without considering max skipping bytes.

Message ID

20240515030429.2575440-1-haochen.jiang@intel.com

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 214723858C3A
From: Haochen Jiang <haochen.jiang@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: hongtao.liu@intel.com,
	ubizjak@gmail.com
Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
Date: Wed, 15 May 2024 11:04:27 +0800
Message-Id: <20240515030429.2575440-1-haochen.jiang@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Series

[1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors | expand

Message

Jiang, Haochen May 15, 2024, 3:04 a.m. UTC

Hi all,

Recently, we have encountered several random performance regressions in
benchmarks commit to commit. It is caused by cross cacheline issue for
tight loops.

We are trying to solve the issue by two patches. One is adjusting the
loop alignment for generic tune, the other is aligning tight and hot
loops more aggressively.

For SPECINT, we get a 0.85% improvement overall in rates, under option
-O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.

BenchMarks      EMR Rates
500.perlbench_r -1.21%
502.gcc_r       0.78%
505.mcf_r       0.00%
520.omnetpp_r   0.41%
523.xalancbmk_r 1.33%
525.x264_r      2.83%
531.deepsjeng_r 1.11%
541.leela_r     0.00%
548.exchange2_r 2.36%
557.xz_r        0.98%
Geomean-int     0.85%

Side effect is that we get a 1.40% increase in codesize.

BenchMarks      EMR Codesize
500.perlbench_r 0.70%
502.gcc_r       0.67%
505.mcf_r       3.26%
520.omnetpp_r   0.31%
523.xalancbmk_r 1.15%
525.x264_r      1.11%
531.deepsjeng_r 1.40%
541.leela_r     1.31%
548.exchange2_r 3.06%
557.xz_r        1.04%
Geomean-int     1.40%

Bootstrapped and regtested on x86_64-pc-linux-gnu.

After we committed into trunk for a month, if there isn't any unexpected
happen. We planned to backport it to GCC14.2.

Thx,
Haochen

Haochen Jiang (1):
  Adjust generic loop alignment from 16:11:8 to 16 for Intel processors

liuhongt (1):
  Align tight&hot loop without considering max skipping bytes.

 gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
 gcc/config/i386/i386.md          |  10 ++-
 gcc/config/i386/x86-tune-costs.h |   2 +-
 3 files changed, 154 insertions(+), 6 deletions(-)

Comments

Jiang, Haochen May 15, 2024, 3:30 a.m. UTC | #1

Also cc Honza and Richard since we touched generic tune.

Thx,
Haochen

> -----Original Message-----
> From: Haochen Jiang <haochen.jiang@intel.com>
> Sent: Wednesday, May 15, 2024 11:04 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> 
> Hi all,
> 
> Recently, we have encountered several random performance regressions in
> benchmarks commit to commit. It is caused by cross cacheline issue for tight
> loops.
> 
> We are trying to solve the issue by two patches. One is adjusting the loop
> alignment for generic tune, the other is aligning tight and hot loops more
> aggressively.
> 
> For SPECINT, we get a 0.85% improvement overall in rates, under option
> -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> 
> BenchMarks      EMR Rates
> 500.perlbench_r -1.21%
> 502.gcc_r       0.78%
> 505.mcf_r       0.00%
> 520.omnetpp_r   0.41%
> 523.xalancbmk_r 1.33%
> 525.x264_r      2.83%
> 531.deepsjeng_r 1.11%
> 541.leela_r     0.00%
> 548.exchange2_r 2.36%
> 557.xz_r        0.98%
> Geomean-int     0.85%
> 
> Side effect is that we get a 1.40% increase in codesize.
> 
> BenchMarks      EMR Codesize
> 500.perlbench_r 0.70%
> 502.gcc_r       0.67%
> 505.mcf_r       3.26%
> 520.omnetpp_r   0.31%
> 523.xalancbmk_r 1.15%
> 525.x264_r      1.11%
> 531.deepsjeng_r 1.40%
> 541.leela_r     1.31%
> 548.exchange2_r 3.06%
> 557.xz_r        1.04%
> Geomean-int     1.40%
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu.
> 
> After we committed into trunk for a month, if there isn't any unexpected
> happen. We planned to backport it to GCC14.2.
> 
> Thx,
> Haochen
> 
> Haochen Jiang (1):
>   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
> 
> liuhongt (1):
>   Align tight&hot loop without considering max skipping bytes.
> 
>  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
>  gcc/config/i386/i386.md          |  10 ++-
>  gcc/config/i386/x86-tune-costs.h |   2 +-
>  3 files changed, 154 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1

Hongtao Liu May 20, 2024, 3:15 a.m. UTC | #2

On Wed, May 15, 2024 at 11:30 AM Jiang, Haochen <haochen.jiang@intel.com> wrote:
>
> Also cc Honza and Richard since we touched generic tune.
>
> Thx,
> Haochen
>
> > -----Original Message-----
> > From: Haochen Jiang <haochen.jiang@intel.com>
> > Sent: Wednesday, May 15, 2024 11:04 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> > Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> >
> > Hi all,
> >
> > Recently, we have encountered several random performance regressions in
> > benchmarks commit to commit. It is caused by cross cacheline issue for tight
> > loops.
> >
> > We are trying to solve the issue by two patches. One is adjusting the loop
> > alignment for generic tune, the other is aligning tight and hot loops more
> > aggressively.
> >
> > For SPECINT, we get a 0.85% improvement overall in rates, under option
> > -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> >
> > BenchMarks      EMR Rates
> > 500.perlbench_r -1.21%
> > 502.gcc_r       0.78%
> > 505.mcf_r       0.00%
> > 520.omnetpp_r   0.41%
> > 523.xalancbmk_r 1.33%
> > 525.x264_r      2.83%
> > 531.deepsjeng_r 1.11%
> > 541.leela_r     0.00%
> > 548.exchange2_r 2.36%
> > 557.xz_r        0.98%
> > Geomean-int     0.85%
> >
> > Side effect is that we get a 1.40% increase in codesize.
> >
> > BenchMarks      EMR Codesize
> > 500.perlbench_r 0.70%
> > 502.gcc_r       0.67%
> > 505.mcf_r       3.26%
> > 520.omnetpp_r   0.31%
> > 523.xalancbmk_r 1.15%
> > 525.x264_r      1.11%
> > 531.deepsjeng_r 1.40%
> > 541.leela_r     1.31%
> > 548.exchange2_r 3.06%
> > 557.xz_r        1.04%
> > Geomean-int     1.40%
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu.
> >
> > After we committed into trunk for a month, if there isn't any unexpected
> > happen. We planned to backport it to GCC14.2.
> >
> > Thx,
> > Haochen
> >
> > Haochen Jiang (1):
> >   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
For this one, current znver{1,2,3,4,5}_cost already set loop align as
16, so I think it should be fine set it to generic_cost.
> >
> > liuhongt (1):
> >   Align tight&hot loop without considering max skipping bytes.
For this one, although we have seen similar growth on AMD's
processors, it's still nice to have someone from AMD to look at this
to see if it's what they need.
> >
> >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> >  gcc/config/i386/i386.md          |  10 ++-
> >  gcc/config/i386/x86-tune-costs.h |   2 +-
> >  3 files changed, 154 insertions(+), 6 deletions(-)
> >
> > --
> > 2.31.1
>

Hongtao Liu May 27, 2024, 1:33 a.m. UTC | #3

On Mon, May 20, 2024 at 11:15 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, May 15, 2024 at 11:30 AM Jiang, Haochen <haochen.jiang@intel.com> wrote:
> >
> > Also cc Honza and Richard since we touched generic tune.
> >
> > Thx,
> > Haochen
> >
> > > -----Original Message-----
> > > From: Haochen Jiang <haochen.jiang@intel.com>
> > > Sent: Wednesday, May 15, 2024 11:04 AM
> > > To: gcc-patches@gcc.gnu.org
> > > Cc: Liu, Hongtao <hongtao.liu@intel.com>; ubizjak@gmail.com
> > > Subject: [PATCH 0/2] Align tight loops to solve cross cacheline issue
> > >
> > > Hi all,
> > >
> > > Recently, we have encountered several random performance regressions in
> > > benchmarks commit to commit. It is caused by cross cacheline issue for tight
> > > loops.
> > >
> > > We are trying to solve the issue by two patches. One is adjusting the loop
> > > alignment for generic tune, the other is aligning tight and hot loops more
> > > aggressively.
> > >
> > > For SPECINT, we get a 0.85% improvement overall in rates, under option
> > > -O2 -march=x86-64-v3 -mtune=generic on Emerald Rapids.
> > >
> > > BenchMarks      EMR Rates
> > > 500.perlbench_r -1.21%
> > > 502.gcc_r       0.78%
> > > 505.mcf_r       0.00%
> > > 520.omnetpp_r   0.41%
> > > 523.xalancbmk_r 1.33%
> > > 525.x264_r      2.83%
> > > 531.deepsjeng_r 1.11%
> > > 541.leela_r     0.00%
> > > 548.exchange2_r 2.36%
> > > 557.xz_r        0.98%
> > > Geomean-int     0.85%
> > >
> > > Side effect is that we get a 1.40% increase in codesize.
> > >
> > > BenchMarks      EMR Codesize
> > > 500.perlbench_r 0.70%
> > > 502.gcc_r       0.67%
> > > 505.mcf_r       3.26%
> > > 520.omnetpp_r   0.31%
> > > 523.xalancbmk_r 1.15%
> > > 525.x264_r      1.11%
> > > 531.deepsjeng_r 1.40%
> > > 541.leela_r     1.31%
> > > 548.exchange2_r 3.06%
> > > 557.xz_r        1.04%
> > > Geomean-int     1.40%
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu.
Ok for this if there's no objection in 48 hours.
> > >
> > > After we committed into trunk for a month, if there isn't any unexpected
> > > happen. We planned to backport it to GCC14.2.
> > >
> > > Thx,
> > > Haochen
> > >
> > > Haochen Jiang (1):
> > >   Adjust generic loop alignment from 16:11:8 to 16 for Intel processors
> For this one, current znver{1,2,3,4,5}_cost already set loop align as
> 16, so I think it should be fine set it to generic_cost.
> > >
> > > liuhongt (1):
> > >   Align tight&hot loop without considering max skipping bytes.
> For this one, although we have seen similar growth on AMD's
> processors, it's still nice to have someone from AMD to look at this
> to see if it's what they need.
> > >
> > >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> > >  gcc/config/i386/i386.md          |  10 ++-
> > >  gcc/config/i386/x86-tune-costs.h |   2 +-
> > >  3 files changed, 154 insertions(+), 6 deletions(-)
> > >
> > > --
> > > 2.31.1
> >
>
>
> --
> BR,
> Hongtao

Jiang, Haochen May 29, 2024, 3:30 a.m. UTC | #4

> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu.
> Ok for this if there's no objection in 48 hours.
> > > >
> > > > After we committed into trunk for a month, if there isn't any
> > > > unexpected happen. We planned to backport it to GCC14.2.

I accidentally backported it to GCC14.2 for now since I did not realize
that my local branch is on GCC14, not trunk.

If there is something unexpected on trunk, I will revert the patches for
GCC14.

Thx,
Haochen

> > > >
> > > > Thx,
> > > > Haochen
> > > >
> > > > Haochen Jiang (1):
> > > >   Adjust generic loop alignment from 16:11:8 to 16 for Intel
> > > > processors
> > For this one, current znver{1,2,3,4,5}_cost already set loop align as
> > 16, so I think it should be fine set it to generic_cost.
> > > >
> > > > liuhongt (1):
> > > >   Align tight&hot loop without considering max skipping bytes.
> > For this one, although we have seen similar growth on AMD's
> > processors, it's still nice to have someone from AMD to look at this
> > to see if it's what they need.
> > > >
> > > >  gcc/config/i386/i386.cc          | 148 ++++++++++++++++++++++++++++++-
> > > >  gcc/config/i386/i386.md          |  10 ++-
> > > >  gcc/config/i386/x86-tune-costs.h |   2 +-
> > > >  3 files changed, 154 insertions(+), 6 deletions(-)
> > > >
> > > > --
> > > > 2.31.1