mbox series

[v7,0/4] ARM: Add support for Hisilicon Kunpeng L3 cache controller

Message ID 20210202071648.1776-1-thunder.leizhen@huawei.com
Headers show
Series ARM: Add support for Hisilicon Kunpeng L3 cache controller | expand

Message

Leizhen (ThunderTown) Feb. 2, 2021, 7:16 a.m. UTC
v6 --> v7:
1. Change all readl()/writel() to _relaxed(), add the corresponding description to the code.
2. Delete the unnecessary spinlock protection in l3cache_init().

v5 --> v6:
1. Use raw_spin_lock_irqsave() instead of spin_lock_irqsave()
2. Move the macros defined in cache-kunpeng-l3.h into cache-kunpeng-l3.c, and delete that header file.
3. In some places, replace readl()/writel() with readl_relaxed()/writel_relaxed() to improve performance without affecting functions.
4. Returns 0 instead of an error code when Kunpeng L3 Cache matching failed.

Thank you for Arnd's review comments and Russell's help.

v4 --> v5:
1. Add SoC macro ARCH_KUNPENG50X, and the Kunpeng L3 cache controller only enabled
   on that platform.
2. Require the compatible string of the Kunpeng L3 cache controller must have a
   relevant name on a specific SoC. For example:
   compatible = "hisilicon,kunpeng509-l3cache", "hisilicon,kunpeng-l3cache";

v3 --> v4:
Rename the compatible string from "hisilicon,l3cache" to "hisilicon,kunpeng-l3cache".
Then adjust the file name, configuration option name, and description accordingly.

v2 --> v3:
Add Hisilicon L3 cache controller driver and its document. That's: patch 2-3.

v1 --> v2:
Discard the middle-tier functions and do silent narrowing cast in the outcache
hook functions. For example:
-static void l2c220_inv_range(unsigned long start, unsigned long end)
+static void l2c220_inv_range(phys_addr_t pa_start, phys_addr_t pa_end)
 {
+	unsigned long start = pa_start;
+	unsigned long end = pa_end;


v1:
Do cast phys_addr_t to unsigned long by adding a middle-tier function.
For example:
-static void l2c220_inv_range(unsigned long start, unsigned long end)
+static void __l2c220_inv_range(unsigned long start, unsigned long end)
 {
 	...
 }
+static void l2c220_inv_range(phys_addr_t start, phys_addr_t end)
+{
+  __l2c220_inv_range(start, end);
+}

Zhen Lei (4):
  ARM: LPAE: Use phys_addr_t instead of unsigned long in outercache
    hooks
  ARM: hisi: add support for Kunpeng50x SoC
  dt-bindings: arm: hisilicon: Add binding for Kunpeng L3 cache
    controller
  ARM: Add support for Hisilicon Kunpeng L3 cache controller

 .../arm/hisilicon/kunpeng-l3cache.yaml        |  40 ++++
 arch/arm/include/asm/outercache.h             |   6 +-
 arch/arm/mach-hisi/Kconfig                    |   6 +
 arch/arm/mm/Kconfig                           |  10 +
 arch/arm/mm/Makefile                          |   1 +
 arch/arm/mm/cache-feroceon-l2.c               |  15 +-
 arch/arm/mm/cache-kunpeng-l3.c                | 178 ++++++++++++++++++
 arch/arm/mm/cache-l2x0.c                      |  50 +++--
 arch/arm/mm/cache-tauros2.c                   |  15 +-
 arch/arm/mm/cache-uniphier.c                  |   6 +-
 arch/arm/mm/cache-xsc3l2.c                    |  12 +-
 11 files changed, 310 insertions(+), 29 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/arm/hisilicon/kunpeng-l3cache.yaml
 create mode 100644 arch/arm/mm/cache-kunpeng-l3.c

Comments

Arnd Bergmann Feb. 2, 2021, 8:44 a.m. UTC | #1
On Tue, Feb 2, 2021 at 8:16 AM Zhen Lei <thunder.leizhen@huawei.com> wrote:
> +
> +/*
> + * All read and write operations on L3 cache registers are protected by the
> + * spinlock, except for l3cache_init(). Each time the L3 cache operation is
> + * performed, all related information is filled into its registers. Therefore,
> + * there is no memory order problem when only _relaxed() functions are used.

Thank you for including the text.

I don't think the explanation with the spin_lock() explains why this
can be considered safe though, as spin_lock() only contains serialization
against other CPUs (smp_mb()) rather than the stronger DMA barriers
implied by readl and writel. As Russell previously explained, these
barriers are the L1 cache operations (e.g. v7_dma_inv_range) do
include stronger barriers, so it would be better to come up with a
justification based on those.

> + * This can help us achieve some performance improvement:
> + * 1) The readl_relaxed() is about 20ns faster than readl().
> + * 2) The writel_relaxed() is about 123ns faster than writel().

These are not really the performance numbers I asked for, as a
low-level benchmark comparing the instructions is rather meaningless.
The time spent waiting for the barrier depends on what else is going
on around the barrier. Also, most of the time would likely be
spent spinning in the loop around readl() while the cache operations
are in progress, so the latency of a single readl() is not necessarily
significant.

To have a more useful performance number, try mentioning the
most performance sensitive non-coherent DMA master on one
of the chips that has this cache controller, and a high-level
performance number such as "1.2% more network packets per
second" if that is something you can measure easily.

Of course, if all high-speed DMA masters on this chip are
cache coherent, there is no need for performance numbers, just
mention that we don't care about speed in that case.

        Arnd
Leizhen (ThunderTown) Feb. 2, 2021, 12:18 p.m. UTC | #2
On 2021/2/2 16:44, Arnd Bergmann wrote:
> On Tue, Feb 2, 2021 at 8:16 AM Zhen Lei <thunder.leizhen@huawei.com> wrote:
>> +
>> +/*
>> + * All read and write operations on L3 cache registers are protected by the
>> + * spinlock, except for l3cache_init(). Each time the L3 cache operation is
>> + * performed, all related information is filled into its registers. Therefore,
>> + * there is no memory order problem when only _relaxed() functions are used.
> 
> Thank you for including the text.
> 
> I don't think the explanation with the spin_lock() explains why this
> can be considered safe though, as spin_lock() only contains serialization
> against other CPUs (smp_mb()) rather than the stronger DMA barriers
> implied by readl and writel. As Russell previously explained, these
> barriers are the L1 cache operations (e.g. v7_dma_inv_range) do
> include stronger barriers, so it would be better to come up with a
> justification based on those.

Okay, I'll correct the description.

> 
>> + * This can help us achieve some performance improvement:
>> + * 1) The readl_relaxed() is about 20ns faster than readl().
>> + * 2) The writel_relaxed() is about 123ns faster than writel().
> 
> These are not really the performance numbers I asked for, as a
> low-level benchmark comparing the instructions is rather meaningless.
> The time spent waiting for the barrier depends on what else is going
> on around the barrier. Also, most of the time would likely be
> spent spinning in the loop around readl() while the cache operations
> are in progress, so the latency of a single readl() is not necessarily
> significant.
> 
> To have a more useful performance number, try mentioning the
> most performance sensitive non-coherent DMA master on one
> of the chips that has this cache controller, and a high-level
> performance number such as "1.2% more network packets per
> second" if that is something you can measure easily.

It's not easy. My board only have debugging NIC, only the downstream
products have high-speed service NIC. Software needs to be packaged
layer by layer.

> 
> Of course, if all high-speed DMA masters on this chip are
> cache coherent, there is no need for performance numbers, just
> mention that we don't care about speed in that case.

It's not cache coherent, otherwise, the L3 cache does not need to be
operated.

> 
>         Arnd
> 
> .
>
Arnd Bergmann Feb. 2, 2021, 3:54 p.m. UTC | #3
On Tue, Feb 2, 2021 at 1:18 PM Leizhen (ThunderTown)
<thunder.leizhen@huawei.com> wrote:
> On 2021/2/2 16:44, Arnd Bergmann wrote:
> >
> > To have a more useful performance number, try mentioning the
> > most performance sensitive non-coherent DMA master on one
> > of the chips that has this cache controller, and a high-level
> > performance number such as "1.2% more network packets per
> > second" if that is something you can measure easily.
>
> It's not easy. My board only have debugging NIC, only the downstream
> products have high-speed service NIC. Software needs to be packaged
> layer by layer.
>
> >
> > Of course, if all high-speed DMA masters on this chip are
> > cache coherent, there is no need for performance numbers, just
> > mention that we don't care about speed in that case.
>
> It's not cache coherent, otherwise, the L3 cache does not need to be
> operated.

Ok, I see. In this case, just explain that the high-speed NIC is not
cache-coherent, so this is expected to make a difference, even if you
can't quantify it exactly.

       Arnd