Message ID | 20210202071648.1776-1-thunder.leizhen@huawei.com |
---|---|
Headers | show |
Series | ARM: Add support for Hisilicon Kunpeng L3 cache controller | expand |
On Tue, Feb 2, 2021 at 8:16 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: > + > +/* > + * All read and write operations on L3 cache registers are protected by the > + * spinlock, except for l3cache_init(). Each time the L3 cache operation is > + * performed, all related information is filled into its registers. Therefore, > + * there is no memory order problem when only _relaxed() functions are used. Thank you for including the text. I don't think the explanation with the spin_lock() explains why this can be considered safe though, as spin_lock() only contains serialization against other CPUs (smp_mb()) rather than the stronger DMA barriers implied by readl and writel. As Russell previously explained, these barriers are the L1 cache operations (e.g. v7_dma_inv_range) do include stronger barriers, so it would be better to come up with a justification based on those. > + * This can help us achieve some performance improvement: > + * 1) The readl_relaxed() is about 20ns faster than readl(). > + * 2) The writel_relaxed() is about 123ns faster than writel(). These are not really the performance numbers I asked for, as a low-level benchmark comparing the instructions is rather meaningless. The time spent waiting for the barrier depends on what else is going on around the barrier. Also, most of the time would likely be spent spinning in the loop around readl() while the cache operations are in progress, so the latency of a single readl() is not necessarily significant. To have a more useful performance number, try mentioning the most performance sensitive non-coherent DMA master on one of the chips that has this cache controller, and a high-level performance number such as "1.2% more network packets per second" if that is something you can measure easily. Of course, if all high-speed DMA masters on this chip are cache coherent, there is no need for performance numbers, just mention that we don't care about speed in that case. Arnd
On 2021/2/2 16:44, Arnd Bergmann wrote: > On Tue, Feb 2, 2021 at 8:16 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: >> + >> +/* >> + * All read and write operations on L3 cache registers are protected by the >> + * spinlock, except for l3cache_init(). Each time the L3 cache operation is >> + * performed, all related information is filled into its registers. Therefore, >> + * there is no memory order problem when only _relaxed() functions are used. > > Thank you for including the text. > > I don't think the explanation with the spin_lock() explains why this > can be considered safe though, as spin_lock() only contains serialization > against other CPUs (smp_mb()) rather than the stronger DMA barriers > implied by readl and writel. As Russell previously explained, these > barriers are the L1 cache operations (e.g. v7_dma_inv_range) do > include stronger barriers, so it would be better to come up with a > justification based on those. Okay, I'll correct the description. > >> + * This can help us achieve some performance improvement: >> + * 1) The readl_relaxed() is about 20ns faster than readl(). >> + * 2) The writel_relaxed() is about 123ns faster than writel(). > > These are not really the performance numbers I asked for, as a > low-level benchmark comparing the instructions is rather meaningless. > The time spent waiting for the barrier depends on what else is going > on around the barrier. Also, most of the time would likely be > spent spinning in the loop around readl() while the cache operations > are in progress, so the latency of a single readl() is not necessarily > significant. > > To have a more useful performance number, try mentioning the > most performance sensitive non-coherent DMA master on one > of the chips that has this cache controller, and a high-level > performance number such as "1.2% more network packets per > second" if that is something you can measure easily. It's not easy. My board only have debugging NIC, only the downstream products have high-speed service NIC. Software needs to be packaged layer by layer. > > Of course, if all high-speed DMA masters on this chip are > cache coherent, there is no need for performance numbers, just > mention that we don't care about speed in that case. It's not cache coherent, otherwise, the L3 cache does not need to be operated. > > Arnd > > . >
On Tue, Feb 2, 2021 at 1:18 PM Leizhen (ThunderTown) <thunder.leizhen@huawei.com> wrote: > On 2021/2/2 16:44, Arnd Bergmann wrote: > > > > To have a more useful performance number, try mentioning the > > most performance sensitive non-coherent DMA master on one > > of the chips that has this cache controller, and a high-level > > performance number such as "1.2% more network packets per > > second" if that is something you can measure easily. > > It's not easy. My board only have debugging NIC, only the downstream > products have high-speed service NIC. Software needs to be packaged > layer by layer. > > > > > Of course, if all high-speed DMA masters on this chip are > > cache coherent, there is no need for performance numbers, just > > mention that we don't care about speed in that case. > > It's not cache coherent, otherwise, the L3 cache does not need to be > operated. Ok, I see. In this case, just explain that the high-speed NIC is not cache-coherent, so this is expected to make a difference, even if you can't quantify it exactly. Arnd