[RFC,0/2] Add RISC-V cpu topology

Message ID	1541113468-22097-1-git-send-email-atish.patra@wdc.com
Headers	show Return-Path: <devicetree-owner@vger.kernel.org> From: Atish Patra <atish.patra@wdc.com> To: linux-riscv@lists.infradead.org Cc: palmer@sifive.com, anup@brainfault.org, hch@infradead.org, Damien.LeMoal@wdc.com, tglx@linutronix.de, mark.rutland@arm.com, linux-kernel@vger.kernel.org, robh+dt@kernel.org, devicetree@vger.kernel.org, alankao@andestech.com, zong@andestech.com Subject: [RFC 0/2] Add RISC-V cpu topology Date: Thu, 1 Nov 2018 16:04:26 -0700 Message-Id: <1541113468-22097-1-git-send-email-atish.patra@wdc.com> Sender: devicetree-owner@vger.kernel.org Precedence: bulk
Series	Add RISC-V cpu topology \| expand [RFC,0/2] Add RISC-V cpu topology [RFC,1/2] dt-bindings: topology: Add RISC-V cpu topology.

Atish Patra Nov. 1, 2018, 11:04 p.m. UTC

This patch series adds the cpu topology for RISC-V. It contains
both the DT binding and actual source code. It has been tested on
QEMU & Unleashed board. 

The idea is based on cpu-map in ARM with changes related to how
we define SMT systems. The reason for adopting a similar approach
to ARM as I feel it provides a very clear way of defining the
topology compared to parsing cache nodes to figure out which cpus
share the same package or core.  I am open to any other idea to
implement cpu-topology as well.

Atish Patra (2):
  dt-bindings: topology: Add RISC-V cpu topology.
  RISC-V: Introduce cpu topology.

 .../devicetree/bindings/riscv/topology.txt         | 154 ++++++++++++++++
 arch/riscv/include/asm/topology.h                  |  28 +++
 arch/riscv/kernel/Makefile                         |   1 +
 arch/riscv/kernel/smpboot.c                        |   5 +-
 arch/riscv/kernel/topology.c                       | 194 +++++++++++++++++++++
 5 files changed, 381 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/riscv/topology.txt
 create mode 100644 arch/riscv/include/asm/topology.h
 create mode 100644 arch/riscv/kernel/topology.c

Nick Kossifidis Nov. 2, 2018, 6:58 p.m. UTC | #1

Hello All,

Στις 2018-11-02 01:04, Atish Patra έγραψε:
> This patch series adds the cpu topology for RISC-V. It contains
> both the DT binding and actual source code. It has been tested on
> QEMU & Unleashed board.
> 
> The idea is based on cpu-map in ARM with changes related to how
> we define SMT systems. The reason for adopting a similar approach
> to ARM as I feel it provides a very clear way of defining the
> topology compared to parsing cache nodes to figure out which cpus
> share the same package or core.  I am open to any other idea to
> implement cpu-topology as well.
> 

I was also about to start a discussion about CPU topology on RISC-V
after the last swtools group meeting. The goal is to provide the
scheduler with hints on how to distribute tasks more efficiently
between harts, by populating the scheduling domain topology levels
(https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
What we want to do is define cpu groups and assign them to
scheduling domains with the appropriate SD_ flags
(https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).

So the cores that belong to a scheduling domain may share:
CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
Power domain (SD_SHARE_POWERDOMAIN)

In this context I believe using words like "core", "package",
"socket" etc can be misleading. For example the sample topology you
use on the documentation says that there are 4 cores that are part
of a package, however "package" has a different meaning to the
scheduler. Also we don't say anything in case they share a power
domain or if they have the same capacity or not. This mapping deals
only with cache hierarchy or other shared resources.

How about defining a dt scheme to describe the scheduler domain
topology levels instead ? e.g:

2 sets (or clusters if you prefer) of 2 SMT cores, each set with
a different capacity and power domain:

sched_topology {
  level0 { // SMT
   shared = "power", "capacity", "resources";
   group0 {
    members = <&hart0>, <&hart1>;
   }
   group1 {
    members = <&hart2>, <&hart3>;
   }
   group2 {
    members = <&hart4>, <&hart5>;
   }
   group3 {
    members = <&hart6>, <&hart7>;
   }
  }
  level1 { // MC
   shared = "power", "capacity"
   group0 {
    members = <&hart0>, <&hart1>, <&hart2>, <&hart3>;
   }
   group1 {
    members = <&hart4>, <&hart5>, <&hart6>, <&hart7>;
   }
  }
  top_level { // A group with all harts in it
   shared = "" // There is nothing common for ALL harts, we could have 
capacity here
  }
}

Regards,
Nick

Atish Patra Nov. 2, 2018, 9:14 p.m. UTC | #2

On 11/2/18 11:59 AM, Nick Kossifidis wrote:
> Hello All,
> 
> Στις 2018-11-02 01:04, Atish Patra έγραψε:
>> This patch series adds the cpu topology for RISC-V. It contains
>> both the DT binding and actual source code. It has been tested on
>> QEMU & Unleashed board.
>>
>> The idea is based on cpu-map in ARM with changes related to how
>> we define SMT systems. The reason for adopting a similar approach
>> to ARM as I feel it provides a very clear way of defining the
>> topology compared to parsing cache nodes to figure out which cpus
>> share the same package or core.  I am open to any other idea to
>> implement cpu-topology as well.
>>
> 
> I was also about to start a discussion about CPU topology on RISC-V
> after the last swtools group meeting. The goal is to provide the
> scheduler with hints on how to distribute tasks more efficiently
> between harts, by populating the scheduling domain topology levels
> (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
> What we want to do is define cpu groups and assign them to
> scheduling domains with the appropriate SD_ flags
> (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
> 

Scheduler domain topology is already getting all the hints in the 
following way.

static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
         { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
         { NULL, },
};

#ifdef CONFIG_SCHED_SMT
static inline const struct cpumask *cpu_smt_mask(int cpu)
{
         return topology_sibling_cpumask(cpu);
}
#endif

const struct cpumask *cpu_coregroup_mask(int cpu)
{
         return &cpu_topology[cpu].core_sibling;
}


> So the cores that belong to a scheduling domain may share:
> CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
> Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
> Power domain (SD_SHARE_POWERDOMAIN)
> 
> In this context I believe using words like "core", "package",
> "socket" etc can be misleading. For example the sample topology you
> use on the documentation says that there are 4 cores that are part
> of a package, however "package" has a different meaning to the
> scheduler. Also we don't say anything in case they share a power
> domain or if they have the same capacity or not. This mapping deals
> only with cache hierarchy or other shared resources.
> 
> How about defining a dt scheme to describe the scheduler domain
> topology levels instead ? e.g:
> 
> 2 sets (or clusters if you prefer) of 2 SMT cores, each set with
> a different capacity and power domain:
> 
> sched_topology {
>    level0 { // SMT
>     shared = "power", "capacity", "resources";
>     group0 {
>      members = <&hart0>, <&hart1>;
>     }
>     group1 {
>      members = <&hart2>, <&hart3>;
>     }
>     group2 {
>      members = <&hart4>, <&hart5>;
>     }
>     group3 {
>      members = <&hart6>, <&hart7>;
>     }
>    }
>    level1 { // MC
>     shared = "power", "capacity"
>     group0 {
>      members = <&hart0>, <&hart1>, <&hart2>, <&hart3>;
>     }
>     group1 {
>      members = <&hart4>, <&hart5>, <&hart6>, <&hart7>;
>     }
>    }
>    top_level { // A group with all harts in it
>     shared = "" // There is nothing common for ALL harts, we could have
> capacity here
>    }
> }
> 

I agree that naming could have been better in the past. But it is what 
it is now. I don't see any big advantages in this approach compared to 
the existing approach where DT specifies what hardware looks like and 
scheduler sets up it's domain based on different cpumasks.


Regards,
Atish
> Regards,
> Nick
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
>

Nick Kossifidis Nov. 2, 2018, 10:18 p.m. UTC | #3

Στις 2018-11-02 23:14, Atish Patra έγραψε:
> On 11/2/18 11:59 AM, Nick Kossifidis wrote:
>> Hello All,
>> 
>> Στις 2018-11-02 01:04, Atish Patra έγραψε:
>>> This patch series adds the cpu topology for RISC-V. It contains
>>> both the DT binding and actual source code. It has been tested on
>>> QEMU & Unleashed board.
>>> 
>>> The idea is based on cpu-map in ARM with changes related to how
>>> we define SMT systems. The reason for adopting a similar approach
>>> to ARM as I feel it provides a very clear way of defining the
>>> topology compared to parsing cache nodes to figure out which cpus
>>> share the same package or core.  I am open to any other idea to
>>> implement cpu-topology as well.
>>> 
>> 
>> I was also about to start a discussion about CPU topology on RISC-V
>> after the last swtools group meeting. The goal is to provide the
>> scheduler with hints on how to distribute tasks more efficiently
>> between harts, by populating the scheduling domain topology levels
>> (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
>> What we want to do is define cpu groups and assign them to
>> scheduling domains with the appropriate SD_ flags
>> (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
>> 
> 
> Scheduler domain topology is already getting all the hints in the 
> following way.
> 
> static struct sched_domain_topology_level default_topology[] = {
> #ifdef CONFIG_SCHED_SMT
>         { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> #endif
> #ifdef CONFIG_SCHED_MC
>         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> #endif
>         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>         { NULL, },
> };
> 
> #ifdef CONFIG_SCHED_SMT
> static inline const struct cpumask *cpu_smt_mask(int cpu)
> {
>         return topology_sibling_cpumask(cpu);
> }
> #endif
> 
> const struct cpumask *cpu_coregroup_mask(int cpu)
> {
>         return &cpu_topology[cpu].core_sibling;
> }
> 
> 

That's a static definition of two scheduling domains that only deal
with SMT and MC, the only difference between them is the
SD_SHARE_PKG_RESOURCES flag. You can't even have multiple levels
of shared resources this way, whatever you have larger than a core
is ignored (it just goes to the MC domain). There is also no handling
of SD_SHARE_POWERDOMAIN or SD_SHARE_CPUCAPACITY.

>> So the cores that belong to a scheduling domain may share:
>> CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
>> Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
>> Power domain (SD_SHARE_POWERDOMAIN)
>> 
>> In this context I believe using words like "core", "package",
>> "socket" etc can be misleading. For example the sample topology you
>> use on the documentation says that there are 4 cores that are part
>> of a package, however "package" has a different meaning to the
>> scheduler. Also we don't say anything in case they share a power
>> domain or if they have the same capacity or not. This mapping deals
>> only with cache hierarchy or other shared resources.
>> 
>> How about defining a dt scheme to describe the scheduler domain
>> topology levels instead ? e.g:
>> 
>> 2 sets (or clusters if you prefer) of 2 SMT cores, each set with
>> a different capacity and power domain:
>> 
>> sched_topology {
>>    level0 { // SMT
>>     shared = "power", "capacity", "resources";
>>     group0 {
>>      members = <&hart0>, <&hart1>;
>>     }
>>     group1 {
>>      members = <&hart2>, <&hart3>;
>>     }
>>     group2 {
>>      members = <&hart4>, <&hart5>;
>>     }
>>     group3 {
>>      members = <&hart6>, <&hart7>;
>>     }
>>    }
>>    level1 { // MC
>>     shared = "power", "capacity"
>>     group0 {
>>      members = <&hart0>, <&hart1>, <&hart2>, <&hart3>;
>>     }
>>     group1 {
>>      members = <&hart4>, <&hart5>, <&hart6>, <&hart7>;
>>     }
>>    }
>>    top_level { // A group with all harts in it
>>     shared = "" // There is nothing common for ALL harts, we could 
>> have
>> capacity here
>>    }
>> }
>> 
> 
> I agree that naming could have been better in the past. But it is what
> it is now. I don't see any big advantages in this approach compared to
> the existing approach where DT specifies what hardware looks like and
> scheduler sets up it's domain based on different cpumasks.
> 

It is what it is on ARM, it doesn't have to be the same on RISC-V, 
anyway
the name is a minor issue. The advantage of this approach is that you 
define the
scheduling domains on the device tree without needing a "translation" of 
a
topology map to scheduling domains. It can handle any scenario the 
scheduler
can handle, using all the available flags. In your approach no matter 
what
gets put to the device tree, the only hint the scheduler will get is one
level of SMT, one level of MC and the rest of the system. No power 
domain
sharing, no asymmetric scheduling, no multiple levels possible. Many 
features
of the scheduler remain unused. This approach can also get extended more 
easily
to e.g. support NUMA nodes and associate memory regions with groups.

Regards,
Nick

Sudeep Holla Nov. 6, 2018, 2:13 p.m. UTC | #4

On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
> Hello All,
> 
> Στις 2018-11-02 01:04, Atish Patra έγραψε:
> > This patch series adds the cpu topology for RISC-V. It contains
> > both the DT binding and actual source code. It has been tested on
> > QEMU & Unleashed board.
> > 
> > The idea is based on cpu-map in ARM with changes related to how
> > we define SMT systems. The reason for adopting a similar approach
> > to ARM as I feel it provides a very clear way of defining the
> > topology compared to parsing cache nodes to figure out which cpus
> > share the same package or core.  I am open to any other idea to
> > implement cpu-topology as well.
> > 
> 
> I was also about to start a discussion about CPU topology on RISC-V
> after the last swtools group meeting. The goal is to provide the
> scheduler with hints on how to distribute tasks more efficiently
> between harts, by populating the scheduling domain topology levels
> (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
> What we want to do is define cpu groups and assign them to
> scheduling domains with the appropriate SD_ flags
> (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
>

OK are we defining a CPU topology binding for Linux scheduler ?
NACK for all the approaches that assumes any knowledge of OS scheduler.

> So the cores that belong to a scheduling domain may share:
> CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
> Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
> Power domain (SD_SHARE_POWERDOMAIN)
> 

Too Linux kernel/scheduler specific to be part of $subject

> In this context I believe using words like "core", "package",
> "socket" etc can be misleading. For example the sample topology you
> use on the documentation says that there are 4 cores that are part
> of a package, however "package" has a different meaning to the
> scheduler. Also we don't say anything in case they share a power
> domain or if they have the same capacity or not. This mapping deals
> only with cache hierarchy or other shared resources.
>

{Un,}fortunately those are terms used by hardware people.

> How about defining a dt scheme to describe the scheduler domain
> topology levels instead ? e.g:
>

NACK as already mentioned above.

--
Regards,
Sudeep

Nick Kossifidis Nov. 6, 2018, 3:26 p.m. UTC | #5

Στις 2018-11-06 16:13, Sudeep Holla έγραψε:
> On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
>> Hello All,
>> 
>> Στις 2018-11-02 01:04, Atish Patra έγραψε:
>> > This patch series adds the cpu topology for RISC-V. It contains
>> > both the DT binding and actual source code. It has been tested on
>> > QEMU & Unleashed board.
>> >
>> > The idea is based on cpu-map in ARM with changes related to how
>> > we define SMT systems. The reason for adopting a similar approach
>> > to ARM as I feel it provides a very clear way of defining the
>> > topology compared to parsing cache nodes to figure out which cpus
>> > share the same package or core.  I am open to any other idea to
>> > implement cpu-topology as well.
>> >
>> 
>> I was also about to start a discussion about CPU topology on RISC-V
>> after the last swtools group meeting. The goal is to provide the
>> scheduler with hints on how to distribute tasks more efficiently
>> between harts, by populating the scheduling domain topology levels
>> (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
>> What we want to do is define cpu groups and assign them to
>> scheduling domains with the appropriate SD_ flags
>> (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
>> 
> 
> OK are we defining a CPU topology binding for Linux scheduler ?
> NACK for all the approaches that assumes any knowledge of OS scheduler.
> 

Is there any standard regarding CPU topology on the device tree spec ? 
As far as
I know there is none. We are talking about a Linux-specific Device Tree 
binding
so I don't see why defining a binding for the Linux scheduler is out of 
scope.
Do you have cpu-map on other OSes as well ?

>> So the cores that belong to a scheduling domain may share:
>> CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
>> Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
>> Power domain (SD_SHARE_POWERDOMAIN)
>> 
> 
> Too Linux kernel/scheduler specific to be part of $subject
> 

All lists on the cc list are Linux specific, again I don't see your 
point here
are we talking about defining a standard CPU topology scheme for the 
device tree
spec or a Linux-specific CPU topology binding such as cpu-map ?

Even on this case your point is not valid, the information of two harts 
sharing
a common power domain or having the same or not capacity/max frequency 
(or maybe
capabilities/extensions in the future), is not Linux specific. I just 
used the
Linux specific macros used by the Linux scheduler to point out the code 
path.
Even on other OSes we still need a way to include this information on 
the CPU
topology, and currently cpu-map doesn't. Also the Linux implementation 
of cpu-map
ignores multiple levels of shared resources, we only get one level for 
SMT and
one level for MC last time I checked.

>> In this context I believe using words like "core", "package",
>> "socket" etc can be misleading. For example the sample topology you
>> use on the documentation says that there are 4 cores that are part
>> of a package, however "package" has a different meaning to the
>> scheduler. Also we don't say anything in case they share a power
>> domain or if they have the same capacity or not. This mapping deals
>> only with cache hierarchy or other shared resources.
>> 
> 
> {Un,}fortunately those are terms used by hardware people.
> 

And they are wrong, how the harts are physically packed doesn't imply 
their
actual topology. In general the "translation" is not always straight 
forward,
there are assumptions in place. We could use "cluster" of harts or 
"group" of
harts instead, they are more abstract.

Regards,
Nick

Sudeep Holla Nov. 6, 2018, 3:50 p.m. UTC | #6

On Tue, Nov 06, 2018 at 05:26:01PM +0200, Nick Kossifidis wrote:
> Στις 2018-11-06 16:13, Sudeep Holla έγραψε:
> > On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
> > > Hello All,
> > >
> > > Στις 2018-11-02 01:04, Atish Patra έγραψε:
> > > > This patch series adds the cpu topology for RISC-V. It contains
> > > > both the DT binding and actual source code. It has been tested on
> > > > QEMU & Unleashed board.
> > > >
> > > > The idea is based on cpu-map in ARM with changes related to how
> > > > we define SMT systems. The reason for adopting a similar approach
> > > > to ARM as I feel it provides a very clear way of defining the
> > > > topology compared to parsing cache nodes to figure out which cpus
> > > > share the same package or core.  I am open to any other idea to
> > > > implement cpu-topology as well.
> > > >
> > >
> > > I was also about to start a discussion about CPU topology on RISC-V
> > > after the last swtools group meeting. The goal is to provide the
> > > scheduler with hints on how to distribute tasks more efficiently
> > > between harts, by populating the scheduling domain topology levels
> > > (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
> > > What we want to do is define cpu groups and assign them to
> > > scheduling domains with the appropriate SD_ flags
> > > (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
> > >
> >
> > OK are we defining a CPU topology binding for Linux scheduler ?
> > NACK for all the approaches that assumes any knowledge of OS scheduler.
> >
>
> Is there any standard regarding CPU topology on the device tree spec ? As
> far as I know there is none. We are talking about a Linux-specific Device
> Tree binding so I don't see why defining a binding for the Linux scheduler
> is out of scope.

There may not be much on CPU topology in device tree spec, but that
doesn't mean we are defining something Linux specific here just because
there's bunch of LKML are cc-ed. We do have dedicated device tree ML for
a reason.

> Do you have cpu-map on other OSes as well ?
>

Nothing prevents them not to. I have seen increase in the projects
relying on DT these days.

> > > So the cores that belong to a scheduling domain may share:
> > > CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
> > > Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
> > > Power domain (SD_SHARE_POWERDOMAIN)
> > >
> >
> > Too Linux kernel/scheduler specific to be part of $subject
> >
>
> All lists on the cc list are Linux specific, again I don't see your point
> here are we talking about defining a standard CPU topology scheme for the
> device tree spec or a Linux-specific CPU topology binding such as cpu-map ?
>

See above.

> Even on this case your point is not valid, the information of two harts
> sharing a common power domain or having the same or not capacity/max
> frequency (or maybe capabilities/extensions in the future), is not Linux
> specific. I just used the Linux specific macros used by the Linux scheduler
> to point out the code path.

The CPU topology can be different from the frequency or the power domains
and we do have specific bindings to provide those information. So let's
try to keep that out of this discussion.

> Even on other OSes we still need a way to include this information on the
> CPU topology, and currently cpu-map doesn't. Also the Linux implementation
> of cpu-map ignores multiple levels of shared resources, we only get one
> level for SMT and one level for MC last time I checked.
>

But that doesn't make it any easy if you influence the bindings based on
Linux scheduler. So just define how hardware is and allow each OS to
choose it's own way to utilise that information. That's how most of the
generic DT bindings are defined.

> > > In this context I believe using words like "core", "package",
> > > "socket" etc can be misleading. For example the sample topology you
> > > use on the documentation says that there are 4 cores that are part
> > > of a package, however "package" has a different meaning to the
> > > scheduler. Also we don't say anything in case they share a power
> > > domain or if they have the same capacity or not. This mapping deals
> > > only with cache hierarchy or other shared resources.
> > >
> >
> > {Un,}fortunately those are terms used by hardware people.
> >
>
> And they are wrong, how the harts are physically packed doesn't imply
> their actual topology. In general the "translation" is not always straight
> forward, there are assumptions in place. We could use "cluster" of harts or
> "group" of harts instead, they are more abstract.
>

Indeed. I agree those terminologies may not be best, but they are already
used one. We need to map on to those generic ones, though the translations
may not be simple. We do have the same issues on ARM. If we try to build
in such information into DT, then it becomes more configuration file for
OS than platform description IMO.

--
Regards,
Sudeep

Mark Rutland Nov. 6, 2018, 4:20 p.m. UTC | #7

On Tue, Nov 06, 2018 at 05:26:01PM +0200, Nick Kossifidis wrote:
> Στις 2018-11-06 16:13, Sudeep Holla έγραψε:
> > On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
> > > Στις 2018-11-02 01:04, Atish Patra έγραψε:
> > > > This patch series adds the cpu topology for RISC-V. It contains
> > > > both the DT binding and actual source code. It has been tested on
> > > > QEMU & Unleashed board.
> > > >
> > > > The idea is based on cpu-map in ARM with changes related to how
> > > > we define SMT systems. The reason for adopting a similar approach
> > > > to ARM as I feel it provides a very clear way of defining the
> > > > topology compared to parsing cache nodes to figure out which cpus
> > > > share the same package or core.  I am open to any other idea to
> > > > implement cpu-topology as well.
> > > 
> > > I was also about to start a discussion about CPU topology on RISC-V
> > > after the last swtools group meeting. The goal is to provide the
> > > scheduler with hints on how to distribute tasks more efficiently
> > > between harts, by populating the scheduling domain topology levels
> > > (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
> > > What we want to do is define cpu groups and assign them to
> > > scheduling domains with the appropriate SD_ flags
> > > (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
> > 
> > OK are we defining a CPU topology binding for Linux scheduler ?
> > NACK for all the approaches that assumes any knowledge of OS scheduler.
> 
> Is there any standard regarding CPU topology on the device tree spec ?
> As far as I know there is none. We are talking about a Linux-specific
> Device Tree binding so I don't see why defining a binding for the
> Linux scheduler is out of scope.

Speaking as a DT binding maintainer, please avoid OS-specific DT
bindings wherever possible.

While DT bindings live in the kernel tree, they are not intended to be
Linux-specific, and other OSs (e.g. FreeBSD, zephyr) are aiming to
support the same bindings.

In general, targeting a specific OS is a bad idea, because the
implementation details of that OS change over time, or the bindings end
up being tailored to one specific use-case. Exposing details to the OS
such that the OS can make decisions at runtime is typically better.

> Do you have cpu-map on other OSes as well ?

There is nothing OS-specific about cpu-map, and it may be of use to
other OSs.

> > > So the cores that belong to a scheduling domain may share:
> > > CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
> > > Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
> > > Power domain (SD_SHARE_POWERDOMAIN)
> > > 
> > 
> > Too Linux kernel/scheduler specific to be part of $subject
> 
> All lists on the cc list are Linux specific, again I don't see your
> point here are we talking about defining a standard CPU topology
> scheme for the device tree spec or a Linux-specific CPU topology
> binding such as cpu-map ?

The cpu-map binding is not intended to be Linux specific, and avoids
Linux-specific terminology.

While the cpu-map binding documentation is in the Linux source tree, the
binding itseld is not intended to be Linux-specific, and it deliberately
avoids Linux implementation details.

> Even on this case your point is not valid, the information of two
> harts sharing a common power domain or having the same or not
> capacity/max frequency (or maybe capabilities/extensions in the
> future), is not Linux specific. I just used the Linux specific macros
> used by the Linux scheduler to point out the code path.  Even on other
> OSes we still need a way to include this information on the CPU
> topology, and currently cpu-map doesn't. Also the Linux implementation
> of cpu-map ignores multiple levels of shared resources, we only get
> one level for SMT and one level for MC last time I checked.

Given clusters can be nested, as in the very first example, I don't see
what prevents multiple levels of shared resources.

Can you please given an example of the topology your considering? Does
that share some resources across clusters at some level?

We are certainly open to improving the cpu-map binding.

Thanks,
Mark.

Nick Kossifidis Nov. 7, 2018, 2:31 a.m. UTC | #8

Στις 2018-11-06 18:20, Mark Rutland έγραψε:
> On Tue, Nov 06, 2018 at 05:26:01PM +0200, Nick Kossifidis wrote:
>> Στις 2018-11-06 16:13, Sudeep Holla έγραψε:
>> > On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
>> > > Στις 2018-11-02 01:04, Atish Patra έγραψε:
>> > > > This patch series adds the cpu topology for RISC-V. It contains
>> > > > both the DT binding and actual source code. It has been tested on
>> > > > QEMU & Unleashed board.
>> > > >
>> > > > The idea is based on cpu-map in ARM with changes related to how
>> > > > we define SMT systems. The reason for adopting a similar approach
>> > > > to ARM as I feel it provides a very clear way of defining the
>> > > > topology compared to parsing cache nodes to figure out which cpus
>> > > > share the same package or core.  I am open to any other idea to
>> > > > implement cpu-topology as well.
>> > >
>> > > I was also about to start a discussion about CPU topology on RISC-V
>> > > after the last swtools group meeting. The goal is to provide the
>> > > scheduler with hints on how to distribute tasks more efficiently
>> > > between harts, by populating the scheduling domain topology levels
>> > > (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
>> > > What we want to do is define cpu groups and assign them to
>> > > scheduling domains with the appropriate SD_ flags
>> > > (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
>> >
>> > OK are we defining a CPU topology binding for Linux scheduler ?
>> > NACK for all the approaches that assumes any knowledge of OS scheduler.
>> 
>> Is there any standard regarding CPU topology on the device tree spec ?
>> As far as I know there is none. We are talking about a Linux-specific
>> Device Tree binding so I don't see why defining a binding for the
>> Linux scheduler is out of scope.
> 
> Speaking as a DT binding maintainer, please avoid OS-specific DT
> bindings wherever possible.
> 
> While DT bindings live in the kernel tree, they are not intended to be
> Linux-specific, and other OSs (e.g. FreeBSD, zephyr) are aiming to
> support the same bindings.
> 
> In general, targeting a specific OS is a bad idea, because the
> implementation details of that OS change over time, or the bindings end
> up being tailored to one specific use-case. Exposing details to the OS
> such that the OS can make decisions at runtime is typically better.
> 
>> Do you have cpu-map on other OSes as well ?
> 
> There is nothing OS-specific about cpu-map, and it may be of use to
> other OSs.
> 
>> > > So the cores that belong to a scheduling domain may share:
>> > > CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
>> > > Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
>> > > Power domain (SD_SHARE_POWERDOMAIN)
>> > >
>> >
>> > Too Linux kernel/scheduler specific to be part of $subject
>> 
>> All lists on the cc list are Linux specific, again I don't see your
>> point here are we talking about defining a standard CPU topology
>> scheme for the device tree spec or a Linux-specific CPU topology
>> binding such as cpu-map ?
> 
> The cpu-map binding is not intended to be Linux specific, and avoids
> Linux-specific terminology.
> 
> While the cpu-map binding documentation is in the Linux source tree, 
> the
> binding itseld is not intended to be Linux-specific, and it 
> deliberately
> avoids Linux implementation details.
> 
>> Even on this case your point is not valid, the information of two
>> harts sharing a common power domain or having the same or not
>> capacity/max frequency (or maybe capabilities/extensions in the
>> future), is not Linux specific. I just used the Linux specific macros
>> used by the Linux scheduler to point out the code path.  Even on other
>> OSes we still need a way to include this information on the CPU
>> topology, and currently cpu-map doesn't. Also the Linux implementation
>> of cpu-map ignores multiple levels of shared resources, we only get
>> one level for SMT and one level for MC last time I checked.
> 
> Given clusters can be nested, as in the very first example, I don't see
> what prevents multiple levels of shared resources.
> 
> Can you please given an example of the topology your considering? Does
> that share some resources across clusters at some level?
> 
> We are certainly open to improving the cpu-map binding.
> 
> Thanks,
> Mark.

Mark and Sundeep thanks a lot for your feedback, I guess you convinced 
me
that having a device tree binding for the scheduler is not a correct 
approach.
It's not a device after all and I agree that the device tree shouldn't 
become
an OS configuration file. Regarding multiple levels of shared resources 
my point
is that since cpu-map doesn't contain any information of what is shared 
among
the cluster/core members it's not easy to do any further translation. 
Last time
I checked the arm code that uses cpu-map, it only defines one domain for 
SMT, one
for MC and then everything else is ignored. No matter how many clusters 
have been
defined, anything above the core level is the same (and then I guess you 
started
talking about adding "packages" on the representation side).

The reason I proposed to have a binding for the scheduler directly is 
not only
because it's simpler and closer to what really happens in the code, it 
also makes
more sense to me than the combination of cpu-map with all the related 
mappings e.g.
for numa or caches or power domains etc.

However you are right we could definitely augment cpu-map to include 
support for
what I'm saying and clean things up, and since you are open about 
improving it
here is a proposal that I hope you find interesting:

At first let's get rid of the <thread> nodes, they don't make sense:

thread0 {
  cpu = <&CPU0>;
};

A thread node can't have more than one cpu entry and any properties
should be on the cpu node itself, so it doesn't / can't add any
more information. We could just have an array of cpu nodes on the
<core> node, it's much cleaner this way.

core0 {
  members = <&CPU0>, <&CPU1>;
};

Then let's allow the cluster and core nodes to accept attributes that 
are
common for the cpus they contain. Right now this is considered invalid.

For power domains we have a generic binding described on
Documentation/devicetree/bindings/power/power_domain.txt
which basically says that we need to put power-domains = <power domain 
specifiers>
attribute on each of the cpu nodes.

The same happens with the capacity binding specified for arm on
Documentation/devicetree/bindings/arm/cpu-capacity.txt
which says we should add the capacity-dmips-mhz on each of the cpu 
nodes.

The same also happens with the generic numa binding on
Documentation/devicetree/bindings/numa.txt
which says we should add the nuna-node-id on each of the cpu nodes.

We could allow for these attributes to exist on cluster and core nodes
as well so that we can represent their properties better. It shouldn't
be a big deal and it can be done in a backwards-compatible way (if we
don't find them on the cpu node, climb up the topology hierarchy until
we find them / not find them at all). All I'm saying is that I prefer 
this:

cpus {
  cpu@0 {
   ...
  };
  cpu@1 {
   ...
  };
  cpu@2 {
   ...
  };
  cpu@3 {
   ...
  };
};

cluster0 {
  cluster0 {
   core0 {
    power-domains = <&pdc 0>;
    numa-node-id = <0>;
    capacity-dmips-mhz = <578>;
    members = <&cpu0>, <&cpu1>;
   }
  };
  cluster1 {
   capacity-dmips-mhz = <1024>;
   core0 {
    power-domains = <&pdc 1>;
    numa-node-id = <1>;
    members = <&cpu2>;
   };
   core1 {
    power-domains = <&pdc 2>;
    numa-node-id = <2>;
    members = <&cpu3>;
   };
  };
}

over this:

cpus {
  cpu@0 {
   ...
   power-domains = <&pdc 0>;
   capacity-dmips-mhz = <578>;
   numa-node-id = <0>;
   ...
  };
  cpu@1 {
   ...
   power-domains = <&pdc 0>;
   capacity-dmips-mhz = <578>;
   numa-node-id = <0>;
   ...
  };
  cpu@2 {
   ...
   power-domains = <&pdc 1>;
   capacity-dmips-mhz = <1024>;
   numa-node-id = <1>;
   ...
  };
  cpu@3 {
   ...
   power-domains = <&pdc 2>;
   capacity-dmips-mhz = <1024>;
   numa-node-id = <2>;
   ...
  };
};

cluster0 {
  cluster0 {
   core0 {
    members = <&cpu0>, <&cpu1>;
   }
  };
  cluster1 {
   core0 {
    members = <&cpu2>;
   }
  };
  cluster2 {
   core0 {
    members = <&cpu3>;
   }
  };
}

When it comes to shared resources, the standard dt mappings we have are 
for
caches and are on the device spec standard (coming from power pc's ePAPR
standard I think). The below comes from HiFive unleashed's device tree
(U540Config.dts) that follows the spec:

cpus {
  cpu@1 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@2 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@3 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@4 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
};

L2: soc {
  L0: cache-controller@2010000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
}

Note that the cache-controller node that's common between the 4 cores 
can
exist anywhere BUT the cluster node ! However it's a property of the 
cluster.
A quick search through the tree got me r8a77980.dtsi that defines the 
cache
on the cpus node and I'm sure there are other similar cases. Wouldn't 
this
be better ?

cluster0 {
  core0 {
   cache-controller@2010000 {
    cache-block-size = <64>;
    cache-level = <2>;
    cache-sets = <2048>;
    cache-size = <2097152>;
    cache-unified;
    compatible = "sifive,ccache0", "cache";
    ...
   };
   members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
  };
};

We could even remove next-level-cache from the cpu nodes and infer it 
from the
topology (search the topology upwards until we get a node that's
"cache"-compatible), we can again make this backwards-compatible.

Finally from the examples above I'd like to stress out that the 
distinction
between a cluster and a core doesn't make much sense and it also makes 
the
representation more complicated. To begin with, how would you call the 
setup
on HiFive Unleashed ? A cluster of 4 cores that share the same L3 cache 
?
One core with 4 harts that share the same L3 cache ? We could represent 
it
like this instead:

cluster0 {
  cache-controller@2010000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
  core0 {
   members = <&cpu0>;
  };
  core1 {
   members = <&cpu1>;
  };
  core2 {
   members = <&cpu2>;
  };
  core3 {
   members = <&cpu3>;
  };
};

We could e.g. keep only cluster nodes and allow them to contain either 
an array
of harts or other cluster sub-nodes + optionally a set of attributes, 
common to
the members/sub-nodes of the cluster. This way we'll get in the first 
example:

cluster0 {
  cluster0 {
   power-domains = <&pdc 0>;
   numa-node-id = <0>;
   capacity-dmips-mhz = <578>;
   members = <&cpu0>, <&cpu1>;
  };
  cluster1 {
   capacity-dmips-mhz = <1024>;
   cluster0 {
    power-domains = <&pdc 1>;
    numa-node-id = <1>;
    members = <&cpu2>;
   };
   cluster1 {
    power-domains = <&pdc 2>;
    numa-node-id = <2>;
    members = <&cpu3>;
   };
  };
}

and in the second example:

cluster0 {
  cache-controller@2010000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
  members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
};

Thank you for your time !

Regards,
Nick

Mark Rutland Nov. 7, 2018, 12:06 p.m. UTC | #9

On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
> Mark and Sundeep thanks a lot for your feedback, I guess you convinced
> me that having a device tree binding for the scheduler is not a
> correct approach. It's not a device after all and I agree that the
> device tree shouldn't become an OS configuration file.

Good to hear.

> Regarding multiple levels of shared resources my point is that since
> cpu-map doesn't contain any information of what is shared among the
> cluster/core members it's not easy to do any further translation. Last
> time I checked the arm code that uses cpu-map, it only defines one
> domain for SMT, one for MC and then everything else is ignored. No
> matter how many clusters have been defined, anything above the core
> level is the same (and then I guess you started talking about adding
> "packages" on the representation side).

While cpu-map doesn't contain that information today, we can *add* that
information to the cpu-map binding if necessary.

> The reason I proposed to have a binding for the scheduler directly is
> not only because it's simpler and closer to what really happens in the
> code, it also makes more sense to me than the combination of cpu-map
> with all the related mappings e.g.  for numa or caches or power
> domains etc.
> 
> However you are right we could definitely augment cpu-map to include
> support for what I'm saying and clean things up, and since you are
> open about improving it here is a proposal that I hope you find
> interesting:
> 
> At first let's get rid of the <thread> nodes, they don't make sense:
> 
> thread0 {
>  cpu = <&CPU0>;
> };
> 
> A thread node can't have more than one cpu entry and any properties
> should be on the cpu node itself, so it doesn't / can't add any
> more information. We could just have an array of cpu nodes on the
> <core> node, it's much cleaner this way.
> 
> core0 {
>  members = <&CPU0>, <&CPU1>;
> };

Hold on. Rather than reinventing things from first principles, can we
please discuss what you want to *achieve*, i.e. what information you
need?

Having a node is not a significant cost, and there are reasons we may
want thread nodes. For example, it means that we can always refer to any
level of topology with a phandle, and we might want to describe
thread-affine devices in future.

There are a tonne of existing bindings that are ugly, but re-inventing
them for taste reasons alone is more costly to the ecosystem than simply
using the existing bindings. We avoid re-inventing bindings unless there
is a functional problem e.g. cases which they cannot possibly describe.

> Then let's allow the cluster and core nodes to accept attributes that are
> common for the cpus they contain. Right now this is considered invalid.
> 
> For power domains we have a generic binding described on
> Documentation/devicetree/bindings/power/power_domain.txt
> which basically says that we need to put power-domains = <power domain
> specifiers>
> attribute on each of the cpu nodes.

FWIW, given this is arguably topological, I'm not personally averse to
describing this in the cpu-map, if that actually gains us more than the
complexity require to support it.

Given we don't do this for device power domains, I suspect that it's
simpler to stick with the existing binding.

> The same happens with the capacity binding specified for arm on
> Documentation/devicetree/bindings/arm/cpu-capacity.txt
> which says we should add the capacity-dmips-mhz on each of the cpu nodes.

The cpu-map was intended to expose topological dtails, and this isn't
really a topological property. For example, Arm DynamIQ systems can have
heterogeneous CPUs within clusters.

I do not think it's worth moving this, tbh.

> The same also happens with the generic numa binding on
> Documentation/devicetree/bindings/numa.txt
> which says we should add the nuna-node-id on each of the cpu nodes.

Is there a strong gain from moving this?

[...]

> Finally from the examples above I'd like to stress out that the distinction
> between a cluster and a core doesn't make much sense and it also makes the
> representation more complicated. To begin with, how would you call the setup
> on HiFive Unleashed ? A cluster of 4 cores that share the same L3 cache ?

Not knowing much about the hardware, I can't really say.

I'm not sure I follow why the distinction between a cluster and a core
is non-sensical. A cluster is always a collection of cores.

A hart could be a core in its own right, or it could be a thread under a
core, which shares functional units with other harts within that core.

Arguably, we could have mandated that the topology always needed to
describe down to a thread, even if a core only had a single thread. That
ship has sailed, however.

Thanks,
Mark.

Sudeep Holla Nov. 7, 2018, 12:28 p.m. UTC | #10

On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
[...]

>
> Mark and Sudeep thanks a lot for your feedback, I guess you convinced me
> that having a device tree binding for the scheduler is not a correct
> approach.
>

Thanks :)

> It's not a device after all and I agree that the device tree shouldn't
> become an OS configuration file. Regarding multiple levels of shared
> resources my point is that since cpu-map doesn't contain any information of
> what is shared among the cluster/core members it's not easy to do any
> further translation. Last time I checked the arm code that uses cpu-map, it
> only defines one domain for SMT, one for MC and then everything else is
> ignored. No matter how many clusters have been defined, anything above the
> core level is the same (and then I guess you started talking about adding
> "packages" on the representation side).
>

Correct.

> The reason I proposed to have a binding for the scheduler directly is not
> only because it's simpler and closer to what really happens in the code, it
> also makes more sense to me than the combination of cpu-map with all the
> related mappings e.g. for numa or caches or power domains etc.
>

Again you are just looking at it with Linux kernel perspective.

> However you are right we could definitely augment cpu-map to include support
> for what I'm saying and clean things up, and since you are open about
> improving it here is a proposal that I hope you find interesting: At first
> let's get rid of the <thread> nodes, they don't make sense:
>
> thread0 {
>  cpu = <&CPU0>;
> };
>

Do you have any strong reasons to do so ?
Since it's already there for some time, I believe we can't remove it for
backward compatibility reasons.

> A thread node can't have more than one cpu entry and any properties
> should be on the cpu node itself, so it doesn't / can't add any
> more information. We could just have an array of cpu nodes on the
> <core> node, it's much cleaner this way.
>
> core0 {
>  members = <&CPU0>, <&CPU1>;
> };
>

I agree, but we have kernel code using it(arm64/kernel/topology.c). It's
too late to remove it. But we can always keep to optional if we move the
ARM64 binding as generic to start with and mandate it for only ARM64.

> Then let's allow the cluster and core nodes to accept attributes that are
> common for the cpus they contain. Right now this is considered invalid.
>

Yes, we have discussed in the past and decided not to. I am fine if we
need to change it, but assuming the topology implies other information
could be wrong. On ARM platforms we have learnt it, so we kept any
information away from topology. I assume same with RISC-V, different
vendors implement in different ways, so it's better to consider those
factors.

> For power domains we have a generic binding described on
> Documentation/devicetree/bindings/power/power_domain.txt
> which basically says that we need to put power-domains = <power domain
> specifiers> attribute on each of the cpu nodes.
>

OK, but what's wrong with that. I gives full flexibility.

> The same happens with the capacity binding specified for arm on
> Documentation/devicetree/bindings/arm/cpu-capacity.txt
> which says we should add the capacity-dmips-mhz on each of the cpu nodes.
>

Ditto, we may need this for our single cluster DSU systems.

> The same also happens with the generic numa binding on
> Documentation/devicetree/bindings/numa.txt
> which says we should add the nuna-node-id on each of the cpu nodes.
>

Yes, but again what's the problem ?

> We could allow for these attributes to exist on cluster and core nodes
> as well so that we can represent their properties better. It shouldn't
> be a big deal and it can be done in a backwards-compatible way (if we
> don't find them on the cpu node, climb up the topology hierarchy until
> we find them / not find them at all). All I'm saying is that I prefer this:
>

[...]

>
> cluster0 {
>  cluster0 {
>   core0 {
>    power-domains = <&pdc 0>;
>    numa-node-id = <0>;
>    capacity-dmips-mhz = <578>;
>    members = <&cpu0>, <&cpu1>;
>   }
>  };
>  cluster1 {
>   capacity-dmips-mhz = <1024>;
>   core0 {
>    power-domains = <&pdc 1>;
>    numa-node-id = <1>;
>    members = <&cpu2>;
>   };
>   core1 {
>    power-domains = <&pdc 2>;
>    numa-node-id = <2>;
>    members = <&cpu3>;
>   };
>  };
> }
>

Why are you so keen on optimising the representation ?
If you are worried about large systems, generate one instead of
handcrafted.

> When it comes to shared resources, the standard dt mappings we have are for
> caches and are on the device spec standard (coming from power pc's ePAPR
> standard I think). The below comes from HiFive unleashed's device tree
> (U540Config.dts) that follows the spec:
>

I don't understand what you are trying to explain, ePAPR does specify
per CPU entries.

[...]

> Note that the cache-controller node that's common between the 4 cores can
> exist anywhere BUT the cluster node ! However it's a property of the
> cluster.
> A quick search through the tree got me r8a77980.dtsi that defines the cache
> on the cpus node and I'm sure there are other similar cases. Wouldn't this
> be better ?
>
> cluster0 {
>  core0 {
>   cache-controller@2010000 {
>    cache-block-size = <64>;
>    cache-level = <2>;
>    cache-sets = <2048>;
>    cache-size = <2097152>;
>    cache-unified;
>    compatible = "sifive,ccache0", "cache";
>    ...
>   };
>   members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;

Not a good idea IMO.

> We could even remove next-level-cache from the cpu nodes and infer it from
> the  topology (search the topology upwards until we get a node that's
> "cache"-compatible), we can again make this backwards-compatible.
>

Why are you assuming that they *have* to be so aligned with topology ?
How do you deal with different kind of systems ?

>
> Finally from the examples above I'd like to stress out that the distinction
> between a cluster and a core doesn't make much sense and it also makes the
> representation more complicated. To begin with, how would you call the setup
> on HiFive Unleashed ? A cluster of 4 cores that share the same L3 cache ?
> One core with 4 harts that share the same L3 cache ? We could represent it
> like this instead:
>

Just represent each physical cache and get the list of CPUs sharing it.
Doesn't matter what it is: cluster or cluster of clusters or cluster of
harts, blah, blah. It really doesn't matter.

>
> We could e.g. keep only cluster nodes and allow them to contain either an
> array of harts or other cluster sub-nodes + optionally a set of attributes,
> common to the members/sub-nodes of the cluster. This way we'll get in the
> first example:
>

All these fancy new ideas you are proposing are good if vendors follow
some things religiously, but I really doubt if that's true. So just
have maximum flexibility so that we can represent maximum number of
systems without having to redefine the bindings again and again for the
same thing.

So instead of finding ways to optimise, you should really come up with
list of shortcomings in the existing bindings so that we cover more
platforms with generic bindings. IMO you are too concerned on optimisation
of DT representation which may defeat the purpose of generic bindings.

--
Regards,
Sudeep

Nick Kossifidis Nov. 8, 2018, 1:45 p.m. UTC | #11

Στις 2018-11-07 14:06, Mark Rutland έγραψε:
> On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
>> Mark and Sundeep thanks a lot for your feedback, I guess you convinced
>> me that having a device tree binding for the scheduler is not a
>> correct approach. It's not a device after all and I agree that the
>> device tree shouldn't become an OS configuration file.
> 
> Good to hear.
> 
>> Regarding multiple levels of shared resources my point is that since
>> cpu-map doesn't contain any information of what is shared among the
>> cluster/core members it's not easy to do any further translation. Last
>> time I checked the arm code that uses cpu-map, it only defines one
>> domain for SMT, one for MC and then everything else is ignored. No
>> matter how many clusters have been defined, anything above the core
>> level is the same (and then I guess you started talking about adding
>> "packages" on the representation side).
> 
> While cpu-map doesn't contain that information today, we can *add* that
> information to the cpu-map binding if necessary.
> 
>> The reason I proposed to have a binding for the scheduler directly is
>> not only because it's simpler and closer to what really happens in the
>> code, it also makes more sense to me than the combination of cpu-map
>> with all the related mappings e.g.  for numa or caches or power
>> domains etc.
>> 
>> However you are right we could definitely augment cpu-map to include
>> support for what I'm saying and clean things up, and since you are
>> open about improving it here is a proposal that I hope you find
>> interesting:
>> 
>> At first let's get rid of the <thread> nodes, they don't make sense:
>> 
>> thread0 {
>>  cpu = <&CPU0>;
>> };
>> 
>> A thread node can't have more than one cpu entry and any properties
>> should be on the cpu node itself, so it doesn't / can't add any
>> more information. We could just have an array of cpu nodes on the
>> <core> node, it's much cleaner this way.
>> 
>> core0 {
>>  members = <&CPU0>, <&CPU1>;
>> };
> 
> Hold on. Rather than reinventing things from first principles, can we
> please discuss what you want to *achieve*, i.e. what information you
> need?
> 
> Having a node is not a significant cost, and there are reasons we may
> want thread nodes. For example, it means that we can always refer to 
> any
> level of topology with a phandle, and we might want to describe
> thread-affine devices in future.
> 

You can use the phandle of the cpu node, the thread node doesn't add
anything more than complexity to the representation.

> There are a tonne of existing bindings that are ugly, but re-inventing
> them for taste reasons alone is more costly to the ecosystem than 
> simply
> using the existing bindings. We avoid re-inventing bindings unless 
> there
> is a functional problem e.g. cases which they cannot possibly describe.
> 

We are talking about using something for RISC-V and possibly common 
across
different archs and, I don't see why we should keep the ugliness of a
binding spec plus in this case the <trhead> node can't possibly describe
anything else than a cpu=<node> alias, it's redundant.

>> Then let's allow the cluster and core nodes to accept attributes that 
>> are
>> common for the cpus they contain. Right now this is considered 
>> invalid.
>> 
>> For power domains we have a generic binding described on
>> Documentation/devicetree/bindings/power/power_domain.txt
>> which basically says that we need to put power-domains = <power domain
>> specifiers>
>> attribute on each of the cpu nodes.
> 
> FWIW, given this is arguably topological, I'm not personally averse to
> describing this in the cpu-map, if that actually gains us more than the
> complexity require to support it.
> 
> Given we don't do this for device power domains, I suspect that it's
> simpler to stick with the existing binding.
> 
>> The same happens with the capacity binding specified for arm on
>> Documentation/devicetree/bindings/arm/cpu-capacity.txt
>> which says we should add the capacity-dmips-mhz on each of the cpu 
>> nodes.
> 
> The cpu-map was intended to expose topological dtails, and this isn't
> really a topological property. For example, Arm DynamIQ systems can 
> have
> heterogeneous CPUs within clusters.
> 
> I do not think it's worth moving this, tbh.
> 
>> The same also happens with the generic numa binding on
>> Documentation/devicetree/bindings/numa.txt
>> which says we should add the nuna-node-id on each of the cpu nodes.
> 
> Is there a strong gain from moving this?
> 
> [...]
> 

Right now with the device tree spec and the above bindings we can
use the information from cpu nodes to infer the cache topology,
the memory topology, the power domain topology etc.

We have of_find_next_cache_node and of_find_last_cache_level for example
that use "next-level-cache" and are used on powerpc (where this spec 
comes
from), that we can further build on top of them (since this is now part
of the  device tree spec anyway), to e.g. populate the levels of cache,
the levels of shared cache and finally create cpu masks for the 
different
cache sharing levels.

This is standards-compliant, arch-independent (they are on of/base.c), 
and
it provides a concrete hint on CPU topology rather grouping harts to 
classes
like "core", "package", "socket", "cluster" etc that don't mean anything
specific and we assume to map to specific levels of cache sharing.

The same goes for the power domain topology, numa topology, or for 
understanding
how the cpus are grouped in terms of capacity, we can always just use 
the
above bindings on cpu nodes and be done with it.

The way I see it cpu-map doesn't add anything new to the spec at this
point, it's just there for convenience so that people don't have to add 
all
these infos on the cpu nodes. Instead of adding cache nodes and use the
next-level-cache property like the device tree spec says, we have 
cpu-map
that presents how the harts are "packed" inside the chip, assuming that
their packing also aligns on how they share their caches (they say
nothing about how they share their power domain or other attributes).

In a sense it goes against your rule of "re-inventing them for taste 
reasons
alone is more costly to the ecosystem than simply using the existing 
bindings",
I fail to see anything else than "taste reasons" when it comes to 
cpu-map,
since there are existing bindings for inferring the CPU topology 
already,
they are just not that convenient.

I'm proposing to add the option (not enforce) of adding those attributes
that are meant to be used on cpu nodes, on the various groups/classes
of the cpu-map, this way cpu-map will provide something more meaningful
and at least improve the representation side of things. On the 
implementation
side it might save us the burden of infering the topology from parsing 
all cpu
nodes each time. It's also backwards compatible with what you already
have, the only thing that's not backwards compatible is the removal
of the <thread> node, which I don't think is such a big deal to fix.

>> Finally from the examples above I'd like to stress out that the 
>> distinction
>> between a cluster and a core doesn't make much sense and it also makes 
>> the
>> representation more complicated. To begin with, how would you call the 
>> setup
>> on HiFive Unleashed ? A cluster of 4 cores that share the same L3 
>> cache ?
> 
> Not knowing much about the hardware, I can't really say.
> 
> I'm not sure I follow why the distinction between a cluster and a core
> is non-sensical. A cluster is always a collection of cores.
> 
> A hart could be a core in its own right, or it could be a thread under 
> a
> core, which shares functional units with other harts within that core.
> 
> Arguably, we could have mandated that the topology always needed to
> describe down to a thread, even if a core only had a single thread. 
> That
> ship has sailed, however.
> 

So we agree, the "core" doesn't say anything useful regarding the 
topology,
why keep using this distinction on a binding for RISC-V and possibly 
other
archs (since we are also talking on an arch-independent approach) ?

Regards,
Nick

Nick Kossifidis Nov. 8, 2018, 2:52 p.m. UTC | #12

Στις 2018-11-07 14:28, Sudeep Holla έγραψε:
> On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
> [...]
> 
>> 
>> Mark and Sudeep thanks a lot for your feedback, I guess you convinced 
>> me
>> that having a device tree binding for the scheduler is not a correct
>> approach.
>> 
> 
> Thanks :)
> 
>> It's not a device after all and I agree that the device tree shouldn't
>> become an OS configuration file. Regarding multiple levels of shared
>> resources my point is that since cpu-map doesn't contain any 
>> information of
>> what is shared among the cluster/core members it's not easy to do any
>> further translation. Last time I checked the arm code that uses 
>> cpu-map, it
>> only defines one domain for SMT, one for MC and then everything else 
>> is
>> ignored. No matter how many clusters have been defined, anything above 
>> the
>> core level is the same (and then I guess you started talking about 
>> adding
>> "packages" on the representation side).
>> 
> 
> Correct.
> 
>> The reason I proposed to have a binding for the scheduler directly is 
>> not
>> only because it's simpler and closer to what really happens in the 
>> code, it
>> also makes more sense to me than the combination of cpu-map with all 
>> the
>> related mappings e.g. for numa or caches or power domains etc.
>> 
> 
> Again you are just looking at it with Linux kernel perspective.
> 
>> However you are right we could definitely augment cpu-map to include 
>> support
>> for what I'm saying and clean things up, and since you are open about
>> improving it here is a proposal that I hope you find interesting: At 
>> first
>> let's get rid of the <thread> nodes, they don't make sense:
>> 
>> thread0 {
>>  cpu = <&CPU0>;
>> };
>> 
> 
> Do you have any strong reasons to do so ?
> Since it's already there for some time, I believe we can't remove it 
> for
> backward compatibility reasons.
> 
>> A thread node can't have more than one cpu entry and any properties
>> should be on the cpu node itself, so it doesn't / can't add any
>> more information. We could just have an array of cpu nodes on the
>> <core> node, it's much cleaner this way.
>> 
>> core0 {
>>  members = <&CPU0>, <&CPU1>;
>> };
>> 
> 
> I agree, but we have kernel code using it(arm64/kernel/topology.c). 
> It's
> too late to remove it. But we can always keep to optional if we move 
> the
> ARM64 binding as generic to start with and mandate it for only ARM64.
> 

That's my point as well, if we are going to define something to be used
by everybody and in this case, at least for RISC-V, there is no need to
carry this from the ARM64 binding. It shouldn't be that hard to fix this
in the future for ARM64 as well, we may give the new mapping another 
name,
maybe cpu-map2 or cpu-topology to slowly move to the new one. Changing 
the
dts files shouldn't be this hard, we can provide a script for it. We can
even contain some compatibility code that also understands <thread> 
nodes
and e.g. merges them together on a core node.

>> Then let's allow the cluster and core nodes to accept attributes that 
>> are
>> common for the cpus they contain. Right now this is considered 
>> invalid.
>> 
> 
> Yes, we have discussed in the past and decided not to. I am fine if we
> need to change it, but assuming the topology implies other information
> could be wrong. On ARM platforms we have learnt it, so we kept any
> information away from topology. I assume same with RISC-V, different
> vendors implement in different ways, so it's better to consider those
> factors.
> 
>> For power domains we have a generic binding described on
>> Documentation/devicetree/bindings/power/power_domain.txt
>> which basically says that we need to put power-domains = <power domain
>> specifiers> attribute on each of the cpu nodes.
>> 
> 
> OK, but what's wrong with that. I gives full flexibility.
> 
>> The same happens with the capacity binding specified for arm on
>> Documentation/devicetree/bindings/arm/cpu-capacity.txt
>> which says we should add the capacity-dmips-mhz on each of the cpu 
>> nodes.
>> 
> 
> Ditto, we may need this for our single cluster DSU systems.
> 
>> The same also happens with the generic numa binding on
>> Documentation/devicetree/bindings/numa.txt
>> which says we should add the nuna-node-id on each of the cpu nodes.
>> 
> 
> Yes, but again what's the problem ?
> 

There is no problem with the above bindings, the problem is that we have
to put them on each cpu node which is messy. We could instead put them
(optionally) on the various groupings used on cpu-map. This would allow
cpu-map to be more specific of what is shared across the members of each
group (core/cluster/whatever).

As I wrote on my answer to Mark previously, the bindings for infering
the cache topology, numa topology, power domain topology etc are already
there, they are in the devicet tree spec and provide very specific
informations we can use. Much "stronger" hints of what's going on at
the hw level. The cpu-map doesn't provide such information, it just
provides a view of how the various harts/threads are "packed" on the 
chip,
not what they share inside each level of "packing". It's useful because
it saves people from having to define a bunch of cache nodes and 
describe
the cache hierarchy on the device tree using the standard spec.

So since cpu-map is there for convenience let's make it more convenient 
!
What I'm saying is that cpu-map could be a more compact way of using the
existing bindings for adding properties on groups of harts instead of
putting them on each hart individually. It will simplify the 
representation
and may also optimize the implementation a bit (we may get the 
information
we need faster). I don't see any other reason for using cpu-map on 
RISC-V
or for making it global across archs.

>> We could allow for these attributes to exist on cluster and core nodes
>> as well so that we can represent their properties better. It shouldn't
>> be a big deal and it can be done in a backwards-compatible way (if we
>> don't find them on the cpu node, climb up the topology hierarchy until
>> we find them / not find them at all). All I'm saying is that I prefer 
>> this:
>> 
> 
> [...]
> 
>> 
>> cluster0 {
>>  cluster0 {
>>   core0 {
>>    power-domains = <&pdc 0>;
>>    numa-node-id = <0>;
>>    capacity-dmips-mhz = <578>;
>>    members = <&cpu0>, <&cpu1>;
>>   }
>>  };
>>  cluster1 {
>>   capacity-dmips-mhz = <1024>;
>>   core0 {
>>    power-domains = <&pdc 1>;
>>    numa-node-id = <1>;
>>    members = <&cpu2>;
>>   };
>>   core1 {
>>    power-domains = <&pdc 2>;
>>    numa-node-id = <2>;
>>    members = <&cpu3>;
>>   };
>>  };
>> }
>> 
> 
> Why are you so keen on optimising the representation ?
> If you are worried about large systems, generate one instead of
> handcrafted.
> 

I don't see a reason not to try to optimize it, since we are talking
about a binding to be used by RISC-V and potentially everybody, I think
it makes sens to improve upon what we already have.

>> When it comes to shared resources, the standard dt mappings we have 
>> are for
>> caches and are on the device spec standard (coming from power pc's 
>> ePAPR
>> standard I think). The below comes from HiFive unleashed's device tree
>> (U540Config.dts) that follows the spec:
>> 
> 
> I don't understand what you are trying to explain, ePAPR does specify
> per CPU entries.
> 
> [...]
> 

I'm saying we could allow moving these properties on the groupings of 
cpu-map
to simplify the representation and possibly optimize the implementation. 
This
way I believe cpu-map will be much more useful than it is now.

>> Note that the cache-controller node that's common between the 4 cores 
>> can
>> exist anywhere BUT the cluster node ! However it's a property of the
>> cluster.
>> A quick search through the tree got me r8a77980.dtsi that defines the 
>> cache
>> on the cpus node and I'm sure there are other similar cases. Wouldn't 
>> this
>> be better ?
>> 
>> cluster0 {
>>  core0 {
>>   cache-controller@2010000 {
>>    cache-block-size = <64>;
>>    cache-level = <2>;
>>    cache-sets = <2048>;
>>    cache-size = <2097152>;
>>    cache-unified;
>>    compatible = "sifive,ccache0", "cache";
>>    ...
>>   };
>>   members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
> 
> Not a good idea IMO.
> 

Where in your opinion should this cache node go ? On the first example 
from
a production dts it goes to the SoC node and on another production dts 
it goes
to the cpus/ node. I think it makes more sense to go to the core node or 
the
cluster node, depending on how it's shared. It makes it also easier
for the implementation to understand the levels of shared caches and 
creating
cpu masks.

>> We could even remove next-level-cache from the cpu nodes and infer it 
>> from
>> the  topology (search the topology upwards until we get a node that's
>> "cache"-compatible), we can again make this backwards-compatible.
>> 
> 
> Why are you assuming that they *have* to be so aligned with topology ?
> How do you deal with different kind of systems ?
> 

It depends on how you define topology. I believe that the cache 
hierarchy is
a much "stronger" definition of the CPU topology than grouping 
harts/threads
on classes like "core", "socket", "packet" etc that don't provide any 
information
of what's going on inside the hardware. You can think of the question 
the other
way around, why are you assuming that the current topology as specified 
on
cpu-map is aligned with how the harts and cores share their "resources" 
?
Because that's what's going on at the implementation side of cpu-map.

>> 
>> Finally from the examples above I'd like to stress out that the 
>> distinction
>> between a cluster and a core doesn't make much sense and it also makes 
>> the
>> representation more complicated. To begin with, how would you call the 
>> setup
>> on HiFive Unleashed ? A cluster of 4 cores that share the same L3 
>> cache ?
>> One core with 4 harts that share the same L3 cache ? We could 
>> represent it
>> like this instead:
>> 
> 
> Just represent each physical cache and get the list of CPUs sharing it.
> Doesn't matter what it is: cluster or cluster of clusters or cluster of
> harts, blah, blah. It really doesn't matter.
> 

I totally agree with you, but in this case what's the reason of having
cpu-map ? We can infer the topology from what we already have, at least
cpu-map adds some convenience (but if we go all the way down the 
"convinience"
path we'll end up on something like my initial approach which was as we
both agree now, wrong).

>> 
>> We could e.g. keep only cluster nodes and allow them to contain either 
>> an
>> array of harts or other cluster sub-nodes + optionally a set of 
>> attributes,
>> common to the members/sub-nodes of the cluster. This way we'll get in 
>> the
>> first example:
>> 
> 
> All these fancy new ideas you are proposing are good if vendors follow
> some things religiously, but I really doubt if that's true. So just
> have maximum flexibility so that we can represent maximum number of
> systems without having to redefine the bindings again and again for the
> same thing.
> 
> So instead of finding ways to optimise, you should really come up with
> list of shortcomings in the existing bindings so that we cover more
> platforms with generic bindings. IMO you are too concerned on 
> optimisation
> of DT representation which may defeat the purpose of generic bindings.
> 

I get your point, but if that's the case, isn't cpu-map an optimization 
over
the generic bindings for cpu nodes anyway ? As for the flexibility, what 
I'm
saying is to allow the flexibility of adding the properties that we 
would
otherwise add to cpu nodes, on the groups we use on cpu-map. Optionaly 
so that
we are also backwards compatible. I see this as more flexible, not less
flexible. The same goes for removing the distinction between "core",
"cluster" etc, having specific names for each of the topology levels is 
IMHO
less flexible than using abstract names.

Also what's the point of defining a binding if vendors don't follow it ?
I think it would be easier for the vendors to just use what's there 
instead
of defining a new binding and writing the code for it. On the other hand
you are probably more experienced on this than I am, I haven't dealt 
with
various different vendors and their different ways.

Regards,
Nick

Mark Rutland Nov. 8, 2018, 3:54 p.m. UTC | #13

On Thu, Nov 08, 2018 at 03:45:36PM +0200, Nick Kossifidis wrote:
> Στις 2018-11-07 14:06, Mark Rutland έγραψε:
> > On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
> > > Mark and Sundeep thanks a lot for your feedback, I guess you convinced
> > > me that having a device tree binding for the scheduler is not a
> > > correct approach. It's not a device after all and I agree that the
> > > device tree shouldn't become an OS configuration file.
> > 
> > Good to hear.
> > 
> > > Regarding multiple levels of shared resources my point is that since
> > > cpu-map doesn't contain any information of what is shared among the
> > > cluster/core members it's not easy to do any further translation. Last
> > > time I checked the arm code that uses cpu-map, it only defines one
> > > domain for SMT, one for MC and then everything else is ignored. No
> > > matter how many clusters have been defined, anything above the core
> > > level is the same (and then I guess you started talking about adding
> > > "packages" on the representation side).
> > 
> > While cpu-map doesn't contain that information today, we can *add* that
> > information to the cpu-map binding if necessary.
> > 
> > > The reason I proposed to have a binding for the scheduler directly is
> > > not only because it's simpler and closer to what really happens in the
> > > code, it also makes more sense to me than the combination of cpu-map
> > > with all the related mappings e.g.  for numa or caches or power
> > > domains etc.
> > > 
> > > However you are right we could definitely augment cpu-map to include
> > > support for what I'm saying and clean things up, and since you are
> > > open about improving it here is a proposal that I hope you find
> > > interesting:
> > > 
> > > At first let's get rid of the <thread> nodes, they don't make sense:
> > > 
> > > thread0 {
> > >  cpu = <&CPU0>;
> > > };
> > > 
> > > A thread node can't have more than one cpu entry and any properties
> > > should be on the cpu node itself, so it doesn't / can't add any
> > > more information. We could just have an array of cpu nodes on the
> > > <core> node, it's much cleaner this way.
> > > 
> > > core0 {
> > >  members = <&CPU0>, <&CPU1>;
> > > };
> > 
> > Hold on. Rather than reinventing things from first principles, can we
> > please discuss what you want to *achieve*, i.e. what information you
> > need?
> > 
> > Having a node is not a significant cost, and there are reasons we may
> > want thread nodes. For example, it means that we can always refer to any
> > level of topology with a phandle, and we might want to describe
> > thread-affine devices in future.
> 
> You can use the phandle of the cpu node, the thread node doesn't add
> anything more than complexity to the representation.

That adds complexity elsewhere, because the phandle of a CPU node is not
a child of the cpu-map node, and you can't simply walk up the tree
hierarchy to find parent nodes.

I see no reason to change this part of the binding. Given it's already
out there, with existing parsing code, changing it only serves to add
complexity to any common code which has to parse this.

Can we please drop the idea of dropping the thread node?

> > There are a tonne of existing bindings that are ugly, but re-inventing
> > them for taste reasons alone is more costly to the ecosystem than simply
> > using the existing bindings. We avoid re-inventing bindings unless there
> > is a functional problem e.g. cases which they cannot possibly describe.
> 
> We are talking about using something for RISC-V and possibly common across
> different archs and, I don't see why we should keep the ugliness of a
> binding spec plus in this case the <trhead> node can't possibly describe
> anything else than a cpu=<node> alias, it's redundant.

While it may be ugly, removing it only serves to add complexity for
common parsing code, and removes flexibility that we may want in future.
Its presence does not cause any technical problem.

For better or worse we're all sharing the same codebase here. The common
case matters, as does the complexity of the codebase as a whole.

I realise it can be frustrating to have to use something you feel is
sub-optimal, but putting up with a few nodes which you personally feel
are redundant is of much lower cost to the ecosystem than having
incompatible bindings where we could have one. If you put up with that,
you can focus your efforts on more worthwhile technical exercises.

> > > Then let's allow the cluster and core nodes to accept attributes
> > > that are
> > > common for the cpus they contain. Right now this is considered
> > > invalid.
> > > 
> > > For power domains we have a generic binding described on
> > > Documentation/devicetree/bindings/power/power_domain.txt
> > > which basically says that we need to put power-domains = <power domain
> > > specifiers>
> > > attribute on each of the cpu nodes.
> > 
> > FWIW, given this is arguably topological, I'm not personally averse to
> > describing this in the cpu-map, if that actually gains us more than the
> > complexity require to support it.
> > 
> > Given we don't do this for device power domains, I suspect that it's
> > simpler to stick with the existing binding.
> > 
> > > The same happens with the capacity binding specified for arm on
> > > Documentation/devicetree/bindings/arm/cpu-capacity.txt
> > > which says we should add the capacity-dmips-mhz on each of the cpu
> > > nodes.
> > 
> > The cpu-map was intended to expose topological dtails, and this isn't
> > really a topological property. For example, Arm DynamIQ systems can have
> > heterogeneous CPUs within clusters.
> > 
> > I do not think it's worth moving this, tbh.
> > 
> > > The same also happens with the generic numa binding on
> > > Documentation/devicetree/bindings/numa.txt
> > > which says we should add the nuna-node-id on each of the cpu nodes.
> > 
> > Is there a strong gain from moving this?
> > 
> > [...]
> 
> Right now with the device tree spec and the above bindings we can
> use the information from cpu nodes to infer the cache topology,
> the memory topology, the power domain topology etc.
> 
> We have of_find_next_cache_node and of_find_last_cache_level for example
> that use "next-level-cache" and are used on powerpc (where this spec comes
> from), that we can further build on top of them (since this is now part
> of the  device tree spec anyway), to e.g. populate the levels of cache,
> the levels of shared cache and finally create cpu masks for the different
> cache sharing levels.

The cpu-map binding does not describe cache topology. I never suggested
that it did.

> This is standards-compliant, arch-independent (they are on of/base.c), and
> it provides a concrete hint on CPU topology rather grouping harts to classes
> like "core", "package", "socket", "cluster" etc that don't mean anything
> specific and we assume to map to specific levels of cache sharing.

Please note that I have never suggested "package", and I'm not sure what
that's in response to.

Please also not that while the cache topology and CPU topology are
somewhat related, in general (this is at least true on ARM systems)
there's no strict mapping between the two, which is why we have separate
bindings in the first place.

Distinct CPU topologies could look identical in cache topology, and
vice-versa.

> The same goes for the power domain topology, numa topology, or for
> understanding how the cpus are grouped in terms of capacity, we can
> always just use the above bindings on cpu nodes and be done with it.
> 
> The way I see it cpu-map doesn't add anything new to the spec at this
> point, it's just there for convenience so that people don't have to add all
> these infos on the cpu nodes. Instead of adding cache nodes and use the
> next-level-cache property like the device tree spec says, we have cpu-map
> that presents how the harts are "packed" inside the chip, assuming that
> their packing also aligns on how they share their caches (they say
> nothing about how they share their power domain or other attributes).

The cpu-map and cache bindings describe separate topological details,
and neither is intended to replace the other. There are plenty of DT
files with both.

> In a sense it goes against your rule of "re-inventing them for taste
> reasons alone is more costly to the ecosystem than simply using the
> existing bindings", I fail to see anything else than "taste reasons"
> when it comes to cpu-map, since there are existing bindings for
> inferring the CPU topology already, they are just not that convenient.

If you believe that's the case, then you can perfectly use the existing
cache and NUMA bindings, and not describe the cpu topology at all, no
need to change cpu-map or come up with a new binding.

> I'm proposing to add the option (not enforce) of adding those
> attributes that are meant to be used on cpu nodes, on the various
> groups/classes of the cpu-map, this way cpu-map will provide something
> more meaningful and at least improve the representation side of
> things. On the implementation side it might save us the burden of
> infering the topology from parsing all cpu nodes each time. It's also
> backwards compatible with what you already have, the only thing that's
> not backwards compatible is the removal of the <thread> node, which I
> don't think is such a big deal to fix.

Sorry, but as I have stated repeatedly, this complicates the common code
for what you admit is a matter of taste. I would rather maintain the
simplicity of this binding and existing parsing code, and not
potentially break other parsers, if the only cost is a few nodes which
you personally consider to be redundant.

> > > Finally from the examples above I'd like to stress out that the
> > > distinction between a cluster and a core doesn't make much sense
> > > and it also makes the representation more complicated. To begin
> > > with, how would you call the setup on HiFive Unleashed ? A cluster
> > > of 4 cores that share the same L3 cache ?
> > 
> > Not knowing much about the hardware, I can't really say.
> > 
> > I'm not sure I follow why the distinction between a cluster and a core
> > is non-sensical. A cluster is always a collection of cores.
> > 
> > A hart could be a core in its own right, or it could be a thread under a
> > core, which shares functional units with other harts within that core.
> > 
> > Arguably, we could have mandated that the topology always needed to
> > describe down to a thread, even if a core only had a single thread. That
> > ship has sailed, however.
> 
> So we agree, the "core" doesn't say anything useful regarding the
> topology, why keep using this distinction on a binding for RISC-V and
> possibly other archs (since we are also talking on an arch-independent
> approach) ?

We keep it because the binding has already been finalised and is use in
practice. The existing binding and parsing code is *sufficient* for your
needs. Personal taste and minor savings in the number of nodes in a DT
are not compelling reasons to change this when the cost is complexity
that we have to maintain *forever*.

Thanks,
Mark.

Sudeep Holla Nov. 8, 2018, 4:48 p.m. UTC | #14

On Thu, Nov 08, 2018 at 04:52:30PM +0200, Nick Kossifidis wrote:
> Στις 2018-11-07 14:28, Sudeep Holla έγραψε:
> >
> > I agree, but we have kernel code using it(arm64/kernel/topology.c). It's
> > too late to remove it. But we can always keep to optional if we move the
> > ARM64 binding as generic to start with and mandate it for only ARM64.
> >
>
> That's my point as well, if we are going to define something to be used
> by everybody and in this case, at least for RISC-V, there is no need to
> carry this from the ARM64 binding.

Sure, whatever you don't need in RISC-V you can see if they can be made
optional. I don't think that should be a problem.

> It shouldn't be that hard to fix this
> in the future for ARM64 as well, we may give the new mapping another name,
> maybe cpu-map2 or cpu-topology to slowly move to the new one.

No, we have it and we will continue to support it. It's not broken to
fix on ARM64. Why do you think that it's broken on ARM64 ?

> Changing the
> dts files shouldn't be this hard, we can provide a script for it. We can
> even contain some compatibility code that also understands <thread> nodes
> and e.g. merges them together on a core node.
>

Sure, hang on this idea of scripting, we can make a better use of it.
Details later further in the mail.

[...]

> > > The same also happens with the generic numa binding on
> > > Documentation/devicetree/bindings/numa.txt
> > > which says we should add the nuna-node-id on each of the cpu nodes.
> > >
> >
> > Yes, but again what's the problem ?
> >
>
> There is no problem with the above bindings, the problem is that we have
> to put them on each cpu node which is messy. We could instead put them
> (optionally) on the various groupings used on cpu-map. This would allow
> cpu-map to be more specific of what is shared across the members of each
> group (core/cluster/whatever).
>

I think Mark has already explain why/how generic bindings are useful.
If you still have concerns, take it up separately and see how you can
build *perfect* bindings for RISC-V to avoid any legacy baggage.

We have reasons why we can't assume information about cache or power
domain topology from CPU topology. I have summarised them already and
we are not discussing. There may not be perfect bindings but there are
already supported and nothing is broken here to fix. DT bindings are
*not* same as code to fix it with a patch to the bindings themselves.
Once agreed and merged, they need to be treated like user ABI.

> As I wrote on my answer to Mark previously, the bindings for infering
> the cache topology, numa topology, power domain topology etc are already
> there, they are in the devicet tree spec and provide very specific
> informations we can use. Much "stronger" hints of what's going on at
> the hw level. The cpu-map doesn't provide such information, it just
> provides a view of how the various harts/threads are "packed" on the chip,
> not what they share inside each level of "packing". It's useful because
> it saves people from having to define a bunch of cache nodes and describe
> the cache hierarchy on the device tree using the standard spec.
>

Ah, here comes. If you want to save people's time or whatever, you can use
your scripting magic you have mentioned above to define those bunch of nodes
you want to avoid.

> So since cpu-map is there for convenience let's make it more convenient !
> What I'm saying is that cpu-map could be a more compact way of using the
> existing bindings for adding properties on groups of harts instead of
> putting them on each hart individually. It will simplify the representation
> and may also optimize the implementation a bit (we may get the information
> we need faster). I don't see any other reason for using cpu-map on RISC-V
> or for making it global across archs.
>

Sure, I don't have strong opinions there. Just stop mentioning that this
is the only solution and all existing ones are broken. They are not and
needs to be supported until they are explicitly deprecated, becomes
obsolete and finally removed.

[...]

> >
> > Why are you so keen on optimising the representation ?
> > If you are worried about large systems, generate one instead of
> > handcrafted.
> >
>
> I don't see a reason not to try to optimize it, since we are talking
> about a binding to be used by RISC-V and potentially everybody, I think
> it makes sens to improve upon what we already have.
>

Sure, you can always unless you stop treating existing ones are broken.
I have already told DT bindings are not *normal code*. You can just
replace existing ones with new optimised ones. You can only add the new
(*optimised*) ones to the existing ones. You *need* to understand that
concept first, otherwise there's not point in this endless discussion
IMO.

I will stop here as I will have to repeat whatever I have already
mentioned to comment on your arguments below.

In summary, I am not against improving the bindings if you think it's
possible, but I don't see how it's more beneficially especially if we
are going to support existing ones also. Mark has already given all the
details.

--
Regards,
Sudeep

Nick Kossifidis Nov. 9, 2018, 2:36 a.m. UTC | #15

Στις 2018-11-08 18:48, Sudeep Holla έγραψε:
> On Thu, Nov 08, 2018 at 04:52:30PM +0200, Nick Kossifidis wrote:
>> Στις 2018-11-07 14:28, Sudeep Holla έγραψε:
>> >
>> > I agree, but we have kernel code using it(arm64/kernel/topology.c). It's
>> > too late to remove it. But we can always keep to optional if we move the
>> > ARM64 binding as generic to start with and mandate it for only ARM64.
>> >
>> 
>> That's my point as well, if we are going to define something to be 
>> used
>> by everybody and in this case, at least for RISC-V, there is no need 
>> to
>> carry this from the ARM64 binding.
> 
> Sure, whatever you don't need in RISC-V you can see if they can be made
> optional. I don't think that should be a problem.
> 
>> It shouldn't be that hard to fix this
>> in the future for ARM64 as well, we may give the new mapping another 
>> name,
>> maybe cpu-map2 or cpu-topology to slowly move to the new one.
> 
> No, we have it and we will continue to support it. It's not broken to
> fix on ARM64. Why do you think that it's broken on ARM64 ?
> 

I never said it's broken, I just assumed that if this binding becomes 
common
across archs you'd probably like to switch to it, so it should either be
backwards compatible with what you have (or as you said "keep them 
optional")
or take steps for the transition. I'm ok either way, It's not such a 
serious
thing to have the <thread> nodes supported or keep the distinction 
between
"cores", "clusters" or whatever, I'm sorry if it looked that way. I'm 
just
saying that I'd like to have something cleaner for RISC-V and/or a 
binding
that's common for all, we can support both and be backwards compatible, 
or we
can keep it as is and mandate the <thread> nodes only for ARM64 as you
suggested.

>> Changing the
>> dts files shouldn't be this hard, we can provide a script for it. We 
>> can
>> even contain some compatibility code that also understands <thread> 
>> nodes
>> and e.g. merges them together on a core node.
>> 
> 
> Sure, hang on this idea of scripting, we can make a better use of it.
> Details later further in the mail.
> 
> [...]
> 
>> > > The same also happens with the generic numa binding on
>> > > Documentation/devicetree/bindings/numa.txt
>> > > which says we should add the nuna-node-id on each of the cpu nodes.
>> > >
>> >
>> > Yes, but again what's the problem ?
>> >
>> 
>> There is no problem with the above bindings, the problem is that we 
>> have
>> to put them on each cpu node which is messy. We could instead put them
>> (optionally) on the various groupings used on cpu-map. This would 
>> allow
>> cpu-map to be more specific of what is shared across the members of 
>> each
>> group (core/cluster/whatever).
>> 
> 
> I think Mark has already explain why/how generic bindings are useful.
> If you still have concerns, take it up separately and see how you can
> build *perfect* bindings for RISC-V to avoid any legacy baggage.
> 
> We have reasons why we can't assume information about cache or power
> domain topology from CPU topology. I have summarised them already and
> we are not discussing.

But that's what you do on arch/arm/kernel/topology.c, you assume
information on power domain and cache topology based on the CPU topology
when you define the scheduling domain topology levels.

PowerPC also does the same on arch/powerpc/kernel/smp.c and basically if
you track the usage of struct sched_domain_topology_level you'll see 
that
every arch that uses it to instruct the scheduler about the scheduling
domains, does it based on its CPU topology, which I think we both agree
is wrong.

> There may not be perfect bindings but there are
> already supported and nothing is broken here to fix. DT bindings are
> *not* same as code to fix it with a patch to the bindings themselves.
> Once agreed and merged, they need to be treated like user ABI.
> 

The only use of the devicetree spec bindings for caches if I'm not
mistaken are used on your cacheinfo driver for exporting them to 
userspace
through sysfs (thank you for cleaning this up btw). Even if we have the
cache topology  using the generic bindings, we are not doing anything 
with
that information but export it to userspace.

As for the power domain bindings, we have drivers/base/power/domain.c 
that
handles the generic bindings and it's being used by some drivers to put
devices to power domains (something used by the pm code), but from a 
quick
search it's not used for cpu nodes and I didn't see this info being used
for providing hints to the scheduler either. Even with the generic power
domain bindings in place, for CPU idle states, on ARM we have another
binding that's basically the same with the addition of 
"wakeup-latency-us".

For having different capacity there is no generic binding but I guess we
could use ARM's.

Of the generic bindings that relate to the scheduler's behavior, only 
the
numa binding is used as expected and only by ARM64, powerpc and sparc 
use
their own stuff.

So there may be nothing broken regarding the generic / already existing
bindings, but their support needs fixing. The way they are now we can't
use them on RISC-V for configuring the scheduler and supporting SMT and
MC properly + I'm not in favor of using CPU topology blindly as the
other archs do.

>> As I wrote on my answer to Mark previously, the bindings for infering
>> the cache topology, numa topology, power domain topology etc are 
>> already
>> there, they are in the devicet tree spec and provide very specific
>> informations we can use. Much "stronger" hints of what's going on at
>> the hw level. The cpu-map doesn't provide such information, it just
>> provides a view of how the various harts/threads are "packed" on the 
>> chip,
>> not what they share inside each level of "packing". It's useful 
>> because
>> it saves people from having to define a bunch of cache nodes and 
>> describe
>> the cache hierarchy on the device tree using the standard spec.
>> 
> 
> Ah, here comes. If you want to save people's time or whatever, you can 
> use
> your scripting magic you have mentioned above to define those bunch of 
> nodes
> you want to avoid.
> 

I mentioned the script as a transitioning method, but you are right, it 
goes
both ways.

>> So since cpu-map is there for convenience let's make it more 
>> convenient !
>> What I'm saying is that cpu-map could be a more compact way of using 
>> the
>> existing bindings for adding properties on groups of harts instead of
>> putting them on each hart individually. It will simplify the 
>> representation
>> and may also optimize the implementation a bit (we may get the 
>> information
>> we need faster). I don't see any other reason for using cpu-map on 
>> RISC-V
>> or for making it global across archs.
>> 
> 
> Sure, I don't have strong opinions there. Just stop mentioning that 
> this
> is the only solution and all existing ones are broken. They are not and
> needs to be supported until they are explicitly deprecated, becomes
> obsolete and finally removed.
> 
> [...]
> 

But I never said that's "the only solution and everything else is 
broken" !
I'm just proposing an extension to cpu-map since Mark suggested that 
it's
possible. My goal is to try and consolidate cpu-map with what Atish 
proposed
for RISC-V (so that we can use the same code) and point out some issues
on how we use and define the CPU topology.

>> >
>> > Why are you so keen on optimising the representation ?
>> > If you are worried about large systems, generate one instead of
>> > handcrafted.
>> >
>> 
>> I don't see a reason not to try to optimize it, since we are talking
>> about a binding to be used by RISC-V and potentially everybody, I 
>> think
>> it makes sens to improve upon what we already have.
>> 
> 
> Sure, you can always unless you stop treating existing ones are broken.
> I have already told DT bindings are not *normal code*. You can just
> replace existing ones with new optimised ones. You can only add the new
> (*optimised*) ones to the existing ones. You *need* to understand that
> concept first, otherwise there's not point in this endless discussion
> IMO.
> 
> I will stop here as I will have to repeat whatever I have already
> mentioned to comment on your arguments below.
> 
> In summary, I am not against improving the bindings if you think it's
> possible, but I don't see how it's more beneficially especially if we
> are going to support existing ones also. Mark has already given all the
> details.
> 

ACK, thanks a lot for your time and the discussion so far, I really
appreciate it.

Regards,
Nick

Nick Kossifidis Nov. 9, 2018, 3:55 a.m. UTC | #16

Στις 2018-11-08 17:54, Mark Rutland έγραψε:
> On Thu, Nov 08, 2018 at 03:45:36PM +0200, Nick Kossifidis wrote:
>> Στις 2018-11-07 14:06, Mark Rutland έγραψε:
>> > On Wed, Nov 07, 2018 at 04:31:34AM +0200, Nick Kossifidis wrote:
>> > > Mark and Sundeep thanks a lot for your feedback, I guess you convinced
>> > > me that having a device tree binding for the scheduler is not a
>> > > correct approach. It's not a device after all and I agree that the
>> > > device tree shouldn't become an OS configuration file.
>> >
>> > Good to hear.
>> >
>> > > Regarding multiple levels of shared resources my point is that since
>> > > cpu-map doesn't contain any information of what is shared among the
>> > > cluster/core members it's not easy to do any further translation. Last
>> > > time I checked the arm code that uses cpu-map, it only defines one
>> > > domain for SMT, one for MC and then everything else is ignored. No
>> > > matter how many clusters have been defined, anything above the core
>> > > level is the same (and then I guess you started talking about adding
>> > > "packages" on the representation side).
>> >
>> > While cpu-map doesn't contain that information today, we can *add* that
>> > information to the cpu-map binding if necessary.
>> >
>> > > The reason I proposed to have a binding for the scheduler directly is
>> > > not only because it's simpler and closer to what really happens in the
>> > > code, it also makes more sense to me than the combination of cpu-map
>> > > with all the related mappings e.g.  for numa or caches or power
>> > > domains etc.
>> > >
>> > > However you are right we could definitely augment cpu-map to include
>> > > support for what I'm saying and clean things up, and since you are
>> > > open about improving it here is a proposal that I hope you find
>> > > interesting:
>> > >
>> > > At first let's get rid of the <thread> nodes, they don't make sense:
>> > >
>> > > thread0 {
>> > >  cpu = <&CPU0>;
>> > > };
>> > >
>> > > A thread node can't have more than one cpu entry and any properties
>> > > should be on the cpu node itself, so it doesn't / can't add any
>> > > more information. We could just have an array of cpu nodes on the
>> > > <core> node, it's much cleaner this way.
>> > >
>> > > core0 {
>> > >  members = <&CPU0>, <&CPU1>;
>> > > };
>> >
>> > Hold on. Rather than reinventing things from first principles, can we
>> > please discuss what you want to *achieve*, i.e. what information you
>> > need?
>> >
>> > Having a node is not a significant cost, and there are reasons we may
>> > want thread nodes. For example, it means that we can always refer to any
>> > level of topology with a phandle, and we might want to describe
>> > thread-affine devices in future.
>> 
>> You can use the phandle of the cpu node, the thread node doesn't add
>> anything more than complexity to the representation.
> 
> That adds complexity elsewhere, because the phandle of a CPU node is 
> not
> a child of the cpu-map node, and you can't simply walk up the tree
> hierarchy to find parent nodes.
> 
> I see no reason to change this part of the binding. Given it's already
> out there, with existing parsing code, changing it only serves to add
> complexity to any common code which has to parse this.
> 
> Can we please drop the idea of dropping the thread node?
> 
>> > There are a tonne of existing bindings that are ugly, but re-inventing
>> > them for taste reasons alone is more costly to the ecosystem than simply
>> > using the existing bindings. We avoid re-inventing bindings unless there
>> > is a functional problem e.g. cases which they cannot possibly describe.
>> 
>> We are talking about using something for RISC-V and possibly common 
>> across
>> different archs and, I don't see why we should keep the ugliness of a
>> binding spec plus in this case the <trhead> node can't possibly 
>> describe
>> anything else than a cpu=<node> alias, it's redundant.
> 
> While it may be ugly, removing it only serves to add complexity for
> common parsing code, and removes flexibility that we may want in 
> future.
> Its presence does not cause any technical problem.
> 
> For better or worse we're all sharing the same codebase here. The 
> common
> case matters, as does the complexity of the codebase as a whole.
> 
> I realise it can be frustrating to have to use something you feel is
> sub-optimal, but putting up with a few nodes which you personally feel
> are redundant is of much lower cost to the ecosystem than having
> incompatible bindings where we could have one. If you put up with that,
> you can focus your efforts on more worthwhile technical exercises.
> 

The only reason I mentioned the issue with the <thread> node is because
you said that you are open for improving cpu-map. I don't think it's
such a big deal and even if we decide against it, it's not a big deal
to be backwards compatible with what's already there. I fully understand
what you say about the impact on the ecosystem and agree with you.

>> > > Then let's allow the cluster and core nodes to accept attributes
>> > > that are
>> > > common for the cpus they contain. Right now this is considered
>> > > invalid.
>> > >
>> > > For power domains we have a generic binding described on
>> > > Documentation/devicetree/bindings/power/power_domain.txt
>> > > which basically says that we need to put power-domains = <power domain
>> > > specifiers>
>> > > attribute on each of the cpu nodes.
>> >
>> > FWIW, given this is arguably topological, I'm not personally averse to
>> > describing this in the cpu-map, if that actually gains us more than the
>> > complexity require to support it.
>> >
>> > Given we don't do this for device power domains, I suspect that it's
>> > simpler to stick with the existing binding.
>> >
>> > > The same happens with the capacity binding specified for arm on
>> > > Documentation/devicetree/bindings/arm/cpu-capacity.txt
>> > > which says we should add the capacity-dmips-mhz on each of the cpu
>> > > nodes.
>> >
>> > The cpu-map was intended to expose topological dtails, and this isn't
>> > really a topological property. For example, Arm DynamIQ systems can have
>> > heterogeneous CPUs within clusters.
>> >
>> > I do not think it's worth moving this, tbh.
>> >
>> > > The same also happens with the generic numa binding on
>> > > Documentation/devicetree/bindings/numa.txt
>> > > which says we should add the nuna-node-id on each of the cpu nodes.
>> >
>> > Is there a strong gain from moving this?
>> >
>> > [...]
>> 
>> Right now with the device tree spec and the above bindings we can
>> use the information from cpu nodes to infer the cache topology,
>> the memory topology, the power domain topology etc.
>> 
>> We have of_find_next_cache_node and of_find_last_cache_level for 
>> example
>> that use "next-level-cache" and are used on powerpc (where this spec 
>> comes
>> from), that we can further build on top of them (since this is now 
>> part
>> of the  device tree spec anyway), to e.g. populate the levels of 
>> cache,
>> the levels of shared cache and finally create cpu masks for the 
>> different
>> cache sharing levels.
> 
> The cpu-map binding does not describe cache topology. I never suggested
> that it did.
> 

Sorry for the misunderstanding Mark !

On ARM the CPU topology is used for configuring the scheduling domain 
topology
(arch/arm/kernel/topology.c) and blindly sets the SD_SHARE_PKG_RESOURCES
and SD_SHARE_POWERDOMAIN flags for a core without taking into account 
the
cache hierarchy (hence validating that SD_SHARE_PKG_RESOURCES is 
correct)
or the power domains (but ok, I doubt it's possible to have two threads 
on
the same core that are powered independently). It also supports only two
scheduling domains, one for SMT and another one for MC (hence my 
comments
on multiple levels of shared resources not being supported).

Since according to the docs cpu-map is a binding common between ARM and 
ARM64
for describing the CPU topology, I assumed it was a part of that 
process.

Turns out it's only used on ARM64, where set_sched_topology() is not 
used for
configuring the scheduling domain topology (there is a commend there 
that
implies that you pass on the clusters to the scheduler that also led me 
to
believe the issue is also present there). So my assumption was wrong and
cpu-map has nothing to do with configuring the scheduler and the issues
I mentioned above.

>> This is standards-compliant, arch-independent (they are on of/base.c), 
>> and
>> it provides a concrete hint on CPU topology rather grouping harts to 
>> classes
>> like "core", "package", "socket", "cluster" etc that don't mean 
>> anything
>> specific and we assume to map to specific levels of cache sharing.
> 
> Please note that I have never suggested "package", and I'm not sure 
> what
> that's in response to.
> 

https://lkml.org/lkml/2018/11/7/918

[...]

>> In a sense it goes against your rule of "re-inventing them for taste
>> reasons alone is more costly to the ecosystem than simply using the
>> existing bindings", I fail to see anything else than "taste reasons"
>> when it comes to cpu-map, since there are existing bindings for
>> inferring the CPU topology already, they are just not that convenient.
> 
> If you believe that's the case, then you can perfectly use the existing
> cache and NUMA bindings, and not describe the cpu topology at all, no
> need to change cpu-map or come up with a new binding.
> 

The problem is that right now nobody is using the generic bindings
for configuring the scheduler domain topology. Only the numa bindings
are used as expected on ARM64. Since cpu-map is only there for 
representing
the cpu topology/layout, allowing them to exist there is obviously not 
the
right approach.

>> I'm proposing to add the option (not enforce) of adding those
>> attributes that are meant to be used on cpu nodes, on the various
>> groups/classes of the cpu-map, this way cpu-map will provide something
>> more meaningful and at least improve the representation side of
>> things. On the implementation side it might save us the burden of
>> infering the topology from parsing all cpu nodes each time. It's also
>> backwards compatible with what you already have, the only thing that's
>> not backwards compatible is the removal of the <thread> node, which I
>> don't think is such a big deal to fix.
> 
> Sorry, but as I have stated repeatedly, this complicates the common 
> code
> for what you admit is a matter of taste. I would rather maintain the
> simplicity of this binding and existing parsing code, and not
> potentially break other parsers, if the only cost is a few nodes which
> you personally consider to be redundant.
> 

[...]

>> So we agree, the "core" doesn't say anything useful regarding the
>> topology, why keep using this distinction on a binding for RISC-V and
>> possibly other archs (since we are also talking on an arch-independent
>> approach) ?
> 
> We keep it because the binding has already been finalised and is use in
> practice. The existing binding and parsing code is *sufficient* for 
> your
> needs. Personal taste and minor savings in the number of nodes in a DT
> are not compelling reasons to change this when the cost is complexity
> that we have to maintain *forever*.
> 

Totally fine with that, as I mentioned above I pointed these out since I
consider it an improvement. Forgive me if I gave you the impression that
was a deal-breaker or something.

Regards,
Nick

Sudeep Holla Nov. 9, 2018, 12:33 p.m. UTC | #17

On Fri, Nov 09, 2018 at 04:36:19AM +0200, Nick Kossifidis wrote:
> Στις 2018-11-08 18:48, Sudeep Holla έγραψε:

[...]

> we can keep it as is and mandate the <thread> nodes only for ARM64 as you
> suggested.
>

Fine by me.

[...]

> > We have reasons why we can't assume information about cache or power
> > domain topology from CPU topology. I have summarised them already and
> > we are not discussing.
>
> But that's what you do on arch/arm/kernel/topology.c, you assume
> information on power domain and cache topology based on the CPU topology
> when you define the scheduling domain topology levels.
>

NO, we don't assume any knowledge about power domain and cache topology
based on the CPU topology. We have updated scheduling domains for ACPI
based systems and plan to do the same for DT. The current code may not
be optimal.

> PowerPC also does the same on arch/powerpc/kernel/smp.c and basically if
> you track the usage of struct sched_domain_topology_level you'll see that
> every arch that uses it to instruct the scheduler about the scheduling
> domains, does it based on its CPU topology, which I think we both agree
> is wrong.
>

Again, that's arch specific, so I can't comment on anything other than ARM.

> > There may not be perfect bindings but there are
> > already supported and nothing is broken here to fix. DT bindings are
> > *not* same as code to fix it with a patch to the bindings themselves.
> > Once agreed and merged, they need to be treated like user ABI.
> >
>
> The only use of the devicetree spec bindings for caches if I'm not
> mistaken are used on your cacheinfo driver for exporting them to userspace
> through sysfs (thank you for cleaning this up btw). Even if we have the
> cache topology  using the generic bindings, we are not doing anything with
> that information but export it to userspace.
>

OK, we may use LLC for sched domains in future, we do for ACPI systems.

> As for the power domain bindings, we have drivers/base/power/domain.c that
> handles the generic bindings and it's being used by some drivers to put
> devices to power domains (something used by the pm code), but from a quick
> search it's not used for cpu nodes and I didn't see this info being used
> for providing hints to the scheduler either. Even with the generic power
> domain bindings in place, for CPU idle states, on ARM we have another
> binding that's basically the same with the addition of "wakeup-latency-us".
>

OK, it's just an extension.

> For having different capacity there is no generic binding but I guess we
> could use ARM's.
>

Better.

> Of the generic bindings that relate to the scheduler's behavior, only the
> numa binding is used as expected and only by ARM64, powerpc and sparc use
> their own stuff.
>

Yes, PPC and SPARC has lots of Open Firmware base which is different from
DT.

> So there may be nothing broken regarding the generic / already existing
> bindings, but their support needs fixing. The way they are now we can't
> use them on RISC-V for configuring the scheduler and supporting SMT and
> MC properly + I'm not in favor of using CPU topology blindly as the
> other archs do.
>

Sure, if you want to improve support for the existing DT bindings that's
fine. Happy to see the patches.

DT bindings can be generic, you can still interpret and manage the sched
domains in the best way for RISC-V.

[...]

> I mentioned the script as a transitioning method, but you are right, it goes
> both ways.
>

Thanks :)

> But I never said that's "the only solution and everything else is broken" !
> I'm just proposing an extension to cpu-map since Mark suggested that it's
> possible. My goal is to try and consolidate cpu-map with what Atish proposed
> for RISC-V (so that we can use the same code) and point out some issues
> on how we use and define the CPU topology.
>

Good, we are in agreement here.

> ACK, thanks a lot for your time and the discussion so far, I really
> appreciate it.
>

No worries, glad if it helps.

--
Regards,
Sudeep

[RFC,0/2] Add RISC-V cpu topology

Message

Comments