mbox series

[GIT,PULL,1/7] soc/tegra: Changes for v5.20-rc1

Message ID 20220708185608.676474-2-thierry.reding@gmail.com
State New
Headers show
Series NVIDIA Tegra changes for v5.20-rc1 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc

Message

Thierry Reding July 8, 2022, 6:56 p.m. UTC
Hi ARM SoC maintainers,

The following changes since commit f2906aa863381afb0015a9eb7fefad885d4e5a56:

  Linux 5.19-rc1 (2022-06-05 17:18:54 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc

for you to fetch changes up to 4773d1c739e22101a92f89c0ae0983190ddbe112:

  soc/tegra: fuse: Add missing of_node_put() (2022-07-08 17:27:26 +0200)

Thanks,
Thierry

----------------------------------------------------------------
soc/tegra: Changes for v5.20-rc1

The bulk of these changes is the new CBB driver which is used to provide
(a lot of) information about SErrors when things go wrong, instead of
the kernel just crashing or hanging.

In addition more SoC information is exposed to sysfs and various minor
issues are fixed.

----------------------------------------------------------------
Bitan Biswas (1):
      soc/tegra: fuse: Expose Tegra production status

Liang He (2):
      soc/tegra: fuse: Add missing of_node_put() in tegra_init_fuse()
      soc/tegra: fuse: Add missing of_node_put()

Sumit Gupta (4):
      soc/tegra: Set ERD bit to mask inband errors
      soc/tegra: cbb: Add CBB 1.0 driver for Tegra194
      soc/tegra: cbb: Add driver for Tegra234 CBB 2.0
      soc/tegra: cbb: Add support for Tegra241 (Grace)

YueHaibing (1):
      soc/tegra: fuse: Add missing DMADEVICES dependency

 drivers/soc/tegra/Kconfig              |   11 +-
 drivers/soc/tegra/Makefile             |    1 +
 drivers/soc/tegra/cbb/Makefile         |    9 +
 drivers/soc/tegra/cbb/tegra-cbb.c      |  190 +++
 drivers/soc/tegra/cbb/tegra194-cbb.c   | 2365 ++++++++++++++++++++++++++++++++
 drivers/soc/tegra/cbb/tegra234-cbb.c   | 1114 +++++++++++++++
 drivers/soc/tegra/fuse/fuse-tegra.c    |   16 +
 drivers/soc/tegra/fuse/tegra-apbmisc.c |   36 +-
 include/soc/tegra/fuse.h               |    7 +
 include/soc/tegra/tegra-cbb.h          |   47 +
 10 files changed, 3791 insertions(+), 5 deletions(-)
 create mode 100644 drivers/soc/tegra/cbb/Makefile
 create mode 100644 drivers/soc/tegra/cbb/tegra-cbb.c
 create mode 100644 drivers/soc/tegra/cbb/tegra194-cbb.c
 create mode 100644 drivers/soc/tegra/cbb/tegra234-cbb.c
 create mode 100644 include/soc/tegra/tegra-cbb.h

Comments

Arnd Bergmann July 12, 2022, 1:27 p.m. UTC | #1
On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote:
>   git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc
...
> ----------------------------------------------------------------
> soc/tegra: Changes for v5.20-rc1
>
> The bulk of these changes is the new CBB driver which is used to provide
> (a lot of) information about SErrors when things go wrong, instead of
> the kernel just crashing or hanging.
>
> In addition more SoC information is exposed to sysfs and various minor
> issues are fixed.
>

Hi Thierry,

I fear I'm going to skip this for the current merge window. It looks like
the CBB driver you add here would fit into the existing drivers/edac/
subsystem, or at the minimum should have been reviewed by the
corresponding maintainers (added to Cc)  to decide whether it goes
there or not.

I had not previously seen this driver, but I'll let them have a look first.

For the other patches, I found two more problems:

> Bitan Biswas (1):
>       soc/tegra: fuse: Expose Tegra production status

Please don't just add random attributes in the soc device infrastructure.
This one has a completely generic name but a SoC specific
meaning, and it lacks a description in Documentation/ABI.
Not sure what the right ABI is here, but this is something that needs
to be discussed more broadly when you send a new version.

I see there are already some custom attributes in the same device,
we should probably not have added those either, but I suppose
we are stuck with those, so please add the missing documentation.

> YueHaibing (1):
>      soc/tegra: fuse: Add missing DMADEVICES dependency

This one fixes the warning the wrong way: we don't 'select' random
drivers from other subsystems, and selecting the entire
subsystem makes it worse. Just drop the 'select' here and
enable the drivers in the defconfig.

         Arnd
Thierry Reding July 13, 2022, 10:58 a.m. UTC | #2
On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote:
> On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux.git tags/tegra-for-5.20-soc
> ...
> > ----------------------------------------------------------------
> > soc/tegra: Changes for v5.20-rc1
> >
> > The bulk of these changes is the new CBB driver which is used to provide
> > (a lot of) information about SErrors when things go wrong, instead of
> > the kernel just crashing or hanging.
> >
> > In addition more SoC information is exposed to sysfs and various minor
> > issues are fixed.
> >
> 
> Hi Thierry,
> 
> I fear I'm going to skip this for the current merge window. It looks like
> the CBB driver you add here would fit into the existing drivers/edac/
> subsystem, or at the minimum should have been reviewed by the
> corresponding maintainers (added to Cc)  to decide whether it goes
> there or not.
> 
> I had not previously seen this driver, but I'll let them have a look first.

EDAC looks like it's used primarily for memory controllers, which this
is not. But then I also see explicit references to non-memory-controller
references in the infrastructure, so perhaps this does fit in there. The
CBB driver is primarily a means to provide additional information about
runtime errors, so it's not directly a means of discovering the errors
(they would be detected anyway and cause a crash) and I don't think we
have a means of correcting any of these errors.

I'll ask Sumit to work with the EDAC maintainers on this.

> For the other patches, I found two more problems:
> 
> > Bitan Biswas (1):
> >       soc/tegra: fuse: Expose Tegra production status
> 
> Please don't just add random attributes in the soc device infrastructure.
> This one has a completely generic name but a SoC specific
> meaning, and it lacks a description in Documentation/ABI.
> Not sure what the right ABI is here, but this is something that needs
> to be discussed more broadly when you send a new version.

I wasn't aware that the SoC device infrastructure was restricted to only
standardized attributes. Looks like there are a few other outliers that
add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.

Do we have some other place where this kind of thing can be exposed? Or
do we just need to come up with some better way of namespacing these?
Perhaps it would also be sufficient if all of these were better
documented so that people know what to look for on their platform of
interest.

> I see there are already some custom attributes in the same device,
> we should probably not have added those either, but I suppose
> we are stuck with those, so please add the missing documentation.

Yeah, that's a good point. These should definitely be documented
properly.

> 
> > YueHaibing (1):
> >      soc/tegra: fuse: Add missing DMADEVICES dependency
> 
> This one fixes the warning the wrong way: we don't 'select' random
> drivers from other subsystems, and selecting the entire
> subsystem makes it worse. Just drop the 'select' here and
> enable the drivers in the defconfig.

This doesn't actually select the DMADEVICES property. It adds a
dependency on DMADEVICES and if that is met it will select
TEGRA20_APB_DMA.

Thierry
Arnd Bergmann July 13, 2022, 12:14 p.m. UTC | #3
On Wed, Jul 13, 2022 at 12:58 PM Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote:
> > On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> >
> > I fear I'm going to skip this for the current merge window. It looks like
> > the CBB driver you add here would fit into the existing drivers/edac/
> > subsystem, or at the minimum should have been reviewed by the
> > corresponding maintainers (added to Cc)  to decide whether it goes
> > there or not.
> >
> > I had not previously seen this driver, but I'll let them have a look first.
>
> EDAC looks like it's used primarily for memory controllers, which this
> is not. But then I also see explicit references to non-memory-controller
> references in the infrastructure, so perhaps this does fit in there. The
> CBB driver is primarily a means to provide additional information about
> runtime errors, so it's not directly a means of discovering the errors
> (they would be detected anyway and cause a crash) and I don't think we
> have a means of correcting any of these errors.

I think this is just a reflection of what other hardware can do:
most machines only detect memory errors, but the EDAC subsystem
can work with any type in principle. There are also a lot of
conditions elsewhere that can be detected but not corrected.

> I'll ask Sumit to work with the EDAC maintainers on this.

Thanks

> > For the other patches, I found two more problems:
> >
> > > Bitan Biswas (1):
> > >       soc/tegra: fuse: Expose Tegra production status
> >
> > Please don't just add random attributes in the soc device infrastructure.
> > This one has a completely generic name but a SoC specific
> > meaning, and it lacks a description in Documentation/ABI.
> > Not sure what the right ABI is here, but this is something that needs
> > to be discussed more broadly when you send a new version.
>
> I wasn't aware that the SoC device infrastructure was restricted to only
> standardized attributes. Looks like there are a few other outliers that
> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
>
> Do we have some other place where this kind of thing can be exposed? Or
> do we just need to come up with some better way of namespacing these?
> Perhaps it would also be sufficient if all of these were better
> documented so that people know what to look for on their platform of
> interest.

It's not a 100% strict rule, I've just tried to limit it as much as possible,
and sometimes missed drivers doing it anyway. My main goal here is
to make things consistent between SoC families, so if one piece of
information is provided by a number of them, I'd rather have a standard
attribute, or a common way of encoding this in the existing attributes
than to have too many custom attributes with similar names.

> > > YueHaibing (1):
> > >      soc/tegra: fuse: Add missing DMADEVICES dependency
> >
> > This one fixes the warning the wrong way: we don't 'select' random
> > drivers from other subsystems, and selecting the entire
> > subsystem makes it worse. Just drop the 'select' here and
> > enable the drivers in the defconfig.
>
> This doesn't actually select the DMADEVICES property. It adds a
> dependency on DMADEVICES and if that is met it will select
> TEGRA20_APB_DMA.

My mistake. However, I still think it's wrong to select
TEGRA20_APB_DMA here, unless there is a build-time
dependency that prevents it from being compiled otherwise.

The dmaengine subsystem is meant to abstract the relation
between the drivers using DMA and those providing the feature,
the same way we abstract all the other subsystems. The
fuse driver may only be used on machines that use
TEGRA20_APB_DMA, but neither the driver code nor
Kconfig should care about that.

        Arnd
Jon Hunter July 13, 2022, 12:19 p.m. UTC | #4
On 13/07/2022 13:14, Arnd Bergmann wrote:

...

>>> For the other patches, I found two more problems:
>>>
>>>> Bitan Biswas (1):
>>>>        soc/tegra: fuse: Expose Tegra production status
>>>
>>> Please don't just add random attributes in the soc device infrastructure.
>>> This one has a completely generic name but a SoC specific
>>> meaning, and it lacks a description in Documentation/ABI.
>>> Not sure what the right ABI is here, but this is something that needs
>>> to be discussed more broadly when you send a new version.
>>
>> I wasn't aware that the SoC device infrastructure was restricted to only
>> standardized attributes. Looks like there are a few other outliers that
>> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
>>
>> Do we have some other place where this kind of thing can be exposed? Or
>> do we just need to come up with some better way of namespacing these?
>> Perhaps it would also be sufficient if all of these were better
>> documented so that people know what to look for on their platform of
>> interest.
> 
> It's not a 100% strict rule, I've just tried to limit it as much as possible,
> and sometimes missed drivers doing it anyway. My main goal here is
> to make things consistent between SoC families, so if one piece of
> information is provided by a number of them, I'd rather have a standard
> attribute, or a common way of encoding this in the existing attributes
> than to have too many custom attributes with similar names.


Makes sense. Any recommendations for this specific attribute? I could 
imagine other vendors may have engineering devices and production 
versions. This is slightly different from the silicon version.

Cheers
Jon
Arnd Bergmann July 13, 2022, 12:36 p.m. UTC | #5
On Wed, Jul 13, 2022 at 2:19 PM Jon Hunter <jonathanh@nvidia.com> wrote:
> On 13/07/2022 13:14, Arnd Bergmann wrote:
> >>> For the other patches, I found two more problems:
> >>>
> >>>> Bitan Biswas (1):
> >>>>        soc/tegra: fuse: Expose Tegra production status
> >>>
> >>> Please don't just add random attributes in the soc device infrastructure.
> >>> This one has a completely generic name but a SoC specific
> >>> meaning, and it lacks a description in Documentation/ABI.
> >>> Not sure what the right ABI is here, but this is something that needs
> >>> to be discussed more broadly when you send a new version.
> >>
> >> I wasn't aware that the SoC device infrastructure was restricted to only
> >> standardized attributes. Looks like there are a few other outliers that
> >> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
> >>
> >> Do we have some other place where this kind of thing can be exposed? Or
> >> do we just need to come up with some better way of namespacing these?
> >> Perhaps it would also be sufficient if all of these were better
> >> documented so that people know what to look for on their platform of
> >> interest.
> >
> > It's not a 100% strict rule, I've just tried to limit it as much as possible,
> > and sometimes missed drivers doing it anyway. My main goal here is
> > to make things consistent between SoC families, so if one piece of
> > information is provided by a number of them, I'd rather have a standard
> > attribute, or a common way of encoding this in the existing attributes
> > than to have too many custom attributes with similar names.
>
>
> Makes sense. Any recommendations for this specific attribute? I could
> imagine other vendors may have engineering devices and production
> versions. This is slightly different from the silicon version.

Not sure, I haven't seen this one referenced elsewhere so far.

What is the actual information this encodes in your case? Is this fused
down in a way that production devices lose access to certain features
that could be security critical but are useful for development?

         Arnd
Thierry Reding July 13, 2022, 8:22 p.m. UTC | #6
On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote:
> On Wed, Jul 13, 2022 at 12:58 PM Thierry Reding
> <thierry.reding@gmail.com> wrote:
> > On Tue, Jul 12, 2022 at 03:27:16PM +0200, Arnd Bergmann wrote:
> > > On Fri, Jul 8, 2022 at 8:56 PM Thierry Reding <thierry.reding@gmail.com> wrote:
> > >
> > > I fear I'm going to skip this for the current merge window. It looks like
> > > the CBB driver you add here would fit into the existing drivers/edac/
> > > subsystem, or at the minimum should have been reviewed by the
> > > corresponding maintainers (added to Cc)  to decide whether it goes
> > > there or not.
> > >
> > > I had not previously seen this driver, but I'll let them have a look first.
> >
> > EDAC looks like it's used primarily for memory controllers, which this
> > is not. But then I also see explicit references to non-memory-controller
> > references in the infrastructure, so perhaps this does fit in there. The
> > CBB driver is primarily a means to provide additional information about
> > runtime errors, so it's not directly a means of discovering the errors
> > (they would be detected anyway and cause a crash) and I don't think we
> > have a means of correcting any of these errors.
> 
> I think this is just a reflection of what other hardware can do:
> most machines only detect memory errors, but the EDAC subsystem
> can work with any type in principle. There are also a lot of
> conditions elsewhere that can be detected but not corrected.
> 
> > I'll ask Sumit to work with the EDAC maintainers on this.
> 
> Thanks
> 
> > > For the other patches, I found two more problems:
> > >
> > > > Bitan Biswas (1):
> > > >       soc/tegra: fuse: Expose Tegra production status
> > >
> > > Please don't just add random attributes in the soc device infrastructure.
> > > This one has a completely generic name but a SoC specific
> > > meaning, and it lacks a description in Documentation/ABI.
> > > Not sure what the right ABI is here, but this is something that needs
> > > to be discussed more broadly when you send a new version.
> >
> > I wasn't aware that the SoC device infrastructure was restricted to only
> > standardized attributes. Looks like there are a few other outliers that
> > add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
> >
> > Do we have some other place where this kind of thing can be exposed? Or
> > do we just need to come up with some better way of namespacing these?
> > Perhaps it would also be sufficient if all of these were better
> > documented so that people know what to look for on their platform of
> > interest.
> 
> It's not a 100% strict rule, I've just tried to limit it as much as possible,
> and sometimes missed drivers doing it anyway. My main goal here is
> to make things consistent between SoC families, so if one piece of
> information is provided by a number of them, I'd rather have a standard
> attribute, or a common way of encoding this in the existing attributes
> than to have too many custom attributes with similar names.

The major/minor attributes that we have on Tegra SoCs should be easy to
standardize. It seems like those could be fairly common. The other one
that we have is the "platform" one, which I suppose is not as easy to
standardize. I don't recall the exact details, but I think we're mostly
interested in whether or not the platform is simulation or silicon. The
exact simulation value is not something that userspace scripts will look
at, as far as I recall.

Jon, correct me if I'm wrong.

Perhaps this can be deprecated in favour of a more standardized property
that can more easily be implemented on other SoCs.

The production mode is something that is read from a fuse and we expose
those via the nvmem subsystem already. Currently nvmem exposes only a
binary attribute in sysfs that userspace would need to parse and ideally
we'd have something a little easier to work with, but perhaps nvmem can
be enhanced to expose individual cells as separate attributes in some
standard format. We also have some other values in the fuses that we
want to make available to userspace (IDs and that sort of thing), so
it's good that you noticed this now before we would've added even more.

> > > > YueHaibing (1):
> > > >      soc/tegra: fuse: Add missing DMADEVICES dependency
> > >
> > > This one fixes the warning the wrong way: we don't 'select' random
> > > drivers from other subsystems, and selecting the entire
> > > subsystem makes it worse. Just drop the 'select' here and
> > > enable the drivers in the defconfig.
> >
> > This doesn't actually select the DMADEVICES property. It adds a
> > dependency on DMADEVICES and if that is met it will select
> > TEGRA20_APB_DMA.
> 
> My mistake. However, I still think it's wrong to select
> TEGRA20_APB_DMA here, unless there is a build-time
> dependency that prevents it from being compiled otherwise.
> 
> The dmaengine subsystem is meant to abstract the relation
> between the drivers using DMA and those providing the feature,
> the same way we abstract all the other subsystems. The
> fuse driver may only be used on machines that use
> TEGRA20_APB_DMA, but neither the driver code nor
> Kconfig should care about that.

This dependency has existed for quite a while and my recollection is
that we wanted to make this very explicit because the lack of the
TEGRA20_APB_DMA driver makes the FUSE driver completely useless on
Tegra20 and that in turn has a very negative impact on the rest of the
system, so we deemed a default configuration change insufficient.

Perhaps a better way to solve this would be to make TEGRA20_APB_DMA
default to "y" if ARCH_TEGRA_2x_SOC. And then perhaps make the FUSE
driver depend on DMADEVICES. That still wouldn't ensure that we get
SOC_TEGRA_FUSE enabled automatically all the time, but perhaps it'd
document the dependency a bit more explicitly.

Thierry
Jon Hunter July 14, 2022, 6:30 a.m. UTC | #7
On 13/07/2022 21:22, Thierry Reding wrote:

...

>>>>> Bitan Biswas (1):
>>>>>        soc/tegra: fuse: Expose Tegra production status
>>>>
>>>> Please don't just add random attributes in the soc device infrastructure.
>>>> This one has a completely generic name but a SoC specific
>>>> meaning, and it lacks a description in Documentation/ABI.
>>>> Not sure what the right ABI is here, but this is something that needs
>>>> to be discussed more broadly when you send a new version.
>>>
>>> I wasn't aware that the SoC device infrastructure was restricted to only
>>> standardized attributes. Looks like there are a few other outliers that
>>> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
>>>
>>> Do we have some other place where this kind of thing can be exposed? Or
>>> do we just need to come up with some better way of namespacing these?
>>> Perhaps it would also be sufficient if all of these were better
>>> documented so that people know what to look for on their platform of
>>> interest.
>>
>> It's not a 100% strict rule, I've just tried to limit it as much as possible,
>> and sometimes missed drivers doing it anyway. My main goal here is
>> to make things consistent between SoC families, so if one piece of
>> information is provided by a number of them, I'd rather have a standard
>> attribute, or a common way of encoding this in the existing attributes
>> than to have too many custom attributes with similar names.
> 
> The major/minor attributes that we have on Tegra SoCs should be easy to
> standardize. It seems like those could be fairly common. The other one
> that we have is the "platform" one, which I suppose is not as easy to
> standardize. I don't recall the exact details, but I think we're mostly
> interested in whether or not the platform is simulation or silicon. The
> exact simulation value is not something that userspace scripts will look
> at, as far as I recall.
> 
> Jon, correct me if I'm wrong.

There are a few different simulation types and I am seen some userspace 
code convert the value and display the actual type. However, in reality 
I am not sure how much this is used, but yes at least identifying that 
this is silicon is used widely from what I have seen.

Jon
Jon Hunter July 14, 2022, 6:49 a.m. UTC | #8
On 13/07/2022 13:36, Arnd Bergmann wrote:
> On Wed, Jul 13, 2022 at 2:19 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>> On 13/07/2022 13:14, Arnd Bergmann wrote:
>>>>> For the other patches, I found two more problems:
>>>>>
>>>>>> Bitan Biswas (1):
>>>>>>         soc/tegra: fuse: Expose Tegra production status
>>>>>
>>>>> Please don't just add random attributes in the soc device infrastructure.
>>>>> This one has a completely generic name but a SoC specific
>>>>> meaning, and it lacks a description in Documentation/ABI.
>>>>> Not sure what the right ABI is here, but this is something that needs
>>>>> to be discussed more broadly when you send a new version.
>>>>
>>>> I wasn't aware that the SoC device infrastructure was restricted to only
>>>> standardized attributes. Looks like there are a few other outliers that
>>>> add custom attributes: UX500, ARM Integrator and RealView, and OMAP2.
>>>>
>>>> Do we have some other place where this kind of thing can be exposed? Or
>>>> do we just need to come up with some better way of namespacing these?
>>>> Perhaps it would also be sufficient if all of these were better
>>>> documented so that people know what to look for on their platform of
>>>> interest.
>>>
>>> It's not a 100% strict rule, I've just tried to limit it as much as possible,
>>> and sometimes missed drivers doing it anyway. My main goal here is
>>> to make things consistent between SoC families, so if one piece of
>>> information is provided by a number of them, I'd rather have a standard
>>> attribute, or a common way of encoding this in the existing attributes
>>> than to have too many custom attributes with similar names.
>>
>>
>> Makes sense. Any recommendations for this specific attribute? I could
>> imagine other vendors may have engineering devices and production
>> versions. This is slightly different from the silicon version.
> 
> Not sure, I haven't seen this one referenced elsewhere so far.
> 
> What is the actual information this encodes in your case? Is this fused
> down in a way that production devices lose access to certain features
> that could be security critical but are useful for development?

Yes I believe it is precisely that. Exact details I am not clear on, but 
I see a lot of references to this throughout our userspace and testing 
code.

Jon
Borislav Petkov July 14, 2022, 1:31 p.m. UTC | #9
On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote:
> I think this is just a reflection of what other hardware can do:
> most machines only detect memory errors, but the EDAC subsystem
> can work with any type in principle. There are also a lot of
> conditions elsewhere that can be detected but not corrected.

Just a couple of thoughts from looking at this:

So the EDAC thing reports *hardware* errors by using the RAS
capabilities built into an IP block. So it started with memory
controllers but it is getting extended to other blocks. AMD are looking
at how to integrate GPU hw errors reporting into it, for example.

Looking at that CBB thing, it looks like it is supposed to report not
so much hardware errors but operational errors. Some of the hw errors
reported by RAS hw are also operation-related but not the majority.

Then, EDAC has this counters exposed in:

$ grep -r . /sys/devices/system/edac/
/sys/devices/system/edac/power/runtime_active_time:0
/sys/devices/system/edac/power/runtime_status:unsupported
/sys/devices/system/edac/power/runtime_suspended_time:0
/sys/devices/system/edac/power/control:auto
/sys/devices/system/edac/pci/edac_pci_log_pe:1
/sys/devices/system/edac/pci/pci0/pe_count:0
/sys/devices/system/edac/pci/pci0/npe_count:0
/sys/devices/system/edac/pci/pci_parity_count:0
/sys/devices/system/edac/pci/pci_nonparity_count:0
/sys/devices/system/edac/pci/edac_pci_log_npe:1
/sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
/sys/devices/system/edac/pci/check_pci_errors:0
/sys/devices/system/edac/mc/power/runtime_active_time:0
/sys/devices/system/edac/mc/power/runtime_status:unsupported
...

with the respective hierarchy: memory controllers, PCI errors, etc.

So the main question is, does it make sense for you to fit this into the
EDAC hierarchy and what would even be the advantage of making it part of
EDAC?

HTH.
Arnd Bergmann July 14, 2022, 2:45 p.m. UTC | #10
On Wed, Jul 13, 2022 at 10:22 PM Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote:
> >
> > It's not a 100% strict rule, I've just tried to limit it as much as possible,
> > and sometimes missed drivers doing it anyway. My main goal here is
> > to make things consistent between SoC families, so if one piece of
> > information is provided by a number of them, I'd rather have a standard
> > attribute, or a common way of encoding this in the existing attributes
> > than to have too many custom attributes with similar names.
>
> The major/minor attributes that we have on Tegra SoCs should be easy to
> standardize. It seems like those could be fairly common.

I think these can just be folded into one of the other attributes, probably
either revision or soc_id dependending on what they actually refer to.

These properties are intentionally free-text fields that you can match
using wildcards with the soc_device_match() function. If I read this
part right, the information is already available in the soc_id field,
so we don't even need to change anything here.

> The other one
> that we have is the "platform" one, which I suppose is not as easy to
> standardize. I don't recall the exact details, but I think we're mostly
> interested in whether or not the platform is simulation or silicon. The
> exact simulation value is not something that userspace scripts will look
> at, as far as I recall.

This also looks like it's part of the chip_id.

> > > > > YueHaibing (1):
> > > > >      soc/tegra: fuse: Add missing DMADEVICES dependency
> > > >
> > > > This one fixes the warning the wrong way: we don't 'select' random
> > > > drivers from other subsystems, and selecting the entire
> > > > subsystem makes it worse. Just drop the 'select' here and
> > > > enable the drivers in the defconfig.
> > >
> > > This doesn't actually select the DMADEVICES property. It adds a
> > > dependency on DMADEVICES and if that is met it will select
> > > TEGRA20_APB_DMA.
> >
> > My mistake. However, I still think it's wrong to select
> > TEGRA20_APB_DMA here, unless there is a build-time
> > dependency that prevents it from being compiled otherwise.
> >
> > The dmaengine subsystem is meant to abstract the relation
> > between the drivers using DMA and those providing the feature,
> > the same way we abstract all the other subsystems. The
> > fuse driver may only be used on machines that use
> > TEGRA20_APB_DMA, but neither the driver code nor
> > Kconfig should care about that.
>
> This dependency has existed for quite a while and my recollection is
> that we wanted to make this very explicit because the lack of the
> TEGRA20_APB_DMA driver makes the FUSE driver completely useless on
> Tegra20 and that in turn has a very negative impact on the rest of the
> system, so we deemed a default configuration change insufficient.
>
> Perhaps a better way to solve this would be to make TEGRA20_APB_DMA
> default to "y" if ARCH_TEGRA_2x_SOC. And then perhaps make the FUSE
> driver depend on DMADEVICES. That still wouldn't ensure that we get
> SOC_TEGRA_FUSE enabled automatically all the time, but perhaps it'd
> document the dependency a bit more explicitly.

Ok, this sounds good to me.

          Arnd
Sumit Gupta July 15, 2022, 8:06 a.m. UTC | #11
Hi Arnd, Boris,

Thank you for your inputs.

>> I think this is just a reflection of what other hardware can do:
>> most machines only detect memory errors, but the EDAC subsystem
>> can work with any type in principle. There are also a lot of
>> conditions elsewhere that can be detected but not corrected.
> 
> Just a couple of thoughts from looking at this:
> 
> So the EDAC thing reports *hardware* errors by using the RAS
> capabilities built into an IP block. So it started with memory
> controllers but it is getting extended to other blocks. AMD are looking
> at how to integrate GPU hw errors reporting into it, for example.
> 
> Looking at that CBB thing, it looks like it is supposed to report not
> so much hardware errors but operational errors. Some of the hw errors
> reported by RAS hw are also operation-related but not the majority.
> 

CBB driver reports errors due to bad MMIO accesses within software.
The vast majority of the CBB errors tend to be programming errors in 
setting up address windows leading to decode errors.

> Then, EDAC has this counters exposed in:
> 
> $ grep -r . /sys/devices/system/edac/
> /sys/devices/system/edac/power/runtime_active_time:0
> /sys/devices/system/edac/power/runtime_status:unsupported
> /sys/devices/system/edac/power/runtime_suspended_time:0
> /sys/devices/system/edac/power/control:auto
> /sys/devices/system/edac/pci/edac_pci_log_pe:1
> /sys/devices/system/edac/pci/pci0/pe_count:0
> /sys/devices/system/edac/pci/pci0/npe_count:0
> /sys/devices/system/edac/pci/pci_parity_count:0
> /sys/devices/system/edac/pci/pci_nonparity_count:0
> /sys/devices/system/edac/pci/edac_pci_log_npe:1
> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
> /sys/devices/system/edac/pci/check_pci_errors:0
> /sys/devices/system/edac/mc/power/runtime_active_time:0
> /sys/devices/system/edac/mc/power/runtime_status:unsupported
> ...
> 
> with the respective hierarchy: memory controllers, PCI errors, etc.
> 
> So the main question is, does it make sense for you to fit this into the
> EDAC hierarchy and what would even be the advantage of making it part of
> EDAC?
> 

I also think this doesn't seem to fit with the errors reported by EDAC 
which are mainly hardware errors as Boris explained.
Please share your thoughts and if we can merge the patches as it is.

> HTH.
> 
> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Thierry Reding July 28, 2022, 5:34 p.m. UTC | #12
On Fri, Jul 15, 2022 at 01:36:16PM +0530, Sumit Gupta wrote:
> Hi Arnd, Boris,
> 
> Thank you for your inputs.
> 
> > > I think this is just a reflection of what other hardware can do:
> > > most machines only detect memory errors, but the EDAC subsystem
> > > can work with any type in principle. There are also a lot of
> > > conditions elsewhere that can be detected but not corrected.
> > 
> > Just a couple of thoughts from looking at this:
> > 
> > So the EDAC thing reports *hardware* errors by using the RAS
> > capabilities built into an IP block. So it started with memory
> > controllers but it is getting extended to other blocks. AMD are looking
> > at how to integrate GPU hw errors reporting into it, for example.
> > 
> > Looking at that CBB thing, it looks like it is supposed to report not
> > so much hardware errors but operational errors. Some of the hw errors
> > reported by RAS hw are also operation-related but not the majority.
> > 
> 
> CBB driver reports errors due to bad MMIO accesses within software.
> The vast majority of the CBB errors tend to be programming errors in setting
> up address windows leading to decode errors.
> 
> > Then, EDAC has this counters exposed in:
> > 
> > $ grep -r . /sys/devices/system/edac/
> > /sys/devices/system/edac/power/runtime_active_time:0
> > /sys/devices/system/edac/power/runtime_status:unsupported
> > /sys/devices/system/edac/power/runtime_suspended_time:0
> > /sys/devices/system/edac/power/control:auto
> > /sys/devices/system/edac/pci/edac_pci_log_pe:1
> > /sys/devices/system/edac/pci/pci0/pe_count:0
> > /sys/devices/system/edac/pci/pci0/npe_count:0
> > /sys/devices/system/edac/pci/pci_parity_count:0
> > /sys/devices/system/edac/pci/pci_nonparity_count:0
> > /sys/devices/system/edac/pci/edac_pci_log_npe:1
> > /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
> > /sys/devices/system/edac/pci/check_pci_errors:0
> > /sys/devices/system/edac/mc/power/runtime_active_time:0
> > /sys/devices/system/edac/mc/power/runtime_status:unsupported
> > ...
> > 
> > with the respective hierarchy: memory controllers, PCI errors, etc.
> > 
> > So the main question is, does it make sense for you to fit this into the
> > EDAC hierarchy and what would even be the advantage of making it part of
> > EDAC?
> > 
> 
> I also think this doesn't seem to fit with the errors reported by EDAC which
> are mainly hardware errors as Boris explained.
> Please share your thoughts and if we can merge the patches as it is.

Arnd,

any more thoughts on this? Looks like there is no consensus on where
this should go. If it's okay for this to go in via ARM SoC after all,
I could prepare another pull request including only the CBB changes
along with some of the reference count fixes. I could possibly also
rework the DMADEVICES dependency patch as discussed, or we could defer
it if it's too risky at this point.

Thierry
Sumit Gupta Aug. 22, 2022, 9:31 a.m. UTC | #13
> On Fri, Jul 15, 2022 at 01:36:16PM +0530, Sumit Gupta wrote:
>> Hi Arnd, Boris,
>>
>> Thank you for your inputs.
>>
>>>> I think this is just a reflection of what other hardware can do:
>>>> most machines only detect memory errors, but the EDAC subsystem
>>>> can work with any type in principle. There are also a lot of
>>>> conditions elsewhere that can be detected but not corrected.
>>> Just a couple of thoughts from looking at this:
>>>
>>> So the EDAC thing reports*hardware*  errors by using the RAS
>>> capabilities built into an IP block. So it started with memory
>>> controllers but it is getting extended to other blocks. AMD are looking
>>> at how to integrate GPU hw errors reporting into it, for example.
>>>
>>> Looking at that CBB thing, it looks like it is supposed to report not
>>> so much hardware errors but operational errors. Some of the hw errors
>>> reported by RAS hw are also operation-related but not the majority.
>>>
>> CBB driver reports errors due to bad MMIO accesses within software.
>> The vast majority of the CBB errors tend to be programming errors in setting
>> up address windows leading to decode errors.
>>
>>> Then, EDAC has this counters exposed in:
>>>
>>> $ grep -r ./sys/devices/system/edac/
>>> /sys/devices/system/edac/power/runtime_active_time:0
>>> /sys/devices/system/edac/power/runtime_status:unsupported
>>> /sys/devices/system/edac/power/runtime_suspended_time:0
>>> /sys/devices/system/edac/power/control:auto
>>> /sys/devices/system/edac/pci/edac_pci_log_pe:1
>>> /sys/devices/system/edac/pci/pci0/pe_count:0
>>> /sys/devices/system/edac/pci/pci0/npe_count:0
>>> /sys/devices/system/edac/pci/pci_parity_count:0
>>> /sys/devices/system/edac/pci/pci_nonparity_count:0
>>> /sys/devices/system/edac/pci/edac_pci_log_npe:1
>>> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
>>> /sys/devices/system/edac/pci/check_pci_errors:0
>>> /sys/devices/system/edac/mc/power/runtime_active_time:0
>>> /sys/devices/system/edac/mc/power/runtime_status:unsupported
>>> ...
>>>
>>> with the respective hierarchy: memory controllers, PCI errors, etc.
>>>
>>> So the main question is, does it make sense for you to fit this into the
>>> EDAC hierarchy and what would even be the advantage of making it part of
>>> EDAC?
>>>
>> I also think this doesn't seem to fit with the errors reported by EDAC which
>> are mainly hardware errors as Boris explained.
>> Please share your thoughts and if we can merge the patches as it is.
> Arnd,
> 
> any more thoughts on this? Looks like there is no consensus on where
> this should go. If it's okay for this to go in via ARM SoC after all,
> I could prepare another pull request including only the CBB changes
> along with some of the reference count fixes. I could possibly also
> rework the DMADEVICES dependency patch as discussed, or we could defer
> it if it's too risky at this point.
> 
> Thierry

Hi Arnd, Thierry,
Gentle ping.

If we are OK with the reasoning then can we please queue the patch 
series for '6.1'.

Thank you,
Sumit
Thierry Reding Sept. 27, 2022, 4 p.m. UTC | #14
On Thu, Jul 14, 2022 at 03:31:07PM +0200, Borislav Petkov wrote:
> On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote:
> > I think this is just a reflection of what other hardware can do:
> > most machines only detect memory errors, but the EDAC subsystem
> > can work with any type in principle. There are also a lot of
> > conditions elsewhere that can be detected but not corrected.
> 
> Just a couple of thoughts from looking at this:
> 
> So the EDAC thing reports *hardware* errors by using the RAS
> capabilities built into an IP block. So it started with memory
> controllers but it is getting extended to other blocks. AMD are looking
> at how to integrate GPU hw errors reporting into it, for example.
> 
> Looking at that CBB thing, it looks like it is supposed to report not
> so much hardware errors but operational errors. Some of the hw errors
> reported by RAS hw are also operation-related but not the majority.
> 
> Then, EDAC has this counters exposed in:
> 
> $ grep -r . /sys/devices/system/edac/
> /sys/devices/system/edac/power/runtime_active_time:0
> /sys/devices/system/edac/power/runtime_status:unsupported
> /sys/devices/system/edac/power/runtime_suspended_time:0
> /sys/devices/system/edac/power/control:auto
> /sys/devices/system/edac/pci/edac_pci_log_pe:1
> /sys/devices/system/edac/pci/pci0/pe_count:0
> /sys/devices/system/edac/pci/pci0/npe_count:0
> /sys/devices/system/edac/pci/pci_parity_count:0
> /sys/devices/system/edac/pci/pci_nonparity_count:0
> /sys/devices/system/edac/pci/edac_pci_log_npe:1
> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
> /sys/devices/system/edac/pci/check_pci_errors:0
> /sys/devices/system/edac/mc/power/runtime_active_time:0
> /sys/devices/system/edac/mc/power/runtime_status:unsupported
> ...
> 
> with the respective hierarchy: memory controllers, PCI errors, etc.
> 
> So the main question is, does it make sense for you to fit this into the
> EDAC hierarchy and what would even be the advantage of making it part of
> EDAC?

Closing the loop on this: we've decided to keep this in drivers/soc for
now, with the option of re-evaluating when we encounter similar
functionality on other hardware.

I'm also going to hijack the thread because something else came up
recently that fits the audience here and it's up the same alley: on
Tegra234 a mechanism, called FSI (Functional Safety Island), exists
to report failures to an external MCU that's monitoring the system.

Special hardware exists in the SoC that can send these errors to the
MCU via different transports, and the idea is to report software-
detected failures from kernel drivers such as I2C or PCI via this
mechanism, so appropriate action can be taken. So essentially we're
looking at adding some new API, preferably something generic, to these
bus drivers along with "provider" drivers that get notified of these
reports so that they can be forwarded to the FSI (and then the MCU).

This again doesn't seem to be a great fit for EDAC as it is today, but
I can also not find anything better looking around the kernel. So I'm
wondering if this is something that others have encountered and might
have solved already and I just haven't found it, or if this is something
that would be worth creating a new subsystem for. Or perhaps this could
be integrated into EDAC somehow? I'm a bit reluctant to add yet another
custom infrastructure for this, given that it's functionality that
likely exists in other SoCs as well.

Any thoughts on this?

Thierry