[RFC,net-next,00/19] Devlink health reporting and recovery system

Message ID	1546266733-9512-1-git-send-email-eranbe@mellanox.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Eran Ben Elisha <eranbe@mellanox.com> To: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>, Jiri Pirko <jiri@mellanox.com> Cc: Moshe Shemesh <moshe@mellanox.com>, Aya Levin <ayal@mellanox.com>, Eran Ben Elisha <eranbe@mellanox.com>, Tal Alon <talal@mellanox.com>, Ariel Almog <ariela@mellanox.com> Subject: [PATCH RFC net-next 00/19] Devlink health reporting and recovery system Date: Mon, 31 Dec 2018 16:31:54 +0200 Message-Id: <1546266733-9512-1-git-send-email-eranbe@mellanox.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Devlink health reporting and recovery system \| expand [RFC,net-next,00/19] Devlink health reporting and recovery system [RFC,net-next,01/19] devlink: Add health buffer support [RFC,net-next,02/19] devlink: Add health reporter create/destroy functionality [RFC,net-next,03/19] devlink: Add health report functionality [RFC,net-next,04/19] devlink: Add health get command [RFC,net-next,05/19] devlink: Add health set command [RFC,net-next,06/19] devlink: Add health recover command [RFC,net-next,07/19] devlink: Add health diagnose command [RFC,net-next,08/19] devlink: Add health objdump {get,clear} commands [RFC,net-next,09/19] net/mlx5e: Add TX reporter support [RFC,net-next,10/19] net/mlx5e: Add TX timeout support for mlx5e TX reporter [RFC,net-next,11/19] net/mlx5: Move all devlink related functions calls to devlink.c [RFC,net-next,12/19] net/mlx5: Refactor print health info [RFC,net-next,13/19] net/mlx5: Create FW devlink_health_reporter [RFC,net-next,14/19] net/mlx5: Add core dump register access functions [RFC,net-next,15/19] net/mlx5: Add support for FW reporter objdump [RFC,net-next,16/19] net/mlx5: Report devlink health on FW issues [RFC,net-next,17/19] net/mlx5: Add FW fatal devlink_health_reporter [RFC,net-next,18/19] net/mlx5: Report devlink health on FW fatal issues [RFC,net-next,19/19] devlink: Add Documentation/networking/devlink-health.txt

Eran Ben Elisha Dec. 31, 2018, 2:31 p.m. UTC

The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
 - Recovery procedures
 - Diagnostics and object dump procedures
 - OOB initial attributes
Different parts of the driver can register different types of health reporters
with different handlers.

Once an error is reported, devlink health will do the following actions:
  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other Objdump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

The user interface:
User can access/change each reporter attributes and driver specific callbacks
via devlink, e.g per error type (per health reporter)
 - Configure reporter's generic attributes (like: Disable/enable auto recovery)
 - Invoke recovery procedure
 - Run diagnostics
 - Object dump

The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
  Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
  Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
  Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
  Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
  Retrieves the last stored objdump. Devlink health
  saves a single objdump. If an objdump is not already stored by the devlink
  for this reporter, devlink generates a new objdump.
  Objdump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
  Clears the last saved objdump file for the specified reporter.


                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+

Available reporters:
In this patchset, three reporters of mlx5e driver are included. The FW
reporters implement devlink_health_reporter diagnostic, object dump and
recovery procedures. The TX reporter implements devlink_health_reporter
diagnostic and recovery procedures.

This RFC is based on the same concepts as previous RFC I sent earlier this
year: "[RFC PATCH iproute2-next] System specification health API". The API was
changed, also devlink code and mlx5e reporters were not available at the
previous RFC.

Aya Levin (1):
  devlink: Add Documentation/networking/devlink-health.txt

Eran Ben Elisha (11):
  devlink: Add health buffer support
  devlink: Add health reporter create/destroy functionality
  devlink: Add health report functionality
  devlink: Add health get command
  devlink: Add health set command
  devlink: Add health recover command
  devlink: Add health diagnose command
  devlink: Add health objdump {get,clear} commands
  net/mlx5e: Add TX reporter support
  net/mlx5e: Add TX timeout support for mlx5e TX reporter
  net/mlx5: Move all devlink related functions calls to devlink.c

Moshe Shemesh (7):
  net/mlx5: Refactor print health info
  net/mlx5: Create FW devlink_health_reporter
  net/mlx5: Add core dump register access functions
  net/mlx5: Add support for FW reporter objdump
  net/mlx5: Report devlink health on FW issues
  net/mlx5: Add FW fatal devlink_health_reporter
  net/mlx5: Report devlink health on FW fatal issues

 Documentation/networking/devlink-health.txt   |   86 ++
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    4 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  310 +++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   22 +
 .../mellanox/mlx5/core/diag/fw_tracer.c       |   75 ++
 .../mellanox/mlx5/core/diag/fw_tracer.h       |   13 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   18 +-
 .../ethernet/mellanox/mlx5/core/en/reporter.h |   15 +
 .../mellanox/mlx5/core/en/reporter_tx.c       |  356 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  186 +--
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |    2 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |   79 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |   10 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |    7 +
 include/linux/mlx5/device.h                   |    1 +
 include/linux/mlx5/driver.h                   |    5 +
 include/linux/mlx5/mlx5_ifc.h                 |   21 +-
 include/net/devlink.h                         |  142 +++
 include/trace/events/devlink.h                |   62 +
 include/uapi/linux/devlink.h                  |   25 +
 net/core/devlink.c                            | 1037 +++++++++++++++++
 21 files changed, 2265 insertions(+), 211 deletions(-)
 create mode 100644 Documentation/networking/devlink-health.txt
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c

Jakub Kicinski Jan. 1, 2019, 1:47 a.m. UTC | #1

On Mon, 31 Dec 2018 16:31:54 +0200, Eran Ben Elisha wrote:
> The health mechanism is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed debugging
>   information.
> 
> The main idea is to unify and centralize driver health reports in the
> generic devlink instance and allow the user to set different
> attributes of the health reporting and recovery procedures.
> 
> The devlink health reporter:
> Device driver creates a "health reporter" per each error/health type.
> Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
> or unknown (driver specific).
> For each registered health reporter a driver can issue error/health reports
> asynchronously. All health reports handling is done by devlink.
> Device driver can provide specific callbacks for each "health reporter", e.g.
>  - Recovery procedures
>  - Diagnostics and object dump procedures
>  - OOB initial attributes
> Different parts of the driver can register different types of health reporters
> with different handlers.
> 
> Once an error is reported, devlink health will do the following actions:
>   * A log is being send to the kernel trace events buffer
>   * Health status and statistics are being updated for the reporter instance
>   * Object dump is being taken and saved at the reporter instance (as long as
>     there is no other Objdump which is already stored)
>   * Auto recovery attempt is being done. Depends on:
>     - Auto-recovery configuration
>     - Grace period vs. time passed since last recover
> 
> The user interface:
> User can access/change each reporter attributes and driver specific callbacks
> via devlink, e.g per error type (per health reporter)
>  - Configure reporter's generic attributes (like: Disable/enable auto recovery)
>  - Invoke recovery procedure
>  - Run diagnostics
>  - Object dump

Thanks for continuing this work!

> The devlink health interface (via netlink):
> DEVLINK_CMD_HEALTH_REPORTER_GET
>   Retrieves status and configuration info per DEV and reporter.
> DEVLINK_CMD_HEALTH_REPORTER_SET
>   Allows reporter-related configuration setting.
> DEVLINK_CMD_HEALTH_REPORTER_RECOVER
>   Triggers a reporter's recovery procedure.
> DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
>   Retrieves diagnostics data from a reporter on a device.
> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET

The addition of "objdump" and its marshalling is a bit disappointing.
It seemed to me when region snapshots were added that they would serve
this exact purpose.  Taking a region snapshot or "core dump", if you
will, after something went south.  Now it's confusing to have two
mechanisms serving the same purpose.

It's unclear from quick reading of the code how if the buffers get
timestamped.  Can you report more than one?

About the marshalling, I'm not sure it belongs in the kernel.  There is
precedent for adding interpretation of FW blobs in user space (ethtool).
IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
drivers.  Amount of code you add to print the simple example from last
patch is not inspiring confidence.

And on the bike shedding side :) -> I think you should steer clear of
calling this objdump, as it has very little to do with the
functionality of well-known objdump tool.  Its not even clear what the
object the name is referring to is.

Long story short the overlap with region snapshots makes it unclear
what purpose either serves, and IMHO we should avoid the marshalling by
teaching user space how to interpret snapshots.  Preferably we only
have one dump mechanism, and user space can be taught the interpretation
once.

>   Retrieves the last stored objdump. Devlink health
>   saves a single objdump. If an objdump is not already stored by the devlink
>   for this reporter, devlink generates a new objdump.
>   Objdump output is defined by the reporter.
> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
>   Clears the last saved objdump file for the specified reporter.
> 
> 
>                                                netlink
>                                       +--------------------------+
>                                       |                          |
>                                       |            +             |
>                                       |            |             |
>                                       +--------------------------+
>                                                    |request for ops
>                                                    |(diagnose,
>  mlx5_core                             devlink     |recover,
>                                                    |dump)
> +--------+                            +--------------------------+
> |        |                            |    reporter|             |
> |        |                            |  +---------v----------+  |
> |        |   ops execution            |  |                    |  |
> |     <----------------------------------+                    |  |
> |        |                            |  |                    |  |
> |        |                            |  + ^------------------+  |
> |        |                            |    | request for ops     |
> |        |                            |    | (recover, dump)     |
> |        |                            |    |                     |
> |        |                            |  +-+------------------+  |
> |        |     health report          |  | health handler     |  |
> |        +------------------------------->                    |  |
> |        |                            |  +--------------------+  |
> |        |     health reporter create |                          |
> |        +---------------------------->                          |
> +--------+                            +--------------------------+
> 
> Available reporters:
> In this patchset, three reporters of mlx5e driver are included. The FW
> reporters implement devlink_health_reporter diagnostic, object dump and
> recovery procedures. The TX reporter implements devlink_health_reporter
> diagnostic and recovery procedures.
> 
> This RFC is based on the same concepts as previous RFC I sent earlier this
> year: "[RFC PATCH iproute2-next] System specification health API". The API was
> changed, also devlink code and mlx5e reporters were not available at the
> previous RFC.

Eran Ben Elisha Jan. 1, 2019, 9:58 a.m. UTC | #2

On 1/1/2019 3:47 AM, Jakub Kicinski wrote:
> On Mon, 31 Dec 2018 16:31:54 +0200, Eran Ben Elisha wrote:
>> The health mechanism is targeted for Real Time Alerting, in order to know when
>> something bad had happened to a PCI device
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed debugging
>>    information.
>>
>> The main idea is to unify and centralize driver health reports in the
>> generic devlink instance and allow the user to set different
>> attributes of the health reporting and recovery procedures.
>>
>> The devlink health reporter:
>> Device driver creates a "health reporter" per each error/health type.
>> Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
>> or unknown (driver specific).
>> For each registered health reporter a driver can issue error/health reports
>> asynchronously. All health reports handling is done by devlink.
>> Device driver can provide specific callbacks for each "health reporter", e.g.
>>   - Recovery procedures
>>   - Diagnostics and object dump procedures
>>   - OOB initial attributes
>> Different parts of the driver can register different types of health reporters
>> with different handlers.
>>
>> Once an error is reported, devlink health will do the following actions:
>>    * A log is being send to the kernel trace events buffer
>>    * Health status and statistics are being updated for the reporter instance
>>    * Object dump is being taken and saved at the reporter instance (as long as
>>      there is no other Objdump which is already stored)
>>    * Auto recovery attempt is being done. Depends on:
>>      - Auto-recovery configuration
>>      - Grace period vs. time passed since last recover
>>
>> The user interface:
>> User can access/change each reporter attributes and driver specific callbacks
>> via devlink, e.g per error type (per health reporter)
>>   - Configure reporter's generic attributes (like: Disable/enable auto recovery)
>>   - Invoke recovery procedure
>>   - Run diagnostics
>>   - Object dump
> 
> Thanks for continuing this work!
:)
> 
>> The devlink health interface (via netlink):
>> DEVLINK_CMD_HEALTH_REPORTER_GET
>>    Retrieves status and configuration info per DEV and reporter.
>> DEVLINK_CMD_HEALTH_REPORTER_SET
>>    Allows reporter-related configuration setting.
>> DEVLINK_CMD_HEALTH_REPORTER_RECOVER
>>    Triggers a reporter's recovery procedure.
>> DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
>>    Retrieves diagnostics data from a reporter on a device.
>> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
> 
> The addition of "objdump" and its marshalling is a bit disappointing.
> It seemed to me when region snapshots were added that they would serve
> this exact purpose.  Taking a region snapshot or "core dump", if you
> will, after something went south.  Now it's confusing to have two
> mechanisms serving the same purpose.

The motivation here was that the driver can provide reporters to its 
sub-modules, such that each reporter will be able to provide all needed 
info and recover methods to face run time errors.

The implementation of the objdump function is in the hands of the 
reporter developer, and he can dump whatever he thinks it is needed.
Keep in mind that a driver can have many reporters (TX, RX, FW, command 
interface, etc). For most of the reporters, there is important 
information in the driver which can not be fetched with region snapshot 
(intended for memory fetching only).
For SW info, driver shall build the info and send it interpreted (unlike 
all dumps / region available mechanisms)
I have in plans to extend the TX reporter to have objdump method in 
which I will pass many SQ SW attributes that can be very handy in a 
debug session.

> 
> It's unclear from quick reading of the code how if the buffers get
> timestamped.  Can you report more than one?

The timestamp which devlink health reports on, is the timestamp in which 
it got the buffers filled by the driver. Every dump/diagnose has one ts.
Per reporter, it is possible to store up to one dump. only clear command 
can remove it and makes the reporter ready to fetch a new objdump.

> 
> About the marshalling, I'm not sure it belongs in the kernel.  There is
> precedent for adding interpretation of FW blobs in user space (ethtool).
> IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
> drivers.  Amount of code you add to print the simple example from last
> patch is not inspiring confidence.

The idea was to provide the developer the ability to create "tree-like" 
of information, it is needed when you want to describe complex objects. 
This caused a longer coding, but I don't think we are even close to the 
scale you are talking about.
We can remove the tree flexibility, and move to array format, it will 
make the code of storing data by the driver to be shorter, but we will 
lose the ability to have it in tree-like format.

And again, regarding ethtool, it is a tool for dumping HW/FW, this could 
have been an argument for the region snapshot, not for the devlink health...

> 
> And on the bike shedding side :) -> I think you should steer clear of
> calling this objdump, as it has very little to do with the
> functionality of well-known objdump tool.  Its not even clear what the
> object the name is referring to is.
Let's agree on concept, we can change name to dump. Reporter->dump is 
very clear when you know what the reporter is.
> 
> Long story short the overlap with region snapshots makes it unclear
> what purpose either serves, and IMHO we should avoid the marshalling by
> teaching user space how to interpret snapshots.  Preferably we only
> have one dump mechanism, and user space can be taught the interpretation
> once.
Forcing SW reporters to use region snapshot mechanism sounds like a bad 
idea.

To summarize, In order to have the health system working properly, it 
must have a way to objdump/dump itself and provide it to the developer 
for debug. Removing the objdump part will make it useless for run-time 
debug.

I think this is a powerful tool and we shall not ask the user level 
scripts to fetch info from many other partial tools. It shall all be 
focused in one place (recover, diagnose, objdump, statistics).

> 
>>    Retrieves the last stored objdump. Devlink health
>>    saves a single objdump. If an objdump is not already stored by the devlink
>>    for this reporter, devlink generates a new objdump.
>>    Objdump output is defined by the reporter.
>> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
>>    Clears the last saved objdump file for the specified reporter.
>>
>>
>>                                                 netlink
>>                                        +--------------------------+
>>                                        |                          |
>>                                        |            +             |
>>                                        |            |             |
>>                                        +--------------------------+
>>                                                     |request for ops
>>                                                     |(diagnose,
>>   mlx5_core                             devlink     |recover,
>>                                                     |dump)
>> +--------+                            +--------------------------+
>> |        |                            |    reporter|             |
>> |        |                            |  +---------v----------+  |
>> |        |   ops execution            |  |                    |  |
>> |     <----------------------------------+                    |  |
>> |        |                            |  |                    |  |
>> |        |                            |  + ^------------------+  |
>> |        |                            |    | request for ops     |
>> |        |                            |    | (recover, dump)     |
>> |        |                            |    |                     |
>> |        |                            |  +-+------------------+  |
>> |        |     health report          |  | health handler     |  |
>> |        +------------------------------->                    |  |
>> |        |                            |  +--------------------+  |
>> |        |     health reporter create |                          |
>> |        +---------------------------->                          |
>> +--------+                            +--------------------------+
>>
>> Available reporters:
>> In this patchset, three reporters of mlx5e driver are included. The FW
>> reporters implement devlink_health_reporter diagnostic, object dump and
>> recovery procedures. The TX reporter implements devlink_health_reporter
>> diagnostic and recovery procedures.
>>
>> This RFC is based on the same concepts as previous RFC I sent earlier this
>> year: "[RFC PATCH iproute2-next] System specification health API". The API was
>> changed, also devlink code and mlx5e reporters were not available at the
>> previous RFC.

Jakub Kicinski Jan. 2, 2019, 10:46 p.m. UTC | #3

On Tue, 1 Jan 2019 09:58:30 +0000, Eran Ben Elisha wrote:
> On 1/1/2019 3:47 AM, Jakub Kicinski wrote:
> > The addition of "objdump" and its marshalling is a bit disappointing.
> > It seemed to me when region snapshots were added that they would serve
> > this exact purpose.  Taking a region snapshot or "core dump", if you
> > will, after something went south.  Now it's confusing to have two
> > mechanisms serving the same purpose.  
> 
> The motivation here was that the driver can provide reporters to its 
> sub-modules, such that each reporter will be able to provide all needed 
> info and recover methods to face run time errors.
> 
> The implementation of the objdump function is in the hands of the 
> reporter developer, and he can dump whatever he thinks it is needed.
> Keep in mind that a driver can have many reporters (TX, RX, FW, command 
> interface, etc). For most of the reporters, there is important 
> information in the driver which can not be fetched with region snapshot 
> (intended for memory fetching only).
> For SW info, driver shall build the info and send it interpreted (unlike 
> all dumps / region available mechanisms)
> I have in plans to extend the TX reporter to have objdump method in 
> which I will pass many SQ SW attributes that can be very handy in a 
> debug session.

My feeling is that instead of duplicating this infrastructure we should
try to grow region snapshots beyond the "HW memory dumper".  In a
debugging session you may want to have dumps as well as read the state
live.  Region snapshot API was built with this promise in mind.  The
information in reporter dump is not easily available other than when
the dump happens, which is not great in a debugging session.  Driver
will have to expose it via debugfs/region dump/ethtool dump or some
such for live debug.

> > It's unclear from quick reading of the code how if the buffers get
> > timestamped.  Can you report more than one?  
> 
> The timestamp which devlink health reports on, is the timestamp in which 
> it got the buffers filled by the driver. Every dump/diagnose has one ts.
> Per reporter, it is possible to store up to one dump. only clear command 
> can remove it and makes the reporter ready to fetch a new objdump.

Region snapshots support collecting multiple snapshots IIRC, no?

> > About the marshalling, I'm not sure it belongs in the kernel.  There is
> > precedent for adding interpretation of FW blobs in user space (ethtool).
> > IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
> > drivers.  Amount of code you add to print the simple example from last
> > patch is not inspiring confidence.  
> 
> The idea was to provide the developer the ability to create "tree-like" 
> of information, it is needed when you want to describe complex objects. 
> This caused a longer coding, but I don't think we are even close to the 
> scale you are talking about.
> We can remove the tree flexibility, and move to array format, it will 
> make the code of storing data by the driver to be shorter, but we will 
> lose the ability to have it in tree-like format.

To be clear I slightly oppose the marshalling in the first place.  It
may be better to just dump the data as is, and have user space know
what the interpretation is.

> And again, regarding ethtool, it is a tool for dumping HW/FW, this could 
> have been an argument for the region snapshot, not for the devlink health...

I've seen drivers dumping ring state and other SW info via ethtool.
All debugging APIs end up "mixed source" in my experience.

> > And on the bike shedding side :) -> I think you should steer clear of
> > calling this objdump, as it has very little to do with the
> > functionality of well-known objdump tool.  Its not even clear what the
> > object the name is referring to is.  
> Let's agree on concept, we can change name to dump. Reporter->dump is 
> very clear when you know what the reporter is.

Thanks!

> > Long story short the overlap with region snapshots makes it unclear
> > what purpose either serves, and IMHO we should avoid the marshalling by
> > teaching user space how to interpret snapshots.  Preferably we only
> > have one dump mechanism, and user space can be taught the interpretation
> > once.  
> Forcing SW reporters to use region snapshot mechanism sounds like a bad 
> idea.

I'm not super excited about reusing region API for SW info.  But I like
it more than having multitude of debug APIs for drivers to implement
with largely overlapping functionality and semantics.

> To summarize, In order to have the health system working properly, it 
> must have a way to objdump/dump itself and provide it to the developer 
> for debug. Removing the objdump part will make it useless for run-time 
> debug.
> 
> I think this is a powerful tool and we shall not ask the user level 
> scripts to fetch info from many other partial tools. It shall all be 
> focused in one place (recover, diagnose, objdump, statistics).

I don't think having reporter API refer to snapshot IDs is "many other
partial tools" if that's the suggestion.  Seems like you want to focus
all the reporter stuff in one place, and I want to focus the debug APIs
a little :)

Eran Ben Elisha Jan. 3, 2019, 1:31 p.m. UTC | #4

On 1/3/2019 12:46 AM, Jakub Kicinski wrote:
> On Tue, 1 Jan 2019 09:58:30 +0000, Eran Ben Elisha wrote:
>> On 1/1/2019 3:47 AM, Jakub Kicinski wrote:
>>> The addition of "objdump" and its marshalling is a bit disappointing.
>>> It seemed to me when region snapshots were added that they would serve
>>> this exact purpose.  Taking a region snapshot or "core dump", if you
>>> will, after something went south.  Now it's confusing to have two
>>> mechanisms serving the same purpose.
>>
>> The motivation here was that the driver can provide reporters to its
>> sub-modules, such that each reporter will be able to provide all needed
>> info and recover methods to face run time errors.
>>
>> The implementation of the objdump function is in the hands of the
>> reporter developer, and he can dump whatever he thinks it is needed.
>> Keep in mind that a driver can have many reporters (TX, RX, FW, command
>> interface, etc). For most of the reporters, there is important
>> information in the driver which can not be fetched with region snapshot
>> (intended for memory fetching only).
>> For SW info, driver shall build the info and send it interpreted (unlike
>> all dumps / region available mechanisms)
>> I have in plans to extend the TX reporter to have objdump method in
>> which I will pass many SQ SW attributes that can be very handy in a
>> debug session.
> 
> My feeling is that instead of duplicating this infrastructure we should
> try to grow region snapshots beyond the "HW memory dumper".  In a
> debugging session you may want to have dumps as well as read the state
> live.  Region snapshot API was built with this promise in mind.  The
> information in reporter dump is not easily available other than when
> the dump happens, which is not great in a debugging session.  Driver
> will have to expose it via debugfs/region dump/ethtool dump or some
> such for live debug.

Arch wise those are two different features which we shouldn't mix.
The region dump is aiming at dumping of information for monitoring of 
"HW memory" at real time, more like a dumb channel to provide memory 
chunks from HW to user.

The health is a user facing tool which provides a centralized vision on 
a device health status with diagnose, recover and dump options divided 
via sub-reporters in a real time and can save critical data 
automatically in case of a reported error.

Regarding the need for other tool for live debug, it is not true. 
Devlink health objdump command provides latest stored dump, as well as 
dumping now for live debug.
It is even better, as driver reporters already contains the needed HW 
and SW buffers according to the device and the error. Unlike region dump 
that requires the administrator to set regions while the administrator 
is not an expert of mapping between required info, device model and 
reported error.

> 
>>> It's unclear from quick reading of the code how if the buffers get
>>> timestamped.  Can you report more than one?
>>
>> The timestamp which devlink health reports on, is the timestamp in which
>> it got the buffers filled by the driver. Every dump/diagnose has one ts.
>> Per reporter, it is possible to store up to one dump. only clear command
>> can remove it and makes the reporter ready to fetch a new objdump.
> 
> Region snapshots support collecting multiple snapshots IIRC, no?

it does allow multiple snapshots. This can be easily added to the 
devlink health if we wish to. I didn't see the current need for it.

> 
>>> About the marshalling, I'm not sure it belongs in the kernel.  There is
>>> precedent for adding interpretation of FW blobs in user space (ethtool).
>>> IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
>>> drivers.  Amount of code you add to print the simple example from last
>>> patch is not inspiring confidence.
>>
>> The idea was to provide the developer the ability to create "tree-like"
>> of information, it is needed when you want to describe complex objects.
>> This caused a longer coding, but I don't think we are even close to the
>> scale you are talking about.
>> We can remove the tree flexibility, and move to array format, it will
>> make the code of storing data by the driver to be shorter, but we will
>> lose the ability to have it in tree-like format.
> 
> To be clear I slightly oppose the marshalling in the first place.  It
> may be better to just dump the data as is, and have user space know
> what the interpretation is.

We provides a way to store the data in nested layers. In internal 
discussions with Jiri, we decided that this is the correct approach.
However, if one insists, it can fill the buffers with raw binary and 
label it as such.

> 
>> And again, regarding ethtool, it is a tool for dumping HW/FW, this could
>> have been an argument for the region snapshot, not for the devlink health...
> 
> I've seen drivers dumping ring state and other SW info via ethtool.
> All debugging APIs end up "mixed source" in my experience.
> 
>>> And on the bike shedding side :) -> I think you should steer clear of
>>> calling this objdump, as it has very little to do with the
>>> functionality of well-known objdump tool.  Its not even clear what the
>>> object the name is referring to is.
>> Let's agree on concept, we can change name to dump. Reporter->dump is
>> very clear when you know what the reporter is.
> 
> Thanks!
> 
>>> Long story short the overlap with region snapshots makes it unclear
>>> what purpose either serves, and IMHO we should avoid the marshalling by
>>> teaching user space how to interpret snapshots.  Preferably we only
>>> have one dump mechanism, and user space can be taught the interpretation
>>> once.
>> Forcing SW reporters to use region snapshot mechanism sounds like a bad
>> idea.
> 
> I'm not super excited about reusing region API for SW info.  But I like
> it more than having multitude of debug APIs for drivers to implement
> with largely overlapping functionality and semantics.

The dumping of HW information is just a very small portion of the 
devlink health. If driver developer thinks he can use existing region 
API to fetch some data into its reporter, he can do so in his dump method.

> 
>> To summarize, In order to have the health system working properly, it
>> must have a way to objdump/dump itself and provide it to the developer
>> for debug. Removing the objdump part will make it useless for run-time
>> debug.
>>
>> I think this is a powerful tool and we shall not ask the user level
>> scripts to fetch info from many other partial tools. It shall all be
>> focused in one place (recover, diagnose, objdump, statistics).
> 
> I don't think having reporter API refer to snapshot IDs is "many other
> partial tools" if that's the suggestion.  Seems like you want to focus
> all the reporter stuff in one place, and I want to focus the debug APIs
> a little :)
> 

As I can see it, this tool is an envelop to all health functionality and 
should provide good and easy to use interface.
This tool should contain all health related dumps (unfortunately as the 
subsystem runs for a long time, it cannot be the sole provider of it, 
but aspire to be the leading one).

With devlink health dump you can get:
1. Run time dumps
2. Automatic error related dumps from the time the error happened stored 
in the memory waiting to be fetched.
3. Driver parsed data as well as raw data (up to the developer to 
decide, API can support both).
4. OOB configured data marked by the driver developers as relevant for 
the failure

Non of the existing tools can provide such features. And the option to 
add them to an existing tool doesn't seem possible / correct even in theory.

Removing the dump option from devlink for the sake of duplicate HW 
memory dump is a killer for this feature. And as A driver developer I 
really think that the networking subsystem needs it.

Jakub Kicinski Jan. 3, 2019, 10:28 p.m. UTC | #5

On Thu, 3 Jan 2019 13:31:59 +0000, Eran Ben Elisha wrote:
> Arch wise those are two different features which we shouldn't mix.
> The region dump is aiming at dumping of information for monitoring of 
> "HW memory" at real time, more like a dumb channel to provide memory 
> chunks from HW to user.

The "real time read" part of the region dump was not even implemented.
And it was the part that made most sense to me.

Region snapshots were described as a tool for gathering crash dumps.
See bedc989b0c98 ("net/mlx4_core: Add Crdump FW snapshot support").

The "chunks from HW" is also incorrect as (1) current implementation of
regions seem to mostly revolve around FW state and (2) there is nothing
in the man page etc. that says HW.

I'm not saying region snapshots fit the bill perfectly for you, I'm
saying you guys are adding a second facility to do a very similar thing
in the span of 6 months - how is it unreasonable of me to ask to
consolidate?

But I'm not gonna fight you any more on this, if nobody else cares.

Jiri Pirko Jan. 4, 2019, 8:57 a.m. UTC | #6

Tue, Jan 01, 2019 at 02:47:27AM CET, jakub.kicinski@netronome.com wrote:
>On Mon, 31 Dec 2018 16:31:54 +0200, Eran Ben Elisha wrote:
>> The health mechanism is targeted for Real Time Alerting, in order to know when
>> something bad had happened to a PCI device
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed debugging
>>   information.
>> 
>> The main idea is to unify and centralize driver health reports in the
>> generic devlink instance and allow the user to set different
>> attributes of the health reporting and recovery procedures.
>> 
>> The devlink health reporter:
>> Device driver creates a "health reporter" per each error/health type.
>> Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
>> or unknown (driver specific).
>> For each registered health reporter a driver can issue error/health reports
>> asynchronously. All health reports handling is done by devlink.
>> Device driver can provide specific callbacks for each "health reporter", e.g.
>>  - Recovery procedures
>>  - Diagnostics and object dump procedures
>>  - OOB initial attributes
>> Different parts of the driver can register different types of health reporters
>> with different handlers.
>> 
>> Once an error is reported, devlink health will do the following actions:
>>   * A log is being send to the kernel trace events buffer
>>   * Health status and statistics are being updated for the reporter instance
>>   * Object dump is being taken and saved at the reporter instance (as long as
>>     there is no other Objdump which is already stored)
>>   * Auto recovery attempt is being done. Depends on:
>>     - Auto-recovery configuration
>>     - Grace period vs. time passed since last recover
>> 
>> The user interface:
>> User can access/change each reporter attributes and driver specific callbacks
>> via devlink, e.g per error type (per health reporter)
>>  - Configure reporter's generic attributes (like: Disable/enable auto recovery)
>>  - Invoke recovery procedure
>>  - Run diagnostics
>>  - Object dump
>
>Thanks for continuing this work!
>
>> The devlink health interface (via netlink):
>> DEVLINK_CMD_HEALTH_REPORTER_GET
>>   Retrieves status and configuration info per DEV and reporter.
>> DEVLINK_CMD_HEALTH_REPORTER_SET
>>   Allows reporter-related configuration setting.
>> DEVLINK_CMD_HEALTH_REPORTER_RECOVER
>>   Triggers a reporter's recovery procedure.
>> DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
>>   Retrieves diagnostics data from a reporter on a device.
>> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
>
>The addition of "objdump" and its marshalling is a bit disappointing.
>It seemed to me when region snapshots were added that they would serve
>this exact purpose.  Taking a region snapshot or "core dump", if you
>will, after something went south.  Now it's confusing to have two
>mechanisms serving the same purpose.
>
>It's unclear from quick reading of the code how if the buffers get
>timestamped.  Can you report more than one?
>
>About the marshalling, I'm not sure it belongs in the kernel.  There is
>precedent for adding interpretation of FW blobs in user space (ethtool).
>IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
>drivers.  Amount of code you add to print the simple example from last
>patch is not inspiring confidence.
>
>And on the bike shedding side :) -> I think you should steer clear of
>calling this objdump, as it has very little to do with the
>functionality of well-known objdump tool.  Its not even clear what the
>object the name is referring to is.
>
>Long story short the overlap with region snapshots makes it unclear
>what purpose either serves, and IMHO we should avoid the marshalling by
>teaching user space how to interpret snapshots.  Preferably we only
>have one dump mechanism, and user space can be taught the interpretation
>once.

Unlike regions, which are binary blobs, the "objdump" we have here is a
tree of values (json like). It is not taken from FW/HW. It is generated
in driver and passed down to user to be shown. Originally, Eran just
pushed that info into a string buffer and the user had to parse it
again. I asked to format this in Netlink attributes JSON-style so the
user gets all hierarchy without need of parsing and can easily do
Netlink->human_stdout or Netlink->JSON.



>
>>   Retrieves the last stored objdump. Devlink health
>>   saves a single objdump. If an objdump is not already stored by the devlink
>>   for this reporter, devlink generates a new objdump.
>>   Objdump output is defined by the reporter.
>> DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
>>   Clears the last saved objdump file for the specified reporter.
>> 
>> 
>>                                                netlink
>>                                       +--------------------------+
>>                                       |                          |
>>                                       |            +             |
>>                                       |            |             |
>>                                       +--------------------------+
>>                                                    |request for ops
>>                                                    |(diagnose,
>>  mlx5_core                             devlink     |recover,
>>                                                    |dump)
>> +--------+                            +--------------------------+
>> |        |                            |    reporter|             |
>> |        |                            |  +---------v----------+  |
>> |        |   ops execution            |  |                    |  |
>> |     <----------------------------------+                    |  |
>> |        |                            |  |                    |  |
>> |        |                            |  + ^------------------+  |
>> |        |                            |    | request for ops     |
>> |        |                            |    | (recover, dump)     |
>> |        |                            |    |                     |
>> |        |                            |  +-+------------------+  |
>> |        |     health report          |  | health handler     |  |
>> |        +------------------------------->                    |  |
>> |        |                            |  +--------------------+  |
>> |        |     health reporter create |                          |
>> |        +---------------------------->                          |
>> +--------+                            +--------------------------+
>> 
>> Available reporters:
>> In this patchset, three reporters of mlx5e driver are included. The FW
>> reporters implement devlink_health_reporter diagnostic, object dump and
>> recovery procedures. The TX reporter implements devlink_health_reporter
>> diagnostic and recovery procedures.
>> 
>> This RFC is based on the same concepts as previous RFC I sent earlier this
>> year: "[RFC PATCH iproute2-next] System specification health API". The API was
>> changed, also devlink code and mlx5e reporters were not available at the
>> previous RFC.

Jiri Pirko Jan. 4, 2019, 9:01 a.m. UTC | #7

Thu, Jan 03, 2019 at 02:31:59PM CET, eranbe@mellanox.com wrote:
>
>
>On 1/3/2019 12:46 AM, Jakub Kicinski wrote:
>> On Tue, 1 Jan 2019 09:58:30 +0000, Eran Ben Elisha wrote:
>>> On 1/1/2019 3:47 AM, Jakub Kicinski wrote:

[...]

>
>> 
>>>> About the marshalling, I'm not sure it belongs in the kernel.  There is
>>>> precedent for adding interpretation of FW blobs in user space (ethtool).
>>>> IMHO it's a more scalable approach, if we want to avoid having 100 kLoC
>>>> drivers.  Amount of code you add to print the simple example from last
>>>> patch is not inspiring confidence.
>>>
>>> The idea was to provide the developer the ability to create "tree-like"
>>> of information, it is needed when you want to describe complex objects.
>>> This caused a longer coding, but I don't think we are even close to the
>>> scale you are talking about.
>>> We can remove the tree flexibility, and move to array format, it will
>>> make the code of storing data by the driver to be shorter, but we will
>>> lose the ability to have it in tree-like format.
>> 
>> To be clear I slightly oppose the marshalling in the first place.  It
>> may be better to just dump the data as is, and have user space know
>> what the interpretation is.
>
>We provides a way to store the data in nested layers. In internal 
>discussions with Jiri, we decided that this is the correct approach.
>However, if one insists, it can fill the buffers with raw binary and 
>label it as such.

Again, the data is generated by driver. Driver knows the objects, it can
assemble a tree of values accordingly. To me it seems wrong to squash
the info into binary, push to userspace where it has to be parsed and
unsquashed back.

Jiri Pirko Jan. 4, 2019, 9:12 a.m. UTC | #8

Thu, Jan 03, 2019 at 11:28:34PM CET, jakub.kicinski@netronome.com wrote:
>On Thu, 3 Jan 2019 13:31:59 +0000, Eran Ben Elisha wrote:
>> Arch wise those are two different features which we shouldn't mix.
>> The region dump is aiming at dumping of information for monitoring of 
>> "HW memory" at real time, more like a dumb channel to provide memory 
>> chunks from HW to user.
>
>The "real time read" part of the region dump was not even implemented.
>And it was the part that made most sense to me.

Agreed. I believe that it was planned to be used for mlx5.


>
>Region snapshots were described as a tool for gathering crash dumps.
>See bedc989b0c98 ("net/mlx4_core: Add Crdump FW snapshot support").
>
>The "chunks from HW" is also incorrect as (1) current implementation of
>regions seem to mostly revolve around FW state and (2) there is nothing
>in the man page etc. that says HW.
>
>I'm not saying region snapshots fit the bill perfectly for you, I'm
>saying you guys are adding a second facility to do a very similar thing
>in the span of 6 months - how is it unreasonable of me to ask to
>consolidate?

If we would need to push binary, yes. But as I described in another
email, that is not the case.

>
>But I'm not gonna fight you any more on this, if nobody else cares.

[RFC,net-next,00/19] Devlink health reporting and recovery system

Message

Comments