[net-next,0/9] devlink: Add support for region access

Message ID	1522339672-18273-1-git-send-email-valex@mellanox.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Alex Vesker <valex@mellanox.com> To: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>, Jiri Pirko <jiri@mellanox.com>, Alex Vesker <valex@mellanox.com> Subject: [PATCH net-next 0/9] devlink: Add support for region access Date: Thu, 29 Mar 2018 19:07:43 +0300 Message-Id: <1522339672-18273-1-git-send-email-valex@mellanox.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	devlink: Add support for region access \| expand [net-next,0/9] devlink: Add support for region access [net-next,1/9] devlink: Add support for creating and destroying regions [net-next,2/9] devlink: Add callback to query for snapshot id before snapshot create [net-next,3/9] devlink: Add support for creating region snapshots [net-next,4/9] devlink: Add support for region get command [net-next,5/9] devlink: Extend the support querying for region snapshot IDs [net-next,6/9] devlink: Add support for region snapshot delete command [net-next,7/9] devlink: Add support for region snapshot read command [net-next,8/9] net/mlx4_core: Add health buffer address capability [net-next,9/9] net/mlx4_core: Add Crdump FW snapshot support

Alex Vesker March 29, 2018, 4:07 p.m. UTC

This is a proposal which will allow access to driver defined address
regions using devlink. Each device can create its supported address
regions and register them. A device which exposes a region will allow
access to it using devlink.

The suggested implementation will allow exposing regions to the user,
reading and dumping snapshots taken from different regions. 
A snapshot represents a memory image of a region taken by the driver.

If a device collects a snapshot of an address region it can be later
exposed using devlink region read or dump commands.
This functionality allows for future analyses on the snapshots to be
done.

The major benefit of this support is not only to provide access to
internal address regions which were inaccessible to the user but also
to provide an additional way to debug complex error states using the
region snapshots.

Implemented commands:
$ devlink region help
$ devlink region show [ DEV/REGION ]
$ devlink region del DEV/REGION snapshot SNAPSHOT_ID
$ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
$ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
	address ADDRESS length length

Show all of the exposed regions with region sizes:
$ devlink region show
pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]

Delete a snapshot using:
$ devlink region del pci/0000:00:05.0/cr-space snapshot 1

Dump a snapshot:
$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5

Read a specific part of a snapshot:
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 
	length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30

For more information you can check devlink-region.8 man page

Future:
There is a plan to extend the support to include a write command
as well as performing read and dump live region

Alex Vesker (9):
  devlink: Add support for creating and destroying regions
  devlink: Add callback to query for snapshot id before snapshot create
  devlink: Add support for creating region snapshots
  devlink: Add support for region get command
  devlink: Extend the support querying for region snapshot IDs
  devlink: Add support for region snapshot delete command
  devlink: Add support for region snapshot read command
  net/mlx4_core: Add health buffer address capability
  net/mlx4_core: Add Crdump FW snapshot support

 drivers/net/ethernet/mellanox/mlx4/Makefile |   2 +-
 drivers/net/ethernet/mellanox/mlx4/catas.c  |   6 +-
 drivers/net/ethernet/mellanox/mlx4/crdump.c | 224 ++++++++++
 drivers/net/ethernet/mellanox/mlx4/fw.c     |   5 +-
 drivers/net/ethernet/mellanox/mlx4/fw.h     |   1 +
 drivers/net/ethernet/mellanox/mlx4/main.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h   |   4 +
 include/linux/mlx4/device.h                 |   7 +
 include/net/devlink.h                       |  39 ++
 include/uapi/linux/devlink.h                |  18 +
 net/core/devlink.c                          | 646 ++++++++++++++++++++++++++++
 11 files changed, 958 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx4/crdump.c

Andrew Lunn March 29, 2018, 5:13 p.m. UTC | #1

On Thu, Mar 29, 2018 at 07:07:43PM +0300, Alex Vesker wrote:
> This is a proposal which will allow access to driver defined address
> regions using devlink. Each device can create its supported address
> regions and register them. A device which exposes a region will allow
> access to it using devlink.
> 
> The suggested implementation will allow exposing regions to the user,
> reading and dumping snapshots taken from different regions. 
> A snapshot represents a memory image of a region taken by the driver.
> 
> If a device collects a snapshot of an address region it can be later
> exposed using devlink region read or dump commands.
> This functionality allows for future analyses on the snapshots to be
> done.

Hi Alex

So the device is in change of making a snapshot? A user cannot
initiate it?

Seems like if i'm trying to debug something, i want to take a snapshot
in the good state, issue the command which breaks things, and then
take another snapshot. Looking at the diff then gives me an idea what
happened.

> Show all of the exposed regions with region sizes:
> $ devlink region show
> pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]

So you have 2Mbytes of snapshot data. Is this held in the device, or
kernel memory?

> Dump a snapshot:
> $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
> 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
> 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
> 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
> 
> Read a specific part of a snapshot:
> $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 
> 	length 16
> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30

Why a separate command? It seems to be just a subset of dump.

    Andrew

Andrew Lunn March 29, 2018, 6:23 p.m. UTC | #2

On Thu, Mar 29, 2018 at 07:07:43PM +0300, Alex Vesker wrote:
> This is a proposal which will allow access to driver defined address
> regions using devlink. Each device can create its supported address
> regions and register them. A device which exposes a region will allow
> access to it using devlink.

Hi Alex

Did you see the work Rahul Lakkireddy has been doing?

https://patchwork.kernel.org/patch/10305935/

It seems like these are similar, or at least overlapping. We probably
want one solution for both.

     Andrew

Alex Vesker March 29, 2018, 6:59 p.m. UTC | #3

On 3/29/2018 8:13 PM, Andrew Lunn wrote:
> On Thu, Mar 29, 2018 at 07:07:43PM +0300, Alex Vesker wrote:
>> This is a proposal which will allow access to driver defined address
>> regions using devlink. Each device can create its supported address
>> regions and register them. A device which exposes a region will allow
>> access to it using devlink.
>>
>> The suggested implementation will allow exposing regions to the user,
>> reading and dumping snapshots taken from different regions.
>> A snapshot represents a memory image of a region taken by the driver.
>>
>> If a device collects a snapshot of an address region it can be later
>> exposed using devlink region read or dump commands.
>> This functionality allows for future analyses on the snapshots to be
>> done.
> Hi Alex
>
> So the device is in change of making a snapshot? A user cannot
> initiate it?
Hi,
Correct, currently the user cannot initiate saving a snapshot but
as I said in the cover letter, planned support is for dumping "live" 
regions.

> Seems like if i'm trying to debug something, i want to take a snapshot
> in the good state, issue the command which breaks things, and then
> take another snapshot. Looking at the diff then gives me an idea what
> happened.
>
>> Show all of the exposed regions with region sizes:
>> $ devlink region show
>> pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
> So you have 2Mbytes of snapshot data. Is this held in the device, or
> kernel memory?
This is allocated in devlink, the maximum number of snapshots is set by 
the driver.

>> Dump a snapshot:
>> $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
>> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
>> 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
>> 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
>> 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
>>
>> Read a specific part of a snapshot:
>> $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
>> 	length 16
>> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
> Why a separate command? It seems to be just a subset of dump.

This is useful when debugging values on specific addresses, this also
brings the API one step closer for a read and write API.

>
>      Andrew

Andrew Lunn March 29, 2018, 7:51 p.m. UTC | #4

> >>Show all of the exposed regions with region sizes:
> >>$ devlink region show
> >>pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
> >So you have 2Mbytes of snapshot data. Is this held in the device, or
> >kernel memory?
> This is allocated in devlink, the maximum number of snapshots is set by the
> driver.

And it seems to want contiguous pages. How well does that work after
the system has been running for a while and memory is fragmented?

> >>Dump a snapshot:
> >>$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
> >>0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
> >>0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
> >>0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
> >>0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
> >>
> >>Read a specific part of a snapshot:
> >>$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
> >>	length 16
> >>0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
> >Why a separate command? It seems to be just a subset of dump.
> 
> This is useful when debugging values on specific addresses, this also
> brings the API one step closer for a read and write API.

The functionality is useful, yes. But why two commands? Why not one
command, dump, which takes optional parameters?

Also, i doubt write support will be accepted. That sounds like the
start of an API to allow a user space driver.

      Andrew

Alex Vesker March 30, 2018, 5:28 a.m. UTC | #5

On 3/29/2018 10:51 PM, Andrew Lunn wrote:
>>>> Show all of the exposed regions with region sizes:
>>>> $ devlink region show
>>>> pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
>>> So you have 2Mbytes of snapshot data. Is this held in the device, or
>>> kernel memory?
>> This is allocated in devlink, the maximum number of snapshots is set by the
>> driver.
> And it seems to want contiguous pages. How well does that work after
> the system has been running for a while and memory is fragmented?

The allocation can be changed, there is no read need for contiguous pages.
It is important to note that we the amount of snapshots is limited by 
the driver
this can be based on the dump size or expected frequency of collection.
I also prefer not to pre-allocate this memory.
>>>> Dump a snapshot:
>>>> $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
>>>> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
>>>> 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
>>>> 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
>>>> 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
>>>>
>>>> Read a specific part of a snapshot:
>>>> $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
>>>> 	length 16
>>>> 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
>>> Why a separate command? It seems to be just a subset of dump.
>> This is useful when debugging values on specific addresses, this also
>> brings the API one step closer for a read and write API.
> The functionality is useful, yes. But why two commands? Why not one
> command, dump, which takes optional parameters?

Dump in devlink means provide all the data, saying dump address x length 
y sounds
confusing.  Do you see this as a critical issue?

> Also, i doubt write support will be accepted. That sounds like the
> start of an API to allow a user space driver.

If this will be an issue we will stay with read access only.

>
>        Andrew

Rahul Lakkireddy March 30, 2018, 9:51 a.m. UTC | #6

On Thursday, March 03/29/18, 2018 at 23:53:43 +0530, Andrew Lunn wrote:
> On Thu, Mar 29, 2018 at 07:07:43PM +0300, Alex Vesker wrote:
> > This is a proposal which will allow access to driver defined address
> > regions using devlink. Each device can create its supported address
> > regions and register them. A device which exposes a region will allow
> > access to it using devlink.
> 
> Hi Alex
> 
> Did you see the work Rahul Lakkireddy has been doing?
> 
> https://patchwork.kernel.org/patch/10305935/
> 
> It seems like these are similar, or at least overlapping. We probably
> want one solution for both.
> 

We're already collecting hardware snapshots when system is live with
ethtool --getdump (which devlink tool is now trying to do).

We are now in the process of adding support to collect hardware
snapshots during kernel panic.

Thanks,
Rahul

Jiri Pirko March 30, 2018, 10:21 a.m. UTC | #7

Thu, Mar 29, 2018 at 09:51:54PM CEST, andrew@lunn.ch wrote:
>> >>Show all of the exposed regions with region sizes:
>> >>$ devlink region show
>> >>pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
>> >So you have 2Mbytes of snapshot data. Is this held in the device, or
>> >kernel memory?
>> This is allocated in devlink, the maximum number of snapshots is set by the
>> driver.
>
>And it seems to want contiguous pages. How well does that work after
>the system has been running for a while and memory is fragmented?
>
>> >>Dump a snapshot:
>> >>$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
>> >>0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
>> >>0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
>> >>0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
>> >>0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
>> >>
>> >>Read a specific part of a snapshot:
>> >>$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
>> >>	length 16
>> >>0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
>> >Why a separate command? It seems to be just a subset of dump.
>> 
>> This is useful when debugging values on specific addresses, this also
>> brings the API one step closer for a read and write API.
>
>The functionality is useful, yes. But why two commands? Why not one
>command, dump, which takes optional parameters?

Between userspace and kernel, this is implemented as a single command.
So this is just a userspace wrapper. I think it is nice to provide clear
commands to the user so he is not confused about what is he doing. Also,
as Alex mentioned, we plan to have write command which will have same
command like args as read. These 2 should be in sync.

>
>Also, i doubt write support will be accepted. That sounds like the
>start of an API to allow a user space driver.

We discussed that on netconf in Seoul and was agreed it is needed.
We have 2 options: Some out of tree crap utils and access via /dev/mem,
or something which is well defined and implemented by in-tree drivers.
Writes will serve for debugging purposes, even tuning and bug hunting
on in production. For that, we need a standard way to do it.

Jiri Pirko March 30, 2018, 10:24 a.m. UTC | #8

Fri, Mar 30, 2018 at 11:51:57AM CEST, rahul.lakkireddy@chelsio.com wrote:
>On Thursday, March 03/29/18, 2018 at 23:53:43 +0530, Andrew Lunn wrote:
>> On Thu, Mar 29, 2018 at 07:07:43PM +0300, Alex Vesker wrote:
>> > This is a proposal which will allow access to driver defined address
>> > regions using devlink. Each device can create its supported address
>> > regions and register them. A device which exposes a region will allow
>> > access to it using devlink.
>> 
>> Hi Alex
>> 
>> Did you see the work Rahul Lakkireddy has been doing?
>> 
>> https://patchwork.kernel.org/patch/10305935/
>> 
>> It seems like these are similar, or at least overlapping. We probably
>> want one solution for both.
>> 
>
>We're already collecting hardware snapshots when system is live with
>ethtool --getdump (which devlink tool is now trying to do).

Ethtool is definitelly a wrong tool for this. It uses netdev as a
handle, but the dumps happen on a parent device - represented by a
devlink instance.
Also, in devlink we have notifications so a deamon can actually listen
on a socket to see if there is new dump available due to a critical
event etc.


>
>We are now in the process of adding support to collect hardware
>snapshots during kernel panic.
>
>Thanks,
>Rahul

Andrew Lunn March 30, 2018, 2:34 p.m. UTC | #9

> >And it seems to want contiguous pages. How well does that work after
> >the system has been running for a while and memory is fragmented?
> 
> The allocation can be changed, there is no read need for contiguous pages.
> It is important to note that we the amount of snapshots is limited by the
> driver
> this can be based on the dump size or expected frequency of collection.
> I also prefer not to pre-allocate this memory.

The driver code also asks for a 1MB contiguous chunk of memory!  You
really should think about this API, how can you avoid double memory
allocations. And can kvmalloc be used. But then you get into the
problem for DMA'ing the memory from the device...

This API also does not scale. 1MB is actually quite small. I'm sure
there is firmware running on CPUs with a lot more than 1MB of RAM.
How well does with API work with 64MB? Say i wanted to snapshot my
GPU? Or the MC/BMC?

> Dump in devlink means provide all the data, saying dump address x length y
> sounds
> confusing.  Do you see this as a critical issue?

No, i don't. But nearly every set of tools i've used has one command.
eg uboot, coreboot, gdb, od, hexdump. Even ethtool has [offset N] [length N]

How many tools can you name which have two different command, rather
than one with options?

      Andrew

David Ahern March 30, 2018, 4:57 p.m. UTC | #10

On 3/30/18 8:34 AM, Andrew Lunn wrote:
>>> And it seems to want contiguous pages. How well does that work after
>>> the system has been running for a while and memory is fragmented?
>>
>> The allocation can be changed, there is no read need for contiguous pages.
>> It is important to note that we the amount of snapshots is limited by the
>> driver
>> this can be based on the dump size or expected frequency of collection.
>> I also prefer not to pre-allocate this memory.
> 
> The driver code also asks for a 1MB contiguous chunk of memory!  You
> really should think about this API, how can you avoid double memory
> allocations. And can kvmalloc be used. But then you get into the
> problem for DMA'ing the memory from the device...
> 
> This API also does not scale. 1MB is actually quite small. I'm sure
> there is firmware running on CPUs with a lot more than 1MB of RAM.
> How well does with API work with 64MB? Say i wanted to snapshot my
> GPU? Or the MC/BMC?
> 

That and the drivers control the number of snapshots. The user should be
able to control the number of snapshots, and an option to remove all
snapshots to free up that memory.

David Miller March 30, 2018, 6:07 p.m. UTC | #11

From: Alex Vesker <valex@mellanox.com>
Date: Fri, 30 Mar 2018 08:28:39 +0300

> On 3/29/2018 10:51 PM, Andrew Lunn wrote:
>> Also, i doubt write support will be accepted. That sounds like the
>> start of an API to allow a user space driver.
> 
> If this will be an issue we will stay with read access only.

Because of registers which are accessed indirectly, it's hard to avoid
supporting write support in some way.

This interface is not for providing a way to do userland drivers, it's
for diagnostics only.  And indeed we did discuss this at netconf and
we had broad concensus on this matter at the time.

David Miller March 30, 2018, 6:07 p.m. UTC | #12

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 29 Mar 2018 21:51:54 +0200

> And it seems to want contiguous pages. How well does that work after
> the system has been running for a while and memory is fragmented?

Indeed this will be a problem.

Alex Vesker March 30, 2018, 7:39 p.m. UTC | #13

On 3/30/2018 7:57 PM, David Ahern wrote:
> On 3/30/18 8:34 AM, Andrew Lunn wrote:
>>>> And it seems to want contiguous pages. How well does that work after
>>>> the system has been running for a while and memory is fragmented?
>>> The allocation can be changed, there is no read need for contiguous pages.
>>> It is important to note that we the amount of snapshots is limited by the
>>> driver
>>> this can be based on the dump size or expected frequency of collection.
>>> I also prefer not to pre-allocate this memory.
>> The driver code also asks for a 1MB contiguous chunk of memory!  You
>> really should think about this API, how can you avoid double memory
>> allocations. And can kvmalloc be used. But then you get into the
>> problem for DMA'ing the memory from the device...
>>
>> This API also does not scale. 1MB is actually quite small. I'm sure
>> there is firmware running on CPUs with a lot more than 1MB of RAM.
>> How well does with API work with 64MB? Say i wanted to snapshot my
>> GPU? Or the MC/BMC?
>>
> That and the drivers control the number of snapshots. The user should be
> able to control the number of snapshots, and an option to remove all
> snapshots to free up that memory.

There is an option to free up this memory, using a delete command.
The reason I added the option to control the number of snapshots from
the driver side only is because the driver knows the size of the snapshots
and when/why they will be taken.
For example in our mlx4 driver the snapshots are taken on rare failures,
the snapshot is quite large and from past analyses the first dump is usually
the important one, this means that 8 is more than enough in my case.
If a user wants more than that he can always monitor notification read
the snapshot and delete once backup-ed, there is no reason for keeping
all of this data in the kernel.

David Ahern March 30, 2018, 10:26 p.m. UTC | #14

On 3/30/18 1:39 PM, Alex Vesker wrote:
> 
> 
> On 3/30/2018 7:57 PM, David Ahern wrote:
>> On 3/30/18 8:34 AM, Andrew Lunn wrote:
>>>>> And it seems to want contiguous pages. How well does that work after
>>>>> the system has been running for a while and memory is fragmented?
>>>> The allocation can be changed, there is no read need for contiguous
>>>> pages.
>>>> It is important to note that we the amount of snapshots is limited
>>>> by the
>>>> driver
>>>> this can be based on the dump size or expected frequency of collection.
>>>> I also prefer not to pre-allocate this memory.
>>> The driver code also asks for a 1MB contiguous chunk of memory!  You
>>> really should think about this API, how can you avoid double memory
>>> allocations. And can kvmalloc be used. But then you get into the
>>> problem for DMA'ing the memory from the device...
>>>
>>> This API also does not scale. 1MB is actually quite small. I'm sure
>>> there is firmware running on CPUs with a lot more than 1MB of RAM.
>>> How well does with API work with 64MB? Say i wanted to snapshot my
>>> GPU? Or the MC/BMC?
>>>
>> That and the drivers control the number of snapshots. The user should be
>> able to control the number of snapshots, and an option to remove all
>> snapshots to free up that memory.
> 
> There is an option to free up this memory, using a delete command.
> The reason I added the option to control the number of snapshots from
> the driver side only is because the driver knows the size of the snapshots
> and when/why they will be taken.
> For example in our mlx4 driver the snapshots are taken on rare failures,
> the snapshot is quite large and from past analyses the first dump is
> usually
> the important one, this means that 8 is more than enough in my case.
> If a user wants more than that he can always monitor notification read
> the snapshot and delete once backup-ed, there is no reason for keeping
> all of this data in the kernel.
> 
> 

I was thinking less. ie., a user says keep only 1 or 2 snapshots or
disable snapshots altogether.

Alex Vesker March 31, 2018, 6:11 a.m. UTC | #15

On 3/31/2018 1:26 AM, David Ahern wrote:
> On 3/30/18 1:39 PM, Alex Vesker wrote:
>>
>> On 3/30/2018 7:57 PM, David Ahern wrote:
>>> On 3/30/18 8:34 AM, Andrew Lunn wrote:
>>>>>> And it seems to want contiguous pages. How well does that work after
>>>>>> the system has been running for a while and memory is fragmented?
>>>>> The allocation can be changed, there is no read need for contiguous
>>>>> pages.
>>>>> It is important to note that we the amount of snapshots is limited
>>>>> by the
>>>>> driver
>>>>> this can be based on the dump size or expected frequency of collection.
>>>>> I also prefer not to pre-allocate this memory.
>>>> The driver code also asks for a 1MB contiguous chunk of memory!  You
>>>> really should think about this API, how can you avoid double memory
>>>> allocations. And can kvmalloc be used. But then you get into the
>>>> problem for DMA'ing the memory from the device...
>>>>
>>>> This API also does not scale. 1MB is actually quite small. I'm sure
>>>> there is firmware running on CPUs with a lot more than 1MB of RAM.
>>>> How well does with API work with 64MB? Say i wanted to snapshot my
>>>> GPU? Or the MC/BMC?
>>>>
>>> That and the drivers control the number of snapshots. The user should be
>>> able to control the number of snapshots, and an option to remove all
>>> snapshots to free up that memory.
>> There is an option to free up this memory, using a delete command.
>> The reason I added the option to control the number of snapshots from
>> the driver side only is because the driver knows the size of the snapshots
>> and when/why they will be taken.
>> For example in our mlx4 driver the snapshots are taken on rare failures,
>> the snapshot is quite large and from past analyses the first dump is
>> usually
>> the important one, this means that 8 is more than enough in my case.
>> If a user wants more than that he can always monitor notification read
>> the snapshot and delete once backup-ed, there is no reason for keeping
>> all of this data in the kernel.
>>
>>
> I was thinking less. ie., a user says keep only 1 or 2 snapshots or
> disable snapshots altogether.
Devlink configuration is not persistent if the driver is reloaded, currently
there is no way to sync this. One or two might not be enough time to
read, delete and make room for the next one, as I said each driver should
do its calculations here based on frequency, size and even the time it takes
capturing it. The user can't know if one snapshot is enough for debug
I saw cases in which debug requires more than one snapshot to make
sure a health clock is incremented and the FW is alive.

I want to be able to login to a customer and accessing this snapshot
without any previous configuration from the user and not asking for
enabling the feature and then waiting for a repro...this will help
debugging issues that are hard to reproduce, I don't see any reason
to disable this.

Andrew Lunn March 31, 2018, 3:53 p.m. UTC | #16

> I want to be able to login to a customer and accessing this snapshot
> without any previous configuration from the user and not asking for
> enabling the feature and then waiting for a repro...this will help
> debugging issues that are hard to reproduce, I don't see any reason
> to disable this.

The likely reality is 99.9% of these snapshots will never be seen or
used. But they take up memory sitting there doing nothing. And if the
snapshot is 2GB, that is a lot of memory. I expect a system admin
wants to be able to choose to enable this feature or not, because of
that memory. You should also consider implementing the memory pressure
callbacks, so you can discard snapshots, rather than OOM the machine.

	   Andrew

David Ahern March 31, 2018, 5:21 p.m. UTC | #17

On 3/31/18 9:53 AM, Andrew Lunn wrote:
>> I want to be able to login to a customer and accessing this snapshot
>> without any previous configuration from the user and not asking for
>> enabling the feature and then waiting for a repro...this will help
>> debugging issues that are hard to reproduce, I don't see any reason
>> to disable this.
> 
> The likely reality is 99.9% of these snapshots will never be seen or
> used. But they take up memory sitting there doing nothing. And if the
> snapshot is 2GB, that is a lot of memory. I expect a system admin
> wants to be able to choose to enable this feature or not, because of
> that memory. You should also consider implementing the memory pressure
> callbacks, so you can discard snapshots, rather than OOM the machine.
> 

That is exactly my point. Nobody wants one rogue device triggering
snapshots, consuming system resources and with no options to disable it.

Alex Vesker April 4, 2018, 11:07 a.m. UTC | #18

On 3/31/2018 8:21 PM, David Ahern wrote:
> On 3/31/18 9:53 AM, Andrew Lunn wrote:
>>> I want to be able to login to a customer and accessing this snapshot
>>> without any previous configuration from the user and not asking for
>>> enabling the feature and then waiting for a repro...this will help
>>> debugging issues that are hard to reproduce, I don't see any reason
>>> to disable this.
>> The likely reality is 99.9% of these snapshots will never be seen or
>> used. But they take up memory sitting there doing nothing. And if the
>> snapshot is 2GB, that is a lot of memory. I expect a system admin
>> wants to be able to choose to enable this feature or not, because of
>> that memory. You should also consider implementing the memory pressure
>> callbacks, so you can discard snapshots, rather than OOM the machine.
>>
> That is exactly my point. Nobody wants one rogue device triggering
> snapshots, consuming system resources and with no options to disable it.


OK, currently there is a task to support persistent/permanent configuration
to devlink. Once this support is in I will add my code on top of it.
This will allow a user to enable the snapshot functionality on the driver.
Regarding the double continuous allocation of memory, I will fix to a single
vmalloc on the driver in case of adding a snapshot. Tell me what you think.

[net-next,0/9] devlink: Add support for region access

Message

Comments