Message ID | 20200817125059.193242-1-idosch@idosch.org |
---|---|
Headers | show |
Series | devlink: Add device metric support | expand |
On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote: > From: Ido Schimmel <idosch@nvidia.com> > > This patch set extends devlink to allow device drivers to expose device > metrics to user space in a standard and extensible fashion, as opposed > to the driver-specific debugfs approach. I feel like all those loose hardware interfaces are a huge maintenance burden. I don't know what the solution is, but the status quo is not great. I spend way too much time patrolling ethtool -S outputs already. I've done my absolute best to make sure that something as simple as updating device firmware can be done in a vendor-agnostic fashion, and I'm not sure I've succeeded. Every single vendor comes up with their own twists. Long story short I'm extremely unexcited about another interface where drivers expose random strings of their picking. Maybe udp_tunnel module can have "global" stats? This would be a good topic for netconf, or LPC hallway discussion :(
On 8/18/20 6:24 PM, Jakub Kicinski wrote: > On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote: >> From: Ido Schimmel <idosch@nvidia.com> >> >> This patch set extends devlink to allow device drivers to expose device >> metrics to user space in a standard and extensible fashion, as opposed >> to the driver-specific debugfs approach. > > I feel like all those loose hardware interfaces are a huge maintenance > burden. I don't know what the solution is, but the status quo is not > great. I don't agree with the 'loose' characterization. Ido and team are pushing what is arguably a modern version of `ethtool -S`, so it provides a better API for retrieving data. > > I spend way too much time patrolling ethtool -S outputs already. > But that's the nature of detailed stats which are often essential to ensuring the system is operating as expected or debugging some problem. Commonality is certainly desired in names when relevant to be able to build tooling around the stats. As an example, per-queue stats have been essential to me for recent investigations. ethq has been really helpful in crossing NIC vendors and viewing those stats as it handles the per-vendor naming differences, but it requires changes to show anything else - errors per queue, xdp stats, drops, etc. This part could be simpler. As for this set, I believe the metrics exposed here are more unique to switch ASICs. At least one company I know of has built a business model around exposing detailed telemetry of switch ASICs, so clearly some find them quite valuable.
On Tue, 18 Aug 2020 20:43:11 -0600 David Ahern wrote: > On 8/18/20 6:24 PM, Jakub Kicinski wrote: > > On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote: > >> From: Ido Schimmel <idosch@nvidia.com> > >> > >> This patch set extends devlink to allow device drivers to expose device > >> metrics to user space in a standard and extensible fashion, as opposed > >> to the driver-specific debugfs approach. > > > > I feel like all those loose hardware interfaces are a huge maintenance > > burden. I don't know what the solution is, but the status quo is not > > great. > > I don't agree with the 'loose' characterization. Loose as in not bound by any standard or best practices. > Ido and team are pushing what is arguably a modern version of > `ethtool -S`, so it provides a better API for retrieving data. ethtool -S is absolutely terrible. Everybody comes up with their own names for IEEE stats, and dumps stats which clearly have corresponding fields in rtnl_link_stats64 there. We don't need a modern ethtool -S, we need to get away from that mess. > > I spend way too much time patrolling ethtool -S outputs already. > > But that's the nature of detailed stats which are often essential to > ensuring the system is operating as expected or debugging some problem. > Commonality is certainly desired in names when relevant to be able to > build tooling around the stats. There are stats which are clearly detailed and device specific, but what ends up happening is that people expose very much not implementation specific stats through the free form interfaces, because it's the easiest. And users are left picking up the pieces, having to ask vendors what each stat means, and trying to create abstractions in their user space glue. > As an example, per-queue stats have been > essential to me for recent investigations. ethq has been really helpful > in crossing NIC vendors and viewing those stats as it handles the > per-vendor naming differences, but it requires changes to show anything > else - errors per queue, xdp stats, drops, etc. This part could be simpler. Sounds like you're agreeing with me? > As for this set, I believe the metrics exposed here are more unique to > switch ASICs. This is the list from patch 6: * - ``nve_vxlan_encap`` * - ``nve_vxlan_decap`` * - ``nve_vxlan_decap_errors`` * - ``nve_vxlan_decap_discards`` What's so unique? > At least one company I know of has built a business model > around exposing detailed telemetry of switch ASICs, so clearly some find > them quite valuable. It's a question of interface, not the value of exposed data. If I have to download vendor documentation and tooling, or adapt my own scripts for every new vendor, I could have as well downloaded an SDK.
On 8/18/2020 8:35 PM, Jakub Kicinski wrote: > On Tue, 18 Aug 2020 20:43:11 -0600 David Ahern wrote: >> On 8/18/20 6:24 PM, Jakub Kicinski wrote: >>> On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote: >>>> From: Ido Schimmel <idosch@nvidia.com> >>>> >>>> This patch set extends devlink to allow device drivers to expose device >>>> metrics to user space in a standard and extensible fashion, as opposed >>>> to the driver-specific debugfs approach. >>> >>> I feel like all those loose hardware interfaces are a huge maintenance >>> burden. I don't know what the solution is, but the status quo is not >>> great. >> >> I don't agree with the 'loose' characterization. > > Loose as in not bound by any standard or best practices. > >> Ido and team are pushing what is arguably a modern version of >> `ethtool -S`, so it provides a better API for retrieving data. > > ethtool -S is absolutely terrible. Everybody comes up with their own > names for IEEE stats, and dumps stats which clearly have corresponding > fields in rtnl_link_stats64 there. We don't need a modern ethtool -S, > we need to get away from that mess. > >>> I spend way too much time patrolling ethtool -S outputs already. >> >> But that's the nature of detailed stats which are often essential to >> ensuring the system is operating as expected or debugging some problem. >> Commonality is certainly desired in names when relevant to be able to >> build tooling around the stats. > > There are stats which are clearly detailed and device specific, > but what ends up happening is that people expose very much not > implementation specific stats through the free form interfaces, > because it's the easiest. > > And users are left picking up the pieces, having to ask vendors what > each stat means, and trying to create abstractions in their user space > glue. Should we require vendors to either provide a Documentation/ entry for each statistics they have (and be guaranteed that it will be outdated unless someone notices), or would you rather have the statistics description be part of the devlink interface itself? Should we define namespaces such that standard metrics should be under the standard namespace and the vendor standard is the wild west? > >> As an example, per-queue stats have been >> essential to me for recent investigations. ethq has been really helpful >> in crossing NIC vendors and viewing those stats as it handles the >> per-vendor naming differences, but it requires changes to show anything >> else - errors per queue, xdp stats, drops, etc. This part could be simpler. > > Sounds like you're agreeing with me? > >> As for this set, I believe the metrics exposed here are more unique to >> switch ASICs. > > This is the list from patch 6: > > * - ``nve_vxlan_encap`` > * - ``nve_vxlan_decap`` > * - ``nve_vxlan_decap_errors`` > * - ``nve_vxlan_decap_discards`` > > What's so unique? > >> At least one company I know of has built a business model >> around exposing detailed telemetry of switch ASICs, so clearly some find >> them quite valuable. > > It's a question of interface, not the value of exposed data. > > If I have to download vendor documentation and tooling, or adapt my own > scripts for every new vendor, I could have as well downloaded an SDK. Are not you being a bit over dramatic here with your example? At least you can run the same command to obtain the stats regardless of the driver and vendor, so from that perspective Linux continues to be the abstraction and that is not broken.
On Tue, 18 Aug 2020 21:30:16 -0700 Florian Fainelli wrote: > >>> I spend way too much time patrolling ethtool -S outputs already. > >> > >> But that's the nature of detailed stats which are often essential to > >> ensuring the system is operating as expected or debugging some problem. > >> Commonality is certainly desired in names when relevant to be able to > >> build tooling around the stats. > > > > There are stats which are clearly detailed and device specific, > > but what ends up happening is that people expose very much not > > implementation specific stats through the free form interfaces, > > because it's the easiest. > > > > And users are left picking up the pieces, having to ask vendors what > > each stat means, and trying to create abstractions in their user space > > glue. > > Should we require vendors to either provide a Documentation/ entry for > each statistics they have (and be guaranteed that it will be outdated > unless someone notices), or would you rather have the statistics > description be part of the devlink interface itself? Should we define > namespaces such that standard metrics should be under the standard > namespace and the vendor standard is the wild west? I'm trying to find a solution which will not require a policeman to constantly monitor the compliance. Please see my effort to ensure drivers document and use the same ethtool -S stats in the TLS offload implementations. I've been trying to improve this situation for a long time, and it's getting old. Please focus on the stats this set adds, instead of fantasizing of what could be. These are absolutely not implementation specific! > > If I have to download vendor documentation and tooling, or adapt my own > > scripts for every new vendor, I could have as well downloaded an SDK. > > Are not you being a bit over dramatic here with your example? I hope not. It's very hard/impossible today to run a fleet of Linux machines without resorting to vendor tooling. > At least you can run the same command to obtain the stats regardless > of the driver and vendor, so from that perspective Linux continues to > be the abstraction and that is not broken. Format of the data is no abstraction.
On 8/19/20 9:18 AM, Jakub Kicinski wrote: > On Tue, 18 Aug 2020 21:30:16 -0700 Florian Fainelli wrote: >>>>> I spend way too much time patrolling ethtool -S outputs already. >>>> >>>> But that's the nature of detailed stats which are often essential to >>>> ensuring the system is operating as expected or debugging some problem. >>>> Commonality is certainly desired in names when relevant to be able to >>>> build tooling around the stats. >>> >>> There are stats which are clearly detailed and device specific, >>> but what ends up happening is that people expose very much not >>> implementation specific stats through the free form interfaces, >>> because it's the easiest. >>> >>> And users are left picking up the pieces, having to ask vendors what >>> each stat means, and trying to create abstractions in their user space >>> glue. >> >> Should we require vendors to either provide a Documentation/ entry for >> each statistics they have (and be guaranteed that it will be outdated >> unless someone notices), or would you rather have the statistics >> description be part of the devlink interface itself? Should we define >> namespaces such that standard metrics should be under the standard >> namespace and the vendor standard is the wild west? > > I'm trying to find a solution which will not require a policeman to > constantly monitor the compliance. Please see my effort to ensure > drivers document and use the same ethtool -S stats in the TLS offload > implementations. I've been trying to improve this situation for a long > time, and it's getting old. Which is why I am asking genuinely what do you think should be done besides doing more code reviews? It does not seem to me that there is an easy way to catch new stats being added with tools/scripts/whatever and then determine what they are about, right? > > Please focus on the stats this set adds, instead of fantasizing of what > could be. These are absolutely not implementation specific! Not sure if fantasizing is quite what I would use. I am just pointing out that given the inability to standardize on statistics maybe we should have namespaces and try our best to have everything fit into the standard namespace along with a standard set of names, and push back whenever we see vendor stats being added (or more pragmatically, ask what they are). But maybe this very idea is moot. > >>> If I have to download vendor documentation and tooling, or adapt my own >>> scripts for every new vendor, I could have as well downloaded an SDK. >> >> Are not you being a bit over dramatic here with your example? > > I hope not. It's very hard/impossible today to run a fleet of Linux > machines without resorting to vendor tooling. Your argument was putting on the same level resorting to vendor tooling to extract meaningful statistics/counters versus using a SDK to operate the hardware (this is how I understood it), and I do not believe this is fair. > >> At least you can run the same command to obtain the stats regardless >> of the driver and vendor, so from that perspective Linux continues to >> be the abstraction and that is not broken. > > Format of the data is no abstraction. >
On Wed, 19 Aug 2020 10:20:08 -0700 Florian Fainelli wrote: > > I'm trying to find a solution which will not require a policeman to > > constantly monitor the compliance. Please see my effort to ensure > > drivers document and use the same ethtool -S stats in the TLS offload > > implementations. I've been trying to improve this situation for a long > > time, and it's getting old. > > Which is why I am asking genuinely what do you think should be done > besides doing more code reviews? It does not seem to me that there is an > easy way to catch new stats being added with tools/scripts/whatever and > then determine what they are about, right? I don't have a great way forward in mind, sadly. All I can think of is that we should try to create more well defined interfaces and steer away from free-form ones. Example, here if the stats are vxlan decap/encap/error - we should expose that from the vxlan module. That way vxlan module defines one set of stats for everyone. In general unless we attach stats to the object they relate to, we will end up building parallel structures for exposing statistics from the drivers. I posted a set once which was implementing hierarchical stats, but I've abandoned it for this reason. > > Please focus on the stats this set adds, instead of fantasizing of what > > could be. These are absolutely not implementation specific! > > Not sure if fantasizing is quite what I would use. I am just pointing > out that given the inability to standardize on statistics maybe we > should have namespaces and try our best to have everything fit into the > standard namespace along with a standard set of names, and push back > whenever we see vendor stats being added (or more pragmatically, ask > what they are). But maybe this very idea is moot. IDK. I just don't feel like this is going to fly, see how many names people invented for the CRC error statistic in ethtool -S, even tho there is a standard stat for that! And users are actually parsing the output of ethtool -S to get CRC stats because (a) it became the go-to place for NIC stats and (b) some drivers forget to report in the standard place. The cover letter says this set replaces the bad debugfs with a good, standard API. It may look good and standard for _vendors_ because they will know where to dump their counters, but it makes very little difference for _users_. If I have to parse names for every vendor I use, I can as well add a per-vendor debugfs path to my script. The bar for implementation-specific driver stats has to be high. > >>> If I have to download vendor documentation and tooling, or adapt my own > >>> scripts for every new vendor, I could have as well downloaded an SDK. > >> > >> Are not you being a bit over dramatic here with your example? > > > > I hope not. It's very hard/impossible today to run a fleet of Linux > > machines without resorting to vendor tooling. > > Your argument was putting on the same level resorting to vendor tooling > to extract meaningful statistics/counters versus using a SDK to operate > the hardware (this is how I understood it), and I do not believe this is > fair. Okay, fair. I just think that in datacenter deployments we are way closer to the SDK model than people may want to admit.
On 8/19/20 12:07 PM, Jakub Kicinski wrote: > On Wed, 19 Aug 2020 10:20:08 -0700 Florian Fainelli wrote: >>> I'm trying to find a solution which will not require a policeman to >>> constantly monitor the compliance. Please see my effort to ensure >>> drivers document and use the same ethtool -S stats in the TLS offload >>> implementations. I've been trying to improve this situation for a long >>> time, and it's getting old. >> >> Which is why I am asking genuinely what do you think should be done >> besides doing more code reviews? It does not seem to me that there is an >> easy way to catch new stats being added with tools/scripts/whatever and >> then determine what they are about, right? > > I don't have a great way forward in mind, sadly. All I can think of is > that we should try to create more well defined interfaces and steer > away from free-form ones. There is a lot of value in free-form too. > > Example, here if the stats are vxlan decap/encap/error - we should > expose that from the vxlan module. That way vxlan module defines one > set of stats for everyone. > > In general unless we attach stats to the object they relate to, we will > end up building parallel structures for exposing statistics from the > drivers. I posted a set once which was implementing hierarchical stats, > but I've abandoned it for this reason. > >>> Please focus on the stats this set adds, instead of fantasizing of what >>> could be. These are absolutely not implementation specific! >> >> Not sure if fantasizing is quite what I would use. I am just pointing >> out that given the inability to standardize on statistics maybe we >> should have namespaces and try our best to have everything fit into the >> standard namespace along with a standard set of names, and push back >> whenever we see vendor stats being added (or more pragmatically, ask >> what they are). But maybe this very idea is moot. > > IDK. I just don't feel like this is going to fly, see how many names > people invented for the CRC error statistic in ethtool -S, even tho > there is a standard stat for that! And users are actually parsing the > output of ethtool -S to get CRC stats because (a) it became the go-to > place for NIC stats and (b) some drivers forget to report in the > standard place. > > The cover letter says this set replaces the bad debugfs with a good, > standard API. It may look good and standard for _vendors_ because they > will know where to dump their counters, but it makes very little > difference for _users_. If I have to parse names for every vendor I use, > I can as well add a per-vendor debugfs path to my script. > > The bar for implementation-specific driver stats has to be high. My take away from this is you do not like the names - the strings side of it. Do you object to the netlink API? The netlink API via devlink? 'perf' has json files to describe and document counters (tools/perf/pmu-events). Would something like that be acceptable as a form of in-tree documentation of counters? (vs Documentation/networking or URLs like https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) > >>>>> If I have to download vendor documentation and tooling, or adapt my own >>>>> scripts for every new vendor, I could have as well downloaded an SDK. >>>> >>>> Are not you being a bit over dramatic here with your example? >>> >>> I hope not. It's very hard/impossible today to run a fleet of Linux >>> machines without resorting to vendor tooling. >> >> Your argument was putting on the same level resorting to vendor tooling >> to extract meaningful statistics/counters versus using a SDK to operate >> the hardware (this is how I understood it), and I do not believe this is >> fair. > > Okay, fair. I just think that in datacenter deployments we are way > closer to the SDK model than people may want to admit. > I do not agree with that; the SDK model means you *must* use vendor code to make something work. Your argument here is about labels for stats and an understanding of their meaning.
On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote: > On 8/19/20 12:07 PM, Jakub Kicinski wrote: > > I don't have a great way forward in mind, sadly. All I can think of is > > that we should try to create more well defined interfaces and steer > > away from free-form ones. > > There is a lot of value in free-form too. On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote: > It's a question of interface, not the value of exposed data. > > Example, here if the stats are vxlan decap/encap/error - we should > > expose that from the vxlan module. That way vxlan module defines one > > set of stats for everyone. > > > > In general unless we attach stats to the object they relate to, we will > > end up building parallel structures for exposing statistics from the > > drivers. I posted a set once which was implementing hierarchical stats, > > but I've abandoned it for this reason. > > > [...] > > > > IDK. I just don't feel like this is going to fly, see how many names > > people invented for the CRC error statistic in ethtool -S, even tho > > there is a standard stat for that! And users are actually parsing the > > output of ethtool -S to get CRC stats because (a) it became the go-to > > place for NIC stats and (b) some drivers forget to report in the > > standard place. > > > > The cover letter says this set replaces the bad debugfs with a good, > > standard API. It may look good and standard for _vendors_ because they > > will know where to dump their counters, but it makes very little > > difference for _users_. If I have to parse names for every vendor I use, > > I can as well add a per-vendor debugfs path to my script. > > > > The bar for implementation-specific driver stats has to be high. > > My take away from this is you do not like the names - the strings side > of it. > > Do you object to the netlink API? The netlink API via devlink? > > 'perf' has json files to describe and document counters > (tools/perf/pmu-events). Would something like that be acceptable as a > form of in-tree documentation of counters? (vs Documentation/networking > or URLs like > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) Please refer to what I said twice now about the definition of the stats exposed here belonging with the VxLAN code, not the driver. > > Okay, fair. I just think that in datacenter deployments we are way > > closer to the SDK model than people may want to admit. > > I do not agree with that; the SDK model means you *must* use vendor code > to make something work. Your argument here is about labels for stats and > an understanding of their meaning. Sure, no "must" for passing packets, but you "must" use vendor tooling to operate a fleet. Since everybody already has vendor tools what value does this API add? I still need per vendor logic. Let's try to build APIs which will actually make user's life easier, which users will want to switch to.
On Thu, Aug 20, 2020 at 09:09:42AM -0700, Jakub Kicinski wrote: > On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote: > > On 8/19/20 12:07 PM, Jakub Kicinski wrote: > > > I don't have a great way forward in mind, sadly. All I can think of is > > > that we should try to create more well defined interfaces and steer > > > away from free-form ones. > > > > There is a lot of value in free-form too. > > On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote: > > It's a question of interface, not the value of exposed data. > > > > Example, here if the stats are vxlan decap/encap/error - we should > > > expose that from the vxlan module. That way vxlan module defines one > > > set of stats for everyone. > > > > > > In general unless we attach stats to the object they relate to, we will > > > end up building parallel structures for exposing statistics from the > > > drivers. I posted a set once which was implementing hierarchical stats, > > > but I've abandoned it for this reason. > > > > [...] > > > > > > IDK. I just don't feel like this is going to fly, see how many names > > > people invented for the CRC error statistic in ethtool -S, even tho > > > there is a standard stat for that! And users are actually parsing the > > > output of ethtool -S to get CRC stats because (a) it became the go-to > > > place for NIC stats and (b) some drivers forget to report in the > > > standard place. > > > > > > The cover letter says this set replaces the bad debugfs with a good, > > > standard API. It may look good and standard for _vendors_ because they > > > will know where to dump their counters, but it makes very little > > > difference for _users_. If I have to parse names for every vendor I use, > > > I can as well add a per-vendor debugfs path to my script. > > > > > > The bar for implementation-specific driver stats has to be high. > > > > My take away from this is you do not like the names - the strings side > > of it. > > > > Do you object to the netlink API? The netlink API via devlink? > > > > 'perf' has json files to describe and document counters > > (tools/perf/pmu-events). Would something like that be acceptable as a > > form of in-tree documentation of counters? (vs Documentation/networking > > or URLs like > > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) > > Please refer to what I said twice now about the definition of the stats > exposed here belonging with the VxLAN code, not the driver. Please refer to the changelog: "The Spectrum ASICs have a single hardware VTEP that is able to perform VXLAN encapsulation and decapsulation. The VTEP is logically mapped by mlxsw to the multiple VXLAN netdevs that are using it. Exposing the counters of this VTEP via the multiple VXLAN netdevs that are using it would be both inaccurate and confusing for users. Instead, expose the counters of the VTEP via devlink-metric. Note that Spectrum-1 supports a different set of counters compared to newer ASICs in the Spectrum family." Hardware implementations will rarely fit 1:1 to the nice and discrete software implementations that they try to accelerate. The purpose of this API is exposing metrics specific to these hardware implementations. This results in better visibility which can be leveraged for faster debugging and more thorough testing. The reason I came up with this interface is not the specific VXLAN metrics that bother you, but a new platform we are working on. It uses the ASIC as a cache that refers lookups to an external device in case of cache misses. It is completely transparent to user space (you get better scale), but the driver is very much aware of this stuff as it needs to insert objects (e.g., routes) in a way that will minimize cache misses. Just checking that ping works is hardly enough. We must be able to read the cache counters to ensure we do not see cache misses when we do not expect them. As another example, consider the algorithmic TCAM implementation we have in Spectrum-2 for ACLs [1]. While a user simply adds / deletes filters, the driver needs to jump through multiple hops in order to program them in a way that will result in a better scale and reduced latency. We currently do not have an interface through which we can expose metrics related to this specific implementation. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=756cd36626f773e9a72a39c1dd12da4deacfacdf > > > > Okay, fair. I just think that in datacenter deployments we are way > > > closer to the SDK model than people may want to admit. > > > > I do not agree with that; the SDK model means you *must* use vendor code > > to make something work. Your argument here is about labels for stats and > > an understanding of their meaning. > > Sure, no "must" for passing packets, but you "must" use vendor tooling > to operate a fleet. > > Since everybody already has vendor tools what value does this API add? We don't have any "vendor tools" to get this information. Our team is doing everything it possibly can in order to move away from such an approach. > I still need per vendor logic. Let's try to build APIs which will > actually make user's life easier, which users will want to switch to. Developers are also users and they should be able to read whatever information they need from the device in order to help them do their work. You have a multitude of tools (e.g., kprobes, tracepoints) to get better visibility into the software data path. Commonality is not a reason to be blind as a bat when looking into the hardware data path.
On Fri, 21 Aug 2020 13:30:21 +0300 Ido Schimmel wrote: > On Thu, Aug 20, 2020 at 09:09:42AM -0700, Jakub Kicinski wrote: > > On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote: > > > On 8/19/20 12:07 PM, Jakub Kicinski wrote: > > > > I don't have a great way forward in mind, sadly. All I can think of is > > > > that we should try to create more well defined interfaces and steer > > > > away from free-form ones. > > > > > > There is a lot of value in free-form too. > > > > On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote: > > > It's a question of interface, not the value of exposed data. > > > > > > Example, here if the stats are vxlan decap/encap/error - we should > > > > expose that from the vxlan module. That way vxlan module defines one > > > > set of stats for everyone. > > > > > > > > In general unless we attach stats to the object they relate to, we will > > > > end up building parallel structures for exposing statistics from the > > > > drivers. I posted a set once which was implementing hierarchical stats, > > > > but I've abandoned it for this reason. > > > > > [...] > > > > > > > > IDK. I just don't feel like this is going to fly, see how many names > > > > people invented for the CRC error statistic in ethtool -S, even tho > > > > there is a standard stat for that! And users are actually parsing the > > > > output of ethtool -S to get CRC stats because (a) it became the go-to > > > > place for NIC stats and (b) some drivers forget to report in the > > > > standard place. > > > > > > > > The cover letter says this set replaces the bad debugfs with a good, > > > > standard API. It may look good and standard for _vendors_ because they > > > > will know where to dump their counters, but it makes very little > > > > difference for _users_. If I have to parse names for every vendor I use, > > > > I can as well add a per-vendor debugfs path to my script. > > > > > > > > The bar for implementation-specific driver stats has to be high. > > > > > > My take away from this is you do not like the names - the strings side > > > of it. > > > > > > Do you object to the netlink API? The netlink API via devlink? > > > > > > 'perf' has json files to describe and document counters > > > (tools/perf/pmu-events). Would something like that be acceptable as a > > > form of in-tree documentation of counters? (vs Documentation/networking > > > or URLs like > > > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) > > > > Please refer to what I said twice now about the definition of the stats > > exposed here belonging with the VxLAN code, not the driver. > > Please refer to the changelog: > > "The Spectrum ASICs have a single hardware VTEP that is able to perform > VXLAN encapsulation and decapsulation. The VTEP is logically mapped by > mlxsw to the multiple VXLAN netdevs that are using it. Exposing the > counters of this VTEP via the multiple VXLAN netdevs that are using it > would be both inaccurate and confusing for users. > > Instead, expose the counters of the VTEP via devlink-metric. Note that > Spectrum-1 supports a different set of counters compared to newer ASICs > in the Spectrum family." I read your cover letter, I didn't say the metrics have to be reported on particular netdevs. > Hardware implementations will rarely fit 1:1 to the nice and discrete > software implementations that they try to accelerate. The purpose of > this API is exposing metrics specific to these hardware implementations. > This results in better visibility which can be leveraged for faster > debugging and more thorough testing. > > The reason I came up with this interface is not the specific VXLAN > metrics that bother you, but a new platform we are working on. It uses > the ASIC as a cache that refers lookups to an external device in case of > cache misses. It is completely transparent to user space (you get better > scale), but the driver is very much aware of this stuff as it needs to > insert objects (e.g., routes) in a way that will minimize cache misses. > Just checking that ping works is hardly enough. We must be able to read > the cache counters to ensure we do not see cache misses when we do not > expect them. > > As another example, consider the algorithmic TCAM implementation we have > in Spectrum-2 for ACLs [1]. While a user simply adds / deletes filters, > the driver needs to jump through multiple hops in order to program them > in a way that will result in a better scale and reduced latency. We > currently do not have an interface through which we can expose metrics > related to this specific implementation. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=756cd36626f773e9a72a39c1dd12da4deacfacdf > > > > > > > Okay, fair. I just think that in datacenter deployments we are way > > > > closer to the SDK model than people may want to admit. > > > > > > I do not agree with that; the SDK model means you *must* use vendor code > > > to make something work. Your argument here is about labels for stats and > > > an understanding of their meaning. > > > > Sure, no "must" for passing packets, but you "must" use vendor tooling > > to operate a fleet. > > > > Since everybody already has vendor tools what value does this API add? > > We don't have any "vendor tools" to get this information. Our team is > doing everything it possibly can in order to move away from such an > approach. For clarity I wasn't speaking about your switches, sadly I have no experience with those. > > I still need per vendor logic. Let's try to build APIs which will > > actually make user's life easier, which users will want to switch to. > > Developers are also users and they should be able to read whatever > information they need from the device in order to help them do their > work. You have a multitude of tools (e.g., kprobes, tracepoints) to get > better visibility into the software data path. Commonality is not a > reason to be blind as a bat when looking into the hardware data path. How many times do I have to say that I'm not arguing against the value of the data? If you open up this interface either someone will police it, or it will become a dumpster.
On 8/21/20 10:53 AM, Jakub Kicinski wrote: > How many times do I have to say that I'm not arguing against the value > of the data? > > If you open up this interface either someone will police it, or it will > become a dumpster. I am not following what you are proposing as a solution. You do not like Ido's idea of stats going through devlink, but you are not being clear on what you think is a better way. You say vxlan stats belong in the vxlan driver, but the stats do not have to be reported on particular netdevs. How then do h/w stats get exposed via vxlan code?
On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote: > On 8/21/20 10:53 AM, Jakub Kicinski wrote: > > How many times do I have to say that I'm not arguing against the value > > of the data? > > > > If you open up this interface either someone will police it, or it will > > become a dumpster. > > I am not following what you are proposing as a solution. You do not like > Ido's idea of stats going through devlink, but you are not being clear > on what you think is a better way. > > You say vxlan stats belong in the vxlan driver, but the stats do not > have to be reported on particular netdevs. How then do h/w stats get > exposed via vxlan code? No strong preference, for TLS I've done: # cat /proc/net/tls_stat TlsCurrTxSw 0 TlsCurrRxSw 0 TlsCurrTxDevice 0 TlsCurrRxDevice 0 TlsTxSw 0 TlsRxSw 0 TlsTxDevice 0 TlsRxDevice 0 TlsDecryptError 0 TlsRxDeviceResync 0 We can add something over netlink, I opted for simplicity since global stats don't have to scale with number of interfaces.
On 8/21/20 5:50 PM, Jakub Kicinski wrote: > On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote: >> On 8/21/20 10:53 AM, Jakub Kicinski wrote: >>> How many times do I have to say that I'm not arguing against the value >>> of the data? >>> >>> If you open up this interface either someone will police it, or it will >>> become a dumpster. >> >> I am not following what you are proposing as a solution. You do not like >> Ido's idea of stats going through devlink, but you are not being clear >> on what you think is a better way. >> >> You say vxlan stats belong in the vxlan driver, but the stats do not >> have to be reported on particular netdevs. How then do h/w stats get >> exposed via vxlan code? > > No strong preference, for TLS I've done: But you clearly *do* have a strong preference. > > # cat /proc/net/tls_stat I do not agree with adding files under /proc/net for this. > TlsCurrTxSw 0 > TlsCurrRxSw 0 > TlsCurrTxDevice 0 > TlsCurrRxDevice 0 > TlsTxSw 0 > TlsRxSw 0 > TlsTxDevice 0 > TlsRxDevice 0 > TlsDecryptError 0 > TlsRxDeviceResync 0 > > We can add something over netlink, I opted for simplicity since global > stats don't have to scale with number of interfaces. > IMHO, netlink is the right "channel" to move data from kernel to userspace, and opting in to *specific* stats is a must have feature. I think devlink is the right framework given that the stats are device based but not specific to any particular netdev instance. Further, this allows discrimination of hardware stats from software stats which if tied to vxlan as a protocol and somehow pulled from the vxan driver those would be combined into one (at least how my mind is thinking of this). #### Let's say the direction is for these specific stats (as opposed to the general problem that Ido and others are considering) to be pulled from the vxlan driver. How does that driver get access to hardware stats? vxlan is a protocol and not tied to devices. How should the connection be made?
On Fri, 21 Aug 2020 17:59:57 -0600 David Ahern wrote: > On 8/21/20 5:50 PM, Jakub Kicinski wrote: > > On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote: > >> I am not following what you are proposing as a solution. You do not like > >> Ido's idea of stats going through devlink, but you are not being clear > >> on what you think is a better way. > >> > >> You say vxlan stats belong in the vxlan driver, but the stats do not > >> have to be reported on particular netdevs. How then do h/w stats get > >> exposed via vxlan code? > > > > No strong preference, for TLS I've done: > > But you clearly *do* have a strong preference. I'm answering your question. The question is "How then do h/w stats get exposed via vxlan code?" Please note that the question includes "via vxlan code". So no, I have no preference as long as it's "via vxlan code", and not directly from the driver with a vendor-invented name. > > # cat /proc/net/tls_stat > > I do not agree with adding files under /proc/net for this. Yeah it's not the best, with higher LoC a better solution should be within reach. > > TlsCurrTxSw 0 > > TlsCurrRxSw 0 > > TlsCurrTxDevice 0 > > TlsCurrRxDevice 0 > > TlsTxSw 0 > > TlsRxSw 0 > > TlsTxDevice 0 > > TlsRxDevice 0 > > TlsDecryptError 0 > > TlsRxDeviceResync 0 > > > > We can add something over netlink, I opted for simplicity since global > > stats don't have to scale with number of interfaces. > > IMHO, netlink is the right "channel" to move data from kernel to > userspace, and opting in to *specific* stats is a must have feature. > > I think devlink is the right framework given that the stats are device > based but not specific to any particular netdev instance. I'd be careful with the "not specific to any particular netdev instance". A perfect API would be flexible when it comes to scoping :) > Further, this > allows discrimination of hardware stats from software stats which if > tied to vxlan as a protocol and somehow pulled from the vxan driver > those would be combined into one (at least how my mind is thinking of this). Right, for tls the stats which have "Device" in the name are hardware. But netlink will have better ways of separating the two. > #### > > Let's say the direction is for these specific stats (as opposed to the > general problem that Ido and others are considering) to be pulled from > the vxlan driver. How does that driver get access to hardware stats? > vxlan is a protocol and not tied to devices. How should the connection > be made? Drivers which offload VxLAN already have a dependency on it, right? They can just registers to it and get queried on dump. Or if we want scoping we can piggyback on whatever object stats are scoped to. *If* we scope on HW objects do we need to worry about some user some day wanting to have stats per vxlan netdev and per HW instance?
On 8/21/20 6:37 PM, Jakub Kicinski wrote: >>> # cat /proc/net/tls_stat >> >> I do not agree with adding files under /proc/net for this. > > Yeah it's not the best, with higher LoC a better solution should be > within reach. The duplicity here is mind-boggling. Tls stats from hardware is on par with Ido's *example* of vxlan stats from an ASIC. You agree that /proc/net files are wrong, but you did it anyway and now you want the next person to solve the problem you did not want to tackle but have strong opinions on. Ido has a history of thinking through problems and solutions in a proper Linux Way. netlink is the right API, and devlink was created for 'device' stuff versus 'netdev' stuff. Hence, I agree with this *framework* for extracting asic stats.
On Fri, 21 Aug 2020 19:18:37 -0600 David Ahern wrote: > On 8/21/20 6:37 PM, Jakub Kicinski wrote: > >>> # cat /proc/net/tls_stat > >> > >> I do not agree with adding files under /proc/net for this. > > > > Yeah it's not the best, with higher LoC a better solution should be > > within reach. > > The duplicity here is mind-boggling. Tls stats from hardware is on par > with Ido's *example* of vxlan stats from an ASIC. You agree that > /proc/net files are wrong, I didn't say /proc/net was wrong, I'm just trying to be agreeable. Maybe I need to improve my command of the English language. AFAIK /proc/net is where protocol stats are. > but you did it anyway and now you want the > next person to solve the problem you did not want to tackle but have > strong opinions on. I have no need and interest in vxlan stats. > Ido has a history of thinking through problems and solutions in a proper > Linux Way. netlink is the right API, and devlink was created for > 'device' stuff versus 'netdev' stuff. Hence, I agree with this > *framework* for extracting asic stats. You seem to focus on less relevant points. I primarily care about the statistics being defined and identified by Linux, not every vendor for themselves. No question about Ido's ability and contributions, but then again (from the cover letter): > This a joint work [...] during a two-day company hackathon.
On Sat, Aug 22, 2020 at 09:27:39AM -0700, Jakub Kicinski wrote: > On Fri, 21 Aug 2020 19:18:37 -0600 David Ahern wrote: > > On 8/21/20 6:37 PM, Jakub Kicinski wrote: > > >>> # cat /proc/net/tls_stat > > >> > > >> I do not agree with adding files under /proc/net for this. > > > > > > Yeah it's not the best, with higher LoC a better solution should be > > > within reach. > > > > The duplicity here is mind-boggling. Tls stats from hardware is on par > > with Ido's *example* of vxlan stats from an ASIC. You agree that > > /proc/net files are wrong, > > I didn't say /proc/net was wrong, I'm just trying to be agreeable. > Maybe I need to improve my command of the English language. > > AFAIK /proc/net is where protocol stats are. > > > but you did it anyway and now you want the > > next person to solve the problem you did not want to tackle but have > > strong opinions on. > > I have no need and interest in vxlan stats. > > > Ido has a history of thinking through problems and solutions in a proper > > Linux Way. netlink is the right API, and devlink was created for > > 'device' stuff versus 'netdev' stuff. Hence, I agree with this > > *framework* for extracting asic stats. > > You seem to focus on less relevant points. I primarily care about the > statistics being defined and identified by Linux, not every vendor for > themselves. Trying to understand how we can move this forward. The issue is with the specific VXLAN metrics, but you generally agree with the need for the framework? See my two other examples: Cache counters and algorithmic TCAM counters. > No question about Ido's ability and contributions, but then again > (from the cover letter): > > > This a joint work [...] during a two-day company hackathon. The *implementation* was done during the hackathon. The *design* was done before. I sent to the team three design proposals (CLI / netlink API / device drivers API) before we landed on this. Started thinking about this idea a few months ago and the hackathon was simply a good opportunity to implement and showcase it.
On Sun, 23 Aug 2020 10:04:34 +0300 Ido Schimmel wrote: > > You seem to focus on less relevant points. I primarily care about the > > statistics being defined and identified by Linux, not every vendor for > > themselves. > > Trying to understand how we can move this forward. The issue is with the > specific VXLAN metrics, but you generally agree with the need for the > framework? See my two other examples: Cache counters and algorithmic > TCAM counters. Yes, we will likely need a way to report design-specific performance counters no matter what. That said I would prefer to pave the way for exposing standardized stats first, so the reviewers (e.g. myself) have a clear place to point folks to. My last attempt was to just try to standardize the strings for the per-netdev TLS offload stats (those are in addition to the /proc stats), and document them in Documentation/. It turned out to have quite a high review overhead, and the convergence is not satisfactory. The only strong use I have right now is FEC stats, and I'm planning to add IEEE-based counters to devlink ports. The scoping of MAC/PHY counters to dl-port is, I hope, reasonable, although it remains to be seen what phy folks think about it. As I previously said - I think that protocol stats are best exported from the protocol driver, otherwise the API may need to grow parallel hierarchies. E.g. semantics of per-queue NIC counters get confusing unless the are reported with the information about the queues - sadly no API for that exists. In particular the life time of objects is hard to match with lifetime of statistics. Similar thing with low granularity counters related to traffic classification. Long story short, it's a complicated topic, IDK how much of it I can expect you to tackle. At the minimum I'd like it if we had a clear separation between Linux/standard stats that drivers should share, and justifiably implementation specific values. The DEVLINK_..GENERIC identifiers or trying to standardize on strings are not working for me as a reviewer, and as an infrastructure engineer.
From: Ido Schimmel <idosch@nvidia.com> This patch set extends devlink to allow device drivers to expose device metrics to user space in a standard and extensible fashion, as opposed to the driver-specific debugfs approach. This a joint work with Amit Cohen and Danielle Ratson during a two-day company hackathon. Motivation ========== Certain devices have metrics (e.g., diagnostic counters, histograms) that are useful to expose to user space for testing and debugging purposes. Currently, there is no standardized interface through which these metrics can be exposed and most drivers resort to debugfs, which is not very welcome in the networking subsystem and for good reasons. For one, it is not a stable interface on which users can rely. Secondly, it results in duplicated code and inconsistent interfaces in case drivers are implementing similar functionality. While Ethernet drivers can expose per-port counters to user space via ethtool, they cannot expose device-wide metrics or configurable metrics such as counters that can be enabled / disabled or histogram agents. Solution overview ================= Currently, the only supported metric type is a counter, but histograms will be added in the future. The current interface is: devlink dev metric show [ DEV metric METRIC | group GROUP ] devlink dev metric set DEV metric METRIC [ group GROUP ] Device drivers can dynamically register or unregister their metrics with devlink by calling devlink_metric_counter_create() / devlink_metric_destroy(). Grouping allows user space to group certain metrics together so that they can be queried from the kernel using one request and retrieved in a single response filtered by the kernel (i.e., kernel sets NLM_F_DUMP_FILTERED). Example ======= Instantiate two netdevsim devices: # echo "10 1" > /sys/bus/netdevsim/new_device # echo "20 1" > /sys/bus/netdevsim/new_device Dump all available metrics: # devlink -s dev metric show netdevsim/netdevsim10: metric dummy_counter type counter group 0 value 2 netdevsim/netdevsim20: metric dummy_counter type counter group 0 value 2 Dump a specific metric: # devlink -s dev metric show netdevsim/netdevsim10 metric dummy_counter netdevsim/netdevsim10: metric dummy_counter type counter group 0 value 3 Set a metric to a different group: # devlink dev metric set netdevsim/netdevsim10 metric dummy_counter group 10 Dump all metrics in a specific group: # devlink -s dev metric show group 10 netdevsim/netdevsim10: metric dummy_counter type counter group 10 value 4 Future extensions ================= 1. Enablement and disablement of metrics. This is useful in case the metric adds latency when enabled or consumes limited resources (e.g., counters or histogram agents). It is up to the device driver to decide if a metric is enabled by default or not. Proposed interface: devlink dev metric set DEV metric METRIC [ group GROUP ] [ enable { true | false } ] 2. Histogram metrics. Some devices have the ability to calculate histograms in hardware by sampling a specific parameter multiple times per second. For example, the transmission queue depth of a port. This enables the debugging of microbursts which would otherwise be invisible. While this can be achieved in software using BPF, it is not applicable when the data plane is offloaded as the CPU does not see the traffic. Proposed interface: devlink dev metric set DEV metric METRIC [ group GROUP ] [ enable { true | false } ] [ hist_type { linear | exp } ] [ hist_sample_interval SAMPLE ] [ hist_min MIN ] [ hist_max MAX ] [ hist_buckets BUCKETS ] 3. Per-port metrics. While all the metrics can be exposed as global and namespaced as per-port by naming them accordingly, there is value in allowing user space to dump all metrics related to a certain port. Proposed interface: devlink port metric set DEV/PORT_INDEX metric METRIC [ group GROUP ] [ enable { true | false } ] [ hist_type { linear | exp } ] [ hist_sample_interval SAMPLE ] [ hist_min MIN ] [ hist_max MAX ] [ hist_buckets BUCKETS ] devlink port metric show [ DEV/PORT_INDEX metric METRIC | group GROUP ] To avoid duplicating ethtool functionality we can decide to expose via this interface only: 1. Configurable metrics 2. Metrics that are not only relevant to Ethernet ports TODO ==== 1. Add devlink-metric man page 2. Add selftests over mlxsw Ido Schimmel (6): devlink: Add device metric infrastructure netdevsim: Add devlink metric support selftests: netdevsim: Add devlink metric tests mlxsw: reg: Add Tunneling NVE Counters Register mlxsw: reg: Add Tunneling NVE Counters Register Version 2 mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric .../networking/devlink/devlink-metric.rst | 37 ++ Documentation/networking/devlink/index.rst | 1 + Documentation/networking/devlink/mlxsw.rst | 36 ++ drivers/net/ethernet/mellanox/mlxsw/reg.h | 104 ++++++ .../ethernet/mellanox/mlxsw/spectrum_nve.h | 10 + .../mellanox/mlxsw/spectrum_nve_vxlan.c | 285 +++++++++++++++ drivers/net/netdevsim/dev.c | 92 ++++- drivers/net/netdevsim/netdevsim.h | 1 + include/net/devlink.h | 18 + include/uapi/linux/devlink.h | 19 + net/core/devlink.c | 346 ++++++++++++++++++ .../drivers/net/netdevsim/devlink.sh | 49 ++- 12 files changed, 995 insertions(+), 3 deletions(-) create mode 100644 Documentation/networking/devlink/devlink-metric.rst