mbox series

[V2,net,00/11] Extending the ena driver to support new features and enhance performance

Message ID 20190603144329.16366-1-sameehj@amazon.com
Headers show
Series Extending the ena driver to support new features and enhance performance | expand

Message

Jubran, Samih June 3, 2019, 2:43 p.m. UTC
From: Sameeh Jubran <sameehj@amazon.com>

This patchset introduces the following:

* add support for changing the inline header size (max_header_size) for applications
  with overlay and nested headers
* enable automatic fallback to polling mode for admin queue when interrupt is not
  available or missed
* add good checksum counter for Rx ethtool statistics
* update ena.txt
* some minor code clean-up
* some performance enhancements with doorbell calculations

Differences from V1:

* net: ena: add handling of llq max tx burst size (1/11):
 * fixed christmas tree issue

* net: ena: ethtool: add extra properties retrieval via get_priv_flags (2/11):
 * replaced snprintf with strlcpy
 * dropped confusing error message
 * added more details to  the commit message

Arthur Kiyanovski (1):
  net: ena: ethtool: add extra properties retrieval via get_priv_flags

Sameeh Jubran (10):
  net: ena: add handling of llq max tx burst size
  net: ena: replace free_tx/rx_ids union with single free_ids field in
    ena_ring
  net: ena: arrange ena_probe() function variables in reverse christmas
    tree
  net: ena: add newline at the end of pr_err prints
  net: ena: documentation: update ena.txt
  net: ena: allow automatic fallback to polling mode
  net: ena: add support for changing max_header_size in LLQ mode
  net: ena: optimise calculations for CQ doorbell
  net: ena: add good checksum counter
  net: ena: use dev_info_once instead of static variable

 .../networking/device_drivers/amazon/ena.txt  |   5 +-
 .../net/ethernet/amazon/ena/ena_admin_defs.h  |  21 +++
 drivers/net/ethernet/amazon/ena/ena_com.c     | 123 ++++++++++++++----
 drivers/net/ethernet/amazon/ena/ena_com.h     |  48 +++++++
 drivers/net/ethernet/amazon/ena/ena_eth_com.c |  28 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h |  73 +++++++++--
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |  78 +++++++++--
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  86 +++++++-----
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  16 +--
 9 files changed, 376 insertions(+), 102 deletions(-)

Comments

David Miller June 3, 2019, 8:30 p.m. UTC | #1
From: <sameehj@amazon.com>
Date: Mon, 3 Jun 2019 17:43:18 +0300

> From: Sameeh Jubran <sameehj@amazon.com>
> 
> This patchset introduces the following:
> 
> * add support for changing the inline header size (max_header_size) for applications
>   with overlay and nested headers
> * enable automatic fallback to polling mode for admin queue when interrupt is not
>   available or missed
> * add good checksum counter for Rx ethtool statistics
> * update ena.txt
> * some minor code clean-up
> * some performance enhancements with doorbell calculations
> 
> Differences from V1:
> 
> * net: ena: add handling of llq max tx burst size (1/11):
>  * fixed christmas tree issue
> 
> * net: ena: ethtool: add extra properties retrieval via get_priv_flags (2/11):
>  * replaced snprintf with strlcpy
>  * dropped confusing error message
>  * added more details to  the commit message

Series applied.
Jakub Kicinski June 3, 2019, 9:32 p.m. UTC | #2
On Mon, 3 Jun 2019 17:43:18 +0300, sameehj@amazon.com wrote:
> * net: ena: ethtool: add extra properties retrieval via get_priv_flags (2/11):
>  * replaced snprintf with strlcpy
>  * dropped confusing error message
>  * added more details to  the commit message

I asked you to clearly state that you are using the blindly passing
this info from the FW to user space.  Stating that you "retrieve" it
is misleading.

IMHO it's a dangerous precedent, you're extending kernel's uAPI down to
the proprietary FW of the device.  Now we have no idea what the flags
are, so we can't refactor and create common APIs among drivers, or even
use the same names.  We have no idea what you're exposing.
Jakub Kicinski June 3, 2019, 11:03 p.m. UTC | #3
On Mon, 3 Jun 2019 22:27:28 +0000, Woodhouse, David wrote:
> On Mon, 2019-06-03 at 14:32 -0700, Jakub Kicinski wrote:
> > On Mon, 3 Jun 2019 17:43:18 +0300, sameehj@amazon.com wrote:  
> > > * net: ena: ethtool: add extra properties retrieval via get_priv_flags (2/11):
> > >   * replaced snprintf with strlcpy
> > >   * dropped confusing error message
> > >   * added more details to  the commit message  
> > 
> > I asked you to clearly state that you are using the blindly passing
> > this info from the FW to user space.  Stating that you "retrieve" it
> > is misleading.  
> 
> Yes, we should be very clear about that.
> 
> > IMHO it's a dangerous precedent, you're extending kernel's uAPI down to
> > the proprietary FW of the device.  Now we have no idea what the flags
> > are, so we can't refactor and create common APIs among drivers, or even
> > use the same names.  We have no idea what you're exposing.  
> 
> Yes, that should be documented very clearly too, and we should
> absolutely make sure that anything that makes sense for other devices
> is considered for making a common API.
> 
> However, the deployment environment for ENA is kind of weird — we get
> to upgrade the *firmware*, while guest kernels might get updated only
> once a decade. So the passthrough you're objecting to, actually gives
> us fairly much the only viable way to offer many users the tuning
> options they need — or at least to experiment and establish whether
> they *do* need them or not, and thus if we should turn them into a
> "real" generic property.
> 
> You do have a valid concern, but I think we can handle it, and I
> suspect that the approach taken here does make sense. Let's just make
> it explicitly clear, and also document the properties that we expect to
> be exposing, for visibility.

Thank you.

It's generally easier to push FW updates, also to enterprises, rather
than rolling out full new kernels.  People are less concerned with FW
updates, because the only thing those can break is the NIC.  Some
customers also run their own modified kernels.  I'd dispute the fact
that Amazon is so very special.

I don't dispute that this is convenient for Amazon and your FW team.
However, what's best for the vendors differs here from what's good for
Open Source (the FW is proprietary) and IMHO health of Linux ecosystem
in general.

Any "SmartNIC" vendor has temptation of uAPI-level hand off to the
firmware (including my employer), we all run pretty beefy processors
inside "the NIC" after all.  The device centric ethtool configuration
can be implemented by just forwarding the uAPI structures as they are
to the FW.  I'm sure Andrew and others who would like to see Linux
takes more control over PHYs etc. would not like this scenario, either.

To address your specific points, let's be realistic, after initial
submission it's unlikely the features will be updated in a timely
manner anywhere else than perhaps some Amazon's own docs.  Can you
be more specific on what those tuning options are today?  There are
users who don't care to update their kernels for 10 years, yet they
care about some minor tuning option of the NIC?
Andrew Lunn June 4, 2019, 1:50 a.m. UTC | #4
> Any "SmartNIC" vendor has temptation of uAPI-level hand off to the
> firmware (including my employer), we all run pretty beefy processors
> inside "the NIC" after all.  The device centric ethtool configuration
> can be implemented by just forwarding the uAPI structures as they are
> to the FW.  I'm sure Andrew and others who would like to see Linux
> takes more control over PHYs etc. would not like this scenario, either.

No, i would not. There are a few good examples of both firmware and
open drivers being used to control the same PHY, on different
boards. The PHY driver was developed by the community, and has more
features than the firmware driver. And it keeps gaining features. The
firmware i stuck, no updates. The community driver can be debugged,
the firmware is a black box, no chance of the community fixing any
bugs in it.

And PHYs are commodity devices. I doubt there is any value add in the
firmware for a PHY, any real IPR which makes the product better, magic
sauce related to the PHY. So just save the cost of writing and
maintaining firmware, export the MDIO bus, and let Linux control it.
Concentrate the engineers on the interesting parts of the NIC, the
Smart parts, where there can be real IPR.

And i would say this is true for any NIC. Let Linux control the PHY.

      Andrew
Bshara, Nafea June 4, 2019, 2:15 a.m. UTC | #5
Andrew,

Sent from my iPhone

On Jun 3, 2019, at 6:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:

>> Any "SmartNIC" vendor has temptation of uAPI-level hand off to the
>> firmware (including my employer), we all run pretty beefy processors
>> inside "the NIC" after all.  The device centric ethtool configuration
>> can be implemented by just forwarding the uAPI structures as they are
>> to the FW.  I'm sure Andrew and others who would like to see Linux
>> takes more control over PHYs etc. would not like this scenario, either.
> 
> No, i would not. There are a few good examples of both firmware and
> open drivers being used to control the same PHY, on different
> boards. The PHY driver was developed by the community, and has more
> features than the firmware driver. And it keeps gaining features. The
> firmware i stuck, no updates. The community driver can be debugged,
> the firmware is a black box, no chance of the community fixing any
> bugs in it.
> 
> And PHYs are commodity devices. I doubt there is any value add in the
> firmware for a PHY, any real IPR which makes the product better, magic
> sauce related to the PHY. So just save the cost of writing and
> maintaining firmware, export the MDIO bus, and let Linux control it.
> Concentrate the engineers on the interesting parts of the NIC, the
> Smart parts, where there can be real IPR.
> 
> And i would say this is true for any NIC. Let Linux control the PHY.
> 
>      Andrew
> 

It may be true for old GbE PHYs where it’s a discrete chip from the likes of Marvell or broadcom

But at 25/50/100G, the PHy is actually part of the nic. It’s a very complex SERDES.  Cloud providers like us spend enormous amount of time testing the PHY across process and voltage variations, all cable types, length and manufacturing variations, and against all switches we use.  Community drivers won’t be able to validate and tune all this.

Plus we would need exact same setting for Linux, including all distributions even 10year old like RHEL6, for all Windows, ESX, DPDK, FreeBSD,  and support millions of different customers with different sets of Machine images. 

In this case, there is no practical choice by have the firmware to manage the PHY
David Woodhouse June 4, 2019, 6:57 a.m. UTC | #6
On Tue, 2019-06-04 at 02:15 +0000, Bshara, Nafea wrote:
> Andrew,
> 
> Sent from my iPhone
> 
> On Jun 3, 2019, at 6:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > Any "SmartNIC" vendor has temptation of uAPI-level hand off to the
> > > firmware (including my employer), we all run pretty beefy processors
> > > inside "the NIC" after all.  The device centric ethtool configuration
> > > can be implemented by just forwarding the uAPI structures as they are
> > > to the FW.  I'm sure Andrew and others who would like to see Linux
> > > takes more control over PHYs etc. would not like this scenario, either.
> > 
> > No, i would not. There are a few good examples of both firmware and
> > open drivers being used to control the same PHY, on different
> > boards. The PHY driver was developed by the community, and has more
> > features than the firmware driver. And it keeps gaining features. The
> > firmware i stuck, no updates. The community driver can be debugged,
> > the firmware is a black box, no chance of the community fixing any
> > bugs in it.
> > 
> > And PHYs are commodity devices. I doubt there is any value add in the
> > firmware for a PHY, any real IPR which makes the product better, magic
> > sauce related to the PHY. So just save the cost of writing and
> > maintaining firmware, export the MDIO bus, and let Linux control it.
> > Concentrate the engineers on the interesting parts of the NIC, the
> > Smart parts, where there can be real IPR.
> > 
> > And i would say this is true for any NIC. Let Linux control the PHY.
> > 
> >      Andrew
> > 
> 
> It may be true for old GbE PHYs where it’s a discrete chip from the
> likes of Marvell or broadcom
> 
> But at 25/50/100G, the PHy is actually part of the nic. It’s a very
> complex SERDES.  Cloud providers like us spend enormous amount of
> time testing the PHY across process and voltage variations, all cable
> types, length and manufacturing variations, and against all switches
> we use.  Community drivers won’t be able to validate and tune all
> this.
> 
> Plus we would need exact same setting for Linux, including all
> distributions even 10year old like RHEL6, for all Windows, ESX, DPDK,
> FreeBSD,  and support millions of different customers with different
> sets of Machine images. 
> 
> In this case, there is no practical choice by have the firmware to
> manage the PHY

I don't quite know why we're talking about PHYs in this context.
ENA is basically a virtio NIC. It has no PHY.
Jakub Kicinski June 4, 2019, 5:24 p.m. UTC | #7
On Tue, 04 Jun 2019 07:57:48 +0100, David Woodhouse wrote:
> On Tue, 2019-06-04 at 02:15 +0000, Bshara, Nafea wrote:
> > On Jun 3, 2019, at 6:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> > > > Any "SmartNIC" vendor has temptation of uAPI-level hand off to the
> > > > firmware (including my employer), we all run pretty beefy processors
> > > > inside "the NIC" after all.  The device centric ethtool configuration
> > > > can be implemented by just forwarding the uAPI structures as they are
> > > > to the FW.  I'm sure Andrew and others who would like to see Linux
> > > > takes more control over PHYs etc. would not like this scenario, either.  
> > > 
> > > No, i would not. There are a few good examples of both firmware and
> > > open drivers being used to control the same PHY, on different
> > > boards. The PHY driver was developed by the community, and has more
> > > features than the firmware driver. And it keeps gaining features. The
> > > firmware i stuck, no updates. The community driver can be debugged,
> > > the firmware is a black box, no chance of the community fixing any
> > > bugs in it.
> > > 
> > > And PHYs are commodity devices. I doubt there is any value add in the
> > > firmware for a PHY, any real IPR which makes the product better, magic
> > > sauce related to the PHY. So just save the cost of writing and
> > > maintaining firmware, export the MDIO bus, and let Linux control it.
> > > Concentrate the engineers on the interesting parts of the NIC, the
> > > Smart parts, where there can be real IPR.
> > > 
> > > And i would say this is true for any NIC. Let Linux control the PHY.
> > 
> > It may be true for old GbE PHYs where it’s a discrete chip from the
> > likes of Marvell or broadcom
> > 
> > But at 25/50/100G, the PHy is actually part of the nic. It’s a very
> > complex SERDES.  Cloud providers like us spend enormous amount of
> > time testing the PHY across process and voltage variations, all cable
> > types, length and manufacturing variations, and against all switches
> > we use.  Community drivers won’t be able to validate and tune all
> > this.
> > 
> > Plus we would need exact same setting for Linux, including all
> > distributions even 10year old like RHEL6, for all Windows, ESX, DPDK,
> > FreeBSD,  and support millions of different customers with different
> > sets of Machine images. 
> > 
> > In this case, there is no practical choice by have the firmware to
> > manage the PHY  
> 
> I don't quite know why we're talking about PHYs in this context.
> ENA is basically a virtio NIC. It has no PHY.

I brought it up as an example, to illustrate that we'd rather see less
trust and control being blindly handed over to the firmware.

Would you mind answering what are the examples of very important flags
you need to expose to users with 10yo kernels?  Or any examples for that
matter..
Jubran, Samih June 6, 2019, 10:23 a.m. UTC | #8
As of today there are no flags exposed by ENA NIC device, however, we are planning to use them in the near future.
We want to provide customers with extra methods to identify (and differentiate) multiple network interfaces that can be attached to a single VM. 
Currently, customers can identify a specific network interface (ENI) only by MAC address, and later look up this MAC among other multiple ENIs that they have.
In some cases it might not be convenient. Using these flags will let us inform the customer about ENI`s specific properties. It's not finalized,
but tentatively it can look like this:
primary-eni: on /* Differentiate between primary and secondary ENIs */
has-associated-efa: on /* Will specify ENA device has another associated EFA device */

> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Tuesday, June 4, 2019 8:24 PM
> To: David Woodhouse <dwmw2@infradead.org>
> Cc: Bshara, Nafea <nafea@amazon.com>; Andrew Lunn <andrew@lunn.ch>;
> Jubran, Samih <sameehj@amazon.com>; Kiyanovski, Arthur
> <akiyano@amazon.com>; Bshara, Saeed <saeedb@amazon.com>; Tzalik,
> Guy <gtzalik@amazon.com>; Matushevsky, Alexander
> <matua@amazon.com>; Liguori, Anthony <aliguori@amazon.com>; Saidi, Ali
> <alisaidi@amazon.com>; Machulsky, Zorik <zorik@amazon.com>;
> netdev@vger.kernel.org; Wilson, Matt <msw@amazon.com>;
> davem@davemloft.net; Belgazal, Netanel <netanel@amazon.com>;
> Herrenschmidt, Benjamin <benh@amazon.com>
> Subject: Re: [PATCH V2 net 00/11] Extending the ena driver to support new
> features and enhance performance
> 
> On Tue, 04 Jun 2019 07:57:48 +0100, David Woodhouse wrote:
> > On Tue, 2019-06-04 at 02:15 +0000, Bshara, Nafea wrote:
> > > On Jun 3, 2019, at 6:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> > > > > Any "SmartNIC" vendor has temptation of uAPI-level hand off to
> > > > > the firmware (including my employer), we all run pretty beefy
> > > > > processors inside "the NIC" after all.  The device centric
> > > > > ethtool configuration can be implemented by just forwarding the
> > > > > uAPI structures as they are to the FW.  I'm sure Andrew and
> > > > > others who would like to see Linux takes more control over PHYs etc.
> would not like this scenario, either.
> > > >
> > > > No, i would not. There are a few good examples of both firmware
> > > > and open drivers being used to control the same PHY, on different
> > > > boards. The PHY driver was developed by the community, and has
> > > > more features than the firmware driver. And it keeps gaining
> > > > features. The firmware i stuck, no updates. The community driver
> > > > can be debugged, the firmware is a black box, no chance of the
> > > > community fixing any bugs in it.
> > > >
> > > > And PHYs are commodity devices. I doubt there is any value add in
> > > > the firmware for a PHY, any real IPR which makes the product
> > > > better, magic sauce related to the PHY. So just save the cost of
> > > > writing and maintaining firmware, export the MDIO bus, and let Linux
> control it.
> > > > Concentrate the engineers on the interesting parts of the NIC, the
> > > > Smart parts, where there can be real IPR.
> > > >
> > > > And i would say this is true for any NIC. Let Linux control the PHY.
> > >
> > > It may be true for old GbE PHYs where it’s a discrete chip from the
> > > likes of Marvell or broadcom
> > >
> > > But at 25/50/100G, the PHy is actually part of the nic. It’s a very
> > > complex SERDES.  Cloud providers like us spend enormous amount of
> > > time testing the PHY across process and voltage variations, all
> > > cable types, length and manufacturing variations, and against all
> > > switches we use.  Community drivers won’t be able to validate and
> > > tune all this.
> > >
> > > Plus we would need exact same setting for Linux, including all
> > > distributions even 10year old like RHEL6, for all Windows, ESX,
> > > DPDK, FreeBSD,  and support millions of different customers with
> > > different sets of Machine images.
> > >
> > > In this case, there is no practical choice by have the firmware to
> > > manage the PHY
> >
> > I don't quite know why we're talking about PHYs in this context.
> > ENA is basically a virtio NIC. It has no PHY.
> 
> I brought it up as an example, to illustrate that we'd rather see less trust and
> control being blindly handed over to the firmware.
> 
> Would you mind answering what are the examples of very important flags
> you need to expose to users with 10yo kernels?  Or any examples for that
> matter..
Jakub Kicinski June 6, 2019, 5:16 p.m. UTC | #9
Hi Samih!

Please don't top post on Linux kernel mailing lists.

On Thu, 6 Jun 2019 10:23:40 +0000, Jubran, Samih wrote:
> As of today there are no flags exposed by ENA NIC device, however, we
> are planning to use them in the near future. We want to provide
> customers with extra methods to identify (and differentiate) multiple
> network interfaces that can be attached to a single VM. Currently,
> customers can identify a specific network interface (ENI) only by MAC
> address, and later look up this MAC among other multiple ENIs that
> they have. In some cases it might not be convenient. Using these
> flags will let us inform the customer about ENI`s specific
> properties.

Oh no :(  You're using private _feature_ flags as labels or tags.

> It's not finalized, but tentatively it can look like this: 
> primary-eni: on /* Differentiate between primary and secondary ENIs */

Did you consider using phys_port_name for this use case?

If the intent is to have those interfaces bonded, even better would
be to extend the net_failover module or work on user space ABI for VM
failover.  That might be a bigger effort, though.

> has-associated-efa: on /* Will specify ENA device has another associated EFA device */

IDK how your ENA/EFA thing works, but sounds like something that should
be solved in the device model.
Bshara, Nafea June 6, 2019, 9:40 p.m. UTC | #10
On 6/6/19, 10:16 AM, "Jakub Kicinski" <jakub.kicinski@netronome.com> wrote:

    Hi Samih!
    
    Please don't top post on Linux kernel mailing lists.
    
    On Thu, 6 Jun 2019 10:23:40 +0000, Jubran, Samih wrote:
    > As of today there are no flags exposed by ENA NIC device, however, we
    > are planning to use them in the near future. We want to provide
    > customers with extra methods to identify (and differentiate) multiple
    > network interfaces that can be attached to a single VM. Currently,
    > customers can identify a specific network interface (ENI) only by MAC
    > address, and later look up this MAC among other multiple ENIs that
    > they have. In some cases it might not be convenient. Using these
    > flags will let us inform the customer about ENI`s specific
    > properties.
    
    Oh no :(  You're using private _feature_ flags as labels or tags.
    
    > It's not finalized, but tentatively it can look like this: 
    > primary-eni: on /* Differentiate between primary and secondary ENIs */
    
    Did you consider using phys_port_name for this use case?
    
    If the intent is to have those interfaces bonded, even better would
    be to extend the net_failover module or work on user space ABI for VM
    failover.  That might be a bigger effort, though.
    
    > has-associated-efa: on /* Will specify ENA device has another associated EFA device */
    
    IDK how your ENA/EFA thing works, but sounds like something that should
    be solved in the device model.

[NB] ENA is a netDev,  EFA is an ib_dev.   Both share the same physical network
All previous attempt to make them share at device level leads to a lot of complexity at the OS level, with inter-dependencies that are pronged to error
Not to mention that OS distribution that backport from mainline would backport different subset of each driver,  let alone they belong to two subtrees with two different maintainers and we don’t want to be in a place where we have to coordinate releases between two subgroups

As such, we selected a much safer path, that operational, development and deployment of two different drivers (netdev vs ib_dev) much easier but exposing the nic as two different Physical functions/devices

Only cost we have is that we need this extra propriety to help user identify which two devices are related hence Samih's comments
Jakub Kicinski June 6, 2019, 10:04 p.m. UTC | #11
On Thu, 6 Jun 2019 21:40:19 +0000, Bshara, Nafea wrote:
> On 6/6/19, 10:16 AM, "Jakub Kicinski" <jakub.kicinski@netronome.com> wrote:
> 
>     Hi Samih!
>     
>     Please don't top post on Linux kernel mailing lists.
>     
>     On Thu, 6 Jun 2019 10:23:40 +0000, Jubran, Samih wrote:
>     > As of today there are no flags exposed by ENA NIC device, however, we
>     > are planning to use them in the near future. We want to provide
>     > customers with extra methods to identify (and differentiate) multiple
>     > network interfaces that can be attached to a single VM. Currently,
>     > customers can identify a specific network interface (ENI) only by MAC
>     > address, and later look up this MAC among other multiple ENIs that
>     > they have. In some cases it might not be convenient. Using these
>     > flags will let us inform the customer about ENI`s specific
>     > properties.  
>     
>     Oh no :(  You're using private _feature_ flags as labels or tags.
>     
>     > It's not finalized, but tentatively it can look like this: 
>     > primary-eni: on /* Differentiate between primary and secondary ENIs */  
>     
>     Did you consider using phys_port_name for this use case?
>     
>     If the intent is to have those interfaces bonded, even better would
>     be to extend the net_failover module or work on user space ABI for VM
>     failover.  That might be a bigger effort, though.

Someone else will address this point?

>     > has-associated-efa: on /* Will specify ENA device has another associated EFA device */  
>     
>     IDK how your ENA/EFA thing works, but sounds like something that should
>     be solved in the device model.
> 
> [NB] ENA is a netDev,  EFA is an ib_dev.   Both share the same physical network
> All previous attempt to make them share at device level leads to a
> lot of complexity at the OS level, with inter-dependencies that are
> pronged to error Not to mention that OS distribution that backport
> from mainline would backport different subset of each driver,  let
> alone they belong to two subtrees with two different maintainers and
> we don’t want to be in a place where we have to coordinate releases
> between two subgroups
>
> As such, we selected a much safer path, that operational, development and deployment of two different drivers (netdev vs ib_dev) much easier but exposing the nic as two different Physical functions/devices
> 
> Only cost we have is that we need this extra propriety to help user identify which two devices are related hence Samih's comments

I think you're arguing with a different point than the one I made :)
Do you think I said they should use the same PCI device?  I haven't.

IIUC you are exposing a "tag" through the feature API on the netdev to
indicate that there is another PCI device present on the system?

What I said is if there is a relation between PCI devices the best
level to expose this relation at is at the device model level.  IOW
have some form on "link" in sysfs, not in a random place on a netdev.

Having said that, it's entirely unclear to me what the user scenario is
here.  You say "which two devices related", yet you only have one bit,
so it can indicate that there is another device, not _which_ device is
related.  Information you can full well get from running lspci 🤷
Do the devices have the same PCI ID/vendor:model?
Bshara, Nafea June 6, 2019, 10:57 p.m. UTC | #12
Sent from my iPhone

> On Jun 6, 2019, at 3:05 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
>> On Thu, 6 Jun 2019 21:40:19 +0000, Bshara, Nafea wrote:
>> On 6/6/19, 10:16 AM, "Jakub Kicinski" <jakub.kicinski@netronome.com> wrote:
>> 
>>    Hi Samih!
>> 
>>    Please don't top post on Linux kernel mailing lists.
>> 
>>>    On Thu, 6 Jun 2019 10:23:40 +0000, Jubran, Samih wrote:
>>> As of today there are no flags exposed by ENA NIC device, however, we
>>> are planning to use them in the near future. We want to provide
>>> customers with extra methods to identify (and differentiate) multiple
>>> network interfaces that can be attached to a single VM. Currently,
>>> customers can identify a specific network interface (ENI) only by MAC
>>> address, and later look up this MAC among other multiple ENIs that
>>> they have. In some cases it might not be convenient. Using these
>>> flags will let us inform the customer about ENI`s specific
>>> properties.  
>> 
>>    Oh no :(  You're using private _feature_ flags as labels or tags.
>> 
>>> It's not finalized, but tentatively it can look like this: 
>>> primary-eni: on /* Differentiate between primary and secondary ENIs */  
>> 
>>    Did you consider using phys_port_name for this use case?
>> 
>>    If the intent is to have those interfaces bonded, even better would
>>    be to extend the net_failover module or work on user space ABI for VM
>>    failover.  That might be a bigger effort, though.
> 
> Someone else will address this point?
> 
>>> has-associated-efa: on /* Will specify ENA device has another associated EFA device */  
>> 
>>    IDK how your ENA/EFA thing works, but sounds like something that should
>>    be solved in the device model.
>> 
>> [NB] ENA is a netDev,  EFA is an ib_dev.   Both share the same physical network
>> All previous attempt to make them share at device level leads to a
>> lot of complexity at the OS level, with inter-dependencies that are
>> pronged to error Not to mention that OS distribution that backport
>> from mainline would backport different subset of each driver,  let
>> alone they belong to two subtrees with two different maintainers and
>> we don’t want to be in a place where we have to coordinate releases
>> between two subgroups
>> 
>> As such, we selected a much safer path, that operational, development and deployment of two different drivers (netdev vs ib_dev) much easier but exposing the nic as two different Physical functions/devices
>> 
>> Only cost we have is that we need this extra propriety to help user identify which two devices are related hence Samih's comments
> 
> I think you're arguing with a different point than the one I made :)
> Do you think I said they should use the same PCI device?  I haven't.
> 
> IIUC you are exposing a "tag" through the feature API on the netdev to
> indicate that there is another PCI device present on the system?
> 
> What I said is if there is a relation between PCI devices the best
> level to expose this relation at is at the device model level.  IOW
> have some form on "link" in sysfs, not in a random place on a netdev.
> 
> Having said that, it's entirely unclear to me what the user scenario is
> here.  You say "which two devices related", yet you only have one bit,
> so it can indicate that there is another device, not _which_ device is
> related.  Information you can full well get from running lspci 🤷
> Do the devices have the same PCI ID/vendor:model?

Different model id
Will look into sysfs
Jakub Kicinski June 6, 2019, 11:07 p.m. UTC | #13
On Thu, 6 Jun 2019 22:57:21 +0000, Bshara, Nafea wrote:
> > Having said that, it's entirely unclear to me what the user scenario is
> > here.  You say "which two devices related", yet you only have one bit,
> > so it can indicate that there is another device, not _which_ device is
> > related.  Information you can full well get from running lspci 🤷
> > Do the devices have the same PCI ID/vendor:model?  
> 
> Different model id

Okay, then you know which one is which.  Are there multiple ENAs but
one EFA?

> Will look into sysfs 

I still don't understand what is the problem you're trying to solve,
perhaps phys_port_id is the way to go...


The larger point here is that we can't guide you to the right API
unless we know what you're trying to achieve.  And we don't have 
the slightest clue of what're trying to achieve if uAPI is forwarded 
to the device.  

Honestly this is worse, and way more basic than I thought, I think
315c28d2b714 ("net: ena: ethtool: add extra properties retrieval via get_priv_flags")
needs to be reverted.
Bshara, Nafea June 6, 2019, 11:21 p.m. UTC | #14
Sent from my iPhone

> On Jun 6, 2019, at 4:08 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> On Thu, 6 Jun 2019 22:57:21 +0000, Bshara, Nafea wrote:
>>> Having said that, it's entirely unclear to me what the user scenario is
>>> here.  You say "which two devices related", yet you only have one bit,
>>> so it can indicate that there is another device, not _which_ device is
>>> related.  Information you can full well get from running lspci 🤷
>>> Do the devices have the same PCI ID/vendor:model?  
>> 
>> Different model id
> 
> Okay, then you know which one is which.  Are there multiple ENAs but
> one EFA?
> 

Yes,  very possible. Very common

Typical use case that instances have one ena for control plane, one for internet facing , and one 100G ena that also have efa capabilities

>> Will look into sysfs 
> 
> I still don't understand what is the problem you're trying to solve,
> perhaps phys_port_id is the way to go...
> 
> 
> The larger point here is that we can't guide you to the right API
> unless we know what you're trying to achieve.  And we don't have 
> the slightest clue of what're trying to achieve if uAPI is forwarded 
> to the device.  
> 
> Honestly this is worse, and way more basic than I thought, I think
> 315c28d2b714 ("net: ena: ethtool: add extra properties retrieval via get_priv_flags")
> needs to be reverted.

Let’s not do that until we finish this discussion and explain the various use cases
Jakub Kicinski June 6, 2019, 11:42 p.m. UTC | #15
On Thu, 6 Jun 2019 23:21:25 +0000, Bshara, Nafea wrote:
> > On Jun 6, 2019, at 4:08 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Thu, 6 Jun 2019 22:57:21 +0000, Bshara, Nafea wrote:  
> >>> Having said that, it's entirely unclear to me what the user scenario is
> >>> here.  You say "which two devices related", yet you only have one bit,
> >>> so it can indicate that there is another device, not _which_ device is
> >>> related.  Information you can full well get from running lspci 🤷
> >>> Do the devices have the same PCI ID/vendor:model?    
> >> 
> >> Different model id  
> > 
> > Okay, then you know which one is which.  Are there multiple ENAs but
> > one EFA?
> 
> Yes,  very possible. Very common
> 
> Typical use case that instances have one ena for control plane, one
> for internet facing , and one 100G ena that also have efa capabilities

I see, and those are PCI devices..  Some form of platform data would
seem like the best fit to me.  There is something called:

/sys/bus/pci/${dbdf}/label

It seems to come from some ACPI table - DSM maybe?  I think you can put
whatever string you want there 🤔

> >> Will look into sysfs   
> > 
> > I still don't understand what is the problem you're trying to solve,
> > perhaps phys_port_id is the way to go...
> > 
> > 
> > The larger point here is that we can't guide you to the right API
> > unless we know what you're trying to achieve.  And we don't have 
> > the slightest clue of what're trying to achieve if uAPI is forwarded 
> > to the device.  
> > 
> > Honestly this is worse, and way more basic than I thought, I think
> > 315c28d2b714 ("net: ena: ethtool: add extra properties retrieval via get_priv_flags")
> > needs to be reverted.  
> 
> Let’s not do that until we finish this discussion and explain the various use cases

Whatever we decide is the right API for tagging interfaces in a virtual
environment, it's definitely not going to be private feature flags.
Bshara, Nafea June 7, 2019, 1:04 a.m. UTC | #16
Sent from my iPhone

On Jun 6, 2019, at 4:43 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

>>> Okay, then you know which one is which.  Are there multiple ENAs but
>>> one EFA?
>> 
>> Yes,  very possible. Very common
>> 
>> Typical use case that instances have one ena for control plane, one
>> for internet facing , and one 100G ena that also have efa capabilities
> 
> I see, and those are PCI devices..  Some form of platform data would
> seem like the best fit to me.  There is something called:
> 
> /sys/bus/pci/${dbdf}/label
> 
> It seems to come from some ACPI table - DSM maybe?  I think you can put
> whatever string you want there 🤔

Acpi path won’t work, much of thee interface are hot attached, using native pcie hot plug and acpi won’t be involved.
Jakub Kicinski June 7, 2019, 1:14 a.m. UTC | #17
On Fri, 7 Jun 2019 01:04:14 +0000, Bshara, Nafea wrote:
> On Jun 6, 2019, at 4:43 PM, Jakub Kicinski wrote:
> >>> Okay, then you know which one is which.  Are there multiple ENAs but
> >>> one EFA?
> >> 
> >> Yes,  very possible. Very common
> >> 
> >> Typical use case that instances have one ena for control plane, one
> >> for internet facing , and one 100G ena that also have efa capabilities
> > 
> > I see, and those are PCI devices..  Some form of platform data would
> > seem like the best fit to me.  There is something called:
> > 
> > /sys/bus/pci/${dbdf}/label
> > 
> > It seems to come from some ACPI table - DSM maybe?  I think you can put
> > whatever string you want there 🤔  
> 
> Acpi path won’t work, much of thee interface are hot attached, using
> native pcie hot plug and acpi won’t be involved. 

Perhaps hotplug break DSM, I won't pretend to be an ACPI expert.  So you
can find a way to stuff the label into that file from another source.
There's also VPD, or custom PCI caps, but platform data generally seems
like a better idea.
Jakub Kicinski June 7, 2019, 9:27 p.m. UTC | #18
On Thu, 6 Jun 2019 18:14:16 -0700, Jakub Kicinski wrote:
> On Fri, 7 Jun 2019 01:04:14 +0000, Bshara, Nafea wrote:
> > On Jun 6, 2019, at 4:43 PM, Jakub Kicinski wrote:  
> > >>> Okay, then you know which one is which.  Are there multiple ENAs but
> > >>> one EFA?  
> > >> 
> > >> Yes,  very possible. Very common
> > >> 
> > >> Typical use case that instances have one ena for control plane, one
> > >> for internet facing , and one 100G ena that also have efa capabilities  
> > > 
> > > I see, and those are PCI devices..  Some form of platform data would
> > > seem like the best fit to me.  There is something called:
> > > 
> > > /sys/bus/pci/${dbdf}/label
> > > 
> > > It seems to come from some ACPI table - DSM maybe?  I think you can put
> > > whatever string you want there 🤔    
> > 
> > Acpi path won’t work, much of thee interface are hot attached, using
> > native pcie hot plug and acpi won’t be involved.   
> 
> Perhaps hotplug break DSM, I won't pretend to be an ACPI expert.  So you
> can find a way to stuff the label into that file from another source.
> There's also VPD, or custom PCI caps, but platform data generally seems
> like a better idea.

Jiri, do you have any thoughts about using phys_port_name for exposing
topology in virtual environments.  Perhaps that's the best fit on the
netdev side.  It's not like we want any other port name in such case,
and having it automatically appended to the netdev name may be useful.
Bshara, Nafea June 7, 2019, 9:34 p.m. UTC | #19
On 6/7/19, 2:27 PM, "Jakub Kicinski" <jakub.kicinski@netronome.com> wrote:

    On Thu, 6 Jun 2019 18:14:16 -0700, Jakub Kicinski wrote:
    > On Fri, 7 Jun 2019 01:04:14 +0000, Bshara, Nafea wrote:
    > > On Jun 6, 2019, at 4:43 PM, Jakub Kicinski wrote:  
    > > >>> Okay, then you know which one is which.  Are there multiple ENAs but
    > > >>> one EFA?  
    > > >> 
    > > >> Yes,  very possible. Very common
    > > >> 
    > > >> Typical use case that instances have one ena for control plane, one
    > > >> for internet facing , and one 100G ena that also have efa capabilities  
    > > > 
    > > > I see, and those are PCI devices..  Some form of platform data would
    > > > seem like the best fit to me.  There is something called:
    > > > 
    > > > /sys/bus/pci/${dbdf}/label
    > > > 
    > > > It seems to come from some ACPI table - DSM maybe?  I think you can put
    > > > whatever string you want there 🤔    
    > > 
    > > Acpi path won’t work, much of thee interface are hot attached, using
    > > native pcie hot plug and acpi won’t be involved.   
    > 
    > Perhaps hotplug break DSM, I won't pretend to be an ACPI expert.  So you
    > can find a way to stuff the label into that file from another source.
    > There's also VPD, or custom PCI caps, but platform data generally seems
    > like a better idea.
    
    Jiri, do you have any thoughts about using phys_port_name for exposing
    topology in virtual environments.  Perhaps that's the best fit on the
    netdev side.  It's not like we want any other port name in such case,
    and having it automatically appended to the netdev name may be useful.
    
any preference for that vs /sysfs ?
Jakub Kicinski June 7, 2019, 9:54 p.m. UTC | #20
On Fri, 7 Jun 2019 21:34:00 +0000, Bshara, Nafea wrote:
> On 6/7/19, 2:27 PM, "Jakub Kicinski" <jakub.kicinski@netronome.com> wrote:
> 
>     On Thu, 6 Jun 2019 18:14:16 -0700, Jakub Kicinski wrote:
>     > On Fri, 7 Jun 2019 01:04:14 +0000, Bshara, Nafea wrote:  
>     > > On Jun 6, 2019, at 4:43 PM, Jakub Kicinski wrote:    
>     > > >>> Okay, then you know which one is which.  Are there multiple ENAs but
>     > > >>> one EFA?    
>     > > >> 
>     > > >> Yes,  very possible. Very common
>     > > >> 
>     > > >> Typical use case that instances have one ena for control plane, one
>     > > >> for internet facing , and one 100G ena that also have efa capabilities    
>     > > > 
>     > > > I see, and those are PCI devices..  Some form of platform data would
>     > > > seem like the best fit to me.  There is something called:
>     > > > 
>     > > > /sys/bus/pci/${dbdf}/label
>     > > > 
>     > > > It seems to come from some ACPI table - DSM maybe?  I think you can put
>     > > > whatever string you want there 🤔      
>     > > 
>     > > Acpi path won’t work, much of thee interface are hot attached, using
>     > > native pcie hot plug and acpi won’t be involved.     
>     > 
>     > Perhaps hotplug break DSM, I won't pretend to be an ACPI expert.  So you
>     > can find a way to stuff the label into that file from another source.
>     > There's also VPD, or custom PCI caps, but platform data generally seems
>     > like a better idea.  
>     
>     Jiri, do you have any thoughts about using phys_port_name for exposing
>     topology in virtual environments.  Perhaps that's the best fit on the
>     netdev side.  It's not like we want any other port name in such case,
>     and having it automatically appended to the netdev name may be useful.
>     
> any preference for that vs /sysfs ?

I think so.  For the topology (control, internal, external) I feel like
its a reasonable fit, and well supported by udev/systemd.  We use it for
appending port names of the switch today (p0, p1, p2, etc).  Jiri was
working on making the core kernel generate those automatically so lets
give him a chance to speak up.

For the ENA<>EFA link I think you may still need to resort to sysfs,
unless it can be somehow implied by the topology label?