[net-next,00/13] resource management using devlink reload

Message ID	20221114125755.13659-1-michal.swiatkowski@linux.intel.com
Headers	show Return-Path: <intel-wired-lan-bounces@osuosl.org> From: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> To: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, edumazet@google.com Date: Mon, 14 Nov 2022 13:57:42 +0100 Message-Id: <20221114125755.13659-1-michal.swiatkowski@linux.intel.com> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668431511; x=1699967511; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=u1XEetprpcT9DVPX1AZNlJyceve4yFTp+k+XYN2n7A8=; b=LpyLr3Vmk6A4UJkeFnGu/aSv4FWvqXiWyEsxJ3x+nt34GLRNWPPwkNhb ZSf+Xktpd5yQ8FTqe9D1u3qN2XO+26kYD5iN0Lq7uzz0BSgRP4Ol4JiY0 1DL7O9tY1xkMl6EaztAmn5Iu+vfTvrfjbiLBV7m1ckVyShSTJkFzur1Tc W84qYcsyxwUrSCXQ7mnj+5f/CB7B8uyQlU+ojKDUoPsnNuu8hKGplZ2wx Ki4GFekWYCeWyt2CcYS6OIwGtuZ2j0bhrDDH/VGtNxrOtY0grQkzcKNEQ zWxsMFO+rjzF0B9bPhHsZKkJwMTKI1HHm9tpUpsz/A2OTYSk1kltyiaFT A==; X-Mailman-Original-Authentication-Results: smtp1.osuosl.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=LpyLr3Vm Subject: [Intel-wired-lan] [PATCH net-next 00/13] resource management using devlink reload Precedence: list Cc: mustafa.ismail@intel.com, leszek.kaliszczuk@intel.com, jiri@nvidia.com, intel-wired-lan@lists.osuosl.org, shiraz.saleem@intel.com, przemyslaw.kitszel@intel.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-wired-lan-bounces@osuosl.org Sender: "Intel-wired-lan" <intel-wired-lan-bounces@osuosl.org>
Series	resource management using devlink reload \| expand [net-next,00/13] resource management using devlink reload [net-next,01/13] ice: move RDMA init to ice_idc.c [net-next,02/13] ice: alloc id for RDMA using xa_array [net-next,03/13] ice: cleanup in VSI config/deconfig code [net-next,04/13] ice: split ice_vsi_setup into smaller functions [net-next,05/13] ice: stop hard coding the ICE_VSI_CTRL location [net-next,06/13] ice: split probe into smaller functions [net-next,07/13] ice: sync netdev filters after clearing VSI [net-next,08/13] ice: move VSI delete outside deconfig [net-next,09/13] ice: update VSI instead of init in some case [net-next,10/13] ice: implement devlink reinit action [net-next,11/13] ice: introduce eswitch capable flag [net-next,12/13] ice, irdma: prepare reservation of MSI-X to reload [net-next,13/13] devlink, ice: add MSIX vectors as devlink resource

Michal Swiatkowski Nov. 14, 2022, 12:57 p.m. UTC

Currently the default value for number of PF vectors is number of CPUs.
Because of that there are cases when all vectors are used for PF
and user can't create more VFs. It is hard to set default number of
CPUs right for all different use cases. Instead allow user to choose
how many vectors should be used for various features. After implementing
subdevices this mechanism will be also used to set number of vectors
for subfunctions.

The idea is to set vectors for eth or VFs using devlink resource API.
New value of vectors will be used after devlink reinit. Example
commands:
$ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
$ sudo devlink dev reload pci/0000:31:00.0
After reload driver will work with 16 vectors used for eth instead of
num_cpus.

The default number of queues is implicitly derived from interrupt
vectors and can be later changed by ethtool.
To decrease queues used on eth user can decrease vectors on eth.
The result will be the same. Still user can change number of queues
using ethtool:
$ sudo ethtool -L enp24s0f0 tx 72 rx 72
but maximum queues amount is equal to amount of vectors.

Most of this patchset is about implementing driver reload mechanism.
Part of code from probe and rebuild is used to not duplicate code.
To allow this reuse probe and rebuild path are split into smaller
functions.

Patch "ice: split ice_vsi_setup into smaller functions" changes
boolean variable in function call to integer and adds define
for it. Instead of having the function called with true/false now it
can be called with readable defines ICE_VSI_FLAG_INIT or
ICE_VSI_FLAG_NO_INIT. It was suggested by Jacob Keller and probably this
mechanism will be implemented across ice driver in follow up patchset.

Patch 1 - 10	-> cleanup code to reuse most of the already
                   implemented function in reload path
Patch 11	-> implement devlink reload API
Patch 12        -> prepare interrupts reservation, make irdma see
                   changeable vectors count
Patch 13        -> changing number of vectors


Jacob Keller (1):
  ice: stop hard coding the ICE_VSI_CTRL location

Michal Kubiak (1):
  devlink, ice: add MSIX vectors as devlink resource

Michal Swiatkowski (11):
  ice: move RDMA init to ice_idc.c
  ice: alloc id for RDMA using xa_array
  ice: cleanup in VSI config/deconfig code
  ice: split ice_vsi_setup into smaller functions
  ice: split probe into smaller functions
  ice: sync netdev filters after clearing VSI
  ice: move VSI delete outside deconfig
  ice: update VSI instead of init in some case
  ice: implement devlink reinit action
  ice: introduce eswitch capable flag
  ice, irdma: prepare reservation of MSI-X to reload

 .../networking/devlink/devlink-resource.rst   |   10 +
 drivers/infiniband/hw/irdma/main.c            |    2 +-
 drivers/net/ethernet/intel/ice/ice.h          |   23 +-
 drivers/net/ethernet/intel/ice/ice_common.c   |   11 +-
 drivers/net/ethernet/intel/ice/ice_devlink.c  |  263 +++-
 drivers/net/ethernet/intel/ice/ice_devlink.h  |    2 +
 drivers/net/ethernet/intel/ice/ice_eswitch.c  |    6 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |    6 +-
 drivers/net/ethernet/intel/ice/ice_fltr.c     |    5 +
 drivers/net/ethernet/intel/ice/ice_idc.c      |   57 +-
 drivers/net/ethernet/intel/ice/ice_lib.c      |  789 +++++-----
 drivers/net/ethernet/intel/ice/ice_lib.h      |    8 +-
 drivers/net/ethernet/intel/ice/ice_main.c     | 1354 ++++++++++-------
 drivers/net/ethernet/intel/ice/ice_sriov.c    |    3 +-
 drivers/net/ethernet/intel/ice/ice_vf_lib.c   |    2 +-
 include/net/devlink.h                         |   14 +
 16 files changed, 1517 insertions(+), 1038 deletions(-)

Leon Romanovsky Nov. 14, 2022, 1:23 p.m. UTC | #1

On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> Currently the default value for number of PF vectors is number of CPUs.
> Because of that there are cases when all vectors are used for PF
> and user can't create more VFs. It is hard to set default number of
> CPUs right for all different use cases. Instead allow user to choose
> how many vectors should be used for various features. After implementing
> subdevices this mechanism will be also used to set number of vectors
> for subfunctions.
> 
> The idea is to set vectors for eth or VFs using devlink resource API.
> New value of vectors will be used after devlink reinit. Example
> commands:
> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> $ sudo devlink dev reload pci/0000:31:00.0
> After reload driver will work with 16 vectors used for eth instead of
> num_cpus.

By saying "vectors", are you referring to MSI-X vectors?
If yes, you have specific interface for that.
https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/

Thanks

Samudrala, Sridhar Nov. 14, 2022, 3:31 p.m. UTC | #2

On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
>> Currently the default value for number of PF vectors is number of CPUs.
>> Because of that there are cases when all vectors are used for PF
>> and user can't create more VFs. It is hard to set default number of
>> CPUs right for all different use cases. Instead allow user to choose
>> how many vectors should be used for various features. After implementing
>> subdevices this mechanism will be also used to set number of vectors
>> for subfunctions.
>>
>> The idea is to set vectors for eth or VFs using devlink resource API.
>> New value of vectors will be used after devlink reinit. Example
>> commands:
>> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
>> $ sudo devlink dev reload pci/0000:31:00.0
>> After reload driver will work with 16 vectors used for eth instead of
>> num_cpus.
> By saying "vectors", are you referring to MSI-X vectors?
> If yes, you have specific interface for that.
> https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/

This patch series is exposing a resources API to split the device level MSI-X vectors
across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
in future subfunctions). Today this is all hidden in a policy implemented within
the PF driver.

The patch you are referring to seems to be providing an interface to change the
msix count for a particular VF. This patch is providing a interface to set the total
msix count for all the possible VFs from the available device level pool of
msix-vectors.

Keller, Jacob E Nov. 14, 2022, 4:58 p.m. UTC | #3

> -----Original Message-----
> From: Samudrala, Sridhar <sridhar.samudrala@intel.com>
> Sent: Monday, November 14, 2022 7:31 AM
> To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski
> <michal.swiatkowski@linux.intel.com>
> Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org;
> pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org;
> jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin,
> Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech
> <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>;
> Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>;
> Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr
> <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman,
> David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek
> <leszek.kaliszczuk@intel.com>
> Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
> 
> On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> >> Currently the default value for number of PF vectors is number of CPUs.
> >> Because of that there are cases when all vectors are used for PF
> >> and user can't create more VFs. It is hard to set default number of
> >> CPUs right for all different use cases. Instead allow user to choose
> >> how many vectors should be used for various features. After implementing
> >> subdevices this mechanism will be also used to set number of vectors
> >> for subfunctions.
> >>
> >> The idea is to set vectors for eth or VFs using devlink resource API.
> >> New value of vectors will be used after devlink reinit. Example
> >> commands:
> >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> >> $ sudo devlink dev reload pci/0000:31:00.0
> >> After reload driver will work with 16 vectors used for eth instead of
> >> num_cpus.
> > By saying "vectors", are you referring to MSI-X vectors?
> > If yes, you have specific interface for that.
> > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> 
> This patch series is exposing a resources API to split the device level MSI-X vectors
> across the different functions supported by the device (PF, RDMA, SR-IOV VFs
> and
> in future subfunctions). Today this is all hidden in a policy implemented within
> the PF driver.
> 
> The patch you are referring to seems to be providing an interface to change the
> msix count for a particular VF. This patch is providing a interface to set the total
> msix count for all the possible VFs from the available device level pool of
> msix-vectors.
> 

It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs.

Thanks,
Jake

Leon Romanovsky Nov. 14, 2022, 5:07 p.m. UTC | #4

On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > Currently the default value for number of PF vectors is number of CPUs.
> > > Because of that there are cases when all vectors are used for PF
> > > and user can't create more VFs. It is hard to set default number of
> > > CPUs right for all different use cases. Instead allow user to choose
> > > how many vectors should be used for various features. After implementing
> > > subdevices this mechanism will be also used to set number of vectors
> > > for subfunctions.
> > > 
> > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > New value of vectors will be used after devlink reinit. Example
> > > commands:
> > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > $ sudo devlink dev reload pci/0000:31:00.0
> > > After reload driver will work with 16 vectors used for eth instead of
> > > num_cpus.
> > By saying "vectors", are you referring to MSI-X vectors?
> > If yes, you have specific interface for that.
> > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> 
> This patch series is exposing a resources API to split the device level MSI-X vectors
> across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> in future subfunctions). Today this is all hidden in a policy implemented within
> the PF driver.

Maybe we are talking about different VFs, but if you refer to PCI VFs,
the amount of MSI-X comes from PCI config space for that specific VF.

You shouldn't set any value through netdev as it will cause to
difference in output between lspci (which doesn't require any driver)
and your newly set number.

Also in RDMA case, it is not clear what will you achieve by this
setting too.

Thanks

Leon Romanovsky Nov. 14, 2022, 5:09 p.m. UTC | #5

On Mon, Nov 14, 2022 at 04:58:57PM +0000, Keller, Jacob E wrote:
> 
> 
> > -----Original Message-----
> > From: Samudrala, Sridhar <sridhar.samudrala@intel.com>
> > Sent: Monday, November 14, 2022 7:31 AM
> > To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski
> > <michal.swiatkowski@linux.intel.com>
> > Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org;
> > pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org;
> > jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin,
> > Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech
> > <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>;
> > Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse
> > <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>;
> > Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr
> > <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman,
> > David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek
> > <leszek.kaliszczuk@intel.com>
> > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
> > 
> > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > >> Currently the default value for number of PF vectors is number of CPUs.
> > >> Because of that there are cases when all vectors are used for PF
> > >> and user can't create more VFs. It is hard to set default number of
> > >> CPUs right for all different use cases. Instead allow user to choose
> > >> how many vectors should be used for various features. After implementing
> > >> subdevices this mechanism will be also used to set number of vectors
> > >> for subfunctions.
> > >>
> > >> The idea is to set vectors for eth or VFs using devlink resource API.
> > >> New value of vectors will be used after devlink reinit. Example
> > >> commands:
> > >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > >> $ sudo devlink dev reload pci/0000:31:00.0
> > >> After reload driver will work with 16 vectors used for eth instead of
> > >> num_cpus.
> > > By saying "vectors", are you referring to MSI-X vectors?
> > > If yes, you have specific interface for that.
> > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > 
> > This patch series is exposing a resources API to split the device level MSI-X vectors
> > across the different functions supported by the device (PF, RDMA, SR-IOV VFs
> > and
> > in future subfunctions). Today this is all hidden in a policy implemented within
> > the PF driver.
> > 
> > The patch you are referring to seems to be providing an interface to change the
> > msix count for a particular VF. This patch is providing a interface to set the total
> > msix count for all the possible VFs from the available device level pool of
> > msix-vectors.
> > 
> 
> It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs.

Yes, to be aligned with PCI spec and see coherent lspci output for VFs.

Thanks

> 
> Thanks,
> Jake

Michal Swiatkowski Nov. 15, 2022, 7 a.m. UTC | #6

On Mon, Nov 14, 2022 at 07:09:36PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 14, 2022 at 04:58:57PM +0000, Keller, Jacob E wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Samudrala, Sridhar <sridhar.samudrala@intel.com>
> > > Sent: Monday, November 14, 2022 7:31 AM
> > > To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski
> > > <michal.swiatkowski@linux.intel.com>
> > > Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org;
> > > pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org;
> > > jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin,
> > > Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech
> > > <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>;
> > > Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse
> > > <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>;
> > > Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr
> > > <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman,
> > > David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek
> > > <leszek.kaliszczuk@intel.com>
> > > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
> > > 
> > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > >> Currently the default value for number of PF vectors is number of CPUs.
> > > >> Because of that there are cases when all vectors are used for PF
> > > >> and user can't create more VFs. It is hard to set default number of
> > > >> CPUs right for all different use cases. Instead allow user to choose
> > > >> how many vectors should be used for various features. After implementing
> > > >> subdevices this mechanism will be also used to set number of vectors
> > > >> for subfunctions.
> > > >>
> > > >> The idea is to set vectors for eth or VFs using devlink resource API.
> > > >> New value of vectors will be used after devlink reinit. Example
> > > >> commands:
> > > >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > >> $ sudo devlink dev reload pci/0000:31:00.0
> > > >> After reload driver will work with 16 vectors used for eth instead of
> > > >> num_cpus.
> > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > If yes, you have specific interface for that.
> > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > 
> > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs
> > > and
> > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > the PF driver.
> > > 
> > > The patch you are referring to seems to be providing an interface to change the
> > > msix count for a particular VF. This patch is providing a interface to set the total
> > > msix count for all the possible VFs from the available device level pool of
> > > msix-vectors.
> > > 
> > 
> > It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs.
> 
> Yes, to be aligned with PCI spec and see coherent lspci output for VFs.
> 
> Thanks
> 

Thanks, I will implement this sysf interface to configure individual
VFs.

> > 
> > Thanks,
> > Jake

Michal Swiatkowski Nov. 15, 2022, 7:12 a.m. UTC | #7

On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > Because of that there are cases when all vectors are used for PF
> > > > and user can't create more VFs. It is hard to set default number of
> > > > CPUs right for all different use cases. Instead allow user to choose
> > > > how many vectors should be used for various features. After implementing
> > > > subdevices this mechanism will be also used to set number of vectors
> > > > for subfunctions.
> > > > 
> > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > New value of vectors will be used after devlink reinit. Example
> > > > commands:
> > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > After reload driver will work with 16 vectors used for eth instead of
> > > > num_cpus.
> > > By saying "vectors", are you referring to MSI-X vectors?
> > > If yes, you have specific interface for that.
> > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > 
> > This patch series is exposing a resources API to split the device level MSI-X vectors
> > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > in future subfunctions). Today this is all hidden in a policy implemented within
> > the PF driver.
> 
> Maybe we are talking about different VFs, but if you refer to PCI VFs,
> the amount of MSI-X comes from PCI config space for that specific VF.
> 
> You shouldn't set any value through netdev as it will cause to
> difference in output between lspci (which doesn't require any driver)
> and your newly set number.

If I understand correctly, lspci shows the MSI-X number for individual
VF. Value set via devlink is the total number of MSI-X that can be used
when creating VFs. As Jake said I will fix the code to track both
values. Thanks for pointing the patch.

> 
> Also in RDMA case, it is not clear what will you achieve by this
> setting too.
>

We have limited number of MSI-X (1024) in the device. Because of that
the amount of MSI-X for each feature is set to the best values. Half for
ethernet, half for RDMA. This patchset allow user to change this values.
If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.

> Thanks

Thanks for reviewing

Leon Romanovsky Nov. 15, 2022, 8:11 a.m. UTC | #8

On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > Because of that there are cases when all vectors are used for PF
> > > > > and user can't create more VFs. It is hard to set default number of
> > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > how many vectors should be used for various features. After implementing
> > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > for subfunctions.
> > > > > 
> > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > New value of vectors will be used after devlink reinit. Example
> > > > > commands:
> > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > num_cpus.
> > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > If yes, you have specific interface for that.
> > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > 
> > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > the PF driver.
> > 
> > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > the amount of MSI-X comes from PCI config space for that specific VF.
> > 
> > You shouldn't set any value through netdev as it will cause to
> > difference in output between lspci (which doesn't require any driver)
> > and your newly set number.
> 
> If I understand correctly, lspci shows the MSI-X number for individual
> VF. Value set via devlink is the total number of MSI-X that can be used
> when creating VFs. 

Yes and no, lspci shows how much MSI-X vectors exist from HW point of
view. Driver can use less than that. It is exactly as your proposed
devlink interface.


> As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> 
> > 
> > Also in RDMA case, it is not clear what will you achieve by this
> > setting too.
> >
> 
> We have limited number of MSI-X (1024) in the device. Because of that
> the amount of MSI-X for each feature is set to the best values. Half for
> ethernet, half for RDMA. This patchset allow user to change this values.
> If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.

RDMA devices doesn't have PCI logic and everything is controlled through
you main core module. It means that when you create RDMA auxiliary device,
it will be connected to netdev (RoCE and iWARP) and that netdev should
deal with vectors. So I still don't understand what does it mean "half
for RDMA".

Thanks

Michal Swiatkowski Nov. 15, 2022, 9:04 a.m. UTC | #9

On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > how many vectors should be used for various features. After implementing
> > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > for subfunctions.
> > > > > > 
> > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > commands:
> > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > num_cpus.
> > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > If yes, you have specific interface for that.
> > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > 
> > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > the PF driver.
> > > 
> > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > 
> > > You shouldn't set any value through netdev as it will cause to
> > > difference in output between lspci (which doesn't require any driver)
> > > and your newly set number.
> > 
> > If I understand correctly, lspci shows the MSI-X number for individual
> > VF. Value set via devlink is the total number of MSI-X that can be used
> > when creating VFs. 
> 
> Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> view. Driver can use less than that. It is exactly as your proposed
> devlink interface.
> 
> 

Ok, I have to take a closer look at it. So, are You saing that we should
drop this devlink solution and use sysfs interface fo VFs or are You
fine with having both? What with MSI-X allocation for subfunction?

> > As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> > 
> > > 
> > > Also in RDMA case, it is not clear what will you achieve by this
> > > setting too.
> > >
> > 
> > We have limited number of MSI-X (1024) in the device. Because of that
> > the amount of MSI-X for each feature is set to the best values. Half for
> > ethernet, half for RDMA. This patchset allow user to change this values.
> > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.
> 
> RDMA devices doesn't have PCI logic and everything is controlled through
> you main core module. It means that when you create RDMA auxiliary device,
> it will be connected to netdev (RoCE and iWARP) and that netdev should
> deal with vectors. So I still don't understand what does it mean "half
> for RDMA".
>

Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA
are reserved and can't be used by ethernet. For example I have
64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for
RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't
be able to use it in ethernet and vice versa.

By saing it can't be used I mean that irdma driver received the MSI-X
vectors number and it is using them (connected them with RDMA interrupts).

Devlink resource is a way to change the number of MSI-X vectors that
will be reserved for RDMA. You wrote that netdev should deal with
vectors, but how netdev will know how many vectors should go to RDMA aux
device? Does there an interface for setting the vectors amount for RDMA
device?

Thanks

> Thanks

Leon Romanovsky Nov. 15, 2022, 9:32 a.m. UTC | #10

On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
> On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > > how many vectors should be used for various features. After implementing
> > > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > > for subfunctions.
> > > > > > > 
> > > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > > commands:
> > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > > num_cpus.
> > > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > > If yes, you have specific interface for that.
> > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > > 
> > > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > > the PF driver.
> > > > 
> > > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > > 
> > > > You shouldn't set any value through netdev as it will cause to
> > > > difference in output between lspci (which doesn't require any driver)
> > > > and your newly set number.
> > > 
> > > If I understand correctly, lspci shows the MSI-X number for individual
> > > VF. Value set via devlink is the total number of MSI-X that can be used
> > > when creating VFs. 
> > 
> > Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> > view. Driver can use less than that. It is exactly as your proposed
> > devlink interface.
> > 
> > 
> 
> Ok, I have to take a closer look at it. So, are You saing that we should
> drop this devlink solution and use sysfs interface fo VFs or are You
> fine with having both? What with MSI-X allocation for subfunction?

You should drop for VFs and PFs and keep it for SFs only.

> 
> > > As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> > > 
> > > > 
> > > > Also in RDMA case, it is not clear what will you achieve by this
> > > > setting too.
> > > >
> > > 
> > > We have limited number of MSI-X (1024) in the device. Because of that
> > > the amount of MSI-X for each feature is set to the best values. Half for
> > > ethernet, half for RDMA. This patchset allow user to change this values.
> > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.
> > 
> > RDMA devices doesn't have PCI logic and everything is controlled through
> > you main core module. It means that when you create RDMA auxiliary device,
> > it will be connected to netdev (RoCE and iWARP) and that netdev should
> > deal with vectors. So I still don't understand what does it mean "half
> > for RDMA".
> >
> 
> Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA
> are reserved and can't be used by ethernet. For example I have
> 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for
> RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't
> be able to use it in ethernet and vice versa.
> 
> By saing it can't be used I mean that irdma driver received the MSI-X
> vectors number and it is using them (connected them with RDMA interrupts).
> 
> Devlink resource is a way to change the number of MSI-X vectors that
> will be reserved for RDMA. You wrote that netdev should deal with
> vectors, but how netdev will know how many vectors should go to RDMA aux
> device? Does there an interface for setting the vectors amount for RDMA
> device?

When RDMA core adds device, it calls to irdma_init_rdma_device() and
num_comp_vectors is actually the number of MSI-X vectors which you want
to give to that device.

I'm trying to say that probably you shouldn't reserve specific vectors
for both ethernet and RDMA and simply share same vectors. RDMA applications
that care about performance set comp_vector through user space verbs anyway.

Thanks

> 
> Thanks
> 
> > Thanks

Michal Swiatkowski Nov. 15, 2022, 10:16 a.m. UTC | #11

On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote:
> On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
> > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > > > how many vectors should be used for various features. After implementing
> > > > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > > > for subfunctions.
> > > > > > > > 
> > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > > > commands:
> > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > > > num_cpus.
> > > > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > > > If yes, you have specific interface for that.
> > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > > > 
> > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > > > the PF driver.
> > > > > 
> > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > > > 
> > > > > You shouldn't set any value through netdev as it will cause to
> > > > > difference in output between lspci (which doesn't require any driver)
> > > > > and your newly set number.
> > > > 
> > > > If I understand correctly, lspci shows the MSI-X number for individual
> > > > VF. Value set via devlink is the total number of MSI-X that can be used
> > > > when creating VFs. 
> > > 
> > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> > > view. Driver can use less than that. It is exactly as your proposed
> > > devlink interface.
> > > 
> > > 
> > 
> > Ok, I have to take a closer look at it. So, are You saing that we should
> > drop this devlink solution and use sysfs interface fo VFs or are You
> > fine with having both? What with MSI-X allocation for subfunction?
> 
> You should drop for VFs and PFs and keep it for SFs only.
> 

I understand that MSI-X for VFs can be set via sysfs interface, but what
with PFs? Should we always allow max MSI-X for PFs? So hw_max - used -
sfs? Is it save to call pci_enable_msix always with max vectors
supported by device?

I added the value for PFs, because we faced a problem with MSI-X
allocation on 8 port device. Our default value (num_cpus) was too big
(not enough vectors in hw). Changing the amount of vectors that can be
used on PFs was solving the issue.

Let me write an example. As default MSI-X for PF is set to num_cpus, the
platform have 128 CPUs, we have 8 port device installed there and still
have 1024 vectors in HW (I simplified because I don't count additional
interrupts). We run out of vectors, there is 0 vectors that can be used
for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2
and will end with 512 vectors for VFs. I assume that with current sysfs
interface in this situation MSI-X for VFs can be set from 0 to 512? What
if user wants more? If there is a PFs MSI-X value which can be set by
user, user can decrease the value and use more vectors for VFs. Is it
possible in current VFs sysfs interface? I mean, setting VFs MSI-X
vectors to value that will need to decrease MSI-X for PFs.

> > 
> > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> > > > 
> > > > > 
> > > > > Also in RDMA case, it is not clear what will you achieve by this
> > > > > setting too.
> > > > >
> > > > 
> > > > We have limited number of MSI-X (1024) in the device. Because of that
> > > > the amount of MSI-X for each feature is set to the best values. Half for
> > > > ethernet, half for RDMA. This patchset allow user to change this values.
> > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.
> > > 
> > > RDMA devices doesn't have PCI logic and everything is controlled through
> > > you main core module. It means that when you create RDMA auxiliary device,
> > > it will be connected to netdev (RoCE and iWARP) and that netdev should
> > > deal with vectors. So I still don't understand what does it mean "half
> > > for RDMA".
> > >
> > 
> > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA
> > are reserved and can't be used by ethernet. For example I have
> > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for
> > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't
> > be able to use it in ethernet and vice versa.
> > 
> > By saing it can't be used I mean that irdma driver received the MSI-X
> > vectors number and it is using them (connected them with RDMA interrupts).
> > 
> > Devlink resource is a way to change the number of MSI-X vectors that
> > will be reserved for RDMA. You wrote that netdev should deal with
> > vectors, but how netdev will know how many vectors should go to RDMA aux
> > device? Does there an interface for setting the vectors amount for RDMA
> > device?
> 
> When RDMA core adds device, it calls to irdma_init_rdma_device() and
> num_comp_vectors is actually the number of MSI-X vectors which you want
> to give to that device.
> 
> I'm trying to say that probably you shouldn't reserve specific vectors
> for both ethernet and RDMA and simply share same vectors. RDMA applications
> that care about performance set comp_vector through user space verbs anyway.
> 

Thanks for explanation, appriciate that. In our driver num_comp_vectors for
RDMA is set during driver probe. Do we have any interface to change
num_comp_vectors while driver is working?

Sorry, I am not fully familiar with RDMA. Can user app for RDMA set
comp_vector to any value or only to max which is num_comp_vectors given
for RDMA while creating aux device?

Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to
64, how I will know if I can or can't use these vectors in ethernet?

Thanks

> Thanks
> 
> > 
> > Thanks
> > 
> > > Thanks

Leon Romanovsky Nov. 15, 2022, 12:12 p.m. UTC | #12

On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote:
> On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
> > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > > > > how many vectors should be used for various features. After implementing
> > > > > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > > > > for subfunctions.
> > > > > > > > > 
> > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > > > > commands:
> > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > > > > num_cpus.
> > > > > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > > > > If yes, you have specific interface for that.
> > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > > > > 
> > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > > > > the PF driver.
> > > > > > 
> > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > > > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > > > > 
> > > > > > You shouldn't set any value through netdev as it will cause to
> > > > > > difference in output between lspci (which doesn't require any driver)
> > > > > > and your newly set number.
> > > > > 
> > > > > If I understand correctly, lspci shows the MSI-X number for individual
> > > > > VF. Value set via devlink is the total number of MSI-X that can be used
> > > > > when creating VFs. 
> > > > 
> > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> > > > view. Driver can use less than that. It is exactly as your proposed
> > > > devlink interface.
> > > > 
> > > > 
> > > 
> > > Ok, I have to take a closer look at it. So, are You saing that we should
> > > drop this devlink solution and use sysfs interface fo VFs or are You
> > > fine with having both? What with MSI-X allocation for subfunction?
> > 
> > You should drop for VFs and PFs and keep it for SFs only.
> > 
> 
> I understand that MSI-X for VFs can be set via sysfs interface, but what
> with PFs? 

PFs are even more tricker than VFs, as you are changing that number
while driver is bound. This makes me wonder what will be lspci output,
as you will need to show right number before driver starts to load.

You need to present right value if user decided to unbind driver from PF too.

> Should we always allow max MSI-X for PFs? So hw_max - used -
> sfs? Is it save to call pci_enable_msix always with max vectors
> supported by device?

I'm not sure. I think that it won't give you much if you enable
more than num_online_cpu().

> 
> I added the value for PFs, because we faced a problem with MSI-X
> allocation on 8 port device. Our default value (num_cpus) was too big
> (not enough vectors in hw). Changing the amount of vectors that can be
> used on PFs was solving the issue.

We had something similar for mlx5 SFs, where we don't have enough vectors.
Our solution is simply to move to automatic shared MSI-X mode. I would
advocate for that for you as well. 

> 
> Let me write an example. As default MSI-X for PF is set to num_cpus, the
> platform have 128 CPUs, we have 8 port device installed there and still
> have 1024 vectors in HW (I simplified because I don't count additional
> interrupts). We run out of vectors, there is 0 vectors that can be used
> for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2
> and will end with 512 vectors for VFs. I assume that with current sysfs
> interface in this situation MSI-X for VFs can be set from 0 to 512? What
> if user wants more? If there is a PFs MSI-X value which can be set by
> user, user can decrease the value and use more vectors for VFs. Is it
> possible in current VFs sysfs interface? I mean, setting VFs MSI-X
> vectors to value that will need to decrease MSI-X for PFs.

You can't do it and this limitation is because PF is bound. You can't change
that number while driver is running. AFAIR, such change will be PCI spec
violation.

> 
> > > 
> > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> > > > > 
> > > > > > 
> > > > > > Also in RDMA case, it is not clear what will you achieve by this
> > > > > > setting too.
> > > > > >
> > > > > 
> > > > > We have limited number of MSI-X (1024) in the device. Because of that
> > > > > the amount of MSI-X for each feature is set to the best values. Half for
> > > > > ethernet, half for RDMA. This patchset allow user to change this values.
> > > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.
> > > > 
> > > > RDMA devices doesn't have PCI logic and everything is controlled through
> > > > you main core module. It means that when you create RDMA auxiliary device,
> > > > it will be connected to netdev (RoCE and iWARP) and that netdev should
> > > > deal with vectors. So I still don't understand what does it mean "half
> > > > for RDMA".
> > > >
> > > 
> > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA
> > > are reserved and can't be used by ethernet. For example I have
> > > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for
> > > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't
> > > be able to use it in ethernet and vice versa.
> > > 
> > > By saing it can't be used I mean that irdma driver received the MSI-X
> > > vectors number and it is using them (connected them with RDMA interrupts).
> > > 
> > > Devlink resource is a way to change the number of MSI-X vectors that
> > > will be reserved for RDMA. You wrote that netdev should deal with
> > > vectors, but how netdev will know how many vectors should go to RDMA aux
> > > device? Does there an interface for setting the vectors amount for RDMA
> > > device?
> > 
> > When RDMA core adds device, it calls to irdma_init_rdma_device() and
> > num_comp_vectors is actually the number of MSI-X vectors which you want
> > to give to that device.
> > 
> > I'm trying to say that probably you shouldn't reserve specific vectors
> > for both ethernet and RDMA and simply share same vectors. RDMA applications
> > that care about performance set comp_vector through user space verbs anyway.
> > 
> 
> Thanks for explanation, appriciate that. In our driver num_comp_vectors for
> RDMA is set during driver probe. Do we have any interface to change
> num_comp_vectors while driver is working?

No, as this number is indirectly exposed to the user space.

> 
> Sorry, I am not fully familiar with RDMA. Can user app for RDMA set
> comp_vector to any value or only to max which is num_comp_vectors given
> for RDMA while creating aux device?

comp_vector logically equal to IRQ number and this is how RDMA
applications controls to which interrupt deliver completions.

" The CQ will use the completion vector comp_vector for signaling
  completion events; it must be at least zero and less than
  context->num_comp_vectors. "
https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html

> 
> Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to
> 64, how I will know if I can or can't use these vectors in ethernet?

Why should you need to know? Vectors are not exclusive and they can be
used by many applications at the same time. The thing is that it is far
fetch to except that high performance RDMA applications and high
performance ethernet can coexist on same device at the same time.

Thanks

> 
> Thanks
> 
> > Thanks
> > 
> > > 
> > > Thanks
> > > 
> > > > Thanks

Michal Swiatkowski Nov. 15, 2022, 2:02 p.m. UTC | #13

On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote:
> > On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote:
> > > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
> > > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> > > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > > > > > how many vectors should be used for various features. After implementing
> > > > > > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > > > > > for subfunctions.
> > > > > > > > > > 
> > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > > > > > commands:
> > > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > > > > > num_cpus.
> > > > > > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > > > > > If yes, you have specific interface for that.
> > > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > > > > > 
> > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > > > > > the PF driver.
> > > > > > > 
> > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > > > > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > > > > > 
> > > > > > > You shouldn't set any value through netdev as it will cause to
> > > > > > > difference in output between lspci (which doesn't require any driver)
> > > > > > > and your newly set number.
> > > > > > 
> > > > > > If I understand correctly, lspci shows the MSI-X number for individual
> > > > > > VF. Value set via devlink is the total number of MSI-X that can be used
> > > > > > when creating VFs. 
> > > > > 
> > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> > > > > view. Driver can use less than that. It is exactly as your proposed
> > > > > devlink interface.
> > > > > 
> > > > > 
> > > > 
> > > > Ok, I have to take a closer look at it. So, are You saing that we should
> > > > drop this devlink solution and use sysfs interface fo VFs or are You
> > > > fine with having both? What with MSI-X allocation for subfunction?
> > > 
> > > You should drop for VFs and PFs and keep it for SFs only.
> > > 
> > 
> > I understand that MSI-X for VFs can be set via sysfs interface, but what
> > with PFs? 
> 
> PFs are even more tricker than VFs, as you are changing that number
> while driver is bound. This makes me wonder what will be lspci output,
> as you will need to show right number before driver starts to load.
> 
> You need to present right value if user decided to unbind driver from PF too.
>

In case of ice driver lspci -vs shows:
Capabilities: [70] MSI-X: Enable+ Count=1024 Masked

so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
total number of MSI-X in the devlink example from cover letter is 1024.

I see that mellanox shows:
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked

I assume that 64 is in this case MSI-X ony for this one PF (it make
sense).
To be honest I don't know why we show maximum MSI-X for the device
there, but because of that the value will be the same afer changing
allocation of MSI-X across features.

Isn't the MSI-X capabilities read from HW register?

> > Should we always allow max MSI-X for PFs? So hw_max - used -
> > sfs? Is it save to call pci_enable_msix always with max vectors
> > supported by device?
> 
> I'm not sure. I think that it won't give you much if you enable
> more than num_online_cpu().
> 

Oh, yes, correct, I missed that.

> > 
> > I added the value for PFs, because we faced a problem with MSI-X
> > allocation on 8 port device. Our default value (num_cpus) was too big
> > (not enough vectors in hw). Changing the amount of vectors that can be
> > used on PFs was solving the issue.
> 
> We had something similar for mlx5 SFs, where we don't have enough vectors.
> Our solution is simply to move to automatic shared MSI-X mode. I would
> advocate for that for you as well. 
>

Thanks for proposing solution, I will take a look how this work in mlx5.

> > 
> > Let me write an example. As default MSI-X for PF is set to num_cpus, the
> > platform have 128 CPUs, we have 8 port device installed there and still
> > have 1024 vectors in HW (I simplified because I don't count additional
> > interrupts). We run out of vectors, there is 0 vectors that can be used
> > for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2
> > and will end with 512 vectors for VFs. I assume that with current sysfs
> > interface in this situation MSI-X for VFs can be set from 0 to 512? What
> > if user wants more? If there is a PFs MSI-X value which can be set by
> > user, user can decrease the value and use more vectors for VFs. Is it
> > possible in current VFs sysfs interface? I mean, setting VFs MSI-X
> > vectors to value that will need to decrease MSI-X for PFs.
> 
> You can't do it and this limitation is because PF is bound. You can't change
> that number while driver is running. AFAIR, such change will be PCI spec
> violation.
>

As in previous comment, we have different value in MSI-X capabilities
field (max number of vectors), so I won't allow user to change this
value. I don't know why it is done like that (MSI-X amount for whole
device in PCI caps). I will try to dig into it.

> > 
> > > > 
> > > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch.
> > > > > > 
> > > > > > > 
> > > > > > > Also in RDMA case, it is not clear what will you achieve by this
> > > > > > > setting too.
> > > > > > >
> > > > > > 
> > > > > > We have limited number of MSI-X (1024) in the device. Because of that
> > > > > > the amount of MSI-X for each feature is set to the best values. Half for
> > > > > > ethernet, half for RDMA. This patchset allow user to change this values.
> > > > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA.
> > > > > 
> > > > > RDMA devices doesn't have PCI logic and everything is controlled through
> > > > > you main core module. It means that when you create RDMA auxiliary device,
> > > > > it will be connected to netdev (RoCE and iWARP) and that netdev should
> > > > > deal with vectors. So I still don't understand what does it mean "half
> > > > > for RDMA".
> > > > >
> > > > 
> > > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA
> > > > are reserved and can't be used by ethernet. For example I have
> > > > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for
> > > > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't
> > > > be able to use it in ethernet and vice versa.
> > > > 
> > > > By saing it can't be used I mean that irdma driver received the MSI-X
> > > > vectors number and it is using them (connected them with RDMA interrupts).
> > > > 
> > > > Devlink resource is a way to change the number of MSI-X vectors that
> > > > will be reserved for RDMA. You wrote that netdev should deal with
> > > > vectors, but how netdev will know how many vectors should go to RDMA aux
> > > > device? Does there an interface for setting the vectors amount for RDMA
> > > > device?
> > > 
> > > When RDMA core adds device, it calls to irdma_init_rdma_device() and
> > > num_comp_vectors is actually the number of MSI-X vectors which you want
> > > to give to that device.
> > > 
> > > I'm trying to say that probably you shouldn't reserve specific vectors
> > > for both ethernet and RDMA and simply share same vectors. RDMA applications
> > > that care about performance set comp_vector through user space verbs anyway.
> > > 
> > 
> > Thanks for explanation, appriciate that. In our driver num_comp_vectors for
> > RDMA is set during driver probe. Do we have any interface to change
> > num_comp_vectors while driver is working?
> 
> No, as this number is indirectly exposed to the user space.
> 

Ok, thanks.

> > 
> > Sorry, I am not fully familiar with RDMA. Can user app for RDMA set
> > comp_vector to any value or only to max which is num_comp_vectors given
> > for RDMA while creating aux device?
> 
> comp_vector logically equal to IRQ number and this is how RDMA
> applications controls to which interrupt deliver completions.
> 
> " The CQ will use the completion vector comp_vector for signaling
>   completion events; it must be at least zero and less than
>   context->num_comp_vectors. "
> https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html
>

Thank You

> > 
> > Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to
> > 64, how I will know if I can or can't use these vectors in ethernet?
> 
> Why should you need to know? Vectors are not exclusive and they can be
> used by many applications at the same time. The thing is that it is far
> fetch to except that high performance RDMA applications and high
> performance ethernet can coexist on same device at the same time.
>

Yes, but after loading aux device part of vectors (num_comp_vectors) are
reserved for only RDMA use case (at least in ice driver). We though that
devlink resource interface can be a good way to change the
num_comp_vectors in this case.

Maybe I am wrong, but I think that vectors that we reserved for RDMA
can't be used by ethernet (in ice driver). We sent specific MSI-X entries
to rdma driver, so I don't think that the same entries can be reuse by
ethernet. I am talking only about ice + irdma case, am I wrong (probably
question to Intel devs)?

Huge thanks for Your comments here, I understnad more now.

> Thanks
> 
> > 
> > Thanks
> > 
> > > Thanks
> > > 
> > > > 
> > > > Thanks
> > > > 
> > > > > Thanks

Leon Romanovsky Nov. 15, 2022, 5:57 p.m. UTC | #14

On Tue, Nov 15, 2022 at 03:02:40PM +0100, Michal Swiatkowski wrote:
> On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote:
> > > On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote:
> > > > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
> > > > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
> > > > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
> > > > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
> > > > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
> > > > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
> > > > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
> > > > > > > > > > > Currently the default value for number of PF vectors is number of CPUs.
> > > > > > > > > > > Because of that there are cases when all vectors are used for PF
> > > > > > > > > > > and user can't create more VFs. It is hard to set default number of
> > > > > > > > > > > CPUs right for all different use cases. Instead allow user to choose
> > > > > > > > > > > how many vectors should be used for various features. After implementing
> > > > > > > > > > > subdevices this mechanism will be also used to set number of vectors
> > > > > > > > > > > for subfunctions.
> > > > > > > > > > > 
> > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API.
> > > > > > > > > > > New value of vectors will be used after devlink reinit. Example
> > > > > > > > > > > commands:
> > > > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
> > > > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0
> > > > > > > > > > > After reload driver will work with 16 vectors used for eth instead of
> > > > > > > > > > > num_cpus.
> > > > > > > > > > By saying "vectors", are you referring to MSI-X vectors?
> > > > > > > > > > If yes, you have specific interface for that.
> > > > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
> > > > > > > > > 
> > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors
> > > > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
> > > > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within
> > > > > > > > > the PF driver.
> > > > > > > > 
> > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs,
> > > > > > > > the amount of MSI-X comes from PCI config space for that specific VF.
> > > > > > > > 
> > > > > > > > You shouldn't set any value through netdev as it will cause to
> > > > > > > > difference in output between lspci (which doesn't require any driver)
> > > > > > > > and your newly set number.
> > > > > > > 
> > > > > > > If I understand correctly, lspci shows the MSI-X number for individual
> > > > > > > VF. Value set via devlink is the total number of MSI-X that can be used
> > > > > > > when creating VFs. 
> > > > > > 
> > > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of
> > > > > > view. Driver can use less than that. It is exactly as your proposed
> > > > > > devlink interface.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Ok, I have to take a closer look at it. So, are You saing that we should
> > > > > drop this devlink solution and use sysfs interface fo VFs or are You
> > > > > fine with having both? What with MSI-X allocation for subfunction?
> > > > 
> > > > You should drop for VFs and PFs and keep it for SFs only.
> > > > 
> > > 
> > > I understand that MSI-X for VFs can be set via sysfs interface, but what
> > > with PFs? 
> > 
> > PFs are even more tricker than VFs, as you are changing that number
> > while driver is bound. This makes me wonder what will be lspci output,
> > as you will need to show right number before driver starts to load.
> > 
> > You need to present right value if user decided to unbind driver from PF too.
> >
> 
> In case of ice driver lspci -vs shows:
> Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> 
> so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> total number of MSI-X in the devlink example from cover letter is 1024.
> 
> I see that mellanox shows:
> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> 
> I assume that 64 is in this case MSI-X ony for this one PF (it make
> sense).

Yes and PF MSI-X count can be changed through FW configuration tool, as
we need to write new value when the driver is unbound and we need it to
be persistent. Users are expecting to see "stable" number any time they
reboot the server. It is not the case for VFs, as they are explicitly
created after reboots and start "fresh" after every boot.

So we set large enough but not too large value as a default for PFs.
If you find sane model of how to change it through kernel, you can count
on our support.

> To be honest I don't know why we show maximum MSI-X for the device
> there, but because of that the value will be the same afer changing
> allocation of MSI-X across features.
> 
> Isn't the MSI-X capabilities read from HW register?

Yes, it comes from Message Control register.

> 
> > > Should we always allow max MSI-X for PFs? So hw_max - used -
> > > sfs? Is it save to call pci_enable_msix always with max vectors
> > > supported by device?
> > 
> > I'm not sure. I think that it won't give you much if you enable
> > more than num_online_cpu().
> > 
> 
> Oh, yes, correct, I missed that.
> 
> > > 
> > > I added the value for PFs, because we faced a problem with MSI-X
> > > allocation on 8 port device. Our default value (num_cpus) was too big
> > > (not enough vectors in hw). Changing the amount of vectors that can be
> > > used on PFs was solving the issue.
> > 
> > We had something similar for mlx5 SFs, where we don't have enough vectors.
> > Our solution is simply to move to automatic shared MSI-X mode. I would
> > advocate for that for you as well. 
> >
> 
> Thanks for proposing solution, I will take a look how this work in mlx5.

BTW, PCI spec talks about something like that in short paragraph
"Handling MSI-X Vector Shortages".

<...>

> > > 
> > > Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to
> > > 64, how I will know if I can or can't use these vectors in ethernet?
> > 
> > Why should you need to know? Vectors are not exclusive and they can be
> > used by many applications at the same time. The thing is that it is far
> > fetch to except that high performance RDMA applications and high
> > performance ethernet can coexist on same device at the same time.
> >
> 
> Yes, but after loading aux device part of vectors (num_comp_vectors) are
> reserved for only RDMA use case (at least in ice driver). We though that
> devlink resource interface can be a good way to change the
> num_comp_vectors in this case.

It can be, but is much better to save from users devlink magic. :)

Thanks

Samudrala, Sridhar Nov. 16, 2022, 1:59 a.m. UTC | #15

On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> On Tue, Nov 15, 2022 at 03:02:40PM +0100, Michal Swiatkowski wrote:
>> On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote:
>>> On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote:
>>>> On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote:
>>>>> On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote:
>>>>>> On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote:
>>>>>>> On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote:
>>>>>>>> On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote:
>>>>>>>>> On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote:
>>>>>>>>>> On 11/14/2022 7:23 AM, Leon Romanovsky wrote:
>>>>>>>>>>> On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote:
>>>>>>>>>>>> Currently the default value for number of PF vectors is number of CPUs.
>>>>>>>>>>>> Because of that there are cases when all vectors are used for PF
>>>>>>>>>>>> and user can't create more VFs. It is hard to set default number of
>>>>>>>>>>>> CPUs right for all different use cases. Instead allow user to choose
>>>>>>>>>>>> how many vectors should be used for various features. After implementing
>>>>>>>>>>>> subdevices this mechanism will be also used to set number of vectors
>>>>>>>>>>>> for subfunctions.
>>>>>>>>>>>>
>>>>>>>>>>>> The idea is to set vectors for eth or VFs using devlink resource API.
>>>>>>>>>>>> New value of vectors will be used after devlink reinit. Example
>>>>>>>>>>>> commands:
>>>>>>>>>>>> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16
>>>>>>>>>>>> $ sudo devlink dev reload pci/0000:31:00.0
>>>>>>>>>>>> After reload driver will work with 16 vectors used for eth instead of
>>>>>>>>>>>> num_cpus.
>>>>>>>>>>> By saying "vectors", are you referring to MSI-X vectors?
>>>>>>>>>>> If yes, you have specific interface for that.
>>>>>>>>>>> https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/
>>>>>>>>>> This patch series is exposing a resources API to split the device level MSI-X vectors
>>>>>>>>>> across the different functions supported by the device (PF, RDMA, SR-IOV VFs and
>>>>>>>>>> in future subfunctions). Today this is all hidden in a policy implemented within
>>>>>>>>>> the PF driver.
>>>>>>>>> Maybe we are talking about different VFs, but if you refer to PCI VFs,
>>>>>>>>> the amount of MSI-X comes from PCI config space for that specific VF.
>>>>>>>>>
>>>>>>>>> You shouldn't set any value through netdev as it will cause to
>>>>>>>>> difference in output between lspci (which doesn't require any driver)
>>>>>>>>> and your newly set number.
>>>>>>>> If I understand correctly, lspci shows the MSI-X number for individual
>>>>>>>> VF. Value set via devlink is the total number of MSI-X that can be used
>>>>>>>> when creating VFs.
>>>>>>> Yes and no, lspci shows how much MSI-X vectors exist from HW point of
>>>>>>> view. Driver can use less than that. It is exactly as your proposed
>>>>>>> devlink interface.
>>>>>>>
>>>>>>>
>>>>>> Ok, I have to take a closer look at it. So, are You saing that we should
>>>>>> drop this devlink solution and use sysfs interface fo VFs or are You
>>>>>> fine with having both? What with MSI-X allocation for subfunction?
>>>>> You should drop for VFs and PFs and keep it for SFs only.
>>>>>
>>>> I understand that MSI-X for VFs can be set via sysfs interface, but what
>>>> with PFs?
>>> PFs are even more tricker than VFs, as you are changing that number
>>> while driver is bound. This makes me wonder what will be lspci output,
>>> as you will need to show right number before driver starts to load.
>>>
>>> You need to present right value if user decided to unbind driver from PF too.
>>>
>> In case of ice driver lspci -vs shows:
>> Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
>>
>> so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
>> total number of MSI-X in the devlink example from cover letter is 1024.
>>
>> I see that mellanox shows:
>> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
>>
>> I assume that 64 is in this case MSI-X ony for this one PF (it make
>> sense).
> Yes and PF MSI-X count can be changed through FW configuration tool, as
> we need to write new value when the driver is unbound and we need it to
> be persistent. Users are expecting to see "stable" number any time they
> reboot the server. It is not the case for VFs, as they are explicitly
> created after reboots and start "fresh" after every boot.
>
> So we set large enough but not too large value as a default for PFs.
> If you find sane model of how to change it through kernel, you can count
> on our support.

I guess one main difference is that in case of ice, PF driver manager resources
for all its associated functions, not the FW. So the MSI-X count reported for PF
shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
to get their MSI-X vector information.

Leon Romanovsky Nov. 16, 2022, 6:04 a.m. UTC | #16

On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> On 11/15/2022 11:57 AM, Leon Romanovsky wrote:

<...>

> > > In case of ice driver lspci -vs shows:
> > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > 
> > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > 
> > > I see that mellanox shows:
> > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > 
> > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > sense).
> > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > we need to write new value when the driver is unbound and we need it to
> > be persistent. Users are expecting to see "stable" number any time they
> > reboot the server. It is not the case for VFs, as they are explicitly
> > created after reboots and start "fresh" after every boot.
> > 
> > So we set large enough but not too large value as a default for PFs.
> > If you find sane model of how to change it through kernel, you can count
> > on our support.
> 
> I guess one main difference is that in case of ice, PF driver manager resources
> for all its associated functions, not the FW. So the MSI-X count reported for PF
> shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> to get their MSI-X vector information.

What is the output of lspci for ice VF when the driver is not bound?

Thanks

> 
> 
>

Michal Swiatkowski Nov. 16, 2022, 12:04 p.m. UTC | #17

On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> 
> <...>
> 
> > > > In case of ice driver lspci -vs shows:
> > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > > 
> > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > > 
> > > > I see that mellanox shows:
> > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > > 
> > > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > > sense).
> > > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > > we need to write new value when the driver is unbound and we need it to
> > > be persistent. Users are expecting to see "stable" number any time they
> > > reboot the server. It is not the case for VFs, as they are explicitly
> > > created after reboots and start "fresh" after every boot.
> > > 
> > > So we set large enough but not too large value as a default for PFs.
> > > If you find sane model of how to change it through kernel, you can count
> > > on our support.
> > 
> > I guess one main difference is that in case of ice, PF driver manager resources
> > for all its associated functions, not the FW. So the MSI-X count reported for PF
> > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> > to get their MSI-X vector information.
> 
> What is the output of lspci for ice VF when the driver is not bound?
> 
> Thanks
> 

It is the same after creating and after unbonding:
Capabilities: [70] MSI-X: Enable- Count=17 Masked-

17, because 16 for traffic and 1 for mailbox.

Thanks
> > 
> > 
> >

Leon Romanovsky Nov. 16, 2022, 5:59 p.m. UTC | #18

On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote:
> On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> > 
> > <...>
> > 
> > > > > In case of ice driver lspci -vs shows:
> > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > > > 
> > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > > > 
> > > > > I see that mellanox shows:
> > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > > > 
> > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > > > sense).
> > > > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > > > we need to write new value when the driver is unbound and we need it to
> > > > be persistent. Users are expecting to see "stable" number any time they
> > > > reboot the server. It is not the case for VFs, as they are explicitly
> > > > created after reboots and start "fresh" after every boot.
> > > > 
> > > > So we set large enough but not too large value as a default for PFs.
> > > > If you find sane model of how to change it through kernel, you can count
> > > > on our support.
> > > 
> > > I guess one main difference is that in case of ice, PF driver manager resources
> > > for all its associated functions, not the FW. So the MSI-X count reported for PF
> > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> > > to get their MSI-X vector information.
> > 
> > What is the output of lspci for ice VF when the driver is not bound?
> > 
> > Thanks
> > 
> 
> It is the same after creating and after unbonding:
> Capabilities: [70] MSI-X: Enable- Count=17 Masked-
> 
> 17, because 16 for traffic and 1 for mailbox.

Interesting, I think that your PF violates PCI spec as it always
uses word "function" and not "device" while talks about MSI-X related
registers.

Thanks

> 
> Thanks
> > > 
> > > 
> > >

Michal Swiatkowski Nov. 17, 2022, 11:10 a.m. UTC | #19

On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote:
> > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> > > 
> > > <...>
> > > 
> > > > > > In case of ice driver lspci -vs shows:
> > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > > > > 
> > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > > > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > > > > 
> > > > > > I see that mellanox shows:
> > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > > > > 
> > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > > > > sense).
> > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > > > > we need to write new value when the driver is unbound and we need it to
> > > > > be persistent. Users are expecting to see "stable" number any time they
> > > > > reboot the server. It is not the case for VFs, as they are explicitly
> > > > > created after reboots and start "fresh" after every boot.
> > > > > 
> > > > > So we set large enough but not too large value as a default for PFs.
> > > > > If you find sane model of how to change it through kernel, you can count
> > > > > on our support.
> > > > 
> > > > I guess one main difference is that in case of ice, PF driver manager resources
> > > > for all its associated functions, not the FW. So the MSI-X count reported for PF
> > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> > > > to get their MSI-X vector information.
> > > 
> > > What is the output of lspci for ice VF when the driver is not bound?
> > > 
> > > Thanks
> > > 
> > 
> > It is the same after creating and after unbonding:
> > Capabilities: [70] MSI-X: Enable- Count=17 Masked-
> > 
> > 17, because 16 for traffic and 1 for mailbox.
> 
> Interesting, I think that your PF violates PCI spec as it always
> uses word "function" and not "device" while talks about MSI-X related
> registers.
> 
> Thanks
> 

I made mistake in one comment. 1024 isn't MSI-X amount for device. On
ice we have 2048 for the whole device. On two ports card each PF have
1024 MSI-X. Our control register mapping to the internal space looks
like that (Assuming two port card; one VF with 5 MSI-X created):
INT[PF0].FIRST	0
		1
		2
		
		.
		.
		.

		1019	INT[VF0].FIRST	__
		1020			  | interrupts used
		1021			  | by VF on PF0
		1022			  |
INT[PF0].LAST	1023	INT[VF0].LAST	__|
INT[PF1].FIRST	1024
		1025
		1026

		.
		.
		.
		
INT[PF1].LAST	2047

MSI-X entry table size for PF0 is 1024, but entry table for VF is a part
of PF0 physical space.

Do You mean that "sharing" the entry between PF and VF is a violation of
PCI spec? Sum of MSI-X count on all function within device shouldn't be
grater than 2048? It is hard to find sth about this in spec. I only read
that: "MSI-X supports a maximum table size of 2048 entries". I will
continue searching for information about that.

I don't think that from driver perspective we can change the table size
located in message control register.

I assume in mlnx the tool that You mentioned can modify this table size?

Thanks

> > 
> > Thanks
> > > > 
> > > > 
> > > >

Leon Romanovsky Nov. 17, 2022, 11:45 a.m. UTC | #20

On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote:
> On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote:
> > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote:
> > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> > > > 
> > > > <...>
> > > > 
> > > > > > > In case of ice driver lspci -vs shows:
> > > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > > > > > 
> > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > > > > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > > > > > 
> > > > > > > I see that mellanox shows:
> > > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > > > > > 
> > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > > > > > sense).
> > > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > > > > > we need to write new value when the driver is unbound and we need it to
> > > > > > be persistent. Users are expecting to see "stable" number any time they
> > > > > > reboot the server. It is not the case for VFs, as they are explicitly
> > > > > > created after reboots and start "fresh" after every boot.
> > > > > > 
> > > > > > So we set large enough but not too large value as a default for PFs.
> > > > > > If you find sane model of how to change it through kernel, you can count
> > > > > > on our support.
> > > > > 
> > > > > I guess one main difference is that in case of ice, PF driver manager resources
> > > > > for all its associated functions, not the FW. So the MSI-X count reported for PF
> > > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> > > > > to get their MSI-X vector information.
> > > > 
> > > > What is the output of lspci for ice VF when the driver is not bound?
> > > > 
> > > > Thanks
> > > > 
> > > 
> > > It is the same after creating and after unbonding:
> > > Capabilities: [70] MSI-X: Enable- Count=17 Masked-
> > > 
> > > 17, because 16 for traffic and 1 for mailbox.
> > 
> > Interesting, I think that your PF violates PCI spec as it always
> > uses word "function" and not "device" while talks about MSI-X related
> > registers.
> > 
> > Thanks
> > 
> 
> I made mistake in one comment. 1024 isn't MSI-X amount for device. On
> ice we have 2048 for the whole device. On two ports card each PF have
> 1024 MSI-X. Our control register mapping to the internal space looks
> like that (Assuming two port card; one VF with 5 MSI-X created):
> INT[PF0].FIRST	0
> 		1
> 		2
> 		
> 		.
> 		.
> 		.
> 
> 		1019	INT[VF0].FIRST	__
> 		1020			  | interrupts used
> 		1021			  | by VF on PF0
> 		1022			  |
> INT[PF0].LAST	1023	INT[VF0].LAST	__|
> INT[PF1].FIRST	1024
> 		1025
> 		1026
> 
> 		.
> 		.
> 		.
> 		
> INT[PF1].LAST	2047
> 
> MSI-X entry table size for PF0 is 1024, but entry table for VF is a part
> of PF0 physical space.
> 
> Do You mean that "sharing" the entry between PF and VF is a violation of
> PCI spec? 

You should consult with your PCI specification experts. It was my
spec interpretation, which can be wrong.


> Sum of MSI-X count on all function within device shouldn't be
> grater than 2048? 

No, it is 2K per-control message/per-function.


> It is hard to find sth about this in spec. I only read
> that: "MSI-X supports a maximum table size of 2048 entries". I will
> continue searching for information about that.
> 
> I don't think that from driver perspective we can change the table size
> located in message control register.

No, you can't, unless you decide explicitly violate spec.

> 
> I assume in mlnx the tool that You mentioned can modify this table size?

Yes, it is FW configuration tool.

Thanks

> 
> Thanks
> 
> > > 
> > > Thanks
> > > > > 
> > > > > 
> > > > >

Michal Swiatkowski Nov. 17, 2022, 1:39 p.m. UTC | #21

On Thu, Nov 17, 2022 at 01:45:38PM +0200, Leon Romanovsky wrote:
> On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote:
> > On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote:
> > > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote:
> > > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> > > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:
> > > > > 
> > > > > <...>
> > > > > 
> > > > > > > > In case of ice driver lspci -vs shows:
> > > > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked
> > > > > > > > 
> > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that
> > > > > > > > total number of MSI-X in the devlink example from cover letter is 1024.
> > > > > > > > 
> > > > > > > > I see that mellanox shows:
> > > > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked
> > > > > > > > 
> > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make
> > > > > > > > sense).
> > > > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as
> > > > > > > we need to write new value when the driver is unbound and we need it to
> > > > > > > be persistent. Users are expecting to see "stable" number any time they
> > > > > > > reboot the server. It is not the case for VFs, as they are explicitly
> > > > > > > created after reboots and start "fresh" after every boot.
> > > > > > > 
> > > > > > > So we set large enough but not too large value as a default for PFs.
> > > > > > > If you find sane model of how to change it through kernel, you can count
> > > > > > > on our support.
> > > > > > 
> > > > > > I guess one main difference is that in case of ice, PF driver manager resources
> > > > > > for all its associated functions, not the FW. So the MSI-X count reported for PF
> > > > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox
> > > > > > to get their MSI-X vector information.
> > > > > 
> > > > > What is the output of lspci for ice VF when the driver is not bound?
> > > > > 
> > > > > Thanks
> > > > > 
> > > > 
> > > > It is the same after creating and after unbonding:
> > > > Capabilities: [70] MSI-X: Enable- Count=17 Masked-
> > > > 
> > > > 17, because 16 for traffic and 1 for mailbox.
> > > 
> > > Interesting, I think that your PF violates PCI spec as it always
> > > uses word "function" and not "device" while talks about MSI-X related
> > > registers.
> > > 
> > > Thanks
> > > 
> > 
> > I made mistake in one comment. 1024 isn't MSI-X amount for device. On
> > ice we have 2048 for the whole device. On two ports card each PF have
> > 1024 MSI-X. Our control register mapping to the internal space looks
> > like that (Assuming two port card; one VF with 5 MSI-X created):
> > INT[PF0].FIRST	0
> > 		1
> > 		2
> > 		
> > 		.
> > 		.
> > 		.
> > 
> > 		1019	INT[VF0].FIRST	__
> > 		1020			  | interrupts used
> > 		1021			  | by VF on PF0
> > 		1022			  |
> > INT[PF0].LAST	1023	INT[VF0].LAST	__|
> > INT[PF1].FIRST	1024
> > 		1025
> > 		1026
> > 
> > 		.
> > 		.
> > 		.
> > 		
> > INT[PF1].LAST	2047
> > 
> > MSI-X entry table size for PF0 is 1024, but entry table for VF is a part
> > of PF0 physical space.
> > 
> > Do You mean that "sharing" the entry between PF and VF is a violation of
> > PCI spec? 
> 
> You should consult with your PCI specification experts. It was my
> spec interpretation, which can be wrong.
> 

Sure, I will

> 
> > Sum of MSI-X count on all function within device shouldn't be
> > grater than 2048? 
> 
> No, it is 2K per-control message/per-function.
> 
> 
> > It is hard to find sth about this in spec. I only read
> > that: "MSI-X supports a maximum table size of 2048 entries". I will
> > continue searching for information about that.
> > 
> > I don't think that from driver perspective we can change the table size
> > located in message control register.
> 
> No, you can't, unless you decide explicitly violate spec.
> 
> > 
> > I assume in mlnx the tool that You mentioned can modify this table size?
> 
> Yes, it is FW configuration tool.
>

Thanks for clarification.
In summary, if this is really violation of PCI spec we for sure will
have to fix that and resource managment implemented in this patchset
will not make sense (as config for PF MSI-X amount will have to happen
in FW).

If it isn't, is it still NACK from You? I mean, if we implement the
devlink resources managment (maybe not generic one, only define in ice)
and sysfs for setting VF MSI-X, will You be ok with that? Or using
devlink to manage MSI-X resources isn't in general good solution?

Our motivation here is to allow user to configure MSI-X allocation for
his own use case. For example giving only 1 MSI-X for ethernet and RDMA
and spending rest on VFs. If I understand correctly, on mlnx card user
can set PF MSI-X to 1 in FW configuration tool and achieve the same.
Currently we don't have other mechanism to change default number of
MSI-X on ethernet.

> Thanks
> 
> > 
> > Thanks
> > 
> > > > 
> > > > Thanks
> > > > > > 
> > > > > > 
> > > > > >

Leon Romanovsky Nov. 17, 2022, 5:38 p.m. UTC | #22

On Thu, Nov 17, 2022 at 02:39:50PM +0100, Michal Swiatkowski wrote:
> On Thu, Nov 17, 2022 at 01:45:38PM +0200, Leon Romanovsky wrote:
> > On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote:
> > > On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote:
> > > > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote:
> > > > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote:
> > > > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote:

<...>

> 
> Thanks for clarification.
> In summary, if this is really violation of PCI spec we for sure will
> have to fix that and resource managment implemented in this patchset
> will not make sense (as config for PF MSI-X amount will have to happen
> in FW).
> 
> If it isn't, is it still NACK from You? I mean, if we implement the
> devlink resources managment (maybe not generic one, only define in ice)
> and sysfs for setting VF MSI-X, will You be ok with that? Or using
> devlink to manage MSI-X resources isn't in general good solution?

NACK/ACK question is not for me, it is not my decision :)

I don't think that management of PCI specific parameters in devlink is
right idea. PCI has his own subsystem with rules and assumptions, netdev
shouldn't mangle them. In MSI-X case, this even more troublesome as users
sees these values through lspci without driver attached to it.

For example (very popular case), users creates VFs, unbinds driver from
all functions (PF and VFs) and pass them to VMs (bind to vfio driver).

Your solution being in wrong layer won't work there.

Thanks

Jakub Kicinski Nov. 18, 2022, 3:36 a.m. UTC | #23

On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote:
> I don't think that management of PCI specific parameters in devlink is
> right idea. PCI has his own subsystem with rules and assumptions, netdev
> shouldn't mangle them.

Not netdev, devlink, which covers netdev, RDMA and others.

> In MSI-X case, this even more troublesome as users
> sees these values through lspci without driver attached to it.

I'm no PCI expert either but FWIW to me the patch set seems reasonable.
Whether management FW policies are configured via core PCI sysfs or
subsystem-specific-ish APIs is a toss up. Adding Bjorn and please CC him
on the next rev.

Link to the series:
https://lore.kernel.org/all/20221114125755.13659-1-michal.swiatkowski@linux.intel.com/

Leon Romanovsky Nov. 18, 2022, 6:20 a.m. UTC | #24

On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote:
> On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote:
> > I don't think that management of PCI specific parameters in devlink is
> > right idea. PCI has his own subsystem with rules and assumptions, netdev
> > shouldn't mangle them.
> 
> Not netdev, devlink, which covers netdev, RDMA and others.

devlink is located in net/*, it is netdev.
Regarding RDMA, it is not fully accurate. We use devlink to convey
information to FW through pci device located in netdev. Some of such
parameters are RDMA related. However, we don't configure RDMA properties
through devlink, there is a special tool for that (rdmatool).

> 
> > In MSI-X case, this even more troublesome as users
> > sees these values through lspci without driver attached to it.
> 
> I'm no PCI expert either but FWIW to me the patch set seems reasonable.
> Whether management FW policies are configured via core PCI sysfs or
> subsystem-specific-ish APIs is a toss up. Adding Bjorn and please CC him
> on the next rev.

The thing is that it must be managed by FW. Earlier in this thread, I pointed
the pretty normal use case where you need to have valid PCI structure without
any driver involvement.

> 
> Link to the series:
> https://lore.kernel.org/all/20221114125755.13659-1-michal.swiatkowski@linux.intel.com/

Saleem, Shiraz Nov. 18, 2022, 2:23 p.m. UTC | #25

> Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
> 
> On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote:
> > On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote:
> > > I don't think that management of PCI specific parameters in devlink
> > > is right idea. PCI has his own subsystem with rules and assumptions,
> > > netdev shouldn't mangle them.
> >
> > Not netdev, devlink, which covers netdev, RDMA and others.
> 
> devlink is located in net/*, it is netdev.
> Regarding RDMA, it is not fully accurate. We use devlink to convey information to
> FW through pci device located in netdev. Some of such parameters are RDMA
> related. However, we don't configure RDMA properties through devlink, there is a
> special tool for that (rdmatool).

rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered.
And cannot be used for any configurations at driver init time.

Don't we already have PCI specific parameters managed through devlink today?

https://docs.kernel.org/networking/devlink/devlink-params.html
enable_sriov
ignore_ari
msix_vec_per_pf_max
msix_vec_per_pf_min

How are these in a different bracket from what Michal is trying to do? Or were these ones a bad idea in hindsight?

Shiraz

Leon Romanovsky Nov. 18, 2022, 5:31 p.m. UTC | #26

On Fri, Nov 18, 2022 at 02:23:50PM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
> > 
> > On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote:
> > > On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote:
> > > > I don't think that management of PCI specific parameters in devlink
> > > > is right idea. PCI has his own subsystem with rules and assumptions,
> > > > netdev shouldn't mangle them.
> > >
> > > Not netdev, devlink, which covers netdev, RDMA and others.
> > 
> > devlink is located in net/*, it is netdev.
> > Regarding RDMA, it is not fully accurate. We use devlink to convey information to
> > FW through pci device located in netdev. Some of such parameters are RDMA
> > related. However, we don't configure RDMA properties through devlink, there is a
> > special tool for that (rdmatool).
> 
> rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered.
> And cannot be used for any configurations at driver init time.

Like I said, we use devlink to configure FW and "core" device to which
ib_device is connected. We don't configure RDMA specific properties, but
only device specific ones.

> 
> Don't we already have PCI specific parameters managed through devlink today?
> 
> https://docs.kernel.org/networking/devlink/devlink-params.html
> enable_sriov
> ignore_ari
> msix_vec_per_pf_max
> msix_vec_per_pf_min
> 
> How are these in a different bracket from what Michal is trying to do? Or were these ones a bad idea in hindsight?

Yes as it doesn't belong to net/* and it exists there just because of
one reason: ability to write to FW of specific device.

At least for ARI, I don't see that bnxt driver masked ARI Extended Capability
and informed PCI subsystem about it, so PCI core will recreate device.

Thanks

Samudrala, Sridhar Nov. 20, 2022, 10:24 p.m. UTC | #27

On 11/18/2022 11:31 AM, Leon Romanovsky wrote:
> On Fri, Nov 18, 2022 at 02:23:50PM +0000, Saleem, Shiraz wrote:
>>> Subject: Re: [PATCH net-next 00/13] resource management using devlink reload
>>>
>>> On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote:
>>>> On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote:
>>>>> I don't think that management of PCI specific parameters in devlink
>>>>> is right idea. PCI has his own subsystem with rules and assumptions,
>>>>> netdev shouldn't mangle them.
>>>> Not netdev, devlink, which covers netdev, RDMA and others.
>>> devlink is located in net/*, it is netdev.
>>> Regarding RDMA, it is not fully accurate. We use devlink to convey information to
>>> FW through pci device located in netdev. Some of such parameters are RDMA
>>> related. However, we don't configure RDMA properties through devlink, there is a
>>> special tool for that (rdmatool).
>> rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered.
>> And cannot be used for any configurations at driver init time.
> Like I said, we use devlink to configure FW and "core" device to which
> ib_device is connected. We don't configure RDMA specific properties, but
> only device specific ones.

I don't think this patch series is configuring any PCI specific parameters OR per-PCI device parameters.
The FW splits the device level MSI-X vectors across its PFs(1 per port).
Each PF manages its own MSI-X vectors and distributes them across the different functions associated with that PF.
So far the PF driver has been doing this with a hard-coded policy implemented within the driver.

We are exposing that policy via devlink and allowing a host admin to split the MSI-X vectors that are
assigned to a PF by the FW across its different functions (PF, all its VFs, all its SFs and RDMA) based
on the use cases/workload running on the host.

[net-next,00/13] resource management using devlink reload

Message

Comments