mbox series

[v3,00/14] Adding GAUDI NIC code to habanalabs driver

Message ID	20200915171022.10561-1-oded.gabbay@gmail.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Oded Gabbay <oded.gabbay@gmail.com> To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: SW_Drivers@habana.ai, gregkh@linuxfoundation.org, davem@davemloft.net, kuba@kernel.org, andrew@lunn.ch, f.fainelli@gmail.com Subject: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver Date: Tue, 15 Sep 2020 20:10:08 +0300 Message-Id: <20200915171022.10561-1-oded.gabbay@gmail.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Adding GAUDI NIC code to habanalabs driver \| expand [v3,00/14] Adding GAUDI NIC code to habanalabs driver [v3,01/14] habanalabs/gaudi: add NIC H/W and registers definitions [v3,02/14] habanalabs/gaudi: add NIC firmware-related definitions [v3,03/14] habanalabs/gaudi: add NIC security configuration [v3,04/14] habanalabs/gaudi: add support for NIC QMANs [v3,05/14] habanalabs/gaudi: add NIC Ethernet support [v3,06/14] habanalabs/gaudi: add NIC PHY code [v3,07/14] habanalabs/gaudi: allow user to get MAC addresses in INFO IOCTL [v3,08/14] habanalabs/gaudi: add a new IOCTL for NIC control operations [v3,09/14] habanalabs/gaudi: add CQ control operations [v3,10/14] habanalabs/gaudi: add WQ control operations [v3,11/14] habanalabs/gaudi: add QP error handling [v3,12/14] habanalabs/gaudi: Add ethtool support using coresight [v3,13/14] habanalabs/gaudi: support DCB protocol [v3,14/14] habanalabs/gaudi: add NIC init/fini calls from common code

Message ID

20200915171022.10561-1-oded.gabbay@gmail.com

Headers

From: Oded Gabbay <oded.gabbay@gmail.com>
To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Cc: SW_Drivers@habana.ai, gregkh@linuxfoundation.org,
        davem@davemloft.net, kuba@kernel.org, andrew@lunn.ch,
        f.fainelli@gmail.com
Subject: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver
Date: Tue, 15 Sep 2020 20:10:08 +0300
Message-Id: <20200915171022.10561-1-oded.gabbay@gmail.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Series

Adding GAUDI NIC code to habanalabs driver | expand

Message

Oded Gabbay Sept. 15, 2020, 5:10 p.m. UTC

Hello,

This is the second version of the patch-set to upstream the GAUDI NIC code
into the habanalabs driver.

The only modification from v2 is in the ethtool patch (patch 12). Details
are in that patch's commit message.

Link to v2 cover letter:
https://lkml.org/lkml/2020/9/12/201

Thanks,
Oded

Omer Shpigelman (14):
  habanalabs/gaudi: add NIC H/W and registers definitions
  habanalabs/gaudi: add NIC firmware-related definitions
  habanalabs/gaudi: add NIC security configuration
  habanalabs/gaudi: add support for NIC QMANs
  habanalabs/gaudi: add NIC Ethernet support
  habanalabs/gaudi: add NIC PHY code
  habanalabs/gaudi: allow user to get MAC addresses in INFO IOCTL
  habanalabs/gaudi: add a new IOCTL for NIC control operations
  habanalabs/gaudi: add CQ control operations
  habanalabs/gaudi: add WQ control operations
  habanalabs/gaudi: add QP error handling
  habanalabs/gaudi: Add ethtool support using coresight
  habanalabs/gaudi: support DCB protocol
  habanalabs/gaudi: add NIC init/fini calls from common code

 drivers/misc/habanalabs/common/context.c      |    1 +
 drivers/misc/habanalabs/common/device.c       |   24 +-
 drivers/misc/habanalabs/common/firmware_if.c  |   44 +
 drivers/misc/habanalabs/common/habanalabs.h   |   33 +-
 .../misc/habanalabs/common/habanalabs_drv.c   |    5 +
 .../misc/habanalabs/common/habanalabs_ioctl.c |  151 +-
 drivers/misc/habanalabs/common/pci.c          |    1 +
 drivers/misc/habanalabs/gaudi/Makefile        |    3 +
 drivers/misc/habanalabs/gaudi/gaudi.c         |  957 +++-
 drivers/misc/habanalabs/gaudi/gaudiP.h        |  331 +-
 .../misc/habanalabs/gaudi/gaudi_coresight.c   |  144 +
 drivers/misc/habanalabs/gaudi/gaudi_nic.c     | 4093 +++++++++++++++++
 drivers/misc/habanalabs/gaudi/gaudi_nic.h     |  353 ++
 .../misc/habanalabs/gaudi/gaudi_nic_dcbnl.c   |  108 +
 .../misc/habanalabs/gaudi/gaudi_nic_ethtool.c |  616 +++
 drivers/misc/habanalabs/gaudi/gaudi_phy.c     | 1276 +++++
 .../misc/habanalabs/gaudi/gaudi_security.c    | 3973 ++++++++++++++++
 drivers/misc/habanalabs/goya/goya.c           |   44 +
 .../misc/habanalabs/include/common/cpucp_if.h |   34 +-
 .../include/gaudi/asic_reg/gaudi_regs.h       |   26 +-
 .../include/gaudi/asic_reg/nic0_qm0_masks.h   |  800 ++++
 .../include/gaudi/asic_reg/nic0_qm0_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic0_qm1_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic0_qpc0_masks.h  |  500 ++
 .../include/gaudi/asic_reg/nic0_qpc0_regs.h   |  710 +++
 .../include/gaudi/asic_reg/nic0_qpc1_regs.h   |  710 +++
 .../include/gaudi/asic_reg/nic0_rxb_regs.h    |  508 ++
 .../include/gaudi/asic_reg/nic0_rxe0_masks.h  |  354 ++
 .../include/gaudi/asic_reg/nic0_rxe0_regs.h   |  158 +
 .../include/gaudi/asic_reg/nic0_rxe1_regs.h   |  158 +
 .../include/gaudi/asic_reg/nic0_stat_regs.h   |  518 +++
 .../include/gaudi/asic_reg/nic0_tmr_regs.h    |  184 +
 .../include/gaudi/asic_reg/nic0_txe0_masks.h  |  336 ++
 .../include/gaudi/asic_reg/nic0_txe0_regs.h   |  264 ++
 .../include/gaudi/asic_reg/nic0_txe1_regs.h   |  264 ++
 .../include/gaudi/asic_reg/nic0_txs0_masks.h  |  336 ++
 .../include/gaudi/asic_reg/nic0_txs0_regs.h   |  214 +
 .../include/gaudi/asic_reg/nic0_txs1_regs.h   |  214 +
 .../include/gaudi/asic_reg/nic1_qm0_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic1_qm1_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic2_qm0_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic2_qm1_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic3_qm0_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic3_qm1_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic4_qm0_regs.h    |  834 ++++
 .../include/gaudi/asic_reg/nic4_qm1_regs.h    |  834 ++++
 drivers/misc/habanalabs/include/gaudi/gaudi.h |   12 +
 .../habanalabs/include/gaudi/gaudi_fw_if.h    |   24 +
 .../habanalabs/include/gaudi/gaudi_masks.h    |   15 +
 .../include/hw_ip/nic/nic_general.h           |   13 +
 include/uapi/misc/habanalabs.h                |  296 +-
 51 files changed, 27083 insertions(+), 62 deletions(-)
 create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic.c
 create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic.h
 create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic_dcbnl.c
 create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic_ethtool.c
 create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_phy.c
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxb_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_stat_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_tmr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic1_qm0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic1_qm1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic2_qm0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic2_qm1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic3_qm0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic3_qm1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic4_qm0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic4_qm1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/nic/nic_general.h

Comments

Jakub Kicinski Sept. 15, 2020, 8:35 p.m. UTC | #1

On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote:
> Hello,
> 
> This is the second version of the patch-set to upstream the GAUDI NIC code
> into the habanalabs driver.
> 
> The only modification from v2 is in the ethtool patch (patch 12). Details
> are in that patch's commit message.

You keep reposting this, yet this SDK shim^W^W driver is still living in
drivers/misc. If you want to make it look like a NIC, the code belongs
where NIC drivers are.

Then again, is it a NIC? Why do you have those custom IOCTLs? That's far
from normal.

Please make sure to CC linux-rdma. You clearly stated that the device
does RDMA-like transfers.

David Miller Sept. 15, 2020, 8:42 p.m. UTC | #2

From: Oded Gabbay <oded.gabbay@gmail.com>
Date: Tue, 15 Sep 2020 20:10:08 +0300

> This is the second version of the patch-set to upstream the GAUDI NIC code
> into the habanalabs driver.
> 
> The only modification from v2 is in the ethtool patch (patch 12). Details
> are in that patch's commit message.
> 
> Link to v2 cover letter:
> https://lkml.org/lkml/2020/9/12/201

I agree with Jakub, this driver definitely can't go-in as it is currently
structured and designed.  And because of the RDMA'ness of it, the RDMA
folks have to be CC:'d and have a chance to review this.

Oded Gabbay Sept. 15, 2020, 8:46 p.m. UTC | #3

On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote:
> > Hello,
> >
> > This is the second version of the patch-set to upstream the GAUDI NIC code
> > into the habanalabs driver.
> >
> > The only modification from v2 is in the ethtool patch (patch 12). Details
> > are in that patch's commit message.
>
> You keep reposting this, yet this SDK shim^W^W driver is still living in
> drivers/misc. If you want to make it look like a NIC, the code belongs
> where NIC drivers are.
>
> Then again, is it a NIC? Why do you have those custom IOCTLs? That's far
> from normal.

Hi Jakub,
I'm sorry but from your question it seems as if you didn't read my
cover letter at all, as I took great lengths in explaining exactly
what our device is and why we use custom IOCTLs.
TL;DR
We have an accelerator for deep learning (GAUDI) which uses RDMA as
infrastructure for communication between multiple accelerators. Same
as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
The RDMA implementation we did does NOT support some basic RDMA
IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
library or to connect to the rdma infrastructure in the kernel. We
wanted to do it but when we analyzed it, we saw we wouldn't be able to
support basic stuff and therefore we had to revert to our IOCTLs.
To sum it up, because our NIC is used for intra-communication, we
don't expose nor intend users to use it as a NIC per-se. However, to
be able to get statistics and manage them in a standard way, and
support control plane over Ethernet, we do register each port to the
net subsystem (i.e. create netdev per port).

I hope this short summary explains this better.
As per your request that this code lives in the net subsystem, I think
that will make it only more complicated and hard to upstream and
maintain.
I see there are other examples (e.g. sgi-xp) that contain networking
driver code in misc so I don't understand this objection.
>
> Please make sure to CC linux-rdma. You clearly stated that the device
> does RDMA-like transfers.

We don't use the RDMA infrastructure in the kernel and we can't
connect to it due to the lack of H/W support we have so I don't see
why we need to CC linux-rdma.

Oded

Oded Gabbay Sept. 15, 2020, 8:49 p.m. UTC | #4

On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
>
> From: Oded Gabbay <oded.gabbay@gmail.com>
> Date: Tue, 15 Sep 2020 20:10:08 +0300
>
> > This is the second version of the patch-set to upstream the GAUDI NIC code
> > into the habanalabs driver.
> >
> > The only modification from v2 is in the ethtool patch (patch 12). Details
> > are in that patch's commit message.
> >
> > Link to v2 cover letter:
> > https://lkml.org/lkml/2020/9/12/201
>
> I agree with Jakub, this driver definitely can't go-in as it is currently
> structured and designed.
Why is that ?
Can you please point to the things that bother you or not working correctly?
I can't really fix the driver if I don't know what's wrong.

In addition, please read my reply to Jakub with the explanation of why
we designed this driver as is.

And because of the RDMA'ness of it, the RDMA
> folks have to be CC:'d and have a chance to review this.
As I said to Jakub, the driver doesn't use the RDMA infrastructure in
the kernel and we can't connect to it due to the lack of H/W support
we have
Therefore, I don't see why we need to CC linux-rdma.
I understood why Greg asked me to CC you because we do connect to the
netdev and standard eth infrastructure, but regarding the RDMA, it's
not really the same.

Thanks,
Oded

Jakub Kicinski Sept. 15, 2020, 9:04 p.m. UTC | #5

On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote:
> On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote:  
> > > Hello,
> > >
> > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > into the habanalabs driver.
> > >
> > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > are in that patch's commit message.  
> >
> > You keep reposting this, yet this SDK shim^W^W driver is still living in
> > drivers/misc. If you want to make it look like a NIC, the code belongs
> > where NIC drivers are.
> >
> > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far
> > from normal.  
> 
> I'm sorry but from your question it seems as if you didn't read my
> cover letter at all, as I took great lengths in explaining exactly
> what our device is and why we use custom IOCTLs.
> TL;DR
> We have an accelerator for deep learning (GAUDI) which uses RDMA as
> infrastructure for communication between multiple accelerators. Same
> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> The RDMA implementation we did does NOT support some basic RDMA
> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> library or to connect to the rdma infrastructure in the kernel. We
> wanted to do it but when we analyzed it, we saw we wouldn't be able to
> support basic stuff and therefore we had to revert to our IOCTLs.
> To sum it up, because our NIC is used for intra-communication, we
> don't expose nor intend users to use it as a NIC per-se. However, to
> be able to get statistics and manage them in a standard way, and
> support control plane over Ethernet, we do register each port to the
> net subsystem (i.e. create netdev per port).
> 
> I hope this short summary explains this better.

I read your cover letter. Networking drivers don't get to define random
IOCTLs as they please. You have to take that part out of the "NIC"
driver.

> As per your request that this code lives in the net subsystem, I think
> that will make it only more complicated and hard to upstream and
> maintain.
> I see there are other examples (e.g. sgi-xp) that contain networking
> driver code in misc so I don't understand this objection.

The maintenance structure and CI systems for the kernel depend on the
directory layout. If you don't understand that I don't know how to help
you.

> > Please make sure to CC linux-rdma. You clearly stated that the device
> > does RDMA-like transfers.  
> 
> We don't use the RDMA infrastructure in the kernel and we can't
> connect to it due to the lack of H/W support we have so I don't see
> why we need to CC linux-rdma.

You have it backward. You don't get to pick and choose which parts of
the infrastructure you use, and therefore who reviews your drivers.
The device uses RDMA under the hood so Linux RDMA experts must very
much be okay with it getting merged. That's how we ensure Linux
interfaces are consistent and good quality.

Oded Gabbay Sept. 15, 2020, 9:20 p.m. UTC | #6

On Wed, Sep 16, 2020 at 12:04 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote:
> > On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote:
> > > > Hello,
> > > >
> > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > into the habanalabs driver.
> > > >
> > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > are in that patch's commit message.
> > >
> > > You keep reposting this, yet this SDK shim^W^W driver is still living in
> > > drivers/misc. If you want to make it look like a NIC, the code belongs
> > > where NIC drivers are.
> > >
> > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far
> > > from normal.
> >
> > I'm sorry but from your question it seems as if you didn't read my
> > cover letter at all, as I took great lengths in explaining exactly
> > what our device is and why we use custom IOCTLs.
> > TL;DR
> > We have an accelerator for deep learning (GAUDI) which uses RDMA as
> > infrastructure for communication between multiple accelerators. Same
> > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > The RDMA implementation we did does NOT support some basic RDMA
> > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > library or to connect to the rdma infrastructure in the kernel. We
> > wanted to do it but when we analyzed it, we saw we wouldn't be able to
> > support basic stuff and therefore we had to revert to our IOCTLs.
> > To sum it up, because our NIC is used for intra-communication, we
> > don't expose nor intend users to use it as a NIC per-se. However, to
> > be able to get statistics and manage them in a standard way, and
> > support control plane over Ethernet, we do register each port to the
> > net subsystem (i.e. create netdev per port).
> >
> > I hope this short summary explains this better.
>
> I read your cover letter. Networking drivers don't get to define random
> IOCTLs as they please. You have to take that part out of the "NIC"
> driver.

The IOCTLs are not for the Ethernet part. They are strictly for the
RDMA operations. RDMA drivers also have IOCTLs as interfaces in the
drivers/infiniband area, so I don't think I'm doing something
different here.
And my driver is not networking. It is an accelerator which has some
network ports.
btw, this is only a single new IOCTL call. The rest of the IOCTLs are
already upstreamed and are for the rest of the ASIC's compute
functionality. What I'm trying to say is that it's very common to
define IOCTLs for accelerators.

>
> > As per your request that this code lives in the net subsystem, I think
> > that will make it only more complicated and hard to upstream and
> > maintain.
> > I see there are other examples (e.g. sgi-xp) that contain networking
> > driver code in misc so I don't understand this objection.
>
> The maintenance structure and CI systems for the kernel depend on the
> directory layout. If you don't understand that I don't know how to help
> you.
I completely understand but you didn't answer my question. How come
there are drivers which create netdev objects, and specifically sgi-xp
in misc (but I also saw it in usb drivers) that live outside
drivers/net ? Why doesn't your request apply to them as well ?
When we wrote the code, we saw those examples and therefore assumed it was fine.

>
> > > Please make sure to CC linux-rdma. You clearly stated that the device
> > > does RDMA-like transfers.
> >
> > We don't use the RDMA infrastructure in the kernel and we can't
> > connect to it due to the lack of H/W support we have so I don't see
> > why we need to CC linux-rdma.
>
> You have it backward. You don't get to pick and choose which parts of
> the infrastructure you use, and therefore who reviews your drivers.
> The device uses RDMA under the hood so Linux RDMA experts must very
> much be okay with it getting merged. That's how we ensure Linux
> interfaces are consistent and good quality.

I understand your point of view but If my H/W doesn't support the
basic requirements of the RDMA infrastructure and interfaces, then
really there is nothing I can do about it. I can't use them.
I wish I was able to use that infrastructure but I can't. That's why
we wrote the IOCTLs in our accelerator driver.

Thanks,
Oded

Andrew Lunn Sept. 15, 2020, 9:37 p.m. UTC | #7

> I completely understand but you didn't answer my question. How come
> there are drivers which create netdev objects, and specifically sgi-xp
> in misc (but I also saw it in usb drivers) that live outside
> drivers/net ? Why doesn't your request apply to them as well ?
> When we wrote the code, we saw those examples and therefore assumed it was fine.

commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a
Author: Dean Nelson <dcn@sgi.com>
Date:   Tue Apr 22 14:46:56 2008 -0500

    [IA64] move XP and XPC to drivers/misc/sgi-xp
    
    Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp.
    
    Signed-off-by: Dean Nelson <dcn@sgi.com>
    Signed-off-by: Tony Luck <tony.luck@intel.com>

It has been there a long time, and no networking person was involved
in its move.

drivers/usb/gadget/function/f_ncm.c
commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797
Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
Date:   Tue Jul 15 13:09:46 2014 +0200

    usb: gadget: Gadget directory cleanup - group usb functions
    
    The drivers/usb/gadget directory contains many files.
    Files which are related can be distributed into separate directories.
    This patch moves the USB functions implementations into a separate directory.
    
    Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
    Signed-off-by: Felipe Balbi <balbi@ti.com>

Again, old.

Can you find an example of a network driver added in the last couple
of years outside of drivers/met?

> > > > Please make sure to CC linux-rdma. You clearly stated that the device
> > > > does RDMA-like transfers.
> > >
> > > We don't use the RDMA infrastructure in the kernel and we can't
> > > connect to it due to the lack of H/W support we have so I don't see
> > > why we need to CC linux-rdma.
> >
> > You have it backward. You don't get to pick and choose which parts of
> > the infrastructure you use, and therefore who reviews your drivers.
> > The device uses RDMA under the hood so Linux RDMA experts must very
> > much be okay with it getting merged. That's how we ensure Linux
> > interfaces are consistent and good quality.
> 
> I understand your point of view but If my H/W doesn't support the
> basic requirements of the RDMA infrastructure and interfaces, then
> really there is nothing I can do about it. I can't use them.

It is up to the RDMA people to say that. They might see how the RDMA
core can be made to work for your hardware.

     Andrew

Oded Gabbay Sept. 15, 2020, 9:43 p.m. UTC | #8

On Wed, Sep 16, 2020 at 12:37 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > I completely understand but you didn't answer my question. How come
> > there are drivers which create netdev objects, and specifically sgi-xp
> > in misc (but I also saw it in usb drivers) that live outside
> > drivers/net ? Why doesn't your request apply to them as well ?
> > When we wrote the code, we saw those examples and therefore assumed it was fine.
>
> commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a
> Author: Dean Nelson <dcn@sgi.com>
> Date:   Tue Apr 22 14:46:56 2008 -0500
>
>     [IA64] move XP and XPC to drivers/misc/sgi-xp
>
>     Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp.
>
>     Signed-off-by: Dean Nelson <dcn@sgi.com>
>     Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> It has been there a long time, and no networking person was involved
> in its move.
>
> drivers/usb/gadget/function/f_ncm.c
> commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797
> Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
> Date:   Tue Jul 15 13:09:46 2014 +0200
>
>     usb: gadget: Gadget directory cleanup - group usb functions
>
>     The drivers/usb/gadget directory contains many files.
>     Files which are related can be distributed into separate directories.
>     This patch moves the USB functions implementations into a separate directory.
>
>     Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
>     Signed-off-by: Felipe Balbi <balbi@ti.com>
>
> Again, old.
>
> Can you find an example of a network driver added in the last couple
> of years outside of drivers/met?
I honestly don't know and I admit we didn't look at the dates of when
these drivers were introduced.
Oded

>
> > > > > Please make sure to CC linux-rdma. You clearly stated that the device
> > > > > does RDMA-like transfers.
> > > >
> > > > We don't use the RDMA infrastructure in the kernel and we can't
> > > > connect to it due to the lack of H/W support we have so I don't see
> > > > why we need to CC linux-rdma.
> > >
> > > You have it backward. You don't get to pick and choose which parts of
> > > the infrastructure you use, and therefore who reviews your drivers.
> > > The device uses RDMA under the hood so Linux RDMA experts must very
> > > much be okay with it getting merged. That's how we ensure Linux
> > > interfaces are consistent and good quality.
> >
> > I understand your point of view but If my H/W doesn't support the
> > basic requirements of the RDMA infrastructure and interfaces, then
> > really there is nothing I can do about it. I can't use them.
>
> It is up to the RDMA people to say that. They might see how the RDMA
> core can be made to work for your hardware.
>
>      Andrew

David Miller Sept. 15, 2020, 10:34 p.m. UTC | #9

From: Oded Gabbay <oded.gabbay@gmail.com>
Date: Wed, 16 Sep 2020 00:20:12 +0300

> I completely understand but you didn't answer my question. How come
> there are drivers which create netdev objects, and specifically sgi-xp
> in misc (but I also saw it in usb drivers) that live outside
> drivers/net ? Why doesn't your request apply to them as well ?

Don't use examples of drivers doing the wrong thing as an excuse for
you to repeat the mistake.

Ok?

That kind of argument doesn't work here.

David Miller Sept. 15, 2020, 10:35 p.m. UTC | #10

From: Oded Gabbay <oded.gabbay@gmail.com>
Date: Wed, 16 Sep 2020 00:43:00 +0300

> I honestly don't know and I admit we didn't look at the dates of when
> these drivers were introduced.

Please do research when you make claims in the future, thank you.

David Miller Sept. 15, 2020, 10:36 p.m. UTC | #11

From: Andrew Lunn <andrew@lunn.ch>
Date: Tue, 15 Sep 2020 23:37:35 +0200

>> I understand your point of view but If my H/W doesn't support the
>> basic requirements of the RDMA infrastructure and interfaces, then
>> really there is nothing I can do about it. I can't use them.
> 
> It is up to the RDMA people to say that. They might see how the RDMA
> core can be made to work for your hardware.

+1

Oded Gabbay Sept. 16, 2020, 4:26 a.m. UTC | #12

On Wed, Sep 16, 2020 at 1:34 AM David Miller <davem@davemloft.net> wrote:
>
> From: Oded Gabbay <oded.gabbay@gmail.com>
> Date: Wed, 16 Sep 2020 00:20:12 +0300
>
> > I completely understand but you didn't answer my question. How come
> > there are drivers which create netdev objects, and specifically sgi-xp
> > in misc (but I also saw it in usb drivers) that live outside
> > drivers/net ? Why doesn't your request apply to them as well ?
>
> Don't use examples of drivers doing the wrong thing as an excuse for
> you to repeat the mistake.
>
> Ok?
Well, it's not like there is a big red warning near those drivers
saying "this is wrong"...
How could I have known that in advance ?

>
> That kind of argument doesn't work here.
I know that, I just didn't know those drivers did "the wrong thing"

Oded

Greg Kroah-Hartman Sept. 16, 2020, 6:26 a.m. UTC | #13

On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> >
> > From: Oded Gabbay <oded.gabbay@gmail.com>
> > Date: Tue, 15 Sep 2020 20:10:08 +0300
> >
> > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > into the habanalabs driver.
> > >
> > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > are in that patch's commit message.
> > >
> > > Link to v2 cover letter:
> > > https://lkml.org/lkml/2020/9/12/201
> >
> > I agree with Jakub, this driver definitely can't go-in as it is currently
> > structured and designed.
> Why is that ?
> Can you please point to the things that bother you or not working correctly?
> I can't really fix the driver if I don't know what's wrong.
> 
> In addition, please read my reply to Jakub with the explanation of why
> we designed this driver as is.
> 
> And because of the RDMA'ness of it, the RDMA
> > folks have to be CC:'d and have a chance to review this.
> As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> the kernel and we can't connect to it due to the lack of H/W support
> we have
> Therefore, I don't see why we need to CC linux-rdma.
> I understood why Greg asked me to CC you because we do connect to the
> netdev and standard eth infrastructure, but regarding the RDMA, it's
> not really the same.

Ok, to do this "right" it needs to be split up into separate drivers,
hopefully using the "virtual bus" code that some day Intel will resubmit
again that will solve this issue.

That will allow you to put the network driver portion in drivers/net/
and split the code up into the proper different pieces easier.

I recommend grabbing the virtual bus code from the archives and looking
at that for how this can be done.  Now that you are part of Intel, I'm
sure that the internal-Intel-Linux-kernel-review-process can kick in and
those developers can help you out.  If not, let me know, so I can go
kick them :)

As for the RDMA stuff, yeah, you should look at the current RDMA
interfaces and verify that those really do not work for you here, and
then document why that is in your patch submission.

thanks,

greg k-h

Oded Gabbay Sept. 16, 2020, 6:36 a.m. UTC | #14

On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > >
> > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > >
> > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > into the habanalabs driver.
> > > >
> > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > are in that patch's commit message.
> > > >
> > > > Link to v2 cover letter:
> > > > https://lkml.org/lkml/2020/9/12/201
> > >
> > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > structured and designed.
> > Why is that ?
> > Can you please point to the things that bother you or not working correctly?
> > I can't really fix the driver if I don't know what's wrong.
> >
> > In addition, please read my reply to Jakub with the explanation of why
> > we designed this driver as is.
> >
> > And because of the RDMA'ness of it, the RDMA
> > > folks have to be CC:'d and have a chance to review this.
> > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > the kernel and we can't connect to it due to the lack of H/W support
> > we have
> > Therefore, I don't see why we need to CC linux-rdma.
> > I understood why Greg asked me to CC you because we do connect to the
> > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > not really the same.
>
> Ok, to do this "right" it needs to be split up into separate drivers,
> hopefully using the "virtual bus" code that some day Intel will resubmit
> again that will solve this issue.
Hi Greg,
Can I suggest an alternative for the short/medium term ?

In an earlier email, Jakub said:
"Is it not possible to move the files and still build them into a single
module?"

I thought maybe that's a good way to progress here ?
First, split the content to Ethernet and RDMA.
Then move the Ethernet part to drivers/net but build it as part of
habanalabs.ko.
Regarding the RDMA code, upstream/review it in a different patch-set
(maybe they will want me to put the files elsewhere).

What do you think ?

>
> That will allow you to put the network driver portion in drivers/net/
> and split the code up into the proper different pieces easier.
>
> I recommend grabbing the virtual bus code from the archives and looking
> at that for how this can be done.  Now that you are part of Intel, I'm
> sure that the internal-Intel-Linux-kernel-review-process can kick in and
> those developers can help you out.  If not, let me know, so I can go
> kick them :)
>
> As for the RDMA stuff, yeah, you should look at the current RDMA
> interfaces and verify that those really do not work for you here, and
> then document why that is in your patch submission.
ok, will do that.

Thanks,
Oded
>
> thanks,
>
> greg k-h

Greg Kroah-Hartman Sept. 16, 2020, 7:42 a.m. UTC | #15

On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > >
> > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > >
> > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > into the habanalabs driver.
> > > > >
> > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > are in that patch's commit message.
> > > > >
> > > > > Link to v2 cover letter:
> > > > > https://lkml.org/lkml/2020/9/12/201
> > > >
> > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > structured and designed.
> > > Why is that ?
> > > Can you please point to the things that bother you or not working correctly?
> > > I can't really fix the driver if I don't know what's wrong.
> > >
> > > In addition, please read my reply to Jakub with the explanation of why
> > > we designed this driver as is.
> > >
> > > And because of the RDMA'ness of it, the RDMA
> > > > folks have to be CC:'d and have a chance to review this.
> > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > the kernel and we can't connect to it due to the lack of H/W support
> > > we have
> > > Therefore, I don't see why we need to CC linux-rdma.
> > > I understood why Greg asked me to CC you because we do connect to the
> > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > not really the same.
> >
> > Ok, to do this "right" it needs to be split up into separate drivers,
> > hopefully using the "virtual bus" code that some day Intel will resubmit
> > again that will solve this issue.
> Hi Greg,
> Can I suggest an alternative for the short/medium term ?
> 
> In an earlier email, Jakub said:
> "Is it not possible to move the files and still build them into a single
> module?"
> 
> I thought maybe that's a good way to progress here ?

Cross-directory builds of a single module are crazy.  Yes, they work,
but really, that's a mess, and would never suggest doing that.

> First, split the content to Ethernet and RDMA.
> Then move the Ethernet part to drivers/net but build it as part of
> habanalabs.ko.
> Regarding the RDMA code, upstream/review it in a different patch-set
> (maybe they will want me to put the files elsewhere).
> 
> What do you think ?

I think you are asking for more work there than just splitting out into
separate modules :)

thanks,

greg k-h

Oded Gabbay Sept. 16, 2020, 8:02 a.m. UTC | #16

On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > > >
> > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > >
> > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > into the habanalabs driver.
> > > > > >
> > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > are in that patch's commit message.
> > > > > >
> > > > > > Link to v2 cover letter:
> > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > >
> > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > structured and designed.
> > > > Why is that ?
> > > > Can you please point to the things that bother you or not working correctly?
> > > > I can't really fix the driver if I don't know what's wrong.
> > > >
> > > > In addition, please read my reply to Jakub with the explanation of why
> > > > we designed this driver as is.
> > > >
> > > > And because of the RDMA'ness of it, the RDMA
> > > > > folks have to be CC:'d and have a chance to review this.
> > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > we have
> > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > I understood why Greg asked me to CC you because we do connect to the
> > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > not really the same.
> > >
> > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > again that will solve this issue.
> > Hi Greg,
> > Can I suggest an alternative for the short/medium term ?
> >
> > In an earlier email, Jakub said:
> > "Is it not possible to move the files and still build them into a single
> > module?"
> >
> > I thought maybe that's a good way to progress here ?
>
> Cross-directory builds of a single module are crazy.  Yes, they work,
> but really, that's a mess, and would never suggest doing that.
>
> > First, split the content to Ethernet and RDMA.
> > Then move the Ethernet part to drivers/net but build it as part of
> > habanalabs.ko.
> > Regarding the RDMA code, upstream/review it in a different patch-set
> > (maybe they will want me to put the files elsewhere).
> >
> > What do you think ?
>
> I think you are asking for more work there than just splitting out into
> separate modules :)
>
> thanks,
>
> greg k-h
Hi Greg,

If cross-directory building is out of the question, what about
splitting into separate modules ? And use cross-module notifiers/calls
? I did that with amdkfd and amdgpu/radeon a couple of years back. It
worked (that's the best thing I can say about it).
The main problem with this "virtual bus" thing is that I'm not
familiar with it at all and from my experience I imagine it would take
a considerable time and effort to upstream this infrastructure work.
This could delay the NIC code for a couple of years, which by then
this won't be relevant at all.

So I'm trying to find some middle ground here on how to proceed.

Thanks,
Oded

Greg Kroah-Hartman Sept. 16, 2020, 8:22 a.m. UTC | #17

On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > > > >
> > > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > >
> > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > > into the habanalabs driver.
> > > > > > >
> > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > > are in that patch's commit message.
> > > > > > >
> > > > > > > Link to v2 cover letter:
> > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > >
> > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > > structured and designed.
> > > > > Why is that ?
> > > > > Can you please point to the things that bother you or not working correctly?
> > > > > I can't really fix the driver if I don't know what's wrong.
> > > > >
> > > > > In addition, please read my reply to Jakub with the explanation of why
> > > > > we designed this driver as is.
> > > > >
> > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > > we have
> > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > I understood why Greg asked me to CC you because we do connect to the
> > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > > not really the same.
> > > >
> > > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > > again that will solve this issue.
> > > Hi Greg,
> > > Can I suggest an alternative for the short/medium term ?
> > >
> > > In an earlier email, Jakub said:
> > > "Is it not possible to move the files and still build them into a single
> > > module?"
> > >
> > > I thought maybe that's a good way to progress here ?
> >
> > Cross-directory builds of a single module are crazy.  Yes, they work,
> > but really, that's a mess, and would never suggest doing that.
> >
> > > First, split the content to Ethernet and RDMA.
> > > Then move the Ethernet part to drivers/net but build it as part of
> > > habanalabs.ko.
> > > Regarding the RDMA code, upstream/review it in a different patch-set
> > > (maybe they will want me to put the files elsewhere).
> > >
> > > What do you think ?
> >
> > I think you are asking for more work there than just splitting out into
> > separate modules :)
> >
> > thanks,
> >
> > greg k-h
> Hi Greg,
> 
> If cross-directory building is out of the question, what about
> splitting into separate modules ? And use cross-module notifiers/calls
> ? I did that with amdkfd and amdgpu/radeon a couple of years back. It
> worked (that's the best thing I can say about it).

That's fine with me.

> The main problem with this "virtual bus" thing is that I'm not
> familiar with it at all and from my experience I imagine it would take
> a considerable time and effort to upstream this infrastructure work.

It shouldn't be taking that long, but for some unknown reason, the
original author of that code is sitting on it and not resending it.  Go
poke them through internal Intel channels to find out what the problem
is, as I have no clue why a 200-300 line bus module is taking so long to
get "right" :(

I'm _ALMOST_ at the point where I would just do that work myself, but
due to my current status with Intel, I'll let them do it as I have
enough other things on my plate...

> This could delay the NIC code for a couple of years, which by then
> this won't be relevant at all.

Why wouldn't this code be relevant in a year?  It's going to be 2+ years
before any of this shows up in an "enterprise distro" based on their
release cycles anyway :)

thanks,

greg k-h

Oded Gabbay Sept. 16, 2020, 8:47 a.m. UTC | #18

On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > > > > >
> > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > > >
> > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > > > into the habanalabs driver.
> > > > > > > >
> > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > > > are in that patch's commit message.
> > > > > > > >
> > > > > > > > Link to v2 cover letter:
> > > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > > >
> > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > > > structured and designed.
> > > > > > Why is that ?
> > > > > > Can you please point to the things that bother you or not working correctly?
> > > > > > I can't really fix the driver if I don't know what's wrong.
> > > > > >
> > > > > > In addition, please read my reply to Jakub with the explanation of why
> > > > > > we designed this driver as is.
> > > > > >
> > > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > > > we have
> > > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > > I understood why Greg asked me to CC you because we do connect to the
> > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > > > not really the same.
> > > > >
> > > > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > > > again that will solve this issue.
> > > > Hi Greg,
> > > > Can I suggest an alternative for the short/medium term ?
> > > >
> > > > In an earlier email, Jakub said:
> > > > "Is it not possible to move the files and still build them into a single
> > > > module?"
> > > >
> > > > I thought maybe that's a good way to progress here ?
> > >
> > > Cross-directory builds of a single module are crazy.  Yes, they work,
> > > but really, that's a mess, and would never suggest doing that.
> > >
> > > > First, split the content to Ethernet and RDMA.
> > > > Then move the Ethernet part to drivers/net but build it as part of
> > > > habanalabs.ko.
> > > > Regarding the RDMA code, upstream/review it in a different patch-set
> > > > (maybe they will want me to put the files elsewhere).
> > > >
> > > > What do you think ?
> > >
> > > I think you are asking for more work there than just splitting out into
> > > separate modules :)
> > >
> > > thanks,
> > >
> > > greg k-h
> > Hi Greg,
> >
> > If cross-directory building is out of the question, what about
> > splitting into separate modules ? And use cross-module notifiers/calls
> > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It
> > worked (that's the best thing I can say about it).
>
> That's fine with me.
>
> > The main problem with this "virtual bus" thing is that I'm not
> > familiar with it at all and from my experience I imagine it would take
> > a considerable time and effort to upstream this infrastructure work.
>
> It shouldn't be taking that long, but for some unknown reason, the
> original author of that code is sitting on it and not resending it.  Go
> poke them through internal Intel channels to find out what the problem
> is, as I have no clue why a 200-300 line bus module is taking so long to
> get "right" :(
>
> I'm _ALMOST_ at the point where I would just do that work myself, but
> due to my current status with Intel, I'll let them do it as I have
> enough other things on my plate...
>
> > This could delay the NIC code for a couple of years, which by then
> > this won't be relevant at all.
>
> Why wouldn't this code be relevant in a year?  It's going to be 2+ years
> before any of this shows up in an "enterprise distro" based on their
> release cycles anyway :)
>
> thanks,
>
> greg k-h

Hi Greg,
ok, I'll take a look. Do you happen to have the name of the patch-set / author ?

Regarding the RDMA stuff, I'll do some work internally to separate it
from the Ethernet code and then will send that code only to RDMA
people with more detailed explanations.

Thanks,
Oded

Greg Kroah-Hartman Sept. 16, 2020, noon UTC | #19

On Wed, Sep 16, 2020 at 11:47:58AM +0300, Oded Gabbay wrote:
> On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > > > > > >
> > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > > > >
> > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > > > > into the habanalabs driver.
> > > > > > > > >
> > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > > > > are in that patch's commit message.
> > > > > > > > >
> > > > > > > > > Link to v2 cover letter:
> > > > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > > > >
> > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > > > > structured and designed.
> > > > > > > Why is that ?
> > > > > > > Can you please point to the things that bother you or not working correctly?
> > > > > > > I can't really fix the driver if I don't know what's wrong.
> > > > > > >
> > > > > > > In addition, please read my reply to Jakub with the explanation of why
> > > > > > > we designed this driver as is.
> > > > > > >
> > > > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > > > > we have
> > > > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > > > I understood why Greg asked me to CC you because we do connect to the
> > > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > > > > not really the same.
> > > > > >
> > > > > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > > > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > > > > again that will solve this issue.
> > > > > Hi Greg,
> > > > > Can I suggest an alternative for the short/medium term ?
> > > > >
> > > > > In an earlier email, Jakub said:
> > > > > "Is it not possible to move the files and still build them into a single
> > > > > module?"
> > > > >
> > > > > I thought maybe that's a good way to progress here ?
> > > >
> > > > Cross-directory builds of a single module are crazy.  Yes, they work,
> > > > but really, that's a mess, and would never suggest doing that.
> > > >
> > > > > First, split the content to Ethernet and RDMA.
> > > > > Then move the Ethernet part to drivers/net but build it as part of
> > > > > habanalabs.ko.
> > > > > Regarding the RDMA code, upstream/review it in a different patch-set
> > > > > (maybe they will want me to put the files elsewhere).
> > > > >
> > > > > What do you think ?
> > > >
> > > > I think you are asking for more work there than just splitting out into
> > > > separate modules :)
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> > > Hi Greg,
> > >
> > > If cross-directory building is out of the question, what about
> > > splitting into separate modules ? And use cross-module notifiers/calls
> > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It
> > > worked (that's the best thing I can say about it).
> >
> > That's fine with me.
> >
> > > The main problem with this "virtual bus" thing is that I'm not
> > > familiar with it at all and from my experience I imagine it would take
> > > a considerable time and effort to upstream this infrastructure work.
> >
> > It shouldn't be taking that long, but for some unknown reason, the
> > original author of that code is sitting on it and not resending it.  Go
> > poke them through internal Intel channels to find out what the problem
> > is, as I have no clue why a 200-300 line bus module is taking so long to
> > get "right" :(
> >
> > I'm _ALMOST_ at the point where I would just do that work myself, but
> > due to my current status with Intel, I'll let them do it as I have
> > enough other things on my plate...
> >
> > > This could delay the NIC code for a couple of years, which by then
> > > this won't be relevant at all.
> >
> > Why wouldn't this code be relevant in a year?  It's going to be 2+ years
> > before any of this shows up in an "enterprise distro" based on their
> > release cycles anyway :)
> >
> > thanks,
> >
> > greg k-h
> 
> Hi Greg,
> ok, I'll take a look. Do you happen to have the name of the patch-set / author ?

Here's at least one copy:
	https://lore.kernel.org/linux-rdma/20200520070227.3392100-2-jeffrey.t.kirsher@intel.com/

there might have been a newer one, can't remember, sorry.

thanks,

greg k-h

Dan Williams Sept. 16, 2020, 11:04 p.m. UTC | #20

On Wed, 2020-09-16 at 10:22 +0200, Greg Kroah-Hartman wrote:
> On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <
> > > > > > davem@davemloft.net> wrote:
> > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > > > 
> > > > > > > > This is the second version of the patch-set to upstream
> > > > > > > > the GAUDI NIC code
> > > > > > > > into the habanalabs driver.
> > > > > > > > 
> > > > > > > > The only modification from v2 is in the ethtool patch
> > > > > > > > (patch 12). Details
> > > > > > > > are in that patch's commit message.
> > > > > > > > 
> > > > > > > > Link to v2 cover letter:
> > > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > > > 
> > > > > > > I agree with Jakub, this driver definitely can't go-in as
> > > > > > > it is currently
> > > > > > > structured and designed.
> > > > > > Why is that ?
> > > > > > Can you please point to the things that bother you or not
> > > > > > working correctly?
> > > > > > I can't really fix the driver if I don't know what's wrong.
> > > > > > 
> > > > > > In addition, please read my reply to Jakub with the
> > > > > > explanation of why
> > > > > > we designed this driver as is.
> > > > > > 
> > > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > > As I said to Jakub, the driver doesn't use the RDMA
> > > > > > infrastructure in
> > > > > > the kernel and we can't connect to it due to the lack of
> > > > > > H/W support
> > > > > > we have
> > > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > > I understood why Greg asked me to CC you because we do
> > > > > > connect to the
> > > > > > netdev and standard eth infrastructure, but regarding the
> > > > > > RDMA, it's
> > > > > > not really the same.
> > > > > 
> > > > > Ok, to do this "right" it needs to be split up into separate
> > > > > drivers,
> > > > > hopefully using the "virtual bus" code that some day Intel
> > > > > will resubmit
> > > > > again that will solve this issue.
> > > > Hi Greg,
> > > > Can I suggest an alternative for the short/medium term ?
> > > > 
> > > > In an earlier email, Jakub said:
> > > > "Is it not possible to move the files and still build them into
> > > > a single
> > > > module?"
> > > > 
> > > > I thought maybe that's a good way to progress here ?
> > > 
> > > Cross-directory builds of a single module are crazy.  Yes, they
> > > work,
> > > but really, that's a mess, and would never suggest doing that.
> > > 
> > > > First, split the content to Ethernet and RDMA.
> > > > Then move the Ethernet part to drivers/net but build it as part
> > > > of
> > > > habanalabs.ko.
> > > > Regarding the RDMA code, upstream/review it in a different
> > > > patch-set
> > > > (maybe they will want me to put the files elsewhere).
> > > > 
> > > > What do you think ?
> > > 
> > > I think you are asking for more work there than just splitting
> > > out into
> > > separate modules :)
> > > 
> > > thanks,
> > > 
> > > greg k-h
> > Hi Greg,
> > 
> > If cross-directory building is out of the question, what about
> > splitting into separate modules ? And use cross-module
> > notifiers/calls
> > ? I did that with amdkfd and amdgpu/radeon a couple of years back.
> > It
> > worked (that's the best thing I can say about it).
> 
> That's fine with me.
> 
> > The main problem with this "virtual bus" thing is that I'm not
> > familiar with it at all and from my experience I imagine it would
> > take
> > a considerable time and effort to upstream this infrastructure
> > work.
> 
> It shouldn't be taking that long, but for some unknown reason, the
> original author of that code is sitting on it and not resending
> it.  Go
> poke them through internal Intel channels to find out what the
> problem
> is, as I have no clue why a 200-300 line bus module is taking so long
> to
> get "right" :(

It turns out that they were caught between being deeply respectful of
your request to get another senior kernel developer to look at it
before sending it out, and deeply respectful of not disclosing that I
was out on bonding leave.

It just happened that I left before they could
get the latest version over to review.

> I'm _ALMOST_ at the point where I would just do that work myself, but
> due to my current status with Intel, I'll let them do it as I have
> enough other things on my plate...

I'm back now, let's get this thing moving. /me goes to review.

Jason Gunthorpe Sept. 17, 2020, 5:18 p.m. UTC | #21

On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> infrastructure for communication between multiple accelerators. Same
> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> The RDMA implementation we did does NOT support some basic RDMA
> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> library or to connect to the rdma infrastructure in the kernel. 

You can't create a parallel RDMA subsystem in netdev, or in misc, and
you can't add random device offloads as IOCTL to nedevs.

RDMA is the proper home for all the networking offloads that don't fit
into netdev.

EFA was able to fit into rdma-core/etc and it isn't even RoCE at
all. I'm sure this can too.

> wanted to do it but when we analyzed it, we saw we wouldn't be able to
> support basic stuff and therefore we had to revert to our IOCTLs.

Try again. Ask for help.

Your patches add CQs, WQ, and other RDMA objects. This is very clearly
not an appropriate functionality for netdev.

> To sum it up, because our NIC is used for intra-communication, we
> don't expose nor intend users to use it as a NIC per-se. However, to
> be able to get statistics and manage them in a standard way, and
> support control plane over Ethernet, we do register each port to the
> net subsystem (i.e. create netdev per port).

Sure, the basic ethernet side is conceptually fine.

> > Please make sure to CC linux-rdma. You clearly stated that the device
> > does RDMA-like transfers.
> 
> We don't use the RDMA infrastructure in the kernel and we can't
> connect to it due to the lack of H/W support we have so I don't see
> why we need to CC linux-rdma.

Because you can't put RDMA like concepts under net.

Jakub, NAK from me on this series.

Jason

Gal Pressman Sept. 18, 2020, 11:36 a.m. UTC | #22

On 17/09/2020 20:18, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
>> infrastructure for communication between multiple accelerators. Same
>> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
>> The RDMA implementation we did does NOT support some basic RDMA
>> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
>> library or to connect to the rdma infrastructure in the kernel. 
> 
> You can't create a parallel RDMA subsystem in netdev, or in misc, and
> you can't add random device offloads as IOCTL to nedevs.
> 
> RDMA is the proper home for all the networking offloads that don't fit
> into netdev.
> 
> EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> all. I'm sure this can too.

Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
was suggested to go through the vfio subsystem instead.

I think this comes back to the discussion we had when EFA was upstreamed, which
is what's the bar to get accepted to the RDMA subsystem.
IIRC, what we eventually agreed on is having a userspace rdma-core provider and
ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).

Does GAUDI fit these requirements? If not, should it be in a different subsystem
or should we open the "what qualifies as an RDMA device" question again?

Leon Romanovsky Sept. 18, 2020, 11:52 a.m. UTC | #23

On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> >> infrastructure for communication between multiple accelerators. Same
> >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> >> The RDMA implementation we did does NOT support some basic RDMA
> >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> >> library or to connect to the rdma infrastructure in the kernel.
> >
> > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > you can't add random device offloads as IOCTL to nedevs.
> >
> > RDMA is the proper home for all the networking offloads that don't fit
> > into netdev.
> >
> > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > all. I'm sure this can too.
>
> Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> was suggested to go through the vfio subsystem instead.
>
> I think this comes back to the discussion we had when EFA was upstreamed, which
> is what's the bar to get accepted to the RDMA subsystem.
> IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
>
> Does GAUDI fit these requirements? If not, should it be in a different subsystem
> or should we open the "what qualifies as an RDMA device" question again?

I want to remind you that rdma-core requirement came to make sure that
anything exposed from the RDMA to the userspace is strict with proper
UAPI header hygiene.

I doubt that Havana's ioctls are backed by anything like this.

Thanks

Jason Gunthorpe Sept. 18, 2020, 11:56 a.m. UTC | #24

On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> >> infrastructure for communication between multiple accelerators. Same
> >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> >> The RDMA implementation we did does NOT support some basic RDMA
> >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> >> library or to connect to the rdma infrastructure in the kernel. 
> > 
> > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > you can't add random device offloads as IOCTL to nedevs.
> > 
> > RDMA is the proper home for all the networking offloads that don't fit
> > into netdev.
> > 
> > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > all. I'm sure this can too.
> 
> Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> was suggested to go through the vfio subsystem instead.
> 
> I think this comes back to the discussion we had when EFA was upstreamed, which
> is what's the bar to get accepted to the RDMA subsystem.
> IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).

That is more or less where we ended up, yes.

I'm most worried about this lack of PD and MR.

Kernel must provide security for apps doing user DMA, PD and MR do
this. If the device doesn't have PD/MR then it is hard to see how a WQ
could ever be exposed directly to userspace, regardless of subsystem.

Jason

Oded Gabbay Sept. 18, 2020, 11:56 a.m. UTC | #25

On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > >> infrastructure for communication between multiple accelerators. Same
> > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > >> The RDMA implementation we did does NOT support some basic RDMA
> > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > >> library or to connect to the rdma infrastructure in the kernel.
> > >
> > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > you can't add random device offloads as IOCTL to nedevs.
> > >
> > > RDMA is the proper home for all the networking offloads that don't fit
> > > into netdev.
> > >
> > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > all. I'm sure this can too.
> >
> > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > was suggested to go through the vfio subsystem instead.
> >
> > I think this comes back to the discussion we had when EFA was upstreamed, which
> > is what's the bar to get accepted to the RDMA subsystem.
> > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> >
> > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > or should we open the "what qualifies as an RDMA device" question again?
>
> I want to remind you that rdma-core requirement came to make sure that
> anything exposed from the RDMA to the userspace is strict with proper
> UAPI header hygiene.
>
> I doubt that Havana's ioctls are backed by anything like this.
>
> Thanks

Why do you doubt that ? Have you looked at our code ?
Our uapi and IOCTLs interface is based on drm subsystem uapi interface
and it is very safe and protected.
Otherwise Greg would have never allowed me to go upstream in the first place.

We have a single function which is the entry point for all the IOCTLs
of our drivers (only one IOCTL is RDMA related, all the others are
compute related).
That function is almost 1:1 copy of the function in drm.

Thanks,
Oded

Oded Gabbay Sept. 18, 2020, 11:59 a.m. UTC | #26

On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > >> infrastructure for communication between multiple accelerators. Same
> > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > >> The RDMA implementation we did does NOT support some basic RDMA
> > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > >> library or to connect to the rdma infrastructure in the kernel.
> > >
> > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > you can't add random device offloads as IOCTL to nedevs.
> > >
> > > RDMA is the proper home for all the networking offloads that don't fit
> > > into netdev.
> > >
> > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > all. I'm sure this can too.
> >
> > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > was suggested to go through the vfio subsystem instead.
> >
> > I think this comes back to the discussion we had when EFA was upstreamed, which
> > is what's the bar to get accepted to the RDMA subsystem.
> > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
>
> That is more or less where we ended up, yes.
>
> I'm most worried about this lack of PD and MR.
>
> Kernel must provide security for apps doing user DMA, PD and MR do
> this. If the device doesn't have PD/MR then it is hard to see how a WQ
> could ever be exposed directly to userspace, regardless of subsystem.
>
> Jason

Hi Jason,
What you say here is very true and we handle that with different
mechanisms. I will start working on a dedicated patch-set of the RDMA
code in the next few weeks with MUCH MORE details in the commit
messages. That will explain exactly how we expose stuff and protect.

For example, regarding isolating between applications, we only support
a single application opening our file descriptor.
Another example is that the submission of WQ is done through our QMAN
mechanism and is NOT mapped to userspace (due to the restrictions you
mentioned above and other restrictions).

But again, I want to send something organized and with proper explanations.
I hope to have something in a couple of weeks.

Thanks,
Oded

Jason Gunthorpe Sept. 18, 2020, noon UTC | #27

On Tue, Sep 15, 2020 at 08:10:08PM +0300, Oded Gabbay wrote:
> Hello,
> 
> This is the second version of the patch-set to upstream the GAUDI NIC code
> into the habanalabs driver.
> 
> The only modification from v2 is in the ethtool patch (patch 12). Details
> are in that patch's commit message.
> 
> Link to v2 cover letter:
> https://lkml.org/lkml/2020/9/12/201

>1. The NIC functionality is NOT exposed as different PCI Physical
>   Functions. There is a single PF which is used for compute and
>   networking, as the main goal of the NIC ports is to be used as
>   intra-communication and not as standard network interfaces. This
>   implies we can't connect different drivers to handle the networking
>   ports because it is the same device, from the kernel POV, as the
>   compute. Therefore, we must integrate the networking code into the
>   main habanalabs driver.

No, this means you need to use virtual bus/ancillary bus that your
other Intel colleagues have been working on with Greg.

It is specificaly intended as the way to split a single PCI function
across multiple subsystems. eg drivers/misc/habanalabs would be the
pci_driver and drivers/net/ethernet/habanadalabs would be the
'virtual/ancillary' driver. Probably one per port.

Jasno

Oded Gabbay Sept. 18, 2020, 12:01 p.m. UTC | #28

On Fri, Sep 18, 2020 at 3:00 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>
> On Tue, Sep 15, 2020 at 08:10:08PM +0300, Oded Gabbay wrote:
> > Hello,
> >
> > This is the second version of the patch-set to upstream the GAUDI NIC code
> > into the habanalabs driver.
> >
> > The only modification from v2 is in the ethtool patch (patch 12). Details
> > are in that patch's commit message.
> >
> > Link to v2 cover letter:
> > https://lkml.org/lkml/2020/9/12/201
>
> >1. The NIC functionality is NOT exposed as different PCI Physical
> >   Functions. There is a single PF which is used for compute and
> >   networking, as the main goal of the NIC ports is to be used as
> >   intra-communication and not as standard network interfaces. This
> >   implies we can't connect different drivers to handle the networking
> >   ports because it is the same device, from the kernel POV, as the
> >   compute. Therefore, we must integrate the networking code into the
> >   main habanalabs driver.
>
> No, this means you need to use virtual bus/ancillary bus that your
> other Intel colleagues have been working on with Greg.
>
> It is specificaly intended as the way to split a single PCI function
> across multiple subsystems. eg drivers/misc/habanalabs would be the
> pci_driver and drivers/net/ethernet/habanadalabs would be the
> 'virtual/ancillary' driver. Probably one per port.
>
> Jasno

Understood.
We are doing a refactor of the code according to those guidelines and
will send an updated patch-set in a couple of weeks.
Thanks,
Oded

Leon Romanovsky Sept. 18, 2020, 12:03 p.m. UTC | #29

On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > >> infrastructure for communication between multiple accelerators. Same
> > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > >> library or to connect to the rdma infrastructure in the kernel.
> > > >
> > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > you can't add random device offloads as IOCTL to nedevs.
> > > >
> > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > into netdev.
> > > >
> > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > all. I'm sure this can too.
> > >
> > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > was suggested to go through the vfio subsystem instead.
> > >
> > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > is what's the bar to get accepted to the RDMA subsystem.
> > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > >
> > > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > > or should we open the "what qualifies as an RDMA device" question again?
> >
> > I want to remind you that rdma-core requirement came to make sure that
> > anything exposed from the RDMA to the userspace is strict with proper
> > UAPI header hygiene.
> >
> > I doubt that Havana's ioctls are backed by anything like this.
> >
> > Thanks
>
> Why do you doubt that ? Have you looked at our code ?
> Our uapi and IOCTLs interface is based on drm subsystem uapi interface
> and it is very safe and protected.

Yes, I looked and didn't find open-source users of your UAPI headers.
It is not related to being safe or protected by to the common request
to present userspace that relies on those exported interfaces.

> Otherwise Greg would have never allowed me to go upstream in the first place.

Nice, can we get a link?

>
> We have a single function which is the entry point for all the IOCTLs
> of our drivers (only one IOCTL is RDMA related, all the others are
> compute related).
> That function is almost 1:1 copy of the function in drm.

DRM has same rules as RDMA, no kernel code will be merged without seeing
open-source userspace.

Thanks

>
> Thanks,
> Oded

Oded Gabbay Sept. 18, 2020, 12:07 p.m. UTC | #30

On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > > >> infrastructure for communication between multiple accelerators. Same
> > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > > >> library or to connect to the rdma infrastructure in the kernel.
> > > > >
> > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > > you can't add random device offloads as IOCTL to nedevs.
> > > > >
> > > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > > into netdev.
> > > > >
> > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > > all. I'm sure this can too.
> > > >
> > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > > was suggested to go through the vfio subsystem instead.
> > > >
> > > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > > is what's the bar to get accepted to the RDMA subsystem.
> > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > > >
> > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > > > or should we open the "what qualifies as an RDMA device" question again?
> > >
> > > I want to remind you that rdma-core requirement came to make sure that
> > > anything exposed from the RDMA to the userspace is strict with proper
> > > UAPI header hygiene.
> > >
> > > I doubt that Havana's ioctls are backed by anything like this.
> > >
> > > Thanks
> >
> > Why do you doubt that ? Have you looked at our code ?
> > Our uapi and IOCTLs interface is based on drm subsystem uapi interface
> > and it is very safe and protected.
>
> Yes, I looked and didn't find open-source users of your UAPI headers.
> It is not related to being safe or protected by to the common request
> to present userspace that relies on those exported interfaces.
>
> > Otherwise Greg would have never allowed me to go upstream in the first place.
>
> Nice, can we get a link?
>
> >
> > We have a single function which is the entry point for all the IOCTLs
> > of our drivers (only one IOCTL is RDMA related, all the others are
> > compute related).
> > That function is almost 1:1 copy of the function in drm.
>
> DRM has same rules as RDMA, no kernel code will be merged without seeing
> open-source userspace.
>
> Thanks
>
> >
> > Thanks,
> > Oded

So we do have an open-source library called hl-thunk, which uses our
driver and indeed that was part of the requirement.
It is similar to libdrm.
Here is the link:
https://github.com/HabanaAI/hl-thunk

That library also comes with a comprehensive suite of tests which
shows how to use the accelerator and we have many NIC tests which show
how to use the NIC.
All the rest of the user-space code in Habana is going through that library.

Currently, you won't find the NIC code there because we didn't
upstream it as the driver code wasn't ready, but I'll push it there in
a private branch if you want to take a look.

Thanks,
Oded

Oded Gabbay Sept. 18, 2020, 12:10 p.m. UTC | #31

On Fri, Sep 18, 2020 at 2:36 PM Gal Pressman <galpress@amazon.com> wrote:
>
> On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> >> infrastructure for communication between multiple accelerators. Same
> >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> >> The RDMA implementation we did does NOT support some basic RDMA
> >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> >> library or to connect to the rdma infrastructure in the kernel.
> >
> > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > you can't add random device offloads as IOCTL to nedevs.
> >
> > RDMA is the proper home for all the networking offloads that don't fit
> > into netdev.
> >
> > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > all. I'm sure this can too.
>
> Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> was suggested to go through the vfio subsystem instead.
>
> I think this comes back to the discussion we had when EFA was upstreamed, which
> is what's the bar to get accepted to the RDMA subsystem.
> IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
>
> Does GAUDI fit these requirements? If not, should it be in a different subsystem
> or should we open the "what qualifies as an RDMA device" question again?

Hi Itay,
Please see the above comments/questions.
Thanks,
Oded

Jason Gunthorpe Sept. 18, 2020, 12:16 p.m. UTC | #32

On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > >> infrastructure for communication between multiple accelerators. Same
> > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > >> library or to connect to the rdma infrastructure in the kernel.
> > > >
> > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > you can't add random device offloads as IOCTL to nedevs.
> > > >
> > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > into netdev.
> > > >
> > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > all. I'm sure this can too.
> > >
> > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > was suggested to go through the vfio subsystem instead.
> > >
> > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > is what's the bar to get accepted to the RDMA subsystem.
> > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> >
> > That is more or less where we ended up, yes.
> >
> > I'm most worried about this lack of PD and MR.
> >
> > Kernel must provide security for apps doing user DMA, PD and MR do
> > this. If the device doesn't have PD/MR then it is hard to see how a WQ
> > could ever be exposed directly to userspace, regardless of subsystem.
> 
> Hi Jason,
> What you say here is very true and we handle that with different
> mechanisms. I will start working on a dedicated patch-set of the RDMA
> code in the next few weeks with MUCH MORE details in the commit
> messages. That will explain exactly how we expose stuff and protect.
> 
> For example, regarding isolating between applications, we only support
> a single application opening our file descriptor.

Then the driver has a special PD create that requires the misc file
descriptor to authorize RDMA access to the resources in that security
context.

> Another example is that the submission of WQ is done through our QMAN
> mechanism and is NOT mapped to userspace (due to the restrictions you
> mentioned above and other restrictions).

Sure, other RDMA drivers also require a kernel ioctl for command
execution.

In this model the MR can be a software construct, again representing a
security authorization:

- A 'full process' MR, in which case the kernel command excution
  handles dma map and pinning at command execution time
- A 'normal' MR, in which case the DMA list is pre-created and the
  command execution just re-uses this data

The general requirement for RDMA is the same as DRM, you must provide
enough code in rdma-core to show how the device works, and minimally
test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC.

So you'll want to arrange something where the default MR and PD
mechanisms do something workable on this device, like auto-open the
misc FD when building the PD, and support the 'normal' MR flow for
command execution.

Jason

Leon Romanovsky Sept. 18, 2020, 12:19 p.m. UTC | #33

On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote:
> > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > > > >> infrastructure for communication between multiple accelerators. Same
> > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > > > >> library or to connect to the rdma infrastructure in the kernel.
> > > > > >
> > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > > > you can't add random device offloads as IOCTL to nedevs.
> > > > > >
> > > > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > > > into netdev.
> > > > > >
> > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > > > all. I'm sure this can too.
> > > > >
> > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > > > was suggested to go through the vfio subsystem instead.
> > > > >
> > > > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > > > is what's the bar to get accepted to the RDMA subsystem.
> > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > > > >
> > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > > > > or should we open the "what qualifies as an RDMA device" question again?
> > > >
> > > > I want to remind you that rdma-core requirement came to make sure that
> > > > anything exposed from the RDMA to the userspace is strict with proper
> > > > UAPI header hygiene.
> > > >
> > > > I doubt that Havana's ioctls are backed by anything like this.
> > > >
> > > > Thanks
> > >
> > > Why do you doubt that ? Have you looked at our code ?
> > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface
> > > and it is very safe and protected.
> >
> > Yes, I looked and didn't find open-source users of your UAPI headers.
> > It is not related to being safe or protected by to the common request
> > to present userspace that relies on those exported interfaces.
> >
> > > Otherwise Greg would have never allowed me to go upstream in the first place.
> >
> > Nice, can we get a link?
> >
> > >
> > > We have a single function which is the entry point for all the IOCTLs
> > > of our drivers (only one IOCTL is RDMA related, all the others are
> > > compute related).
> > > That function is almost 1:1 copy of the function in drm.
> >
> > DRM has same rules as RDMA, no kernel code will be merged without seeing
> > open-source userspace.
> >
> > Thanks
> >
> > >
> > > Thanks,
> > > Oded
>
> So we do have an open-source library called hl-thunk, which uses our
> driver and indeed that was part of the requirement.
> It is similar to libdrm.
> Here is the link:
> https://github.com/HabanaAI/hl-thunk

Are you kidding?

This is mirror of some internal repository that looks like dumpster
with ChangeId, internal bug tracker numbers, not part of major OS
distributions.

It is not open-source library and shows very clear why you chose
to upstream your driver through driver/misc/ tree.

Thanks

Oded Gabbay Sept. 18, 2020, 12:31 p.m. UTC | #34

On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote:
> > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > > > > >> infrastructure for communication between multiple accelerators. Same
> > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > > > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > > > > >> library or to connect to the rdma infrastructure in the kernel.
> > > > > > >
> > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > > > > you can't add random device offloads as IOCTL to nedevs.
> > > > > > >
> > > > > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > > > > into netdev.
> > > > > > >
> > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > > > > all. I'm sure this can too.
> > > > > >
> > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > > > > was suggested to go through the vfio subsystem instead.
> > > > > >
> > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > > > > is what's the bar to get accepted to the RDMA subsystem.
> > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > > > > >
> > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > > > > > or should we open the "what qualifies as an RDMA device" question again?
> > > > >
> > > > > I want to remind you that rdma-core requirement came to make sure that
> > > > > anything exposed from the RDMA to the userspace is strict with proper
> > > > > UAPI header hygiene.
> > > > >
> > > > > I doubt that Havana's ioctls are backed by anything like this.
> > > > >
> > > > > Thanks
> > > >
> > > > Why do you doubt that ? Have you looked at our code ?
> > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface
> > > > and it is very safe and protected.
> > >
> > > Yes, I looked and didn't find open-source users of your UAPI headers.
> > > It is not related to being safe or protected by to the common request
> > > to present userspace that relies on those exported interfaces.
> > >
> > > > Otherwise Greg would have never allowed me to go upstream in the first place.
> > >
> > > Nice, can we get a link?
> > >
> > > >
> > > > We have a single function which is the entry point for all the IOCTLs
> > > > of our drivers (only one IOCTL is RDMA related, all the others are
> > > > compute related).
> > > > That function is almost 1:1 copy of the function in drm.
> > >
> > > DRM has same rules as RDMA, no kernel code will be merged without seeing
> > > open-source userspace.
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks,
> > > > Oded
> >
> > So we do have an open-source library called hl-thunk, which uses our
> > driver and indeed that was part of the requirement.
> > It is similar to libdrm.
> > Here is the link:
> > https://github.com/HabanaAI/hl-thunk
>
> Are you kidding?
>
> This is mirror of some internal repository that looks like dumpster
> with ChangeId, internal bug tracker numbers, not part of major OS
> distributions.
>
> It is not open-source library and shows very clear why you chose
> to upstream your driver through driver/misc/ tree.
>
> Thanks

Adding Olof here.

No, usually not.
But are you kidding ?
What did you exactly expect to find ? Is there an open-source project
somewhere that encapsulates Deep-learning accelerators which I could
connect to ?
AFAIK, the only thing remotely relevant is CUDA and that is
closed-source (strange to hear lectures about open-source from NVIDIA
people here...)

So we are trying to give to the community such an open source library,
or at least an example. Hopefully one day, when more companies
upstream their drivers for deep-learning accelerators we could do
something like libdrm or rdma-core, but for now, it's just our driver.

I have been in this community since 2013 with AMD and then RedHat, and
I come with good intentions and a desire to open source and upstream
as much as I can. I don't think I deserve this kind of response.

The bottom line is that we had this discussion with Greg and Olof and
DRM people almost 2 years ago and if there was some open-source
project in user-space or some subsystem in the kernel we could connect
to, we would have done that instead of what we did, but the fact of
the matter there isn't such thing. Olof tried and is trying to create
a h/w accelerator subsystem but it still hasn't got up from the ground
yet.

Oded

Oded Gabbay Sept. 18, 2020, 12:34 p.m. UTC | #35

On Fri, Sep 18, 2020 at 3:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > > >> infrastructure for communication between multiple accelerators. Same
> > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > > >> library or to connect to the rdma infrastructure in the kernel.
> > > > >
> > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > > you can't add random device offloads as IOCTL to nedevs.
> > > > >
> > > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > > into netdev.
> > > > >
> > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > > all. I'm sure this can too.
> > > >
> > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > > was suggested to go through the vfio subsystem instead.
> > > >
> > > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > > is what's the bar to get accepted to the RDMA subsystem.
> > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > >
> > > That is more or less where we ended up, yes.
> > >
> > > I'm most worried about this lack of PD and MR.
> > >
> > > Kernel must provide security for apps doing user DMA, PD and MR do
> > > this. If the device doesn't have PD/MR then it is hard to see how a WQ
> > > could ever be exposed directly to userspace, regardless of subsystem.
> >
> > Hi Jason,
> > What you say here is very true and we handle that with different
> > mechanisms. I will start working on a dedicated patch-set of the RDMA
> > code in the next few weeks with MUCH MORE details in the commit
> > messages. That will explain exactly how we expose stuff and protect.
> >
> > For example, regarding isolating between applications, we only support
> > a single application opening our file descriptor.
>
> Then the driver has a special PD create that requires the misc file
> descriptor to authorize RDMA access to the resources in that security
> context.
>
> > Another example is that the submission of WQ is done through our QMAN
> > mechanism and is NOT mapped to userspace (due to the restrictions you
> > mentioned above and other restrictions).
>
> Sure, other RDMA drivers also require a kernel ioctl for command
> execution.
>
> In this model the MR can be a software construct, again representing a
> security authorization:
>
> - A 'full process' MR, in which case the kernel command excution
>   handles dma map and pinning at command execution time
> - A 'normal' MR, in which case the DMA list is pre-created and the
>   command execution just re-uses this data
>
> The general requirement for RDMA is the same as DRM, you must provide
> enough code in rdma-core to show how the device works, and minimally
> test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC.
>
> So you'll want to arrange something where the default MR and PD
> mechanisms do something workable on this device, like auto-open the
> misc FD when building the PD, and support the 'normal' MR flow for
> command execution.
>
> Jason

I don't know how we can support MR because we can't support any
virtual address on the host. Our internal MMU doesn't support 64-bits.
We investigated in the past, very much wanted to use IBverbs but
didn't figure out how to make it work.
I'm adding Itay here and he can also shed more details on that.
Oded

Jason Gunthorpe Sept. 18, 2020, 12:50 p.m. UTC | #36

On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote:
> > > Another example is that the submission of WQ is done through our QMAN
> > > mechanism and is NOT mapped to userspace (due to the restrictions you
> > > mentioned above and other restrictions).
> >
> > Sure, other RDMA drivers also require a kernel ioctl for command
> > execution.
> >
> > In this model the MR can be a software construct, again representing a
> > security authorization:
> >
> > - A 'full process' MR, in which case the kernel command excution
> >   handles dma map and pinning at command execution time
> > - A 'normal' MR, in which case the DMA list is pre-created and the
> >   command execution just re-uses this data
> >
> > The general requirement for RDMA is the same as DRM, you must provide
> > enough code in rdma-core to show how the device works, and minimally
> > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC.
> >
> > So you'll want to arrange something where the default MR and PD
> > mechanisms do something workable on this device, like auto-open the
> > misc FD when building the PD, and support the 'normal' MR flow for
> > command execution.
> 
> I don't know how we can support MR because we can't support any
> virtual address on the host. Our internal MMU doesn't support 64-bits.
> We investigated in the past, very much wanted to use IBverbs but
> didn't figure out how to make it work.
> I'm adding Itay here and he can also shed more details on that.

I'm not sure what that means, if the driver intends to DMA from
process memory then it certainly has a MR concept. 

MRs can control the IOVA directly so if you say the HW needs a MR IOVA
< 2**32 then that is still OK.

Jason

Oded Gabbay Sept. 18, 2020, 1:02 p.m. UTC | #37

On Fri, Sep 18, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote:
> > > > Another example is that the submission of WQ is done through our QMAN
> > > > mechanism and is NOT mapped to userspace (due to the restrictions you
> > > > mentioned above and other restrictions).
> > >
> > > Sure, other RDMA drivers also require a kernel ioctl for command
> > > execution.
> > >
> > > In this model the MR can be a software construct, again representing a
> > > security authorization:
> > >
> > > - A 'full process' MR, in which case the kernel command excution
> > >   handles dma map and pinning at command execution time
> > > - A 'normal' MR, in which case the DMA list is pre-created and the
> > >   command execution just re-uses this data
> > >
> > > The general requirement for RDMA is the same as DRM, you must provide
> > > enough code in rdma-core to show how the device works, and minimally
> > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC.
> > >
> > > So you'll want to arrange something where the default MR and PD
> > > mechanisms do something workable on this device, like auto-open the
> > > misc FD when building the PD, and support the 'normal' MR flow for
> > > command execution.
> >
> > I don't know how we can support MR because we can't support any
> > virtual address on the host. Our internal MMU doesn't support 64-bits.
> > We investigated in the past, very much wanted to use IBverbs but
> > didn't figure out how to make it work.
> > I'm adding Itay here and he can also shed more details on that.
>
> I'm not sure what that means, if the driver intends to DMA from
> process memory then it certainly has a MR concept.
>
> MRs can control the IOVA directly so if you say the HW needs a MR IOVA
> < 2**32 then that is still OK.
>
> Jason

Hi Jason,
I'll try to explain but please bear with me because it requires some
understanding of our H/W architecture.

Our ASIC has 32 GB of HBM memory (similar to GPUs). The problem is
that HBM memory is accessed by our ASIC's engines (DMA, NIC, etc.)
with physical addressing, which is mapped inside our device between
0x0 to 0x8_0000_0000.

Now, if a user performs malloc and then maps that memory to our device
(using our memory MAP ioctl, similar to how GPU works), it will get a
new virtual address, which is in the range of 0x80_0000_0000 - (2^50
-1). Then, he can use that new VA in our device with different engines
(DMA, NIC, compute).

That way, addresses that represent the host memory do not overlap
addresses that represent HBM memory.

The problem with MR is that the API doesn't let us return a new VA. It
forces us to use the original VA that the Host OS allocated. What will
we do if that VA is in the range of our HBM addresses ? The device
won't be able to distinguish between them. The transaction that is
generated by an engine inside our device will go to the HBM instead of
going to the PCI controller and then to the host.

That's the crust of the problem and why we didn't use MR.
If that's not clear, I'll be happy to explain more.

Thanks,
Oded

Leon Romanovsky Sept. 18, 2020, 1:09 p.m. UTC | #38

On Fri, Sep 18, 2020 at 03:31:51PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote:
> > > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote:
> > > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote:
> > > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote:
> > > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote:
> > > > > > > >> infrastructure for communication between multiple accelerators. Same
> > > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC.
> > > > > > > >> The RDMA implementation we did does NOT support some basic RDMA
> > > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core
> > > > > > > >> library or to connect to the rdma infrastructure in the kernel.
> > > > > > > >
> > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and
> > > > > > > > you can't add random device offloads as IOCTL to nedevs.
> > > > > > > >
> > > > > > > > RDMA is the proper home for all the networking offloads that don't fit
> > > > > > > > into netdev.
> > > > > > > >
> > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at
> > > > > > > > all. I'm sure this can too.
> > > > > > >
> > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it
> > > > > > > was suggested to go through the vfio subsystem instead.
> > > > > > >
> > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which
> > > > > > > is what's the bar to get accepted to the RDMA subsystem.
> > > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and
> > > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?).
> > > > > > >
> > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem
> > > > > > > or should we open the "what qualifies as an RDMA device" question again?
> > > > > >
> > > > > > I want to remind you that rdma-core requirement came to make sure that
> > > > > > anything exposed from the RDMA to the userspace is strict with proper
> > > > > > UAPI header hygiene.
> > > > > >
> > > > > > I doubt that Havana's ioctls are backed by anything like this.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > Why do you doubt that ? Have you looked at our code ?
> > > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface
> > > > > and it is very safe and protected.
> > > >
> > > > Yes, I looked and didn't find open-source users of your UAPI headers.
> > > > It is not related to being safe or protected by to the common request
> > > > to present userspace that relies on those exported interfaces.
> > > >
> > > > > Otherwise Greg would have never allowed me to go upstream in the first place.
> > > >
> > > > Nice, can we get a link?
> > > >
> > > > >
> > > > > We have a single function which is the entry point for all the IOCTLs
> > > > > of our drivers (only one IOCTL is RDMA related, all the others are
> > > > > compute related).
> > > > > That function is almost 1:1 copy of the function in drm.
> > > >
> > > > DRM has same rules as RDMA, no kernel code will be merged without seeing
> > > > open-source userspace.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks,
> > > > > Oded
> > >
> > > So we do have an open-source library called hl-thunk, which uses our
> > > driver and indeed that was part of the requirement.
> > > It is similar to libdrm.
> > > Here is the link:
> > > https://github.com/HabanaAI/hl-thunk
> >
> > Are you kidding?
> >
> > This is mirror of some internal repository that looks like dumpster
> > with ChangeId, internal bug tracker numbers, not part of major OS
> > distributions.
> >
> > It is not open-source library and shows very clear why you chose
> > to upstream your driver through driver/misc/ tree.
> >
> > Thanks
>
> Adding Olof here.
>
> No, usually not.
> But are you kidding ?
> What did you exactly expect to find ? Is there an open-source project
> somewhere that encapsulates Deep-learning accelerators which I could
> connect to ?

I would expect certain level of code quality, collaboration and review
that distros require for inclusion. It is not the case for the github
repo you presented.

> AFAIK, the only thing remotely relevant is CUDA and that is
> closed-source (strange to hear lectures about open-source from NVIDIA
> people here...)

Please check git log statistics to estimate Nvidia/Mellanox/Cumulus
contributions to the Linux kernel and the open-source. You will be
surprised.

>
> So we are trying to give to the community such an open source library,
> or at least an example. Hopefully one day, when more companies
> upstream their drivers for deep-learning accelerators we could do
> something like libdrm or rdma-core, but for now, it's just our driver.

AFAIR, your driver is not unique, HiSilicon tried to submit something
similar years ago (warpdrive) and they are not alone.

>
> I have been in this community since 2013 with AMD and then RedHat, and
> I come with good intentions and a desire to open source and upstream
> as much as I can. I don't think I deserve this kind of response.

There is no need to take it personal. It was you who posted a link
to the github repo. What did you expect?

>
> The bottom line is that we had this discussion with Greg and Olof and
> DRM people almost 2 years ago and if there was some open-source
> project in user-space or some subsystem in the kernel we could connect
> to, we would have done that instead of what we did, but the fact of
> the matter there isn't such thing. Olof tried and is trying to create
> a h/w accelerator subsystem but it still hasn't got up from the ground
> yet.

Maybe it is a time to do it right.

>
> Oded

Jason Gunthorpe Sept. 18, 2020, 1:26 p.m. UTC | #39

On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:

> The problem with MR is that the API doesn't let us return a new VA. It
> forces us to use the original VA that the Host OS allocated.

If using the common MR API you'd have to assign a unique linear range
in the single device address map and record both the IOVA and the MMU
VA in the kernel struct.

Then when submitting work using that MR lkey the kernel will adjust
the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
forwarding to HW.

EFA doesn't support rkeys, so they are not required to be emulated. It
would have to create rkeys using some guadidv_reg_mr_rkey()

It is important to understand that the usual way we support these
non-RDMA devices is to insist that they use SW to construct a minimal
standards based RDMA API, and then allow the device to have a 'dv' API
to access a faster, highly device specific, SW bypass path.

So for instance you might have some guadidv_post_work(qp) that doesn't
use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr)
would return the required HW VA from the kernel.

Usually the higher level communication library (UCX, MPI, etc) forms
the dv primitives into something application usable.

> we do if that VA is in the range of our HBM addresses ? The device
> won't be able to distinguish between them. The transaction that is
> generated by an engine inside our device will go to the HBM instead of
> going to the PCI controller and then to the host.
> 
> That's the crust of the problem and why we didn't use MR.

No, the problem with the device is that it doesn't have a lkey/rkey,
so it is stuck with a single translation domain. RoCE compliant
devices are required to have multiple translation domains - each
lkey/rkey specifies a unique translation.

The MR concept is a region of process VA mapped into the device for
device access, and this device *clearly* has that.

Jason

Oded Gabbay Sept. 18, 2020, 1:49 p.m. UTC | #40

On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
>
> > The problem with MR is that the API doesn't let us return a new VA. It
> > forces us to use the original VA that the Host OS allocated.
>
> If using the common MR API you'd have to assign a unique linear range
> in the single device address map and record both the IOVA and the MMU
> VA in the kernel struct.
>
> Then when submitting work using that MR lkey the kernel will adjust
> the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> forwarding to HW.
>
We can't do that. That will kill the performance. If for every
submission I need to modify the packet's contents, the throughput will
go downhill.
Also, submissions to our RDMA qmans are coupled with submissions to
our DMA/Compute QMANs. We can't separate those to different API calls.
That will also kill performance and in addition, will prevent us from
synchronizing all the engines.

I also have to say, it troubles me that you keep referring to our
device as an RDMA device. It is not an RDMA device. It is a
deep-learning accelerator which uses RDMA as a way to interconnect
multiple devices. We don't intend to replace General-Purpose RDMA
devices. We know we don't support that.
Therefore, I still fail to see why we need to support all the above...

Our work submission is not to just "send/receive packets". Sending
packets is part of a general recipe to do DMA, perform compute on data
and send/receive data. All together, in a synchronized fashion.

The way you try to force me to go is to separate that into different
functionality, as if I have different ASICs, which is very
counter-productive in terms of performance and simplicity. i.e. have
one method of submitting work to DMA/compute and another way to RDMA
ports.

I know this is how the kernel is structured now - subsystems for
devices that belong to a single domain (graphics, net, storage). But I
fear that you will soon see this paradigm doesn't work with new
devices in AI, which combine multiple domains into a single ASIC.

Greg, I would love to hear your opinion here. Am I totally wrong ? Is
treating a single ASIC that belongs to multiple domains as if it were
multiple ASICs a good thing ? Don't you think it will hurt the
performance ?

Oded

> EFA doesn't support rkeys, so they are not required to be emulated. It
> would have to create rkeys using some guadidv_reg_mr_rkey()
>
> It is important to understand that the usual way we support these
> non-RDMA devices is to insist that they use SW to construct a minimal
> standards based RDMA API, and then allow the device to have a 'dv' API
> to access a faster, highly device specific, SW bypass path.
>
> So for instance you might have some guadidv_post_work(qp) that doesn't
> use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr)
> would return the required HW VA from the kernel.
>
> Usually the higher level communication library (UCX, MPI, etc) forms
> the dv primitives into something application usable.
>
> > we do if that VA is in the range of our HBM addresses ? The device
> > won't be able to distinguish between them. The transaction that is
> > generated by an engine inside our device will go to the HBM instead of
> > going to the PCI controller and then to the host.
> >
> > That's the crust of the problem and why we didn't use MR.
>
> No, the problem with the device is that it doesn't have a lkey/rkey,
> so it is stuck with a single translation domain. RoCE compliant
> devices are required to have multiple translation domains - each
> lkey/rkey specifies a unique translation.
>
> The MR concept is a region of process VA mapped into the device for
> device access, and this device *clearly* has that.
>
> Jason

Jason Gunthorpe Sept. 18, 2020, 1:59 p.m. UTC | #41

On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> >
> > > The problem with MR is that the API doesn't let us return a new VA. It
> > > forces us to use the original VA that the Host OS allocated.
> >
> > If using the common MR API you'd have to assign a unique linear range
> > in the single device address map and record both the IOVA and the MMU
> > VA in the kernel struct.
> >
> > Then when submitting work using that MR lkey the kernel will adjust
> > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > forwarding to HW.
> >
> We can't do that. That will kill the performance. If for every
> submission I need to modify the packet's contents, the throughput will
> go downhill.

You clearly didn't read where I explained there is a fast path and
slow path expectation.

> Also, submissions to our RDMA qmans are coupled with submissions to
> our DMA/Compute QMANs. We can't separate those to different API calls.
> That will also kill performance and in addition, will prevent us from
> synchronizing all the engines.

Not sure I see why this is a problem. I already explained the fast
device specific path. 

As long as the kernel maintains proper security when it processes
submissions the driver can allow objects to cross between the two
domains.

Jason

Oded Gabbay Sept. 18, 2020, 2:12 p.m. UTC | #42

On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> > >
> > > > The problem with MR is that the API doesn't let us return a new VA. It
> > > > forces us to use the original VA that the Host OS allocated.
> > >
> > > If using the common MR API you'd have to assign a unique linear range
> > > in the single device address map and record both the IOVA and the MMU
> > > VA in the kernel struct.
> > >
> > > Then when submitting work using that MR lkey the kernel will adjust
> > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > > forwarding to HW.
> > >
> > We can't do that. That will kill the performance. If for every
> > submission I need to modify the packet's contents, the throughput will
> > go downhill.
>
> You clearly didn't read where I explained there is a fast path and
> slow path expectation.
>
> > Also, submissions to our RDMA qmans are coupled with submissions to
> > our DMA/Compute QMANs. We can't separate those to different API calls.
> > That will also kill performance and in addition, will prevent us from
> > synchronizing all the engines.
>
> Not sure I see why this is a problem. I already explained the fast
> device specific path.
>
> As long as the kernel maintains proper security when it processes
> submissions the driver can allow objects to cross between the two
> domains.
Can you please explain what you mean by "two domains" ?
You mean the RDMA and compute domains ? Or something else ?

What I was trying to say is that I don't want the application to split
its submissions to different system calls.

Currently we perform submissions through the CS_IOCTL that is defined
in our driver. It is a single IOCTL which allows the user to submit
work to all queues, without regard to the underlying engine of each
queue.
If I need to split that to different system calls it will have major
implications. I don't even want to start thinking about all the
synchronization at the host (userspace) level that I will need to do.
That's what I meant by saying that you force me to treat my device as
if it were multiple devices. The whole point of our ASIC is to combine
multiple IPs on the same ASIC.

What will happen when we will add a third domain to our device (e.g.
storage, video decoding, encryption engine, whatever). Will I then
need to separate submissions to 3 different system calls ? In 3
different subsystems ? This doesn't scale. And I strongly say that it
will kill the performance of the device. Not because of the driver.
Because of the complications to the user-space.

Oded

>
> Jason

Jason Gunthorpe Sept. 18, 2020, 2:19 p.m. UTC | #43

On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote:
> On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> > > >
> > > > > The problem with MR is that the API doesn't let us return a new VA. It
> > > > > forces us to use the original VA that the Host OS allocated.
> > > >
> > > > If using the common MR API you'd have to assign a unique linear range
> > > > in the single device address map and record both the IOVA and the MMU
> > > > VA in the kernel struct.
> > > >
> > > > Then when submitting work using that MR lkey the kernel will adjust
> > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > > > forwarding to HW.
> > > >
> > > We can't do that. That will kill the performance. If for every
> > > submission I need to modify the packet's contents, the throughput will
> > > go downhill.
> >
> > You clearly didn't read where I explained there is a fast path and
> > slow path expectation.
> >
> > > Also, submissions to our RDMA qmans are coupled with submissions to
> > > our DMA/Compute QMANs. We can't separate those to different API calls.
> > > That will also kill performance and in addition, will prevent us from
> > > synchronizing all the engines.
> >
> > Not sure I see why this is a problem. I already explained the fast
> > device specific path.
> >
> > As long as the kernel maintains proper security when it processes
> > submissions the driver can allow objects to cross between the two
> > domains.
> Can you please explain what you mean by "two domains" ?
> You mean the RDMA and compute domains ? Or something else ?

Yes

> What I was trying to say is that I don't want the application to split
> its submissions to different system calls.

If you can manage the security then you can cross them. Eg since The
RDMA PD would be created on top of the /dev/misc char dev then it is
fine for the /dev/misc char dev to access the RDMA objects as a 'dv
fast path'.

But now that you say everything is interconnected, I'm wondering,
without HW security how do you keep netdev isolated from userspace?

Can I issue commands to /dev/misc and write to kernel memory (does the
kernel put any pages into the single MMU?) or corrupt the netdev
driver operations in any way?

Jason

Oded Gabbay Sept. 18, 2020, 2:45 p.m. UTC | #44

On Fri, Sep 18, 2020 at 5:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> > > > >
> > > > > > The problem with MR is that the API doesn't let us return a new VA. It
> > > > > > forces us to use the original VA that the Host OS allocated.
> > > > >
> > > > > If using the common MR API you'd have to assign a unique linear range
> > > > > in the single device address map and record both the IOVA and the MMU
> > > > > VA in the kernel struct.
> > > > >
> > > > > Then when submitting work using that MR lkey the kernel will adjust
> > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > > > > forwarding to HW.
> > > > >
> > > > We can't do that. That will kill the performance. If for every
> > > > submission I need to modify the packet's contents, the throughput will
> > > > go downhill.
> > >
> > > You clearly didn't read where I explained there is a fast path and
> > > slow path expectation.
> > >
> > > > Also, submissions to our RDMA qmans are coupled with submissions to
> > > > our DMA/Compute QMANs. We can't separate those to different API calls.
> > > > That will also kill performance and in addition, will prevent us from
> > > > synchronizing all the engines.
> > >
> > > Not sure I see why this is a problem. I already explained the fast
> > > device specific path.
> > >
> > > As long as the kernel maintains proper security when it processes
> > > submissions the driver can allow objects to cross between the two
> > > domains.
> > Can you please explain what you mean by "two domains" ?
> > You mean the RDMA and compute domains ? Or something else ?
>
> Yes
>
> > What I was trying to say is that I don't want the application to split
> > its submissions to different system calls.
>
> If you can manage the security then you can cross them. Eg since The
> RDMA PD would be created on top of the /dev/misc char dev then it is
> fine for the /dev/misc char dev to access the RDMA objects as a 'dv
> fast path'.
>
> But now that you say everything is interconnected, I'm wondering,
> without HW security how do you keep netdev isolated from userspace?
>
> Can I issue commands to /dev/misc and write to kernel memory (does the
> kernel put any pages into the single MMU?) or corrupt the netdev
> driver operations in any way?
>
> Jason

No, no, no. Please give me more credit :) btw, our kernel interface
was scrutinized when we upstreamed the driver and it was under review
by the Intel security team.

To explain our security mechanism will require some time. It is
detailed in the driver, but it is hard to understand without some
background.
I wonder where to start...

First of all, we support open, close, mmap and IOCTLs to
/dev/misc/hlX. We don't support read/write system calls.
A user never gets direct access to kernel memory. Only through
standard mmap. The only thing we allow to mmap is a command buffer
(which is used to submit work to certain DMA  queues on our device)
and to a memory region we use for "CQ" for the RDMA. That's it.

Any access by the device's engines to the host memory is done via our
device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The
kernel driver is assigned ASID 0, while the user is assigned ASID 1.
We can support up to 1024 ASIDs, but because we limit the user to have
a single application, we only use ASID 0 and 1.

The above means a user can't program an engine (DMA, NIC, compute) to
access memory he didn't first mapped into our device's MMU. The
mapping is done via one of our IOCTLs and the kernel driver makes sure
(using standard kernel internal APIs) the host memory truly belongs to
the user process. All those mappings are done using ASID 1.

If the driver needs to map kernel pages into the device's MMU, then
this is done using ASID 0. This is how we take care of separation
between kernel memory and user memory.

Each transaction our engines create and is going to the host first
passes through our MMU. The transaction comes with its ASID value.
According to that, the MMU knows which page tables to do the walk on.

Specifically regarding RDMA, the user prepares a WQE on the host
memory in an area which is mapped into our MMU using ASID 1. The user
uses the NIC control IOCTL to give the kernel driver the virtual base
address of the WQ and the driver programs it to the H/W. Then, the
user can submit the WQE by submitting a command buffer to the NIC
QMAN. The command buffer contains a message to the QMAN that tells it
to ring the doorbell of the relevant NIC port. The user can't do it
from userspace.

For regular Ethernet traffice, we don't have any IOCTLs of course. All
Ethernet operations are done via the standard networking subsystem
(sockets, etc.).

There are more details of course. I don't know how much you want me to
go deeper. If you have specific questions I'll be happy to answer.
Oded

Jason Gunthorpe Sept. 18, 2020, 3:07 p.m. UTC | #45

On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote:

> Any access by the device's engines to the host memory is done via our
> device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The
> kernel driver is assigned ASID 0, while the user is assigned ASID 1.
> We can support up to 1024 ASIDs, but because we limit the user to have
> a single application, we only use ASID 0 and 1.

If the QP/WQ/etc is HW bound to an ASID then that binding is called a
PD and the ASID is acting in the PD role.

If the ASID is translating from on the wire IOVA to DMA PA, then it is
acting in the MR role as well.

Bundling those two things together is not as flexible as standards
based RDMA, but it is not as far away as you are making things out to
be.

Jason

Oded Gabbay Sept. 18, 2020, 3:15 p.m. UTC | #46

On Fri, Sep 18, 2020 at 6:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote:
>
> > Any access by the device's engines to the host memory is done via our
> > device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The
> > kernel driver is assigned ASID 0, while the user is assigned ASID 1.
> > We can support up to 1024 ASIDs, but because we limit the user to have
> > a single application, we only use ASID 0 and 1.
>
> If the QP/WQ/etc is HW bound to an ASID then that binding is called a
> PD and the ASID is acting in the PD role.
>
> If the ASID is translating from on the wire IOVA to DMA PA, then it is
> acting in the MR role as well.
>
> Bundling those two things together is not as flexible as standards
> based RDMA, but it is not as far away as you are making things out to
> be.
>
> Jason

But Jason, why do I need to use RDMA definitions in my common code ?
RDMA is such a small part of our ASIC. We also have an ASIC called
GOYA for inference, which is handled by the same driver, but doesn't
have RDMA ports at all. Why would I need to use RDMA definitions for
that ?

I'm sorry, but you won't be able to convince me here that I need to
"enslave" my entire code to RDMA, just because my ASIC "also" has some
RDMA ports.
On the same weight, the GPU people tried and failed to say that my
device is a GPU. And I think the reasoning that we applied back then,
and Greg and Olof agreed with it, applies here as well.

I want to play along, but it has to be something that won't make my
entire device's driver into an RDMA driver. And it has to be something
that doesn't hurt performance.
All other things can and will be changed according to your inputs.

Thanks,
Oded

Oded

Jason Gunthorpe Sept. 18, 2020, 3:28 p.m. UTC | #47

On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote:

> I'm sorry, but you won't be able to convince me here that I need to
> "enslave" my entire code to RDMA, just because my ASIC "also" has some
> RDMA ports.

You can't recreate common shared subsystems in a driver just because
you don't want to work with the subsystem.

I don't care what else the ASIC has. In Linux the netdev part is
exposed through netdev, the RDMA part through RDMA, the
totally-not-a-GPU part through drivers/misc.

It is always been this way. Chelsio didn't get to rebuild the SCSI
stack in their driver just because "storage is a small part of their
device"

Drivers are not allowed to re-implement I2C/SPI/etc without re-using
the comon code for that just because "I2C is a small part of their
device"

Exposing to userspace the creation of RoCE QPs and their related
objects are unambiguously a RDMA subsystem task. I don't even know how
you think you can argue it is not. It is your company proudly claiming
the device has 100G RoCE ports in all the marketing literature, after
all.

It is too bad the device has a non-standards compliant implementation
of RoCE so this will be a bit hard for you. Oh well.

Jason

Greg Kroah-Hartman Sept. 19, 2020, 6:40 a.m. UTC | #48

On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > So we do have an open-source library called hl-thunk, which uses our
> > driver and indeed that was part of the requirement.
> > It is similar to libdrm.
> > Here is the link:
> > https://github.com/HabanaAI/hl-thunk
> 
> Are you kidding?
> 
> This is mirror of some internal repository that looks like dumpster
> with ChangeId, internal bug tracker numbers, not part of major OS
> distributions.
> 
> It is not open-source library and shows very clear why you chose
> to upstream your driver through driver/misc/ tree.

It is an open source library, as per the license and the code
availability.  What more is expected here?

No distro has to pick it up, that's not a requirement for kernel code,
we have many kernel helper programs that are not in distros.  Heck, udev
took a long time to get into distros, does that mean the kernel side of
that interface should never have been merged?

I don't understand your complaint here, it's not our place to judge the
code quality of userspace libraries, otherwise we would never get any
real-work done :)

thanks,

greg k-h

Leon Romanovsky Sept. 19, 2020, 8:20 a.m. UTC | #49

On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > > So we do have an open-source library called hl-thunk, which uses our
> > > driver and indeed that was part of the requirement.
> > > It is similar to libdrm.
> > > Here is the link:
> > > https://github.com/HabanaAI/hl-thunk
> >
> > Are you kidding?
> >
> > This is mirror of some internal repository that looks like dumpster
> > with ChangeId, internal bug tracker numbers, not part of major OS
> > distributions.
> >
> > It is not open-source library and shows very clear why you chose
> > to upstream your driver through driver/misc/ tree.
>
> It is an open source library, as per the license and the code
> availability.  What more is expected here?

So can I fork iproute2, add bunch of new custom netlink UAPIs and expect
Dave to merge it after I throw it on github?

>
> No distro has to pick it up, that's not a requirement for kernel code,
> we have many kernel helper programs that are not in distros.  Heck, udev
> took a long time to get into distros, does that mean the kernel side of
> that interface should never have been merged?
>
> I don't understand your complaint here, it's not our place to judge the
> code quality of userspace libraries, otherwise we would never get any
> real-work done :)

My main complaint is that you can't imagine merging code into large
subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source
citizen. It means use of existing user-space libraries/tools and/or
providing new ones that will be usable for everyone.

In this case, we have some custom char device with library that is not
usable for anyone else and this is why drivers/misc/ is right place.

While we are talking about real-work, it is our benefit to push companies
to make investment into ecosystem and not letting them to find an excuse
for not doing it.

Thanks

>
> thanks,
>
> greg k-h

Greg Kroah-Hartman Sept. 19, 2020, 8:30 a.m. UTC | #50

On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote:
> On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > > > So we do have an open-source library called hl-thunk, which uses our
> > > > driver and indeed that was part of the requirement.
> > > > It is similar to libdrm.
> > > > Here is the link:
> > > > https://github.com/HabanaAI/hl-thunk
> > >
> > > Are you kidding?
> > >
> > > This is mirror of some internal repository that looks like dumpster
> > > with ChangeId, internal bug tracker numbers, not part of major OS
> > > distributions.
> > >
> > > It is not open-source library and shows very clear why you chose
> > > to upstream your driver through driver/misc/ tree.
> >
> > It is an open source library, as per the license and the code
> > availability.  What more is expected here?
> 
> So can I fork iproute2, add bunch of new custom netlink UAPIs and expect
> Dave to merge it after I throw it on github?

Don't be silly, that's not the case here at all and you know that.

> > No distro has to pick it up, that's not a requirement for kernel code,
> > we have many kernel helper programs that are not in distros.  Heck, udev
> > took a long time to get into distros, does that mean the kernel side of
> > that interface should never have been merged?
> >
> > I don't understand your complaint here, it's not our place to judge the
> > code quality of userspace libraries, otherwise we would never get any
> > real-work done :)
> 
> My main complaint is that you can't imagine merging code into large
> subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source
> citizen. It means use of existing user-space libraries/tools and/or
> providing new ones that will be usable for everyone.

Agreed.

> In this case, we have some custom char device with library that is not
> usable for anyone else and this is why drivers/misc/ is right place.

Also agreed.

> While we are talking about real-work, it is our benefit to push companies
> to make investment into ecosystem and not letting them to find an excuse
> for not doing it.

So why are you complaining about a stand-alone driver that does not have
any shared subsystems's userspace code to control that driver?

Yes, when integrating into other subsystems (i.e. networking and rdma),
they should use those common subsystems interfaces, no one is arguing
that at all.

totally lost,

greg k-h

Leon Romanovsky Sept. 19, 2020, 8:58 a.m. UTC | #51

On Sat, Sep 19, 2020 at 10:30:12AM +0200, Greg Kroah-Hartman wrote:
> On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote:
> > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote:
> > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > > > > So we do have an open-source library called hl-thunk, which uses our
> > > > > driver and indeed that was part of the requirement.
> > > > > It is similar to libdrm.
> > > > > Here is the link:
> > > > > https://github.com/HabanaAI/hl-thunk
> > > >
> > > > Are you kidding?
> > > >
> > > > This is mirror of some internal repository that looks like dumpster
> > > > with ChangeId, internal bug tracker numbers, not part of major OS
> > > > distributions.
> > > >
> > > > It is not open-source library and shows very clear why you chose
> > > > to upstream your driver through driver/misc/ tree.
> > >
> > > It is an open source library, as per the license and the code
> > > availability.  What more is expected here?
> >
> > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect
> > Dave to merge it after I throw it on github?
>
> Don't be silly, that's not the case here at all and you know that.

It was far-fetched example.

>
> > > No distro has to pick it up, that's not a requirement for kernel code,
> > > we have many kernel helper programs that are not in distros.  Heck, udev
> > > took a long time to get into distros, does that mean the kernel side of
> > > that interface should never have been merged?
> > >
> > > I don't understand your complaint here, it's not our place to judge the
> > > code quality of userspace libraries, otherwise we would never get any
> > > real-work done :)
> >
> > My main complaint is that you can't imagine merging code into large
> > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source
> > citizen. It means use of existing user-space libraries/tools and/or
> > providing new ones that will be usable for everyone.
>
> Agreed.
>
> > In this case, we have some custom char device with library that is not
> > usable for anyone else and this is why drivers/misc/ is right place.
>
> Also agreed.
>
> > While we are talking about real-work, it is our benefit to push companies
> > to make investment into ecosystem and not letting them to find an excuse
> > for not doing it.
>
> So why are you complaining about a stand-alone driver that does not have
> any shared subsystems's userspace code to control that driver?

I didn't, everything started when I explained to Gal why RDMA subsystem
requires rdma-core counterpart for any UAPI code.
https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m17d52d61adadf54c12bfecf1af5db40f5d829ac3

And expressed my view on the quality of the library that was presented
as open-source example.
https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m9059c5a9405ba932d9ffb731195a43b27443d265

>
> Yes, when integrating into other subsystems (i.e. networking and rdma),
> they should use those common subsystems interfaces, no one is arguing
> that at all.
>
> totally lost,

And here comes my request to do it right
https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#ma1fa6fe63666f630674eb668f1c00e6a672db85b

All that I asked from Oded is to do UAPI/libraries right, while all the responses
can be summarized to one sentence - "it is too hard, we don't want to do it."

Thanks

>
> greg k-h

Oded Gabbay Sept. 19, 2020, 4:43 p.m. UTC | #52

On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote:
> > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote:
> > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > > > > So we do have an open-source library called hl-thunk, which uses our
> > > > > driver and indeed that was part of the requirement.
> > > > > It is similar to libdrm.
> > > > > Here is the link:
> > > > > https://github.com/HabanaAI/hl-thunk
> > > >
> > > > Are you kidding?
> > > >
> > > > This is mirror of some internal repository that looks like dumpster
> > > > with ChangeId, internal bug tracker numbers, not part of major OS
> > > > distributions.
> > > >
> > > > It is not open-source library and shows very clear why you chose
> > > > to upstream your driver through driver/misc/ tree.
> > >
> > > It is an open source library, as per the license and the code
> > > availability.  What more is expected here?
> >
> > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect
> > Dave to merge it after I throw it on github?
>
> Don't be silly, that's not the case here at all and you know that.
>
> > > No distro has to pick it up, that's not a requirement for kernel code,
> > > we have many kernel helper programs that are not in distros.  Heck, udev
> > > took a long time to get into distros, does that mean the kernel side of
> > > that interface should never have been merged?
> > >
> > > I don't understand your complaint here, it's not our place to judge the
> > > code quality of userspace libraries, otherwise we would never get any
> > > real-work done :)
> >
> > My main complaint is that you can't imagine merging code into large
> > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source
> > citizen. It means use of existing user-space libraries/tools and/or
> > providing new ones that will be usable for everyone.
>
> Agreed.
>
> > In this case, we have some custom char device with library that is not
> > usable for anyone else and this is why drivers/misc/ is right place.
>
> Also agreed.
>
> > While we are talking about real-work, it is our benefit to push companies
> > to make investment into ecosystem and not letting them to find an excuse
> > for not doing it.
>
> So why are you complaining about a stand-alone driver that does not have
> any shared subsystems's userspace code to control that driver?
>
> Yes, when integrating into other subsystems (i.e. networking and rdma),
> they should use those common subsystems interfaces, no one is arguing
> that at all.
Hi Greg,
It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
I understand your reasoning about networking (Ethernet) as the driver
connects to the kernel networking stack (netdev), but with RDMA the
driver doesn't use or connect to anything in that stack. If I were to
support IBverbs and declare that I support it, then of course I would
need to integrate to the RDMA subsystem and add my backend to
rdma-core.
But we don't do that so why am I being forced to support IBverbs ?
Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies
with a sledgehammer.
I do hope that in future devices we will support it natively and of
course then we will integrate as requested, but for GAUDI it is just a
huge overkill IMHO.

Thanks,
Oded
>
> totally lost,
>
> greg k-h

Greg Kroah-Hartman Sept. 19, 2020, 5:27 p.m. UTC | #53

On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote:
> On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote:
> > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote:
> > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote:
> > > > > > So we do have an open-source library called hl-thunk, which uses our
> > > > > > driver and indeed that was part of the requirement.
> > > > > > It is similar to libdrm.
> > > > > > Here is the link:
> > > > > > https://github.com/HabanaAI/hl-thunk
> > > > >
> > > > > Are you kidding?
> > > > >
> > > > > This is mirror of some internal repository that looks like dumpster
> > > > > with ChangeId, internal bug tracker numbers, not part of major OS
> > > > > distributions.
> > > > >
> > > > > It is not open-source library and shows very clear why you chose
> > > > > to upstream your driver through driver/misc/ tree.
> > > >
> > > > It is an open source library, as per the license and the code
> > > > availability.  What more is expected here?
> > >
> > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect
> > > Dave to merge it after I throw it on github?
> >
> > Don't be silly, that's not the case here at all and you know that.
> >
> > > > No distro has to pick it up, that's not a requirement for kernel code,
> > > > we have many kernel helper programs that are not in distros.  Heck, udev
> > > > took a long time to get into distros, does that mean the kernel side of
> > > > that interface should never have been merged?
> > > >
> > > > I don't understand your complaint here, it's not our place to judge the
> > > > code quality of userspace libraries, otherwise we would never get any
> > > > real-work done :)
> > >
> > > My main complaint is that you can't imagine merging code into large
> > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source
> > > citizen. It means use of existing user-space libraries/tools and/or
> > > providing new ones that will be usable for everyone.
> >
> > Agreed.
> >
> > > In this case, we have some custom char device with library that is not
> > > usable for anyone else and this is why drivers/misc/ is right place.
> >
> > Also agreed.
> >
> > > While we are talking about real-work, it is our benefit to push companies
> > > to make investment into ecosystem and not letting them to find an excuse
> > > for not doing it.
> >
> > So why are you complaining about a stand-alone driver that does not have
> > any shared subsystems's userspace code to control that driver?
> >
> > Yes, when integrating into other subsystems (i.e. networking and rdma),
> > they should use those common subsystems interfaces, no one is arguing
> > that at all.
> Hi Greg,
> It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
> I understand your reasoning about networking (Ethernet) as the driver
> connects to the kernel networking stack (netdev), but with RDMA the
> driver doesn't use or connect to anything in that stack. If I were to
> support IBverbs and declare that I support it, then of course I would
> need to integrate to the RDMA subsystem and add my backend to
> rdma-core.

IBverbs are horrid and I would not wish them on anyone.  Seriously.

> But we don't do that so why am I being forced to support IBverbs ?

You shouldn't.

> Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies
> with a sledgehammer.
> I do hope that in future devices we will support it natively and of
> course then we will integrate as requested, but for GAUDI it is just a
> huge overkill IMHO.

I think the general rdma apis are the key here, not the userspace api.

Note, I do not know exactly what they are, but no, IBverbs are not ok.

Ick.

greg k-h

Andrew Lunn Sept. 19, 2020, 6:49 p.m. UTC | #54

On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote:
> It's probably heresy, but why do I need to integrate into the RDMA subsystem ?

Hi Oded

I don't know the RDMA subsystem at all. So i will give a more generic
answer. Are you reinventing things which a subsystem core already has?
The subsystem core will be well tested, since lots of devices use
it. Because of this, subsystem cores generally have a lower bug count
per line of code than driver code. Using core code means drivers are
smaller, and smaller code has less bugs by definition.

We as maintainers have to assume you are going to abandon the driver
at some point, while the hardware still exists, and leave the
community to maintain it. So a smaller driver, which makes heavy use
of the core is much easier to maintain.

By making use of core code, you also get freebies. Somebody adds new
functionality to the core, your driver automatically gets it.

Look at this from the opposite perspective. Say every driver
implemented their own TCP/IP stack? Or DMA engine? SPI infrastructure?
How big a nightmare would it be to maintain?

In your case, some parts of you hardware looks a bit like RDMA? So you
ideally want to use the core code from the RDMA subsystem. Maybe you
just need some of the lower layers? Maybe you need to refactor some of
the RDMA core to make it a library you can pick and choice the bits
useful to you? What you really want to avoid is re-implementing stuff
in your driver which is already in the core.

      Andrew

Jason Gunthorpe Sept. 19, 2020, 7:22 p.m. UTC | #55

On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote:
> > It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
> > I understand your reasoning about networking (Ethernet) as the driver
> > connects to the kernel networking stack (netdev), but with RDMA the
> > driver doesn't use or connect to anything in that stack. If I were to
> > support IBverbs and declare that I support it, then of course I would
> > need to integrate to the RDMA subsystem and add my backend to
> > rdma-core.
> 
> IBverbs are horrid and I would not wish them on anyone.  Seriously.

I'm curious what drives this opinion? Did you have it since you
reviewed the initial submission all those years ago?

> I think the general rdma apis are the key here, not the userspace api.

Are you proposing that habana should have uAPI in drivers/misc and
present a standard rdma-core userspace for it? This is the only
userspace programming interface for RoCE HW. I think that would be
much more work.

If not, what open source userspace are you going to ask them to
present to merge the kernel side into misc?

> Note, I do not know exactly what they are, but no, IBverbs are not ok.

Should we stop merging new drivers and abandon the RDMA subsystem? Is
there something you'd like to see fixed?

Don't really understand your position, sorry.

Jason

Greg Kroah-Hartman Sept. 20, 2020, 8:47 a.m. UTC | #56

On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote:
> On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote:
> > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
> > > I understand your reasoning about networking (Ethernet) as the driver
> > > connects to the kernel networking stack (netdev), but with RDMA the
> > > driver doesn't use or connect to anything in that stack. If I were to
> > > support IBverbs and declare that I support it, then of course I would
> > > need to integrate to the RDMA subsystem and add my backend to
> > > rdma-core.
> > 
> > IBverbs are horrid and I would not wish them on anyone.  Seriously.
> 
> I'm curious what drives this opinion? Did you have it since you
> reviewed the initial submission all those years ago?

As I learned more about that interface, yes, I like it less and less :)

But that's the userspace api you all are stuck with, for various
reasons, my opinion doesn't matter here.

> > I think the general rdma apis are the key here, not the userspace api.
> 
> Are you proposing that habana should have uAPI in drivers/misc and
> present a standard rdma-core userspace for it? This is the only
> userspace programming interface for RoCE HW. I think that would be
> much more work.
> 
> If not, what open source userspace are you going to ask them to
> present to merge the kernel side into misc?

I don't think that they have a userspace api to their rdma feature from
what I understand, but I could be totally wrong as I do not know their
hardware at all, so I'll let them answer this question.

> > Note, I do not know exactly what they are, but no, IBverbs are not ok.
> 
> Should we stop merging new drivers and abandon the RDMA subsystem? Is
> there something you'd like to see fixed?
> 
> Don't really understand your position, sorry.

For anything that _has_ to have a userspace RMDA interface, sure ibverbs
are the one we are stuck with, but I didn't think that was the issue
here at all, which is why I wrote the above comments.

thanks,

greg k-h

Daniel Vetter Sept. 20, 2020, 4:45 p.m. UTC | #57

On Wed, Sep 16, 2020 at 02:00:54PM +0200, Greg Kroah-Hartman wrote:
> On Wed, Sep 16, 2020 at 11:47:58AM +0300, Oded Gabbay wrote:
> > On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> > > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote:
> > > > > > > > >
> > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > > > > >
> > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > > > > > into the habanalabs driver.
> > > > > > > > > >
> > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > > > > > are in that patch's commit message.
> > > > > > > > > >
> > > > > > > > > > Link to v2 cover letter:
> > > > > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > > > > >
> > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > > > > > structured and designed.
> > > > > > > > Why is that ?
> > > > > > > > Can you please point to the things that bother you or not working correctly?
> > > > > > > > I can't really fix the driver if I don't know what's wrong.
> > > > > > > >
> > > > > > > > In addition, please read my reply to Jakub with the explanation of why
> > > > > > > > we designed this driver as is.
> > > > > > > >
> > > > > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > > > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > > > > > we have
> > > > > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > > > > I understood why Greg asked me to CC you because we do connect to the
> > > > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > > > > > not really the same.
> > > > > > >
> > > > > > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > > > > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > > > > > again that will solve this issue.
> > > > > > Hi Greg,
> > > > > > Can I suggest an alternative for the short/medium term ?
> > > > > >
> > > > > > In an earlier email, Jakub said:
> > > > > > "Is it not possible to move the files and still build them into a single
> > > > > > module?"
> > > > > >
> > > > > > I thought maybe that's a good way to progress here ?
> > > > >
> > > > > Cross-directory builds of a single module are crazy.  Yes, they work,
> > > > > but really, that's a mess, and would never suggest doing that.
> > > > >
> > > > > > First, split the content to Ethernet and RDMA.
> > > > > > Then move the Ethernet part to drivers/net but build it as part of
> > > > > > habanalabs.ko.
> > > > > > Regarding the RDMA code, upstream/review it in a different patch-set
> > > > > > (maybe they will want me to put the files elsewhere).
> > > > > >
> > > > > > What do you think ?
> > > > >
> > > > > I think you are asking for more work there than just splitting out into
> > > > > separate modules :)
> > > > >
> > > > > thanks,
> > > > >
> > > > > greg k-h
> > > > Hi Greg,
> > > >
> > > > If cross-directory building is out of the question, what about
> > > > splitting into separate modules ? And use cross-module notifiers/calls
> > > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It
> > > > worked (that's the best thing I can say about it).
> > >
> > > That's fine with me.
> > >
> > > > The main problem with this "virtual bus" thing is that I'm not
> > > > familiar with it at all and from my experience I imagine it would take
> > > > a considerable time and effort to upstream this infrastructure work.
> > >
> > > It shouldn't be taking that long, but for some unknown reason, the
> > > original author of that code is sitting on it and not resending it.  Go
> > > poke them through internal Intel channels to find out what the problem
> > > is, as I have no clue why a 200-300 line bus module is taking so long to
> > > get "right" :(
> > >
> > > I'm _ALMOST_ at the point where I would just do that work myself, but
> > > due to my current status with Intel, I'll let them do it as I have
> > > enough other things on my plate...
> > >
> > > > This could delay the NIC code for a couple of years, which by then
> > > > this won't be relevant at all.
> > >
> > > Why wouldn't this code be relevant in a year?  It's going to be 2+ years
> > > before any of this shows up in an "enterprise distro" based on their
> > > release cycles anyway :)
> > >
> > > thanks,
> > >
> > > greg k-h
> > 
> > Hi Greg,
> > ok, I'll take a look. Do you happen to have the name of the patch-set / author ?
> 
> Here's at least one copy:
> 	https://lore.kernel.org/linux-rdma/20200520070227.3392100-2-jeffrey.t.kirsher@intel.com/
> 
> there might have been a newer one, can't remember, sorry.

Maybe I'm missing something or maybe the in-tree code we have already
should be refactored to use more buses and drivers, but
drivers/base/component.c is made for this. We use this to glue all kinds
of things across all kinds of subsystems already.

Of course it really should be only used for one-off problems, as soon as
you have a standard interface/interaction there should be some kind of
standard lookup way to get at your thing (and the driver behind it), e.g.
in drivers/gpu we're now building up drm_bridge and trying to get away
from componenent.c for these things.

Cheers, Daniel

Oded Gabbay Sept. 20, 2020, 7:05 p.m. UTC | #58

On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote:
> > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote:
> > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
> > > > I understand your reasoning about networking (Ethernet) as the driver
> > > > connects to the kernel networking stack (netdev), but with RDMA the
> > > > driver doesn't use or connect to anything in that stack. If I were to
> > > > support IBverbs and declare that I support it, then of course I would
> > > > need to integrate to the RDMA subsystem and add my backend to
> > > > rdma-core.
> > >
> > > IBverbs are horrid and I would not wish them on anyone.  Seriously.
> >
> > I'm curious what drives this opinion? Did you have it since you
> > reviewed the initial submission all those years ago?
>
> As I learned more about that interface, yes, I like it less and less :)
>
> But that's the userspace api you all are stuck with, for various
> reasons, my opinion doesn't matter here.
>
> > > I think the general rdma apis are the key here, not the userspace api.
> >
> > Are you proposing that habana should have uAPI in drivers/misc and
> > present a standard rdma-core userspace for it? This is the only
> > userspace programming interface for RoCE HW. I think that would be
> > much more work.
> >
> > If not, what open source userspace are you going to ask them to
> > present to merge the kernel side into misc?
>
> I don't think that they have a userspace api to their rdma feature from
> what I understand, but I could be totally wrong as I do not know their
> hardware at all, so I'll let them answer this question.

Hi Greg,
We do expose a new IOCTL to enable the user to configure connections
between multiple GAUDI devices.

Having said that, we restrict this IOCTL to be used only by the same
user who is doing the compute on our device, as opposed to a real RDMA
device where any number of applications can send and receive.
In addition, this IOCTL limits the user to connect ONLY to another
GAUDI device and not to a 3rd party RDMA device.

It is true that GAUDI supports RDMA data movement but the data
movement is NOT done by the user. It is done by our compute engines.
i.e. the compute engines performs "send" and "receive" without going
to the host (aka no support for ibv_postsend, ibv_postreceive). The
only thing that is controlled by the user is to say which GAUDI is
connected to which. After that, the command submission the user
performs to operate our compute engines will cause them to send and
receive RDMA packets.

Moreover, as opposed to smart NICs where the Networking is the main
focus and the compute is only secondary, in our device the compute is
our major focus and the networking is a slave for it.

The hl-thunk userspace library will have wrappers around this single
IOCTL (like all our driver's IOCTLs) and also contain demos to show
how to use it.

>
> > > Note, I do not know exactly what they are, but no, IBverbs are not ok.
> >
> > Should we stop merging new drivers and abandon the RDMA subsystem? Is
> > there something you'd like to see fixed?
> >
> > Don't really understand your position, sorry.
>
> For anything that _has_ to have a userspace RMDA interface, sure ibverbs
> are the one we are stuck with, but I didn't think that was the issue
> here at all, which is why I wrote the above comments.
To emphasize again, we don't want to expose a userspace RDMA interface.
We just want to allow our single compute user to configure a
connection to another GAUDI.

Thanks,
Oded

>
> thanks,
>
> greg k-h

Leon Romanovsky Sept. 21, 2020, 10:39 a.m. UTC | #59

On Sun, Sep 20, 2020 at 10:05:39PM +0300, Oded Gabbay wrote:
> On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote:
> > > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote:
> > > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
> > > > > I understand your reasoning about networking (Ethernet) as the driver
> > > > > connects to the kernel networking stack (netdev), but with RDMA the
> > > > > driver doesn't use or connect to anything in that stack. If I were to
> > > > > support IBverbs and declare that I support it, then of course I would
> > > > > need to integrate to the RDMA subsystem and add my backend to
> > > > > rdma-core.
> > > >
> > > > IBverbs are horrid and I would not wish them on anyone.  Seriously.
> > >
> > > I'm curious what drives this opinion? Did you have it since you
> > > reviewed the initial submission all those years ago?
> >
> > As I learned more about that interface, yes, I like it less and less :)
> >
> > But that's the userspace api you all are stuck with, for various
> > reasons, my opinion doesn't matter here.
> >
> > > > I think the general rdma apis are the key here, not the userspace api.
> > >
> > > Are you proposing that habana should have uAPI in drivers/misc and
> > > present a standard rdma-core userspace for it? This is the only
> > > userspace programming interface for RoCE HW. I think that would be
> > > much more work.
> > >
> > > If not, what open source userspace are you going to ask them to
> > > present to merge the kernel side into misc?
> >
> > I don't think that they have a userspace api to their rdma feature from
> > what I understand, but I could be totally wrong as I do not know their
> > hardware at all, so I'll let them answer this question.
>
> Hi Greg,
> We do expose a new IOCTL to enable the user to configure connections
> between multiple GAUDI devices.

How is it different from RDMA QP configuration?

>
> Having said that, we restrict this IOCTL to be used only by the same
> user who is doing the compute on our device, as opposed to a real RDMA
> device where any number of applications can send and receive.

The ability to support multiple applications is not RDMA-requirement,
but the implementation. For example MPI jobs are single user of RDMA device.

> In addition, this IOCTL limits the user to connect ONLY to another
> GAUDI device and not to a 3rd party RDMA device.

I don't see how it is different from EFA with their SQD QP type or mlx5
devices with DC QPs that you can connect only to similar devices (no
interoperability).

Thanks

Gal Pressman Sept. 21, 2020, 11:22 a.m. UTC | #60

On 18/09/2020 18:28, Jason Gunthorpe wrote:
> On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote:
> 
>> I'm sorry, but you won't be able to convince me here that I need to
>> "enslave" my entire code to RDMA, just because my ASIC "also" has some
>> RDMA ports.
> 
> You can't recreate common shared subsystems in a driver just because
> you don't want to work with the subsystem.
> 
> I don't care what else the ASIC has. In Linux the netdev part is
> exposed through netdev, the RDMA part through RDMA, the
> totally-not-a-GPU part through drivers/misc.
> 
> It is always been this way. Chelsio didn't get to rebuild the SCSI
> stack in their driver just because "storage is a small part of their
> device"
> 
> Drivers are not allowed to re-implement I2C/SPI/etc without re-using
> the comon code for that just because "I2C is a small part of their
> device"
> 
> Exposing to userspace the creation of RoCE QPs and their related
> objects are unambiguously a RDMA subsystem task. I don't even know how
> you think you can argue it is not. It is your company proudly claiming
> the device has 100G RoCE ports in all the marketing literature, after
> all.
> 
> It is too bad the device has a non-standards compliant implementation
> of RoCE so this will be a bit hard for you. Oh well.

What is considered a RoCE port in this case if it's not compliant with RoCE?
Sounds like it's an implementation of RDMA over ethernet, not RoCE.
Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol?
(BTW, Oded claims it's similar to nvlink, how is nvlink's implementation
exposed? Or is it closed source?)

Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path
verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by
habanalabs driver)?
So neither any userspace verbs apps could use it nor kernel ULPs?

Leon Romanovsky Sept. 21, 2020, 11:49 a.m. UTC | #61

On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote:
> On 18/09/2020 18:28, Jason Gunthorpe wrote:
> > On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote:
> >
> >> I'm sorry, but you won't be able to convince me here that I need to
> >> "enslave" my entire code to RDMA, just because my ASIC "also" has some
> >> RDMA ports.
> >
> > You can't recreate common shared subsystems in a driver just because
> > you don't want to work with the subsystem.
> >
> > I don't care what else the ASIC has. In Linux the netdev part is
> > exposed through netdev, the RDMA part through RDMA, the
> > totally-not-a-GPU part through drivers/misc.
> >
> > It is always been this way. Chelsio didn't get to rebuild the SCSI
> > stack in their driver just because "storage is a small part of their
> > device"
> >
> > Drivers are not allowed to re-implement I2C/SPI/etc without re-using
> > the comon code for that just because "I2C is a small part of their
> > device"
> >
> > Exposing to userspace the creation of RoCE QPs and their related
> > objects are unambiguously a RDMA subsystem task. I don't even know how
> > you think you can argue it is not. It is your company proudly claiming
> > the device has 100G RoCE ports in all the marketing literature, after
> > all.
> >
> > It is too bad the device has a non-standards compliant implementation
> > of RoCE so this will be a bit hard for you. Oh well.
>
> What is considered a RoCE port in this case if it's not compliant with RoCE?

They claim that it is RoCE v2.
https://www.hotchips.org/hc31/HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf

Thanks

Jason Gunthorpe Sept. 21, 2020, 11:52 a.m. UTC | #62

On Sun, Sep 20, 2020 at 10:47:02AM +0200, Greg Kroah-Hartman wrote:
> > If not, what open source userspace are you going to ask them to
> > present to merge the kernel side into misc?
> 
> I don't think that they have a userspace api to their rdma feature from
> what I understand, but I could be totally wrong as I do not know their
> hardware at all, so I'll let them answer this question.

I thought Oded was pretty clear, the goal of this series is to expose
their RDMA HW to userspace. This problem space requires co-mingling
networking and compute at extremely high speed/low overhead. This is
all done in userspace.

We are specifically talking about this in
include/uapi/misc/habanalabs.h:

 /*
  * NIC
  *
  * This IOCTL allows the user to manage and configure the device's NIC ports.
  * The following operations are available:
  * - Create a completion queue
  * - Destroy a completion queue
  * - Wait on completion queue
  * - Poll a completion queue
  * - Update consumed completion queue entries
  * - Set a work queue
  * - Unset a work queue
  *
  * For all operations, the user should provide a pointer to an input structure
  * with the context parameters. Some of the operations also require a pointer to
  * driver regarding how many of the available CQEs were actually
  * processed/consumed. Only then the driver will override them with newer
  * entries.
  * The set WQ operation should provide the device virtual address of the WQ with
  * a matching size for the number of WQs and entries per WQ.
  *
  */
 #define HL_IOCTL_NIC	_IOWR('H', 0x07, struct hl_nic_args)

Which is ibv_create_qp, ibv_create_cq, ibv_poll_cq, etc, etc

Habana has repeatedly described their HW as having multiple 100G RoCE
ports. RoCE is one of the common industry standards that ibverbs
unambiguously is responsible for.

I would be much less annoyed if they were not actively marketing their
product as RoCE RDMA.

Sure there is some argument that their RoCE isn't spec compliant, but
I don't think it excuses the basic principle of our subsystem:

 RDMA HW needs to demonstrate some basic functionality using the
 standard open source userspace software stack.

I don't like this idea of backdooring a bunch of proprietary closed
source RDMA userspace through drivers/misc, and if you don't have a
clear idea how to get something equal for drivers/misc you should not
accept the H_IOCTL_NIC.

Plus RoCE is complicated, there is a bunch of interaction with netdev
and rules related to that that really needs to be respected.

> For anything that _has_ to have a userspace RMDA interface, sure ibverbs
> are the one we are stuck with, but I didn't think that was the issue
> here at all, which is why I wrote the above comments.

I think you should look at the patches #8 through 11:

https://lore.kernel.org/lkml/20200915171022.10561-9-oded.gabbay@gmail.com/

Jason

Jakub Kicinski Sept. 21, 2020, 9:20 p.m. UTC | #63

On Mon, 21 Sep 2020 08:52:39 -0300 Jason Gunthorpe wrote:
> I don't like this idea of backdooring a bunch of proprietary closed
> source RDMA userspace through drivers/misc, and if you don't have a
> clear idea how to get something equal for drivers/misc you should not
> accept the H_IOCTL_NIC.
> 
> Plus RoCE is complicated, there is a bunch of interaction with netdev
> and rules related to that that really needs to be respected.

+1

To me this code quite clearly fits the description of vendor SDK which
runs proprietary stuff on top. It's such an vendor SDK thing to do to
pick the parts of our infrastructure they like and "simplify the rest"
with its own re-implementation.

I'd wager the only reason you expose the netdevs at all is for link
settings, stats, packet capture and debug. You'd never run TCP traffic
over those links. And you're fighting against using Linux APIs for the
only real traffic that runs on those links - RDMA(ish) traffic.

Greg - I'm probably the least experience of the folks involved in this
conversation - could you ELI5 what's the benefit to the community from
merging this code?

Jason Gunthorpe Sept. 22, 2020, 11:41 a.m. UTC | #64

On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote:

> What is considered a RoCE port in this case if it's not compliant with RoCE?
> Sounds like it's an implementation of RDMA over ethernet, not RoCE.
> Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol?
> (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation
> exposed? Or is it closed source?)

I think Oded was drawing a parallel to how nvlink is integral with the
compute element. From Oded's descriptions I don't think it is much
like nvlink at all.

> Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path
> verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by
> habanalabs driver)?
> So neither any userspace verbs apps could use it nor kernel ULPs?

Based on what Oded described it seems like a reasonable RDMA device
with some limitations around MR IOVA.

Looks like the desire is to create a RDMA WR and CQ ring in userspace,
and then co-mingle that with the compute side of the device.

So instead of doing the special IOCTL and mmap against the compute FD
it would create a RDMA QP and RDMA CQ, use dv to access the raw
internals, and the propritary stack would have exactly the same stuff
it would have had with the misc ioctl.

But, completely separately, they'd also have to implement some of
verbs which serves as the open source userspace showing how this HW
works. What that is depends largely on what their HW can do, and if
they want to connect to UCX/mpi/libfabric/etc

A bunch of ioctl stubs or a few tests is far below our standard in
RDMA.

There may have been some argument that the compute side of this device
has no industry standards so should be a drivers/misc, but HPC
networking *does* have extensive standards and extensive open source
software stacks. It is very hard for me to see how a device in this
market could be competitive without integrating with that stuff.

Jason

Jason Gunthorpe Sept. 22, 2020, 11:49 a.m. UTC | #65

On Mon, Sep 21, 2020 at 02:20:53PM -0700, Jakub Kicinski wrote:
> I'd wager the only reason you expose the netdevs at all is for link
> settings, stats, packet capture and debug. You'd never run TCP traffic
> over those links. And you're fighting against using Linux APIs for the
> only real traffic that runs on those links - RDMA(ish) traffic.

The usual working flow is to use something like TCP to exchange
connection information then pivot to RDMA for the actual data
flow. This is why a driver like this could get away with such a low
performance implementation for a 100G NIC, it is just application boot
metadata being exchanged.

Sniffing probably won't work as typically the HW will capture the RoCE
traffic before reaching Linux - and the Linux driver couldn't handle a
100G flow anyhow. Stats might not work either.

As far as the "usual rules" we do require that accelerator devices
sharing a netdev are secure in the concept of netdev userspace
security. They can access the assigned RoCEv2 UDP port but cannot do
things like forge src IP/MAC addresses, violate VLANs, reach outside
net namespaces, capature arbitary traffic, etc.

This stuff is tricky and generally requires HW support. Someone has to
audit all of this and ensure it meets the netdev security requirements
too, otherwise it will need CAP_NET_RAW to function. Obviously this
requires seeing enough of a userspace implementation to understand how
the design approaches verbs 'Address Handles' and so forth.

RDMA HW has had errors before and when discovered it was blocked with
CAP_NET_RAW until new chip revs came out, this is something I take
very seriously.

Jason

Gal Pressman Sept. 22, 2020, 12:46 p.m. UTC | #66

On 22/09/2020 14:41, Jason Gunthorpe wrote:
> On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote:
> 
>> What is considered a RoCE port in this case if it's not compliant with RoCE?
>> Sounds like it's an implementation of RDMA over ethernet, not RoCE.
>> Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol?
>> (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation
>> exposed? Or is it closed source?)
> 
> I think Oded was drawing a parallel to how nvlink is integral with the
> compute element. From Oded's descriptions I don't think it is much
> like nvlink at all.
> 
>> Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path
>> verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by
>> habanalabs driver)?
>> So neither any userspace verbs apps could use it nor kernel ULPs?
> 
> Based on what Oded described it seems like a reasonable RDMA device
> with some limitations around MR IOVA.
> 
> Looks like the desire is to create a RDMA WR and CQ ring in userspace,
> and then co-mingle that with the compute side of the device.
> 
> So instead of doing the special IOCTL and mmap against the compute FD
> it would create a RDMA QP and RDMA CQ, use dv to access the raw
> internals, and the propritary stack would have exactly the same stuff
> it would have had with the misc ioctl.
> 
> But, completely separately, they'd also have to implement some of
> verbs which serves as the open source userspace showing how this HW
> works. What that is depends largely on what their HW can do, and if
> they want to connect to UCX/mpi/libfabric/etc
> 
> A bunch of ioctl stubs or a few tests is far below our standard in
> RDMA.
> 
> There may have been some argument that the compute side of this device
> has no industry standards so should be a drivers/misc, but HPC
> networking *does* have extensive standards and extensive open source
> software stacks. It is very hard for me to see how a device in this
> market could be competitive without integrating with that stuff.

I agree, that makes sense.
But assuming Oded actually goes and implements all the needed verbs to get a
basic functional libibverbs provider (assuming their HW can do it somehow), is
it really useful if no one is going to use it?
It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter,
and I'm assuming the only real world use case is going to be using the hl stack,
which means we're left with a lot of dead code that's not used/tested by anyone.

Genuine question, wouldn't it be better if they only implement what's actually
going to be used and tested by their customers?

Jason Gunthorpe Sept. 22, 2020, 4:14 p.m. UTC | #67

On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote:

> I agree, that makes sense.
> But assuming Oded actually goes and implements all the needed verbs to get a
> basic functional libibverbs provider (assuming their HW can do it somehow), is
> it really useful if no one is going to use it?
> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter,
> and I'm assuming the only real world use case is going to be using the hl stack,
> which means we're left with a lot of dead code that's not used/tested by anyone.
> 
> Genuine question, wouldn't it be better if they only implement what's actually
> going to be used and tested by their customers?

The general standard for this 'accel' hardware, both in DRM and RDMA
is to present an open source userspace. Companies are encouraged to
use that as their main interface but I suppose are free to carry the
cost of dual APIs, and the community's wrath if they want.

At least for RDMA this is guided by the founding event of Linux RDMA
where all customers demanded the madness of every supplier having a
unique software stack from the kernel down stop. Since then the low
level stack has been cross vendor and uniform.

Jason

Gal Pressman Sept. 22, 2020, 4:30 p.m. UTC | #68

On 22/09/2020 19:14, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote:
> 
>> I agree, that makes sense.
>> But assuming Oded actually goes and implements all the needed verbs to get a
>> basic functional libibverbs provider (assuming their HW can do it somehow), is
>> it really useful if no one is going to use it?
>> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter,
>> and I'm assuming the only real world use case is going to be using the hl stack,
>> which means we're left with a lot of dead code that's not used/tested by anyone.
>>
>> Genuine question, wouldn't it be better if they only implement what's actually
>> going to be used and tested by their customers?
> 
> The general standard for this 'accel' hardware, both in DRM and RDMA
> is to present an open source userspace. Companies are encouraged to
> use that as their main interface but I suppose are free to carry the
> cost of dual APIs, and the community's wrath if they want.

I didn't mean they should maintain two interfaces.
The question is whether they should implement libibverbs support that covers the
cases used by their stack, or should they implement all "mandatory" verbs so
they could be able to run libibverbs' examples/perftest/pyverbs as well, even
though these will likely be the only apps covering these verbs.

Jason Gunthorpe Sept. 22, 2020, 4:52 p.m. UTC | #69

On Tue, Sep 22, 2020 at 07:30:32PM +0300, Gal Pressman wrote:
> On 22/09/2020 19:14, Jason Gunthorpe wrote:
> > On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote:
> > 
> >> I agree, that makes sense.
> >> But assuming Oded actually goes and implements all the needed verbs to get a
> >> basic functional libibverbs provider (assuming their HW can do it somehow), is
> >> it really useful if no one is going to use it?
> >> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter,
> >> and I'm assuming the only real world use case is going to be using the hl stack,
> >> which means we're left with a lot of dead code that's not used/tested by anyone.
> >>
> >> Genuine question, wouldn't it be better if they only implement what's actually
> >> going to be used and tested by their customers?
> > 
> > The general standard for this 'accel' hardware, both in DRM and RDMA
> > is to present an open source userspace. Companies are encouraged to
> > use that as their main interface but I suppose are free to carry the
> > cost of dual APIs, and the community's wrath if they want.
> 
> I didn't mean they should maintain two interfaces.
> The question is whether they should implement libibverbs support that covers the
> cases used by their stack, or should they implement all "mandatory" verbs so
> they could be able to run libibverbs' examples/perftest/pyverbs as well, even
> though these will likely be the only apps covering these verbs.

As I said, the minimum standard is an open source user space that will
operate the NIC. For EFA we decided that was ibv_ud_pingpong, and now
parts of pyverbs. A similar decision would be needed here too. It is a
conversation that should start with a propsal from Oded.

The *point* is to have the open userspace, so I really don't care what
their proprietary universe does, and shrinking the opensource side
becuase it is "redundant" is completely backwards to what we want to
see.

Jason