Message ID | 20200915171022.10561-1-oded.gabbay@gmail.com |
---|---|
Headers | show |
Series | Adding GAUDI NIC code to habanalabs driver | expand |
On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > Hello, > > This is the second version of the patch-set to upstream the GAUDI NIC code > into the habanalabs driver. > > The only modification from v2 is in the ethtool patch (patch 12). Details > are in that patch's commit message. You keep reposting this, yet this SDK shim^W^W driver is still living in drivers/misc. If you want to make it look like a NIC, the code belongs where NIC drivers are. Then again, is it a NIC? Why do you have those custom IOCTLs? That's far from normal. Please make sure to CC linux-rdma. You clearly stated that the device does RDMA-like transfers.
From: Oded Gabbay <oded.gabbay@gmail.com> Date: Tue, 15 Sep 2020 20:10:08 +0300 > This is the second version of the patch-set to upstream the GAUDI NIC code > into the habanalabs driver. > > The only modification from v2 is in the ethtool patch (patch 12). Details > are in that patch's commit message. > > Link to v2 cover letter: > https://lkml.org/lkml/2020/9/12/201 I agree with Jakub, this driver definitely can't go-in as it is currently structured and designed. And because of the RDMA'ness of it, the RDMA folks have to be CC:'d and have a chance to review this.
On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > Hello, > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > into the habanalabs driver. > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > are in that patch's commit message. > > You keep reposting this, yet this SDK shim^W^W driver is still living in > drivers/misc. If you want to make it look like a NIC, the code belongs > where NIC drivers are. > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > from normal. Hi Jakub, I'm sorry but from your question it seems as if you didn't read my cover letter at all, as I took great lengths in explaining exactly what our device is and why we use custom IOCTLs. TL;DR We have an accelerator for deep learning (GAUDI) which uses RDMA as infrastructure for communication between multiple accelerators. Same as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. The RDMA implementation we did does NOT support some basic RDMA IBverbs (such as MR and PD) and therefore, we can't use the rdma-core library or to connect to the rdma infrastructure in the kernel. We wanted to do it but when we analyzed it, we saw we wouldn't be able to support basic stuff and therefore we had to revert to our IOCTLs. To sum it up, because our NIC is used for intra-communication, we don't expose nor intend users to use it as a NIC per-se. However, to be able to get statistics and manage them in a standard way, and support control plane over Ethernet, we do register each port to the net subsystem (i.e. create netdev per port). I hope this short summary explains this better. As per your request that this code lives in the net subsystem, I think that will make it only more complicated and hard to upstream and maintain. I see there are other examples (e.g. sgi-xp) that contain networking driver code in misc so I don't understand this objection. > > Please make sure to CC linux-rdma. You clearly stated that the device > does RDMA-like transfers. We don't use the RDMA infrastructure in the kernel and we can't connect to it due to the lack of H/W support we have so I don't see why we need to CC linux-rdma. Oded
On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > From: Oded Gabbay <oded.gabbay@gmail.com> > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > into the habanalabs driver. > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > are in that patch's commit message. > > > > Link to v2 cover letter: > > https://lkml.org/lkml/2020/9/12/201 > > I agree with Jakub, this driver definitely can't go-in as it is currently > structured and designed. Why is that ? Can you please point to the things that bother you or not working correctly? I can't really fix the driver if I don't know what's wrong. In addition, please read my reply to Jakub with the explanation of why we designed this driver as is. And because of the RDMA'ness of it, the RDMA > folks have to be CC:'d and have a chance to review this. As I said to Jakub, the driver doesn't use the RDMA infrastructure in the kernel and we can't connect to it due to the lack of H/W support we have Therefore, I don't see why we need to CC linux-rdma. I understood why Greg asked me to CC you because we do connect to the netdev and standard eth infrastructure, but regarding the RDMA, it's not really the same. Thanks, Oded
On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote: > On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > > Hello, > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > into the habanalabs driver. > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > are in that patch's commit message. > > > > You keep reposting this, yet this SDK shim^W^W driver is still living in > > drivers/misc. If you want to make it look like a NIC, the code belongs > > where NIC drivers are. > > > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > > from normal. > > I'm sorry but from your question it seems as if you didn't read my > cover letter at all, as I took great lengths in explaining exactly > what our device is and why we use custom IOCTLs. > TL;DR > We have an accelerator for deep learning (GAUDI) which uses RDMA as > infrastructure for communication between multiple accelerators. Same > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > The RDMA implementation we did does NOT support some basic RDMA > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > library or to connect to the rdma infrastructure in the kernel. We > wanted to do it but when we analyzed it, we saw we wouldn't be able to > support basic stuff and therefore we had to revert to our IOCTLs. > To sum it up, because our NIC is used for intra-communication, we > don't expose nor intend users to use it as a NIC per-se. However, to > be able to get statistics and manage them in a standard way, and > support control plane over Ethernet, we do register each port to the > net subsystem (i.e. create netdev per port). > > I hope this short summary explains this better. I read your cover letter. Networking drivers don't get to define random IOCTLs as they please. You have to take that part out of the "NIC" driver. > As per your request that this code lives in the net subsystem, I think > that will make it only more complicated and hard to upstream and > maintain. > I see there are other examples (e.g. sgi-xp) that contain networking > driver code in misc so I don't understand this objection. The maintenance structure and CI systems for the kernel depend on the directory layout. If you don't understand that I don't know how to help you. > > Please make sure to CC linux-rdma. You clearly stated that the device > > does RDMA-like transfers. > > We don't use the RDMA infrastructure in the kernel and we can't > connect to it due to the lack of H/W support we have so I don't see > why we need to CC linux-rdma. You have it backward. You don't get to pick and choose which parts of the infrastructure you use, and therefore who reviews your drivers. The device uses RDMA under the hood so Linux RDMA experts must very much be okay with it getting merged. That's how we ensure Linux interfaces are consistent and good quality.
On Wed, Sep 16, 2020 at 12:04 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote: > > On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > > > Hello, > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > into the habanalabs driver. > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > are in that patch's commit message. > > > > > > You keep reposting this, yet this SDK shim^W^W driver is still living in > > > drivers/misc. If you want to make it look like a NIC, the code belongs > > > where NIC drivers are. > > > > > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > > > from normal. > > > > I'm sorry but from your question it seems as if you didn't read my > > cover letter at all, as I took great lengths in explaining exactly > > what our device is and why we use custom IOCTLs. > > TL;DR > > We have an accelerator for deep learning (GAUDI) which uses RDMA as > > infrastructure for communication between multiple accelerators. Same > > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > The RDMA implementation we did does NOT support some basic RDMA > > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > library or to connect to the rdma infrastructure in the kernel. We > > wanted to do it but when we analyzed it, we saw we wouldn't be able to > > support basic stuff and therefore we had to revert to our IOCTLs. > > To sum it up, because our NIC is used for intra-communication, we > > don't expose nor intend users to use it as a NIC per-se. However, to > > be able to get statistics and manage them in a standard way, and > > support control plane over Ethernet, we do register each port to the > > net subsystem (i.e. create netdev per port). > > > > I hope this short summary explains this better. > > I read your cover letter. Networking drivers don't get to define random > IOCTLs as they please. You have to take that part out of the "NIC" > driver. The IOCTLs are not for the Ethernet part. They are strictly for the RDMA operations. RDMA drivers also have IOCTLs as interfaces in the drivers/infiniband area, so I don't think I'm doing something different here. And my driver is not networking. It is an accelerator which has some network ports. btw, this is only a single new IOCTL call. The rest of the IOCTLs are already upstreamed and are for the rest of the ASIC's compute functionality. What I'm trying to say is that it's very common to define IOCTLs for accelerators. > > > As per your request that this code lives in the net subsystem, I think > > that will make it only more complicated and hard to upstream and > > maintain. > > I see there are other examples (e.g. sgi-xp) that contain networking > > driver code in misc so I don't understand this objection. > > The maintenance structure and CI systems for the kernel depend on the > directory layout. If you don't understand that I don't know how to help > you. I completely understand but you didn't answer my question. How come there are drivers which create netdev objects, and specifically sgi-xp in misc (but I also saw it in usb drivers) that live outside drivers/net ? Why doesn't your request apply to them as well ? When we wrote the code, we saw those examples and therefore assumed it was fine. > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > does RDMA-like transfers. > > > > We don't use the RDMA infrastructure in the kernel and we can't > > connect to it due to the lack of H/W support we have so I don't see > > why we need to CC linux-rdma. > > You have it backward. You don't get to pick and choose which parts of > the infrastructure you use, and therefore who reviews your drivers. > The device uses RDMA under the hood so Linux RDMA experts must very > much be okay with it getting merged. That's how we ensure Linux > interfaces are consistent and good quality. I understand your point of view but If my H/W doesn't support the basic requirements of the RDMA infrastructure and interfaces, then really there is nothing I can do about it. I can't use them. I wish I was able to use that infrastructure but I can't. That's why we wrote the IOCTLs in our accelerator driver. Thanks, Oded
> I completely understand but you didn't answer my question. How come > there are drivers which create netdev objects, and specifically sgi-xp > in misc (but I also saw it in usb drivers) that live outside > drivers/net ? Why doesn't your request apply to them as well ? > When we wrote the code, we saw those examples and therefore assumed it was fine. commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a Author: Dean Nelson <dcn@sgi.com> Date: Tue Apr 22 14:46:56 2008 -0500 [IA64] move XP and XPC to drivers/misc/sgi-xp Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp. Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com> It has been there a long time, and no networking person was involved in its move. drivers/usb/gadget/function/f_ncm.c commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797 Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Date: Tue Jul 15 13:09:46 2014 +0200 usb: gadget: Gadget directory cleanup - group usb functions The drivers/usb/gadget directory contains many files. Files which are related can be distributed into separate directories. This patch moves the USB functions implementations into a separate directory. Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Again, old. Can you find an example of a network driver added in the last couple of years outside of drivers/met? > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > > does RDMA-like transfers. > > > > > > We don't use the RDMA infrastructure in the kernel and we can't > > > connect to it due to the lack of H/W support we have so I don't see > > > why we need to CC linux-rdma. > > > > You have it backward. You don't get to pick and choose which parts of > > the infrastructure you use, and therefore who reviews your drivers. > > The device uses RDMA under the hood so Linux RDMA experts must very > > much be okay with it getting merged. That's how we ensure Linux > > interfaces are consistent and good quality. > > I understand your point of view but If my H/W doesn't support the > basic requirements of the RDMA infrastructure and interfaces, then > really there is nothing I can do about it. I can't use them. It is up to the RDMA people to say that. They might see how the RDMA core can be made to work for your hardware. Andrew
On Wed, Sep 16, 2020 at 12:37 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > I completely understand but you didn't answer my question. How come > > there are drivers which create netdev objects, and specifically sgi-xp > > in misc (but I also saw it in usb drivers) that live outside > > drivers/net ? Why doesn't your request apply to them as well ? > > When we wrote the code, we saw those examples and therefore assumed it was fine. > > commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a > Author: Dean Nelson <dcn@sgi.com> > Date: Tue Apr 22 14:46:56 2008 -0500 > > [IA64] move XP and XPC to drivers/misc/sgi-xp > > Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp. > > Signed-off-by: Dean Nelson <dcn@sgi.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > > It has been there a long time, and no networking person was involved > in its move. > > drivers/usb/gadget/function/f_ncm.c > commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797 > Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com> > Date: Tue Jul 15 13:09:46 2014 +0200 > > usb: gadget: Gadget directory cleanup - group usb functions > > The drivers/usb/gadget directory contains many files. > Files which are related can be distributed into separate directories. > This patch moves the USB functions implementations into a separate directory. > > Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com> > Signed-off-by: Felipe Balbi <balbi@ti.com> > > Again, old. > > Can you find an example of a network driver added in the last couple > of years outside of drivers/met? I honestly don't know and I admit we didn't look at the dates of when these drivers were introduced. Oded > > > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > > > does RDMA-like transfers. > > > > > > > > We don't use the RDMA infrastructure in the kernel and we can't > > > > connect to it due to the lack of H/W support we have so I don't see > > > > why we need to CC linux-rdma. > > > > > > You have it backward. You don't get to pick and choose which parts of > > > the infrastructure you use, and therefore who reviews your drivers. > > > The device uses RDMA under the hood so Linux RDMA experts must very > > > much be okay with it getting merged. That's how we ensure Linux > > > interfaces are consistent and good quality. > > > > I understand your point of view but If my H/W doesn't support the > > basic requirements of the RDMA infrastructure and interfaces, then > > really there is nothing I can do about it. I can't use them. > > It is up to the RDMA people to say that. They might see how the RDMA > core can be made to work for your hardware. > > Andrew
From: Oded Gabbay <oded.gabbay@gmail.com> Date: Wed, 16 Sep 2020 00:20:12 +0300 > I completely understand but you didn't answer my question. How come > there are drivers which create netdev objects, and specifically sgi-xp > in misc (but I also saw it in usb drivers) that live outside > drivers/net ? Why doesn't your request apply to them as well ? Don't use examples of drivers doing the wrong thing as an excuse for you to repeat the mistake. Ok? That kind of argument doesn't work here.
From: Oded Gabbay <oded.gabbay@gmail.com> Date: Wed, 16 Sep 2020 00:43:00 +0300 > I honestly don't know and I admit we didn't look at the dates of when > these drivers were introduced. Please do research when you make claims in the future, thank you.
From: Andrew Lunn <andrew@lunn.ch> Date: Tue, 15 Sep 2020 23:37:35 +0200 >> I understand your point of view but If my H/W doesn't support the >> basic requirements of the RDMA infrastructure and interfaces, then >> really there is nothing I can do about it. I can't use them. > > It is up to the RDMA people to say that. They might see how the RDMA > core can be made to work for your hardware. +1
On Wed, Sep 16, 2020 at 1:34 AM David Miller <davem@davemloft.net> wrote: > > From: Oded Gabbay <oded.gabbay@gmail.com> > Date: Wed, 16 Sep 2020 00:20:12 +0300 > > > I completely understand but you didn't answer my question. How come > > there are drivers which create netdev objects, and specifically sgi-xp > > in misc (but I also saw it in usb drivers) that live outside > > drivers/net ? Why doesn't your request apply to them as well ? > > Don't use examples of drivers doing the wrong thing as an excuse for > you to repeat the mistake. > > Ok? Well, it's not like there is a big red warning near those drivers saying "this is wrong"... How could I have known that in advance ? > > That kind of argument doesn't work here. I know that, I just didn't know those drivers did "the wrong thing" Oded
On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > into the habanalabs driver. > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > are in that patch's commit message. > > > > > > Link to v2 cover letter: > > > https://lkml.org/lkml/2020/9/12/201 > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > structured and designed. > Why is that ? > Can you please point to the things that bother you or not working correctly? > I can't really fix the driver if I don't know what's wrong. > > In addition, please read my reply to Jakub with the explanation of why > we designed this driver as is. > > And because of the RDMA'ness of it, the RDMA > > folks have to be CC:'d and have a chance to review this. > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > the kernel and we can't connect to it due to the lack of H/W support > we have > Therefore, I don't see why we need to CC linux-rdma. > I understood why Greg asked me to CC you because we do connect to the > netdev and standard eth infrastructure, but regarding the RDMA, it's > not really the same. Ok, to do this "right" it needs to be split up into separate drivers, hopefully using the "virtual bus" code that some day Intel will resubmit again that will solve this issue. That will allow you to put the network driver portion in drivers/net/ and split the code up into the proper different pieces easier. I recommend grabbing the virtual bus code from the archives and looking at that for how this can be done. Now that you are part of Intel, I'm sure that the internal-Intel-Linux-kernel-review-process can kick in and those developers can help you out. If not, let me know, so I can go kick them :) As for the RDMA stuff, yeah, you should look at the current RDMA interfaces and verify that those really do not work for you here, and then document why that is in your patch submission. thanks, greg k-h
On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > into the habanalabs driver. > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > are in that patch's commit message. > > > > > > > > Link to v2 cover letter: > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > structured and designed. > > Why is that ? > > Can you please point to the things that bother you or not working correctly? > > I can't really fix the driver if I don't know what's wrong. > > > > In addition, please read my reply to Jakub with the explanation of why > > we designed this driver as is. > > > > And because of the RDMA'ness of it, the RDMA > > > folks have to be CC:'d and have a chance to review this. > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > the kernel and we can't connect to it due to the lack of H/W support > > we have > > Therefore, I don't see why we need to CC linux-rdma. > > I understood why Greg asked me to CC you because we do connect to the > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > not really the same. > > Ok, to do this "right" it needs to be split up into separate drivers, > hopefully using the "virtual bus" code that some day Intel will resubmit > again that will solve this issue. Hi Greg, Can I suggest an alternative for the short/medium term ? In an earlier email, Jakub said: "Is it not possible to move the files and still build them into a single module?" I thought maybe that's a good way to progress here ? First, split the content to Ethernet and RDMA. Then move the Ethernet part to drivers/net but build it as part of habanalabs.ko. Regarding the RDMA code, upstream/review it in a different patch-set (maybe they will want me to put the files elsewhere). What do you think ? > > That will allow you to put the network driver portion in drivers/net/ > and split the code up into the proper different pieces easier. > > I recommend grabbing the virtual bus code from the archives and looking > at that for how this can be done. Now that you are part of Intel, I'm > sure that the internal-Intel-Linux-kernel-review-process can kick in and > those developers can help you out. If not, let me know, so I can go > kick them :) > > As for the RDMA stuff, yeah, you should look at the current RDMA > interfaces and verify that those really do not work for you here, and > then document why that is in your patch submission. ok, will do that. Thanks, Oded > > thanks, > > greg k-h
On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > into the habanalabs driver. > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > are in that patch's commit message. > > > > > > > > > > Link to v2 cover letter: > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > structured and designed. > > > Why is that ? > > > Can you please point to the things that bother you or not working correctly? > > > I can't really fix the driver if I don't know what's wrong. > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > we designed this driver as is. > > > > > > And because of the RDMA'ness of it, the RDMA > > > > folks have to be CC:'d and have a chance to review this. > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > the kernel and we can't connect to it due to the lack of H/W support > > > we have > > > Therefore, I don't see why we need to CC linux-rdma. > > > I understood why Greg asked me to CC you because we do connect to the > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > not really the same. > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > hopefully using the "virtual bus" code that some day Intel will resubmit > > again that will solve this issue. > Hi Greg, > Can I suggest an alternative for the short/medium term ? > > In an earlier email, Jakub said: > "Is it not possible to move the files and still build them into a single > module?" > > I thought maybe that's a good way to progress here ? Cross-directory builds of a single module are crazy. Yes, they work, but really, that's a mess, and would never suggest doing that. > First, split the content to Ethernet and RDMA. > Then move the Ethernet part to drivers/net but build it as part of > habanalabs.ko. > Regarding the RDMA code, upstream/review it in a different patch-set > (maybe they will want me to put the files elsewhere). > > What do you think ? I think you are asking for more work there than just splitting out into separate modules :) thanks, greg k-h
On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > > into the habanalabs driver. > > > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > > are in that patch's commit message. > > > > > > > > > > > > Link to v2 cover letter: > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > > structured and designed. > > > > Why is that ? > > > > Can you please point to the things that bother you or not working correctly? > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > > we designed this driver as is. > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > folks have to be CC:'d and have a chance to review this. > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > > the kernel and we can't connect to it due to the lack of H/W support > > > > we have > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > I understood why Greg asked me to CC you because we do connect to the > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > > not really the same. > > > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > > hopefully using the "virtual bus" code that some day Intel will resubmit > > > again that will solve this issue. > > Hi Greg, > > Can I suggest an alternative for the short/medium term ? > > > > In an earlier email, Jakub said: > > "Is it not possible to move the files and still build them into a single > > module?" > > > > I thought maybe that's a good way to progress here ? > > Cross-directory builds of a single module are crazy. Yes, they work, > but really, that's a mess, and would never suggest doing that. > > > First, split the content to Ethernet and RDMA. > > Then move the Ethernet part to drivers/net but build it as part of > > habanalabs.ko. > > Regarding the RDMA code, upstream/review it in a different patch-set > > (maybe they will want me to put the files elsewhere). > > > > What do you think ? > > I think you are asking for more work there than just splitting out into > separate modules :) > > thanks, > > greg k-h Hi Greg, If cross-directory building is out of the question, what about splitting into separate modules ? And use cross-module notifiers/calls ? I did that with amdkfd and amdgpu/radeon a couple of years back. It worked (that's the best thing I can say about it). The main problem with this "virtual bus" thing is that I'm not familiar with it at all and from my experience I imagine it would take a considerable time and effort to upstream this infrastructure work. This could delay the NIC code for a couple of years, which by then this won't be relevant at all. So I'm trying to find some middle ground here on how to proceed. Thanks, Oded
On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote: > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > > > into the habanalabs driver. > > > > > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > > > are in that patch's commit message. > > > > > > > > > > > > > > Link to v2 cover letter: > > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > > > structured and designed. > > > > > Why is that ? > > > > > Can you please point to the things that bother you or not working correctly? > > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > > > we designed this driver as is. > > > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > > folks have to be CC:'d and have a chance to review this. > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > > > the kernel and we can't connect to it due to the lack of H/W support > > > > > we have > > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > > I understood why Greg asked me to CC you because we do connect to the > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > > > not really the same. > > > > > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > > > hopefully using the "virtual bus" code that some day Intel will resubmit > > > > again that will solve this issue. > > > Hi Greg, > > > Can I suggest an alternative for the short/medium term ? > > > > > > In an earlier email, Jakub said: > > > "Is it not possible to move the files and still build them into a single > > > module?" > > > > > > I thought maybe that's a good way to progress here ? > > > > Cross-directory builds of a single module are crazy. Yes, they work, > > but really, that's a mess, and would never suggest doing that. > > > > > First, split the content to Ethernet and RDMA. > > > Then move the Ethernet part to drivers/net but build it as part of > > > habanalabs.ko. > > > Regarding the RDMA code, upstream/review it in a different patch-set > > > (maybe they will want me to put the files elsewhere). > > > > > > What do you think ? > > > > I think you are asking for more work there than just splitting out into > > separate modules :) > > > > thanks, > > > > greg k-h > Hi Greg, > > If cross-directory building is out of the question, what about > splitting into separate modules ? And use cross-module notifiers/calls > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It > worked (that's the best thing I can say about it). That's fine with me. > The main problem with this "virtual bus" thing is that I'm not > familiar with it at all and from my experience I imagine it would take > a considerable time and effort to upstream this infrastructure work. It shouldn't be taking that long, but for some unknown reason, the original author of that code is sitting on it and not resending it. Go poke them through internal Intel channels to find out what the problem is, as I have no clue why a 200-300 line bus module is taking so long to get "right" :( I'm _ALMOST_ at the point where I would just do that work myself, but due to my current status with Intel, I'll let them do it as I have enough other things on my plate... > This could delay the NIC code for a couple of years, which by then > this won't be relevant at all. Why wouldn't this code be relevant in a year? It's going to be 2+ years before any of this shows up in an "enterprise distro" based on their release cycles anyway :) thanks, greg k-h
On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote: > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > > > > into the habanalabs driver. > > > > > > > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > > > > are in that patch's commit message. > > > > > > > > > > > > > > > > Link to v2 cover letter: > > > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > > > > structured and designed. > > > > > > Why is that ? > > > > > > Can you please point to the things that bother you or not working correctly? > > > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > > > > we designed this driver as is. > > > > > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > > > folks have to be CC:'d and have a chance to review this. > > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > > > > the kernel and we can't connect to it due to the lack of H/W support > > > > > > we have > > > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > > > I understood why Greg asked me to CC you because we do connect to the > > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > > > > not really the same. > > > > > > > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > > > > hopefully using the "virtual bus" code that some day Intel will resubmit > > > > > again that will solve this issue. > > > > Hi Greg, > > > > Can I suggest an alternative for the short/medium term ? > > > > > > > > In an earlier email, Jakub said: > > > > "Is it not possible to move the files and still build them into a single > > > > module?" > > > > > > > > I thought maybe that's a good way to progress here ? > > > > > > Cross-directory builds of a single module are crazy. Yes, they work, > > > but really, that's a mess, and would never suggest doing that. > > > > > > > First, split the content to Ethernet and RDMA. > > > > Then move the Ethernet part to drivers/net but build it as part of > > > > habanalabs.ko. > > > > Regarding the RDMA code, upstream/review it in a different patch-set > > > > (maybe they will want me to put the files elsewhere). > > > > > > > > What do you think ? > > > > > > I think you are asking for more work there than just splitting out into > > > separate modules :) > > > > > > thanks, > > > > > > greg k-h > > Hi Greg, > > > > If cross-directory building is out of the question, what about > > splitting into separate modules ? And use cross-module notifiers/calls > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It > > worked (that's the best thing I can say about it). > > That's fine with me. > > > The main problem with this "virtual bus" thing is that I'm not > > familiar with it at all and from my experience I imagine it would take > > a considerable time and effort to upstream this infrastructure work. > > It shouldn't be taking that long, but for some unknown reason, the > original author of that code is sitting on it and not resending it. Go > poke them through internal Intel channels to find out what the problem > is, as I have no clue why a 200-300 line bus module is taking so long to > get "right" :( > > I'm _ALMOST_ at the point where I would just do that work myself, but > due to my current status with Intel, I'll let them do it as I have > enough other things on my plate... > > > This could delay the NIC code for a couple of years, which by then > > this won't be relevant at all. > > Why wouldn't this code be relevant in a year? It's going to be 2+ years > before any of this shows up in an "enterprise distro" based on their > release cycles anyway :) > > thanks, > > greg k-h Hi Greg, ok, I'll take a look. Do you happen to have the name of the patch-set / author ? Regarding the RDMA stuff, I'll do some work internally to separate it from the Ethernet code and then will send that code only to RDMA people with more detailed explanations. Thanks, Oded
On Wed, Sep 16, 2020 at 11:47:58AM +0300, Oded Gabbay wrote: > On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote: > > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > > > > > into the habanalabs driver. > > > > > > > > > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > > > > > are in that patch's commit message. > > > > > > > > > > > > > > > > > > Link to v2 cover letter: > > > > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > > > > > structured and designed. > > > > > > > Why is that ? > > > > > > > Can you please point to the things that bother you or not working correctly? > > > > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > > > > > we designed this driver as is. > > > > > > > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > > > > folks have to be CC:'d and have a chance to review this. > > > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > > > > > the kernel and we can't connect to it due to the lack of H/W support > > > > > > > we have > > > > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > > > > I understood why Greg asked me to CC you because we do connect to the > > > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > > > > > not really the same. > > > > > > > > > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > > > > > hopefully using the "virtual bus" code that some day Intel will resubmit > > > > > > again that will solve this issue. > > > > > Hi Greg, > > > > > Can I suggest an alternative for the short/medium term ? > > > > > > > > > > In an earlier email, Jakub said: > > > > > "Is it not possible to move the files and still build them into a single > > > > > module?" > > > > > > > > > > I thought maybe that's a good way to progress here ? > > > > > > > > Cross-directory builds of a single module are crazy. Yes, they work, > > > > but really, that's a mess, and would never suggest doing that. > > > > > > > > > First, split the content to Ethernet and RDMA. > > > > > Then move the Ethernet part to drivers/net but build it as part of > > > > > habanalabs.ko. > > > > > Regarding the RDMA code, upstream/review it in a different patch-set > > > > > (maybe they will want me to put the files elsewhere). > > > > > > > > > > What do you think ? > > > > > > > > I think you are asking for more work there than just splitting out into > > > > separate modules :) > > > > > > > > thanks, > > > > > > > > greg k-h > > > Hi Greg, > > > > > > If cross-directory building is out of the question, what about > > > splitting into separate modules ? And use cross-module notifiers/calls > > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It > > > worked (that's the best thing I can say about it). > > > > That's fine with me. > > > > > The main problem with this "virtual bus" thing is that I'm not > > > familiar with it at all and from my experience I imagine it would take > > > a considerable time and effort to upstream this infrastructure work. > > > > It shouldn't be taking that long, but for some unknown reason, the > > original author of that code is sitting on it and not resending it. Go > > poke them through internal Intel channels to find out what the problem > > is, as I have no clue why a 200-300 line bus module is taking so long to > > get "right" :( > > > > I'm _ALMOST_ at the point where I would just do that work myself, but > > due to my current status with Intel, I'll let them do it as I have > > enough other things on my plate... > > > > > This could delay the NIC code for a couple of years, which by then > > > this won't be relevant at all. > > > > Why wouldn't this code be relevant in a year? It's going to be 2+ years > > before any of this shows up in an "enterprise distro" based on their > > release cycles anyway :) > > > > thanks, > > > > greg k-h > > Hi Greg, > ok, I'll take a look. Do you happen to have the name of the patch-set / author ? Here's at least one copy: https://lore.kernel.org/linux-rdma/20200520070227.3392100-2-jeffrey.t.kirsher@intel.com/ there might have been a newer one, can't remember, sorry. thanks, greg k-h
On Wed, 2020-09-16 at 10:22 +0200, Greg Kroah-Hartman wrote: > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote: > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller < > > > > > > davem@davemloft.net> wrote: > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > > > > > This is the second version of the patch-set to upstream > > > > > > > > the GAUDI NIC code > > > > > > > > into the habanalabs driver. > > > > > > > > > > > > > > > > The only modification from v2 is in the ethtool patch > > > > > > > > (patch 12). Details > > > > > > > > are in that patch's commit message. > > > > > > > > > > > > > > > > Link to v2 cover letter: > > > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as > > > > > > > it is currently > > > > > > > structured and designed. > > > > > > Why is that ? > > > > > > Can you please point to the things that bother you or not > > > > > > working correctly? > > > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > > > > > In addition, please read my reply to Jakub with the > > > > > > explanation of why > > > > > > we designed this driver as is. > > > > > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > > > folks have to be CC:'d and have a chance to review this. > > > > > > As I said to Jakub, the driver doesn't use the RDMA > > > > > > infrastructure in > > > > > > the kernel and we can't connect to it due to the lack of > > > > > > H/W support > > > > > > we have > > > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > > > I understood why Greg asked me to CC you because we do > > > > > > connect to the > > > > > > netdev and standard eth infrastructure, but regarding the > > > > > > RDMA, it's > > > > > > not really the same. > > > > > > > > > > Ok, to do this "right" it needs to be split up into separate > > > > > drivers, > > > > > hopefully using the "virtual bus" code that some day Intel > > > > > will resubmit > > > > > again that will solve this issue. > > > > Hi Greg, > > > > Can I suggest an alternative for the short/medium term ? > > > > > > > > In an earlier email, Jakub said: > > > > "Is it not possible to move the files and still build them into > > > > a single > > > > module?" > > > > > > > > I thought maybe that's a good way to progress here ? > > > > > > Cross-directory builds of a single module are crazy. Yes, they > > > work, > > > but really, that's a mess, and would never suggest doing that. > > > > > > > First, split the content to Ethernet and RDMA. > > > > Then move the Ethernet part to drivers/net but build it as part > > > > of > > > > habanalabs.ko. > > > > Regarding the RDMA code, upstream/review it in a different > > > > patch-set > > > > (maybe they will want me to put the files elsewhere). > > > > > > > > What do you think ? > > > > > > I think you are asking for more work there than just splitting > > > out into > > > separate modules :) > > > > > > thanks, > > > > > > greg k-h > > Hi Greg, > > > > If cross-directory building is out of the question, what about > > splitting into separate modules ? And use cross-module > > notifiers/calls > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. > > It > > worked (that's the best thing I can say about it). > > That's fine with me. > > > The main problem with this "virtual bus" thing is that I'm not > > familiar with it at all and from my experience I imagine it would > > take > > a considerable time and effort to upstream this infrastructure > > work. > > It shouldn't be taking that long, but for some unknown reason, the > original author of that code is sitting on it and not resending > it. Go > poke them through internal Intel channels to find out what the > problem > is, as I have no clue why a 200-300 line bus module is taking so long > to > get "right" :( It turns out that they were caught between being deeply respectful of your request to get another senior kernel developer to look at it before sending it out, and deeply respectful of not disclosing that I was out on bonding leave. It just happened that I left before they could get the latest version over to review. > I'm _ALMOST_ at the point where I would just do that work myself, but > due to my current status with Intel, I'll let them do it as I have > enough other things on my plate... I'm back now, let's get this thing moving. /me goes to review.
On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > infrastructure for communication between multiple accelerators. Same > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > The RDMA implementation we did does NOT support some basic RDMA > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > library or to connect to the rdma infrastructure in the kernel. You can't create a parallel RDMA subsystem in netdev, or in misc, and you can't add random device offloads as IOCTL to nedevs. RDMA is the proper home for all the networking offloads that don't fit into netdev. EFA was able to fit into rdma-core/etc and it isn't even RoCE at all. I'm sure this can too. > wanted to do it but when we analyzed it, we saw we wouldn't be able to > support basic stuff and therefore we had to revert to our IOCTLs. Try again. Ask for help. Your patches add CQs, WQ, and other RDMA objects. This is very clearly not an appropriate functionality for netdev. > To sum it up, because our NIC is used for intra-communication, we > don't expose nor intend users to use it as a NIC per-se. However, to > be able to get statistics and manage them in a standard way, and > support control plane over Ethernet, we do register each port to the > net subsystem (i.e. create netdev per port). Sure, the basic ethernet side is conceptually fine. > > Please make sure to CC linux-rdma. You clearly stated that the device > > does RDMA-like transfers. > > We don't use the RDMA infrastructure in the kernel and we can't > connect to it due to the lack of H/W support we have so I don't see > why we need to CC linux-rdma. Because you can't put RDMA like concepts under net. Jakub, NAK from me on this series. Jason
On 17/09/2020 20:18, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: >> infrastructure for communication between multiple accelerators. Same >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. >> The RDMA implementation we did does NOT support some basic RDMA >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core >> library or to connect to the rdma infrastructure in the kernel. > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > you can't add random device offloads as IOCTL to nedevs. > > RDMA is the proper home for all the networking offloads that don't fit > into netdev. > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > all. I'm sure this can too. Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it was suggested to go through the vfio subsystem instead. I think this comes back to the discussion we had when EFA was upstreamed, which is what's the bar to get accepted to the RDMA subsystem. IIRC, what we eventually agreed on is having a userspace rdma-core provider and ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). Does GAUDI fit these requirements? If not, should it be in a different subsystem or should we open the "what qualifies as an RDMA device" question again?
On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > or should we open the "what qualifies as an RDMA device" question again? I want to remind you that rdma-core requirement came to make sure that anything exposed from the RDMA to the userspace is strict with proper UAPI header hygiene. I doubt that Havana's ioctls are backed by anything like this. Thanks
On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). That is more or less where we ended up, yes. I'm most worried about this lack of PD and MR. Kernel must provide security for apps doing user DMA, PD and MR do this. If the device doesn't have PD/MR then it is hard to see how a WQ could ever be exposed directly to userspace, regardless of subsystem. Jason
On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > >> infrastructure for communication between multiple accelerators. Same > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > >> The RDMA implementation we did does NOT support some basic RDMA > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > into netdev. > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > all. I'm sure this can too. > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > was suggested to go through the vfio subsystem instead. > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > is what's the bar to get accepted to the RDMA subsystem. > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > or should we open the "what qualifies as an RDMA device" question again? > > I want to remind you that rdma-core requirement came to make sure that > anything exposed from the RDMA to the userspace is strict with proper > UAPI header hygiene. > > I doubt that Havana's ioctls are backed by anything like this. > > Thanks Why do you doubt that ? Have you looked at our code ? Our uapi and IOCTLs interface is based on drm subsystem uapi interface and it is very safe and protected. Otherwise Greg would have never allowed me to go upstream in the first place. We have a single function which is the entry point for all the IOCTLs of our drivers (only one IOCTL is RDMA related, all the others are compute related). That function is almost 1:1 copy of the function in drm. Thanks, Oded
On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > >> infrastructure for communication between multiple accelerators. Same > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > >> The RDMA implementation we did does NOT support some basic RDMA > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > into netdev. > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > all. I'm sure this can too. > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > was suggested to go through the vfio subsystem instead. > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > is what's the bar to get accepted to the RDMA subsystem. > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > That is more or less where we ended up, yes. > > I'm most worried about this lack of PD and MR. > > Kernel must provide security for apps doing user DMA, PD and MR do > this. If the device doesn't have PD/MR then it is hard to see how a WQ > could ever be exposed directly to userspace, regardless of subsystem. > > Jason Hi Jason, What you say here is very true and we handle that with different mechanisms. I will start working on a dedicated patch-set of the RDMA code in the next few weeks with MUCH MORE details in the commit messages. That will explain exactly how we expose stuff and protect. For example, regarding isolating between applications, we only support a single application opening our file descriptor. Another example is that the submission of WQ is done through our QMAN mechanism and is NOT mapped to userspace (due to the restrictions you mentioned above and other restrictions). But again, I want to send something organized and with proper explanations. I hope to have something in a couple of weeks. Thanks, Oded
On Tue, Sep 15, 2020 at 08:10:08PM +0300, Oded Gabbay wrote: > Hello, > > This is the second version of the patch-set to upstream the GAUDI NIC code > into the habanalabs driver. > > The only modification from v2 is in the ethtool patch (patch 12). Details > are in that patch's commit message. > > Link to v2 cover letter: > https://lkml.org/lkml/2020/9/12/201 >1. The NIC functionality is NOT exposed as different PCI Physical > Functions. There is a single PF which is used for compute and > networking, as the main goal of the NIC ports is to be used as > intra-communication and not as standard network interfaces. This > implies we can't connect different drivers to handle the networking > ports because it is the same device, from the kernel POV, as the > compute. Therefore, we must integrate the networking code into the > main habanalabs driver. No, this means you need to use virtual bus/ancillary bus that your other Intel colleagues have been working on with Greg. It is specificaly intended as the way to split a single PCI function across multiple subsystems. eg drivers/misc/habanalabs would be the pci_driver and drivers/net/ethernet/habanadalabs would be the 'virtual/ancillary' driver. Probably one per port. Jasno
On Fri, Sep 18, 2020 at 3:00 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Sep 15, 2020 at 08:10:08PM +0300, Oded Gabbay wrote: > > Hello, > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > into the habanalabs driver. > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > are in that patch's commit message. > > > > Link to v2 cover letter: > > https://lkml.org/lkml/2020/9/12/201 > > >1. The NIC functionality is NOT exposed as different PCI Physical > > Functions. There is a single PF which is used for compute and > > networking, as the main goal of the NIC ports is to be used as > > intra-communication and not as standard network interfaces. This > > implies we can't connect different drivers to handle the networking > > ports because it is the same device, from the kernel POV, as the > > compute. Therefore, we must integrate the networking code into the > > main habanalabs driver. > > No, this means you need to use virtual bus/ancillary bus that your > other Intel colleagues have been working on with Greg. > > It is specificaly intended as the way to split a single PCI function > across multiple subsystems. eg drivers/misc/habanalabs would be the > pci_driver and drivers/net/ethernet/habanadalabs would be the > 'virtual/ancillary' driver. Probably one per port. > > Jasno Understood. We are doing a refactor of the code according to those guidelines and will send an updated patch-set in a couple of weeks. Thanks, Oded
On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > >> infrastructure for communication between multiple accelerators. Same > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > into netdev. > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > all. I'm sure this can too. > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > was suggested to go through the vfio subsystem instead. > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > is what's the bar to get accepted to the RDMA subsystem. > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > or should we open the "what qualifies as an RDMA device" question again? > > > > I want to remind you that rdma-core requirement came to make sure that > > anything exposed from the RDMA to the userspace is strict with proper > > UAPI header hygiene. > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > Thanks > > Why do you doubt that ? Have you looked at our code ? > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > and it is very safe and protected. Yes, I looked and didn't find open-source users of your UAPI headers. It is not related to being safe or protected by to the common request to present userspace that relies on those exported interfaces. > Otherwise Greg would have never allowed me to go upstream in the first place. Nice, can we get a link? > > We have a single function which is the entry point for all the IOCTLs > of our drivers (only one IOCTL is RDMA related, all the others are > compute related). > That function is almost 1:1 copy of the function in drm. DRM has same rules as RDMA, no kernel code will be merged without seeing open-source userspace. Thanks > > Thanks, > Oded
On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > >> infrastructure for communication between multiple accelerators. Same > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > into netdev. > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > all. I'm sure this can too. > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > anything exposed from the RDMA to the userspace is strict with proper > > > UAPI header hygiene. > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > Thanks > > > > Why do you doubt that ? Have you looked at our code ? > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > and it is very safe and protected. > > Yes, I looked and didn't find open-source users of your UAPI headers. > It is not related to being safe or protected by to the common request > to present userspace that relies on those exported interfaces. > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > Nice, can we get a link? > > > > > We have a single function which is the entry point for all the IOCTLs > > of our drivers (only one IOCTL is RDMA related, all the others are > > compute related). > > That function is almost 1:1 copy of the function in drm. > > DRM has same rules as RDMA, no kernel code will be merged without seeing > open-source userspace. > > Thanks > > > > > Thanks, > > Oded So we do have an open-source library called hl-thunk, which uses our driver and indeed that was part of the requirement. It is similar to libdrm. Here is the link: https://github.com/HabanaAI/hl-thunk That library also comes with a comprehensive suite of tests which shows how to use the accelerator and we have many NIC tests which show how to use the NIC. All the rest of the user-space code in Habana is going through that library. Currently, you won't find the NIC code there because we didn't upstream it as the driver code wasn't ready, but I'll push it there in a private branch if you want to take a look. Thanks, Oded
On Fri, Sep 18, 2020 at 2:36 PM Gal Pressman <galpress@amazon.com> wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > or should we open the "what qualifies as an RDMA device" question again? Hi Itay, Please see the above comments/questions. Thanks, Oded
On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > >> infrastructure for communication between multiple accelerators. Same > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > into netdev. > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > all. I'm sure this can too. > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > was suggested to go through the vfio subsystem instead. > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > is what's the bar to get accepted to the RDMA subsystem. > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > That is more or less where we ended up, yes. > > > > I'm most worried about this lack of PD and MR. > > > > Kernel must provide security for apps doing user DMA, PD and MR do > > this. If the device doesn't have PD/MR then it is hard to see how a WQ > > could ever be exposed directly to userspace, regardless of subsystem. > > Hi Jason, > What you say here is very true and we handle that with different > mechanisms. I will start working on a dedicated patch-set of the RDMA > code in the next few weeks with MUCH MORE details in the commit > messages. That will explain exactly how we expose stuff and protect. > > For example, regarding isolating between applications, we only support > a single application opening our file descriptor. Then the driver has a special PD create that requires the misc file descriptor to authorize RDMA access to the resources in that security context. > Another example is that the submission of WQ is done through our QMAN > mechanism and is NOT mapped to userspace (due to the restrictions you > mentioned above and other restrictions). Sure, other RDMA drivers also require a kernel ioctl for command execution. In this model the MR can be a software construct, again representing a security authorization: - A 'full process' MR, in which case the kernel command excution handles dma map and pinning at command execution time - A 'normal' MR, in which case the DMA list is pre-created and the command execution just re-uses this data The general requirement for RDMA is the same as DRM, you must provide enough code in rdma-core to show how the device works, and minimally test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. So you'll want to arrange something where the default MR and PD mechanisms do something workable on this device, like auto-open the misc FD when building the PD, and support the 'normal' MR flow for command execution. Jason
On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > into netdev. > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > all. I'm sure this can too. > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > UAPI header hygiene. > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > Thanks > > > > > > Why do you doubt that ? Have you looked at our code ? > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > and it is very safe and protected. > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > It is not related to being safe or protected by to the common request > > to present userspace that relies on those exported interfaces. > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > Nice, can we get a link? > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > compute related). > > > That function is almost 1:1 copy of the function in drm. > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > open-source userspace. > > > > Thanks > > > > > > > > Thanks, > > > Oded > > So we do have an open-source library called hl-thunk, which uses our > driver and indeed that was part of the requirement. > It is similar to libdrm. > Here is the link: > https://github.com/HabanaAI/hl-thunk Are you kidding? This is mirror of some internal repository that looks like dumpster with ChangeId, internal bug tracker numbers, not part of major OS distributions. It is not open-source library and shows very clear why you chose to upstream your driver through driver/misc/ tree. Thanks
On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > > into netdev. > > > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > > all. I'm sure this can too. > > > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > > UAPI header hygiene. > > > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > > > Thanks > > > > > > > > Why do you doubt that ? Have you looked at our code ? > > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > > and it is very safe and protected. > > > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > > It is not related to being safe or protected by to the common request > > > to present userspace that relies on those exported interfaces. > > > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > > > Nice, can we get a link? > > > > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > > compute related). > > > > That function is almost 1:1 copy of the function in drm. > > > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > > open-source userspace. > > > > > > Thanks > > > > > > > > > > > Thanks, > > > > Oded > > > > So we do have an open-source library called hl-thunk, which uses our > > driver and indeed that was part of the requirement. > > It is similar to libdrm. > > Here is the link: > > https://github.com/HabanaAI/hl-thunk > > Are you kidding? > > This is mirror of some internal repository that looks like dumpster > with ChangeId, internal bug tracker numbers, not part of major OS > distributions. > > It is not open-source library and shows very clear why you chose > to upstream your driver through driver/misc/ tree. > > Thanks Adding Olof here. No, usually not. But are you kidding ? What did you exactly expect to find ? Is there an open-source project somewhere that encapsulates Deep-learning accelerators which I could connect to ? AFAIK, the only thing remotely relevant is CUDA and that is closed-source (strange to hear lectures about open-source from NVIDIA people here...) So we are trying to give to the community such an open source library, or at least an example. Hopefully one day, when more companies upstream their drivers for deep-learning accelerators we could do something like libdrm or rdma-core, but for now, it's just our driver. I have been in this community since 2013 with AMD and then RedHat, and I come with good intentions and a desire to open source and upstream as much as I can. I don't think I deserve this kind of response. The bottom line is that we had this discussion with Greg and Olof and DRM people almost 2 years ago and if there was some open-source project in user-space or some subsystem in the kernel we could connect to, we would have done that instead of what we did, but the fact of the matter there isn't such thing. Olof tried and is trying to create a h/w accelerator subsystem but it still hasn't got up from the ground yet. Oded
On Fri, Sep 18, 2020 at 3:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > >> infrastructure for communication between multiple accelerators. Same > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > into netdev. > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > all. I'm sure this can too. > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > That is more or less where we ended up, yes. > > > > > > I'm most worried about this lack of PD and MR. > > > > > > Kernel must provide security for apps doing user DMA, PD and MR do > > > this. If the device doesn't have PD/MR then it is hard to see how a WQ > > > could ever be exposed directly to userspace, regardless of subsystem. > > > > Hi Jason, > > What you say here is very true and we handle that with different > > mechanisms. I will start working on a dedicated patch-set of the RDMA > > code in the next few weeks with MUCH MORE details in the commit > > messages. That will explain exactly how we expose stuff and protect. > > > > For example, regarding isolating between applications, we only support > > a single application opening our file descriptor. > > Then the driver has a special PD create that requires the misc file > descriptor to authorize RDMA access to the resources in that security > context. > > > Another example is that the submission of WQ is done through our QMAN > > mechanism and is NOT mapped to userspace (due to the restrictions you > > mentioned above and other restrictions). > > Sure, other RDMA drivers also require a kernel ioctl for command > execution. > > In this model the MR can be a software construct, again representing a > security authorization: > > - A 'full process' MR, in which case the kernel command excution > handles dma map and pinning at command execution time > - A 'normal' MR, in which case the DMA list is pre-created and the > command execution just re-uses this data > > The general requirement for RDMA is the same as DRM, you must provide > enough code in rdma-core to show how the device works, and minimally > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > So you'll want to arrange something where the default MR and PD > mechanisms do something workable on this device, like auto-open the > misc FD when building the PD, and support the 'normal' MR flow for > command execution. > > Jason I don't know how we can support MR because we can't support any virtual address on the host. Our internal MMU doesn't support 64-bits. We investigated in the past, very much wanted to use IBverbs but didn't figure out how to make it work. I'm adding Itay here and he can also shed more details on that. Oded
On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote: > > > Another example is that the submission of WQ is done through our QMAN > > > mechanism and is NOT mapped to userspace (due to the restrictions you > > > mentioned above and other restrictions). > > > > Sure, other RDMA drivers also require a kernel ioctl for command > > execution. > > > > In this model the MR can be a software construct, again representing a > > security authorization: > > > > - A 'full process' MR, in which case the kernel command excution > > handles dma map and pinning at command execution time > > - A 'normal' MR, in which case the DMA list is pre-created and the > > command execution just re-uses this data > > > > The general requirement for RDMA is the same as DRM, you must provide > > enough code in rdma-core to show how the device works, and minimally > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > > > So you'll want to arrange something where the default MR and PD > > mechanisms do something workable on this device, like auto-open the > > misc FD when building the PD, and support the 'normal' MR flow for > > command execution. > > I don't know how we can support MR because we can't support any > virtual address on the host. Our internal MMU doesn't support 64-bits. > We investigated in the past, very much wanted to use IBverbs but > didn't figure out how to make it work. > I'm adding Itay here and he can also shed more details on that. I'm not sure what that means, if the driver intends to DMA from process memory then it certainly has a MR concept. MRs can control the IOVA directly so if you say the HW needs a MR IOVA < 2**32 then that is still OK. Jason
On Fri, Sep 18, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote: > > > > Another example is that the submission of WQ is done through our QMAN > > > > mechanism and is NOT mapped to userspace (due to the restrictions you > > > > mentioned above and other restrictions). > > > > > > Sure, other RDMA drivers also require a kernel ioctl for command > > > execution. > > > > > > In this model the MR can be a software construct, again representing a > > > security authorization: > > > > > > - A 'full process' MR, in which case the kernel command excution > > > handles dma map and pinning at command execution time > > > - A 'normal' MR, in which case the DMA list is pre-created and the > > > command execution just re-uses this data > > > > > > The general requirement for RDMA is the same as DRM, you must provide > > > enough code in rdma-core to show how the device works, and minimally > > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > > > > > So you'll want to arrange something where the default MR and PD > > > mechanisms do something workable on this device, like auto-open the > > > misc FD when building the PD, and support the 'normal' MR flow for > > > command execution. > > > > I don't know how we can support MR because we can't support any > > virtual address on the host. Our internal MMU doesn't support 64-bits. > > We investigated in the past, very much wanted to use IBverbs but > > didn't figure out how to make it work. > > I'm adding Itay here and he can also shed more details on that. > > I'm not sure what that means, if the driver intends to DMA from > process memory then it certainly has a MR concept. > > MRs can control the IOVA directly so if you say the HW needs a MR IOVA > < 2**32 then that is still OK. > > Jason Hi Jason, I'll try to explain but please bear with me because it requires some understanding of our H/W architecture. Our ASIC has 32 GB of HBM memory (similar to GPUs). The problem is that HBM memory is accessed by our ASIC's engines (DMA, NIC, etc.) with physical addressing, which is mapped inside our device between 0x0 to 0x8_0000_0000. Now, if a user performs malloc and then maps that memory to our device (using our memory MAP ioctl, similar to how GPU works), it will get a new virtual address, which is in the range of 0x80_0000_0000 - (2^50 -1). Then, he can use that new VA in our device with different engines (DMA, NIC, compute). That way, addresses that represent the host memory do not overlap addresses that represent HBM memory. The problem with MR is that the API doesn't let us return a new VA. It forces us to use the original VA that the Host OS allocated. What will we do if that VA is in the range of our HBM addresses ? The device won't be able to distinguish between them. The transaction that is generated by an engine inside our device will go to the HBM instead of going to the PCI controller and then to the host. That's the crust of the problem and why we didn't use MR. If that's not clear, I'll be happy to explain more. Thanks, Oded
On Fri, Sep 18, 2020 at 03:31:51PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > > > into netdev. > > > > > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > > > all. I'm sure this can too. > > > > > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > > > UAPI header hygiene. > > > > > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > > > > > Thanks > > > > > > > > > > Why do you doubt that ? Have you looked at our code ? > > > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > > > and it is very safe and protected. > > > > > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > > > It is not related to being safe or protected by to the common request > > > > to present userspace that relies on those exported interfaces. > > > > > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > > > > > Nice, can we get a link? > > > > > > > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > > > compute related). > > > > > That function is almost 1:1 copy of the function in drm. > > > > > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > > > open-source userspace. > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks, > > > > > Oded > > > > > > So we do have an open-source library called hl-thunk, which uses our > > > driver and indeed that was part of the requirement. > > > It is similar to libdrm. > > > Here is the link: > > > https://github.com/HabanaAI/hl-thunk > > > > Are you kidding? > > > > This is mirror of some internal repository that looks like dumpster > > with ChangeId, internal bug tracker numbers, not part of major OS > > distributions. > > > > It is not open-source library and shows very clear why you chose > > to upstream your driver through driver/misc/ tree. > > > > Thanks > > Adding Olof here. > > No, usually not. > But are you kidding ? > What did you exactly expect to find ? Is there an open-source project > somewhere that encapsulates Deep-learning accelerators which I could > connect to ? I would expect certain level of code quality, collaboration and review that distros require for inclusion. It is not the case for the github repo you presented. > AFAIK, the only thing remotely relevant is CUDA and that is > closed-source (strange to hear lectures about open-source from NVIDIA > people here...) Please check git log statistics to estimate Nvidia/Mellanox/Cumulus contributions to the Linux kernel and the open-source. You will be surprised. > > So we are trying to give to the community such an open source library, > or at least an example. Hopefully one day, when more companies > upstream their drivers for deep-learning accelerators we could do > something like libdrm or rdma-core, but for now, it's just our driver. AFAIR, your driver is not unique, HiSilicon tried to submit something similar years ago (warpdrive) and they are not alone. > > I have been in this community since 2013 with AMD and then RedHat, and > I come with good intentions and a desire to open source and upstream > as much as I can. I don't think I deserve this kind of response. There is no need to take it personal. It was you who posted a link to the github repo. What did you expect? > > The bottom line is that we had this discussion with Greg and Olof and > DRM people almost 2 years ago and if there was some open-source > project in user-space or some subsystem in the kernel we could connect > to, we would have done that instead of what we did, but the fact of > the matter there isn't such thing. Olof tried and is trying to create > a h/w accelerator subsystem but it still hasn't got up from the ground > yet. Maybe it is a time to do it right. > > Oded
On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > The problem with MR is that the API doesn't let us return a new VA. It > forces us to use the original VA that the Host OS allocated. If using the common MR API you'd have to assign a unique linear range in the single device address map and record both the IOVA and the MMU VA in the kernel struct. Then when submitting work using that MR lkey the kernel will adjust the work VA using the equation (WORK_VA - IOVA) + MMU_VA before forwarding to HW. EFA doesn't support rkeys, so they are not required to be emulated. It would have to create rkeys using some guadidv_reg_mr_rkey() It is important to understand that the usual way we support these non-RDMA devices is to insist that they use SW to construct a minimal standards based RDMA API, and then allow the device to have a 'dv' API to access a faster, highly device specific, SW bypass path. So for instance you might have some guadidv_post_work(qp) that doesn't use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr) would return the required HW VA from the kernel. Usually the higher level communication library (UCX, MPI, etc) forms the dv primitives into something application usable. > we do if that VA is in the range of our HBM addresses ? The device > won't be able to distinguish between them. The transaction that is > generated by an engine inside our device will go to the HBM instead of > going to the PCI controller and then to the host. > > That's the crust of the problem and why we didn't use MR. No, the problem with the device is that it doesn't have a lkey/rkey, so it is stuck with a single translation domain. RoCE compliant devices are required to have multiple translation domains - each lkey/rkey specifies a unique translation. The MR concept is a region of process VA mapped into the device for device access, and this device *clearly* has that. Jason
On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > The problem with MR is that the API doesn't let us return a new VA. It > > forces us to use the original VA that the Host OS allocated. > > If using the common MR API you'd have to assign a unique linear range > in the single device address map and record both the IOVA and the MMU > VA in the kernel struct. > > Then when submitting work using that MR lkey the kernel will adjust > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > forwarding to HW. > We can't do that. That will kill the performance. If for every submission I need to modify the packet's contents, the throughput will go downhill. Also, submissions to our RDMA qmans are coupled with submissions to our DMA/Compute QMANs. We can't separate those to different API calls. That will also kill performance and in addition, will prevent us from synchronizing all the engines. I also have to say, it troubles me that you keep referring to our device as an RDMA device. It is not an RDMA device. It is a deep-learning accelerator which uses RDMA as a way to interconnect multiple devices. We don't intend to replace General-Purpose RDMA devices. We know we don't support that. Therefore, I still fail to see why we need to support all the above... Our work submission is not to just "send/receive packets". Sending packets is part of a general recipe to do DMA, perform compute on data and send/receive data. All together, in a synchronized fashion. The way you try to force me to go is to separate that into different functionality, as if I have different ASICs, which is very counter-productive in terms of performance and simplicity. i.e. have one method of submitting work to DMA/compute and another way to RDMA ports. I know this is how the kernel is structured now - subsystems for devices that belong to a single domain (graphics, net, storage). But I fear that you will soon see this paradigm doesn't work with new devices in AI, which combine multiple domains into a single ASIC. Greg, I would love to hear your opinion here. Am I totally wrong ? Is treating a single ASIC that belongs to multiple domains as if it were multiple ASICs a good thing ? Don't you think it will hurt the performance ? Oded > EFA doesn't support rkeys, so they are not required to be emulated. It > would have to create rkeys using some guadidv_reg_mr_rkey() > > It is important to understand that the usual way we support these > non-RDMA devices is to insist that they use SW to construct a minimal > standards based RDMA API, and then allow the device to have a 'dv' API > to access a faster, highly device specific, SW bypass path. > > So for instance you might have some guadidv_post_work(qp) that doesn't > use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr) > would return the required HW VA from the kernel. > > Usually the higher level communication library (UCX, MPI, etc) forms > the dv primitives into something application usable. > > > we do if that VA is in the range of our HBM addresses ? The device > > won't be able to distinguish between them. The transaction that is > > generated by an engine inside our device will go to the HBM instead of > > going to the PCI controller and then to the host. > > > > That's the crust of the problem and why we didn't use MR. > > No, the problem with the device is that it doesn't have a lkey/rkey, > so it is stuck with a single translation domain. RoCE compliant > devices are required to have multiple translation domains - each > lkey/rkey specifies a unique translation. > > The MR concept is a region of process VA mapped into the device for > device access, and this device *clearly* has that. > > Jason
On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > forces us to use the original VA that the Host OS allocated. > > > > If using the common MR API you'd have to assign a unique linear range > > in the single device address map and record both the IOVA and the MMU > > VA in the kernel struct. > > > > Then when submitting work using that MR lkey the kernel will adjust > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > forwarding to HW. > > > We can't do that. That will kill the performance. If for every > submission I need to modify the packet's contents, the throughput will > go downhill. You clearly didn't read where I explained there is a fast path and slow path expectation. > Also, submissions to our RDMA qmans are coupled with submissions to > our DMA/Compute QMANs. We can't separate those to different API calls. > That will also kill performance and in addition, will prevent us from > synchronizing all the engines. Not sure I see why this is a problem. I already explained the fast device specific path. As long as the kernel maintains proper security when it processes submissions the driver can allow objects to cross between the two domains. Jason
On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > forces us to use the original VA that the Host OS allocated. > > > > > > If using the common MR API you'd have to assign a unique linear range > > > in the single device address map and record both the IOVA and the MMU > > > VA in the kernel struct. > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > forwarding to HW. > > > > > We can't do that. That will kill the performance. If for every > > submission I need to modify the packet's contents, the throughput will > > go downhill. > > You clearly didn't read where I explained there is a fast path and > slow path expectation. > > > Also, submissions to our RDMA qmans are coupled with submissions to > > our DMA/Compute QMANs. We can't separate those to different API calls. > > That will also kill performance and in addition, will prevent us from > > synchronizing all the engines. > > Not sure I see why this is a problem. I already explained the fast > device specific path. > > As long as the kernel maintains proper security when it processes > submissions the driver can allow objects to cross between the two > domains. Can you please explain what you mean by "two domains" ? You mean the RDMA and compute domains ? Or something else ? What I was trying to say is that I don't want the application to split its submissions to different system calls. Currently we perform submissions through the CS_IOCTL that is defined in our driver. It is a single IOCTL which allows the user to submit work to all queues, without regard to the underlying engine of each queue. If I need to split that to different system calls it will have major implications. I don't even want to start thinking about all the synchronization at the host (userspace) level that I will need to do. That's what I meant by saying that you force me to treat my device as if it were multiple devices. The whole point of our ASIC is to combine multiple IPs on the same ASIC. What will happen when we will add a third domain to our device (e.g. storage, video decoding, encryption engine, whatever). Will I then need to separate submissions to 3 different system calls ? In 3 different subsystems ? This doesn't scale. And I strongly say that it will kill the performance of the device. Not because of the driver. Because of the complications to the user-space. Oded > > Jason
On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > > forces us to use the original VA that the Host OS allocated. > > > > > > > > If using the common MR API you'd have to assign a unique linear range > > > > in the single device address map and record both the IOVA and the MMU > > > > VA in the kernel struct. > > > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > > forwarding to HW. > > > > > > > We can't do that. That will kill the performance. If for every > > > submission I need to modify the packet's contents, the throughput will > > > go downhill. > > > > You clearly didn't read where I explained there is a fast path and > > slow path expectation. > > > > > Also, submissions to our RDMA qmans are coupled with submissions to > > > our DMA/Compute QMANs. We can't separate those to different API calls. > > > That will also kill performance and in addition, will prevent us from > > > synchronizing all the engines. > > > > Not sure I see why this is a problem. I already explained the fast > > device specific path. > > > > As long as the kernel maintains proper security when it processes > > submissions the driver can allow objects to cross between the two > > domains. > Can you please explain what you mean by "two domains" ? > You mean the RDMA and compute domains ? Or something else ? Yes > What I was trying to say is that I don't want the application to split > its submissions to different system calls. If you can manage the security then you can cross them. Eg since The RDMA PD would be created on top of the /dev/misc char dev then it is fine for the /dev/misc char dev to access the RDMA objects as a 'dv fast path'. But now that you say everything is interconnected, I'm wondering, without HW security how do you keep netdev isolated from userspace? Can I issue commands to /dev/misc and write to kernel memory (does the kernel put any pages into the single MMU?) or corrupt the netdev driver operations in any way? Jason
On Fri, Sep 18, 2020 at 5:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > > > forces us to use the original VA that the Host OS allocated. > > > > > > > > > > If using the common MR API you'd have to assign a unique linear range > > > > > in the single device address map and record both the IOVA and the MMU > > > > > VA in the kernel struct. > > > > > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > > > forwarding to HW. > > > > > > > > > We can't do that. That will kill the performance. If for every > > > > submission I need to modify the packet's contents, the throughput will > > > > go downhill. > > > > > > You clearly didn't read where I explained there is a fast path and > > > slow path expectation. > > > > > > > Also, submissions to our RDMA qmans are coupled with submissions to > > > > our DMA/Compute QMANs. We can't separate those to different API calls. > > > > That will also kill performance and in addition, will prevent us from > > > > synchronizing all the engines. > > > > > > Not sure I see why this is a problem. I already explained the fast > > > device specific path. > > > > > > As long as the kernel maintains proper security when it processes > > > submissions the driver can allow objects to cross between the two > > > domains. > > Can you please explain what you mean by "two domains" ? > > You mean the RDMA and compute domains ? Or something else ? > > Yes > > > What I was trying to say is that I don't want the application to split > > its submissions to different system calls. > > If you can manage the security then you can cross them. Eg since The > RDMA PD would be created on top of the /dev/misc char dev then it is > fine for the /dev/misc char dev to access the RDMA objects as a 'dv > fast path'. > > But now that you say everything is interconnected, I'm wondering, > without HW security how do you keep netdev isolated from userspace? > > Can I issue commands to /dev/misc and write to kernel memory (does the > kernel put any pages into the single MMU?) or corrupt the netdev > driver operations in any way? > > Jason No, no, no. Please give me more credit :) btw, our kernel interface was scrutinized when we upstreamed the driver and it was under review by the Intel security team. To explain our security mechanism will require some time. It is detailed in the driver, but it is hard to understand without some background. I wonder where to start... First of all, we support open, close, mmap and IOCTLs to /dev/misc/hlX. We don't support read/write system calls. A user never gets direct access to kernel memory. Only through standard mmap. The only thing we allow to mmap is a command buffer (which is used to submit work to certain DMA queues on our device) and to a memory region we use for "CQ" for the RDMA. That's it. Any access by the device's engines to the host memory is done via our device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The kernel driver is assigned ASID 0, while the user is assigned ASID 1. We can support up to 1024 ASIDs, but because we limit the user to have a single application, we only use ASID 0 and 1. The above means a user can't program an engine (DMA, NIC, compute) to access memory he didn't first mapped into our device's MMU. The mapping is done via one of our IOCTLs and the kernel driver makes sure (using standard kernel internal APIs) the host memory truly belongs to the user process. All those mappings are done using ASID 1. If the driver needs to map kernel pages into the device's MMU, then this is done using ASID 0. This is how we take care of separation between kernel memory and user memory. Each transaction our engines create and is going to the host first passes through our MMU. The transaction comes with its ASID value. According to that, the MMU knows which page tables to do the walk on. Specifically regarding RDMA, the user prepares a WQE on the host memory in an area which is mapped into our MMU using ASID 1. The user uses the NIC control IOCTL to give the kernel driver the virtual base address of the WQ and the driver programs it to the H/W. Then, the user can submit the WQE by submitting a command buffer to the NIC QMAN. The command buffer contains a message to the QMAN that tells it to ring the doorbell of the relevant NIC port. The user can't do it from userspace. For regular Ethernet traffice, we don't have any IOCTLs of course. All Ethernet operations are done via the standard networking subsystem (sockets, etc.). There are more details of course. I don't know how much you want me to go deeper. If you have specific questions I'll be happy to answer. Oded
On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote: > Any access by the device's engines to the host memory is done via our > device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The > kernel driver is assigned ASID 0, while the user is assigned ASID 1. > We can support up to 1024 ASIDs, but because we limit the user to have > a single application, we only use ASID 0 and 1. If the QP/WQ/etc is HW bound to an ASID then that binding is called a PD and the ASID is acting in the PD role. If the ASID is translating from on the wire IOVA to DMA PA, then it is acting in the MR role as well. Bundling those two things together is not as flexible as standards based RDMA, but it is not as far away as you are making things out to be. Jason
On Fri, Sep 18, 2020 at 6:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote: > > > Any access by the device's engines to the host memory is done via our > > device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The > > kernel driver is assigned ASID 0, while the user is assigned ASID 1. > > We can support up to 1024 ASIDs, but because we limit the user to have > > a single application, we only use ASID 0 and 1. > > If the QP/WQ/etc is HW bound to an ASID then that binding is called a > PD and the ASID is acting in the PD role. > > If the ASID is translating from on the wire IOVA to DMA PA, then it is > acting in the MR role as well. > > Bundling those two things together is not as flexible as standards > based RDMA, but it is not as far away as you are making things out to > be. > > Jason But Jason, why do I need to use RDMA definitions in my common code ? RDMA is such a small part of our ASIC. We also have an ASIC called GOYA for inference, which is handled by the same driver, but doesn't have RDMA ports at all. Why would I need to use RDMA definitions for that ? I'm sorry, but you won't be able to convince me here that I need to "enslave" my entire code to RDMA, just because my ASIC "also" has some RDMA ports. On the same weight, the GPU people tried and failed to say that my device is a GPU. And I think the reasoning that we applied back then, and Greg and Olof agreed with it, applies here as well. I want to play along, but it has to be something that won't make my entire device's driver into an RDMA driver. And it has to be something that doesn't hurt performance. All other things can and will be changed according to your inputs. Thanks, Oded Oded
On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > I'm sorry, but you won't be able to convince me here that I need to > "enslave" my entire code to RDMA, just because my ASIC "also" has some > RDMA ports. You can't recreate common shared subsystems in a driver just because you don't want to work with the subsystem. I don't care what else the ASIC has. In Linux the netdev part is exposed through netdev, the RDMA part through RDMA, the totally-not-a-GPU part through drivers/misc. It is always been this way. Chelsio didn't get to rebuild the SCSI stack in their driver just because "storage is a small part of their device" Drivers are not allowed to re-implement I2C/SPI/etc without re-using the comon code for that just because "I2C is a small part of their device" Exposing to userspace the creation of RoCE QPs and their related objects are unambiguously a RDMA subsystem task. I don't even know how you think you can argue it is not. It is your company proudly claiming the device has 100G RoCE ports in all the marketing literature, after all. It is too bad the device has a non-standards compliant implementation of RoCE so this will be a bit hard for you. Oh well. Jason
On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > So we do have an open-source library called hl-thunk, which uses our > > driver and indeed that was part of the requirement. > > It is similar to libdrm. > > Here is the link: > > https://github.com/HabanaAI/hl-thunk > > Are you kidding? > > This is mirror of some internal repository that looks like dumpster > with ChangeId, internal bug tracker numbers, not part of major OS > distributions. > > It is not open-source library and shows very clear why you chose > to upstream your driver through driver/misc/ tree. It is an open source library, as per the license and the code availability. What more is expected here? No distro has to pick it up, that's not a requirement for kernel code, we have many kernel helper programs that are not in distros. Heck, udev took a long time to get into distros, does that mean the kernel side of that interface should never have been merged? I don't understand your complaint here, it's not our place to judge the code quality of userspace libraries, otherwise we would never get any real-work done :) thanks, greg k-h
On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > So we do have an open-source library called hl-thunk, which uses our > > > driver and indeed that was part of the requirement. > > > It is similar to libdrm. > > > Here is the link: > > > https://github.com/HabanaAI/hl-thunk > > > > Are you kidding? > > > > This is mirror of some internal repository that looks like dumpster > > with ChangeId, internal bug tracker numbers, not part of major OS > > distributions. > > > > It is not open-source library and shows very clear why you chose > > to upstream your driver through driver/misc/ tree. > > It is an open source library, as per the license and the code > availability. What more is expected here? So can I fork iproute2, add bunch of new custom netlink UAPIs and expect Dave to merge it after I throw it on github? > > No distro has to pick it up, that's not a requirement for kernel code, > we have many kernel helper programs that are not in distros. Heck, udev > took a long time to get into distros, does that mean the kernel side of > that interface should never have been merged? > > I don't understand your complaint here, it's not our place to judge the > code quality of userspace libraries, otherwise we would never get any > real-work done :) My main complaint is that you can't imagine merging code into large subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source citizen. It means use of existing user-space libraries/tools and/or providing new ones that will be usable for everyone. In this case, we have some custom char device with library that is not usable for anyone else and this is why drivers/misc/ is right place. While we are talking about real-work, it is our benefit to push companies to make investment into ecosystem and not letting them to find an excuse for not doing it. Thanks > > thanks, > > greg k-h
On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > So we do have an open-source library called hl-thunk, which uses our > > > > driver and indeed that was part of the requirement. > > > > It is similar to libdrm. > > > > Here is the link: > > > > https://github.com/HabanaAI/hl-thunk > > > > > > Are you kidding? > > > > > > This is mirror of some internal repository that looks like dumpster > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > distributions. > > > > > > It is not open-source library and shows very clear why you chose > > > to upstream your driver through driver/misc/ tree. > > > > It is an open source library, as per the license and the code > > availability. What more is expected here? > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > Dave to merge it after I throw it on github? Don't be silly, that's not the case here at all and you know that. > > No distro has to pick it up, that's not a requirement for kernel code, > > we have many kernel helper programs that are not in distros. Heck, udev > > took a long time to get into distros, does that mean the kernel side of > > that interface should never have been merged? > > > > I don't understand your complaint here, it's not our place to judge the > > code quality of userspace libraries, otherwise we would never get any > > real-work done :) > > My main complaint is that you can't imagine merging code into large > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > citizen. It means use of existing user-space libraries/tools and/or > providing new ones that will be usable for everyone. Agreed. > In this case, we have some custom char device with library that is not > usable for anyone else and this is why drivers/misc/ is right place. Also agreed. > While we are talking about real-work, it is our benefit to push companies > to make investment into ecosystem and not letting them to find an excuse > for not doing it. So why are you complaining about a stand-alone driver that does not have any shared subsystems's userspace code to control that driver? Yes, when integrating into other subsystems (i.e. networking and rdma), they should use those common subsystems interfaces, no one is arguing that at all. totally lost, greg k-h
On Sat, Sep 19, 2020 at 10:30:12AM +0200, Greg Kroah-Hartman wrote: > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > driver and indeed that was part of the requirement. > > > > > It is similar to libdrm. > > > > > Here is the link: > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > Are you kidding? > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > distributions. > > > > > > > > It is not open-source library and shows very clear why you chose > > > > to upstream your driver through driver/misc/ tree. > > > > > > It is an open source library, as per the license and the code > > > availability. What more is expected here? > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > Dave to merge it after I throw it on github? > > Don't be silly, that's not the case here at all and you know that. It was far-fetched example. > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > we have many kernel helper programs that are not in distros. Heck, udev > > > took a long time to get into distros, does that mean the kernel side of > > > that interface should never have been merged? > > > > > > I don't understand your complaint here, it's not our place to judge the > > > code quality of userspace libraries, otherwise we would never get any > > > real-work done :) > > > > My main complaint is that you can't imagine merging code into large > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > citizen. It means use of existing user-space libraries/tools and/or > > providing new ones that will be usable for everyone. > > Agreed. > > > In this case, we have some custom char device with library that is not > > usable for anyone else and this is why drivers/misc/ is right place. > > Also agreed. > > > While we are talking about real-work, it is our benefit to push companies > > to make investment into ecosystem and not letting them to find an excuse > > for not doing it. > > So why are you complaining about a stand-alone driver that does not have > any shared subsystems's userspace code to control that driver? I didn't, everything started when I explained to Gal why RDMA subsystem requires rdma-core counterpart for any UAPI code. https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m17d52d61adadf54c12bfecf1af5db40f5d829ac3 And expressed my view on the quality of the library that was presented as open-source example. https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m9059c5a9405ba932d9ffb731195a43b27443d265 > > Yes, when integrating into other subsystems (i.e. networking and rdma), > they should use those common subsystems interfaces, no one is arguing > that at all. > > totally lost, And here comes my request to do it right https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#ma1fa6fe63666f630674eb668f1c00e6a672db85b All that I asked from Oded is to do UAPI/libraries right, while all the responses can be summarized to one sentence - "it is too hard, we don't want to do it." Thanks > > greg k-h
On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > driver and indeed that was part of the requirement. > > > > > It is similar to libdrm. > > > > > Here is the link: > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > Are you kidding? > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > distributions. > > > > > > > > It is not open-source library and shows very clear why you chose > > > > to upstream your driver through driver/misc/ tree. > > > > > > It is an open source library, as per the license and the code > > > availability. What more is expected here? > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > Dave to merge it after I throw it on github? > > Don't be silly, that's not the case here at all and you know that. > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > we have many kernel helper programs that are not in distros. Heck, udev > > > took a long time to get into distros, does that mean the kernel side of > > > that interface should never have been merged? > > > > > > I don't understand your complaint here, it's not our place to judge the > > > code quality of userspace libraries, otherwise we would never get any > > > real-work done :) > > > > My main complaint is that you can't imagine merging code into large > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > citizen. It means use of existing user-space libraries/tools and/or > > providing new ones that will be usable for everyone. > > Agreed. > > > In this case, we have some custom char device with library that is not > > usable for anyone else and this is why drivers/misc/ is right place. > > Also agreed. > > > While we are talking about real-work, it is our benefit to push companies > > to make investment into ecosystem and not letting them to find an excuse > > for not doing it. > > So why are you complaining about a stand-alone driver that does not have > any shared subsystems's userspace code to control that driver? > > Yes, when integrating into other subsystems (i.e. networking and rdma), > they should use those common subsystems interfaces, no one is arguing > that at all. Hi Greg, It's probably heresy, but why do I need to integrate into the RDMA subsystem ? I understand your reasoning about networking (Ethernet) as the driver connects to the kernel networking stack (netdev), but with RDMA the driver doesn't use or connect to anything in that stack. If I were to support IBverbs and declare that I support it, then of course I would need to integrate to the RDMA subsystem and add my backend to rdma-core. But we don't do that so why am I being forced to support IBverbs ? Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies with a sledgehammer. I do hope that in future devices we will support it natively and of course then we will integrate as requested, but for GAUDI it is just a huge overkill IMHO. Thanks, Oded > > totally lost, > > greg k-h
On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote: > On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > > driver and indeed that was part of the requirement. > > > > > > It is similar to libdrm. > > > > > > Here is the link: > > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > > > Are you kidding? > > > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > > distributions. > > > > > > > > > > It is not open-source library and shows very clear why you chose > > > > > to upstream your driver through driver/misc/ tree. > > > > > > > > It is an open source library, as per the license and the code > > > > availability. What more is expected here? > > > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > > Dave to merge it after I throw it on github? > > > > Don't be silly, that's not the case here at all and you know that. > > > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > > we have many kernel helper programs that are not in distros. Heck, udev > > > > took a long time to get into distros, does that mean the kernel side of > > > > that interface should never have been merged? > > > > > > > > I don't understand your complaint here, it's not our place to judge the > > > > code quality of userspace libraries, otherwise we would never get any > > > > real-work done :) > > > > > > My main complaint is that you can't imagine merging code into large > > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > > citizen. It means use of existing user-space libraries/tools and/or > > > providing new ones that will be usable for everyone. > > > > Agreed. > > > > > In this case, we have some custom char device with library that is not > > > usable for anyone else and this is why drivers/misc/ is right place. > > > > Also agreed. > > > > > While we are talking about real-work, it is our benefit to push companies > > > to make investment into ecosystem and not letting them to find an excuse > > > for not doing it. > > > > So why are you complaining about a stand-alone driver that does not have > > any shared subsystems's userspace code to control that driver? > > > > Yes, when integrating into other subsystems (i.e. networking and rdma), > > they should use those common subsystems interfaces, no one is arguing > > that at all. > Hi Greg, > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > I understand your reasoning about networking (Ethernet) as the driver > connects to the kernel networking stack (netdev), but with RDMA the > driver doesn't use or connect to anything in that stack. If I were to > support IBverbs and declare that I support it, then of course I would > need to integrate to the RDMA subsystem and add my backend to > rdma-core. IBverbs are horrid and I would not wish them on anyone. Seriously. > But we don't do that so why am I being forced to support IBverbs ? You shouldn't. > Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies > with a sledgehammer. > I do hope that in future devices we will support it natively and of > course then we will integrate as requested, but for GAUDI it is just a > huge overkill IMHO. I think the general rdma apis are the key here, not the userspace api. Note, I do not know exactly what they are, but no, IBverbs are not ok. Ick. greg k-h
On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote:
> It's probably heresy, but why do I need to integrate into the RDMA subsystem ?
Hi Oded
I don't know the RDMA subsystem at all. So i will give a more generic
answer. Are you reinventing things which a subsystem core already has?
The subsystem core will be well tested, since lots of devices use
it. Because of this, subsystem cores generally have a lower bug count
per line of code than driver code. Using core code means drivers are
smaller, and smaller code has less bugs by definition.
We as maintainers have to assume you are going to abandon the driver
at some point, while the hardware still exists, and leave the
community to maintain it. So a smaller driver, which makes heavy use
of the core is much easier to maintain.
By making use of core code, you also get freebies. Somebody adds new
functionality to the core, your driver automatically gets it.
Look at this from the opposite perspective. Say every driver
implemented their own TCP/IP stack? Or DMA engine? SPI infrastructure?
How big a nightmare would it be to maintain?
In your case, some parts of you hardware looks a bit like RDMA? So you
ideally want to use the core code from the RDMA subsystem. Maybe you
just need some of the lower layers? Maybe you need to refactor some of
the RDMA core to make it a library you can pick and choice the bits
useful to you? What you really want to avoid is re-implementing stuff
in your driver which is already in the core.
Andrew
On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > I understand your reasoning about networking (Ethernet) as the driver > > connects to the kernel networking stack (netdev), but with RDMA the > > driver doesn't use or connect to anything in that stack. If I were to > > support IBverbs and declare that I support it, then of course I would > > need to integrate to the RDMA subsystem and add my backend to > > rdma-core. > > IBverbs are horrid and I would not wish them on anyone. Seriously. I'm curious what drives this opinion? Did you have it since you reviewed the initial submission all those years ago? > I think the general rdma apis are the key here, not the userspace api. Are you proposing that habana should have uAPI in drivers/misc and present a standard rdma-core userspace for it? This is the only userspace programming interface for RoCE HW. I think that would be much more work. If not, what open source userspace are you going to ask them to present to merge the kernel side into misc? > Note, I do not know exactly what they are, but no, IBverbs are not ok. Should we stop merging new drivers and abandon the RDMA subsystem? Is there something you'd like to see fixed? Don't really understand your position, sorry. Jason
On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > I understand your reasoning about networking (Ethernet) as the driver > > > connects to the kernel networking stack (netdev), but with RDMA the > > > driver doesn't use or connect to anything in that stack. If I were to > > > support IBverbs and declare that I support it, then of course I would > > > need to integrate to the RDMA subsystem and add my backend to > > > rdma-core. > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > I'm curious what drives this opinion? Did you have it since you > reviewed the initial submission all those years ago? As I learned more about that interface, yes, I like it less and less :) But that's the userspace api you all are stuck with, for various reasons, my opinion doesn't matter here. > > I think the general rdma apis are the key here, not the userspace api. > > Are you proposing that habana should have uAPI in drivers/misc and > present a standard rdma-core userspace for it? This is the only > userspace programming interface for RoCE HW. I think that would be > much more work. > > If not, what open source userspace are you going to ask them to > present to merge the kernel side into misc? I don't think that they have a userspace api to their rdma feature from what I understand, but I could be totally wrong as I do not know their hardware at all, so I'll let them answer this question. > > Note, I do not know exactly what they are, but no, IBverbs are not ok. > > Should we stop merging new drivers and abandon the RDMA subsystem? Is > there something you'd like to see fixed? > > Don't really understand your position, sorry. For anything that _has_ to have a userspace RMDA interface, sure ibverbs are the one we are stuck with, but I didn't think that was the issue here at all, which is why I wrote the above comments. thanks, greg k-h
On Wed, Sep 16, 2020 at 02:00:54PM +0200, Greg Kroah-Hartman wrote: > On Wed, Sep 16, 2020 at 11:47:58AM +0300, Oded Gabbay wrote: > > On Wed, Sep 16, 2020 at 11:21 AM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote: > > > > On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote: > > > > > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman > > > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote: > > > > > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@davemloft.net> wrote: > > > > > > > > > > > > > > > > > > From: Oded Gabbay <oded.gabbay@gmail.com> > > > > > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300 > > > > > > > > > > > > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > > > > > > > into the habanalabs driver. > > > > > > > > > > > > > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > > > > > > > are in that patch's commit message. > > > > > > > > > > > > > > > > > > > > Link to v2 cover letter: > > > > > > > > > > https://lkml.org/lkml/2020/9/12/201 > > > > > > > > > > > > > > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently > > > > > > > > > structured and designed. > > > > > > > > Why is that ? > > > > > > > > Can you please point to the things that bother you or not working correctly? > > > > > > > > I can't really fix the driver if I don't know what's wrong. > > > > > > > > > > > > > > > > In addition, please read my reply to Jakub with the explanation of why > > > > > > > > we designed this driver as is. > > > > > > > > > > > > > > > > And because of the RDMA'ness of it, the RDMA > > > > > > > > > folks have to be CC:'d and have a chance to review this. > > > > > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in > > > > > > > > the kernel and we can't connect to it due to the lack of H/W support > > > > > > > > we have > > > > > > > > Therefore, I don't see why we need to CC linux-rdma. > > > > > > > > I understood why Greg asked me to CC you because we do connect to the > > > > > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's > > > > > > > > not really the same. > > > > > > > > > > > > > > Ok, to do this "right" it needs to be split up into separate drivers, > > > > > > > hopefully using the "virtual bus" code that some day Intel will resubmit > > > > > > > again that will solve this issue. > > > > > > Hi Greg, > > > > > > Can I suggest an alternative for the short/medium term ? > > > > > > > > > > > > In an earlier email, Jakub said: > > > > > > "Is it not possible to move the files and still build them into a single > > > > > > module?" > > > > > > > > > > > > I thought maybe that's a good way to progress here ? > > > > > > > > > > Cross-directory builds of a single module are crazy. Yes, they work, > > > > > but really, that's a mess, and would never suggest doing that. > > > > > > > > > > > First, split the content to Ethernet and RDMA. > > > > > > Then move the Ethernet part to drivers/net but build it as part of > > > > > > habanalabs.ko. > > > > > > Regarding the RDMA code, upstream/review it in a different patch-set > > > > > > (maybe they will want me to put the files elsewhere). > > > > > > > > > > > > What do you think ? > > > > > > > > > > I think you are asking for more work there than just splitting out into > > > > > separate modules :) > > > > > > > > > > thanks, > > > > > > > > > > greg k-h > > > > Hi Greg, > > > > > > > > If cross-directory building is out of the question, what about > > > > splitting into separate modules ? And use cross-module notifiers/calls > > > > ? I did that with amdkfd and amdgpu/radeon a couple of years back. It > > > > worked (that's the best thing I can say about it). > > > > > > That's fine with me. > > > > > > > The main problem with this "virtual bus" thing is that I'm not > > > > familiar with it at all and from my experience I imagine it would take > > > > a considerable time and effort to upstream this infrastructure work. > > > > > > It shouldn't be taking that long, but for some unknown reason, the > > > original author of that code is sitting on it and not resending it. Go > > > poke them through internal Intel channels to find out what the problem > > > is, as I have no clue why a 200-300 line bus module is taking so long to > > > get "right" :( > > > > > > I'm _ALMOST_ at the point where I would just do that work myself, but > > > due to my current status with Intel, I'll let them do it as I have > > > enough other things on my plate... > > > > > > > This could delay the NIC code for a couple of years, which by then > > > > this won't be relevant at all. > > > > > > Why wouldn't this code be relevant in a year? It's going to be 2+ years > > > before any of this shows up in an "enterprise distro" based on their > > > release cycles anyway :) > > > > > > thanks, > > > > > > greg k-h > > > > Hi Greg, > > ok, I'll take a look. Do you happen to have the name of the patch-set / author ? > > Here's at least one copy: > https://lore.kernel.org/linux-rdma/20200520070227.3392100-2-jeffrey.t.kirsher@intel.com/ > > there might have been a newer one, can't remember, sorry. Maybe I'm missing something or maybe the in-tree code we have already should be refactored to use more buses and drivers, but drivers/base/component.c is made for this. We use this to glue all kinds of things across all kinds of subsystems already. Of course it really should be only used for one-off problems, as soon as you have a standard interface/interaction there should be some kind of standard lookup way to get at your thing (and the driver behind it), e.g. in drivers/gpu we're now building up drm_bridge and trying to get away from componenent.c for these things. Cheers, Daniel
On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > > I understand your reasoning about networking (Ethernet) as the driver > > > > connects to the kernel networking stack (netdev), but with RDMA the > > > > driver doesn't use or connect to anything in that stack. If I were to > > > > support IBverbs and declare that I support it, then of course I would > > > > need to integrate to the RDMA subsystem and add my backend to > > > > rdma-core. > > > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > > > I'm curious what drives this opinion? Did you have it since you > > reviewed the initial submission all those years ago? > > As I learned more about that interface, yes, I like it less and less :) > > But that's the userspace api you all are stuck with, for various > reasons, my opinion doesn't matter here. > > > > I think the general rdma apis are the key here, not the userspace api. > > > > Are you proposing that habana should have uAPI in drivers/misc and > > present a standard rdma-core userspace for it? This is the only > > userspace programming interface for RoCE HW. I think that would be > > much more work. > > > > If not, what open source userspace are you going to ask them to > > present to merge the kernel side into misc? > > I don't think that they have a userspace api to their rdma feature from > what I understand, but I could be totally wrong as I do not know their > hardware at all, so I'll let them answer this question. Hi Greg, We do expose a new IOCTL to enable the user to configure connections between multiple GAUDI devices. Having said that, we restrict this IOCTL to be used only by the same user who is doing the compute on our device, as opposed to a real RDMA device where any number of applications can send and receive. In addition, this IOCTL limits the user to connect ONLY to another GAUDI device and not to a 3rd party RDMA device. It is true that GAUDI supports RDMA data movement but the data movement is NOT done by the user. It is done by our compute engines. i.e. the compute engines performs "send" and "receive" without going to the host (aka no support for ibv_postsend, ibv_postreceive). The only thing that is controlled by the user is to say which GAUDI is connected to which. After that, the command submission the user performs to operate our compute engines will cause them to send and receive RDMA packets. Moreover, as opposed to smart NICs where the Networking is the main focus and the compute is only secondary, in our device the compute is our major focus and the networking is a slave for it. The hl-thunk userspace library will have wrappers around this single IOCTL (like all our driver's IOCTLs) and also contain demos to show how to use it. > > > > Note, I do not know exactly what they are, but no, IBverbs are not ok. > > > > Should we stop merging new drivers and abandon the RDMA subsystem? Is > > there something you'd like to see fixed? > > > > Don't really understand your position, sorry. > > For anything that _has_ to have a userspace RMDA interface, sure ibverbs > are the one we are stuck with, but I didn't think that was the issue > here at all, which is why I wrote the above comments. To emphasize again, we don't want to expose a userspace RDMA interface. We just want to allow our single compute user to configure a connection to another GAUDI. Thanks, Oded > > thanks, > > greg k-h
On Sun, Sep 20, 2020 at 10:05:39PM +0300, Oded Gabbay wrote: > On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > > > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > > > I understand your reasoning about networking (Ethernet) as the driver > > > > > connects to the kernel networking stack (netdev), but with RDMA the > > > > > driver doesn't use or connect to anything in that stack. If I were to > > > > > support IBverbs and declare that I support it, then of course I would > > > > > need to integrate to the RDMA subsystem and add my backend to > > > > > rdma-core. > > > > > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > > > > > I'm curious what drives this opinion? Did you have it since you > > > reviewed the initial submission all those years ago? > > > > As I learned more about that interface, yes, I like it less and less :) > > > > But that's the userspace api you all are stuck with, for various > > reasons, my opinion doesn't matter here. > > > > > > I think the general rdma apis are the key here, not the userspace api. > > > > > > Are you proposing that habana should have uAPI in drivers/misc and > > > present a standard rdma-core userspace for it? This is the only > > > userspace programming interface for RoCE HW. I think that would be > > > much more work. > > > > > > If not, what open source userspace are you going to ask them to > > > present to merge the kernel side into misc? > > > > I don't think that they have a userspace api to their rdma feature from > > what I understand, but I could be totally wrong as I do not know their > > hardware at all, so I'll let them answer this question. > > Hi Greg, > We do expose a new IOCTL to enable the user to configure connections > between multiple GAUDI devices. How is it different from RDMA QP configuration? > > Having said that, we restrict this IOCTL to be used only by the same > user who is doing the compute on our device, as opposed to a real RDMA > device where any number of applications can send and receive. The ability to support multiple applications is not RDMA-requirement, but the implementation. For example MPI jobs are single user of RDMA device. > In addition, this IOCTL limits the user to connect ONLY to another > GAUDI device and not to a 3rd party RDMA device. I don't see how it is different from EFA with their SQD QP type or mlx5 devices with DC QPs that you can connect only to similar devices (no interoperability). Thanks
On 18/09/2020 18:28, Jason Gunthorpe wrote: > On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > >> I'm sorry, but you won't be able to convince me here that I need to >> "enslave" my entire code to RDMA, just because my ASIC "also" has some >> RDMA ports. > > You can't recreate common shared subsystems in a driver just because > you don't want to work with the subsystem. > > I don't care what else the ASIC has. In Linux the netdev part is > exposed through netdev, the RDMA part through RDMA, the > totally-not-a-GPU part through drivers/misc. > > It is always been this way. Chelsio didn't get to rebuild the SCSI > stack in their driver just because "storage is a small part of their > device" > > Drivers are not allowed to re-implement I2C/SPI/etc without re-using > the comon code for that just because "I2C is a small part of their > device" > > Exposing to userspace the creation of RoCE QPs and their related > objects are unambiguously a RDMA subsystem task. I don't even know how > you think you can argue it is not. It is your company proudly claiming > the device has 100G RoCE ports in all the marketing literature, after > all. > > It is too bad the device has a non-standards compliant implementation > of RoCE so this will be a bit hard for you. Oh well. What is considered a RoCE port in this case if it's not compliant with RoCE? Sounds like it's an implementation of RDMA over ethernet, not RoCE. Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation exposed? Or is it closed source?) Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by habanalabs driver)? So neither any userspace verbs apps could use it nor kernel ULPs?
On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > On 18/09/2020 18:28, Jason Gunthorpe wrote: > > On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > > > >> I'm sorry, but you won't be able to convince me here that I need to > >> "enslave" my entire code to RDMA, just because my ASIC "also" has some > >> RDMA ports. > > > > You can't recreate common shared subsystems in a driver just because > > you don't want to work with the subsystem. > > > > I don't care what else the ASIC has. In Linux the netdev part is > > exposed through netdev, the RDMA part through RDMA, the > > totally-not-a-GPU part through drivers/misc. > > > > It is always been this way. Chelsio didn't get to rebuild the SCSI > > stack in their driver just because "storage is a small part of their > > device" > > > > Drivers are not allowed to re-implement I2C/SPI/etc without re-using > > the comon code for that just because "I2C is a small part of their > > device" > > > > Exposing to userspace the creation of RoCE QPs and their related > > objects are unambiguously a RDMA subsystem task. I don't even know how > > you think you can argue it is not. It is your company proudly claiming > > the device has 100G RoCE ports in all the marketing literature, after > > all. > > > > It is too bad the device has a non-standards compliant implementation > > of RoCE so this will be a bit hard for you. Oh well. > > What is considered a RoCE port in this case if it's not compliant with RoCE? They claim that it is RoCE v2. https://www.hotchips.org/hc31/HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf Thanks
On Sun, Sep 20, 2020 at 10:47:02AM +0200, Greg Kroah-Hartman wrote: > > If not, what open source userspace are you going to ask them to > > present to merge the kernel side into misc? > > I don't think that they have a userspace api to their rdma feature from > what I understand, but I could be totally wrong as I do not know their > hardware at all, so I'll let them answer this question. I thought Oded was pretty clear, the goal of this series is to expose their RDMA HW to userspace. This problem space requires co-mingling networking and compute at extremely high speed/low overhead. This is all done in userspace. We are specifically talking about this in include/uapi/misc/habanalabs.h: /* * NIC * * This IOCTL allows the user to manage and configure the device's NIC ports. * The following operations are available: * - Create a completion queue * - Destroy a completion queue * - Wait on completion queue * - Poll a completion queue * - Update consumed completion queue entries * - Set a work queue * - Unset a work queue * * For all operations, the user should provide a pointer to an input structure * with the context parameters. Some of the operations also require a pointer to * driver regarding how many of the available CQEs were actually * processed/consumed. Only then the driver will override them with newer * entries. * The set WQ operation should provide the device virtual address of the WQ with * a matching size for the number of WQs and entries per WQ. * */ #define HL_IOCTL_NIC _IOWR('H', 0x07, struct hl_nic_args) Which is ibv_create_qp, ibv_create_cq, ibv_poll_cq, etc, etc Habana has repeatedly described their HW as having multiple 100G RoCE ports. RoCE is one of the common industry standards that ibverbs unambiguously is responsible for. I would be much less annoyed if they were not actively marketing their product as RoCE RDMA. Sure there is some argument that their RoCE isn't spec compliant, but I don't think it excuses the basic principle of our subsystem: RDMA HW needs to demonstrate some basic functionality using the standard open source userspace software stack. I don't like this idea of backdooring a bunch of proprietary closed source RDMA userspace through drivers/misc, and if you don't have a clear idea how to get something equal for drivers/misc you should not accept the H_IOCTL_NIC. Plus RoCE is complicated, there is a bunch of interaction with netdev and rules related to that that really needs to be respected. > For anything that _has_ to have a userspace RMDA interface, sure ibverbs > are the one we are stuck with, but I didn't think that was the issue > here at all, which is why I wrote the above comments. I think you should look at the patches #8 through 11: https://lore.kernel.org/lkml/20200915171022.10561-9-oded.gabbay@gmail.com/ Jason
On Mon, 21 Sep 2020 08:52:39 -0300 Jason Gunthorpe wrote: > I don't like this idea of backdooring a bunch of proprietary closed > source RDMA userspace through drivers/misc, and if you don't have a > clear idea how to get something equal for drivers/misc you should not > accept the H_IOCTL_NIC. > > Plus RoCE is complicated, there is a bunch of interaction with netdev > and rules related to that that really needs to be respected. +1 To me this code quite clearly fits the description of vendor SDK which runs proprietary stuff on top. It's such an vendor SDK thing to do to pick the parts of our infrastructure they like and "simplify the rest" with its own re-implementation. I'd wager the only reason you expose the netdevs at all is for link settings, stats, packet capture and debug. You'd never run TCP traffic over those links. And you're fighting against using Linux APIs for the only real traffic that runs on those links - RDMA(ish) traffic. Greg - I'm probably the least experience of the folks involved in this conversation - could you ELI5 what's the benefit to the community from merging this code?
On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > What is considered a RoCE port in this case if it's not compliant with RoCE? > Sounds like it's an implementation of RDMA over ethernet, not RoCE. > Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? > (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation > exposed? Or is it closed source?) I think Oded was drawing a parallel to how nvlink is integral with the compute element. From Oded's descriptions I don't think it is much like nvlink at all. > Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path > verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by > habanalabs driver)? > So neither any userspace verbs apps could use it nor kernel ULPs? Based on what Oded described it seems like a reasonable RDMA device with some limitations around MR IOVA. Looks like the desire is to create a RDMA WR and CQ ring in userspace, and then co-mingle that with the compute side of the device. So instead of doing the special IOCTL and mmap against the compute FD it would create a RDMA QP and RDMA CQ, use dv to access the raw internals, and the propritary stack would have exactly the same stuff it would have had with the misc ioctl. But, completely separately, they'd also have to implement some of verbs which serves as the open source userspace showing how this HW works. What that is depends largely on what their HW can do, and if they want to connect to UCX/mpi/libfabric/etc A bunch of ioctl stubs or a few tests is far below our standard in RDMA. There may have been some argument that the compute side of this device has no industry standards so should be a drivers/misc, but HPC networking *does* have extensive standards and extensive open source software stacks. It is very hard for me to see how a device in this market could be competitive without integrating with that stuff. Jason
On Mon, Sep 21, 2020 at 02:20:53PM -0700, Jakub Kicinski wrote: > I'd wager the only reason you expose the netdevs at all is for link > settings, stats, packet capture and debug. You'd never run TCP traffic > over those links. And you're fighting against using Linux APIs for the > only real traffic that runs on those links - RDMA(ish) traffic. The usual working flow is to use something like TCP to exchange connection information then pivot to RDMA for the actual data flow. This is why a driver like this could get away with such a low performance implementation for a 100G NIC, it is just application boot metadata being exchanged. Sniffing probably won't work as typically the HW will capture the RoCE traffic before reaching Linux - and the Linux driver couldn't handle a 100G flow anyhow. Stats might not work either. As far as the "usual rules" we do require that accelerator devices sharing a netdev are secure in the concept of netdev userspace security. They can access the assigned RoCEv2 UDP port but cannot do things like forge src IP/MAC addresses, violate VLANs, reach outside net namespaces, capature arbitary traffic, etc. This stuff is tricky and generally requires HW support. Someone has to audit all of this and ensure it meets the netdev security requirements too, otherwise it will need CAP_NET_RAW to function. Obviously this requires seeing enough of a userspace implementation to understand how the design approaches verbs 'Address Handles' and so forth. RDMA HW has had errors before and when discovered it was blocked with CAP_NET_RAW until new chip revs came out, this is something I take very seriously. Jason
On 22/09/2020 14:41, Jason Gunthorpe wrote: > On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > >> What is considered a RoCE port in this case if it's not compliant with RoCE? >> Sounds like it's an implementation of RDMA over ethernet, not RoCE. >> Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? >> (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation >> exposed? Or is it closed source?) > > I think Oded was drawing a parallel to how nvlink is integral with the > compute element. From Oded's descriptions I don't think it is much > like nvlink at all. > >> Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path >> verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by >> habanalabs driver)? >> So neither any userspace verbs apps could use it nor kernel ULPs? > > Based on what Oded described it seems like a reasonable RDMA device > with some limitations around MR IOVA. > > Looks like the desire is to create a RDMA WR and CQ ring in userspace, > and then co-mingle that with the compute side of the device. > > So instead of doing the special IOCTL and mmap against the compute FD > it would create a RDMA QP and RDMA CQ, use dv to access the raw > internals, and the propritary stack would have exactly the same stuff > it would have had with the misc ioctl. > > But, completely separately, they'd also have to implement some of > verbs which serves as the open source userspace showing how this HW > works. What that is depends largely on what their HW can do, and if > they want to connect to UCX/mpi/libfabric/etc > > A bunch of ioctl stubs or a few tests is far below our standard in > RDMA. > > There may have been some argument that the compute side of this device > has no industry standards so should be a drivers/misc, but HPC > networking *does* have extensive standards and extensive open source > software stacks. It is very hard for me to see how a device in this > market could be competitive without integrating with that stuff. I agree, that makes sense. But assuming Oded actually goes and implements all the needed verbs to get a basic functional libibverbs provider (assuming their HW can do it somehow), is it really useful if no one is going to use it? It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, and I'm assuming the only real world use case is going to be using the hl stack, which means we're left with a lot of dead code that's not used/tested by anyone. Genuine question, wouldn't it be better if they only implement what's actually going to be used and tested by their customers?
On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > I agree, that makes sense. > But assuming Oded actually goes and implements all the needed verbs to get a > basic functional libibverbs provider (assuming their HW can do it somehow), is > it really useful if no one is going to use it? > It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, > and I'm assuming the only real world use case is going to be using the hl stack, > which means we're left with a lot of dead code that's not used/tested by anyone. > > Genuine question, wouldn't it be better if they only implement what's actually > going to be used and tested by their customers? The general standard for this 'accel' hardware, both in DRM and RDMA is to present an open source userspace. Companies are encouraged to use that as their main interface but I suppose are free to carry the cost of dual APIs, and the community's wrath if they want. At least for RDMA this is guided by the founding event of Linux RDMA where all customers demanded the madness of every supplier having a unique software stack from the kernel down stop. Since then the low level stack has been cross vendor and uniform. Jason
On 22/09/2020 19:14, Jason Gunthorpe wrote: > On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > >> I agree, that makes sense. >> But assuming Oded actually goes and implements all the needed verbs to get a >> basic functional libibverbs provider (assuming their HW can do it somehow), is >> it really useful if no one is going to use it? >> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, >> and I'm assuming the only real world use case is going to be using the hl stack, >> which means we're left with a lot of dead code that's not used/tested by anyone. >> >> Genuine question, wouldn't it be better if they only implement what's actually >> going to be used and tested by their customers? > > The general standard for this 'accel' hardware, both in DRM and RDMA > is to present an open source userspace. Companies are encouraged to > use that as their main interface but I suppose are free to carry the > cost of dual APIs, and the community's wrath if they want. I didn't mean they should maintain two interfaces. The question is whether they should implement libibverbs support that covers the cases used by their stack, or should they implement all "mandatory" verbs so they could be able to run libibverbs' examples/perftest/pyverbs as well, even though these will likely be the only apps covering these verbs.
On Tue, Sep 22, 2020 at 07:30:32PM +0300, Gal Pressman wrote: > On 22/09/2020 19:14, Jason Gunthorpe wrote: > > On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > > > >> I agree, that makes sense. > >> But assuming Oded actually goes and implements all the needed verbs to get a > >> basic functional libibverbs provider (assuming their HW can do it somehow), is > >> it really useful if no one is going to use it? > >> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, > >> and I'm assuming the only real world use case is going to be using the hl stack, > >> which means we're left with a lot of dead code that's not used/tested by anyone. > >> > >> Genuine question, wouldn't it be better if they only implement what's actually > >> going to be used and tested by their customers? > > > > The general standard for this 'accel' hardware, both in DRM and RDMA > > is to present an open source userspace. Companies are encouraged to > > use that as their main interface but I suppose are free to carry the > > cost of dual APIs, and the community's wrath if they want. > > I didn't mean they should maintain two interfaces. > The question is whether they should implement libibverbs support that covers the > cases used by their stack, or should they implement all "mandatory" verbs so > they could be able to run libibverbs' examples/perftest/pyverbs as well, even > though these will likely be the only apps covering these verbs. As I said, the minimum standard is an open source user space that will operate the NIC. For EFA we decided that was ibv_ud_pingpong, and now parts of pyverbs. A similar decision would be needed here too. It is a conversation that should start with a propsal from Oded. The *point* is to have the open userspace, so I really don't care what their proprietary universe does, and shrinking the opensource side becuase it is "redundant" is completely backwards to what we want to see. Jason