mbox series

[net-next,v2,0/3] VGT+ support

Message ID 1572551213-9022-1-git-send-email-lariel@mellanox.com
Headers show
Series VGT+ support | expand

Message

Ariel Levkovich Oct. 31, 2019, 7:47 p.m. UTC
The following series introduces VGT+ support for SRIOV vf
devices.

VGT+ is an extention of the VGT (Virtual Guest Tagging)
where the guest is in charge of vlan tagging the packets
only with VGT+ the admin can limit the allowed vlan ids
the guest can use to a specific vlan trunk list.

The patches introduce the API for admin users to set and
query these vlan trunk lists on the vfs using netlink
commands.

changes from v1 to v2:
- Fixed indentation of RTEXT_FILTER_SKIP_STATS.
- Changed vf_ext param to bool.
- Check if VF num exceeds the opened VFs range and return without
adding the vfinfo.

Ariel Levkovich (3):
  net: Support querying specific VF properties
  net: Add SRIOV VGT+ support
  net/mlx5: Add SRIOV VGT+ support

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  30 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  | 600 ++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  27 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   8 +-
 include/linux/if_link.h                            |   3 +
 include/linux/netdevice.h                          |  12 +
 include/uapi/linux/if_link.h                       |  35 ++
 include/uapi/linux/rtnetlink.h                     |   3 +-
 net/core/rtnetlink.c                               | 173 ++++--
 9 files changed, 717 insertions(+), 174 deletions(-)

Comments

David Miller Oct. 31, 2019, 8:31 p.m. UTC | #1
The previous posted version was also v2, what are you doing?
Ariel Levkovich Oct. 31, 2019, 10:20 p.m. UTC | #2
On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:

    
>    The previous posted version was also v2, what are you doing?

I restarted this series since my first submission had a mistake in the subject prefix.
This is the 2nd version of that new submission while previous has a different subject
and can be ignored.
David Miller Oct. 31, 2019, 10:58 p.m. UTC | #3
From: Ariel Levkovich <lariel@mellanox.com>
Date: Thu, 31 Oct 2019 22:20:55 +0000

> 
> 
> On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:
> 
>     
>>    The previous posted version was also v2, what are you doing?
> 
> I restarted this series since my first submission had a mistake in the subject prefix.
> This is the 2nd version of that new submission while previous has a different subject
> and can be ignored.  

Always increment the version number when you post a patch series anew.

Otherwise it is ambiguous to me which one is the latest.
Jakub Kicinski Nov. 1, 2019, 12:23 a.m. UTC | #4
On Thu, 31 Oct 2019 19:47:25 +0000, Ariel Levkovich wrote:
> The following series introduces VGT+ support for SRIOV vf
> devices.
> 
> VGT+ is an extention of the VGT (Virtual Guest Tagging)
> where the guest is in charge of vlan tagging the packets
> only with VGT+ the admin can limit the allowed vlan ids
> the guest can use to a specific vlan trunk list.
> 
> The patches introduce the API for admin users to set and
> query these vlan trunk lists on the vfs using netlink
> commands.
> 
> changes from v1 to v2:
> - Fixed indentation of RTEXT_FILTER_SKIP_STATS.
> - Changed vf_ext param to bool.
> - Check if VF num exceeds the opened VFs range and return without
> adding the vfinfo.

If you repost something without addressing feedback you received you
_have_ _to_ describe what that feedback was and why it was ignored in 
the cover letter of the new posting, please and thank you.
Ariel Levkovich Nov. 1, 2019, 2:55 p.m. UTC | #5
On 10/31/19 6:58 PM, "David Miller" <davem@davemloft.net> wrote:
> From: Ariel Levkovich <lariel@mellanox.com>
> Date: Thu, 31 Oct 2019 22:20:55 +0000
>
>>
>> On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:
>>
>>      
>>>     The previous posted version was also v2, what are you doing?
>> I restarted this series since my first submission had a mistake in the subject prefix.
>> This is the 2nd version of that new submission while previous has a different subject
>> and can be ignored.
> Always increment the version number when you post a patch series anew.
>
> Otherwise it is ambiguous to me which one is the latest.


Understood. Will make sure of that.
Saeed Mahameed Nov. 1, 2019, 9:28 p.m. UTC | #6
On Fri, 2019-11-01 at 19:44 +0000, Ariel Levkovich wrote:
> On 10/31/19 8:23 PM, Jakub Kicinski <jakub.kicinski@netronome.com>
> wrote:
> > On Thu, 31 Oct 2019 19:47:25 +0000, Ariel Levkovich wrote:
> > > The following series introduces VGT+ support for SRIOV vf
> > > devices.
> > > 
> > > VGT+ is an extention of the VGT (Virtual Guest Tagging)
> > > where the guest is in charge of vlan tagging the packets
> > > only with VGT+ the admin can limit the allowed vlan ids
> > > the guest can use to a specific vlan trunk list.
> > > 
> > > The patches introduce the API for admin users to set and
> > > query these vlan trunk lists on the vfs using netlink
> > > commands.
> > > 
> > > changes from v1 to v2:
> > > - Fixed indentation of RTEXT_FILTER_SKIP_STATS.
> > > - Changed vf_ext param to bool.
> > > - Check if VF num exceeds the opened VFs range and return without
> > > adding the vfinfo.
> > 
> > If you repost something without addressing feedback you received
> > you
> > _have_ _to_ describe what that feedback was and why it was ignored
> > in 
> > the cover letter of the new posting, please and thank you.
> 
> Right, I must've missed that.
> 
> I re posted the patches to address the code related feedback from
> Stephen while
> 
> we continue the discussion on the concept of legacy mode features.
> 
> On 10/30/19 you wrote:
> 
>     "
> 
>     The "we don't want any more legacy VF ndos" policy which I think
> we
>     wanted to follow is much easier to stick to than "we don't want
> any
>     more legacy VF ndos, unless..".
>    
>     There's nothing here that can't be done in switchdev mode
> (perhaps
>     bridge offload would actually be more suitable than just flower),
>     and the uAPI extension is not an insignificant one.
>    
>     I don't think we should be growing both legacy and switchdev
> APIs, at
>     some point we got to pick one. The switchdev extension to set
> hwaddr
>     for which patches were posted recently had been implemented
> through
>     legacy API a while ago (by Chelsio IIRC) so it's not that we're
> looking
>     towards switchdev where legacy API is impossible to extend. It's
> purely
>     a policy decision to pick one and deprecate the other.
>    
>     Only if we freeze the legacy API completely will the
> orchestration
>     layers have an incentive to support switchdev. And we can save
> the few
>     hundred lines of code per feature in every driver..
>     "
> 
> 
> Unfortunately, like Saeed and yourself mentioned, we still face
> customers that are
> 
> refusing to move to switchdev while requiring feature and
> capabilities
> 
> in virtual environments.
> 
> 
> 
> We have several configurations available today for legacy that
> include vlan settings
> for the VF.
> I know that we decided to refrain from adding new legacy ndos but I
> also think we
> left a gap in the VLAN control for legacy mode while this feature
> fills that gap.
> Today user can either set a single VLAN id to a VF or none there's
> nothing in between.
> This feature complements  the missing part of the puzzle and we don't
> see any further
> future development in this area after this.
> 
> We have tried to submit and close this gap before (
> https://patchwork.ozlabs.org/patch/806213)
> but messed up there and at the time we decided not to proceed the
> effort
> due to several reasons so we are trying to close this now.
> 
> 
> 

Jakub, since Ariel is still working on his upstream mailing list skills
:), i would like to emphasis and summarize his point in text style ;-)
the way we like it.

Bottom line, we tried to push this feature a couple of years ago, and
due to some internal issues this submission ignored for a while, now as
the legacy sriov customers are moving towards upstream, which is for me
a great progress I think this feature worth the shot, also as Ariel
pointed out, VF vlan filter is really a gap that should be closed.

For all other features it is true that the user must consider moving to
witchdev mode or find a another community for support.

Our policy is still strong regarding obsoleting legacy mode and pushing
all new feature to switchdev mode, but looking at the facts here  I do
think there is a point here and ROI to close this gap in legacy mode.

I hope this all make sense. 

Thanks,
Saeed.
Jakub Kicinski Nov. 2, 2019, 12:21 a.m. UTC | #7
On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
> Jakub, since Ariel is still working on his upstream mailing list skills
> :), i would like to emphasis and summarize his point in text style ;-)
> the way we like it.

Thanks :)

> Bottom line, we tried to push this feature a couple of years ago, and
> due to some internal issues this submission ignored for a while, now as
> the legacy sriov customers are moving towards upstream, which is for me
> a great progress I think this feature worth the shot, also as Ariel
> pointed out, VF vlan filter is really a gap that should be closed.
> 
> For all other features it is true that the user must consider moving to
> witchdev mode or find a another community for support.
> 
> Our policy is still strong regarding obsoleting legacy mode and pushing
> all new feature to switchdev mode, but looking at the facts here  I do
> think there is a point here and ROI to close this gap in legacy mode.
> 
> I hope this all make sense. 

I understand and sympathize, you know full well the benefits of working
upstream-first...

I won't reiterate the entire response from my previous email, but the
bottom line for me is that we haven't added a single legacy VF NDO
since 2016, I was hoping we never will add more and I was trying to
stop anyone who tried.

Muxing the VF info through linkinfo is getting pretty ugly with the
tricks we have to play to make it fit message size.

And you can't really say you considered your options carefully here and
reviewed the code, since the patch adds this nugget to a uAPI header:

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 797e214..35ab210 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -180,6 +180,8 @@ enum {
 #ifndef __KERNEL__
 #define IFLA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifinfomsg))))
 #define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg))
+#define BITS_PER_BYTE 8
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 #endif
 
 enum {
Saeed Mahameed Nov. 5, 2019, 1:38 a.m. UTC | #8
On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:
> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
> > Jakub, since Ariel is still working on his upstream mailing list
> > skills
> > :), i would like to emphasis and summarize his point in text style
> > ;-)
> > the way we like it.
> 
> Thanks :)
> 
> > Bottom line, we tried to push this feature a couple of years ago,
> > and
> > due to some internal issues this submission ignored for a while,
> > now as
> > the legacy sriov customers are moving towards upstream, which is
> > for me
> > a great progress I think this feature worth the shot, also as Ariel
> > pointed out, VF vlan filter is really a gap that should be closed.
> > 
> > For all other features it is true that the user must consider
> > moving to
> > witchdev mode or find a another community for support.
> > 
> > Our policy is still strong regarding obsoleting legacy mode and
> > pushing
> > all new feature to switchdev mode, but looking at the facts here  I
> > do
> > think there is a point here and ROI to close this gap in legacy
> > mode.
> > 
> > I hope this all make sense. 
> 
> I understand and sympathize, you know full well the benefits of
> working
> upstream-first...
> 
> I won't reiterate the entire response from my previous email, but the
> bottom line for me is that we haven't added a single legacy VF NDO
> since 2016, I was hoping we never will add more and I was trying to
> stop anyone who tried.
> 

The NDO is not the problem here, we can perfectly extend the current
set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
changes, but first you have to look at this from my angel, i have been
doing lots of research and there are many points for why this should be
added to legacy mode:

1) Switchdev mode can't replace legacy mode with a press of a button,
many missing pieces.

2) Upstream Legacy SRIOV is incomplete since it goes together with
flexible vf vlan configuration, most of mlx5 legacy sriov users are
using customized kernels and external drivers, since upstream is
lacking this one basic vlan filtering feature, and many users decline
switching to upstream kernel due to this missing features.

3) Many other vendors have this feature in customized drivers/kernels,
and many vendors/drivers don't even support switchdev mode (mlx4 for
example), we can't just tell the users of such device we are not
supporting basic sriov legacy mode features in upstream kernel.

4) the motivation for this is to slowly move sriov users to upstream
and eventually to switchdev mode.

This has been a large gap and disadvantage in upstream and 
this feature has been long time coming and we have been laying the
grounds for it since 2016 (IFLA_VF_VLAN_LIST), it is not reasonable
that we have APIs to set all kinds of vf vlan attributes but not vlan
filters, I really believe this feature should be IN.

I still believe in the switchdev only policy, but vlan filter is a gray
area, Legacy sriov is all about VF vlans.

my opinion is: VF Vlan filter  is a very basic requirement for SRIOV,
not all HW support switchdev mode and many customers and usecases are
meant to use legacy mode only, it is unfair to block this feature, and
this is a great loss of users and use cases to the upstream kernel.

Now if the only remaining problem is the uAPI, we can minimize kernel
impact or even make no kernel changes at all, only ip route2 and
drivers, by reusing the current set_vf_vlan_ndo.
David Ahern Nov. 5, 2019, 1:47 a.m. UTC | #9
On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:
>> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
>>> Jakub, since Ariel is still working on his upstream mailing list
>>> skills
>>> :), i would like to emphasis and summarize his point in text style
>>> ;-)
>>> the way we like it.
>>
>> Thanks :)
>>
>>> Bottom line, we tried to push this feature a couple of years ago,
>>> and
>>> due to some internal issues this submission ignored for a while,
>>> now as
>>> the legacy sriov customers are moving towards upstream, which is
>>> for me
>>> a great progress I think this feature worth the shot, also as Ariel
>>> pointed out, VF vlan filter is really a gap that should be closed.
>>>
>>> For all other features it is true that the user must consider
>>> moving to
>>> witchdev mode or find a another community for support.
>>>
>>> Our policy is still strong regarding obsoleting legacy mode and
>>> pushing
>>> all new feature to switchdev mode, but looking at the facts here  I
>>> do
>>> think there is a point here and ROI to close this gap in legacy
>>> mode.
>>>
>>> I hope this all make sense. 
>>
>> I understand and sympathize, you know full well the benefits of
>> working
>> upstream-first...
>>
>> I won't reiterate the entire response from my previous email, but the
>> bottom line for me is that we haven't added a single legacy VF NDO
>> since 2016, I was hoping we never will add more and I was trying to
>> stop anyone who tried.
>>
> 
> The NDO is not the problem here, we can perfectly extend the current
> set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
> changes, but first you have to look at this from my angel, i have been
> doing lots of research and there are many points for why this should be
> added to legacy mode:
> 
> 1) Switchdev mode can't replace legacy mode with a press of a button,
> many missing pieces.
> 
> 2) Upstream Legacy SRIOV is incomplete since it goes together with
> flexible vf vlan configuration, most of mlx5 legacy sriov users are
> using customized kernels and external drivers, since upstream is
> lacking this one basic vlan filtering feature, and many users decline
> switching to upstream kernel due to this missing features.
> 
> 3) Many other vendors have this feature in customized drivers/kernels,
> and many vendors/drivers don't even support switchdev mode (mlx4 for
> example), we can't just tell the users of such device we are not
> supporting basic sriov legacy mode features in upstream kernel.
> 
> 4) the motivation for this is to slowly move sriov users to upstream
> and eventually to switchdev mode.

If the legacy freeze started in 2016 and we are at the end of 2019, what
is the migration path?


> 
> Now if the only remaining problem is the uAPI, we can minimize kernel
> impact or even make no kernel changes at all, only ip route2 and
> drivers, by reusing the current set_vf_vlan_ndo.

And this caught my eye as well -- iproute2 does not need the baggage either.

Is there any reason this continued support for legacy sriov can not be
done out of tree?
Jakub Kicinski Nov. 5, 2019, 2:35 a.m. UTC | #10
On Mon, 4 Nov 2019 18:47:32 -0700, David Ahern wrote:
> On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> > On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:  
> >> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:  
> >>> Bottom line, we tried to push this feature a couple of years ago,
> >>> and
> >>> due to some internal issues this submission ignored for a while,
> >>> now as
> >>> the legacy sriov customers are moving towards upstream, which is
> >>> for me
> >>> a great progress I think this feature worth the shot, also as Ariel
> >>> pointed out, VF vlan filter is really a gap that should be closed.
> >>>
> >>> For all other features it is true that the user must consider
> >>> moving to
> >>> witchdev mode or find a another community for support.
> >>>
> >>> Our policy is still strong regarding obsoleting legacy mode and
> >>> pushing
> >>> all new feature to switchdev mode, but looking at the facts here  I
> >>> do
> >>> think there is a point here and ROI to close this gap in legacy
> >>> mode.
> >>>
> >>> I hope this all make sense.   
> >>
> >> I understand and sympathize, you know full well the benefits of
> >> working
> >> upstream-first...
> >>
> >> I won't reiterate the entire response from my previous email, but the
> >> bottom line for me is that we haven't added a single legacy VF NDO
> >> since 2016, I was hoping we never will add more and I was trying to
> >> stop anyone who tried.
> >>  
> > 
> > The NDO is not the problem here, we can perfectly extend the current
> > set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
> > changes, but first you have to look at this from my angel, i have been
> > doing lots of research and there are many points for why this should be
> > added to legacy mode:
> > 
> > 1) Switchdev mode can't replace legacy mode with a press of a button,
> > many missing pieces.
> > 
> > 2) Upstream Legacy SRIOV is incomplete since it goes together with
> > flexible vf vlan configuration, most of mlx5 legacy sriov users are
> > using customized kernels and external drivers, since upstream is
> > lacking this one basic vlan filtering feature, and many users decline
> > switching to upstream kernel due to this missing features.
> > 
> > 3) Many other vendors have this feature in customized drivers/kernels,
> > and many vendors/drivers don't even support switchdev mode (mlx4 for
> > example), we can't just tell the users of such device we are not
> > supporting basic sriov legacy mode features in upstream kernel.
> > 
> > 4) the motivation for this is to slowly move sriov users to upstream
> > and eventually to switchdev mode.  
> 
> If the legacy freeze started in 2016 and we are at the end of 2019, what
> is the migration path?

The migration path is to implement basic bridge offload which can take
care of this trivially.

Problem is people equate switchdev with OvS offload today, so bridge
offload is not actually implemented. It's really hard to convince
product marketing that it's work worth taking on.

Adding incremental features to legacy API is always going to be cheaper
than migrating controllers to switchdev.

IDK if you remember the net_failover argument about in-kernel VF to
virtio bonding. The controllers are doing the bare minimum and it's 
hard for HW vendors to justify the expense of moving forward all parts 
of the stack.

Which means SR-IOV is either stuck in pure-L2 middle ages or requires
all the SDN complexity and overhead. Switchdev bridge offload can be
trivially extended to support eVPN and simplify so many deployments,
sigh..

> > Now if the only remaining problem is the uAPI, we can minimize kernel
> > impact or even make no kernel changes at all, only ip route2 and
> > drivers, by reusing the current set_vf_vlan_ndo.  
> 
> And this caught my eye as well -- iproute2 does not need the baggage either.
> 
> Is there any reason this continued support for legacy sriov can not be
> done out of tree?

Exactly. Moving to upstream is only valuable if it doesn't require
brining all the out-of-tree baggage.
Saeed Mahameed Nov. 5, 2019, 8:10 p.m. UTC | #11
On Mon, 2019-11-04 at 18:35 -0800, Jakub Kicinski wrote:
> On Mon, 4 Nov 2019 18:47:32 -0700, David Ahern wrote:
> > On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> > > On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:  
> > > > On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:  
> > > > > Bottom line, we tried to push this feature a couple of years
> > > > > ago,
> > > > > and
> > > > > due to some internal issues this submission ignored for a
> > > > > while,
> > > > > now as
> > > > > the legacy sriov customers are moving towards upstream, which
> > > > > is
> > > > > for me
> > > > > a great progress I think this feature worth the shot, also as
> > > > > Ariel
> > > > > pointed out, VF vlan filter is really a gap that should be
> > > > > closed.
> > > > > 
> > > > > For all other features it is true that the user must consider
> > > > > moving to
> > > > > witchdev mode or find a another community for support.
> > > > > 
> > > > > Our policy is still strong regarding obsoleting legacy mode
> > > > > and
> > > > > pushing
> > > > > all new feature to switchdev mode, but looking at the facts
> > > > > here  I
> > > > > do
> > > > > think there is a point here and ROI to close this gap in
> > > > > legacy
> > > > > mode.
> > > > > 
> > > > > I hope this all make sense.   
> > > > 
> > > > I understand and sympathize, you know full well the benefits of
> > > > working
> > > > upstream-first...
> > > > 
> > > > I won't reiterate the entire response from my previous email,
> > > > but the
> > > > bottom line for me is that we haven't added a single legacy VF
> > > > NDO
> > > > since 2016, I was hoping we never will add more and I was
> > > > trying to
> > > > stop anyone who tried.
> > > >  
> > > 
> > > The NDO is not the problem here, we can perfectly extend the
> > > current
> > > set_vf_vlan_ndo to achieve the same goal with minimal or even NO
> > > kernel
> > > changes, but first you have to look at this from my angel, i have
> > > been
> > > doing lots of research and there are many points for why this
> > > should be
> > > added to legacy mode:
> > > 
> > > 1) Switchdev mode can't replace legacy mode with a press of a
> > > button,
> > > many missing pieces.
> > > 
> > > 2) Upstream Legacy SRIOV is incomplete since it goes together
> > > with
> > > flexible vf vlan configuration, most of mlx5 legacy sriov users
> > > are
> > > using customized kernels and external drivers, since upstream is
> > > lacking this one basic vlan filtering feature, and many users
> > > decline
> > > switching to upstream kernel due to this missing features.
> > > 
> > > 3) Many other vendors have this feature in customized
> > > drivers/kernels,
> > > and many vendors/drivers don't even support switchdev mode (mlx4
> > > for
> > > example), we can't just tell the users of such device we are not
> > > supporting basic sriov legacy mode features in upstream kernel.
> > > 
> > > 4) the motivation for this is to slowly move sriov users to
> > > upstream
> > > and eventually to switchdev mode.  
> > 
> > If the legacy freeze started in 2016 and we are at the end of 2019,
> > what
> > is the migration path?
> 
> The migration path is to implement basic bridge offload which can
> take
> care of this trivially.
> 

it is not about implementation, it is also the user echo system that
needs to be migrated.
so this needs to happen in baby steps, you can't just ask ask someone
to move to something that is not even cooked yet, and will require
months or years of validation and deployment.

> Problem is people equate switchdev with OvS offload today, so bridge
> offload is not actually implemented. It's really hard to convince
> product marketing that it's work worth taking on.
> 
> Adding incremental features to legacy API is always going to be
> cheaper
> than migrating controllers to switchdev.
> 

Yes this feature is 1000 times cheaper than implementing the full
offload, and yet very trivial and necessary, the ROI is so big that it
worth doing it, more exposure to upstream.

> IDK if you remember the net_failover argument about in-kernel VF to
> virtio bonding. The controllers are doing the bare minimum and it's 
> hard for HW vendors to justify the expense of moving forward all
> parts 
> of the stack.
> 
> Which means SR-IOV is either stuck in pure-L2 middle ages or requires
> all the SDN complexity and overhead. Switchdev bridge offload can be
> trivially extended to support eVPN and simplify so many deployments,
> sigh..
> 
> > > Now if the only remaining problem is the uAPI, we can minimize
> > > kernel
> > > impact or even make no kernel changes at all, only ip route2 and
> > > drivers, by reusing the current set_vf_vlan_ndo.  
> > 
> > And this caught my eye as well -- iproute2 does not need the
> > baggage either.
> > 
> > Is there any reason this continued support for legacy sriov can not
> > be
> > done out of tree?
> 
> Exactly. Moving to upstream is only valuable if it doesn't require
> brining all the out-of-tree baggage.

this baggage is a very essential part for eth sriov, it is a missing
feature in both switchdev mode (bridge offloads) and legacy.

Guys, I need to know my options here and make some effort assessment.

1) implement bridge offloads: months of development, years for
deployment and migration
2) Close this gap in legacy mode: days.

I am all IN for bridge offloads, but you have to understand why i pick
2, not because it is cheaper, but because it is more realistic for my
current users. Saying no to this just because switchdev mode is the de
facto standard isn't fair and there should be an active clear
transition plan, with something available to work with ... not just
ideas.

Your claims are valid only when we are truly ready for migration. we
are simply not and no one has a clear plan in the horizon, so i don't
get this total freeze attitude of legacy mode, it should be just like
ethtool we want to to replace it but we know we are not there yet, so
we carefully add only necessary things with lots of auditing, same
should go here.
Jakub Kicinski Nov. 5, 2019, 9:55 p.m. UTC | #12
On Tue, 5 Nov 2019 20:10:02 +0000, Saeed Mahameed wrote:
> > > > Now if the only remaining problem is the uAPI, we can minimize
> > > > kernel impact or even make no kernel changes at all, only ip
> > > > route2 and drivers, by reusing the current set_vf_vlan_ndo.    
> > > 
> > > And this caught my eye as well -- iproute2 does not need the
> > > baggage either.
> > > 
> > > Is there any reason this continued support for legacy sriov can
> > > not be done out of tree?  
> > 
> > Exactly. Moving to upstream is only valuable if it doesn't require
> > brining all the out-of-tree baggage.  
> 
> this baggage is a very essential part for eth sriov, it is a missing
> feature in both switchdev mode (bridge offloads) and legacy.

AFAIK from uAPI perspective nothing is missing in switchdev mode.

> Guys, I need to know my options here and make some effort assessment.
> 
> 1) implement bridge offloads: months of development, years for
> deployment and migration
> 2) Close this gap in legacy mode: days.
> 
> I am all IN for bridge offloads, but you have to understand why i pick
> 2, not because it is cheaper, but because it is more realistic for my
> current users. Saying no to this just because switchdev mode is the de
> facto standard isn't fair and there should be an active clear
> transition plan, with something available to work with ... not just
> ideas.

I understand your perspective. It is cheaper for you.

> Your claims are valid only when we are truly ready for migration. we
> are simply not and no one has a clear plan in the horizon, so i don't
> get this total freeze attitude of legacy mode, 

There will never be any L2 plan unless we say no to legacy extensions.

> it should be just like ethtool we want to to replace it but we know
> we are not there yet, so we carefully add only necessary things with
> lots of auditing, same should go here.

Worked out amazingly for ethtool, right?
Saeed Mahameed Nov. 5, 2019, 10:52 p.m. UTC | #13
On Tue, 2019-11-05 at 13:55 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 20:10:02 +0000, Saeed Mahameed wrote:
> > > > > Now if the only remaining problem is the uAPI, we can
> > > > > minimize
> > > > > kernel impact or even make no kernel changes at all, only ip
> > > > > route2 and drivers, by reusing the current
> > > > > set_vf_vlan_ndo.    
> > > > 
> > > > And this caught my eye as well -- iproute2 does not need the
> > > > baggage either.
> > > > 
> > > > Is there any reason this continued support for legacy sriov can
> > > > not be done out of tree?  
> > > 
> > > Exactly. Moving to upstream is only valuable if it doesn't
> > > require
> > > brining all the out-of-tree baggage.  
> > 
> > this baggage is a very essential part for eth sriov, it is a
> > missing
> > feature in both switchdev mode (bridge offloads) and legacy.
> 
> AFAIK from uAPI perspective nothing is missing in switchdev mode.
> 

bridge offloads is not on the road map. same as for netlink ethtool
adapter which is still WIP since 2017.

> > Guys, I need to know my options here and make some effort
> > assessment.
> > 
> > 1) implement bridge offloads: months of development, years for
> > deployment and migration
> > 2) Close this gap in legacy mode: days.
> > 
> > I am all IN for bridge offloads, but you have to understand why i
> > pick
> > 2, not because it is cheaper, but because it is more realistic for
> > my
> > current users. Saying no to this just because switchdev mode is the
> > de
> > facto standard isn't fair and there should be an active clear
> > transition plan, with something available to work with ... not just
> > ideas.
> 
> I understand your perspective. It is cheaper for you.
> 

Again, not because cheaper, considering the circumstances and the
basicness of this feature, this is the only right way to go for this
particular case.

> > Your claims are valid only when we are truly ready for migration.
> > we
> > are simply not and no one has a clear plan in the horizon, so i
> > don't
> > get this total freeze attitude of legacy mode, 
> 
> There will never be any L2 plan unless we say no to legacy
> extensions.
> 

And picking on basic/simple extensions will get you there ? i doubt it.

Being strict is one thing and I totally agree with the motivation, but
I strongly disagree with this total shutdown policy with no
alternatives and no real grounds other than the hope someone would
eventually listen and implement the migration path, meanwhile ALL users
MUST suffer regardless how trivial or basic their request is.

> > it should be just like ethtool we want to to replace it but we know
> > we are not there yet, so we carefully add only necessary things
> > with
> > lots of auditing, same should go here.
> 
> Worked out amazingly for ethtool, right?

Obviously, no one would have agreed to such total shutdown for ethtool,
we eventually decided not to block ethtool unless we have the netlink
adapter working .. legacy mode should get the same treatment.

Bottom line for the same reason we decided that ethtool is not totally
dead until ethtool netlink interface is complete, we should still
support selective and basic sriov legacy mode extensions until bridge
offloads is complete.
Jakub Kicinski Nov. 5, 2019, 11:10 p.m. UTC | #14
On Tue, 5 Nov 2019 22:52:18 +0000, Saeed Mahameed wrote:
> > > it should be just like ethtool we want to to replace it but we know
> > > we are not there yet, so we carefully add only necessary things
> > > with
> > > lots of auditing, same should go here.  
> > 
> > Worked out amazingly for ethtool, right?  
> 
> Obviously, no one would have agreed to such total shutdown for ethtool,
> we eventually decided not to block ethtool unless we have the netlink
> adapter working .. legacy mode should get the same treatment.
> 
> Bottom line for the same reason we decided that ethtool is not totally
> dead until ethtool netlink interface is complete, we should still
> support selective and basic sriov legacy mode extensions until bridge
> offloads is complete. 

But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
and ready. We're missing the driver and user space parts, but no core
and uAPI extensions. It's just L2 switching and there's quite a few
switch drivers upstream, as I'm sure you know :/ 

ethtool is not ready from uAPI perspective, still.

It'd be more accurate to compare to legacy IOCTLs, like arguing that 
we need a small tweak to IOCTLs when Netlink is already there..
Saeed Mahameed Nov. 5, 2019, 11:48 p.m. UTC | #15
On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 22:52:18 +0000, Saeed Mahameed wrote:
> > > > it should be just like ethtool we want to to replace it but we
> > > > know
> > > > we are not there yet, so we carefully add only necessary things
> > > > with
> > > > lots of auditing, same should go here.  
> > > 
> > > Worked out amazingly for ethtool, right?  
> > 
> > Obviously, no one would have agreed to such total shutdown for
> > ethtool,
> > we eventually decided not to block ethtool unless we have the
> > netlink
> > adapter working .. legacy mode should get the same treatment.
> > 
> > Bottom line for the same reason we decided that ethtool is not
> > totally
> > dead until ethtool netlink interface is complete, we should still
> > support selective and basic sriov legacy mode extensions until
> > bridge
> > offloads is complete. 
> 
> But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
> and ready. We're missing the driver and user space parts, but no core
> and uAPI extensions. It's just L2 switching and there's quite a few
> switch drivers upstream, as I'm sure you know :/ 
> 

I can say the same about netlink, it also was there, the missing part
was the netlink ethtool connection and userspace parts .. 

Just because switchdev uAPI is powerful enough to do anything it
doesn't mean we are ready, you said it, user space and drivers are not
ready, and frankly it is not on the road map, and we all know that it
could take years before we can sit back and relax that we got our L2
switching .. Just like what is happening now with ethtool, it been
years you know.. 

> ethtool is not ready from uAPI perspective, still.
> 

ethool is by definition a uAPI only tool, you can't compare the
technical details and challenges of both issue, each one has different
challenges, we need to compare complexity and deprecation policy impact
on consumers.

> It'd be more accurate to compare to legacy IOCTLs, like arguing that 
> we need a small tweak to IOCTLs when Netlink is already there..

Well, no, The right analog here would be:

netlink == switchdev mode
ethtool == legacy mode
netlink-ethtool-interface == L2 bridge offloads 

you do the math.

It is not about uAPI, uAPI could be ready for switchdev mode, maybe, it
just no one even gave this a real serious thought, so i can't accept
that bridge offloads is just around the corner just because switchdev
uAPI feels like ready to do anything, it is seriously not.
Jakub Kicinski Nov. 6, 2019, 1:38 a.m. UTC | #16
On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
> On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> > But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
> > and ready. We're missing the driver and user space parts, but no core
> > and uAPI extensions. It's just L2 switching and there's quite a few
> > switch drivers upstream, as I'm sure you know :/ 
> 
> I can say the same about netlink, it also was there, the missing part
> was the netlink ethtool connection and userspace parts .. 

uAPI is the part that matters. No driver implements all the APIs. 
I'm telling you that the API for what you're trying to configure
already exists, and your driver should use it. Driver's technical 
debt is not my concern.

> Just because switchdev uAPI is powerful enough to do anything it
> doesn't mean we are ready, you said it, user space and drivers are not
> ready, and frankly it is not on the road map, 

I bet it's not on the road map. Product marketing sees only legacy
SR-IOV (table stakes) and OvS offload == switchdev (value add). 
L2 switchdev will never be implemented with that mind set.

In the upstream community, however, we care about the technical aspects.

> and we all know that it could take years before we can sit back and
> relax that we got our L2 switching .. 

Let's not be dramatic. It shouldn't take years to implement basic L2
switching offload.

> Just like what is happening now with ethtool, it been years you know..

Exactly my point!!! Nobody is going to lift a finger unless there is a
loud and resounding "no".

> > ethtool is not ready from uAPI perspective, still.
> 
> ethool is by definition a uAPI only tool, you can't compare the
> technical details and challenges of both issue, each one has different
> challenges, we need to compare complexity and deprecation policy impact
> on consumers.

Well, you started comparing the two.

API deprecation is always painful. The further you go with the legacy
API the more painful it will be to deprecate.

> > It'd be more accurate to compare to legacy IOCTLs, like arguing that 
> > we need a small tweak to IOCTLs when Netlink is already there..  
> 
> Well, no, The right analog here would be:
> 
> netlink == switchdev mode
> ethtool == legacy mode
> netlink-ethtool-interface == L2 bridge offloads 
> 
> you do the math.

My reading of the situation is that your maintenance cost goes down if
you manage to push some out of tree uAPI upstream, while the upstream
has no need for that uAPI. In fact it'd be far better for the project
to encourage people to work on SR-IOV offloads via normal kernel
constructs and not either some special case legacy VF ops or franken tc
OvS.

> It is not about uAPI, uAPI could be ready for switchdev mode, maybe, it
> just no one even gave this a real serious thought, so i can't accept
> that bridge offloads is just around the corner just because switchdev
> uAPI feels like ready to do anything, it is seriously not.

I had given switchdev L2 some thought. IDK what you'd call serious, 
I don't have code. We are doing some ridiculously complex OvS offloads
while most customers just need L2 plus maybe VXLAN encap and basic
ACLs. Which switchdev can do very nicely thanks to Cumulus folks.
Saeed Mahameed Nov. 6, 2019, 10:21 p.m. UTC | #17
On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
> > On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> > > But switchdev _is_ _here_. _Today_. From uAPI perspective it's
> > > done,
> > > and ready. We're missing the driver and user space parts, but no
> > > core
> > > and uAPI extensions. It's just L2 switching and there's quite a
> > > few
> > > switch drivers upstream, as I'm sure you know :/ 
> > 
> > I can say the same about netlink, it also was there, the missing
> > part
> > was the netlink ethtool connection and userspace parts .. 
> 
> uAPI is the part that matters. No driver implements all the APIs. 
> I'm telling you that the API for what you're trying to configure
> already exists, and your driver should use it. Driver's technical 
> debt is not my concern.
> 
> > Just because switchdev uAPI is powerful enough to do anything it
> > doesn't mean we are ready, you said it, user space and drivers are
> > not
> > ready, and frankly it is not on the road map, 
> 
> I bet it's not on the road map. Product marketing sees only legacy
> SR-IOV (table stakes) and OvS offload == switchdev (value add). 
> L2 switchdev will never be implemented with that mind set.
> 
> In the upstream community, however, we care about the technical
> aspects.
> 
> > and we all know that it could take years before we can sit back and
> > relax that we got our L2 switching .. 
> 
> Let's not be dramatic. It shouldn't take years to implement basic L2
> switching offload.
> 
> > Just like what is happening now with ethtool, it been years you
> > know..
> 
> Exactly my point!!! Nobody is going to lift a finger unless there is
> a
> loud and resounding "no".
> 

Ok then, "no" new uAPI, although i still think there should be some
special cases to be allowed, but ... your call.

In the meanwhile i will figure out something to be driver only as
intermediate solution until we have full l2 offload, then i can ask
every one to move to full switchdev mode with a press of a button.
Jiri Pirko Nov. 7, 2019, 10:24 a.m. UTC | #18
Wed, Nov 06, 2019 at 11:21:37PM CET, saeedm@mellanox.com wrote:
>On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
>> On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
>> > On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
>> > > But switchdev _is_ _here_. _Today_. From uAPI perspective it's
>> > > done,
>> > > and ready. We're missing the driver and user space parts, but no
>> > > core
>> > > and uAPI extensions. It's just L2 switching and there's quite a
>> > > few
>> > > switch drivers upstream, as I'm sure you know :/ 
>> > 
>> > I can say the same about netlink, it also was there, the missing
>> > part
>> > was the netlink ethtool connection and userspace parts .. 
>> 
>> uAPI is the part that matters. No driver implements all the APIs. 
>> I'm telling you that the API for what you're trying to configure
>> already exists, and your driver should use it. Driver's technical 
>> debt is not my concern.
>> 
>> > Just because switchdev uAPI is powerful enough to do anything it
>> > doesn't mean we are ready, you said it, user space and drivers are
>> > not
>> > ready, and frankly it is not on the road map, 
>> 
>> I bet it's not on the road map. Product marketing sees only legacy
>> SR-IOV (table stakes) and OvS offload == switchdev (value add). 
>> L2 switchdev will never be implemented with that mind set.
>> 
>> In the upstream community, however, we care about the technical
>> aspects.
>> 
>> > and we all know that it could take years before we can sit back and
>> > relax that we got our L2 switching .. 
>> 
>> Let's not be dramatic. It shouldn't take years to implement basic L2
>> switching offload.
>> 
>> > Just like what is happening now with ethtool, it been years you
>> > know..
>> 
>> Exactly my point!!! Nobody is going to lift a finger unless there is
>> a
>> loud and resounding "no".
>> 
>
>Ok then, "no" new uAPI, although i still think there should be some
>special cases to be allowed, but ... your call.
>
>In the meanwhile i will figure out something to be driver only as

"something to be driver only". I'm curious what do you mean by that...


>intermediate solution until we have full l2 offload, then i can ask
>every one to move to full switchdev mode with a press of a button.
>
Keller, Jacob E Nov. 13, 2019, 10:55 p.m. UTC | #19
Hi Jakub,

On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> In the upstream community, however, we care about the technical
> aspects.
> 
> > and we all know that it could take years before we can sit back and
> > relax that we got our L2 switching .. 
> 
> Let's not be dramatic. It shouldn't take years to implement basic L2
> switching offload.

I had meant to send something earlier in this thread, but never got
around to it. I wanted to ask your opinion and get some feedback.

We (Intel) have recently been investigating use of port representors
for enabling introspection and control of VFs in the host system after
they've been assigned to a virtual machine.

I had originally been thinking of adding these port representor netdevs
before we fully implement switchdev with the e-switch offloads. The
idea was to migrate to using port representors in either case.

However, from what it looks like on this thread, you'd rather propose
that we implement switchdev with basic L2 offload?

I'm not too familiar with switchdev, (trying to read and learn about so
that we can begin actually implementing it in our network drivers).

Based on your comments and feedback in this thread, it sounds like our
original idea to have a "legacy with port representors" mode is not
really something you'd like, because it would continue to encourage
avoiding migrating from legacy stack to switchdev model.

But, instead of trying to go fully towards implementing switchdev with
complicated OvS offloads, we could do a simpler approach that only
supports L2 offloads initially, and from these comments it seems this
is the direction you'd rather upstream persue?

> I had given switchdev L2 some thought. IDK what you'd call serious, 
> I don't have code. We are doing some ridiculously complex OvS
> offloads
> while most customers just need L2 plus maybe VXLAN encap and basic
> ACLs. Which switchdev can do very nicely thanks to Cumulus folks.

Based on this, it sounds like the switchdev API can do this L2
offloading and drivers simply need to enable it. If I understand
correctly, it  requires the system administrator to place the VF devies
into a bridge, rather than simply having the bridging hidden inside the
device.

Thanks,
Jake
Jakub Kicinski Nov. 14, 2019, 2:25 a.m. UTC | #20
On Wed, 13 Nov 2019 22:55:16 +0000, Keller, Jacob E wrote:
> Hi Jakub,
> 
> On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> > In the upstream community, however, we care about the technical
> > aspects.
> >   
> > > and we all know that it could take years before we can sit back and
> > > relax that we got our L2 switching ..   
> > 
> > Let's not be dramatic. It shouldn't take years to implement basic L2
> > switching offload.  
> 
> I had meant to send something earlier in this thread, but never got
> around to it. I wanted to ask your opinion and get some feedback.
> 
> We (Intel) have recently been investigating use of port representors
> for enabling introspection and control of VFs in the host system after
> they've been assigned to a virtual machine.

Cool!

> I had originally been thinking of adding these port representor netdevs
> before we fully implement switchdev with the e-switch offloads. The
> idea was to migrate to using port representors in either case.
> 
> However, from what it looks like on this thread, you'd rather propose
> that we implement switchdev with basic L2 offload?
> 
> I'm not too familiar with switchdev, (trying to read and learn about so
> that we can begin actually implementing it in our network drivers).

So switchdev mode for SR-IOV NICs basically means that all ports are
represented by a netdev and no implicit switching happens in HW, if
packet is received on a port, be it VF or uplink - it's sent up to the
representor. That's pretty much it. Then you can install rules to
forward internally in the device.

Perhaps an obvious suggestion but did you consider converting
Documentation/networking/switchdev.txt to ReST and updating it as you
explore the code? ;)

> Based on your comments and feedback in this thread, it sounds like our
> original idea to have a "legacy with port representors" mode is not
> really something you'd like, because it would continue to encourage
> avoiding migrating from legacy stack to switchdev model.

Not at this point, no. 

I think this was all discussed before with Alex still strongly in the
netdev loops at Intel. I was initially accommodating to some partial
implementations like what you mention, since it'd had been good to have
Intel come on board with switchdev, and Fortville reportedly couldn't
disable leaking the packets which missed filters to the uplink.

IIRC at that point Or Gerlitz strongly drew the line at preserving
switchdev behaviour as described above - default to everything falls
back to host.

Today since many months have passed I don't think we should walk back
on that decision.

> But, instead of trying to go fully towards implementing switchdev with
> complicated OvS offloads, we could do a simpler approach that only
> supports L2 offloads initially, and from these comments it seems this
> is the direction you'd rather upstream persue?

Yes, I think simple L2 offload that supports common cases would be a
pretty cool starting point for a new switchdev implementation.

> > I had given switchdev L2 some thought. IDK what you'd call serious, 
> > I don't have code. We are doing some ridiculously complex OvS
> > offloads
> > while most customers just need L2 plus maybe VXLAN encap and basic
> > ACLs. Which switchdev can do very nicely thanks to Cumulus folks.  
> 
> Based on this, it sounds like the switchdev API can do this L2
> offloading and drivers simply need to enable it. If I understand
> correctly, it  requires the system administrator to place the VF devies
> into a bridge, rather than simply having the bridging hidden inside the
> device.

Yup. You may want to support only offloading of certain configuration
of the bridge to simplify your life, e.g. disable learning and flood
only to uplink..