Message ID | 20200525230556.1455927-1-idosch@idosch.org |
---|---|
Headers | show |
Series | mlxsw: Various trap changes - part 2 | expand |
On Tue, 26 May 2020 02:05:42 +0300 Ido Schimmel wrote: > From: Ido Schimmel <idosch@mellanox.com> > > This patch set contains another set of small changes in mlxsw trap > configuration. It is the last set before exposing control traps (e.g., > IGMP query, ARP request) via devlink-trap. When traps were introduced my understanding was that they are for reporting frames which hit an expectation on the datapath. IOW the primary use for them was troubleshooting. Now, if I'm following things correctly we have explicit DHCP, IGMP, ARP, ND, BFD etc. traps. Are we still in the troubleshooting realm?
On Tue, May 26, 2020 at 03:14:37PM -0700, Jakub Kicinski wrote: > On Tue, 26 May 2020 02:05:42 +0300 Ido Schimmel wrote: > > From: Ido Schimmel <idosch@mellanox.com> > > > > This patch set contains another set of small changes in mlxsw trap > > configuration. It is the last set before exposing control traps (e.g., > > IGMP query, ARP request) via devlink-trap. > > When traps were introduced my understanding was that they are for > reporting frames which hit an expectation on the datapath. IOW the > primary use for them was troubleshooting. > > Now, if I'm following things correctly we have explicit DHCP, IGMP, > ARP, ND, BFD etc. traps. Are we still in the troubleshooting realm? First of all, we always had them. This patch set mainly performs some cleanups in mlxsw. Second, I don't understand how you got the impression that the primary use of devlink-trap is troubleshooting. I was very clear and transparent about the scope of the work from the very beginning and I don't wish to be portrayed as if I wasn't. The first two paragraphs from the kernel documentation [1] explicitly mention the ability to trap control packets to the CPU: "Devices capable of offloading the kernel’s datapath and perform functions such as bridging and routing must also be able to send specific packets to the kernel (i.e., the CPU) for processing. For example, a device acting as a multicast-aware bridge must be able to send IGMP membership reports to the kernel for processing by the bridge module. Without processing such packets, the bridge module could never populate its MDB." In my reply to you from almost a year ago I outlined the entire plan for devlink-trap [2]: "Switch ASICs have dedicated traps for specific packets. Usually, these packets are control packets (e.g., ARP, BGP) which are required for the correct functioning of the control plane. You can see this in the SAI interface, which is an abstraction layer over vendors' SDKs: https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 We need to be able to configure the hardware policers of these traps and read their statistics to understand how many packets they dropped. We currently do not have a way to do any of that and we rely on hardcoded defaults in the driver which do not fit every use case (from experience): https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 We plan to extend devlink-trap mechanism to cover all these use cases. I hope you agree that this functionality belongs in devlink given it is a device-specific configuration and not a netdev-specific one. That being said, in its current form, this mechanism is focused on traps that correlate to packets the device decided to drop as this is very useful for debugging." In the last cycle, when I added the ability to configure trap policers [3] I again mentioned under "Future plans" that I plan to "Add more packet traps. For example, for control packets (e.g., IGMP)". If you worry that packets received via control traps will be somehow tunneled to user space via drop_monitor, then I can assure you this is not the case. You can refer to this commit [4] from the next submission. Thanks [1] https://www.kernel.org/doc/html/latest/networking/devlink/devlink-trap.html [2] https://lore.kernel.org/netdev/20190709123844.GA27309@splinter/ [3] https://lwn.net/Articles/815948/ [4] https://github.com/jpirko/linux_mlxsw/commit/4a5f0a8034f0f10301680fe68559e3debacf534d
On Wed, 27 May 2020 02:19:05 +0300 Ido Schimmel wrote: > On Tue, May 26, 2020 at 03:14:37PM -0700, Jakub Kicinski wrote: > > On Tue, 26 May 2020 02:05:42 +0300 Ido Schimmel wrote: > > > From: Ido Schimmel <idosch@mellanox.com> > > > > > > This patch set contains another set of small changes in mlxsw trap > > > configuration. It is the last set before exposing control traps (e.g., > > > IGMP query, ARP request) via devlink-trap. > > > > When traps were introduced my understanding was that they are for > > reporting frames which hit an expectation on the datapath. IOW the > > primary use for them was troubleshooting. > > > > Now, if I'm following things correctly we have explicit DHCP, IGMP, > > ARP, ND, BFD etc. traps. Are we still in the troubleshooting realm? > > First of all, we always had them. This patch set mainly performs some > cleanups in mlxsw. > > Second, I don't understand how you got the impression that the primary > use of devlink-trap is troubleshooting. I was very clear and transparent > about the scope of the work from the very beginning and I don't wish to > be portrayed as if I wasn't. > > The first two paragraphs from the kernel documentation [1] explicitly > mention the ability to trap control packets to the CPU: > > "Devices capable of offloading the kernel’s datapath and perform > functions such as bridging and routing must also be able to send > specific packets to the kernel (i.e., the CPU) for processing. > > For example, a device acting as a multicast-aware bridge must be able to > send IGMP membership reports to the kernel for processing by the bridge > module. Without processing such packets, the bridge module could never > populate its MDB." > > In my reply to you from almost a year ago I outlined the entire plan for > devlink-trap [2]: > > "Switch ASICs have dedicated traps for specific packets. Usually, these > packets are control packets (e.g., ARP, BGP) which are required for the > correct functioning of the control plane. You can see this in the SAI > interface, which is an abstraction layer over vendors' SDKs: > > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 > > We need to be able to configure the hardware policers of these traps and > read their statistics to understand how many packets they dropped. We > currently do not have a way to do any of that and we rely on hardcoded > defaults in the driver which do not fit every use case (from > experience): > > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 > > We plan to extend devlink-trap mechanism to cover all these use cases. I > hope you agree that this functionality belongs in devlink given it is a > device-specific configuration and not a netdev-specific one. > > That being said, in its current form, this mechanism is focused on traps > that correlate to packets the device decided to drop as this is very > useful for debugging." > > In the last cycle, when I added the ability to configure trap policers > [3] I again mentioned under "Future plans" that I plan to "Add more > packet traps. For example, for control packets (e.g., IGMP)". > > If you worry that packets received via control traps will be somehow > tunneled to user space via drop_monitor, then I can assure you this is > not the case. You can refer to this commit [4] from the next submission. Perhaps the troubleshooting is how I justified the necessity of having devlink traps to myself. It's much harder to get visibility into what HW is doing and therefore we need this special interface. But we should be able to configure a DHCP daemon without any special sauce. What purpose does the DHCP trap serve? What's the packet flow for BFD? How does the HW case differ from a SW router?
From: Ido Schimmel <idosch@idosch.org> Date: Tue, 26 May 2020 02:05:42 +0300 > From: Ido Schimmel <idosch@mellanox.com> > > This patch set contains another set of small changes in mlxsw trap > configuration. It is the last set before exposing control traps (e.g., > IGMP query, ARP request) via devlink-trap. > > Tested with existing devlink-trap selftests. Please see individual > patches for a detailed changelog. Since this is just a cleanup of existing code, series applied since it is independent of the discussion Ido is having with Jakub.
On Tue, May 26, 2020 at 04:43:23PM -0700, Jakub Kicinski wrote: > On Wed, 27 May 2020 02:19:05 +0300 Ido Schimmel wrote: > > On Tue, May 26, 2020 at 03:14:37PM -0700, Jakub Kicinski wrote: > > > On Tue, 26 May 2020 02:05:42 +0300 Ido Schimmel wrote: > > > > From: Ido Schimmel <idosch@mellanox.com> > > > > > > > > This patch set contains another set of small changes in mlxsw trap > > > > configuration. It is the last set before exposing control traps (e.g., > > > > IGMP query, ARP request) via devlink-trap. > > > > > > When traps were introduced my understanding was that they are for > > > reporting frames which hit an expectation on the datapath. IOW the > > > primary use for them was troubleshooting. > > > > > > Now, if I'm following things correctly we have explicit DHCP, IGMP, > > > ARP, ND, BFD etc. traps. Are we still in the troubleshooting realm? > > > > First of all, we always had them. This patch set mainly performs some > > cleanups in mlxsw. > > > > Second, I don't understand how you got the impression that the primary > > use of devlink-trap is troubleshooting. I was very clear and transparent > > about the scope of the work from the very beginning and I don't wish to > > be portrayed as if I wasn't. > > > > The first two paragraphs from the kernel documentation [1] explicitly > > mention the ability to trap control packets to the CPU: > > > > "Devices capable of offloading the kernel’s datapath and perform > > functions such as bridging and routing must also be able to send > > specific packets to the kernel (i.e., the CPU) for processing. > > > > For example, a device acting as a multicast-aware bridge must be able to > > send IGMP membership reports to the kernel for processing by the bridge > > module. Without processing such packets, the bridge module could never > > populate its MDB." > > > > In my reply to you from almost a year ago I outlined the entire plan for > > devlink-trap [2]: > > > > "Switch ASICs have dedicated traps for specific packets. Usually, these > > packets are control packets (e.g., ARP, BGP) which are required for the > > correct functioning of the control plane. You can see this in the SAI > > interface, which is an abstraction layer over vendors' SDKs: > > > > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 > > > > We need to be able to configure the hardware policers of these traps and > > read their statistics to understand how many packets they dropped. We > > currently do not have a way to do any of that and we rely on hardcoded > > defaults in the driver which do not fit every use case (from > > experience): > > > > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 > > > > We plan to extend devlink-trap mechanism to cover all these use cases. I > > hope you agree that this functionality belongs in devlink given it is a > > device-specific configuration and not a netdev-specific one. > > > > That being said, in its current form, this mechanism is focused on traps > > that correlate to packets the device decided to drop as this is very > > useful for debugging." > > > > In the last cycle, when I added the ability to configure trap policers > > [3] I again mentioned under "Future plans" that I plan to "Add more > > packet traps. For example, for control packets (e.g., IGMP)". > > > > If you worry that packets received via control traps will be somehow > > tunneled to user space via drop_monitor, then I can assure you this is > > not the case. You can refer to this commit [4] from the next submission. > > Perhaps the troubleshooting is how I justified the necessity of having > devlink traps to myself. It's much harder to get visibility into what > HW is doing and therefore we need this special interface. > > But we should be able to configure a DHCP daemon without any special > sauce. What purpose does the DHCP trap serve? > > What's the packet flow for BFD? How does the HW case differ from a SW > router? There is no special sauce required to get a DHCP daemon working nor BFD. It is supposed to Just Work. Same for IGMP / MLD snooping, STP etc. This is enabled by the ASIC trapping the required packets to the CPU. However, having a 3.2/6.4/12.8 Tbps ASIC (it keeps growing all the time) send traffic to the CPU can very easily result in denial of service. You need to have hardware policers and classification to different traffic classes ensuring the system remains functional regardless of the havoc happening in the offloaded data path. This control plane policy has been hard coded in mlxsw for a few years now (based on sane defaults), but it obviously does not fit everyone's needs. Different users have different use cases and different CPUs connected to the ASIC. Some have Celeron / Atom while others have more high-end Xeon CPUs, which are obviously capable of handling more packets per second. You also have zero visibility into how many packets were dropped by these hardware policers. By exposing these traps we allow users to tune these policers and get visibility into how many packets they dropped. In the future also changing their traffic class, so that (for example), packets hitting local routes are scheduled towards the CPU before packets dropped due to ingress VLAN filter. If you don't have any special needs you are probably OK with the defaults, in which case you don't need to do anything (no special sauce).
On Wed, 27 May 2020 10:38:57 +0300 Ido Schimmel wrote: > There is no special sauce required to get a DHCP daemon working nor BFD. > It is supposed to Just Work. Same for IGMP / MLD snooping, STP etc. This > is enabled by the ASIC trapping the required packets to the CPU. > > However, having a 3.2/6.4/12.8 Tbps ASIC (it keeps growing all the time) > send traffic to the CPU can very easily result in denial of service. You > need to have hardware policers and classification to different traffic > classes ensuring the system remains functional regardless of the havoc > happening in the offloaded data path. I don't see how that's only applicable to a switch ASIC, though. Ingress classification, and rate limiting applies to any network system. > This control plane policy has been hard coded in mlxsw for a few years > now (based on sane defaults), but it obviously does not fit everyone's > needs. Different users have different use cases and different CPUs > connected to the ASIC. Some have Celeron / Atom while others have more > high-end Xeon CPUs, which are obviously capable of handling more packets > per second. You also have zero visibility into how many packets were > dropped by these hardware policers. There are embedded Atom systems out there with multi-gig interfaces, they obviously can't ingest peak traffic, doesn't matter whether they are connected to a switch ASIC or a NIC. > By exposing these traps we allow users to tune these policers and get > visibility into how many packets they dropped. In the future also > changing their traffic class, so that (for example), packets hitting > local routes are scheduled towards the CPU before packets dropped due to > ingress VLAN filter. > > If you don't have any special needs you are probably OK with the > defaults, in which case you don't need to do anything (no special > sauce). As much as traps which forward traffic to the CPU fit the switch programming model, we'd rather see a solution that offloads constructs which are also applicable to the software world. Sniffing dropped frames to troubleshoot is one thing, but IMHO traps which default to "trap" are a bad smell.
On Wed, May 27, 2020 at 12:50:17PM -0700, Jakub Kicinski wrote: > On Wed, 27 May 2020 10:38:57 +0300 Ido Schimmel wrote: > > There is no special sauce required to get a DHCP daemon working nor BFD. > > It is supposed to Just Work. Same for IGMP / MLD snooping, STP etc. This > > is enabled by the ASIC trapping the required packets to the CPU. > > > > However, having a 3.2/6.4/12.8 Tbps ASIC (it keeps growing all the time) > > send traffic to the CPU can very easily result in denial of service. You > > need to have hardware policers and classification to different traffic > > classes ensuring the system remains functional regardless of the havoc > > happening in the offloaded data path. > > I don't see how that's only applicable to a switch ASIC, though. > Ingress classification, and rate limiting applies to any network > system. This is not about ingress classification and rate limiting. The classification does not happen at ingress. It happens throughout different points in the pipeline, by hard-coded checks meant to identify packets of interest. These checks look at both state (e.g., neighbour miss, route miss) and packet fields (e.g., BGP packet that hit a local route). Similarly, the rate limiting does not happen at ingress. It only applies to packets that your offloaded data path decided should go to the attached host (the control plane). You cannot perform the rate limiting at ingress for the simple reason that you still do not know if the packet should reach the control plane. > > > This control plane policy has been hard coded in mlxsw for a few years > > now (based on sane defaults), but it obviously does not fit everyone's > > needs. Different users have different use cases and different CPUs > > connected to the ASIC. Some have Celeron / Atom while others have more > > high-end Xeon CPUs, which are obviously capable of handling more packets > > per second. You also have zero visibility into how many packets were > > dropped by these hardware policers. > > There are embedded Atom systems out there with multi-gig interfaces, > they obviously can't ingest peak traffic, doesn't matter whether they > are connected to a switch ASIC or a NIC. Not the same thing. Every packet received by such systems should reach the attached host. The control plane and the data plane are the same. The whole point of this work is to rate limit packets coming from your offloaded data plane to the control plane. > > > By exposing these traps we allow users to tune these policers and get > > visibility into how many packets they dropped. In the future also > > changing their traffic class, so that (for example), packets hitting > > local routes are scheduled towards the CPU before packets dropped due to > > ingress VLAN filter. > > > > If you don't have any special needs you are probably OK with the > > defaults, in which case you don't need to do anything (no special > > sauce). > > As much as traps which forward traffic to the CPU fit the switch > programming model, we'd rather see a solution that offloads constructs > which are also applicable to the software world. In the software world the data plane and the control plane are the same. The CPU sees every packet. IGMP packets trigger MDB modifications, packets that incurred a neighbour miss trigger an ARP / ND etc. These are all control plane operations. Once you separate your control plane from the data plane and offload the latter to a capable hardware (e.g., switch ASIC), you create a need to limit the packets coming from your data plane to the control plane. This is a hardware-specific problem. > > Sniffing dropped frames to troubleshoot is one thing, but IMHO traps > which default to "trap" are a bad smell. These traps exist today. They are programmed by mlxsw during initialization. Without them basic stuff like DHCP/ARP/STP would not work and you would need the "special sauce" you previously mentioned. By exposing them via devlink-trap we allow users to configure their rate from the offloaded data plane towards the control plane running on the attached host. This is the only set operation you can do. Nothing else. Anyway, I don't know how to argue with "bad smell". I held off on sending the next patch set because this discussion was on-going, but at this point I don't think it's possible for me to explain the problem and solution in a clearer fashion, so I'll go ahead and send the patches. Thanks
From: Ido Schimmel <idosch@mellanox.com> This patch set contains another set of small changes in mlxsw trap configuration. It is the last set before exposing control traps (e.g., IGMP query, ARP request) via devlink-trap. Tested with existing devlink-trap selftests. Please see individual patches for a detailed changelog. Ido Schimmel (14): mlxsw: spectrum: Use dedicated trap group for ACL trap mlxsw: spectrum: Use same switch case for identical groups mlxsw: spectrum: Rename IPv6 ND trap group mlxsw: spectrum: Use same trap group for various IPv6 packets mlxsw: spectrum: Use separate trap group for FID miss mlxsw: spectrum: Use same trap group for local routes and link-local destination mlxsw: spectrum: Reduce priority of locally delivered packets mlxsw: switchx2: Move SwitchX-2 trap groups out of main enum mlxsw: spectrum_trap: Do not hard code "thin" policer identifier mlxsw: reg: Move all trap groups under the same enum mlxsw: spectrum: Share one group for all locally delivered packets mlxsw: spectrum: Treat IPv6 link-local SIP as an exception mlxsw: spectrum: Add packet traps for BFD packets mlxsw: spectrum_router: Allow programming link-local prefix routes drivers/net/ethernet/mellanox/mlxsw/reg.h | 17 ++++---- .../net/ethernet/mellanox/mlxsw/spectrum.c | 40 +++++++++++-------- .../ethernet/mellanox/mlxsw/spectrum_router.c | 6 ++- .../ethernet/mellanox/mlxsw/spectrum_trap.c | 17 +++++--- .../ethernet/mellanox/mlxsw/spectrum_trap.h | 2 + .../net/ethernet/mellanox/mlxsw/switchx2.c | 5 +++ drivers/net/ethernet/mellanox/mlxsw/trap.h | 2 + .../drivers/net/mlxsw/sharedbuffer.sh | 2 +- 8 files changed, 56 insertions(+), 35 deletions(-)