Message ID | 20200517003354.233373-1-marex@denx.de |
---|---|
Headers | show |
Series | net: ks8851: Unify KS8851 SPI and MLL drivers | expand |
From: Marek Vasut <marex@denx.de> Date: Sun, 17 May 2020 02:33:34 +0200 > The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces > of silicon, except the former has an SPI interface, while the later has a > parallel bus interface. Thus far, Linux has two separate drivers for each > and they are diverging considerably. > > This series unifies them into a single driver with small SPI and parallel > bus specific parts. The approach here is to first separate out the SPI > specific parts into a separate file, then add parallel bus accessors in > another separate file and then finally remove the old parallel bus driver. > The reason for replacing the old parallel bus driver is because the SPI > bus driver is much higher quality. What strikes me in these changes is all of the new indirect jumps in the fast paths of TX and RX packet processing. It's just too much for my eyes. :-) Especially in the presence of Spectre mitigations, these costs are quite non-trivial. Seriously, I would recommend that instead of having these small indirect helpers, just inline the differences into two instances of the RX interrupt and the TX handler.
On 5/17/20 4:02 AM, David Miller wrote: > From: Marek Vasut <marex@denx.de> > Date: Sun, 17 May 2020 02:33:34 +0200 > >> The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces >> of silicon, except the former has an SPI interface, while the later has a >> parallel bus interface. Thus far, Linux has two separate drivers for each >> and they are diverging considerably. >> >> This series unifies them into a single driver with small SPI and parallel >> bus specific parts. The approach here is to first separate out the SPI >> specific parts into a separate file, then add parallel bus accessors in >> another separate file and then finally remove the old parallel bus driver. >> The reason for replacing the old parallel bus driver is because the SPI >> bus driver is much higher quality. > > What strikes me in these changes is all of the new indirect jumps in > the fast paths of TX and RX packet processing. It's just too much for > my eyes. :-) > > Especially in the presence of Spectre mitigations, these costs are > quite non-trivial. > > Seriously, I would recommend that instead of having these small > indirect helpers, just inline the differences into two instances of > the RX interrupt and the TX handler. So I was already led into reworking the entire series to do this inlining once, after V1. It then turned out it's a horrible mess to get everything to compile as modules and built-in and then also only the parallel/SPI as a module and then the other way around. Since Lukas was very adamant about the performance impact of these indirect calls, I did measurements and it turned out the impact is basically nonexistent. And that's pretty much to be expected, since this is a 100 Mbit ethernet on SPI bus limited to 40 MHz (so effectively a 40 Mbit ethernet tops) or on parallel bus. The parallel bus option is the "more performant" one and even that one shows no difference with and without this series in iperf, either in throughput or latency. So I spent more time and undid those changes again. And here is the V6 of the series.
On Sat, May 16, 2020 at 07:02:25PM -0700, David Miller wrote: > > The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces > > of silicon, except the former has an SPI interface, while the later has a > > parallel bus interface. Thus far, Linux has two separate drivers for each > > and they are diverging considerably. > > > > This series unifies them into a single driver with small SPI and parallel > > bus specific parts. The approach here is to first separate out the SPI > > specific parts into a separate file, then add parallel bus accessors in > > another separate file and then finally remove the old parallel bus driver. > > The reason for replacing the old parallel bus driver is because the SPI > > bus driver is much higher quality. > > What strikes me in these changes is all of the new indirect jumps in > the fast paths of TX and RX packet processing. It's just too much for > my eyes. :-) > > Especially in the presence of Spectre mitigations, these costs are > quite non-trivial. > > Seriously, I would recommend that instead of having these small > indirect helpers, just inline the differences into two instances of > the RX interrupt and the TX handler. I agree. However in terms of performance there's a bigger problem: Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors. The present series drops them and performs a 32-bit access as two 16-bit accesses and an 8-bit access as one 16-bit access because that's what ks8851_mll.c (16-bit parallel bus driver) does. That has a real, measurable performance impact because in the case of 8-bit accesses, another 8 bits need to be transferred over the SPI bus, and in the case of 32-bit accesses, *two* SPI transfers need to be performed. The 8-bit and 32-bit accesses happen in ks8851_rx_pkts(), i.e. in the RX hotpath. I've provided numbers for the performance impact and even a patch to solve them but it was dismissed and not included in the present series: https://lore.kernel.org/netdev/20200420140700.6632hztejwcgjwsf@wunner.de/ The reason given for the dismissal was that I had performed the measurements on 4.19 which is allegedly "long dead" (in Andrew Lunn's words). However I can assure you that performing two SPI transfers has not magically become as fast as performing one SPI transfer since 4.19. So the argument is nonsense. Nevertheless I was going to repeat the performance measurements on a recent kernel but haven't gotten around to that yet because the measurements need to be performed with CONFIG_PREEMPT_RT_FULL to be reliable (a vanilla kernel is too jittery), so I have to create a new branch with RT patches on the test machine, which is fairly involved and time consuming. I think it's fair that the two drivers are unified, but the performance for the SPI variant shouldn't be unnecessarily diminished in the process. Thanks, Lukas
On 5/17/20 9:13 AM, Lukas Wunner wrote: > On Sat, May 16, 2020 at 07:02:25PM -0700, David Miller wrote: >>> The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces >>> of silicon, except the former has an SPI interface, while the later has a >>> parallel bus interface. Thus far, Linux has two separate drivers for each >>> and they are diverging considerably. >>> >>> This series unifies them into a single driver with small SPI and parallel >>> bus specific parts. The approach here is to first separate out the SPI >>> specific parts into a separate file, then add parallel bus accessors in >>> another separate file and then finally remove the old parallel bus driver. >>> The reason for replacing the old parallel bus driver is because the SPI >>> bus driver is much higher quality. >> >> What strikes me in these changes is all of the new indirect jumps in >> the fast paths of TX and RX packet processing. It's just too much for >> my eyes. :-) >> >> Especially in the presence of Spectre mitigations, these costs are >> quite non-trivial. >> >> Seriously, I would recommend that instead of having these small >> indirect helpers, just inline the differences into two instances of >> the RX interrupt and the TX handler. > > I agree. I do not. > However in terms of performance there's a bigger problem: > > Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors. > The present series drops them and performs a 32-bit access as two 16-bit > accesses and an 8-bit access as one 16-bit access because that's what > ks8851_mll.c (16-bit parallel bus driver) does. That has a real, > measurable performance impact because in the case of 8-bit accesses, > another 8 bits need to be transferred over the SPI bus, and in the case > of 32-bit accesses, *two* SPI transfers need to be performed. > > The 8-bit and 32-bit accesses happen in ks8851_rx_pkts(), i.e. in the > RX hotpath. I've provided numbers for the performance impact and even > a patch to solve them but it was dismissed and not included in the > present series: > > https://lore.kernel.org/netdev/20200420140700.6632hztejwcgjwsf@wunner.de/ > > The reason given for the dismissal was that I had performed the measurements > on 4.19 which is allegedly "long dead" (in Andrew Lunn's words). > However I can assure you that performing two SPI transfers has not > magically become as fast as performing one SPI transfer since 4.19. > So the argument is nonsense. I invested time and even obtained the SPI variant of the card to perform actual comparative measurements on linux-next both on the SPI and parallel variant with iperf, both for latency and throughput, and I do not observe this problem. A month ago, I even provided you a branch with all the patches and the DT patch for RPi3 (the platform you claim to use for these tests, so I used the same) so you can perform the same test as I did, with the same hardware and the same software. So it should have been trivial to reproduce the tests I did and their results. > Nevertheless I was going to repeat the performance measurements on a > recent kernel but haven't gotten around to that yet because the > measurements need to be performed with CONFIG_PREEMPT_RT_FULL to > be reliable (a vanilla kernel is too jittery), so I have to create > a new branch with RT patches on the test machine, which is fairly > involved and time consuming. > > I think it's fair that the two drivers are unified, but the performance > for the SPI variant shouldn't be unnecessarily diminished in the process. Could it be that your problem is related to this huge out-of-tree patch you use then ?
> However in terms of performance there's a bigger problem: > > Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors. > The present series drops them and performs a 32-bit access as two 16-bit > accesses and an 8-bit access as one 16-bit access because that's what > ks8851_mll.c (16-bit parallel bus driver) does. That has a real, > measurable performance impact because in the case of 8-bit accesses, > another 8 bits need to be transferred over the SPI bus, and in the case > of 32-bit accesses, *two* SPI transfers need to be performed. How often does this happen on a per packet basis? Packets are generally a mixture of 50bytes, 576bytes and 1500bytes in size, with the majority being 576 bytes. Does an extra 8 or 16 bits per packet really make that much difference? Or is the real problem the overheads of doing the transaction, not the number of bytes transferred? If so, maybe the abstractions needs to be slightly higher, not register access, but basic functionality. > Nevertheless I was going to repeat the performance measurements on a > recent kernel but haven't gotten around to that yet because the > measurements need to be performed with CONFIG_PREEMPT_RT_FULL to > be reliable (a vanilla kernel is too jittery), so I have to create > a new branch with RT patches on the test machine, which is fairly > involved and time consuming. I assume you will then mainline the changes, so you don't need to do it again? That is the problem with doing development work on a dead kernel. Andrew
> So I was already led into reworking the entire series to do this > inlining once, after V1. It then turned out it's a horrible mess to get > everything to compile as modules and built-in and then also only the > parallel/SPI as a module and then the other way around. Maybe consider some trade offs. Have both sets of accessors in the core, and then thin wrappers around it to probe on each bus type. You bloat the core, but avoid the indirection. You can also have the core as a standalone module, which exports symbols for the wrappers to use. It does take some Kconfig work to get built in vs modules correct, but there are people who can help. It is also not considered a regression if you reduce the options in terms of module vs built in. Andrew
From: Andrew Lunn <andrew@lunn.ch> Date: Sun, 17 May 2020 21:16:35 +0200 >> Nevertheless I was going to repeat the performance measurements on a >> recent kernel but haven't gotten around to that yet because the >> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to >> be reliable (a vanilla kernel is too jittery), so I have to create >> a new branch with RT patches on the test machine, which is fairly >> involved and time consuming. > > I assume you will then mainline the changes, so you don't need to do > it again? That is the problem with doing development work on a dead > kernel. I think the limitation is outside of his control as not all of the RT patches are in mainline yet.
On 5/17/20 9:26 PM, Andrew Lunn wrote: >> So I was already led into reworking the entire series to do this >> inlining once, after V1. It then turned out it's a horrible mess to get >> everything to compile as modules and built-in and then also only the >> parallel/SPI as a module and then the other way around. > > Maybe consider some trade offs. Have both sets of accessors in the > core, and then thin wrappers around it to probe on each bus type. You > bloat the core, but avoid the indirection. You can also have the core > as a standalone module, which exports symbols for the wrappers to > use. It does take some Kconfig work to get built in vs modules > correct, but there are people who can help. It is also not considered > a regression if you reduce the options in terms of module vs built in. I think this is what we attempted with V1/V2/V3 already, except back then it was to improve performance, which turned out to be a no-issue, as the performance is the same with or without the indirect accessors (of course it is, the interface is so slow the indirect accessors make no difference, plus add into it that it's communicating with the NIC through SPI core and SPI drivers, which are full of this indirection already). Or do I misunderstand something?
On 5/17/20 9:30 PM, David Miller wrote: > From: Andrew Lunn <andrew@lunn.ch> > Date: Sun, 17 May 2020 21:16:35 +0200 > >>> Nevertheless I was going to repeat the performance measurements on a >>> recent kernel but haven't gotten around to that yet because the >>> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to >>> be reliable (a vanilla kernel is too jittery), so I have to create >>> a new branch with RT patches on the test machine, which is fairly >>> involved and time consuming. >> >> I assume you will then mainline the changes, so you don't need to do >> it again? That is the problem with doing development work on a dead >> kernel. > > I think the limitation is outside of his control as not all of the RT > patches are in mainline yet. So how shall we proceed here? Some basic measurement results are here, for the reference [1]. [1] https://www.spinics.net/lists/netdev/msg643099.html