mbox series

[V6,00/20] net: ks8851: Unify KS8851 SPI and MLL drivers

Message ID 20200517003354.233373-1-marex@denx.de
Headers show
Series net: ks8851: Unify KS8851 SPI and MLL drivers | expand

Message

Marek Vasut May 17, 2020, 12:33 a.m. UTC
The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces
of silicon, except the former has an SPI interface, while the later has a
parallel bus interface. Thus far, Linux has two separate drivers for each
and they are diverging considerably.

This series unifies them into a single driver with small SPI and parallel
bus specific parts. The approach here is to first separate out the SPI
specific parts into a separate file, then add parallel bus accessors in
another separate file and then finally remove the old parallel bus driver.
The reason for replacing the old parallel bus driver is because the SPI
bus driver is much higher quality.

Marek Vasut (20):
  net: ks8851: Factor out spi->dev in probe()/remove()
  net: ks8851: Rename ndev to netdev in probe
  net: ks8851: Replace dev_err() with netdev_err() in IRQ handler
  net: ks8851: Pass device node into ks8851_init_mac()
  net: ks8851: Use devm_alloc_etherdev()
  net: ks8851: Use dev_{get,set}_drvdata()
  net: ks8851: Remove ks8851_rdreg32()
  net: ks8851: Use 16-bit writes to program MAC address
  net: ks8851: Use 16-bit read of RXFC register
  net: ks8851: Factor out bus lock handling
  net: ks8851: Factor out SKB receive function
  net: ks8851: Split out SPI specific entries in struct ks8851_net
  net: ks8851: Split out SPI specific code from probe() and remove()
  net: ks8851: Factor out TX work flush function
  net: ks8851: Permit overridding interrupt enable register
  net: ks8851: Implement register, FIFO, lock accessor callbacks
  net: ks8851: Separate SPI operations into separate file
  net: ks8851: Implement Parallel bus operations
  net: ks8851: Remove ks8851_mll.c
  net: ks8851: Drop define debug and pr_fmt()

 drivers/net/ethernet/micrel/Kconfig           |    2 +
 drivers/net/ethernet/micrel/Makefile          |    2 +
 drivers/net/ethernet/micrel/ks8851.h          |  142 +-
 .../micrel/{ks8851.c => ks8851_common.c}      |  698 ++-------
 drivers/net/ethernet/micrel/ks8851_mll.c      | 1393 -----------------
 drivers/net/ethernet/micrel/ks8851_par.c      |  351 +++++
 drivers/net/ethernet/micrel/ks8851_spi.c      |  481 ++++++
 7 files changed, 1116 insertions(+), 1953 deletions(-)
 rename drivers/net/ethernet/micrel/{ks8851.c => ks8851_common.c} (62%)
 delete mode 100644 drivers/net/ethernet/micrel/ks8851_mll.c
 create mode 100644 drivers/net/ethernet/micrel/ks8851_par.c
 create mode 100644 drivers/net/ethernet/micrel/ks8851_spi.c

Cc: David S. Miller <davem@davemloft.net>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Petr Stetiar <ynezz@true.cz>
Cc: YueHaibing <yuehaibing@huawei.com>

Comments

David Miller May 17, 2020, 2:02 a.m. UTC | #1
From: Marek Vasut <marex@denx.de>
Date: Sun, 17 May 2020 02:33:34 +0200

> The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces
> of silicon, except the former has an SPI interface, while the later has a
> parallel bus interface. Thus far, Linux has two separate drivers for each
> and they are diverging considerably.
> 
> This series unifies them into a single driver with small SPI and parallel
> bus specific parts. The approach here is to first separate out the SPI
> specific parts into a separate file, then add parallel bus accessors in
> another separate file and then finally remove the old parallel bus driver.
> The reason for replacing the old parallel bus driver is because the SPI
> bus driver is much higher quality.

What strikes me in these changes is all of the new indirect jumps in
the fast paths of TX and RX packet processing.  It's just too much for
my eyes. :-)

Especially in the presence of Spectre mitigations, these costs are
quite non-trivial.

Seriously, I would recommend that instead of having these small
indirect helpers, just inline the differences into two instances of
the RX interrupt and the TX handler.
Marek Vasut May 17, 2020, 2:27 a.m. UTC | #2
On 5/17/20 4:02 AM, David Miller wrote:
> From: Marek Vasut <marex@denx.de>
> Date: Sun, 17 May 2020 02:33:34 +0200
> 
>> The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces
>> of silicon, except the former has an SPI interface, while the later has a
>> parallel bus interface. Thus far, Linux has two separate drivers for each
>> and they are diverging considerably.
>>
>> This series unifies them into a single driver with small SPI and parallel
>> bus specific parts. The approach here is to first separate out the SPI
>> specific parts into a separate file, then add parallel bus accessors in
>> another separate file and then finally remove the old parallel bus driver.
>> The reason for replacing the old parallel bus driver is because the SPI
>> bus driver is much higher quality.
> 
> What strikes me in these changes is all of the new indirect jumps in
> the fast paths of TX and RX packet processing.  It's just too much for
> my eyes. :-)
> 
> Especially in the presence of Spectre mitigations, these costs are
> quite non-trivial.
> 
> Seriously, I would recommend that instead of having these small
> indirect helpers, just inline the differences into two instances of
> the RX interrupt and the TX handler.

So I was already led into reworking the entire series to do this
inlining once, after V1. It then turned out it's a horrible mess to get
everything to compile as modules and built-in and then also only the
parallel/SPI as a module and then the other way around.

Since Lukas was very adamant about the performance impact of these
indirect calls, I did measurements and it turned out the impact is
basically nonexistent. And that's pretty much to be expected, since this
is a 100 Mbit ethernet on SPI bus limited to 40 MHz (so effectively a 40
Mbit ethernet tops) or on parallel bus. The parallel bus option is the
"more performant" one and even that one shows no difference with and
without this series in iperf, either in throughput or latency.

So I spent more time and undid those changes again. And here is the V6
of the series.
Lukas Wunner May 17, 2020, 7:13 a.m. UTC | #3
On Sat, May 16, 2020 at 07:02:25PM -0700, David Miller wrote:
> > The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces
> > of silicon, except the former has an SPI interface, while the later has a
> > parallel bus interface. Thus far, Linux has two separate drivers for each
> > and they are diverging considerably.
> > 
> > This series unifies them into a single driver with small SPI and parallel
> > bus specific parts. The approach here is to first separate out the SPI
> > specific parts into a separate file, then add parallel bus accessors in
> > another separate file and then finally remove the old parallel bus driver.
> > The reason for replacing the old parallel bus driver is because the SPI
> > bus driver is much higher quality.
> 
> What strikes me in these changes is all of the new indirect jumps in
> the fast paths of TX and RX packet processing.  It's just too much for
> my eyes. :-)
> 
> Especially in the presence of Spectre mitigations, these costs are
> quite non-trivial.
> 
> Seriously, I would recommend that instead of having these small
> indirect helpers, just inline the differences into two instances of
> the RX interrupt and the TX handler.

I agree.

However in terms of performance there's a bigger problem:

Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors.
The present series drops them and performs a 32-bit access as two 16-bit
accesses and an 8-bit access as one 16-bit access because that's what
ks8851_mll.c (16-bit parallel bus driver) does.  That has a real,
measurable performance impact because in the case of 8-bit accesses,
another 8 bits need to be transferred over the SPI bus, and in the case
of 32-bit accesses, *two* SPI transfers need to be performed.

The 8-bit and 32-bit accesses happen in ks8851_rx_pkts(), i.e. in the
RX hotpath.  I've provided numbers for the performance impact and even
a patch to solve them but it was dismissed and not included in the
present series:

https://lore.kernel.org/netdev/20200420140700.6632hztejwcgjwsf@wunner.de/

The reason given for the dismissal was that I had performed the measurements
on 4.19 which is allegedly "long dead" (in Andrew Lunn's words).
However I can assure you that performing two SPI transfers has not
magically become as fast as performing one SPI transfer since 4.19.
So the argument is nonsense.

Nevertheless I was going to repeat the performance measurements on a
recent kernel but haven't gotten around to that yet because the
measurements need to be performed with CONFIG_PREEMPT_RT_FULL to
be reliable (a vanilla kernel is too jittery), so I have to create
a new branch with RT patches on the test machine, which is fairly
involved and time consuming.

I think it's fair that the two drivers are unified, but the performance
for the SPI variant shouldn't be unnecessarily diminished in the process.

Thanks,

Lukas
Marek Vasut May 17, 2020, 12:36 p.m. UTC | #4
On 5/17/20 9:13 AM, Lukas Wunner wrote:
> On Sat, May 16, 2020 at 07:02:25PM -0700, David Miller wrote:
>>> The KS8851SNL/SNLI and KS8851-16MLL/MLLI/MLLU are very much the same pieces
>>> of silicon, except the former has an SPI interface, while the later has a
>>> parallel bus interface. Thus far, Linux has two separate drivers for each
>>> and they are diverging considerably.
>>>
>>> This series unifies them into a single driver with small SPI and parallel
>>> bus specific parts. The approach here is to first separate out the SPI
>>> specific parts into a separate file, then add parallel bus accessors in
>>> another separate file and then finally remove the old parallel bus driver.
>>> The reason for replacing the old parallel bus driver is because the SPI
>>> bus driver is much higher quality.
>>
>> What strikes me in these changes is all of the new indirect jumps in
>> the fast paths of TX and RX packet processing.  It's just too much for
>> my eyes. :-)
>>
>> Especially in the presence of Spectre mitigations, these costs are
>> quite non-trivial.
>>
>> Seriously, I would recommend that instead of having these small
>> indirect helpers, just inline the differences into two instances of
>> the RX interrupt and the TX handler.
> 
> I agree.

I do not.

> However in terms of performance there's a bigger problem:
> 
> Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors.
> The present series drops them and performs a 32-bit access as two 16-bit
> accesses and an 8-bit access as one 16-bit access because that's what
> ks8851_mll.c (16-bit parallel bus driver) does.  That has a real,
> measurable performance impact because in the case of 8-bit accesses,
> another 8 bits need to be transferred over the SPI bus, and in the case
> of 32-bit accesses, *two* SPI transfers need to be performed.
> 
> The 8-bit and 32-bit accesses happen in ks8851_rx_pkts(), i.e. in the
> RX hotpath.  I've provided numbers for the performance impact and even
> a patch to solve them but it was dismissed and not included in the
> present series:
> 
> https://lore.kernel.org/netdev/20200420140700.6632hztejwcgjwsf@wunner.de/
> 
> The reason given for the dismissal was that I had performed the measurements
> on 4.19 which is allegedly "long dead" (in Andrew Lunn's words).
> However I can assure you that performing two SPI transfers has not
> magically become as fast as performing one SPI transfer since 4.19.
> So the argument is nonsense.

I invested time and even obtained the SPI variant of the card to perform
actual comparative measurements on linux-next both on the SPI and
parallel variant with iperf, both for latency and throughput, and I do
not observe this problem.

A month ago, I even provided you a branch with all the patches and the
DT patch for RPi3 (the platform you claim to use for these tests, so I
used the same) so you can perform the same test as I did, with the same
hardware and the same software. So it should have been trivial to
reproduce the tests I did and their results.

> Nevertheless I was going to repeat the performance measurements on a
> recent kernel but haven't gotten around to that yet because the
> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to
> be reliable (a vanilla kernel is too jittery), so I have to create
> a new branch with RT patches on the test machine, which is fairly
> involved and time consuming.
> 
> I think it's fair that the two drivers are unified, but the performance
> for the SPI variant shouldn't be unnecessarily diminished in the process.

Could it be that your problem is related to this huge out-of-tree patch
you use then ?
Andrew Lunn May 17, 2020, 7:16 p.m. UTC | #5
> However in terms of performance there's a bigger problem:
> 
> Previously ks8851.c (SPI driver) had 8-bit and 32-bit register accessors.
> The present series drops them and performs a 32-bit access as two 16-bit
> accesses and an 8-bit access as one 16-bit access because that's what
> ks8851_mll.c (16-bit parallel bus driver) does.  That has a real,
> measurable performance impact because in the case of 8-bit accesses,
> another 8 bits need to be transferred over the SPI bus, and in the case
> of 32-bit accesses, *two* SPI transfers need to be performed.

How often does this happen on a per packet basis? Packets are
generally a mixture of 50bytes, 576bytes and 1500bytes in size, with
the majority being 576 bytes. Does an extra 8 or 16 bits per packet
really make that much difference? Or is the real problem the overheads
of doing the transaction, not the number of bytes transferred? If so,
maybe the abstractions needs to be slightly higher, not register
access, but basic functionality.

> Nevertheless I was going to repeat the performance measurements on a
> recent kernel but haven't gotten around to that yet because the
> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to
> be reliable (a vanilla kernel is too jittery), so I have to create
> a new branch with RT patches on the test machine, which is fairly
> involved and time consuming.

I assume you will then mainline the changes, so you don't need to do
it again? That is the problem with doing development work on a dead
kernel.

	Andrew
Andrew Lunn May 17, 2020, 7:26 p.m. UTC | #6
> So I was already led into reworking the entire series to do this
> inlining once, after V1. It then turned out it's a horrible mess to get
> everything to compile as modules and built-in and then also only the
> parallel/SPI as a module and then the other way around.

Maybe consider some trade offs. Have both sets of accessors in the
core, and then thin wrappers around it to probe on each bus type. You
bloat the core, but avoid the indirection. You can also have the core
as a standalone module, which exports symbols for the wrappers to
use. It does take some Kconfig work to get built in vs modules
correct, but there are people who can help. It is also not considered
a regression if you reduce the options in terms of module vs built in.

   Andrew
David Miller May 17, 2020, 7:30 p.m. UTC | #7
From: Andrew Lunn <andrew@lunn.ch>
Date: Sun, 17 May 2020 21:16:35 +0200

>> Nevertheless I was going to repeat the performance measurements on a
>> recent kernel but haven't gotten around to that yet because the
>> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to
>> be reliable (a vanilla kernel is too jittery), so I have to create
>> a new branch with RT patches on the test machine, which is fairly
>> involved and time consuming.
> 
> I assume you will then mainline the changes, so you don't need to do
> it again? That is the problem with doing development work on a dead
> kernel.

I think the limitation is outside of his control as not all of the RT
patches are in mainline yet.
Marek Vasut May 17, 2020, 8:08 p.m. UTC | #8
On 5/17/20 9:26 PM, Andrew Lunn wrote:
>> So I was already led into reworking the entire series to do this
>> inlining once, after V1. It then turned out it's a horrible mess to get
>> everything to compile as modules and built-in and then also only the
>> parallel/SPI as a module and then the other way around.
> 
> Maybe consider some trade offs. Have both sets of accessors in the
> core, and then thin wrappers around it to probe on each bus type. You
> bloat the core, but avoid the indirection. You can also have the core
> as a standalone module, which exports symbols for the wrappers to
> use. It does take some Kconfig work to get built in vs modules
> correct, but there are people who can help. It is also not considered
> a regression if you reduce the options in terms of module vs built in.

I think this is what we attempted with V1/V2/V3 already, except back
then it was to improve performance, which turned out to be a no-issue,
as the performance is the same with or without the indirect accessors
(of course it is, the interface is so slow the indirect accessors make
no difference, plus add into it that it's communicating with the NIC
through SPI core and SPI drivers, which are full of this indirection
already).

Or do I misunderstand something?
Marek Vasut May 21, 2020, 1:11 p.m. UTC | #9
On 5/17/20 9:30 PM, David Miller wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Sun, 17 May 2020 21:16:35 +0200
> 
>>> Nevertheless I was going to repeat the performance measurements on a
>>> recent kernel but haven't gotten around to that yet because the
>>> measurements need to be performed with CONFIG_PREEMPT_RT_FULL to
>>> be reliable (a vanilla kernel is too jittery), so I have to create
>>> a new branch with RT patches on the test machine, which is fairly
>>> involved and time consuming.
>>
>> I assume you will then mainline the changes, so you don't need to do
>> it again? That is the problem with doing development work on a dead
>> kernel.
> 
> I think the limitation is outside of his control as not all of the RT
> patches are in mainline yet.

So how shall we proceed here?

Some basic measurement results are here, for the reference [1].

[1] https://www.spinics.net/lists/netdev/msg643099.html