mbox series

[RFT,v3,00/27] Apple M1 SoC platform bring-up

Message ID 20210304213902.83903-1-marcan@marcan.st
Headers show
Series Apple M1 SoC platform bring-up | expand

Message

Hector Martin March 4, 2021, 9:38 p.m. UTC
This series brings up initial support for the Apple M1 SoC, used in the
2020 Mac Mini, MacBook Pro, and MacBook Air models.

The following features are supported in this initial port:

- UART (samsung-style) with earlycon support
- Interrupts, including affinity and IPIs (Apple Interrupt Controller)
- SMP (through standard spin-table support)
- simplefb-based framebuffer
- Devicetree for the Mac Mini (should work for the others too at this
  stage)

See below for an overview of changes since v2.

== Merge notes ==

This patchset depends on both the nVHE changes that are already in
5.12-rc1, as well as the FIQ support work currently being reviewed
at [1]. A tree containing this patchset on top of the required
dependencies is available at [2][3]. Alternatively, you may apply
this series on top of Mark's tree at the arm64-fiq-20210302 tag [4][5].

This series is expected to be merged by Arnd via the SoC tree.
Maintainers, please ack if you are happy with the patches. We will
coordinate with Arnd and Mark on the FIQ series to make sure all
that goes smoothly.

This is marked RFT as it needs some basic testing on other on other
ARM platforms; in particular the Samsung UART driver changes need to be
tested, and the ioremap()/timer changes could use at least a smoke test
on some other arm64 platform. I would appreciate any test reports.

[1] https://lore.kernel.org/linux-arm-kernel/20210302101211.2328-1-mark.rutland@arm.com/T/#t
[2] git://github.com/AsahiLinux/linux.git upstream-bringup-v3
[3] https://github.com/AsahiLinux/linux/tree/upstream-bringup-v3
[4] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64-fiq-20210302
[5] https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/tag/?h=arm64-fiq-20210302

== Testing notes ==

This has been tested on an Apple M1 Mac Mini booting to a framebuffer
and serial console, with SMP and (newly tested since v2) KASLR, with
an arm64 defconfig (+ CONFIG_FB_SIMPLE for the fb).

An arm32 build has been smoke-tested in qemu with the smdkc210 machine,
to test the samsung_tty changes on non-Apple (emulated) hardware.

== Patch overview ==

- 01    Cope with CPUs stuck in VHE mode
        This is a follow-up to the recent nVHE/cpufeature changes in
        mainline by Mark, adding a workaround for the M1's lack of
        nVHE mode.
- 02-03 Core platform DT bindings
- 04-05 CPU DT bindings and MIDR defines
- 06-07 Add interrupt-names support to the ARM timer driver
        This is used to cleanly express the lack of a secure timer;
        platforms in the past have used various hacks like dummy
        IRQs here.
- 08-12 ioremap_np() (nGnRnE) support
        The fabric in these SoCs only supports nGnRnE accesses for
        standard MMIO, except for PCI ranges which use nGnRE. Linux
        currently defaults to the latter on ARM64, so this adds a new
        ioremap type and DT properties to automatically select it for
        drivers using OF and devm abstractions, under buses specified
        in DT.
- 13-16 AIC (Apple Interrupt Controller) driver and support defines
        This also embeds FIQ handling for this platform.
- 17    Introduce CONFIG_ARCH_APPLE & add it to defconfig
- 18-25 Add Apple SoC support to the samsung_tty driver
        This includes several refactoring patches to be able to more
        cleanly introduce this, as well as a switch to
        devm_ioremap_resource to be able to use the nGnRnE support
        introduced above. Earlycon support is included with a
        special-case nGnRnE hack, as earlycon is too early to use the
        generic infrastructure.
- 26    simple-framebuffer bindings for Apple (trivial)
- 27    Add the initial M1 Mac Mini (j274) devicetree

== About the hardware ==

These machines officially support booting unsigned/user-provided
XNU-like kernels, with a very different boot protocol and devicetree
format. We are developing an initial bootloader, m1n1 [1], to take care
of as many hardware peculiarities as possible and present a standard
Linux arm64 boot protocol and device tree. In the future, I expect that
production setups will add U-Boot and perhaps GRUB into the boot chain,
to make the boot process similar to other ARM64 platforms.

The machines expose their debug UART over USB Type C, triggered with
vendor-specific USB-PD commands. Currently, the easiest way to get a
serial console on these machines is to use a second M1 box and a simple
USB C cable [2]. You can also build a DIY interface using an Arduino, a
FUSB302 chip or board, and a 1.2V UART-TTL adapter [3]. In the coming
weeks we will be designing an open hardware project to provide
serial/debug connectivity to these machines (and, hopefully, also
support other UART-over-Type C setups from other vendors). Please
contact me privately if you are interested in getting an early prototype
version of one of these devices.

A quickstart guide to booting Linux kernels on these machines is
available at [4], and we are documenting the hardware at [5].

[1] https://github.com/AsahiLinux/m1n1/
[2] https://github.com/AsahiLinux/macvdmtool/
[3] https://github.com/AsahiLinux/vdmtool/
[4] https://github.com/AsahiLinux/docs/wiki/Developer-Quickstart
[5] https://github.com/AsahiLinux/docs/wiki

== Project Blurb ==

Asahi Linux is an open community project dedicated to developing and
maintaining mainline support for Apple Silicon on Linux. Feel free to
drop by #asahi and #asahi-dev on freenode to chat with us, or check
our website for more information on the project:

https://asahilinux.org/

== Changes since v2 ==

* Removed FIQ support patches, as this is now being handled as a
  separate series.
* Added nVHE workaround patch from Marc
* Renamed dts(i) files to better match conventions used in other
  platforms
* Renamed 'm1' in compatible/dts names to 't8103', to be better
  prepared for the chance of multiple SoCs being released under the
  same marketing name.
* Reworded device tree binding text for the platform.
* Changed the default ioremap_np() implementation to return NULL,
  like ioremap_uc().
* Added general documentation for ioremap() variants, including the
  newly introduced one.
* Reworked virtual IPI support in the AIC driver, and attempted
  to thoroughly shave the memory ordering yak.
* Moved GIC registers to sysregs.h instead of including that in the AIC
  driver.
* Added _EL1 suffixes to Apple sysregs.
* Addressed further review comments and feedback.


Arnd Bergmann (1):
  docs: driver-api: device-io: Document I/O access functions

Hector Martin (25):
  dt-bindings: vendor-prefixes: Add apple prefix
  dt-bindings: arm: apple: Add bindings for Apple ARM platforms
  dt-bindings: arm: cpus: Add apple,firestorm & icestorm compatibles
  arm64: cputype: Add CPU implementor & types for the Apple M1 cores
  dt-bindings: timer: arm,arch_timer: Add interrupt-names support
  arm64: arch_timer: implement support for interrupt-names
  asm-generic/io.h:  Add a non-posted variant of ioremap()
  docs: driver-api: device-io: Document ioremap() variants & access
    funcs
  arm64: Implement ioremap_np() to map MMIO as nGnRnE
  of/address: Add infrastructure to declare MMIO as non-posted
  arm64: Add Apple vendor-specific system registers
  arm64: move ICH_ sysreg bits from arm-gic-v3.h to sysreg.h
  dt-bindings: interrupt-controller: Add DT bindings for apple-aic
  irqchip/apple-aic: Add support for the Apple Interrupt Controller
  arm64: Kconfig: Introduce CONFIG_ARCH_APPLE
  tty: serial: samsung_tty: Separate S3C64XX ops structure
  tty: serial: samsung_tty: Add ucon_mask parameter
  tty: serial: samsung_tty: Add s3c24xx_port_type
  tty: serial: samsung_tty: IRQ rework
  tty: serial: samsung_tty: Use devm_ioremap_resource
  dt-bindings: serial: samsung: Add apple,s5l-uart compatible
  tty: serial: samsung_tty: Add support for Apple UARTs
  tty: serial: samsung_tty: Add earlycon support for Apple UARTs
  dt-bindings: display: Add apple,simple-framebuffer
  arm64: apple: Add initial Apple Mac mini (M1, 2020) devicetree

Marc Zyngier (1):
  arm64: Cope with CPUs stuck in VHE mode

 .../devicetree/bindings/arm/apple.yaml        |  64 ++
 .../devicetree/bindings/arm/cpus.yaml         |   2 +
 .../bindings/display/simple-framebuffer.yaml  |   5 +
 .../interrupt-controller/apple,aic.yaml       |  88 +++
 .../bindings/serial/samsung_uart.yaml         |   4 +-
 .../bindings/timer/arm,arch_timer.yaml        |  14 +
 .../devicetree/bindings/vendor-prefixes.yaml  |   2 +
 Documentation/driver-api/device-io.rst        | 356 +++++++++
 .../driver-api/driver-model/devres.rst        |   1 +
 MAINTAINERS                                   |  15 +
 arch/arm64/Kconfig.platforms                  |   8 +
 arch/arm64/boot/dts/Makefile                  |   1 +
 arch/arm64/boot/dts/apple/Makefile            |   2 +
 arch/arm64/boot/dts/apple/t8103-j274.dts      |  45 ++
 arch/arm64/boot/dts/apple/t8103.dtsi          | 135 ++++
 arch/arm64/configs/defconfig                  |   1 +
 arch/arm64/include/asm/cputype.h              |   6 +
 arch/arm64/include/asm/io.h                   |   1 +
 arch/arm64/include/asm/sysreg.h               |  60 ++
 arch/arm64/include/asm/sysreg_apple.h         |  69 ++
 arch/arm64/kernel/head.S                      |  33 +-
 arch/arm64/kernel/hyp-stub.S                  |  28 +-
 arch/sparc/include/asm/io_64.h                |   4 +
 drivers/clocksource/arm_arch_timer.c          |  24 +-
 drivers/irqchip/Kconfig                       |   8 +
 drivers/irqchip/Makefile                      |   1 +
 drivers/irqchip/irq-apple-aic.c               | 710 ++++++++++++++++++
 drivers/of/address.c                          |  72 +-
 drivers/tty/serial/Kconfig                    |   2 +-
 drivers/tty/serial/samsung_tty.c              | 496 +++++++++---
 include/asm-generic/io.h                      |  22 +-
 include/asm-generic/iomap.h                   |   9 +
 include/clocksource/arm_arch_timer.h          |   1 +
 .../interrupt-controller/apple-aic.h          |  15 +
 include/linux/cpuhotplug.h                    |   1 +
 include/linux/io.h                            |   2 +
 include/linux/ioport.h                        |   1 +
 include/linux/irqchip/arm-gic-v3.h            |  56 --
 include/linux/of_address.h                    |   1 +
 include/linux/serial_s3c.h                    |  16 +
 lib/devres.c                                  |  22 +
 41 files changed, 2228 insertions(+), 175 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/arm/apple.yaml
 create mode 100644 Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml
 create mode 100644 arch/arm64/boot/dts/apple/Makefile
 create mode 100644 arch/arm64/boot/dts/apple/t8103-j274.dts
 create mode 100644 arch/arm64/boot/dts/apple/t8103.dtsi
 create mode 100644 arch/arm64/include/asm/sysreg_apple.h
 create mode 100644 drivers/irqchip/irq-apple-aic.c
 create mode 100644 include/dt-bindings/interrupt-controller/apple-aic.h

--
2.30.0

Comments

Hector Martin March 5, 2021, 10:11 a.m. UTC | #1
On 05/03/2021 06.38, Hector Martin wrote:
> == Merge notes ==
> 
> This patchset depends on both the nVHE changes that are already in
> 5.12-rc1, as well as the FIQ support work currently being reviewed
> at [1]. A tree containing this patchset on top of the required
> dependencies is available at [2][3]. Alternatively, you may apply
> this series on top of Mark's tree at the arm64-fiq-20210302 tag [4][5].

Important warning: these trees are all based on v5.12-rc1, which has a 
bad bug that causes your filesystems to go kaboom if you use a swap file 
[1].

This doesn't affect M1 since we don't *have* storage, but for folks 
testing for regressions on on e.g. Samsung or other ARM boards, please 
make sure you don't use swap files.

[1] 
https://lore.kernel.org/lkml/CAHk-=wjnzdLSP3oDxhf9eMTYo7GF-QjaNLBUH1Zk3c4A7X75YA@mail.gmail.com/
Linus Walleij March 5, 2021, 10:19 a.m. UTC | #2
On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:

> This allows the devicetree to correctly represent the available set of
> timers, which varies from device to device, without the need for fake
> dummy interrupts for unavailable slots.
>
> Also add the hyp-virt timer/PPI, which is not currently used, but worth
> representing.
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> Reviewed-by: Tony Lindgren <tony@atomide.com>

This is the right solution.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Linus Walleij March 5, 2021, 10:22 a.m. UTC | #3
On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
>
> This adds more detailed descriptions of the various read/write
> primitives available for use with I/O memory/ports.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Hector Martin <marcan@marcan.st>

Excellent work!
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Linus Walleij March 5, 2021, 10:25 a.m. UTC | #4
On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:

> This documents the newly introduced ioremap_np() along with all the
> other common ioremap() variants, and some higher-level abstractions
> available.
>
> Signed-off-by: Hector Martin <marcan@marcan.st>

I like this, I just want one change:

Put the common ioremap() on top in all paragraphs, so the norm
comes before the exceptions.

I.e. it is weird to mention ioremap_np() before mentioning ioremap().

With that change:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Krzysztof Kozlowski March 5, 2021, 10:30 a.m. UTC | #5
On 04/03/2021 22:38, Hector Martin wrote:
> Instead of patching a single global ops structure depending on the port
> type, use a separate s3c64xx_serial_ops for the S3C64XX type. This
> allows us to mark the structures as const.
> 
> Also split out s3c64xx_serial_shutdown into a separate function now that
> we have a separate ops structure; this avoids excessive branching
> control flow and mirrors s3c64xx_serial_startup. tx_claimed and
> rx_claimed are only used in the S3C24XX functions.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 71 ++++++++++++++++++++++++--------
>  1 file changed, 54 insertions(+), 17 deletions(-)


Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:34 a.m. UTC | #6
On 04/03/2021 22:38, Hector Martin wrote:
> This simplifies the code by removing the only distinction between the
> S3C2410 and S3C2440 codepaths.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/tty/serial/samsung_tty.c b/drivers/tty/serial/samsung_tty.c
> index 78dc6e9240fb..33b421dbeb83 100644
> --- a/drivers/tty/serial/samsung_tty.c
> +++ b/drivers/tty/serial/samsung_tty.c
> @@ -70,6 +70,7 @@ struct s3c24xx_uart_info {
>  	unsigned long		num_clks;
>  	unsigned long		clksel_mask;
>  	unsigned long		clksel_shift;
> +	unsigned long		ucon_mask;
>  
>  	/* uart port features */
>  
> @@ -1736,14 +1737,9 @@ static void s3c24xx_serial_resetport(struct uart_port *port,
>  {
>  	struct s3c24xx_uart_info *info = s3c24xx_port_to_info(port);
>  	unsigned long ucon = rd_regl(port, S3C2410_UCON);
> -	unsigned int ucon_mask;
>  
> -	ucon_mask = info->clksel_mask;
> -	if (info->type == PORT_S3C2440)
> -		ucon_mask |= S3C2440_UCON0_DIVMASK;
> -
> -	ucon &= ucon_mask;
> -	wr_regl(port, S3C2410_UCON,  ucon | cfg->ucon);
> +	ucon &= (info->clksel_mask | info->ucon_mask);
> +	wr_regl(port, S3C2410_UCON, ucon | cfg->ucon);

This line (wr_regl()) is not related, please split it to separate
white-space cleanups.

With the change:

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>


Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:49 a.m. UTC | #7
On 04/03/2021 22:38, Hector Martin wrote:
> This decouples the TTY layer PORT_ types, which are exposed to
> userspace, from the driver-internal flag of what kind of port this is.

s/This decouples/Decouple

> 
> This removes s3c24xx_serial_has_interrupt_mask, which was just checking

s/This removes/Remove/

https://elixir.bootlin.com/linux/latest/source/Documentation/process/submitting-patches.rst#L89

This actually also applies to your patch 19 and 21... and maybe more.

> for a specific type anyway.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 112 +++++++++++++++++++------------
>  1 file changed, 70 insertions(+), 42 deletions(-)
> 
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:51 a.m. UTC | #8
On 04/03/2021 22:38, Hector Martin wrote:
> * Split out s3c24xx_serial_tx_chars from s3c24xx_serial_tx_irq,
>   where only the latter acquires the port lock. This will be necessary
>   on platforms which have edge-triggered IRQs, as we need to call
>   s3c24xx_serial_tx_chars to kick off transmission from outside IRQ
>   context, with the port lock held.
> 
> * Rename s3c24xx_serial_rx_chars to s3c24xx_serial_rx_irq for
>   consistency with the above. All it does now is call two other
>   functions anyway.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 34 +++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 14 deletions(-)
> 

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:54 a.m. UTC | #9
On 04/03/2021 22:38, Hector Martin wrote:
> This picks up the non-posted I/O mode needed for Apple platforms to
> work properly.
> 
> This removes the request/release functions, which are no longer
> necessary, since devm_ioremap_resource takes care of that already. Most
> other drivers already do it this way, anyway.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 25 +++----------------------
>  1 file changed, 3 insertions(+), 22 deletions(-)

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:55 a.m. UTC | #10
On 04/03/2021 22:39, Hector Martin wrote:
> Earlycon support is identical to S3C2410, but Apple SoCs also need
> MMIO mapped as nGnRnE. This is handled generically for normal drivers
> including the normal UART path here, but earlycon uses fixmap and
> runs before that scaffolding is ready.
> 
> Since this is the only case where we need this fix, it makes more
> sense to do it here in the UART driver instead of introducing a
> whole fdt nonposted-mmio resolver just for earlycon/fixmap.
> 
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 10:58 a.m. UTC | #11
On 04/03/2021 22:38, Hector Martin wrote:
> Apple SoCs are a distant descendant of Samsung designs and use yet
> another variant of their UART style, with different interrupt handling.
> 
> In particular, this variant has the following differences with existing
> ones:
> 
> * It includes a built-in interrupt controller with different registers,
>   using only a single platform IRQ
> 
> * Internal interrupt sources are treated as edge-triggered, even though
>   the IRQ output is level-triggered. This chiefly affects the TX IRQ
>   path: the driver can no longer rely on the TX buffer empty IRQ
>   immediately firing after TX is enabled, but instead must prime the
>   FIFO with data directly.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/Kconfig       |   2 +-
>  drivers/tty/serial/samsung_tty.c | 238 +++++++++++++++++++++++++++++--
>  include/linux/serial_s3c.h       |  16 +++
>  3 files changed, 247 insertions(+), 9 deletions(-)
> 

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

Best regards,
Krzysztof
Krzysztof Kozlowski March 5, 2021, 11:03 a.m. UTC | #12
On 04/03/2021 22:39, Hector Martin wrote:
> This currently supports:
> 
> * SMP (via spin-tables)
> * AIC IRQs
> * Serial (with earlycon)
> * Framebuffer
> 
> A number of properties are dynamic, and based on system firmware
> decisions that vary from version to version. These are expected
> to be filled in by the loader.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  MAINTAINERS                              |   1 +
>  arch/arm64/boot/dts/Makefile             |   1 +
>  arch/arm64/boot/dts/apple/Makefile       |   2 +
>  arch/arm64/boot/dts/apple/t8103-j274.dts |  45 ++++++++
>  arch/arm64/boot/dts/apple/t8103.dtsi     | 135 +++++++++++++++++++++++
>  5 files changed, 184 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/apple/Makefile
>  create mode 100644 arch/arm64/boot/dts/apple/t8103-j274.dts
>  create mode 100644 arch/arm64/boot/dts/apple/t8103.dtsi
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 28bd46f4f7a7..d5e4d93a536a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1647,6 +1647,7 @@ C:	irc://chat.freenode.net/asahi-dev
>  T:	git https://github.com/AsahiLinux/linux.git
>  F:	Documentation/devicetree/bindings/arm/apple.yaml
>  F:	Documentation/devicetree/bindings/interrupt-controller/apple,aic.yaml
> +F:	arch/arm64/boot/dts/apple/
>  F:	arch/arm64/include/asm/sysreg_apple.h
>  F:	drivers/irqchip/irq-apple-aic.c
>  F:	include/dt-bindings/interrupt-controller/apple-aic.h
> diff --git a/arch/arm64/boot/dts/Makefile b/arch/arm64/boot/dts/Makefile
> index f1173cd93594..639e01a4d855 100644
> --- a/arch/arm64/boot/dts/Makefile
> +++ b/arch/arm64/boot/dts/Makefile
> @@ -6,6 +6,7 @@ subdir-y += amazon
>  subdir-y += amd
>  subdir-y += amlogic
>  subdir-y += apm
> +subdir-y += apple
>  subdir-y += arm
>  subdir-y += bitmain
>  subdir-y += broadcom
> diff --git a/arch/arm64/boot/dts/apple/Makefile b/arch/arm64/boot/dts/apple/Makefile
> new file mode 100644
> index 000000000000..cbbd701ebf05
> --- /dev/null
> +++ b/arch/arm64/boot/dts/apple/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0
> +dtb-$(CONFIG_ARCH_APPLE) += t8103-j274.dtb
> diff --git a/arch/arm64/boot/dts/apple/t8103-j274.dts b/arch/arm64/boot/dts/apple/t8103-j274.dts
> new file mode 100644
> index 000000000000..8afc2ed70361
> --- /dev/null
> +++ b/arch/arm64/boot/dts/apple/t8103-j274.dts
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0+ OR MIT
> +/*
> + * Apple Mac mini (M1, 2020)
> + *
> + * target-type: J174
> + *
> + * Copyright The Asahi Linux Contributors
> + */
> +
> +/dts-v1/;
> +
> +#include "t8103.dtsi"
> +
> +/ {
> +	compatible = "apple,j274", "apple,t8103", "apple,arm-platform";
> +	model = "Apple Mac mini (M1, 2020)";
> +
> +	aliases {
> +		serial0 = &serial0;
> +	};
> +
> +	chosen {
> +		#address-cells = <2>;
> +		#size-cells = <2>;
> +		ranges;
> +
> +		stdout-path = "serial0";
> +
> +		framebuffer0: framebuffer@0 {
> +			compatible = "apple,simple-framebuffer", "simple-framebuffer";
> +			reg = <0 0 0 0>; /* To be filled by loader */
> +			/* Format properties will be added by loader */
> +			status = "disabled";
> +		};
> +	};
> +
> +	memory@800000000 {
> +		device_type = "memory";
> +		reg = <0x8 0 0x2 0>; /* To be filled by loader */

Shouldn't this be 0x800000000 with ~0x80000000 length (or whatever is
more common)? Or did I miss some ranges?

Best regards,
Krzysztof
Hector Martin March 5, 2021, 11:14 a.m. UTC | #13
On 05/03/2021 20.03, Krzysztof Kozlowski wrote:
>> +	memory@800000000 {
>> +		device_type = "memory";
>> +		reg = <0x8 0 0x2 0>; /* To be filled by loader */
> 
> Shouldn't this be 0x800000000 with ~0x80000000 length (or whatever is
> more common)? Or did I miss some ranges?

The base model has 8GB of RAM, and RAM always starts at 0x800000000, 
hence that reg property.

It's not actually useful to try to boot Linux like this, because it'll 
step all over device carveouts on both ends and break, but since those 
are potentially dynamic it doesn't really make sense to use a more 
specific example for the dts.

E.g. on my system, with my current firmware version, this ends up 
getting patched to:

reg = <0x8 0x0134c000 0x1 0xda294000>

Thanks,
Krzysztof Kozlowski March 5, 2021, 11:45 a.m. UTC | #14
On 05/03/2021 12:14, Hector Martin wrote:
> On 05/03/2021 20.03, Krzysztof Kozlowski wrote:
>>> +	memory@800000000 {
>>> +		device_type = "memory";
>>> +		reg = <0x8 0 0x2 0>; /* To be filled by loader */
>>
>> Shouldn't this be 0x800000000 with ~0x80000000 length (or whatever is
>> more common)? Or did I miss some ranges?
> 
> The base model has 8GB of RAM, and RAM always starts at 0x800000000, 
> hence that reg property.

Ah, I messed up the unit addressing and number of zeros... it's OK.

Best regards,
Krzysztof
Andy Shevchenko March 5, 2021, 2:45 p.m. UTC | #15
On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote:
>
> ARM64 currently defaults to posted MMIO (nGnRnE), but some devices
> require the use of non-posted MMIO (nGnRE). Introduce a new ioremap()
> variant to handle this case. ioremap_np() is aliased to ioremap() by
> default on arches that do not implement this variant.

Hmm... But isn't it basically a requirement to those device drivers to
use readX()/writeX() instead of readX_relaxed() / writeX_relaxed()?

...

>  #define IORESOURCE_MEM_32BIT           (3<<3)
>  #define IORESOURCE_MEM_SHADOWABLE      (1<<5)  /* dup: IORESOURCE_SHADOWABLE */
>  #define IORESOURCE_MEM_EXPANSIONROM    (1<<6)
> +#define IORESOURCE_MEM_NONPOSTED       (1<<7)

Not sure it's the right location (in a bit field) for this flag.
Andy Shevchenko March 5, 2021, 3:05 p.m. UTC | #16
On Thu, Mar 4, 2021 at 11:41 PM Hector Martin <marcan@marcan.st> wrote:
>
> This is the root interrupt controller used on Apple ARM SoCs such as the
> M1. This irqchip driver performs multiple functions:
>
> * Handles both IRQs and FIQs
>
> * Drives the AIC peripheral itself (which handles IRQs)
>
> * Dispatches FIQs to downstream hard-wired clients (currently the ARM
>   timer).
>
> * Implements a virtual IPI multiplexer to funnel multiple Linux IPIs
>   into a single hardware IPI

...

> + *   - <0 nr flags> - hwirq #nr
> + *   - <1 nr flags> - FIQ #nr
> + *     - nr=0  Physical HV timer
> + *     - nr=1  Virtual HV timer
> + *     - nr=2  Physical guest timer
> + *     - nr=3  Virtual guest timer

> + *

Unneeded blank line.

> + */

...

> +#define pr_fmt(fmt) "%s: " fmt, __func__

This is not needed, really, if you have unique / distinguishable
messages in the first place.
Rather people include module names, which may be useful.

...

> +#define MASK_REG(x)            (4 * ((x) >> 5))
> +#define MASK_BIT(x)            BIT((x) & 0x1f)

GENMASK(4,0)

...

> +/*
> + * Max 31 bits in IPI SEND register (top bit is self).
> + * >=32-core chips will need code changes anyway.
> + */
> +#define AIC_MAX_CPUS           31

I would put it as (32 - 1) to show that the register is actually 32-bit long.

...

> +static atomic_t aic_vipi_flag[AIC_MAX_CPUS];
> +static atomic_t aic_vipi_enable[AIC_MAX_CPUS];

Isn't it easier to handle these when they are full width, i.e. 32
items per the array?

...

> +static int aic_irq_set_affinity(struct irq_data *d,
> +                               const struct cpumask *mask_val, bool force)
> +{
> +       irq_hw_number_t hwirq = irqd_to_hwirq(d);
> +       struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> +       int cpu;
> +
> +       if (hwirq > ic->nr_hw)

>= ?

> +               return -EINVAL;
> +
> +       if (force)
> +               cpu = cpumask_first(mask_val);
> +       else
> +               cpu = cpumask_any_and(mask_val, cpu_online_mask);
> +
> +       aic_ic_write(ic, AIC_TARGET_CPU + hwirq * 4, BIT(cpu));
> +       irq_data_update_effective_affinity(d, cpumask_of(cpu));
> +
> +       return IRQ_SET_MASK_OK;
> +}

...

> +static void aic_fiq_mask(struct irq_data *d)
> +{
> +       /* Only the guest timers have real mask bits, unfortunately. */
> +       switch (d->hwirq) {
> +       case AIC_TMR_GUEST_PHYS:
> +               sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, VM_TMR_FIQ_ENABLE_P, 0);
> +               break;
> +       case AIC_TMR_GUEST_VIRT:
> +               sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, VM_TMR_FIQ_ENABLE_V, 0);
> +               break;

default case? // some compilers may not be happy
Ditto for all similar places in the series.

> +       }
> +}

...

> +#define TIMER_FIRING(x)                                                        \
> +       (((x) & (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_MASK |            \
> +                ARCH_TIMER_CTRL_IT_STAT)) ==                                  \
> +        (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_STAT))

It's a bit hard to read. Perhaps

#define FOO_MASK  (_ENABLE | _STAT)
#define _FIRING ... (FOO_MASK | _MASK == FOO_MASK)

?

...

> +       if ((read_sysreg_s(SYS_APL_PMCR0_EL1) & (PMCR0_IMODE | PMCR0_IACT))
> +                       == (FIELD_PREP(PMCR0_IMODE, PMCR0_IMODE_FIQ) | PMCR0_IACT)) {

It's better to have == on the previous line.

...

> +       for_each_set_bit(i, &firing, AIC_NR_SWIPI) {
> +               handle_domain_irq(aic_irqc->ipi_domain, i, regs);
> +       }

No {} needed.

...

> +static int aic_init_smp(struct aic_irq_chip *irqc, struct device_node *node)
> +{
> +       int base_ipi;

Introducing a temporary variable may help with readability

...  *d = irqc->hw_domain;

> +       irqc->ipi_domain = irq_domain_create_linear(irqc->hw_domain->fwnode, AIC_NR_SWIPI,
> +                                                   &aic_ipi_domain_ops, irqc);
> +       if (WARN_ON(!irqc->ipi_domain))
> +               return -ENODEV;
> +
> +       irqc->ipi_domain->flags |= IRQ_DOMAIN_FLAG_IPI_SINGLE;
> +       irq_domain_update_bus_token(irqc->ipi_domain, DOMAIN_BUS_IPI);
> +
> +       base_ipi = __irq_domain_alloc_irqs(irqc->ipi_domain, -1, AIC_NR_SWIPI,
> +                                          NUMA_NO_NODE, NULL, false, NULL);
> +
> +       if (WARN_ON(!base_ipi)) {
> +               irq_domain_remove(irqc->ipi_domain);
> +               return -ENODEV;
> +       }
> +
> +       set_smp_ipi_range(base_ipi, AIC_NR_SWIPI);
> +
> +       return 0;
> +}

...

> +       return 0;
> +

Extra blank line.

...

> +       irqc->hw_domain = irq_domain_create_linear(of_node_to_fwnode(node),
> +                                                  irqc->nr_hw + AIC_NR_FIQ,
> +                                                  &aic_irq_domain_ops, irqc);

If you are sure it will be always OF-only, why not to use
irq_domain_add_linear()?

...

> +       for (i = 0; i < BITS_TO_U32(irqc->nr_hw); i++)
> +               aic_ic_write(irqc, AIC_MASK_SET + i * 4, ~0);
> +       for (i = 0; i < BITS_TO_U32(irqc->nr_hw); i++)
> +               aic_ic_write(irqc, AIC_SW_CLR + i * 4, ~0);

~0 is a beast when it suddenly gets into > int size.

I would recommend using either GENMASK() if it's a bit field, or
type_MAX values if it's a plain number.
Andy Shevchenko March 5, 2021, 3:09 p.m. UTC | #17
On Fri, Mar 5, 2021 at 12:25 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:
>
> > This documents the newly introduced ioremap_np() along with all the
> > other common ioremap() variants, and some higher-level abstractions
> > available.
> >
> > Signed-off-by: Hector Martin <marcan@marcan.st>
>
> I like this, I just want one change:
>
> Put the common ioremap() on top in all paragraphs, so the norm
> comes before the exceptions.
>
> I.e. it is weird to mention ioremap_np() before mentioning ioremap().

+1 here. That is what I have stumbled upon reading carefully.
Andy Shevchenko March 5, 2021, 3:17 p.m. UTC | #18
On Thu, Mar 4, 2021 at 11:41 PM Hector Martin <marcan@marcan.st> wrote:
>
> * Split out s3c24xx_serial_tx_chars from s3c24xx_serial_tx_irq,
>   where only the latter acquires the port lock. This will be necessary
>   on platforms which have edge-triggered IRQs, as we need to call
>   s3c24xx_serial_tx_chars to kick off transmission from outside IRQ
>   context, with the port lock held.
>
> * Rename s3c24xx_serial_rx_chars to s3c24xx_serial_rx_irq for
>   consistency with the above. All it does now is call two other
>   functions anyway.
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/tty/serial/samsung_tty.c | 34 +++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/tty/serial/samsung_tty.c b/drivers/tty/serial/samsung_tty.c
> index 39b2eb165bdc..7106eb238d8c 100644
> --- a/drivers/tty/serial/samsung_tty.c
> +++ b/drivers/tty/serial/samsung_tty.c
> @@ -827,7 +827,7 @@ static irqreturn_t s3c24xx_serial_rx_chars_pio(void *dev_id)
>         return IRQ_HANDLED;
>  }
>
> -static irqreturn_t s3c24xx_serial_rx_chars(int irq, void *dev_id)
> +static irqreturn_t s3c24xx_serial_rx_irq(int irq, void *dev_id)
>  {
>         struct s3c24xx_uart_port *ourport = dev_id;
>
> @@ -836,16 +836,12 @@ static irqreturn_t s3c24xx_serial_rx_chars(int irq, void *dev_id)
>         return s3c24xx_serial_rx_chars_pio(dev_id);
>  }
>
> -static irqreturn_t s3c24xx_serial_tx_chars(int irq, void *id)
> +static void s3c24xx_serial_tx_chars(struct s3c24xx_uart_port *ourport)
>  {
> -       struct s3c24xx_uart_port *ourport = id;
>         struct uart_port *port = &ourport->port;
>         struct circ_buf *xmit = &port->state->xmit;
> -       unsigned long flags;
>         int count, dma_count = 0;
>
> -       spin_lock_irqsave(&port->lock, flags);
> -
>         count = CIRC_CNT_TO_END(xmit->head, xmit->tail, UART_XMIT_SIZE);
>
>         if (ourport->dma && ourport->dma->tx_chan &&
> @@ -862,7 +858,7 @@ static irqreturn_t s3c24xx_serial_tx_chars(int irq, void *id)
>                 wr_reg(port, S3C2410_UTXH, port->x_char);
>                 port->icount.tx++;
>                 port->x_char = 0;
> -               goto out;
> +               return;
>         }
>
>         /* if there isn't anything more to transmit, or the uart is now
> @@ -871,7 +867,7 @@ static irqreturn_t s3c24xx_serial_tx_chars(int irq, void *id)
>
>         if (uart_circ_empty(xmit) || uart_tx_stopped(port)) {
>                 s3c24xx_serial_stop_tx(port);
> -               goto out;
> +               return;
>         }
>
>         /* try and drain the buffer... */
> @@ -893,7 +889,7 @@ static irqreturn_t s3c24xx_serial_tx_chars(int irq, void *id)
>
>         if (!count && dma_count) {
>                 s3c24xx_serial_start_tx_dma(ourport, dma_count);
> -               goto out;
> +               return;
>         }
>
>         if (uart_circ_chars_pending(xmit) < WAKEUP_CHARS) {
> @@ -904,8 +900,18 @@ static irqreturn_t s3c24xx_serial_tx_chars(int irq, void *id)
>
>         if (uart_circ_empty(xmit))
>                 s3c24xx_serial_stop_tx(port);
> +}
> +
> +static irqreturn_t s3c24xx_serial_tx_irq(int irq, void *id)
> +{
> +       struct s3c24xx_uart_port *ourport = id;
> +       struct uart_port *port = &ourport->port;
> +       unsigned long flags;
> +


> +       spin_lock_irqsave(&port->lock, flags);
> +
> +       s3c24xx_serial_tx_chars(ourport);
>
> -out:
>         spin_unlock_irqrestore(&port->lock, flags);

Add a separate change that removes flags from the spin lock in the IRQ handler.

>         return IRQ_HANDLED;
>  }
> @@ -919,11 +925,11 @@ static irqreturn_t s3c64xx_serial_handle_irq(int irq, void *id)
>         irqreturn_t ret = IRQ_HANDLED;
>
>         if (pend & S3C64XX_UINTM_RXD_MSK) {
> -               ret = s3c24xx_serial_rx_chars(irq, id);
> +               ret = s3c24xx_serial_rx_irq(irq, id);
>                 wr_regl(port, S3C64XX_UINTP, S3C64XX_UINTM_RXD_MSK);
>         }
>         if (pend & S3C64XX_UINTM_TXD_MSK) {
> -               ret = s3c24xx_serial_tx_chars(irq, id);
> +               ret = s3c24xx_serial_tx_irq(irq, id);
>                 wr_regl(port, S3C64XX_UINTP, S3C64XX_UINTM_TXD_MSK);
>         }
>         return ret;
> @@ -1155,7 +1161,7 @@ static int s3c24xx_serial_startup(struct uart_port *port)
>
>         ourport->rx_enabled = 1;
>
> -       ret = request_irq(ourport->rx_irq, s3c24xx_serial_rx_chars, 0,
> +       ret = request_irq(ourport->rx_irq, s3c24xx_serial_rx_irq, 0,
>                           s3c24xx_serial_portname(port), ourport);
>
>         if (ret != 0) {
> @@ -1169,7 +1175,7 @@ static int s3c24xx_serial_startup(struct uart_port *port)
>
>         ourport->tx_enabled = 1;
>
> -       ret = request_irq(ourport->tx_irq, s3c24xx_serial_tx_chars, 0,
> +       ret = request_irq(ourport->tx_irq, s3c24xx_serial_tx_irq, 0,
>                           s3c24xx_serial_portname(port), ourport);
>
>         if (ret) {
> --
> 2.30.0
>
Andy Shevchenko March 5, 2021, 3:19 p.m. UTC | #19
On Fri, Mar 5, 2021 at 12:55 PM Krzysztof Kozlowski
<krzysztof.kozlowski@canonical.com> wrote:
>
> On 04/03/2021 22:38, Hector Martin wrote:
> > This picks up the non-posted I/O mode needed for Apple platforms to
> > work properly.
> >
> > This removes the request/release functions, which are no longer
> > necessary, since devm_ioremap_resource takes care of that already. Most
> > other drivers already do it this way, anyway.
> >

For the patches 18-22, with Krzysztof's and mine comments addressed
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>

> > Signed-off-by: Hector Martin <marcan@marcan.st>
> > ---
> >  drivers/tty/serial/samsung_tty.c | 25 +++----------------------
> >  1 file changed, 3 insertions(+), 22 deletions(-)
>
> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
> Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
>
> Best regards,
> Krzysztof
Hector Martin March 5, 2021, 3:19 p.m. UTC | #20
On 05/03/2021 23.45, Andy Shevchenko wrote:
> On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote:
>>
>> ARM64 currently defaults to posted MMIO (nGnRnE), but some devices
>> require the use of non-posted MMIO (nGnRE). Introduce a new ioremap()
>> variant to handle this case. ioremap_np() is aliased to ioremap() by
>> default on arches that do not implement this variant.
> 
> Hmm... But isn't it basically a requirement to those device drivers to
> use readX()/writeX() instead of readX_relaxed() / writeX_relaxed()?

No, the write ops used do not matter. It's just that on these Apple SoCs 
the fabric requires the mappings to be nGnRnE, else it just throws 
SErrors on all writes and ignores them.

The difference between _relaxed and not is barrier behavior with regards 
to DMA/memory accesses; this applies regardless of whether the writes 
are E or nE. You can have relaxed accesses with nGnRnE and then you 
would still have race conditions if you do not have a barrier between 
the MMIO and accessing DMA memory. What nGnRnE buys you (on 
platforms/buses where it works properly) is that you do need a dummy 
read after a write to ensure completion.

All of this is to some extent moot on these SoCs; it's not that we need 
the drivers to use nGnRnE for some correctness reason, it's that the 
SoCs force us to use it or else everything breaks, which was the 
motivation for this change. But since on most other SoCs both are valid 
options, this does allow some other drivers/platforms to opt into nGnRnE 
if they have a good reason to do so.

Though you just made me notice two mistakes in the commit description: 
first, it describes the old v2 version, for v3 I made ioremap_np() just 
return NULL on arches that don't implement it. Second, nGnRnE and nGnRE 
are backwards. Oops. I'll fix it for the next version.

>>   #define IORESOURCE_MEM_32BIT           (3<<3)
>>   #define IORESOURCE_MEM_SHADOWABLE      (1<<5)  /* dup: IORESOURCE_SHADOWABLE */
>>   #define IORESOURCE_MEM_EXPANSIONROM    (1<<6)
>> +#define IORESOURCE_MEM_NONPOSTED       (1<<7)
> 
> Not sure it's the right location (in a bit field) for this flag.

Do you have a better suggestion? It seemed logical to put it here, as a 
flag on memory-type I/O resources.
Andy Shevchenko March 5, 2021, 3:28 p.m. UTC | #21
On Thu, Mar 4, 2021 at 11:42 PM Hector Martin <marcan@marcan.st> wrote:
>
> Apple SoCs are a distant descendant of Samsung designs and use yet
> another variant of their UART style, with different interrupt handling.
>
> In particular, this variant has the following differences with existing
> ones:
>
> * It includes a built-in interrupt controller with different registers,
>   using only a single platform IRQ
>
> * Internal interrupt sources are treated as edge-triggered, even though
>   the IRQ output is level-triggered. This chiefly affects the TX IRQ
>   path: the driver can no longer rely on the TX buffer empty IRQ
>   immediately firing after TX is enabled, but instead must prime the
>   FIFO with data directly.

...

> +       case TYPE_APPLE_S5L:
> +               WARN_ON(1); // No DMA

Oh, no, please use the ONCE variant.

...

> +       /* Apple types use these bits for IRQ masks */
> +       if (ourport->info->type != TYPE_APPLE_S5L) {
> +               ucon &= ~(S3C64XX_UCON_TIMEOUT_MASK |
> +                               S3C64XX_UCON_EMPTYINT_EN |
> +                               S3C64XX_UCON_DMASUS_EN |
> +                               S3C64XX_UCON_TIMEOUT_EN);
> +               ucon |= 0xf << S3C64XX_UCON_TIMEOUT_SHIFT |

Can you spell 0xf with named constant(s), please?

In case they are repetitive via the code, introduce either a temporary
variable (in case it scoped to one function only), or define it as a
constant.

> +                               S3C64XX_UCON_TIMEOUT_EN;
> +       }

...

> +/* interrupt handler for Apple SoC's.*/
> +static irqreturn_t apple_serial_handle_irq(int irq, void *id)
> +{
> +       struct s3c24xx_uart_port *ourport = id;
> +       struct uart_port *port = &ourport->port;
> +       unsigned int pend = rd_regl(port, S3C2410_UTRSTAT);

> +       irqreturn_t ret = IRQ_NONE;

Redundant. You may return directly.

> +
> +       if (pend & (APPLE_S5L_UTRSTAT_RXTHRESH | APPLE_S5L_UTRSTAT_RXTO)) {
> +               wr_regl(port, S3C2410_UTRSTAT,
> +                       APPLE_S5L_UTRSTAT_RXTHRESH | APPLE_S5L_UTRSTAT_RXTO);
> +               ret = s3c24xx_serial_rx_irq(irq, id);
> +       }
> +       if (pend & APPLE_S5L_UTRSTAT_TXTHRESH) {
> +               wr_regl(port, S3C2410_UTRSTAT, APPLE_S5L_UTRSTAT_TXTHRESH);
> +               ret = s3c24xx_serial_tx_irq(irq, id);
> +       }

No IO serialization?

> +       return ret;
> +}

...

> +static void apple_s5l_serial_shutdown(struct uart_port *port)
> +{
> +       struct s3c24xx_uart_port *ourport = to_ourport(port);

> +

Extra blank line (check your entire series for a such)

> +       unsigned int ucon;

> +       ourport->tx_in_progress = 0;
> +}

...

> +       ourport->rx_enabled = 1;
> +       ourport->tx_enabled = 0;

How are these protected against race?

...

> +               case TYPE_APPLE_S5L: {
> +                       unsigned int ucon;
> +                       int ret;
> +
> +                       ret = clk_prepare_enable(ourport->clk);
> +                       if (ret) {
> +                               dev_err(dev, "clk_enable clk failed: %d\n", ret);
> +                               return ret;
> +                       }
> +                       if (!IS_ERR(ourport->baudclk)) {
> +                               ret = clk_prepare_enable(ourport->baudclk);
> +                               if (ret) {
> +                                       dev_err(dev, "clk_enable baudclk failed: %d\n", ret);
> +                                       clk_disable_unprepare(ourport->clk);
> +                                       return ret;
> +                               }
> +                       }

Wouldn't it be better to use CLK bulk API?

> +                       ucon = rd_regl(port, S3C2410_UCON);
> +
> +                       ucon &= ~(APPLE_S5L_UCON_TXTHRESH_ENA_MSK |
> +                                 APPLE_S5L_UCON_RXTHRESH_ENA_MSK |
> +                                 APPLE_S5L_UCON_RXTO_ENA_MSK);
> +
> +                       if (ourport->tx_enabled)
> +                               ucon |= APPLE_S5L_UCON_TXTHRESH_ENA_MSK;
> +                       if (ourport->rx_enabled)
> +                               ucon |= APPLE_S5L_UCON_RXTHRESH_ENA_MSK |
> +                                       APPLE_S5L_UCON_RXTO_ENA_MSK;
> +
> +                       wr_regl(port, S3C2410_UCON, ucon);
> +
> +                       if (!IS_ERR(ourport->baudclk))
> +                               clk_disable_unprepare(ourport->baudclk);
> +                       clk_disable_unprepare(ourport->clk);
> +                       break;
> +               }

...

> +#ifdef CONFIG_ARCH_APPLE

Why? Wouldn't you like the one kernel to work on many SoCs?

> +static struct s3c24xx_serial_drv_data s5l_serial_drv_data = {
> +       .info = &(struct s3c24xx_uart_info) {
> +               .name           = "Apple S5L UART",
> +               .type           = TYPE_APPLE_S5L,
> +               .port_type      = PORT_8250,
> +               .fifosize       = 16,
> +               .rx_fifomask    = S3C2410_UFSTAT_RXMASK,
> +               .rx_fifoshift   = S3C2410_UFSTAT_RXSHIFT,
> +               .rx_fifofull    = S3C2410_UFSTAT_RXFULL,
> +               .tx_fifofull    = S3C2410_UFSTAT_TXFULL,
> +               .tx_fifomask    = S3C2410_UFSTAT_TXMASK,
> +               .tx_fifoshift   = S3C2410_UFSTAT_TXSHIFT,
> +               .def_clk_sel    = S3C2410_UCON_CLKSEL0,
> +               .num_clks       = 1,
> +               .clksel_mask    = 0,
> +               .clksel_shift   = 0,
> +       },
> +       .def_cfg = &(struct s3c2410_uartcfg) {
> +               .ucon           = APPLE_S5L_UCON_DEFAULT,
> +               .ufcon          = S3C2410_UFCON_DEFAULT,
> +       },
> +};
> +#define S5L_SERIAL_DRV_DATA ((kernel_ulong_t)&s5l_serial_drv_data)
> +#else
> +#define S5L_SERIAL_DRV_DATA ((kernel_ulong_t)NULL)
> +#endif

...

> +#define APPLE_S5L_UCON_RXTO_ENA_MSK    (1 << APPLE_S5L_UCON_RXTO_ENA)
> +#define APPLE_S5L_UCON_RXTHRESH_ENA_MSK        (1 << APPLE_S5L_UCON_RXTHRESH_ENA)
> +#define APPLE_S5L_UCON_TXTHRESH_ENA_MSK        (1 << APPLE_S5L_UCON_TXTHRESH_ENA)

BIT() ?

...

> +#define APPLE_S5L_UCON_DEFAULT         (S3C2410_UCON_TXIRQMODE | \
> +                                        S3C2410_UCON_RXIRQMODE | \
> +                                        S3C2410_UCON_RXFIFO_TOI)

Indentation level is too high. Hint: start a value of the definition
on the new line.

...

> +#define APPLE_S5L_UTRSTAT_RXTHRESH     (1<<4)
> +#define APPLE_S5L_UTRSTAT_TXTHRESH     (1<<5)
> +#define APPLE_S5L_UTRSTAT_RXTO         (1<<9)
> +#define APPLE_S5L_UTRSTAT_ALL_FLAGS    (0x3f0)

BIT() ?
Arnd Bergmann March 5, 2021, 3:51 p.m. UTC | #22
On Fri, Mar 5, 2021 at 4:09 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Fri, Mar 5, 2021 at 12:25 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> > On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:
> >
> > > This documents the newly introduced ioremap_np() along with all the
> > > other common ioremap() variants, and some higher-level abstractions
> > > available.
> > >
> > > Signed-off-by: Hector Martin <marcan@marcan.st>
> >
> > I like this, I just want one change:
> >
> > Put the common ioremap() on top in all paragraphs, so the norm
> > comes before the exceptions.
> >
> > I.e. it is weird to mention ioremap_np() before mentioning ioremap().
>
> +1 here. That is what I have stumbled upon reading carefully.

In that case, the order should probably be:

ioremap
ioremap_wc
ioremap_wt
ioremap_np
ioremap_uc
ioremap_cache

Going from most common to least common, rather than going from
strongest to weakest.

       Arnd
Mark Kettenis March 5, 2021, 3:59 p.m. UTC | #23
> From: Hector Martin <marcan@marcan.st>
> Date: Fri, 5 Mar 2021 20:14:10 +0900
> 
> On 05/03/2021 20.03, Krzysztof Kozlowski wrote:
> >> +	memory@800000000 {
> >> +		device_type = "memory";
> >> +		reg = <0x8 0 0x2 0>; /* To be filled by loader */
> > 
> > Shouldn't this be 0x800000000 with ~0x80000000 length (or whatever is
> > more common)? Or did I miss some ranges?
> 
> The base model has 8GB of RAM, and RAM always starts at 0x800000000, 
> hence that reg property.
> 
> It's not actually useful to try to boot Linux like this, because it'll 
> step all over device carveouts on both ends and break, but since those 
> are potentially dynamic it doesn't really make sense to use a more 
> specific example for the dts.
> 
> E.g. on my system, with my current firmware version, this ends up 
> getting patched to:
> 
> reg = <0x8 0x0134c000 0x1 0xda294000>

It may be better to handle the memory reserved by the firmware using a
"/reserved-memory" node.  I think the benefit of that could be that it
communicates the entire range of physical memory to the kernel, which
means it could use large mappings in the page tables.  Unless the
"/reserved-memory" node defines a region that has the "no-map"
property of course.

That doesn't really have an impact on this diff though.  The
firmware/bootloader would still have to modify the property on 16GB
models.
Hector Martin March 5, 2021, 4:16 p.m. UTC | #24
On 06/03/2021 00.17, Andy Shevchenko wrote:
> Add a separate change that removes flags from the spin lock in the IRQ handler.

This commit should have no functional changes; I am just splitting an 
existing function into two, where one takes the lock and the other does 
the work. Do you mean using a different locking function? I'm not 
entirely sure what you're suggesting.
Andy Shevchenko March 5, 2021, 4:20 p.m. UTC | #25
On Fri, Mar 5, 2021 at 6:16 PM Hector Martin <marcan@marcan.st> wrote:
>
> On 06/03/2021 00.17, Andy Shevchenko wrote:
> > Add a separate change that removes flags from the spin lock in the IRQ handler.
>
> This commit should have no functional changes;

Exactly my point why I'm suggesting to have _separate_ change!

> I am just splitting an
> existing function into two, where one takes the lock and the other does
> the work. Do you mean using a different locking function? I'm not
> entirely sure what you're suggesting.

Yes, as a prerequisite

spin_lock_irqsave -> spin_lock().
Hector Martin March 5, 2021, 4:29 p.m. UTC | #26
On 06/03/2021 01.20, Andy Shevchenko wrote:
>> I am just splitting an
>> existing function into two, where one takes the lock and the other does
>> the work. Do you mean using a different locking function? I'm not
>> entirely sure what you're suggesting.
> 
> Yes, as a prerequisite
> 
> spin_lock_irqsave -> spin_lock().

Krzysztof, is this something you want in this series? I was trying to 
avoid logic changes to the non-Apple paths.
Hector Martin March 5, 2021, 4:50 p.m. UTC | #27
On 06/03/2021 00.59, Mark Kettenis wrote:
> It may be better to handle the memory reserved by the firmware using a
> "/reserved-memory" node.  I think the benefit of that could be that it
> communicates the entire range of physical memory to the kernel, which
> means it could use large mappings in the page tables.  Unless the
> "/reserved-memory" node defines a region that has the "no-map"
> property of course.

We actually need no-map, because otherwise the CPU could speculate its 
way into these carveouts (it's not just firmware, there's stuff in here 
the CPU really can't be allowed to touch, e.g. the SEP carveout). It 
also breaks simplefb mapping the framebuffer. I thought of the 
reserved-memory approach, but then figured it wouldn't buy us anything 
for this reason.
Hector Martin March 5, 2021, 5:04 p.m. UTC | #28
On 06/03/2021 00.28, Andy Shevchenko wrote:
>> +       case TYPE_APPLE_S5L:
>> +               WARN_ON(1); // No DMA
> 
> Oh, no, please use the ONCE variant.

Thanks, changing this for v4.

> 
> ...
> 
>> +       /* Apple types use these bits for IRQ masks */
>> +       if (ourport->info->type != TYPE_APPLE_S5L) {
>> +               ucon &= ~(S3C64XX_UCON_TIMEOUT_MASK |
>> +                               S3C64XX_UCON_EMPTYINT_EN |
>> +                               S3C64XX_UCON_DMASUS_EN |
>> +                               S3C64XX_UCON_TIMEOUT_EN);
>> +               ucon |= 0xf << S3C64XX_UCON_TIMEOUT_SHIFT |
> 
> Can you spell 0xf with named constant(s), please?
> 
> In case they are repetitive via the code, introduce either a temporary
> variable (in case it scoped to one function only), or define it as a
> constant.

I'm just moving this code; as far as I can tell this is a timeout value 
(so just an integer), but I don't know if there is any special meaning 
to 0xf here. Note that this codepath is for *non-Apple* chips, as the 
Apple ones don't even have this field (at least not here).

>> +       irqreturn_t ret = IRQ_NONE;
> 
> Redundant. You may return directly.

What if both interrupts are pending?

> No IO serialization?

There is no DMA on the Apple variants (as far as I know; it's not 
implemented anyway), so there is no need for serializing IO with DMA. In 
any case, dealing with that is the DMA code's job, the interrupt handler 
shouldn't need to care.

If you mean serializing IO with the IRQ: CPU-wise, I would hope that's 
the irqchip's job (AIC does this with a readl on the event). If you mean 
ensuring all writes are complete (i.e. posted write issue), on the Apple 
chips everything is non-posted as explained in the previous patches.

> Extra blank line (check your entire series for a such)

Thanks, noted. I'll check the declaration blocks in other patches.

>> +       ourport->rx_enabled = 1;
>> +       ourport->tx_enabled = 0;
> 
> How are these protected against race?

The serial core should be holding the port mutex for pretty much every 
call into the driver, as far as I can tell.

> 
> ...
> 
>> +               case TYPE_APPLE_S5L: {
>> +                       unsigned int ucon;
>> +                       int ret;
>> +
>> +                       ret = clk_prepare_enable(ourport->clk);
>> +                       if (ret) {
>> +                               dev_err(dev, "clk_enable clk failed: %d\n", ret);
>> +                               return ret;
>> +                       }
>> +                       if (!IS_ERR(ourport->baudclk)) {
>> +                               ret = clk_prepare_enable(ourport->baudclk);
>> +                               if (ret) {
>> +                                       dev_err(dev, "clk_enable baudclk failed: %d\n", ret);
>> +                                       clk_disable_unprepare(ourport->clk);
>> +                                       return ret;
>> +                               }
>> +                       }
> 
> Wouldn't it be better to use CLK bulk API?

Ah, I guess that could save a line or two of code here, even though it 
requires setting up the array. I'll give it a shot.

>> +#ifdef CONFIG_ARCH_APPLE
> 
> Why? Wouldn't you like the one kernel to work on many SoCs?

This *adds* Apple support, it is not mutually exclusive with all the 
other SoCs. You can enable all of those options and get a driver that 
works on all of them. This is the same pattern used throughout the 
driver for all the other Samsung variants. There is no reason to have 
Apple SoC support in the samsung driver if the rest of the kernel 
doesn't have Apple SoC support either, of course.

>> +#define APPLE_S5L_UCON_RXTO_ENA_MSK    (1 << APPLE_S5L_UCON_RXTO_ENA)
>> +#define APPLE_S5L_UCON_RXTHRESH_ENA_MSK        (1 << APPLE_S5L_UCON_RXTHRESH_ENA)
>> +#define APPLE_S5L_UCON_TXTHRESH_ENA_MSK        (1 << APPLE_S5L_UCON_TXTHRESH_ENA)
> 
> BIT() ?

I'm trying to keep the style of the rest of the file here, which doesn't 
use BIT() anywhere. I agree this header could use some work though... I 
wonder if I've met my required quota of cleanups to this driver for this 
patchset ;-)

>> +#define APPLE_S5L_UCON_DEFAULT         (S3C2410_UCON_TXIRQMODE | \
>> +                                        S3C2410_UCON_RXIRQMODE | \
>> +                                        S3C2410_UCON_RXFIFO_TOI)
> 
> Indentation level is too high. Hint: start a value of the definition
> on the new line.

Is it that bad? It's within 80 cols, putting one bit per line is more 
readable than several on one line, and this is how the rest of the 
header is written. Is it really better to do

#define APPLE_S5L_UCON_DEFAULT \
	(S3C2410_UCON_TXIRQMODE | S3C2410_UCON_RXIRQMODE | \
	 S3C2410_UCON_RXFIFO_TOI)

or

#define APPLE_S5L_UCON_DEFAULT \
		(S3C2410_UCON_TXIRQMODE | \
		 S3C2410_UCON_RXIRQMODE | \
		 S3C2410_UCON_RXFIFO_TOI)

here? Those don't look like an obvious improvement to me, I'd even say 
overlapping the bits and the macro name in the same columns makes it 
less readable to my eyes.

>> +#define APPLE_S5L_UTRSTAT_RXTHRESH     (1<<4)
>> +#define APPLE_S5L_UTRSTAT_TXTHRESH     (1<<5)
>> +#define APPLE_S5L_UTRSTAT_RXTO         (1<<9)
>> +#define APPLE_S5L_UTRSTAT_ALL_FLAGS    (0x3f0)
> 
> BIT() ?

See above.
Krzysztof Kozlowski March 7, 2021, 11:34 a.m. UTC | #29
On 05/03/2021 17:29, Hector Martin wrote:
> On 06/03/2021 01.20, Andy Shevchenko wrote:
>>> I am just splitting an
>>> existing function into two, where one takes the lock and the other does
>>> the work. Do you mean using a different locking function? I'm not
>>> entirely sure what you're suggesting.
>>
>> Yes, as a prerequisite
>>
>> spin_lock_irqsave -> spin_lock().
> 
> Krzysztof, is this something you want in this series? I was trying to 
> avoid logic changes to the non-Apple paths.

I don't quite get the need for such change (the code will be still
called in interrupt handler, right?), but assuming the "why?" is
properly documented, it can be a separate patch here.

Best regards,
Krzysztof
Krzysztof Kozlowski March 7, 2021, 11:40 a.m. UTC | #30
On 05/03/2021 18:04, Hector Martin wrote:
> On 06/03/2021 00.28, Andy Shevchenko wrote:
>>> +       case TYPE_APPLE_S5L:
>>> +               WARN_ON(1); // No DMA
>>
>> Oh, no, please use the ONCE variant.
> 
> Thanks, changing this for v4.
> 
>>
>> ...
>>
>>> +       /* Apple types use these bits for IRQ masks */
>>> +       if (ourport->info->type != TYPE_APPLE_S5L) {
>>> +               ucon &= ~(S3C64XX_UCON_TIMEOUT_MASK |
>>> +                               S3C64XX_UCON_EMPTYINT_EN |
>>> +                               S3C64XX_UCON_DMASUS_EN |
>>> +                               S3C64XX_UCON_TIMEOUT_EN);
>>> +               ucon |= 0xf << S3C64XX_UCON_TIMEOUT_SHIFT |
>>
>> Can you spell 0xf with named constant(s), please?
>>
>> In case they are repetitive via the code, introduce either a temporary
>> variable (in case it scoped to one function only), or define it as a
>> constant.
> 
> I'm just moving this code; as far as I can tell this is a timeout value 
> (so just an integer), but I don't know if there is any special meaning 
> to 0xf here. Note that this codepath is for *non-Apple* chips, as the 
> Apple ones don't even have this field (at least not here).

I agree here with Hector. Andi, you propose here unrelated change (which
without documentation might not be doable by Hector).

> 
>>> +       irqreturn_t ret = IRQ_NONE;
>>
>> Redundant. You may return directly.
> 
> What if both interrupts are pending?
> 
>> No IO serialization?
> 
> There is no DMA on the Apple variants (as far as I know; it's not 
> implemented anyway), so there is no need for serializing IO with DMA. In 
> any case, dealing with that is the DMA code's job, the interrupt handler 
> shouldn't need to care.
> 
> If you mean serializing IO with the IRQ: CPU-wise, I would hope that's 
> the irqchip's job (AIC does this with a readl on the event). If you mean 
> ensuring all writes are complete (i.e. posted write issue), on the Apple 
> chips everything is non-posted as explained in the previous patches.
> 
>> Extra blank line (check your entire series for a such)
> 
> Thanks, noted. I'll check the declaration blocks in other patches.
> 
>>> +       ourport->rx_enabled = 1;
>>> +       ourport->tx_enabled = 0;
>>
>> How are these protected against race?
> 
> The serial core should be holding the port mutex for pretty much every 
> call into the driver, as far as I can tell.
> 
>>
>> ...
>>
>>> +               case TYPE_APPLE_S5L: {
>>> +                       unsigned int ucon;
>>> +                       int ret;
>>> +
>>> +                       ret = clk_prepare_enable(ourport->clk);
>>> +                       if (ret) {
>>> +                               dev_err(dev, "clk_enable clk failed: %d\n", ret);
>>> +                               return ret;
>>> +                       }
>>> +                       if (!IS_ERR(ourport->baudclk)) {
>>> +                               ret = clk_prepare_enable(ourport->baudclk);
>>> +                               if (ret) {
>>> +                                       dev_err(dev, "clk_enable baudclk failed: %d\n", ret);
>>> +                                       clk_disable_unprepare(ourport->clk);
>>> +                                       return ret;
>>> +                               }
>>> +                       }
>>
>> Wouldn't it be better to use CLK bulk API?
> 
> Ah, I guess that could save a line or two of code here, even though it 
> requires setting up the array. I'll give it a shot.
> 
>>> +#ifdef CONFIG_ARCH_APPLE
>>
>> Why? Wouldn't you like the one kernel to work on many SoCs?
> 
> This *adds* Apple support, it is not mutually exclusive with all the 
> other SoCs. You can enable all of those options and get a driver that 
> works on all of them. This is the same pattern used throughout the 
> driver for all the other Samsung variants. There is no reason to have 
> Apple SoC support in the samsung driver if the rest of the kernel 
> doesn't have Apple SoC support either, of course.

How ifdef on ARCH_APLLE makes it non-working on many SoCs? All new
platforms are multi... The true question is - do the ifdefs in the code
make it more difficult to read/review?

Best regards,
Krzysztof
Arnd Bergmann March 7, 2021, 4:01 p.m. UTC | #31
On Sun, Mar 7, 2021 at 12:34 PM Krzysztof Kozlowski
<krzysztof.kozlowski@canonical.com> wrote:
> On 05/03/2021 17:29, Hector Martin wrote:
> > On 06/03/2021 01.20, Andy Shevchenko wrote:
> >>> I am just splitting an
> >>> existing function into two, where one takes the lock and the other does
> >>> the work. Do you mean using a different locking function? I'm not
> >>> entirely sure what you're suggesting.
> >>
> >> Yes, as a prerequisite
> >>
> >> spin_lock_irqsave -> spin_lock().
> >
> > Krzysztof, is this something you want in this series? I was trying to
> > avoid logic changes to the non-Apple paths.
>
> I don't quite get the need for such change (the code will be still
> called in interrupt handler, right?), but assuming the "why?" is
> properly documented, it can be a separate patch here.

This is only for readability: the common rule is to not disable
interrupts when they are already disabled, so a reader might wonder
if this instance of the handler is special in some case that it might
be called with interrupts enabled.

There is also a small overhead in accessing the global irq mask
register on some architectures, but for a uart that does not make
any difference of course.

While I'm generally in favor of that kind of cleanup, I'd also
prefer to leave it out of this series -- once you get into details
like this the series gets harder to review.

        Arnd
Krzysztof Kozlowski March 7, 2021, 7:51 p.m. UTC | #32
On 07/03/2021 17:01, Arnd Bergmann wrote:
> On Sun, Mar 7, 2021 at 12:34 PM Krzysztof Kozlowski
> <krzysztof.kozlowski@canonical.com> wrote:
>> On 05/03/2021 17:29, Hector Martin wrote:
>>> On 06/03/2021 01.20, Andy Shevchenko wrote:
>>>>> I am just splitting an
>>>>> existing function into two, where one takes the lock and the other does
>>>>> the work. Do you mean using a different locking function? I'm not
>>>>> entirely sure what you're suggesting.
>>>>
>>>> Yes, as a prerequisite
>>>>
>>>> spin_lock_irqsave -> spin_lock().
>>>
>>> Krzysztof, is this something you want in this series? I was trying to
>>> avoid logic changes to the non-Apple paths.
>>
>> I don't quite get the need for such change (the code will be still
>> called in interrupt handler, right?), but assuming the "why?" is
>> properly documented, it can be a separate patch here.
> 
> This is only for readability: the common rule is to not disable
> interrupts when they are already disabled, so a reader might wonder
> if this instance of the handler is special in some case that it might
> be called with interrupts enabled.
> 
> There is also a small overhead in accessing the global irq mask
> register on some architectures, but for a uart that does not make
> any difference of course.
> 
> While I'm generally in favor of that kind of cleanup, I'd also
> prefer to leave it out of this series -- once you get into details
> like this the series gets harder to review.

So it's only about the spinlock in the IRQ handler (which does not need
to disable the IRQs). Makes sense but not related at all to the topic of
bringing up Apple M1, therefore should not stop the review/merging.

Best regards,
Krzysztof
Marc Zyngier March 8, 2021, 11:13 a.m. UTC | #33
On Thu, 04 Mar 2021 21:38:42 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> This allows the devicetree to correctly represent the available set of
> timers, which varies from device to device, without the need for fake
> dummy interrupts for unavailable slots.
> 
> Also add the hyp-virt timer/PPI, which is not currently used, but worth
> representing.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> Reviewed-by: Tony Lindgren <tony@atomide.com>

Reviewed-by: Marc Zyngier <maz@kernel.org>

Thanks,

	M.
Marc Zyngier March 8, 2021, 11:20 a.m. UTC | #34
On Thu, 04 Mar 2021 21:38:43 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> ARM64 currently defaults to posted MMIO (nGnRnE), but some devices
> require the use of non-posted MMIO (nGnRE). Introduce a new ioremap()
> variant to handle this case. ioremap_np() is aliased to ioremap() by
> default on arches that do not implement this variant.
> 
> sparc64 is the only architecture that needs to be touched directly,
> because it includes neither of the generic io.h or iomap.h headers.
> 
> This adds the IORESOURCE_MEM_NONPOSTED flag, which maps to this
> variant and marks a given resource as requiring non-posted mappings.
> This is implemented in the resource system because it is a SoC-level
> requirement, so existing drivers do not need special-case code to pick
> this ioremap variant.
> 
> Then this is implemented in devres by introducing devm_ioremap_np(),
> and making devm_ioremap_resource() automatically select this variant
> when the resource has the IORESOURCE_MEM_NONPOSTED flag set.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>

Acked-by: Marc Zyngier <maz@kernel.org>

	M.
Marc Zyngier March 8, 2021, 11:22 a.m. UTC | #35
On Thu, 04 Mar 2021 21:38:46 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> This is used on Apple ARM platforms, which require most MMIO
> (except PCI devices) to be mapped as nGnRnE.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>

Acked-by: Marc Zyngier <maz@kernel.org>

	M.
Marc Zyngier March 8, 2021, 11:39 a.m. UTC | #36
On Thu, 04 Mar 2021 21:38:49 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> These definitions are in arm-gic-v3.h for historical reasons which no
> longer apply. Move them to sysreg.h so the AIC driver can use them, as
> it needs to peek into vGIC registers to deal with the GIC maintentance
> interrupt.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>

Acked-by: Marc Zyngier <maz@kernel.org>

	M.
Marc Zyngier March 8, 2021, 11:50 a.m. UTC | #37
On Fri, 05 Mar 2021 15:05:08 +0000,
Andy Shevchenko <andy.shevchenko@gmail.com> wrote:

[...]

> > +#define TIMER_FIRING(x)                                                        \
> > +       (((x) & (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_MASK |            \
> > +                ARCH_TIMER_CTRL_IT_STAT)) ==                                  \
> > +        (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_STAT))
> 
> It's a bit hard to read. Perhaps
> 
> #define FOO_MASK  (_ENABLE | _STAT)
> #define _FIRING ... (FOO_MASK | _MASK == FOO_MASK)

The expression above is a direct translation of the architecture
reference manual, and I'd rather not have that hidden behind a bunch
of obscure macros.

[...]

> > +       irqc->hw_domain = irq_domain_create_linear(of_node_to_fwnode(node),
> > +                                                  irqc->nr_hw + AIC_NR_FIQ,
> > +                                                  &aic_irq_domain_ops, irqc);
> 
> If you are sure it will be always OF-only, why not to use
> irq_domain_add_linear()?

The OF-only API is deprecated, and there is no point in using it for
*new* code, specially when things like IPI allocation require the use
of the modern API. For arm64 root controllers, that's the way to go.

Thanks,

	M.
Andy Shevchenko March 8, 2021, 12:02 p.m. UTC | #38
On Mon, Mar 8, 2021 at 1:50 PM Marc Zyngier <maz@kernel.org> wrote:
> On Fri, 05 Mar 2021 15:05:08 +0000,
> Andy Shevchenko <andy.shevchenko@gmail.com> wrote:

...

> > > +#define TIMER_FIRING(x)                                                        \
> > > +       (((x) & (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_MASK |            \
> > > +                ARCH_TIMER_CTRL_IT_STAT)) ==                                  \
> > > +        (ARCH_TIMER_CTRL_ENABLE | ARCH_TIMER_CTRL_IT_STAT))
> >
> > It's a bit hard to read. Perhaps
> >
> > #define FOO_MASK  (_ENABLE | _STAT)
> > #define _FIRING ... (FOO_MASK | _MASK == FOO_MASK)
>
> The expression above is a direct translation of the architecture
> reference manual, and I'd rather not have that hidden behind a bunch
> of obscure macros.

OK!

...

> > > +       irqc->hw_domain = irq_domain_create_linear(of_node_to_fwnode(node),
> > > +                                                  irqc->nr_hw + AIC_NR_FIQ,
> > > +                                                  &aic_irq_domain_ops, irqc);
> >
> > If you are sure it will be always OF-only, why not to use
> > irq_domain_add_linear()?
>
> The OF-only API is deprecated, and there is no point in using it for
> *new* code, specially when things like IPI allocation require the use
> of the modern API. For arm64 root controllers, that's the way to go.

Good to know, thanks!
Marc Zyngier March 8, 2021, 1:31 p.m. UTC | #39
On Thu, 04 Mar 2021 21:38:51 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> This is the root interrupt controller used on Apple ARM SoCs such as the
> M1. This irqchip driver performs multiple functions:
> 
> * Handles both IRQs and FIQs
> 
> * Drives the AIC peripheral itself (which handles IRQs)
> 
> * Dispatches FIQs to downstream hard-wired clients (currently the ARM
>   timer).
> 
> * Implements a virtual IPI multiplexer to funnel multiple Linux IPIs
>   into a single hardware IPI
>

[...]

> Signed-off-by: Hector Martin <marcan@marcan.st>
> +static void __exception_irq_entry aic_handle_irq(struct pt_regs *regs)
> +{
> +	struct aic_irq_chip *ic = aic_irqc;
> +	u32 event, type, irq;
> +
> +	do {
> +		/*
> +		 * We cannot use a relaxed read here, as DMA needs to be
> +		 * ordered with respect to the IRQ firing.
> +		 */
> +		event = readl(ic->base + AIC_EVENT);
> +		type = FIELD_GET(AIC_EVENT_TYPE, event);
> +		irq = FIELD_GET(AIC_EVENT_NUM, event);
> +
> +		if (type == AIC_EVENT_TYPE_HW)
> +			handle_domain_irq(aic_irqc->hw_domain, irq, regs);
> +		else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
> +			aic_handle_ipi(regs);
> +		else if (event != 0)
> +			pr_err("Unknown IRQ event %d, %d\n", type, irq);
> +	} while (event);
> +
> +	/*
> +	 * vGIC maintenance interrupts end up here too, so we need to check
> +	 * for them separately. Just report and disable vGIC for now, until
> +	 * we implement this properly.
> +	 */
> +	if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
> +		read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
> +		pr_err("vGIC IRQ fired, disabling.\n");

Please add a _ratelimited here. Whilst debugging KVM on this machine,
I ended up with this firing at such a rate that it was impossible to
do anything useful. Ratelimiting it allowed me to pinpoint the
problem.

[...]

> +/*
> + * FIQ irqchip
> + */
> +
> +static void aic_fiq_mask(struct irq_data *d)
> +{
> +	/* Only the guest timers have real mask bits, unfortunately. */
> +	switch (d->hwirq) {
> +	case AIC_TMR_GUEST_PHYS:
> +		sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, VM_TMR_FIQ_ENABLE_P, 0);
> +		break;
> +	case AIC_TMR_GUEST_VIRT:
> +		sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, VM_TMR_FIQ_ENABLE_V, 0);
> +		break;
> +	}
> +}
> +
> +static void aic_fiq_unmask(struct irq_data *d)
> +{
> +	switch (d->hwirq) {
> +	case AIC_TMR_GUEST_PHYS:
> +		sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, 0, VM_TMR_FIQ_ENABLE_P);
> +		break;
> +	case AIC_TMR_GUEST_VIRT:
> +		sysreg_clear_set_s(SYS_APL_VM_TMR_FIQ_ENA_EL1, 0, VM_TMR_FIQ_ENABLE_V);
> +		break;
> +	}
> +}
> +
> +static void aic_fiq_eoi(struct irq_data *d)
> +{
> +	/* We mask to ack (where we can), so we need to unmask at EOI. */
> +	if (!irqd_irq_disabled(d) && !irqd_irq_masked(d))

Ah, be careful here: irqd_irq_masked() doesn't do what you think it
does for per-CPU interrupts. It's been on my list to fix for the rVIC
implementation, but I never got around to doing it, and all decent ICs
hide this from SW by having a HW-managed mask, similar to what is on
the IRQ side.

I can see two possibilities:

- you can track the masked state directly and use that instead of
  these predicates

- you can just drop the masking altogether as this is only useful to a
  hosted hypervisor (KVM), which will have to do its own masking
  behind the scenes anyway

> +		aic_fiq_unmask(d);
> +}
> +

The rest looks good to me.

Thanks,

	M.
Marc Zyngier March 8, 2021, 3:35 p.m. UTC | #40
On Thu, 04 Mar 2021 21:38:52 +0000,
Hector Martin <marcan@marcan.st> wrote:
> 
> This adds a Kconfig option to toggle support for Apple ARM SoCs.
> At this time this targets the M1 and later "Apple Silicon" Mac SoCs.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  arch/arm64/Kconfig.platforms | 8 ++++++++
>  arch/arm64/configs/defconfig | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig.platforms b/arch/arm64/Kconfig.platforms
> index cdfd5fed457f..c2b5791e3d69 100644
> --- a/arch/arm64/Kconfig.platforms
> +++ b/arch/arm64/Kconfig.platforms
> @@ -36,6 +36,14 @@ config ARCH_ALPINE
>  	  This enables support for the Annapurna Labs Alpine
>  	  Soc family.
>  
> +config ARCH_APPLE
> +	bool "Apple Silicon SoC family"
> +	select APPLE_AIC
> +	select ARM64_FIQ_SUPPORT

Do we still need this FIQ symbol? I though it was now gone...

Thanks,

	M.
Hector Martin March 9, 2021, 8:29 p.m. UTC | #41
On 06/03/2021 00.51, Arnd Bergmann wrote:
> On Fri, Mar 5, 2021 at 4:09 PM Andy Shevchenko
> <andy.shevchenko@gmail.com> wrote:
>> On Fri, Mar 5, 2021 at 12:25 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>>> On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:
>>>
>>>> This documents the newly introduced ioremap_np() along with all the
>>>> other common ioremap() variants, and some higher-level abstractions
>>>> available.
>>>>
>>>> Signed-off-by: Hector Martin <marcan@marcan.st>
>>>
>>> I like this, I just want one change:
>>>
>>> Put the common ioremap() on top in all paragraphs, so the norm
>>> comes before the exceptions.
>>>
>>> I.e. it is weird to mention ioremap_np() before mentioning ioremap().
>>
>> +1 here. That is what I have stumbled upon reading carefully.
> 
> In that case, the order should probably be:
> 
> ioremap
> ioremap_wc
> ioremap_wt
> ioremap_np
> ioremap_uc
> ioremap_cache
> 
> Going from most common to least common, rather than going from
> strongest to weakest.

Yeah, I was dwelling on the issue of ioremap_np being first when I wrote 
that... this alternative works for me, I'll sort it like this then. 
It'll just need some re-wording to make it all flow properly.
Hector Martin March 9, 2021, 8:30 p.m. UTC | #42
On 09/03/2021 00.35, Marc Zyngier wrote:
> On Thu, 04 Mar 2021 21:38:52 +0000,
> Hector Martin <marcan@marcan.st> wrote:
>>
>> This adds a Kconfig option to toggle support for Apple ARM SoCs.
>> At this time this targets the M1 and later "Apple Silicon" Mac SoCs.
>>
>> Signed-off-by: Hector Martin <marcan@marcan.st>
>> ---
>>   arch/arm64/Kconfig.platforms | 8 ++++++++
>>   arch/arm64/configs/defconfig | 1 +
>>   2 files changed, 9 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig.platforms b/arch/arm64/Kconfig.platforms
>> index cdfd5fed457f..c2b5791e3d69 100644
>> --- a/arch/arm64/Kconfig.platforms
>> +++ b/arch/arm64/Kconfig.platforms
>> @@ -36,6 +36,14 @@ config ARCH_ALPINE
>>   	  This enables support for the Annapurna Labs Alpine
>>   	  Soc family.
>>   
>> +config ARCH_APPLE
>> +	bool "Apple Silicon SoC family"
>> +	select APPLE_AIC
>> +	select ARM64_FIQ_SUPPORT
> 
> Do we still need this FIQ symbol? I though it was now gone...

Whoops! Thanks for the catch, this can go away.
Linus Walleij March 10, 2021, 11:11 p.m. UTC | #43
On Thu, Mar 4, 2021 at 10:42 PM Hector Martin <marcan@marcan.st> wrote:

> Earlycon support is identical to S3C2410, but Apple SoCs also need
> MMIO mapped as nGnRnE. This is handled generically for normal drivers
> including the normal UART path here, but earlycon uses fixmap and
> runs before that scaffolding is ready.
>
> Since this is the only case where we need this fix, it makes more
> sense to do it here in the UART driver instead of introducing a
> whole fdt nonposted-mmio resolver just for earlycon/fixmap.
>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Hector Martin <marcan@marcan.st>

This is as elegant as it gets!
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Will Deacon March 24, 2021, 6:05 p.m. UTC | #44
On Fri, Mar 05, 2021 at 06:38:36AM +0900, Hector Martin wrote:
> From: Marc Zyngier <maz@kernel.org>
> 
> It seems that the CPU known as Apple M1 has the terrible habit
> of being stuck with HCR_EL2.E2H==1, in violation of the architecture.
> 
> Try and work around this deplorable state of affairs by detecting
> the stuck bit early and short-circuit the nVHE dance. It is still
> unknown whether there are many more such nuggets to be found...
> 
> Reported-by: Hector Martin <marcan@marcan.st>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kernel/head.S     | 33 ++++++++++++++++++++++++++++++---
>  arch/arm64/kernel/hyp-stub.S | 28 ++++++++++++++++++++++++----
>  2 files changed, 54 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 66b0e0b66e31..673002b11865 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -477,14 +477,13 @@ EXPORT_SYMBOL(kimage_vaddr)
>   * booted in EL1 or EL2 respectively.
>   */
>  SYM_FUNC_START(init_kernel_el)
> -	mov_q	x0, INIT_SCTLR_EL1_MMU_OFF
> -	msr	sctlr_el1, x0
> -
>  	mrs	x0, CurrentEL
>  	cmp	x0, #CurrentEL_EL2
>  	b.eq	init_el2
>  
>  SYM_INNER_LABEL(init_el1, SYM_L_LOCAL)
> +	mov_q	x0, INIT_SCTLR_EL1_MMU_OFF
> +	msr	sctlr_el1, x0
>  	isb
>  	mov_q	x0, INIT_PSTATE_EL1
>  	msr	spsr_el1, x0
> @@ -504,6 +503,34 @@ SYM_INNER_LABEL(init_el2, SYM_L_LOCAL)
>  	msr	vbar_el2, x0
>  	isb
>  
> +	/*
> +	 * Fruity CPUs seem to have HCR_EL2.E2H set to RES1,
> +	 * making it impossible to start in nVHE mode. Is that
> +	 * compliant with the architecture? Absolutely not!
> +	 */
> +	mrs	x0, hcr_el2
> +	and	x0, x0, #HCR_E2H
> +	cbz	x0, 1f
> +
> +	/* Switching to VHE requires a sane SCTLR_EL1 as a start */
> +	mov_q	x0, INIT_SCTLR_EL1_MMU_OFF
> +	msr_s	SYS_SCTLR_EL12, x0
> +
> +	/*
> +	 * Force an eret into a helper "function", and let it return
> +	 * to our original caller... This makes sure that we have
> +	 * initialised the basic PSTATE state.
> +	 */
> +	mov	x0, #INIT_PSTATE_EL2
> +	msr	spsr_el1, x0
> +	adr_l	x0, stick_to_vhe
> +	msr	elr_el1, x0
> +	eret
> +
> +1:
> +	mov_q	x0, INIT_SCTLR_EL1_MMU_OFF
> +	msr	sctlr_el1, x0
> +
>  	msr	elr_el2, lr
>  	mov	w0, #BOOT_CPU_MODE_EL2
>  	eret
> diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> index 5eccbd62fec8..c7601030ee82 100644
> --- a/arch/arm64/kernel/hyp-stub.S
> +++ b/arch/arm64/kernel/hyp-stub.S
> @@ -27,12 +27,12 @@ SYM_CODE_START(__hyp_stub_vectors)
>  	ventry	el2_fiq_invalid			// FIQ EL2t
>  	ventry	el2_error_invalid		// Error EL2t
>  
> -	ventry	el2_sync_invalid		// Synchronous EL2h
> +	ventry	elx_sync			// Synchronous EL2h
>  	ventry	el2_irq_invalid			// IRQ EL2h
>  	ventry	el2_fiq_invalid			// FIQ EL2h
>  	ventry	el2_error_invalid		// Error EL2h
>  
> -	ventry	el1_sync			// Synchronous 64-bit EL1
> +	ventry	elx_sync			// Synchronous 64-bit EL1
>  	ventry	el1_irq_invalid			// IRQ 64-bit EL1
>  	ventry	el1_fiq_invalid			// FIQ 64-bit EL1
>  	ventry	el1_error_invalid		// Error 64-bit EL1
> @@ -45,7 +45,7 @@ SYM_CODE_END(__hyp_stub_vectors)
>  
>  	.align 11
>  
> -SYM_CODE_START_LOCAL(el1_sync)
> +SYM_CODE_START_LOCAL(elx_sync)
>  	cmp	x0, #HVC_SET_VECTORS
>  	b.ne	1f
>  	msr	vbar_el2, x1
> @@ -71,7 +71,7 @@ SYM_CODE_START_LOCAL(el1_sync)
>  
>  9:	mov	x0, xzr
>  	eret
> -SYM_CODE_END(el1_sync)
> +SYM_CODE_END(elx_sync)
>  
>  // nVHE? No way! Give me the real thing!
>  SYM_CODE_START_LOCAL(mutate_to_vhe)
> @@ -243,3 +243,23 @@ SYM_FUNC_START(switch_to_vhe)
>  #endif
>  	ret
>  SYM_FUNC_END(switch_to_vhe)
> +
> +SYM_FUNC_START(stick_to_vhe)
> +	/*
> +	 * Make sure the switch to VHE cannot fail, by overriding the
> +	 * override. This is hilarious.
> +	 */
> +	adr_l	x1, id_aa64mmfr1_override
> +	add	x1, x1, #FTR_OVR_MASK_OFFSET
> +	dc 	civac, x1
> +	dsb	sy
> +	isb

Why do we need an ISB here?

> +	ldr	x0, [x1]
> +	bic	x0, x0, #(0xf << ID_AA64MMFR1_VHE_SHIFT)
> +	str	x0, [x1]

I find it a bit bizarre doing this here, as for the primary CPU we're still
a way away from parsing the early paramaters and for secondary CPUs this
doesn't need to be done for each one. Furthermore, this same code is run
on the resume path, which can probably then race with itself.

Is it possible to do it later on the boot CPU only, e.g. in
init_feature_override()? We should probably also log a warning that we're
ignoring the option because nVHE is not available.

Will
Will Deacon March 24, 2021, 6:12 p.m. UTC | #45
On Fri, Mar 05, 2021 at 06:38:43AM +0900, Hector Martin wrote:
> ARM64 currently defaults to posted MMIO (nGnRnE), but some devices
> require the use of non-posted MMIO (nGnRE). Introduce a new ioremap()
> variant to handle this case. ioremap_np() is aliased to ioremap() by
> default on arches that do not implement this variant.
> 
> sparc64 is the only architecture that needs to be touched directly,
> because it includes neither of the generic io.h or iomap.h headers.
> 
> This adds the IORESOURCE_MEM_NONPOSTED flag, which maps to this
> variant and marks a given resource as requiring non-posted mappings.
> This is implemented in the resource system because it is a SoC-level
> requirement, so existing drivers do not need special-case code to pick
> this ioremap variant.
> 
> Then this is implemented in devres by introducing devm_ioremap_np(),
> and making devm_ioremap_resource() automatically select this variant
> when the resource has the IORESOURCE_MEM_NONPOSTED flag set.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  .../driver-api/driver-model/devres.rst        |  1 +
>  arch/sparc/include/asm/io_64.h                |  4 ++++
>  include/asm-generic/io.h                      | 22 ++++++++++++++++++-
>  include/asm-generic/iomap.h                   |  9 ++++++++
>  include/linux/io.h                            |  2 ++
>  include/linux/ioport.h                        |  1 +
>  lib/devres.c                                  | 22 +++++++++++++++++++
>  7 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/driver-api/driver-model/devres.rst b/Documentation/driver-api/driver-model/devres.rst
> index cd8b6e657b94..2f45877a539d 100644
> --- a/Documentation/driver-api/driver-model/devres.rst
> +++ b/Documentation/driver-api/driver-model/devres.rst
> @@ -309,6 +309,7 @@ IOMAP
>    devm_ioremap()
>    devm_ioremap_uc()
>    devm_ioremap_wc()
> +  devm_ioremap_np()
>    devm_ioremap_resource() : checks resource, requests memory region, ioremaps
>    devm_ioremap_resource_wc()
>    devm_platform_ioremap_resource() : calls devm_ioremap_resource() for platform device
> diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h
> index 9bb27e5c22f1..9fbfc9574432 100644
> --- a/arch/sparc/include/asm/io_64.h
> +++ b/arch/sparc/include/asm/io_64.h
> @@ -409,6 +409,10 @@ static inline void __iomem *ioremap(unsigned long offset, unsigned long size)
>  #define ioremap_uc(X,Y)			ioremap((X),(Y))
>  #define ioremap_wc(X,Y)			ioremap((X),(Y))
>  #define ioremap_wt(X,Y)			ioremap((X),(Y))
> +static inline void __iomem *ioremap_np(unsigned long offset, unsigned long size)
> +{
> +	return NULL;
> +}
>  
>  static inline void iounmap(volatile void __iomem *addr)
>  {
> diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
> index c6af40ce03be..082e0c96db6e 100644
> --- a/include/asm-generic/io.h
> +++ b/include/asm-generic/io.h
> @@ -942,7 +942,9 @@ static inline void *phys_to_virt(unsigned long address)
>   *
>   * ioremap_wc() and ioremap_wt() can provide more relaxed caching attributes
>   * for specific drivers if the architecture choses to implement them.  If they
> - * are not implemented we fall back to plain ioremap.
> + * are not implemented we fall back to plain ioremap. Conversely, ioremap_np()
> + * can provide stricter non-posted write semantics if the architecture
> + * implements them.
>   */
>  #ifndef CONFIG_MMU
>  #ifndef ioremap
> @@ -993,6 +995,24 @@ static inline void __iomem *ioremap_uc(phys_addr_t offset, size_t size)
>  {
>  	return NULL;
>  }
> +
> +/*
> + * ioremap_np needs an explicit architecture implementation, as it
> + * requests stronger semantics than regular ioremap(). Portable drivers
> + * should instead use one of the higher-level abstractions, like
> + * devm_ioremap_resource(), to choose the correct variant for any given
> + * device and bus. Portable drivers with a good reason to want non-posted
> + * write semantics should always provide an ioremap() fallback in case
> + * ioremap_np() is not available.
> + */
> +#ifndef ioremap_np
> +#define ioremap_np ioremap_np
> +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
> +{
> +	return NULL;
> +}
> +#endif

Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
if it is supported by the architecture? That way, we could avoid defining
both on arm64.

Will
Will Deacon March 24, 2021, 6:13 p.m. UTC | #46
On Fri, Mar 05, 2021 at 06:38:40AM +0900, Hector Martin wrote:
> The implementor will be used to condition the FIQ support quirk.
> 
> The specific CPU types are not used at the moment, but let's add them
> for documentation purposes.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  arch/arm64/include/asm/cputype.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
> index ef5b040dee44..6231e1f0abe7 100644
> --- a/arch/arm64/include/asm/cputype.h
> +++ b/arch/arm64/include/asm/cputype.h
> @@ -59,6 +59,7 @@
>  #define ARM_CPU_IMP_NVIDIA		0x4E
>  #define ARM_CPU_IMP_FUJITSU		0x46
>  #define ARM_CPU_IMP_HISI		0x48
> +#define ARM_CPU_IMP_APPLE		0x61
>  
>  #define ARM_CPU_PART_AEM_V8		0xD0F
>  #define ARM_CPU_PART_FOUNDATION		0xD00
> @@ -99,6 +100,9 @@
>  
>  #define HISI_CPU_PART_TSV110		0xD01
>  
> +#define APPLE_CPU_PART_M1_ICESTORM	0x022
> +#define APPLE_CPU_PART_M1_FIRESTORM	0x023
> +
>  #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
>  #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
>  #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
> @@ -127,6 +131,8 @@
>  #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
>  #define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
>  #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
> +#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
> +#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)

We usually only merge these when they're needed, but this SoC seems broken
enough that I can see the value in having them from the start :(

Acked-by: Will Deacon <will@kernel.org>

Will
Will Deacon March 24, 2021, 6:18 p.m. UTC | #47
On Fri, Mar 05, 2021 at 06:38:46AM +0900, Hector Martin wrote:
> This is used on Apple ARM platforms, which require most MMIO
> (except PCI devices) to be mapped as nGnRnE.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  arch/arm64/include/asm/io.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 5ea8656a2030..953b8703af60 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -169,6 +169,7 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
>  
>  #define ioremap(addr, size)		__ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
>  #define ioremap_wc(addr, size)		__ioremap((addr), (size), __pgprot(PROT_NORMAL_NC))
> +#define ioremap_np(addr, size)		__ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRnE))

Probably worth noting that whether or not this actually results in a
non-posted mapping depends on the system architecture, but this is the
best we can do, so:

Acked-by: Will Deacon <will@kernel.org>

I would personally prefer that drivers didn't have to care about this, and
ioremap on arm64 figured out the right attributes based on the region being
mapped, but I haven't followed the discussion closely so I won't die on that
hill.

Will
Will Deacon March 24, 2021, 6:23 p.m. UTC | #48
On Fri, Mar 05, 2021 at 06:38:49AM +0900, Hector Martin wrote:
> These definitions are in arm-gic-v3.h for historical reasons which no
> longer apply. Move them to sysreg.h so the AIC driver can use them, as
> it needs to peek into vGIC registers to deal with the GIC maintentance
> interrupt.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  arch/arm64/include/asm/sysreg.h    | 60 ++++++++++++++++++++++++++++++
>  include/linux/irqchip/arm-gic-v3.h | 56 ----------------------------
>  2 files changed, 60 insertions(+), 56 deletions(-)

This would be much easier to remove if you just moved the definitions,
rather than reordered than and changed the comments at the same time!

But I *think* nothing had changed, so:

Acked-by: Will Deacon <will@kernel.org>

Will
Will Deacon March 24, 2021, 6:38 p.m. UTC | #49
On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> have to deal with, and those don't really belong in sysreg.h with all
> the architectural registers. Make a new home for them, and add some
> registers which are useful for early bring-up.
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  MAINTAINERS                           |  1 +
>  arch/arm64/include/asm/sysreg_apple.h | 69 +++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
>  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index aec14fbd61b8..3a352c687d4b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1646,6 +1646,7 @@ B:	https://github.com/AsahiLinux/linux/issues
>  C:	irc://chat.freenode.net/asahi-dev
>  T:	git https://github.com/AsahiLinux/linux.git
>  F:	Documentation/devicetree/bindings/arm/apple.yaml
> +F:	arch/arm64/include/asm/sysreg_apple.h

(this isn't needed with my suggestion below).

>  ARM/ARTPEC MACHINE SUPPORT
>  M:	Jesper Nilsson <jesper.nilsson@axis.com>
> diff --git a/arch/arm64/include/asm/sysreg_apple.h b/arch/arm64/include/asm/sysreg_apple.h
> new file mode 100644
> index 000000000000..48347a51d564
> --- /dev/null
> +++ b/arch/arm64/include/asm/sysreg_apple.h

I doubt apple are the only folks doing this, so can we instead have
sysreg-impdef.h please, and then have an Apple section in there for these
registers? That way, we could also have an imp_sys_reg() macro to limit
CRn to 11 or 15, which is the reserved encoding space for these registers.

We'll cc you for any patches touching the Apple parts, as we don't have
the first clue about what's hiding in there.

> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Apple SoC vendor-defined system register definitions
> + *
> + * Copyright The Asahi Linux Contributors
> +
> + * This file contains only well-understood registers that are useful to
> + * Linux. If you are looking for things to add here, you should visit:
> + *
> + * https://github.com/AsahiLinux/docs/wiki/HW:ARM-System-Registers
> + */
> +
> +#ifndef __ASM_SYSREG_APPLE_H
> +#define __ASM_SYSREG_APPLE_H
> +
> +#include <asm/sysreg.h>
> +#include <linux/bits.h>
> +#include <linux/bitfield.h>
> +
> +/*
> + * Keep these registers in encoding order, except for register arrays;
> + * those should be listed in array order starting from the position of
> + * the encoding of the first register.
> + */
> +
> +#define SYS_APL_PMCR0_EL1		sys_reg(3, 1, 15, 0, 0)
> +#define PMCR0_IMODE			GENMASK(10, 8)
> +#define PMCR0_IMODE_OFF			0
> +#define PMCR0_IMODE_PMI			1
> +#define PMCR0_IMODE_AIC			2
> +#define PMCR0_IMODE_HALT		3
> +#define PMCR0_IMODE_FIQ			4
> +#define PMCR0_IACT			BIT(11)

The Arm ARM says this about imp-def sysregs:

  | The Arm architecture guarantees not to define any register name
  | prefixed with IMP_ as part of the standard Arm architecture.
  |
  | Note
  | Arm strongly recommends that any register names created in the
  | IMPLEMENTATION DEFINED register spaces be prefixed with IMP_ and
  | postfixed with _ELx, where appropriate.

and it seems like we could follow that here without much effort, if you
don't mind.

Will
Mark Rutland March 24, 2021, 6:59 p.m. UTC | #50
On Wed, Mar 24, 2021 at 06:38:18PM +0000, Will Deacon wrote:
> On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> > Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> > have to deal with, and those don't really belong in sysreg.h with all
> > the architectural registers. Make a new home for them, and add some
> > registers which are useful for early bring-up.
> > 
> > Signed-off-by: Hector Martin <marcan@marcan.st>
> > ---
> >  MAINTAINERS                           |  1 +
> >  arch/arm64/include/asm/sysreg_apple.h | 69 +++++++++++++++++++++++++++
> >  2 files changed, 70 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index aec14fbd61b8..3a352c687d4b 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1646,6 +1646,7 @@ B:	https://github.com/AsahiLinux/linux/issues
> >  C:	irc://chat.freenode.net/asahi-dev
> >  T:	git https://github.com/AsahiLinux/linux.git
> >  F:	Documentation/devicetree/bindings/arm/apple.yaml
> > +F:	arch/arm64/include/asm/sysreg_apple.h
> 
> (this isn't needed with my suggestion below).
> 
> >  ARM/ARTPEC MACHINE SUPPORT
> >  M:	Jesper Nilsson <jesper.nilsson@axis.com>
> > diff --git a/arch/arm64/include/asm/sysreg_apple.h b/arch/arm64/include/asm/sysreg_apple.h
> > new file mode 100644
> > index 000000000000..48347a51d564
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/sysreg_apple.h
> 
> I doubt apple are the only folks doing this, so can we instead have
> sysreg-impdef.h please, and then have an Apple section in there for these
> registers? That way, we could also have an imp_sys_reg() macro to limit
> CRn to 11 or 15, which is the reserved encoding space for these registers.
> 
> We'll cc you for any patches touching the Apple parts, as we don't have
> the first clue about what's hiding in there.

For existing IMP-DEF sysregs (e.g. the Kryo L2 control registers), we've
put the definitions in the drivers, rather than collating
non-architectural bits under arch/arm64/.

So far we've kept arch/arm64/ largely devoid of IMP-DEF bits, and it
seems a shame to add something with the sole purpose of collating that,
especially given arch code shouldn't need to touch these if FW and
bootloader have done their jobs right.

Can we put the definitions in the relevant drivers? That would sidestep
any pain with MAINTAINERS, too.

Thanks,
Mark.
Will Deacon March 24, 2021, 7:04 p.m. UTC | #51
On Wed, Mar 24, 2021 at 06:59:21PM +0000, Mark Rutland wrote:
> On Wed, Mar 24, 2021 at 06:38:18PM +0000, Will Deacon wrote:
> > On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> > > Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> > > have to deal with, and those don't really belong in sysreg.h with all
> > > the architectural registers. Make a new home for them, and add some
> > > registers which are useful for early bring-up.
> > > 
> > > Signed-off-by: Hector Martin <marcan@marcan.st>
> > > ---
> > >  MAINTAINERS                           |  1 +
> > >  arch/arm64/include/asm/sysreg_apple.h | 69 +++++++++++++++++++++++++++
> > >  2 files changed, 70 insertions(+)
> > >  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index aec14fbd61b8..3a352c687d4b 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -1646,6 +1646,7 @@ B:	https://github.com/AsahiLinux/linux/issues
> > >  C:	irc://chat.freenode.net/asahi-dev
> > >  T:	git https://github.com/AsahiLinux/linux.git
> > >  F:	Documentation/devicetree/bindings/arm/apple.yaml
> > > +F:	arch/arm64/include/asm/sysreg_apple.h
> > 
> > (this isn't needed with my suggestion below).
> > 
> > >  ARM/ARTPEC MACHINE SUPPORT
> > >  M:	Jesper Nilsson <jesper.nilsson@axis.com>
> > > diff --git a/arch/arm64/include/asm/sysreg_apple.h b/arch/arm64/include/asm/sysreg_apple.h
> > > new file mode 100644
> > > index 000000000000..48347a51d564
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/sysreg_apple.h
> > 
> > I doubt apple are the only folks doing this, so can we instead have
> > sysreg-impdef.h please, and then have an Apple section in there for these
> > registers? That way, we could also have an imp_sys_reg() macro to limit
> > CRn to 11 or 15, which is the reserved encoding space for these registers.
> > 
> > We'll cc you for any patches touching the Apple parts, as we don't have
> > the first clue about what's hiding in there.
> 
> For existing IMP-DEF sysregs (e.g. the Kryo L2 control registers), we've
> put the definitions in the drivers, rather than collating
> non-architectural bits under arch/arm64/.

Yeah, but we could include those here as well.

> So far we've kept arch/arm64/ largely devoid of IMP-DEF bits, and it
> seems a shame to add something with the sole purpose of collating that,
> especially given arch code shouldn't need to touch these if FW and
> bootloader have done their jobs right.
> 
> Can we put the definitions in the relevant drivers? That would sidestep
> any pain with MAINTAINERS, too.

If we can genuinely ignore these in arch code, then sure. I just don't know
how long that is going to be the case, and ending up in a situation where
these are scattered randomly throughout the tree sounds horrible to me.

Will
Arnd Bergmann March 24, 2021, 7:09 p.m. UTC | #52
On Wed, Mar 24, 2021 at 7:12 PM Will Deacon <will@kernel.org> wrote:
>
> > +/*
> > + * ioremap_np needs an explicit architecture implementation, as it
> > + * requests stronger semantics than regular ioremap(). Portable drivers
> > + * should instead use one of the higher-level abstractions, like
> > + * devm_ioremap_resource(), to choose the correct variant for any given
> > + * device and bus. Portable drivers with a good reason to want non-posted
> > + * write semantics should always provide an ioremap() fallback in case
> > + * ioremap_np() is not available.
> > + */
> > +#ifndef ioremap_np
> > +#define ioremap_np ioremap_np
> > +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
> > +{
> > +     return NULL;
> > +}
> > +#endif
>
> Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
> if it is supported by the architecture? That way, we could avoid defining
> both on arm64.

Good idea. It needs a fallback in case the ioremap_np() fails on most
architectures, but that sounds easy enough.

Since pci_remap_cfgspace() only has custom implementations, it sounds like
we can actually make the generic implementation unconditional in the end,
but that requires adding ioremap_np() on 32-bit as well, and I would keep
that separate from this series.

        Arnd
Will Deacon March 24, 2021, 7:57 p.m. UTC | #53
Hi Hector,

Sorry it took me so long to get to this. Some comments below.

On Fri, Mar 05, 2021 at 06:38:51AM +0900, Hector Martin wrote:
> This is the root interrupt controller used on Apple ARM SoCs such as the
> M1. This irqchip driver performs multiple functions:
> 
> * Handles both IRQs and FIQs
> 
> * Drives the AIC peripheral itself (which handles IRQs)
> 
> * Dispatches FIQs to downstream hard-wired clients (currently the ARM
>   timer).
> 
> * Implements a virtual IPI multiplexer to funnel multiple Linux IPIs
>   into a single hardware IPI
> 
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  MAINTAINERS                     |   2 +
>  drivers/irqchip/Kconfig         |   8 +
>  drivers/irqchip/Makefile        |   1 +
>  drivers/irqchip/irq-apple-aic.c | 710 ++++++++++++++++++++++++++++++++
>  include/linux/cpuhotplug.h      |   1 +
>  5 files changed, 722 insertions(+)
>  create mode 100644 drivers/irqchip/irq-apple-aic.c

[...]

> + * Implementation notes:
> + *
> + * - This driver creates two IRQ domains, one for HW IRQs and internal FIQs,
> + *   and one for IPIs.
> + * - Since Linux needs more than 2 IPIs, we implement a software IRQ controller
> + *   and funnel all IPIs into one per-CPU IPI (the second "self" IPI is unused).
> + * - FIQ hwirq numbers are assigned after true hwirqs, and are per-cpu.
> + * - DT bindings use 3-cell form (like GIC):
> + *   - <0 nr flags> - hwirq #nr
> + *   - <1 nr flags> - FIQ #nr
> + *     - nr=0  Physical HV timer
> + *     - nr=1  Virtual HV timer
> + *     - nr=2  Physical guest timer
> + *     - nr=3  Virtual guest timer
> + *
> + */
> +
> +#define pr_fmt(fmt) "%s: " fmt, __func__

General nit: but I suspect many of the prints in here probably want to be
using the *_ratelimited variants.

> +static void __exception_irq_entry aic_handle_irq(struct pt_regs *regs)
> +{
> +	struct aic_irq_chip *ic = aic_irqc;
> +	u32 event, type, irq;
> +
> +	do {
> +		/*
> +		 * We cannot use a relaxed read here, as DMA needs to be
> +		 * ordered with respect to the IRQ firing.
> +		 */

I think this could be a bit clearer: the readl() doesn't order any DMA
accesses, but instead means that subsequent reads by the CPU are ordered
(which may be from a buffer which was DMA'd to) are ordered after the
read of the MMIO register.

> +		event = readl(ic->base + AIC_EVENT);
> +		type = FIELD_GET(AIC_EVENT_TYPE, event);
> +		irq = FIELD_GET(AIC_EVENT_NUM, event);
> +
> +		if (type == AIC_EVENT_TYPE_HW)
> +			handle_domain_irq(aic_irqc->hw_domain, irq, regs);
> +		else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
> +			aic_handle_ipi(regs);
> +		else if (event != 0)
> +			pr_err("Unknown IRQ event %d, %d\n", type, irq);
> +	} while (event);
> +
> +	/*
> +	 * vGIC maintenance interrupts end up here too, so we need to check
> +	 * for them separately. Just report and disable vGIC for now, until
> +	 * we implement this properly.
> +	 */
> +	if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
> +		read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
> +		pr_err("vGIC IRQ fired, disabling.\n");
> +		sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EN, 0);
> +	}

What prevents all these system register accesses being speculated up before
the handler?

> +}
> +
> +static int aic_irq_set_affinity(struct irq_data *d,
> +				const struct cpumask *mask_val, bool force)
> +{
> +	irq_hw_number_t hwirq = irqd_to_hwirq(d);
> +	struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> +	int cpu;
> +
> +	if (hwirq > ic->nr_hw)
> +		return -EINVAL;
> +
> +	if (force)
> +		cpu = cpumask_first(mask_val);
> +	else
> +		cpu = cpumask_any_and(mask_val, cpu_online_mask);
> +
> +	aic_ic_write(ic, AIC_TARGET_CPU + hwirq * 4, BIT(cpu));
> +	irq_data_update_effective_affinity(d, cpumask_of(cpu));
> +
> +	return IRQ_SET_MASK_OK;
> +}
> +
> +static int aic_irq_set_type(struct irq_data *d, unsigned int type)
> +{
> +	return (type == IRQ_TYPE_LEVEL_HIGH) ? 0 : -EINVAL;
> +}
> +
> +static struct irq_chip aic_chip = {
> +	.name = "AIC",
> +	.irq_mask = aic_irq_mask,
> +	.irq_unmask = aic_irq_unmask,

I know these are driven by the higher-level irq chip code, but I'm a bit
confused as to what provides ordering if, e.g. something ends up calling:

	aic_chip.irq_mask(d);
	...
	aic_chip.irq_unmask(d);

I can't see any ISBs in here and they're writing to two different registers,
so can we end up with the IRQ masked after this sequence?

> +/*
> + * IPI irqchip
> + */
> +
> +static void aic_ipi_mask(struct irq_data *d)
> +{
> +	u32 irq_bit = BIT(irqd_to_hwirq(d));
> +	int this_cpu = smp_processor_id();
> +
> +	/* No specific ordering requirements needed here. */
> +	atomic_andnot(irq_bit, &aic_vipi_enable[this_cpu]);
> +}

Why not use a per-cpu variable here instead of an array of atomics? The pcpu
API has things for atomic updates (e.g. or, and, xchg).

> +static void aic_ipi_unmask(struct irq_data *d)
> +{
> +	struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> +	u32 irq_bit = BIT(irqd_to_hwirq(d));
> +	int this_cpu = smp_processor_id();
> +
> +	/*
> +	 * This must complete before the atomic_read_acquire() below to avoid
> +	 * racing aic_ipi_send_mask(). Use a dummy fetch op with release
> +	 * semantics for this. This is arch-specific: ARMv8 B2.3.3 specifies
> +	 * that writes with Release semantics are Barrier-ordered-before reads
> +	 * with Acquire semantics, even though the Linux arch-independent
> +	 * definition of these atomic ops does not.
> +	 */

I think a more idiomatic (and portable) way to do this would be to use
the relaxed accessors, but with smp_mb__after_atomic() between them. Do you
have a good reason for _not_ doing it like that?

> +	(void)atomic_fetch_or_release(irq_bit, &aic_vipi_enable[this_cpu]);
> +
> +	/*
> +	 * If a pending vIPI was unmasked, raise a HW IPI to ourselves.
> +	 * No barriers needed here since this is a self-IPI.
> +	 */
> +	if (atomic_read_acquire(&aic_vipi_flag[this_cpu]) & irq_bit)

"No barriers needed here" right before an acquire is confusing ;)

> +		aic_ic_write(ic, AIC_IPI_SEND, AIC_IPI_SEND_CPU(this_cpu));
> +}
> +
> +static void aic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> +{
> +	struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> +	u32 irq_bit = BIT(irqd_to_hwirq(d));
> +	u32 send = 0;
> +	int cpu;
> +	unsigned long pending;
> +
> +	for_each_cpu(cpu, mask) {
> +		/*
> +		 * This sequence is the mirror of the one in aic_ipi_unmask();
> +		 * see the comment there. Additionally, release semantics
> +		 * ensure that the vIPI flag set is ordered after any shared
> +		 * memory accesses that precede it. This therefore also pairs
> +		 * with the atomic_fetch_andnot in aic_handle_ipi().
> +		 */
> +		pending = atomic_fetch_or_release(irq_bit, &aic_vipi_flag[cpu]);

(same here)

> +		if (!(pending & irq_bit) && (atomic_read_acquire(&aic_vipi_enable[cpu]) & irq_bit))
> +			send |= AIC_IPI_SEND_CPU(cpu);
> +	}
> +
> +	/*
> +	 * The flag writes must complete before the physical IPI is issued
> +	 * to another CPU. This is implied by the control dependency on
> +	 * the result of atomic_read_acquire() above, which is itself
> +	 * already ordered after the vIPI flag write.
> +	 */
> +	if (send)
> +		aic_ic_write(ic, AIC_IPI_SEND, send);
> +}
> +
> +static struct irq_chip ipi_chip = {
> +	.name = "AIC-IPI",
> +	.irq_mask = aic_ipi_mask,
> +	.irq_unmask = aic_ipi_unmask,
> +	.ipi_send_mask = aic_ipi_send_mask,
> +};
> +
> +/*
> + * IPI IRQ domain
> + */
> +
> +static void aic_handle_ipi(struct pt_regs *regs)
> +{
> +	int this_cpu = smp_processor_id();
> +	int i;
> +	unsigned long enabled, firing;
> +
> +	/*
> +	 * Ack the IPI. We need to order this after the AIC event read, but
> +	 * that is enforced by normal MMIO ordering guarantees.
> +	 */
> +	aic_ic_write(aic_irqc, AIC_IPI_ACK, AIC_IPI_OTHER);
> +
> +	/*
> +	 * The mask read does not need to be ordered. Only we can change
> +	 * our own mask anyway, so no races are possible here, as long as
> +	 * we are properly in the interrupt handler (which is covered by
> +	 * the barrier that is part of the top-level AIC handler's readl()).
> +	 */
> +	enabled = atomic_read(&aic_vipi_enable[this_cpu]);
> +
> +	/*
> +	 * Clear the IPIs we are about to handle. This pairs with the
> +	 * atomic_fetch_or_release() in aic_ipi_send_mask(), and needs to be
> +	 * ordered after the aic_ic_write() above (to avoid dropping vIPIs) and
> +	 * before IPI handling code (to avoid races handling vIPIs before they
> +	 * are signaled). The former is taken care of by the release semantics
> +	 * of the write portion, while the latter is taken care of by the
> +	 * acquire semantics of the read portion.
> +	 */
> +	firing = atomic_fetch_andnot(enabled, &aic_vipi_flag[this_cpu]) & enabled;

Does this also need to be ordered after the Ack? For example, if we have
something like:

CPU 0						CPU 1
						<some other IPI>
aic_ipi_send_mask()
						atomic_fetch_andnot(flag)
	atomic_fetch_or_release(flag)
	aic_ic_write(AIC_IPI_SEND)
						aic_ic_write(AIC_IPI_ACK)

sorry if it's a stupid question, I'm just not sure about the cases in which
the hardware will pend things for you.

Will
Hector Martin March 25, 2021, 2:07 p.m. UTC | #54
On 25/03/2021 04.09, Arnd Bergmann wrote:
> On Wed, Mar 24, 2021 at 7:12 PM Will Deacon <will@kernel.org> wrote:
>>
>>> +/*
>>> + * ioremap_np needs an explicit architecture implementation, as it
>>> + * requests stronger semantics than regular ioremap(). Portable drivers
>>> + * should instead use one of the higher-level abstractions, like
>>> + * devm_ioremap_resource(), to choose the correct variant for any given
>>> + * device and bus. Portable drivers with a good reason to want non-posted
>>> + * write semantics should always provide an ioremap() fallback in case
>>> + * ioremap_np() is not available.
>>> + */
>>> +#ifndef ioremap_np
>>> +#define ioremap_np ioremap_np
>>> +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
>>> +{
>>> +     return NULL;
>>> +}
>>> +#endif
>>
>> Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
>> if it is supported by the architecture? That way, we could avoid defining
>> both on arm64.
> 
> Good idea. It needs a fallback in case the ioremap_np() fails on most
> architectures, but that sounds easy enough.
> 
> Since pci_remap_cfgspace() only has custom implementations, it sounds like
> we can actually make the generic implementation unconditional in the end,
> but that requires adding ioremap_np() on 32-bit as well, and I would keep
> that separate from this series.

Sounds good; I'm adding a patch to adjust the generic implementation and 
remove the arm64 one in v4, and we can then complete the cleanup for 
other arches later.
Will Deacon March 25, 2021, 2:49 p.m. UTC | #55
On Thu, Mar 25, 2021 at 11:07:40PM +0900, Hector Martin wrote:
> On 25/03/2021 04.09, Arnd Bergmann wrote:
> > On Wed, Mar 24, 2021 at 7:12 PM Will Deacon <will@kernel.org> wrote:
> > > 
> > > > +/*
> > > > + * ioremap_np needs an explicit architecture implementation, as it
> > > > + * requests stronger semantics than regular ioremap(). Portable drivers
> > > > + * should instead use one of the higher-level abstractions, like
> > > > + * devm_ioremap_resource(), to choose the correct variant for any given
> > > > + * device and bus. Portable drivers with a good reason to want non-posted
> > > > + * write semantics should always provide an ioremap() fallback in case
> > > > + * ioremap_np() is not available.
> > > > + */
> > > > +#ifndef ioremap_np
> > > > +#define ioremap_np ioremap_np
> > > > +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
> > > > +{
> > > > +     return NULL;
> > > > +}
> > > > +#endif
> > > 
> > > Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
> > > if it is supported by the architecture? That way, we could avoid defining
> > > both on arm64.
> > 
> > Good idea. It needs a fallback in case the ioremap_np() fails on most
> > architectures, but that sounds easy enough.
> > 
> > Since pci_remap_cfgspace() only has custom implementations, it sounds like
> > we can actually make the generic implementation unconditional in the end,
> > but that requires adding ioremap_np() on 32-bit as well, and I would keep
> > that separate from this series.
> 
> Sounds good; I'm adding a patch to adjust the generic implementation and
> remove the arm64 one in v4, and we can then complete the cleanup for other
> arches later.

Cheers. Don't forget to update the comment in the generic code which
complains about the lack of an ioremap() API for non-posted writes ;)

Will
Hector Martin March 26, 2021, 6:23 a.m. UTC | #56
On 25/03/2021 04.04, Will Deacon wrote:
> On Wed, Mar 24, 2021 at 06:59:21PM +0000, Mark Rutland wrote:
>> So far we've kept arch/arm64/ largely devoid of IMP-DEF bits, and it
>> seems a shame to add something with the sole purpose of collating that,
>> especially given arch code shouldn't need to touch these if FW and
>> bootloader have done their jobs right.
>>
>> Can we put the definitions in the relevant drivers? That would sidestep
>> any pain with MAINTAINERS, too.
> 
> If we can genuinely ignore these in arch code, then sure. I just don't know
> how long that is going to be the case, and ending up in a situation where
> these are scattered randomly throughout the tree sounds horrible to me.

I thought we would need some in KVM code, but given the direction Marc's 
series ended up in, it seems we won't. So I'm happy keeping these in the 
respective drivers; if this ends up being messy in the future it 
shouldn't be a big deal to refactor it all into one file again.
Hector Martin March 26, 2021, 7:54 a.m. UTC | #57
On 25/03/2021 05.00, Marc Zyngier wrote:
> I've come up with this on top of the original patch, spitting a
> warning when the right conditions are met. It's pretty ugly, but hey,
> so is the HW this runs on.

[...]

Folded into v4 and tested; works fine with `kvm-arm.mode=nvhe`, throwing 
the ugly WARN.
Hector Martin March 26, 2021, 7:57 a.m. UTC | #58
On 08/03/2021 22.31, Marc Zyngier wrote:
>> +	if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
>> +		read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
>> +		pr_err("vGIC IRQ fired, disabling.\n");
> 
> Please add a _ratelimited here. Whilst debugging KVM on this machine,
> I ended up with this firing at such a rate that it was impossible to
> do anything useful. Ratelimiting it allowed me to pinpoint the
> problem.

Ouch. Done for v4.

>> +static void aic_fiq_eoi(struct irq_data *d)
>> +{
>> +	/* We mask to ack (where we can), so we need to unmask at EOI. */
>> +	if (!irqd_irq_disabled(d) && !irqd_irq_masked(d))
> 
> Ah, be careful here: irqd_irq_masked() doesn't do what you think it
> does for per-CPU interrupts. It's been on my list to fix for the rVIC
> implementation, but I never got around to doing it, and all decent ICs
> hide this from SW by having a HW-managed mask, similar to what is on
> the IRQ side.
> 
> I can see two possibilities:
> 
> - you can track the masked state directly and use that instead of
>    these predicates
> 
> - you can just drop the masking altogether as this is only useful to a
>    hosted hypervisor (KVM), which will have to do its own masking
>    behind the scenes anyway
> 

Since you're using the masking in KVM after all, I'm tracking the mask 
state in a percpu variable now. Also folded in your two minor bugfixes 
from the KVM series. Cheers!
Hector Martin March 26, 2021, 8:58 a.m. UTC | #59
Hi Will,

No worries, I was busy until a couple days ago anyway.

On 25/03/2021 04.57, Will Deacon wrote:
>> +#define pr_fmt(fmt) "%s: " fmt, __func__
> 
> General nit: but I suspect many of the prints in here probably want to be
> using the *_ratelimited variants.

You're right, everything complaining about spurious/unsupported stuff is 
probably safer rate-limited. I also made them all pr_err_ratelimited, 
since nothing should be trying to use that hardware before we get around 
to supporting it in this driver.

>> +static void __exception_irq_entry aic_handle_irq(struct pt_regs *regs)
>> +{
>> +	struct aic_irq_chip *ic = aic_irqc;
>> +	u32 event, type, irq;
>> +
>> +	do {
>> +		/*
>> +		 * We cannot use a relaxed read here, as DMA needs to be
>> +		 * ordered with respect to the IRQ firing.
>> +		 */
> 
> I think this could be a bit clearer: the readl() doesn't order any DMA
> accesses, but instead means that subsequent reads by the CPU are ordered
> (which may be from a buffer which was DMA'd to) are ordered after the
> read of the MMIO register.

Good point, reworded it for v4:

/*
  * We cannot use a relaxed read here, as reads from DMA buffers
  * need to be ordered after the IRQ fires.
  */

> 
>> +		event = readl(ic->base + AIC_EVENT);
>> +		type = FIELD_GET(AIC_EVENT_TYPE, event);
>> +		irq = FIELD_GET(AIC_EVENT_NUM, event);
>> +
>> +		if (type == AIC_EVENT_TYPE_HW)
>> +			handle_domain_irq(aic_irqc->hw_domain, irq, regs);
>> +		else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
>> +			aic_handle_ipi(regs);
>> +		else if (event != 0)
>> +			pr_err("Unknown IRQ event %d, %d\n", type, irq);
>> +	} while (event);
>> +
>> +	/*
>> +	 * vGIC maintenance interrupts end up here too, so we need to check
>> +	 * for them separately. Just report and disable vGIC for now, until
>> +	 * we implement this properly.
>> +	 */
>> +	if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
>> +		read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
>> +		pr_err("vGIC IRQ fired, disabling.\n");
>> +		sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EN, 0);
>> +	}
> 
> What prevents all these system register accesses being speculated up before
> the handler?

Nothing, but that's not a problem, is it? If the condition is met, it 
means the vGIC IRQ *is* firing and needs clearing. We don't particularly 
care if this happens before, after, or during the rest of the IRQ handling.

I changed the message to this, because we actually should never hit this 
path with correctly-working KVM code (it takes care of it before this 
handler runs):

pr_err_ratelimited("vGIC IRQ fired and not handled by KVM, disabling.\n");

>> +static struct irq_chip aic_chip = {
>> +	.name = "AIC",
>> +	.irq_mask = aic_irq_mask,
>> +	.irq_unmask = aic_irq_unmask,
> 
> I know these are driven by the higher-level irq chip code, but I'm a bit
> confused as to what provides ordering if, e.g. something ends up calling:
> 
> 	aic_chip.irq_mask(d);
> 	...
> 	aic_chip.irq_unmask(d);
> 
> I can't see any ISBs in here and they're writing to two different registers,
> so can we end up with the IRQ masked after this sequence?

Wait, aren't MMIO writes to the same peripheral using device-nGnRnE 
memory modes always ordered with respect to each other? I thought the 
_relaxed versions were only trouble when mixed with memory/DMA buffers, 
and MMIO for any given peripheral always takes effect in program order.

>> +static void aic_ipi_mask(struct irq_data *d)
>> +{
>> +	u32 irq_bit = BIT(irqd_to_hwirq(d));
>> +	int this_cpu = smp_processor_id();
>> +
>> +	/* No specific ordering requirements needed here. */
>> +	atomic_andnot(irq_bit, &aic_vipi_enable[this_cpu]);
>> +}
> 
> Why not use a per-cpu variable here instead of an array of atomics? The pcpu
> API has things for atomic updates (e.g. or, and, xchg).

One CPU still needs to be able to mutate the flags of another CPU to 
fire an IPI; AIUI the per-cpu ops are *not* atomic for concurrent access 
by multiple CPUs, and in fact there is no API for that, only for "this CPU".

>> +static void aic_ipi_unmask(struct irq_data *d)
>> +{
>> +	struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
>> +	u32 irq_bit = BIT(irqd_to_hwirq(d));
>> +	int this_cpu = smp_processor_id();
>> +
>> +	/*
>> +	 * This must complete before the atomic_read_acquire() below to avoid
>> +	 * racing aic_ipi_send_mask(). Use a dummy fetch op with release
>> +	 * semantics for this. This is arch-specific: ARMv8 B2.3.3 specifies
>> +	 * that writes with Release semantics are Barrier-ordered-before reads
>> +	 * with Acquire semantics, even though the Linux arch-independent
>> +	 * definition of these atomic ops does not.
>> +	 */
> 
> I think a more idiomatic (and portable) way to do this would be to use
> the relaxed accessors, but with smp_mb__after_atomic() between them. Do you
> have a good reason for _not_ doing it like that?

Not particularly, other than symmetry with the case below.

>> +		/*
>> +		 * This sequence is the mirror of the one in aic_ipi_unmask();
>> +		 * see the comment there. Additionally, release semantics
>> +		 * ensure that the vIPI flag set is ordered after any shared
>> +		 * memory accesses that precede it. This therefore also pairs
>> +		 * with the atomic_fetch_andnot in aic_handle_ipi().
>> +		 */
>> +		pending = atomic_fetch_or_release(irq_bit, &aic_vipi_flag[cpu]);

We do need the return data here, and the release semantics (or another 
barrier before it). But the read below can be made relaxed and a barrier 
used instead, and then the same patern above except with a plain 
atomic_or().

>> +		if (!(pending & irq_bit) && (atomic_read_acquire(&aic_vipi_enable[cpu]) & irq_bit))
>> +			send |= AIC_IPI_SEND_CPU(cpu);
>> +	}

[...]

>> +	/*
>> +	 * Clear the IPIs we are about to handle. This pairs with the
>> +	 * atomic_fetch_or_release() in aic_ipi_send_mask(), and needs to be
>> +	 * ordered after the aic_ic_write() above (to avoid dropping vIPIs) and
>> +	 * before IPI handling code (to avoid races handling vIPIs before they
>> +	 * are signaled). The former is taken care of by the release semantics
>> +	 * of the write portion, while the latter is taken care of by the
>> +	 * acquire semantics of the read portion.
>> +	 */
>> +	firing = atomic_fetch_andnot(enabled, &aic_vipi_flag[this_cpu]) & enabled;
> 
> Does this also need to be ordered after the Ack? For example, if we have
> something like:
> 
> CPU 0						CPU 1
> 						<some other IPI>
> aic_ipi_send_mask()
> 						atomic_fetch_andnot(flag)
> 	atomic_fetch_or_release(flag)
> 	aic_ic_write(AIC_IPI_SEND)
> 						aic_ic_write(AIC_IPI_ACK)
> 
> sorry if it's a stupid question, I'm just not sure about the cases in which
> the hardware will pend things for you.

It is ordered, right? As the comment says, it "needs to be ordered after 
the aic_ic_write() above". atomic_fetch_andnot() is *supposed* to be 
fully ordered and that should include against the writel_relaxed() on 
AIC_IPI_FLAG. On ARM it turns out it's not quite fully ordered, but the 
acquire semantics of the read half are sufficient for this case, as they 
guarantee the flags are always read after the FIQ has been ACKed.

Cheeers,
Hector Martin March 26, 2021, 1:40 p.m. UTC | #60
On 06/03/2021 00.05, Andy Shevchenko wrote:
>> +#define pr_fmt(fmt) "%s: " fmt, __func__
> 
> This is not needed, really, if you have unique / distinguishable
> messages in the first place.
> Rather people include module names, which may be useful.

Makes sense, I'll switch to KBUILD_MODNAME.

>> +#define MASK_BIT(x)            BIT((x) & 0x1f)
> 
> GENMASK(4,0)

It's not really a register bitmask, but rather extracting the low bits 
of an index... but sure, GENMASK also expresses that. Changed.

>> +static atomic_t aic_vipi_flag[AIC_MAX_CPUS];
>> +static atomic_t aic_vipi_enable[AIC_MAX_CPUS];
> 
> Isn't it easier to handle these when they are full width, i.e. 32
> items per the array?

I don't think so, it doesn't really buy us anything. It's just a maximum 
beyond which the driver doesn't work in its current state anyway (if the 
number were much larger it'd make sense to dynamically allocate these, 
but not at this point).

>> +static int aic_irq_set_affinity(struct irq_data *d,
>> +                               const struct cpumask *mask_val, bool force)
>> +{
>> +       irq_hw_number_t hwirq = irqd_to_hwirq(d);
>> +       struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
>> +       int cpu;
>> +
>> +       if (hwirq > ic->nr_hw)
> 
> >= ?

Good catch, but this is actually obsolete. Higher IRQs go into the FIQ 
irqchip, so this should never happen (it's a leftover from when they 
were a single one). I'll remove it.

Ack on the other comments, thanks!
Will Deacon March 29, 2021, 12:04 p.m. UTC | #61
Hi Hector,

On Fri, Mar 26, 2021 at 05:58:15PM +0900, Hector Martin wrote:
> On 25/03/2021 04.57, Will Deacon wrote:
> > > +		event = readl(ic->base + AIC_EVENT);
> > > +		type = FIELD_GET(AIC_EVENT_TYPE, event);
> > > +		irq = FIELD_GET(AIC_EVENT_NUM, event);
> > > +
> > > +		if (type == AIC_EVENT_TYPE_HW)
> > > +			handle_domain_irq(aic_irqc->hw_domain, irq, regs);
> > > +		else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
> > > +			aic_handle_ipi(regs);
> > > +		else if (event != 0)
> > > +			pr_err("Unknown IRQ event %d, %d\n", type, irq);
> > > +	} while (event);
> > > +
> > > +	/*
> > > +	 * vGIC maintenance interrupts end up here too, so we need to check
> > > +	 * for them separately. Just report and disable vGIC for now, until
> > > +	 * we implement this properly.
> > > +	 */
> > > +	if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
> > > +		read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
> > > +		pr_err("vGIC IRQ fired, disabling.\n");
> > > +		sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EN, 0);
> > > +	}
> > 
> > What prevents all these system register accesses being speculated up before
> > the handler?
> 
> Nothing, but that's not a problem, is it? If the condition is met, it means
> the vGIC IRQ *is* firing and needs clearing. We don't particularly care if
> this happens before, after, or during the rest of the IRQ handling.
> 
> I changed the message to this, because we actually should never hit this
> path with correctly-working KVM code (it takes care of it before this
> handler runs):
> 
> pr_err_ratelimited("vGIC IRQ fired and not handled by KVM, disabling.\n");

Looks good.

> > > +static struct irq_chip aic_chip = {
> > > +	.name = "AIC",
> > > +	.irq_mask = aic_irq_mask,
> > > +	.irq_unmask = aic_irq_unmask,
> > 
> > I know these are driven by the higher-level irq chip code, but I'm a bit
> > confused as to what provides ordering if, e.g. something ends up calling:
> > 
> > 	aic_chip.irq_mask(d);
> > 	...
> > 	aic_chip.irq_unmask(d);
> > 
> > I can't see any ISBs in here and they're writing to two different registers,
> > so can we end up with the IRQ masked after this sequence?
> 
> Wait, aren't MMIO writes to the same peripheral using device-nGnRnE memory
> modes always ordered with respect to each other? I thought the _relaxed
> versions were only trouble when mixed with memory/DMA buffers, and MMIO for
> any given peripheral always takes effect in program order.

Sorry, this was my mistake -- I seem to have mixed up the MMIO parts with
the system register parts. In this case, aic_irq_[un]mask() are MMIO writes,
so things work out. It's the FIQ mask/unmask code that needs the ISBs.

> > > +static void aic_ipi_mask(struct irq_data *d)
> > > +{
> > > +	u32 irq_bit = BIT(irqd_to_hwirq(d));
> > > +	int this_cpu = smp_processor_id();
> > > +
> > > +	/* No specific ordering requirements needed here. */
> > > +	atomic_andnot(irq_bit, &aic_vipi_enable[this_cpu]);
> > > +}
> > 
> > Why not use a per-cpu variable here instead of an array of atomics? The pcpu
> > API has things for atomic updates (e.g. or, and, xchg).
> 
> One CPU still needs to be able to mutate the flags of another CPU to fire an
> IPI; AIUI the per-cpu ops are *not* atomic for concurrent access by multiple
> CPUs, and in fact there is no API for that, only for "this CPU".

Huh, I really thought we had an API for that, but you're right. Oh well! But
I'd still suggest a per-cpu atomic_t in that case, rather than the array.

> > > +static void aic_ipi_unmask(struct irq_data *d)
> > > +{
> > > +	struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> > > +	u32 irq_bit = BIT(irqd_to_hwirq(d));
> > > +	int this_cpu = smp_processor_id();
> > > +
> > > +	/*
> > > +	 * This must complete before the atomic_read_acquire() below to avoid
> > > +	 * racing aic_ipi_send_mask(). Use a dummy fetch op with release
> > > +	 * semantics for this. This is arch-specific: ARMv8 B2.3.3 specifies
> > > +	 * that writes with Release semantics are Barrier-ordered-before reads
> > > +	 * with Acquire semantics, even though the Linux arch-independent
> > > +	 * definition of these atomic ops does not.
> > > +	 */
> > 
> > I think a more idiomatic (and portable) way to do this would be to use
> > the relaxed accessors, but with smp_mb__after_atomic() between them. Do you
> > have a good reason for _not_ doing it like that?
> 
> Not particularly, other than symmetry with the case below.

I think it would be better not to rely on arm64-specific ordering unless
there's a good reason to.

> > > +		/*
> > > +		 * This sequence is the mirror of the one in aic_ipi_unmask();
> > > +		 * see the comment there. Additionally, release semantics
> > > +		 * ensure that the vIPI flag set is ordered after any shared
> > > +		 * memory accesses that precede it. This therefore also pairs
> > > +		 * with the atomic_fetch_andnot in aic_handle_ipi().
> > > +		 */
> > > +		pending = atomic_fetch_or_release(irq_bit, &aic_vipi_flag[cpu]);
> 
> We do need the return data here, and the release semantics (or another
> barrier before it). But the read below can be made relaxed and a barrier
> used instead, and then the same patern above except with a plain
> atomic_or().

Yes, I think using atomic_fetch_or() followed by atomic_read() would be
best (obviously with the relevant comments!)

> 
> > > +		if (!(pending & irq_bit) && (atomic_read_acquire(&aic_vipi_enable[cpu]) & irq_bit))
> > > +			send |= AIC_IPI_SEND_CPU(cpu);
> > > +	}
> 
> [...]
> 
> > > +	/*
> > > +	 * Clear the IPIs we are about to handle. This pairs with the
> > > +	 * atomic_fetch_or_release() in aic_ipi_send_mask(), and needs to be
> > > +	 * ordered after the aic_ic_write() above (to avoid dropping vIPIs) and
> > > +	 * before IPI handling code (to avoid races handling vIPIs before they
> > > +	 * are signaled). The former is taken care of by the release semantics
> > > +	 * of the write portion, while the latter is taken care of by the
> > > +	 * acquire semantics of the read portion.
> > > +	 */
> > > +	firing = atomic_fetch_andnot(enabled, &aic_vipi_flag[this_cpu]) & enabled;
> > 
> > Does this also need to be ordered after the Ack? For example, if we have
> > something like:
> > 
> > CPU 0						CPU 1
> > 						<some other IPI>
> > aic_ipi_send_mask()
> > 						atomic_fetch_andnot(flag)
> > 	atomic_fetch_or_release(flag)
> > 	aic_ic_write(AIC_IPI_SEND)
> > 						aic_ic_write(AIC_IPI_ACK)
> > 
> > sorry if it's a stupid question, I'm just not sure about the cases in which
> > the hardware will pend things for you.
> 
> It is ordered, right? As the comment says, it "needs to be ordered after the
> aic_ic_write() above". atomic_fetch_andnot() is *supposed* to be fully
> ordered and that should include against the writel_relaxed() on
> AIC_IPI_FLAG. On ARM it turns out it's not quite fully ordered, but the
> acquire semantics of the read half are sufficient for this case, as they
> guarantee the flags are always read after the FIQ has been ACKed.

Sorry, I missed that the answer to my question was already written in the
comment. However, I'm still a bit unsure about whether the memory barriers
give you what you need here. The barrier in atomic_fetch_andnot() will
order the previous aic_ic_write(AIC_IPI_ACK) for the purposes of other
CPUs reading those locations, but it doesn't say anything about when the
interrupt controller actually changes state after the Ack.

Given that the AIC is mapped Device-nGnRnE, the Arm ARM offers:

  | Additionally, for Device-nGnRnE memory, a read or write of a Location
  | in a Memory-mapped peripheral that exhibits side-effects is complete
  | only when the read or write both:
  |
  | * Can begin to affect the state of the Memory-mapped peripheral.
  | * Can trigger all associated side-effects, whether they affect other
  |   peripheral devices, PEs, or memory.

so without AIC documentation I can't tell whether completion of the Ack write
just begins the process of an Ack (in which case we might need something like
a read-back), or whether the write response back from the AIC only occurs once
the Ack has taken effect. Any ideas?

> Cheeers,

No prooblem :)

Will
Hector Martin April 1, 2021, 1:16 p.m. UTC | #62
Hi Will,

On 29/03/2021 21.04, Will Deacon wrote:
>> One CPU still needs to be able to mutate the flags of another CPU to fire an
>> IPI; AIUI the per-cpu ops are *not* atomic for concurrent access by multiple
>> CPUs, and in fact there is no API for that, only for "this CPU".
> 
> Huh, I really thought we had an API for that, but you're right. Oh well! But
> I'd still suggest a per-cpu atomic_t in that case, rather than the array.

Yeah, after digging into the per-cpu stuff earlier and understanding how 
it works, I agree that a per-cpu atomic makes sense here. Switched it to 
that (which simplified out a bunch of smp_processor_id() calls too). Thanks!

>>> I think a more idiomatic (and portable) way to do this would be to use
>>> the relaxed accessors, but with smp_mb__after_atomic() between them. Do you
>>> have a good reason for _not_ doing it like that?
>>
>> Not particularly, other than symmetry with the case below.
> 
> I think it would be better not to rely on arm64-specific ordering unless
> there's a good reason to.

Sounds reasonable, I'll switch to the barrier version.

>> We do need the return data here, and the release semantics (or another
>> barrier before it). But the read below can be made relaxed and a barrier
>> used instead, and then the same patern above except with a plain
>> atomic_or().
> 
> Yes, I think using atomic_fetch_or() followed by atomic_read() would be
> best (obviously with the relevant comments!)

atomic_fetch_or_release is sufficient here (atomic_fetch_or is stronger; 
atomic_fetch_or_relaxed would not be strong enough as this needs to be 
ordered after any writes prior to sending the IPI; in this case release 
semantics also make logical sense).

>> It is ordered, right? As the comment says, it "needs to be ordered after the
>> aic_ic_write() above". atomic_fetch_andnot() is *supposed* to be fully
>> ordered and that should include against the writel_relaxed() on
>> AIC_IPI_FLAG. On ARM it turns out it's not quite fully ordered, but the
>> acquire semantics of the read half are sufficient for this case, as they
>> guarantee the flags are always read after the FIQ has been ACKed.
> 
> Sorry, I missed that the answer to my question was already written in the
> comment. However, I'm still a bit unsure about whether the memory barriers
> give you what you need here. The barrier in atomic_fetch_andnot() will
> order the previous aic_ic_write(AIC_IPI_ACK) for the purposes of other
> CPUs reading those locations, but it doesn't say anything about when the
> interrupt controller actually changes state after the Ack.
> 
> Given that the AIC is mapped Device-nGnRnE, the Arm ARM offers:
> 
>    | Additionally, for Device-nGnRnE memory, a read or write of a Location
>    | in a Memory-mapped peripheral that exhibits side-effects is complete
>    | only when the read or write both:
>    |
>    | * Can begin to affect the state of the Memory-mapped peripheral.
>    | * Can trigger all associated side-effects, whether they affect other
>    |   peripheral devices, PEs, or memory.
> 
> so without AIC documentation I can't tell whether completion of the Ack write
> just begins the process of an Ack (in which case we might need something like
> a read-back), or whether the write response back from the AIC only occurs once
> the Ack has taken effect. Any ideas?

Ahh, you're talking about latency within AIC itself... I obviously don't 
have an authoritative answer to this, though the hardware designer in me 
wants to say this really ought to be single-cycle type stuff that isn't 
internally pipelined in a way that would create races.

I tried to set up an SMP test case for the atomic-to-AIC sequence in 
m1n1, but unfortunately I couldn't hit the race window in deliberately 
racy code (i.e. ack after clearing flags) without widening it even 
further with at least one dummy load in between, and of course I didn't 
experience any races with the proper code either.

What I can say is that a simple set IPI; ack IPI (in adjacent str 
instructions) sequence always yields a cleared IPI, and the converse 
always yields a set IPI. So if there is latency to the operations it 
seems it would at least be the same for sets and acks and would imply 
readbacks block, which should still yield equivalently correct results. 
But of course this is a single-CPU test, so it is not fully 
representative of what could happen in an SMP scenario.

At this point all I can say is I'm inclined to shrug and say we have no 
evidence of this being something that can happen, and it shouldn't in 
sane hardware, and hope for the best :-)

Thanks,