mbox series

[0/15] Enable CXL PCIe port protocol error handling and logging

Message ID 20241008221657.1130181-1-terry.bowman@amd.com
Headers show
Series Enable CXL PCIe port protocol error handling and logging | expand

Message

Bowman, Terry Oct. 8, 2024, 10:16 p.m. UTC
This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling. This patchset adds the CXL PCIe
port handling and logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 8 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common RCH and VH, adding port specific error handlers, and protocol error
logging.

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
-1-terry.bowman@amd.com/

Testing:

Below are test results for this patchset. This is using Qemu with a root
port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
(0e:00.0).

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed).

    Root port UCE:
    root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
    [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
    [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
    [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
    [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
    [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   27.325584]
    [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
    [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   27.335716] Call Trace:
    [   27.335985]  <TASK>
    [   27.336226]  panic+0x2ed/0x320
    [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
    [   27.337453]  cxl_do_recovery+0x304/0x310
    [   27.337833]  aer_isr+0x3fd/0x700
    [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
    [   27.338572]  irq_thread_fn+0x1f/0x60
    [   27.338923]  irq_thread+0x102/0x1b0
    [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   27.339683]  ? __pfx_irq_thread+0x10/0x10
    [   27.340059]  kthread+0xcd/0x100
    [   27.340387]  ? __pfx_kthread+0x10/0x10
    [   27.340748]  ret_from_fork+0x2f/0x50
    [   27.341100]  ? __pfx_kthread+0x10/0x10
    [   27.341466]  ret_from_fork_asm+0x1a/0x30
    [   27.341842]  </TASK>
    [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Root port CE:
    root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
    [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
    [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
    [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
    [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
    [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   19.449223]
    [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

    Upstream switch port UCE:
    root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
    [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
    [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
    [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
    [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
    [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   45.242448]
    [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
    [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   45.251907] Call Trace:
    [   45.253284]  <TASK>
    [   45.253564]  panic+0x2ed/0x320
    [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
    [   45.255915]  cxl_do_recovery+0x304/0x310
    [   45.257219]  aer_isr+0x3fd/0x700
    [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
    [   45.258006]  irq_thread_fn+0x1f/0x60
    [   45.258383]  irq_thread+0x102/0x1b0
    [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   45.259196]  ? __pfx_irq_thread+0x10/0x10
    [   45.259605]  kthread+0xcd/0x100
    [   45.259956]  ? __pfx_kthread+0x10/0x10
    [   45.260386]  ret_from_fork+0x2f/0x50
    [   45.260879]  ? __pfx_kthread+0x10/0x10
    [   45.261418]  ret_from_fork_asm+0x1a/0x30
    [   45.261936]  </TASK>
    [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Upstream switch port CE:
    root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh 
    [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
    [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
    [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
    [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
    [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   37.510180] 
    [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

    Downstream switch port UCE:
    root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
    [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
    [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
    [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
    [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
    [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   29.427111]
    [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
    [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   29.433031] Call Trace:
    [   29.433354]  <TASK>
    [   29.433631]  panic+0x2ed/0x320
    [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
    [   29.435179]  cxl_do_recovery+0x304/0x310
    [   29.435626]  aer_isr+0x3fd/0x700
    [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
    [   29.436507]  irq_thread_fn+0x1f/0x60
    [   29.436898]  irq_thread+0x102/0x1b0
    [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   29.437758]  ? __pfx_irq_thread+0x10/0x10
    [   29.438189]  kthread+0xcd/0x100
    [   29.438551]  ? __pfx_kthread+0x10/0x10
    [   29.438959]  ret_from_fork+0x2f/0x50
    [   29.439362]  ? __pfx_kthread+0x10/0x10
    [   29.439771]  ret_from_fork_asm+0x1a/0x30
    [   29.440221]  </TASK>
    [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Downstream switch port CE:
    root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
    [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
    [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
    [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
    [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
    [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [  177.119521]
    [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

Changes RFC->v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (15):
  cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
    driver
  cxl/aer/pci: Update is_internal_error() to be callable w/o
    CONFIG_PCIEAER_CXL
  cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
    PCIe ports
  cxl/aer/pci: Add CXL PCIe port correctable error support in AER
    service driver
  cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
    PCIe port devices
  cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
  cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
    service driver
  cxl/pci: Change find_cxl_ports() to be non-static
  cxl/pci: Map CXL PCIe downstream port RAS registers
  cxl/pci: Map CXL PCIe upstream port RAS registers
  cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/aer/pci: Export pci_aer_unmask_internal_errors()
  cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices

 drivers/cxl/core/core.h  |   3 +
 drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
 drivers/cxl/core/port.c  |   4 +-
 drivers/cxl/core/trace.h |  47 +++++++++++
 drivers/cxl/cxl.h        |  14 +++-
 drivers/cxl/mem.c        |  30 ++++++-
 drivers/cxl/pci.c        |   8 ++
 drivers/pci/pci.h        |   5 ++
 drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
 drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
 include/linux/aer.h      |  16 ++++
 include/linux/pci.h      |   3 +
 12 files changed, 503 insertions(+), 72 deletions(-)


base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a

Comments

Bjorn Helgaas Oct. 10, 2024, 7:07 p.m. UTC | #1
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
> 
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
> 
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.
> 
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> -1-terry.bowman@amd.com/

Makes life easier if URLs are all on one line so they still work.

> Testing:
> 
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
> 
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).
> 
>     Root port UCE:
>     root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   27.325584]
>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
>     [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   27.335716] Call Trace:
>     [   27.335985]  <TASK>
>     [   27.336226]  panic+0x2ed/0x320
>     [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
>     [   27.337453]  cxl_do_recovery+0x304/0x310
>     [   27.337833]  aer_isr+0x3fd/0x700
>     [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   27.338572]  irq_thread_fn+0x1f/0x60
>     [   27.338923]  irq_thread+0x102/0x1b0
>     [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   27.339683]  ? __pfx_irq_thread+0x10/0x10
>     [   27.340059]  kthread+0xcd/0x100
>     [   27.340387]  ? __pfx_kthread+0x10/0x10
>     [   27.340748]  ret_from_fork+0x2f/0x50
>     [   27.341100]  ? __pfx_kthread+0x10/0x10
>     [   27.341466]  ret_from_fork_asm+0x1a/0x30
>     [   27.341842]  </TASK>
>     [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Root port CE:
>     root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
>     [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>     [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>     [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>     [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
>     [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [   19.449223]
>     [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
> 
>     Upstream switch port UCE:
>     root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>     [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>     [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>     [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>     [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
>     [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   45.242448]
>     [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
>     [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   45.251907] Call Trace:
>     [   45.253284]  <TASK>
>     [   45.253564]  panic+0x2ed/0x320
>     [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
>     [   45.255915]  cxl_do_recovery+0x304/0x310
>     [   45.257219]  aer_isr+0x3fd/0x700
>     [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   45.258006]  irq_thread_fn+0x1f/0x60
>     [   45.258383]  irq_thread+0x102/0x1b0
>     [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   45.259196]  ? __pfx_irq_thread+0x10/0x10
>     [   45.259605]  kthread+0xcd/0x100
>     [   45.259956]  ? __pfx_kthread+0x10/0x10
>     [   45.260386]  ret_from_fork+0x2f/0x50
>     [   45.260879]  ? __pfx_kthread+0x10/0x10
>     [   45.261418]  ret_from_fork_asm+0x1a/0x30
>     [   45.261936]  </TASK>
>     [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Upstream switch port CE:
>     root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh 
>     [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>     [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>     [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>     [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
>     [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [   37.510180] 
>     [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
> 
>     Downstream switch port UCE:
>     root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>     [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>     [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>     [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>     [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
>     [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   29.427111]
>     [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
>     [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   29.433031] Call Trace:
>     [   29.433354]  <TASK>
>     [   29.433631]  panic+0x2ed/0x320
>     [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
>     [   29.435179]  cxl_do_recovery+0x304/0x310
>     [   29.435626]  aer_isr+0x3fd/0x700
>     [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   29.436507]  irq_thread_fn+0x1f/0x60
>     [   29.436898]  irq_thread+0x102/0x1b0
>     [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   29.437758]  ? __pfx_irq_thread+0x10/0x10
>     [   29.438189]  kthread+0xcd/0x100
>     [   29.438551]  ? __pfx_kthread+0x10/0x10
>     [   29.438959]  ret_from_fork+0x2f/0x50
>     [   29.439362]  ? __pfx_kthread+0x10/0x10
>     [   29.439771]  ret_from_fork_asm+0x1a/0x30
>     [   29.440221]  </TASK>
>     [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Downstream switch port CE:
>     root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>     [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>     [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>     [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>     [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
>     [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [  177.119521]
>     [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

Thanks for the hints about how to test this; it's helpful to have
those in the email archives.  Remove the timestamps and non-relevant
call trace entries unless they add useful information.  AFAICT they're
just distractions in this case.

> Changes RFC->v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
> 
> Terry Bowman (15):
>   cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
>     driver
>   cxl/aer/pci: Update is_internal_error() to be callable w/o
>     CONFIG_PCIEAER_CXL
>   cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
>     PCIe ports
>   cxl/aer/pci: Add CXL PCIe port correctable error support in AER
>     service driver
>   cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
>     PCIe port devices
>   cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
>   cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
>     service driver

I had to look at the patches to learn that all the above only touch
drivers/pci, aer.h, and pci.h.  Can you use the PCI subject line
conventions (e.g., "PCI/AER: ...") to make this more obvious?  Almost
all already include "CXL", so I don't think we'd really lose any
information.

>   cxl/pci: Change find_cxl_ports() to be non-static
>   cxl/pci: Map CXL PCIe downstream port RAS registers
>   cxl/pci: Map CXL PCIe upstream port RAS registers
>   cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/aer/pci: Export pci_aer_unmask_internal_errors()

Ditto here, and add something about CXL in the subject since this
doesn't export universally.

>   cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
> 
>  drivers/cxl/core/core.h  |   3 +
>  drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
>  drivers/cxl/core/port.c  |   4 +-
>  drivers/cxl/core/trace.h |  47 +++++++++++
>  drivers/cxl/cxl.h        |  14 +++-
>  drivers/cxl/mem.c        |  30 ++++++-
>  drivers/cxl/pci.c        |   8 ++
>  drivers/pci/pci.h        |   5 ++
>  drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
>  drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
>  include/linux/aer.h      |  16 ++++
>  include/linux/pci.h      |   3 +
>  12 files changed, 503 insertions(+), 72 deletions(-)
> 
> 
> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a

This doesn't apply cleanly on v6.12-rc1, and
f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet.  Where
is it?  I guess it relies on some other series that hasn't been merged
yet?

Bjorn
Bowman, Terry Oct. 14, 2024, 5:22 p.m. UTC | #2
Hi Bjorn,

Thanks for taking the time to review. I added comments below.

On 10/10/24 14:07, Bjorn Helgaas wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
>> -1-terry.bowman@amd.com/
> 
> Makes life easier if URLs are all on one line so they still work.
> 

Ok.

>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
>>
>>     Root port UCE:
>>     root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>>     [   27.325584]
>>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
>>     first_error: 'Memory Address Parity Error'
>>     [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>>     [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
>>     [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>     [   27.335716] Call Trace:
>>     [   27.335985]  <TASK>
>>     [   27.336226]  panic+0x2ed/0x320
>>     [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>>     [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
>>     [   27.337453]  cxl_do_recovery+0x304/0x310
>>     [   27.337833]  aer_isr+0x3fd/0x700
>>     [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
>>     [   27.338572]  irq_thread_fn+0x1f/0x60
>>     [   27.338923]  irq_thread+0x102/0x1b0
>>     [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
>>     [   27.339683]  ? __pfx_irq_thread+0x10/0x10
>>     [   27.340059]  kthread+0xcd/0x100
>>     [   27.340387]  ? __pfx_kthread+0x10/0x10
>>     [   27.340748]  ret_from_fork+0x2f/0x50
>>     [   27.341100]  ? __pfx_kthread+0x10/0x10
>>     [   27.341466]  ret_from_fork_asm+0x1a/0x30
>>     [   27.341842]  </TASK>
>>     [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>     [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>>     Root port CE:
>>     root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
>>     [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>>     [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>>     [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>     [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>>     [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
>>     [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>>     [   19.449223]
>>     [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>>
>>     Upstream switch port UCE:
>>     root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>>     [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>>     [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>>     [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>>     [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
>>     [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>>     [   45.242448]
>>     [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
>>     first_error: 'Memory Address Parity Error'
>>     [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>>     [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
>>     [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>     [   45.251907] Call Trace:
>>     [   45.253284]  <TASK>
>>     [   45.253564]  panic+0x2ed/0x320
>>     [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>>     [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
>>     [   45.255915]  cxl_do_recovery+0x304/0x310
>>     [   45.257219]  aer_isr+0x3fd/0x700
>>     [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
>>     [   45.258006]  irq_thread_fn+0x1f/0x60
>>     [   45.258383]  irq_thread+0x102/0x1b0
>>     [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
>>     [   45.259196]  ? __pfx_irq_thread+0x10/0x10
>>     [   45.259605]  kthread+0xcd/0x100
>>     [   45.259956]  ? __pfx_kthread+0x10/0x10
>>     [   45.260386]  ret_from_fork+0x2f/0x50
>>     [   45.260879]  ? __pfx_kthread+0x10/0x10
>>     [   45.261418]  ret_from_fork_asm+0x1a/0x30
>>     [   45.261936]  </TASK>
>>     [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>     [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>>     Upstream switch port CE:
>>     root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh 
>>     [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>>     [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>>     [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>     [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>>     [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
>>     [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>>     [   37.510180] 
>>     [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>>
>>     Downstream switch port UCE:
>>     root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>>     [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>>     [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>>     [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>>     [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
>>     [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>>     [   29.427111]
>>     [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
>>     first_error: 'Memory Address Parity Error'
>>     [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>>     [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
>>     [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>     [   29.433031] Call Trace:
>>     [   29.433354]  <TASK>
>>     [   29.433631]  panic+0x2ed/0x320
>>     [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>>     [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
>>     [   29.435179]  cxl_do_recovery+0x304/0x310
>>     [   29.435626]  aer_isr+0x3fd/0x700
>>     [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
>>     [   29.436507]  irq_thread_fn+0x1f/0x60
>>     [   29.436898]  irq_thread+0x102/0x1b0
>>     [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
>>     [   29.437758]  ? __pfx_irq_thread+0x10/0x10
>>     [   29.438189]  kthread+0xcd/0x100
>>     [   29.438551]  ? __pfx_kthread+0x10/0x10
>>     [   29.438959]  ret_from_fork+0x2f/0x50
>>     [   29.439362]  ? __pfx_kthread+0x10/0x10
>>     [   29.439771]  ret_from_fork_asm+0x1a/0x30
>>     [   29.440221]  </TASK>
>>     [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>     [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>>
>>     Downstream switch port CE:
>>     root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>>     [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>>     [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>>     [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>     [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>>     [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
>>     [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>>     [  177.119521]
>>     [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
> 
> Thanks for the hints about how to test this; it's helpful to have
> those in the email archives.  Remove the timestamps and non-relevant
> call trace entries unless they add useful information.  AFAICT they're
> just distractions in this case.
> 

I'll remove the test logging and details from the cover sheet. I'm unable to find how to 
attach using git tools. Instead of an atatachment, I can locate the log files and details 
on a public github. Let me know if this is not acceptable.

>> Changes RFC->v1:
>>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>>  [Dan] Add cxl_do_recovery()
>>  [Jonathan] Flatten cxl_setup_parent_uport()
>>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>>  [Jonathan] Rename cxl_dev_is_pci_type()
>>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>>  replace these find_cxl_port() and device_find_child().
>>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>>  [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>>
>> Terry Bowman (15):
>>   cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
>>     driver
>>   cxl/aer/pci: Update is_internal_error() to be callable w/o
>>     CONFIG_PCIEAER_CXL
>>   cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
>>     PCIe ports
>>   cxl/aer/pci: Add CXL PCIe port correctable error support in AER
>>     service driver
>>   cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
>>     PCIe port devices
>>   cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
>>   cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
>>     service driver
> 
> I had to look at the patches to learn that all the above only touch
> drivers/pci, aer.h, and pci.h.  Can you use the PCI subject line
> conventions (e.g., "PCI/AER: ...") to make this more obvious?  Almost
> all already include "CXL", so I don't think we'd really lose any
> information.
> 

Yes, I'll change the patches' headlines to use capitalized "PCI/AER".

>>   cxl/pci: Change find_cxl_ports() to be non-static
>>   cxl/pci: Map CXL PCIe downstream port RAS registers
>>   cxl/pci: Map CXL PCIe upstream port RAS registers
>>   cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
>>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>>   cxl/aer/pci: Export pci_aer_unmask_internal_errors()
> 
> Ditto here, and add something about CXL in the subject since this
> doesn't export universally.
> 

Ok.

>>   cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
>>
>>  drivers/cxl/core/core.h  |   3 +
>>  drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
>>  drivers/cxl/core/port.c  |   4 +-
>>  drivers/cxl/core/trace.h |  47 +++++++++++
>>  drivers/cxl/cxl.h        |  14 +++-
>>  drivers/cxl/mem.c        |  30 ++++++-
>>  drivers/cxl/pci.c        |   8 ++
>>  drivers/pci/pci.h        |   5 ++
>>  drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
>>  drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h      |  16 ++++
>>  include/linux/pci.h      |   3 +
>>  12 files changed, 503 insertions(+), 72 deletions(-)
>>
>>
>> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
> 
> This doesn't apply cleanly on v6.12-rc1, and
> f7982d85e136ba7e26b31a725c1841373f81f84a isn't upstream yet.  Where
> is it?  I guess it relies on some other series that hasn't been merged
> yet?
> 
> Bjorn

Hmmm, I thought I was using a 6.11-rc7 commit. I will rebase to either 6.12-rc1 or rc2.

Regards,
Terry
Bjorn Helgaas Oct. 14, 2024, 5:29 p.m. UTC | #3
On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote:
> On 10/10/24 14:07, Bjorn Helgaas wrote:
> > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> >> port handling and logging.
> ...

> >>     Downstream switch port CE:
> >>     root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
> >>     [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
> >>     [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
> >>     [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> >>     [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
> >>     [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
> >>     [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> >>     [  177.119521]
> >>     [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
> > 
> > Thanks for the hints about how to test this; it's helpful to have
> > those in the email archives.  Remove the timestamps and non-relevant
> > call trace entries unless they add useful information.  AFAICT they're
> > just distractions in this case.
> 
> I'll remove the test logging and details from the cover sheet. I'm
> unable to find how to attach using git tools. Instead of an
> atatachment, I can locate the log files and details on a public
> github. Let me know if this is not acceptable.

It's fine to keep this in the cover sheet, and I'd rather have it
there, where lore will archive it reliably forever, than to have a
pointer to some other github that may eventually disappear even though
it's public today.

I just meant to remove irrelevant information like the timestamps.

Bjorn
Bowman, Terry Oct. 14, 2024, 5:33 p.m. UTC | #4
On 10/14/24 12:29, Bjorn Helgaas wrote:
> On Mon, Oct 14, 2024 at 12:22:08PM -0500, Terry Bowman wrote:
>> On 10/10/24 14:07, Bjorn Helgaas wrote:
>>> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>>>> The RFC resulted in the decision to add CXL PCIe port error handling to
>>>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>>>> port handling and logging.
>> ...
> 
>>>>     Downstream switch port CE:
>>>>     root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>>>>     [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>>>>     [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>>>>     [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>>>     [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>>>>     [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
>>>>     [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>>>>     [  177.119521]
>>>>     [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>>>
>>> Thanks for the hints about how to test this; it's helpful to have
>>> those in the email archives.  Remove the timestamps and non-relevant
>>> call trace entries unless they add useful information.  AFAICT they're
>>> just distractions in this case.
>>
>> I'll remove the test logging and details from the cover sheet. I'm
>> unable to find how to attach using git tools. Instead of an
>> atatachment, I can locate the log files and details on a public
>> github. Let me know if this is not acceptable.
> 
> It's fine to keep this in the cover sheet, and I'd rather have it
> there, where lore will archive it reliably forever, than to have a
> pointer to some other github that may eventually disappear even though
> it's public today.
> 
> I just meant to remove irrelevant information like the timestamps.
> 
> Bjorn

Ok, I'll cleanup and leave here. Thanks.

Regards,
Terry
Fan Ni Oct. 17, 2024, 4:34 p.m. UTC | #5
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
> 
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
> 
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.
> 
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> -1-terry.bowman@amd.com/
> 
> Testing:
> 
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
> 
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).

Hi Terry,
Can you share the aer-inject repo for the testing or the test patch?

Fan
> 
>     Root port UCE:
>     root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   27.325584]
>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
>     [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   27.335716] Call Trace:
>     [   27.335985]  <TASK>
>     [   27.336226]  panic+0x2ed/0x320
>     [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
>     [   27.337453]  cxl_do_recovery+0x304/0x310
>     [   27.337833]  aer_isr+0x3fd/0x700
>     [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   27.338572]  irq_thread_fn+0x1f/0x60
>     [   27.338923]  irq_thread+0x102/0x1b0
>     [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   27.339683]  ? __pfx_irq_thread+0x10/0x10
>     [   27.340059]  kthread+0xcd/0x100
>     [   27.340387]  ? __pfx_kthread+0x10/0x10
>     [   27.340748]  ret_from_fork+0x2f/0x50
>     [   27.341100]  ? __pfx_kthread+0x10/0x10
>     [   27.341466]  ret_from_fork_asm+0x1a/0x30
>     [   27.341842]  </TASK>
>     [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Root port CE:
>     root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
>     [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>     [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>     [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>     [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
>     [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [   19.449223]
>     [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
> 
>     Upstream switch port UCE:
>     root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>     [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>     [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>     [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>     [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
>     [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   45.242448]
>     [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
>     [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   45.251907] Call Trace:
>     [   45.253284]  <TASK>
>     [   45.253564]  panic+0x2ed/0x320
>     [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
>     [   45.255915]  cxl_do_recovery+0x304/0x310
>     [   45.257219]  aer_isr+0x3fd/0x700
>     [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   45.258006]  irq_thread_fn+0x1f/0x60
>     [   45.258383]  irq_thread+0x102/0x1b0
>     [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   45.259196]  ? __pfx_irq_thread+0x10/0x10
>     [   45.259605]  kthread+0xcd/0x100
>     [   45.259956]  ? __pfx_kthread+0x10/0x10
>     [   45.260386]  ret_from_fork+0x2f/0x50
>     [   45.260879]  ? __pfx_kthread+0x10/0x10
>     [   45.261418]  ret_from_fork_asm+0x1a/0x30
>     [   45.261936]  </TASK>
>     [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Upstream switch port CE:
>     root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh 
>     [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>     [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>     [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>     [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
>     [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [   37.510180] 
>     [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
> 
>     Downstream switch port UCE:
>     root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>     [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>     [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>     [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>     [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
>     [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>     [   29.427111]
>     [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
>     first_error: 'Memory Address Parity Error'
>     [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
>     [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
>     [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>     [   29.433031] Call Trace:
>     [   29.433354]  <TASK>
>     [   29.433631]  panic+0x2ed/0x320
>     [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
>     [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
>     [   29.435179]  cxl_do_recovery+0x304/0x310
>     [   29.435626]  aer_isr+0x3fd/0x700
>     [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
>     [   29.436507]  irq_thread_fn+0x1f/0x60
>     [   29.436898]  irq_thread+0x102/0x1b0
>     [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
>     [   29.437758]  ? __pfx_irq_thread+0x10/0x10
>     [   29.438189]  kthread+0xcd/0x100
>     [   29.438551]  ? __pfx_kthread+0x10/0x10
>     [   29.438959]  ret_from_fork+0x2f/0x50
>     [   29.439362]  ? __pfx_kthread+0x10/0x10
>     [   29.439771]  ret_from_fork_asm+0x1a/0x30
>     [   29.440221]  </TASK>
>     [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>     [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> 
>     Downstream switch port CE:
>     root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>     [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>     [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>     [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>     [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
>     [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>     [  177.119521]
>     [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
> 
> Changes RFC->v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
> 
> Terry Bowman (15):
>   cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
>     driver
>   cxl/aer/pci: Update is_internal_error() to be callable w/o
>     CONFIG_PCIEAER_CXL
>   cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
>     PCIe ports
>   cxl/aer/pci: Add CXL PCIe port correctable error support in AER
>     service driver
>   cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
>     PCIe port devices
>   cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
>   cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
>     service driver
>   cxl/pci: Change find_cxl_ports() to be non-static
>   cxl/pci: Map CXL PCIe downstream port RAS registers
>   cxl/pci: Map CXL PCIe upstream port RAS registers
>   cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/aer/pci: Export pci_aer_unmask_internal_errors()
>   cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
> 
>  drivers/cxl/core/core.h  |   3 +
>  drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
>  drivers/cxl/core/port.c  |   4 +-
>  drivers/cxl/core/trace.h |  47 +++++++++++
>  drivers/cxl/cxl.h        |  14 +++-
>  drivers/cxl/mem.c        |  30 ++++++-
>  drivers/cxl/pci.c        |   8 ++
>  drivers/pci/pci.h        |   5 ++
>  drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
>  drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
>  include/linux/aer.h      |  16 ++++
>  include/linux/pci.h      |   3 +
>  12 files changed, 503 insertions(+), 72 deletions(-)
> 
> 
> base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
> -- 
> 2.34.1
>
Bowman, Terry Oct. 17, 2024, 5:27 p.m. UTC | #6
Hi Fan,

On 10/17/2024 11:34 AM, Fan Ni wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
>> -1-terry.bowman@amd.com/
>>
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
> 
> Hi Terry,
> Can you share the aer-inject repo for the testing or the test patch?
> 
> Fan

Sure, but, its easiest to attach the patch here.

Origin was https://github.com/jderrick/aer-inject.git
Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b

Regards,
Terry
From ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001
From: Terry Bowman <terry.bowman@amd.com>
Date: Thu, 17 Oct 2024 12:12:58 -0500
Subject: [PATCH] aer-inject: Add internal error injection

Add corrected (CE) and uncorrected (UCE) AER internal error injection
support.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 aer.h   | 2 ++
 aer.lex | 2 ++
 aer.y   | 8 ++++----
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/aer.h b/aer.h
index a0ad152..e55a731 100644
--- a/aer.h
+++ b/aer.h
@@ -30,11 +30,13 @@ struct aer_error_inj
 #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
 #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
 #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
+#define  PCI_ERR_UNC_INTERNAL   0x00400000      /* Internal error */
 #define  PCI_ERR_COR_RCVR	0x00000001	/* Receiver Error Status */
 #define  PCI_ERR_COR_BAD_TLP	0x00000040	/* Bad TLP Status */
 #define  PCI_ERR_COR_BAD_DLLP	0x00000080	/* Bad DLLP Status */
 #define  PCI_ERR_COR_REP_ROLL	0x00000100	/* REPLAY_NUM Rollover */
 #define  PCI_ERR_COR_REP_TIMER	0x00001000	/* Replay Timer Timeout */
+#define  PCI_ERR_COR_CINTERNAL	0x00004000	/* Internal error */
 
 extern void init_aer(struct aer_error_inj *err);
 extern void submit_aer(struct aer_error_inj *err);
diff --git a/aer.lex b/aer.lex
index 6121e4e..4fadd0e 100644
--- a/aer.lex
+++ b/aer.lex
@@ -82,11 +82,13 @@ static struct key {
 	KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP),
 	KEYVAL(ECRC, PCI_ERR_UNC_ECRC),
 	KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP),
+	KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL),
 	KEYVAL(RCVR, PCI_ERR_COR_RCVR),
 	KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP),
 	KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP),
 	KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL),
 	KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER),
+	KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL),
 };
 
 static int cmp_key(const void *av, const void *bv)
diff --git a/aer.y b/aer.y
index e5ecc7d..500dc97 100644
--- a/aer.y
+++ b/aer.y
@@ -34,8 +34,8 @@ static void init(void);
 
 %token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG
 %token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER
-%token <num> MALF_TLP ECRC UNSUP
-%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER
+%token <num> MALF_TLP ECRC UNSUP INTERNAL
+%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL
 %token <num> SYMBOL NUMBER
 %token <str> PCI_ID_STR
 
@@ -77,14 +77,14 @@ uncor_status_list: /* empty */			{ $$ = 0; }
 	;
 
 uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT
-	| UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER
+	| UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER
 	;
 
 cor_status_list: /* empty */			{ $$ = 0; }
 	| cor_status_list cor_status		{ $$ = $1 | $2; }
 	;
 
-cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER
+cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER
 	;
 
 %%
Bjorn Helgaas Oct. 18, 2024, 11:22 p.m. UTC | #7
On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling. This patchset adds the CXL PCIe
> port handling and logging.
> 
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
> 
> The following 8 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common RCH and VH, adding port specific error handlers, and protocol error
> logging.

Looks like all my comments at
https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still
apply.

URL broken across lines, distracting timestamps, patch subjects,
no clue about the base commit.
Bowman, Terry Oct. 21, 2024, 7:22 p.m. UTC | #8
Hi Bjorn,

I added a response below.

On 10/18/24 18:22, Bjorn Helgaas wrote:
> On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling. This patchset adds the CXL PCIe
>> port handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 8 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common RCH and VH, adding port specific error handlers, and protocol error
>> logging.
> 
> Looks like all my comments at
> https://lore.kernel.org/r/20241010190726.GA570880@bhelgaas still
> apply.
> 
> URL broken across lines, distracting timestamps, patch subjects,
> no clue about the base commit.

I added changes for code reuse in pcie_do_recovery() as recommended. I am finishing
testing now and will have v2 upstreamed shortly.

Regards,
Terry
Fan Ni Oct. 21, 2024, 10:19 p.m. UTC | #9
On Thu, Oct 17, 2024 at 12:27:04PM -0500, Bowman, Terry wrote:
> Hi Fan,
> 
> On 10/17/2024 11:34 AM, Fan Ni wrote:
> > On Tue, Oct 08, 2024 at 05:16:42PM -0500, Terry Bowman wrote:
> > > This is a continuation of the CXL port error handling RFC from earlier.[1]
> > > The RFC resulted in the decision to add CXL PCIe port error handling to
> > > the existing RCH downstream port handling. This patchset adds the CXL PCIe
> > > port handling and logging.
> > > 
> > > The first 7 patches update the existing AER service driver to support CXL
> > > PCIe port protocol error handling and reporting. This includes AER service
> > > driver changes for adding correctable and uncorrectable error support, CXL
> > > specific recovery handling, and addition of CXL driver callback handlers.
> > > 
> > > The following 8 patches address CXL driver support for CXL PCIe port
> > > protocol errors. This includes the following changes to the CXL drivers:
> > > mapping CXL port and downstream port RAS registers, interface updates for
> > > common RCH and VH, adding port specific error handlers, and protocol error
> > > logging.
> > > 
> > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
> > > -1-terry.bowman@amd.com/
> > > 
> > > Testing:
> > > 
> > > Below are test results for this patchset. This is using Qemu with a root
> > > port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> > > (0e:00.0).
> > > 
> > > This was tested using aer-inject updated to support CE and UCE internal
> > > error injection. CXL RAS was set using a test patch (not upstreamed).
> > 
> > Hi Terry,
> > Can you share the aer-inject repo for the testing or the test patch?

Hi Terry,

Could you tell me which code base you use for this patch set?
I hit a lot of issues when trying to apply it on top of "fixes" or
"next" branches.

Fan

> > 
> > Fan
> 
> Sure, but, its easiest to attach the patch here.
> 
> Origin was https://github.com/jderrick/aer-inject.git
> Base is 81701cbb30e35a1a76c3876f55692f91bdb9751b
> 
> Regards,
> Terry

> From ca9277866b506723f46f3acd7b264ffa80c37276 Mon Sep 17 00:00:00 2001
> From: Terry Bowman <terry.bowman@amd.com>
> Date: Thu, 17 Oct 2024 12:12:58 -0500
> Subject: [PATCH] aer-inject: Add internal error injection
> 
> Add corrected (CE) and uncorrected (UCE) AER internal error injection
> support.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  aer.h   | 2 ++
>  aer.lex | 2 ++
>  aer.y   | 8 ++++----
>  3 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/aer.h b/aer.h
> index a0ad152..e55a731 100644
> --- a/aer.h
> +++ b/aer.h
> @@ -30,11 +30,13 @@ struct aer_error_inj
>  #define  PCI_ERR_UNC_MALF_TLP	0x00040000	/* Malformed TLP */
>  #define  PCI_ERR_UNC_ECRC	0x00080000	/* ECRC Error Status */
>  #define  PCI_ERR_UNC_UNSUP	0x00100000	/* Unsupported Request */
> +#define  PCI_ERR_UNC_INTERNAL   0x00400000      /* Internal error */
>  #define  PCI_ERR_COR_RCVR	0x00000001	/* Receiver Error Status */
>  #define  PCI_ERR_COR_BAD_TLP	0x00000040	/* Bad TLP Status */
>  #define  PCI_ERR_COR_BAD_DLLP	0x00000080	/* Bad DLLP Status */
>  #define  PCI_ERR_COR_REP_ROLL	0x00000100	/* REPLAY_NUM Rollover */
>  #define  PCI_ERR_COR_REP_TIMER	0x00001000	/* Replay Timer Timeout */
> +#define  PCI_ERR_COR_CINTERNAL	0x00004000	/* Internal error */
>  
>  extern void init_aer(struct aer_error_inj *err);
>  extern void submit_aer(struct aer_error_inj *err);
> diff --git a/aer.lex b/aer.lex
> index 6121e4e..4fadd0e 100644
> --- a/aer.lex
> +++ b/aer.lex
> @@ -82,11 +82,13 @@ static struct key {
>  	KEYVAL(MALF_TLP, PCI_ERR_UNC_MALF_TLP),
>  	KEYVAL(ECRC, PCI_ERR_UNC_ECRC),
>  	KEYVAL(UNSUP, PCI_ERR_UNC_UNSUP),
> +	KEYVAL(INTERNAL, PCI_ERR_UNC_INTERNAL),
>  	KEYVAL(RCVR, PCI_ERR_COR_RCVR),
>  	KEYVAL(BAD_TLP, PCI_ERR_COR_BAD_TLP),
>  	KEYVAL(BAD_DLLP, PCI_ERR_COR_BAD_DLLP),
>  	KEYVAL(REP_ROLL, PCI_ERR_COR_REP_ROLL),
>  	KEYVAL(REP_TIMER, PCI_ERR_COR_REP_TIMER),
> +	KEYVAL(CINTERNAL, PCI_ERR_COR_CINTERNAL),
>  };
>  
>  static int cmp_key(const void *av, const void *bv)
> diff --git a/aer.y b/aer.y
> index e5ecc7d..500dc97 100644
> --- a/aer.y
> +++ b/aer.y
> @@ -34,8 +34,8 @@ static void init(void);
>  
>  %token AER DOMAIN BUS DEV FN PCI_ID UNCOR_STATUS COR_STATUS HEADER_LOG
>  %token <num> TRAIN DLP POISON_TLP FCP COMP_TIME COMP_ABORT UNX_COMP RX_OVER
> -%token <num> MALF_TLP ECRC UNSUP
> -%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER
> +%token <num> MALF_TLP ECRC UNSUP INTERNAL
> +%token <num> RCVR BAD_TLP BAD_DLLP REP_ROLL REP_TIMER CINTERNAL
>  %token <num> SYMBOL NUMBER
>  %token <str> PCI_ID_STR
>  
> @@ -77,14 +77,14 @@ uncor_status_list: /* empty */			{ $$ = 0; }
>  	;
>  
>  uncor_status: TRAIN | DLP | POISON_TLP | FCP | COMP_TIME | COMP_ABORT
> -	| UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | NUMBER
> +	| UNX_COMP | RX_OVER | MALF_TLP | ECRC | UNSUP | INTERNAL | NUMBER
>  	;
>  
>  cor_status_list: /* empty */			{ $$ = 0; }
>  	| cor_status_list cor_status		{ $$ = $1 | $2; }
>  	;
>  
> -cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | NUMBER
> +cor_status: RCVR | BAD_TLP | BAD_DLLP | REP_ROLL | REP_TIMER | CINTERNAL | NUMBER
>  	;
>  
>  %% 
> -- 
> 2.34.1
>
Dan Williams Oct. 22, 2024, 1:43 a.m. UTC | #10
Terry Bowman wrote:
[..]
> Testing:
> 
> Below are test results for this patchset. This is using Qemu with a root
> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
> (0e:00.0).
> 
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed).

Thanks for these test outputs!

> 
>     Root port UCE:
>     root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available

It strikes that by this point the code knows that it is a "CXL Bus"
error and no longer a "PCIe Bus" error. Given the divergent  responses
to Fatal errors based on bus I think it would help to clarify that the
kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.

>     [   27.325584]
>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'

...i.e. someone may not notice that this is "cxl" reference in the
backtrace.
Bowman, Terry Oct. 22, 2024, 1:29 p.m. UTC | #11
Hi Dan,

On 10/21/24 20:43, Dan Williams wrote:
> Terry Bowman wrote:
> [..]
>> Testing:
>>
>> Below are test results for this patchset. This is using Qemu with a root
>> port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
>> (0e:00.0).
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed).
> 
> Thanks for these test outputs!
> 
>>
>>     Root port UCE:
>>     root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>>     [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>     [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>     [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>>     [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
>>     [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> 
> It strikes that by this point the code knows that it is a "CXL Bus"
> error and no longer a "PCIe Bus" error. Given the divergent  responses
> to Fatal errors based on bus I think it would help to clarify that the
> kernel is panicking due to "CXL Bus", not "PCIe Bus" errors.
> 
>>     [   27.325584]
>>     [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
> 
> ...i.e. someone may not notice that this is "cxl" reference in the
> backtrace.

Good idea. I'll add logic to print 'CXL' bus in the case of a CXL erroring device.

Regards,
Terry