Message ID | 20220119092200.35823-3-sr@denx.de |
---|---|
State | New |
Headers | show |
Series | Fully enable AER | expand |
On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote: > With this change, AER is now enabled on all PCIe devices, also when the > PCIe device is hot-plugged. > > Please note that this change is quite invasive, as with this patch > applied, AER now will be enabled in the Device Control registers of all > available PCIe Endpoints, which currently is not the case. > > When "pci=noaer" is selected, AER stays disabled of course. Hello Stefan! I was thinking more about this change and I'm not sure what happens if AER-capable PCIe device is hotplugged into some PCIe switch connected in the PCIe hierarchy where Root Port is not AER-capable (e.g. current linux implementation of pci-aardvark.c and pci-mvebu.c). My feeling is that in this case AER should not be enabled as there is nobody who can deliver AER interrupt to the OS. But I really do not know what is supposed from kernel AER driver, so lets wait for Bjorn reply. And when you opened this issue with hotplugging, another thing for followup changes in future is calling pcie_set_ecrc_checking() function to align ECRC state of newly hotplugged device with "pci=ecrc=..." cmdline option. As currently it is done only at that function set_device_error_reporting(). > Signed-off-by: Stefan Roese <sr@denx.de> > Cc: Bjorn Helgaas <helgaas@kernel.org> > Cc: Pali Rohár <pali@kernel.org> > Cc: Bharat Kumar Gogada <bharat.kumar.gogada@xilinx.com> > Cc: Michal Simek <michal.simek@xilinx.com> > Cc: Yao Hongbo <yaohongbo@linux.alibaba.com> > Cc: Naveen Naidu <naveennaidu479@gmail.com> > --- > v3: > - New patch, replacing the "old" 2/2 patch > Now enabling of AER for each PCIe device is done in pci_aer_init(), > which also makes sure that AER is enabled in each PCIe device even when > it's hot-plugged. > > drivers/pci/pcie/aer.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 9fa1f97e5b27..01a25e4a5168 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) > pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); > > pci_aer_clear_status(dev); > + > + /* Enable AER if requested */ > + if (pci_aer_available()) > + pci_enable_pcie_error_reporting(dev); > } > > void pci_aer_exit(struct pci_dev *dev) > -- > 2.34.1 >
On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote: > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) > pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); > > pci_aer_clear_status(dev); > + > + /* Enable AER if requested */ > + if (pci_aer_available()) > + pci_enable_pcie_error_reporting(dev); > } Hasn't it always been the device specific driver's responsibility to call this function?
On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote: > On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote: > > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) > > pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); > > > > pci_aer_clear_status(dev); > > + > > + /* Enable AER if requested */ > > + if (pci_aer_available()) > > + pci_enable_pcie_error_reporting(dev); > > } > > Hasn't it always been the device specific driver's responsibility to > call this function? So far it has been done by the driver, because the PCI core doesn't do it. But is there a reason it should be done by the driver? It doesn't seem necessarily device-specific. Bjorn
On Wed, Jan 19, 2022 at 03:00:02PM -0600, Bjorn Helgaas wrote: > On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote: > > On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote: > > > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) > > > pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); > > > > > > pci_aer_clear_status(dev); > > > + > > > + /* Enable AER if requested */ > > > + if (pci_aer_available()) > > > + pci_enable_pcie_error_reporting(dev); > > > } > > > > Hasn't it always been the device specific driver's responsibility to > > call this function? > > So far it has been done by the driver, because the PCI core doesn't do > it. But is there a reason it should be done by the driver? It > doesn't seem necessarily device-specific. I was thinking the device driver knows if it provides .err_handler callbacks in order to respond to AER handling, so it would know if it is ready for its device to enable error reporting. But I guess it doesn't really matter if the driver provides callbacks anyway.
On 1/19/22 11:37, Pali Rohár wrote: > On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote: >> With this change, AER is now enabled on all PCIe devices, also when the >> PCIe device is hot-plugged. >> >> Please note that this change is quite invasive, as with this patch >> applied, AER now will be enabled in the Device Control registers of all >> available PCIe Endpoints, which currently is not the case. >> >> When "pci=noaer" is selected, AER stays disabled of course. > > Hello Stefan! I was thinking more about this change and I'm not sure > what happens if AER-capable PCIe device is hotplugged into some PCIe > switch connected in the PCIe hierarchy where Root Port is not > AER-capable (e.g. current linux implementation of pci-aardvark.c and > pci-mvebu.c). My feeling is that in this case AER should not be enabled > as there is nobody who can deliver AER interrupt to the OS. But I really > do not know what is supposed from kernel AER driver, so lets wait for > Bjorn reply. But what happens right now, when a device driver like the NVMe driver calls pci_enable_pcie_error_reporting() ? There is also no checking, if the connected Root Port or some switch / bridge in-between supports AER or not. IIUTC, this is identical to what this patch here does. Enable AER in the device and if the upstream infrastructure does not support AER, then the AER event will just not be received by the Kernel. Which is most likely not worse than not enabling AER at all on this device. Or am I missing something? > And when you opened this issue with hotplugging, another thing for > followup changes in future is calling pcie_set_ecrc_checking() function > to align ECRC state of newly hotplugged device with "pci=ecrc=..." > cmdline option. As currently it is done only at that function > set_device_error_reporting(). Agreed, this is another area to look into. Not sure if it's okay to address this, once this patch-set has been accepted (if it will be). Thanks, Stefan
On 1/19/22 22:18, Keith Busch wrote: > On Wed, Jan 19, 2022 at 03:00:02PM -0600, Bjorn Helgaas wrote: >> On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote: >>> On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote: >>>> @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) >>>> pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); >>>> >>>> pci_aer_clear_status(dev); >>>> + >>>> + /* Enable AER if requested */ >>>> + if (pci_aer_available()) >>>> + pci_enable_pcie_error_reporting(dev); >>>> } >>> >>> Hasn't it always been the device specific driver's responsibility to >>> call this function? >> >> So far it has been done by the driver, because the PCI core doesn't do >> it. But is there a reason it should be done by the driver? It >> doesn't seem necessarily device-specific. > > I was thinking the device driver knows if it provides .err_handler > callbacks in order to respond to AER handling, so it would know if it is > ready for its device to enable error reporting. But I guess it doesn't > really matter if the driver provides callbacks anyway. That's my understanding as well. Thanks, Stefan
On Thursday 20 January 2022 08:31:31 Stefan Roese wrote: > On 1/19/22 11:37, Pali Rohár wrote: > > On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote: > > > With this change, AER is now enabled on all PCIe devices, also when the > > > PCIe device is hot-plugged. > > > > > > Please note that this change is quite invasive, as with this patch > > > applied, AER now will be enabled in the Device Control registers of all > > > available PCIe Endpoints, which currently is not the case. > > > > > > When "pci=noaer" is selected, AER stays disabled of course. > > > > Hello Stefan! I was thinking more about this change and I'm not sure > > what happens if AER-capable PCIe device is hotplugged into some PCIe > > switch connected in the PCIe hierarchy where Root Port is not > > AER-capable (e.g. current linux implementation of pci-aardvark.c and > > pci-mvebu.c). My feeling is that in this case AER should not be enabled > > as there is nobody who can deliver AER interrupt to the OS. But I really > > do not know what is supposed from kernel AER driver, so lets wait for > > Bjorn reply. > > But what happens right now, when a device driver like the NVMe driver > calls pci_enable_pcie_error_reporting() ? There is also no checking, > if the connected Root Port or some switch / bridge in-between supports > AER or not. IIUTC, this is identical to what this patch here does. > Enable AER in the device and if the upstream infrastructure does not > support AER, then the AER event will just not be received by the > Kernel. Which is most likely not worse than not enabling AER at all > on this device. Or am I missing something? You are right! Seems that AER code has lot of candidates for followup fixes/cleanups... > > And when you opened this issue with hotplugging, another thing for > > followup changes in future is calling pcie_set_ecrc_checking() function > > to align ECRC state of newly hotplugged device with "pci=ecrc=..." > > cmdline option. As currently it is done only at that function > > set_device_error_reporting(). > > Agreed, this is another area to look into. Not sure if it's okay to > address this, once this patch-set has been accepted (if it will be). > > Thanks, > Stefan
On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote: > On 1/19/22 11:37, Pali Rohár wrote: > > And when you opened this issue with hotplugging, another thing for > > followup changes in future is calling pcie_set_ecrc_checking() function > > to align ECRC state of newly hotplugged device with "pci=ecrc=..." > > cmdline option. As currently it is done only at that function > > set_device_error_reporting(). > > Agreed, this is another area to look into. Not sure if it's okay to > address this, once this patch-set has been accepted (if it will be). ECRC might be something that could be peeled off first to reduce the complexity of AER itself. The ECRC capability and enable bits are in the AER Capability, so I think it should be moved to pci_aer_init() so it happens for every device as we enumerate it. As far as I can tell, there is no requirement that every device in the path support ECRC, so it can be enabled independently for each device. I think devices that don't support ECRC checking must handle TLPs with ECRC without error. Per Table 6-5, ECRC check failures result in a device logging the prefix/header of the TLP and sending ERR_NONFATAL or ERR_COR. I think this is useful regardless of whether AER interrupts are enabled because error information is logged where the ECRC failure was detected. Bjorn
On 1/20/22 16:46, Bjorn Helgaas wrote: > On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote: >> On 1/19/22 11:37, Pali Rohár wrote: > >>> And when you opened this issue with hotplugging, another thing for >>> followup changes in future is calling pcie_set_ecrc_checking() function >>> to align ECRC state of newly hotplugged device with "pci=ecrc=..." >>> cmdline option. As currently it is done only at that function >>> set_device_error_reporting(). >> >> Agreed, this is another area to look into. Not sure if it's okay to >> address this, once this patch-set has been accepted (if it will be). > > ECRC might be something that could be peeled off first to reduce the > complexity of AER itself. > > The ECRC capability and enable bits are in the AER Capability, so I > think it should be moved to pci_aer_init() so it happens for every > device as we enumerate it. Just that there is no misunderstanding: You are thinking about something like this: diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..5585fefc4d0e 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -387,6 +387,9 @@ void pci_aer_init(struct pci_dev *dev) pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); pci_aer_clear_status(dev); + + /* Enable ECRC checking if enabled and configured */ + pcie_set_ecrc_checking(dev); } void pci_aer_exit(struct pci_dev *dev) @@ -1223,9 +1226,6 @@ static int set_device_error_reporting(struct pci_dev *dev, void *data) pci_disable_pcie_error_reporting(dev); } - if (enable) - pcie_set_ecrc_checking(dev); - return 0; } Perhaps as patch 1/3 in this patch series? Or as some completely separate patch? Thanks, Stefan > As far as I can tell, there is no requirement that every device in the > path support ECRC, so it can be enabled independently for each device. > I think devices that don't support ECRC checking must handle TLPs with > ECRC without error. > > Per Table 6-5, ECRC check failures result in a device logging the > prefix/header of the TLP and sending ERR_NONFATAL or ERR_COR. I think > this is useful regardless of whether AER interrupts are enabled > because error information is logged where the ECRC failure was > detected. > > Bjorn > Viele Grüße, Stefan Roese
On Thu, Jan 20, 2022 at 05:59:22PM +0100, Stefan Roese wrote: > On 1/20/22 16:46, Bjorn Helgaas wrote: > > On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote: > > > On 1/19/22 11:37, Pali Rohár wrote: > > > > > > And when you opened this issue with hotplugging, another thing for > > > > followup changes in future is calling pcie_set_ecrc_checking() function > > > > to align ECRC state of newly hotplugged device with "pci=ecrc=..." > > > > cmdline option. As currently it is done only at that function > > > > set_device_error_reporting(). > > > > > > Agreed, this is another area to look into. Not sure if it's okay to > > > address this, once this patch-set has been accepted (if it will be). > > > > ECRC might be something that could be peeled off first to reduce the > > complexity of AER itself. > > > > The ECRC capability and enable bits are in the AER Capability, so I > > think it should be moved to pci_aer_init() so it happens for every > > device as we enumerate it. > > Just that there is no misunderstanding: You are thinking about something > like this: > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 9fa1f97e5b27..5585fefc4d0e 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -387,6 +387,9 @@ void pci_aer_init(struct pci_dev *dev) > pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * > n); > > pci_aer_clear_status(dev); > + > + /* Enable ECRC checking if enabled and configured */ > + pcie_set_ecrc_checking(dev); > } > > void pci_aer_exit(struct pci_dev *dev) > @@ -1223,9 +1226,6 @@ static int set_device_error_reporting(struct pci_dev > *dev, void *data) > pci_disable_pcie_error_reporting(dev); > } > > - if (enable) > - pcie_set_ecrc_checking(dev); > - > return 0; > } > > Perhaps as patch 1/3 in this patch series? Or as some completely > separate patch? Yes. Probably as 1/3, since subsequent patches may depend on this one, or at least may not apply cleanly without this one. Bjorn
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 9fa1f97e5b27..01a25e4a5168 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev) pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n); pci_aer_clear_status(dev); + + /* Enable AER if requested */ + if (pci_aer_available()) + pci_enable_pcie_error_reporting(dev); } void pci_aer_exit(struct pci_dev *dev)
With this change, AER is now enabled on all PCIe devices, also when the PCIe device is hot-plugged. Please note that this change is quite invasive, as with this patch applied, AER now will be enabled in the Device Control registers of all available PCIe Endpoints, which currently is not the case. When "pci=noaer" is selected, AER stays disabled of course. Signed-off-by: Stefan Roese <sr@denx.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Pali Rohár <pali@kernel.org> Cc: Bharat Kumar Gogada <bharat.kumar.gogada@xilinx.com> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Yao Hongbo <yaohongbo@linux.alibaba.com> Cc: Naveen Naidu <naveennaidu479@gmail.com> --- v3: - New patch, replacing the "old" 2/2 patch Now enabling of AER for each PCIe device is done in pci_aer_init(), which also makes sure that AER is enabled in each PCIe device even when it's hot-plugged. drivers/pci/pcie/aer.c | 4 ++++ 1 file changed, 4 insertions(+)