Message ID | 0a443323ab64ba8c0fc6caa03ca56ecd4d038ea3.1633453452.git.naveennaidu479@gmail.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | Fix long standing AER Error Handling Issues | expand |
[+cc Keith, Sinan, Oza] On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote: > In the EDR path, AER registers are cleared *after* DPC error event is > processed. The process stack in EDR is: > > edr_handle_event() > dpc_process_error() > pci_aer_raw_clear_status() > pcie_do_recovery() > > But in DPC path, AER status registers are cleared *while* processing > the error. The process stack in DPC is: > > dpc_handler() > dpc_process_error() > pci_aer_clear_status() > pcie_do_recovery() These are accurate but they both include dpc_process_error(), so we need a hint to show why the one here is different from the one in the EDR path, e.g., dpc_handler dpc_process_error if (reason == 0) pci_aer_clear_status # uncorrectable errors only pcie_do_recovery > In EDR path, AER status registers are cleared irrespective of whether > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the > AER status registers are cleared only when it's an unmasked uncorrectable > error. > > This leads to two different behaviours for the same task (handling of > DPC errors) in FFS systems and when native OS has control. FFS? I'd really like to have a specific example of how a user would observe this difference. I know you probably don't have two systems to compare like that, but maybe we can work it out manually. I guess you're saying the problem is in the native DPC handling, and we don't clear the AER status registers for ERR_NONFATAL, ERR_NONFATAL, etc., right? I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling"), where Keith explicitly mentions those cases. The commit log here should connect back to that and explain whether something has changed. I cc'd Keith and the reviewers of that change in case any of them have time to dig into this again. > Bring the same semantics for clearing the AER status register in EDR > path and DPC path. > > Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com> > --- > drivers/pci/pcie/dpc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > index faf4a1e77fab..68899a3db126 100644 > --- a/drivers/pci/pcie/dpc.c > +++ b/drivers/pci/pcie/dpc.c > @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev) > dpc_get_aer_uncorrect_severity(pdev, &info) && > aer_get_device_error_info(pdev, &info)) { > aer_print_error(pdev, &info); > - pci_aer_clear_status(pdev); > } > } > > @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context) > struct pci_dev *pdev = context; > > dpc_process_error(pdev); > + pci_aer_clear_status(pdev); > > /* We configure DPC so it only triggers on ERR_FATAL */ > pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link); > -- > 2.25.1 > > _______________________________________________ > Linux-kernel-mentees mailing list > Linux-kernel-mentees@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees
On 20/10, Bjorn Helgaas wrote: > [+cc Keith, Sinan, Oza] > > On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote: > > In the EDR path, AER registers are cleared *after* DPC error event is > > processed. The process stack in EDR is: > > > > edr_handle_event() > > dpc_process_error() > > pci_aer_raw_clear_status() > > pcie_do_recovery() > > > > But in DPC path, AER status registers are cleared *while* processing > > the error. The process stack in DPC is: > > > > dpc_handler() > > dpc_process_error() > > pci_aer_clear_status() > > pcie_do_recovery() > > These are accurate but they both include dpc_process_error(), so we > need a hint to show why the one here is different from the one in the > EDR path, e.g., > > dpc_handler > dpc_process_error > if (reason == 0) > pci_aer_clear_status # uncorrectable errors only > pcie_do_recovery > > > In EDR path, AER status registers are cleared irrespective of whether > > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the > > AER status registers are cleared only when it's an unmasked uncorrectable > > error. > > > > This leads to two different behaviours for the same task (handling of > > DPC errors) in FFS systems and when native OS has control. > > FFS? > Firmware First Systems > I'd really like to have a specific example of how a user would observe > this difference. I know you probably don't have two systems to > compare like that, but maybe we can work it out manually. > Apologies again! Reading through the code again and the specification, I realize that my understanding was very incorrect at the time of making this patch. I grossly oversimplified EDR and DPC when I was learning about it. I'll drop this patch when I send the v5 for the series. Apologies again ^^' > I guess you're saying the problem is in the native DPC handling, and > we don't clear the AER status registers for ERR_NONFATAL, > ERR_NONFATAL, etc., right? > But yes, I did have this question though (I wasn't able to find the answers to it when reading the spec). Why do we not clear the entire ERR_NONFATAL and ERR_FATAL registers in the DPC path just like EDR does using the pci_aer_raw_clear_status() before going to pcie_do_recovery() I am sure I might have missed something in the spec. I guess I'll look/re-read these bits again. Thanks for the review :) > I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER > status in DPC event handling"), where Keith explicitly mentions those > cases. The commit log here should connect back to that and explain > whether something has changed. > > I cc'd Keith and the reviewers of that change in case any of them have > time to dig into this again. > > > Bring the same semantics for clearing the AER status register in EDR > > path and DPC path. > > > > Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com> > > --- > > drivers/pci/pcie/dpc.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > > index faf4a1e77fab..68899a3db126 100644 > > --- a/drivers/pci/pcie/dpc.c > > +++ b/drivers/pci/pcie/dpc.c > > @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev) > > dpc_get_aer_uncorrect_severity(pdev, &info) && > > aer_get_device_error_info(pdev, &info)) { > > aer_print_error(pdev, &info); > > - pci_aer_clear_status(pdev); > > } > > } > > > > @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context) > > struct pci_dev *pdev = context; > > > > dpc_process_error(pdev); > > + pci_aer_clear_status(pdev); > > > > /* We configure DPC so it only triggers on ERR_FATAL */ > > pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link); > > -- > > 2.25.1 > > > > _______________________________________________ > > Linux-kernel-mentees mailing list > > Linux-kernel-mentees@lists.linuxfoundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees
On Thu, Oct 21, 2021 at 10:23:30PM +0530, Naveen Naidu wrote: > On 20/10, Bjorn Helgaas wrote: > > On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote: > > > In EDR path, AER status registers are cleared irrespective of whether > > > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the > > > AER status registers are cleared only when it's an unmasked uncorrectable > > > error. > > > > > > This leads to two different behaviours for the same task (handling of > > > DPC errors) in FFS systems and when native OS has control. > > > > FFS? > > Firmware First Systems I assumed that's what it was, but it's helpful to use the same terms used by the specs to make things easier to find. I don't think it's actually the case that "Firmware First" necessary applies to the entire system, since the ACPI FIRMWARE_FIRST flag is a per-error source thing, not a per-system thing.
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index faf4a1e77fab..68899a3db126 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev) dpc_get_aer_uncorrect_severity(pdev, &info) && aer_get_device_error_info(pdev, &info)) { aer_print_error(pdev, &info); - pci_aer_clear_status(pdev); } } @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context) struct pci_dev *pdev = context; dpc_process_error(pdev); + pci_aer_clear_status(pdev); /* We configure DPC so it only triggers on ERR_FATAL */ pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
In the EDR path, AER registers are cleared *after* DPC error event is processed. The process stack in EDR is: edr_handle_event() dpc_process_error() pci_aer_raw_clear_status() pcie_do_recovery() But in DPC path, AER status registers are cleared *while* processing the error. The process stack in DPC is: dpc_handler() dpc_process_error() pci_aer_clear_status() pcie_do_recovery() In EDR path, AER status registers are cleared irrespective of whether the error was an RP PIO or unmasked uncorrectable error. But in DPC, the AER status registers are cleared only when it's an unmasked uncorrectable error. This leads to two different behaviours for the same task (handling of DPC errors) in FFS systems and when native OS has control. Bring the same semantics for clearing the AER status register in EDR path and DPC path. Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com> --- drivers/pci/pcie/dpc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)