Message ID | 1576237474-32021-1-git-send-email-yangyicong@hisilicon.com |
---|---|
State | New |
Headers | show |
Series | PCI:AER:Notify which device has no error_detected callback | expand |
On Fri, Dec 13, 2019 at 07:44:34PM +0800, Yicong Yang wrote: > The PCI error recovery will fail if any device under > root port doesn't have an error_detected callback. > Currently only failure result is printed, which is > not enough to determine which device leads to the > failure and the detailed failure reason. > > Add print information if certain device under root > port has no error_detected callback. > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Applied to pci/aer for v5.6, thanks! I also added a trivial follow-on patch to factor out the "AER: " prefix (attached below). This code is now used by DPC as well as AER, so "AER: " might not be exactly the correct prefix, but I didn't try to untangle that. > --- > drivers/pci/pcie/err.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > index b0e6048..ec37c33 100644 > --- a/drivers/pci/pcie/err.c > +++ b/drivers/pci/pcie/err.c > @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, > * error callbacks of "any" device in the subtree, and will > * exit in the disconnected error state. > */ > - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) > + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > vote = PCI_ERS_RESULT_NO_AER_DRIVER; > + pci_info(dev, "AER: Device has no error_detected callback\n"); > + } > else > vote = PCI_ERS_RESULT_NONE; > } else { > -- > 2.8.1 > commit 9694ef043ea4 ("PCI/AER: Factor message prefixes with dev_fmt()") Author: Bjorn Helgaas <bhelgaas@google.com> Date: Fri Dec 13 16:46:05 2019 -0600 PCI/AER: Factor message prefixes with dev_fmt() Define dev_fmt() with the common prefix of log messages so we don't have to repeat it in every printk. No functional change intended. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index abde5e000f5d..747ef8b41d1f 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -10,6 +10,8 @@ * Zhang Yanmin (yanmin.zhang@intel.com) */ +#define dev_fmt(fmt) "AER: " fmt + #include <linux/pci.h> #include <linux/module.h> #include <linux/kernel.h> @@ -64,7 +66,7 @@ static int report_error_detected(struct pci_dev *dev, */ if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { vote = PCI_ERS_RESULT_NO_AER_DRIVER; - pci_info(dev, "AER: Driver has no error_detected callback\n"); + pci_info(dev, "driver has no error_detected callback\n"); } else { vote = PCI_ERS_RESULT_NONE; } @@ -236,12 +238,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, pci_aer_clear_device_status(dev); pci_cleanup_aer_uncorrect_error_status(dev); - pci_info(dev, "AER: Device recovery successful\n"); + pci_info(dev, "device recovery successful\n"); return; failed: pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); /* TODO: Should kernel panic here? */ - pci_info(dev, "AER: Device recovery failed\n"); + pci_info(dev, "device recovery failed\n"); }
On 2019/12/14 6:57, Bjorn Helgaas wrote: > On Fri, Dec 13, 2019 at 07:44:34PM +0800, Yicong Yang wrote: >> The PCI error recovery will fail if any device under >> root port doesn't have an error_detected callback. >> Currently only failure result is printed, which is >> not enough to determine which device leads to the >> failure and the detailed failure reason. >> >> Add print information if certain device under root >> port has no error_detected callback. >> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > Applied to pci/aer for v5.6, thanks! > > I also added a trivial follow-on patch to factor out the "AER: " > prefix (attached below). This code is now used by DPC as well as AER, > so "AER: " might not be exactly the correct prefix, but I didn't try > to untangle that. > >> --- >> drivers/pci/pcie/err.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >> index b0e6048..ec37c33 100644 >> --- a/drivers/pci/pcie/err.c >> +++ b/drivers/pci/pcie/err.c >> @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, >> * error callbacks of "any" device in the subtree, and will >> * exit in the disconnected error state. >> */ >> - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) >> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { >> vote = PCI_ERS_RESULT_NO_AER_DRIVER; >> + pci_info(dev, "AER: Device has no error_detected callback\n"); >> + } >> else >> vote = PCI_ERS_RESULT_NONE; >> } else { >> -- >> 2.8.1 >> > commit 9694ef043ea4 ("PCI/AER: Factor message prefixes with dev_fmt()") > Author: Bjorn Helgaas <bhelgaas@google.com> > Date: Fri Dec 13 16:46:05 2019 -0600 > > PCI/AER: Factor message prefixes with dev_fmt() > > Define dev_fmt() with the common prefix of log messages so we don't have to > repeat it in every printk. No functional change intended. > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > index abde5e000f5d..747ef8b41d1f 100644 > --- a/drivers/pci/pcie/err.c > +++ b/drivers/pci/pcie/err.c > @@ -10,6 +10,8 @@ > * Zhang Yanmin (yanmin.zhang@intel.com) > */ > > +#define dev_fmt(fmt) "AER: " fmt > + > #include <linux/pci.h> > #include <linux/module.h> > #include <linux/kernel.h> > @@ -64,7 +66,7 @@ static int report_error_detected(struct pci_dev *dev, > */ > if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > vote = PCI_ERS_RESULT_NO_AER_DRIVER; > - pci_info(dev, "AER: Driver has no error_detected callback\n"); > + pci_info(dev, "driver has no error_detected callback\n"); I think use "device" here is more proper. Sometimes the device is even not bind to the driver, which will also lead to the recovery failure. > } else { > vote = PCI_ERS_RESULT_NONE; > } > @@ -236,12 +238,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, > > pci_aer_clear_device_status(dev); > pci_cleanup_aer_uncorrect_error_status(dev); > - pci_info(dev, "AER: Device recovery successful\n"); > + pci_info(dev, "device recovery successful\n"); > return; > > failed: > pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); > > /* TODO: Should kernel panic here? */ > - pci_info(dev, "AER: Device recovery failed\n"); > + pci_info(dev, "device recovery failed\n"); > } > > . >
[+cc Alex] On Mon, Dec 16, 2019 at 06:30:11PM +0800, Yicong Yang wrote: > On 2019/12/14 6:57, Bjorn Helgaas wrote: > > On Fri, Dec 13, 2019 at 07:44:34PM +0800, Yicong Yang wrote: > >> The PCI error recovery will fail if any device under > >> root port doesn't have an error_detected callback. > >> Currently only failure result is printed, which is > >> not enough to determine which device leads to the > >> failure and the detailed failure reason. > >> > >> Add print information if certain device under root > >> port has no error_detected callback. > >> > >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > Applied to pci/aer for v5.6, thanks! > > > > I also added a trivial follow-on patch to factor out the "AER: " > > prefix (attached below). This code is now used by DPC as well as AER, > > so "AER: " might not be exactly the correct prefix, but I didn't try > > to untangle that. > > > >> --- > >> drivers/pci/pcie/err.c | 4 +++- > >> 1 file changed, 3 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > >> index b0e6048..ec37c33 100644 > >> --- a/drivers/pci/pcie/err.c > >> +++ b/drivers/pci/pcie/err.c > >> @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, > >> * error callbacks of "any" device in the subtree, and will > >> * exit in the disconnected error state. > >> */ > >> - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) > >> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > >> vote = PCI_ERS_RESULT_NO_AER_DRIVER; > >> + pci_info(dev, "AER: Device has no error_detected callback\n"); > >> + } > >> else > >> vote = PCI_ERS_RESULT_NONE; > >> } else { > >> -- > >> 2.8.1 > >> > > commit 9694ef043ea4 ("PCI/AER: Factor message prefixes with dev_fmt()") > > Author: Bjorn Helgaas <bhelgaas@google.com> > > Date: Fri Dec 13 16:46:05 2019 -0600 > > > > PCI/AER: Factor message prefixes with dev_fmt() > > > > Define dev_fmt() with the common prefix of log messages so we don't have to > > repeat it in every printk. No functional change intended. > > > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > > index abde5e000f5d..747ef8b41d1f 100644 > > --- a/drivers/pci/pcie/err.c > > +++ b/drivers/pci/pcie/err.c > > @@ -10,6 +10,8 @@ > > * Zhang Yanmin (yanmin.zhang@intel.com) > > */ > > > > +#define dev_fmt(fmt) "AER: " fmt > > + > > #include <linux/pci.h> > > #include <linux/module.h> > > #include <linux/kernel.h> > > @@ -64,7 +66,7 @@ static int report_error_detected(struct pci_dev *dev, > > */ > > if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > > vote = PCI_ERS_RESULT_NO_AER_DRIVER; > > - pci_info(dev, "AER: Driver has no error_detected callback\n"); > > + pci_info(dev, "driver has no error_detected callback\n"); > > I think use "device" here is more proper. Sometimes the device is > even not bind to the driver, which will also lead to the recovery > failure. Hmmm, maybe so, although the error_detected callback is definitely a property of the *driver*, not the device. What would you think of something like this? pci_info(dev, dev->driver ? "driver not capable of error recovery\n" : "no driver\n"); I am curious about the larger question of why we fail if the device has no driver. I understand why we fail the recovery for a device with a driver that has no err_handlers: we can't just reset the device out from under the driver. But why do we fail if a device has no driver at all? Shouldn't it be safe to reset such a device? The commit log of 918b4053184c ("PCI/AER: Report success only when every device has AER-aware driver") suggests that it has something to do with KVM and the pci-stub driver, but I don't understand it. > > } else { > > vote = PCI_ERS_RESULT_NONE; > > } > > @@ -236,12 +238,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state, > > > > pci_aer_clear_device_status(dev); > > pci_cleanup_aer_uncorrect_error_status(dev); > > - pci_info(dev, "AER: Device recovery successful\n"); > > + pci_info(dev, "device recovery successful\n"); > > return; > > > > failed: > > pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); > > > > /* TODO: Should kernel panic here? */ > > - pci_info(dev, "AER: Device recovery failed\n"); > > + pci_info(dev, "device recovery failed\n"); > > }
On Mon, 16 Dec 2019 11:58:32 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > [+cc Alex] > > On Mon, Dec 16, 2019 at 06:30:11PM +0800, Yicong Yang wrote: > > On 2019/12/14 6:57, Bjorn Helgaas wrote: > > > On Fri, Dec 13, 2019 at 07:44:34PM +0800, Yicong Yang wrote: > > >> The PCI error recovery will fail if any device under > > >> root port doesn't have an error_detected callback. > > >> Currently only failure result is printed, which is > > >> not enough to determine which device leads to the > > >> failure and the detailed failure reason. > > >> > > >> Add print information if certain device under root > > >> port has no error_detected callback. > > >> > > >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > > > Applied to pci/aer for v5.6, thanks! > > > > > > I also added a trivial follow-on patch to factor out the "AER: " > > > prefix (attached below). This code is now used by DPC as well as AER, > > > so "AER: " might not be exactly the correct prefix, but I didn't try > > > to untangle that. > > > > > >> --- > > >> drivers/pci/pcie/err.c | 4 +++- > > >> 1 file changed, 3 insertions(+), 1 deletion(-) > > >> > > >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > > >> index b0e6048..ec37c33 100644 > > >> --- a/drivers/pci/pcie/err.c > > >> +++ b/drivers/pci/pcie/err.c > > >> @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, > > >> * error callbacks of "any" device in the subtree, and will > > >> * exit in the disconnected error state. > > >> */ > > >> - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) > > >> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > > >> vote = PCI_ERS_RESULT_NO_AER_DRIVER; > > >> + pci_info(dev, "AER: Device has no error_detected callback\n"); > > >> + } > > >> else > > >> vote = PCI_ERS_RESULT_NONE; > > >> } else { > > >> -- > > >> 2.8.1 > > >> > > > commit 9694ef043ea4 ("PCI/AER: Factor message prefixes with dev_fmt()") > > > Author: Bjorn Helgaas <bhelgaas@google.com> > > > Date: Fri Dec 13 16:46:05 2019 -0600 > > > > > > PCI/AER: Factor message prefixes with dev_fmt() > > > > > > Define dev_fmt() with the common prefix of log messages so we don't have to > > > repeat it in every printk. No functional change intended. > > > > > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> > > > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > > > index abde5e000f5d..747ef8b41d1f 100644 > > > --- a/drivers/pci/pcie/err.c > > > +++ b/drivers/pci/pcie/err.c > > > @@ -10,6 +10,8 @@ > > > * Zhang Yanmin (yanmin.zhang@intel.com) > > > */ > > > > > > +#define dev_fmt(fmt) "AER: " fmt > > > + > > > #include <linux/pci.h> > > > #include <linux/module.h> > > > #include <linux/kernel.h> > > > @@ -64,7 +66,7 @@ static int report_error_detected(struct pci_dev *dev, > > > */ > > > if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { > > > vote = PCI_ERS_RESULT_NO_AER_DRIVER; > > > - pci_info(dev, "AER: Driver has no error_detected callback\n"); > > > + pci_info(dev, "driver has no error_detected callback\n"); > > > > I think use "device" here is more proper. Sometimes the device is > > even not bind to the driver, which will also lead to the recovery > > failure. > > Hmmm, maybe so, although the error_detected callback is definitely a > property of the *driver*, not the device. What would you think of > something like this? > > pci_info(dev, dev->driver ? "driver not capable of error recovery\n" : > "no driver\n"); > > I am curious about the larger question of why we fail if the device > has no driver. I understand why we fail the recovery for a device > with a driver that has no err_handlers: we can't just reset the device > out from under the driver. > > But why do we fail if a device has no driver at all? Shouldn't it be > safe to reset such a device? The commit log of 918b4053184c > ("PCI/AER: Report success only when every device has AER-aware > driver") suggests that it has something to do with KVM and the > pci-stub driver, but I don't understand it. That commit is from 2012, when legacy KVM device assignment existed in the kernel and was able to drive a device without really binding to it as a driver. Sane users of this mechanism would at least bind the device to pci-stub to prevent it from being used by both a host driver and KVM simultaneously. In any case, legacy KVM device assignment no longer exists in the kernel, so if that was the justification to not reset driver-less devices, we can probably fill that gap now. Thanks, Alex
On 2019/12/17 2:28, Alex Williamson wrote: > On Mon, 16 Dec 2019 11:58:32 -0600 > Bjorn Helgaas <helgaas@kernel.org> wrote: > >> [+cc Alex] >> >> On Mon, Dec 16, 2019 at 06:30:11PM +0800, Yicong Yang wrote: >>> On 2019/12/14 6:57, Bjorn Helgaas wrote: >>>> On Fri, Dec 13, 2019 at 07:44:34PM +0800, Yicong Yang wrote: >>>>> The PCI error recovery will fail if any device under >>>>> root port doesn't have an error_detected callback. >>>>> Currently only failure result is printed, which is >>>>> not enough to determine which device leads to the >>>>> failure and the detailed failure reason. >>>>> >>>>> Add print information if certain device under root >>>>> port has no error_detected callback. >>>>> >>>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >>>> Applied to pci/aer for v5.6, thanks! >>>> >>>> I also added a trivial follow-on patch to factor out the "AER: " >>>> prefix (attached below). This code is now used by DPC as well as AER, >>>> so "AER: " might not be exactly the correct prefix, but I didn't try >>>> to untangle that. >>>> >>>>> --- >>>>> drivers/pci/pcie/err.c | 4 +++- >>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >>>>> index b0e6048..ec37c33 100644 >>>>> --- a/drivers/pci/pcie/err.c >>>>> +++ b/drivers/pci/pcie/err.c >>>>> @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, >>>>> * error callbacks of "any" device in the subtree, and will >>>>> * exit in the disconnected error state. >>>>> */ >>>>> - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) >>>>> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { >>>>> vote = PCI_ERS_RESULT_NO_AER_DRIVER; >>>>> + pci_info(dev, "AER: Device has no error_detected callback\n"); >>>>> + } >>>>> else >>>>> vote = PCI_ERS_RESULT_NONE; >>>>> } else { >>>>> -- >>>>> 2.8.1 >>>>> >>>> commit 9694ef043ea4 ("PCI/AER: Factor message prefixes with dev_fmt()") >>>> Author: Bjorn Helgaas <bhelgaas@google.com> >>>> Date: Fri Dec 13 16:46:05 2019 -0600 >>>> >>>> PCI/AER: Factor message prefixes with dev_fmt() >>>> >>>> Define dev_fmt() with the common prefix of log messages so we don't have to >>>> repeat it in every printk. No functional change intended. >>>> >>>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> >>>> >>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >>>> index abde5e000f5d..747ef8b41d1f 100644 >>>> --- a/drivers/pci/pcie/err.c >>>> +++ b/drivers/pci/pcie/err.c >>>> @@ -10,6 +10,8 @@ >>>> * Zhang Yanmin (yanmin.zhang@intel.com) >>>> */ >>>> >>>> +#define dev_fmt(fmt) "AER: " fmt >>>> + >>>> #include <linux/pci.h> >>>> #include <linux/module.h> >>>> #include <linux/kernel.h> >>>> @@ -64,7 +66,7 @@ static int report_error_detected(struct pci_dev *dev, >>>> */ >>>> if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { >>>> vote = PCI_ERS_RESULT_NO_AER_DRIVER; >>>> - pci_info(dev, "AER: Driver has no error_detected callback\n"); >>>> + pci_info(dev, "driver has no error_detected callback\n"); >>> I think use "device" here is more proper. Sometimes the device is >>> even not bind to the driver, which will also lead to the recovery >>> failure. >> Hmmm, maybe so, although the error_detected callback is definitely a >> property of the *driver*, not the device. What would you think of >> something like this? >> >> pci_info(dev, dev->driver ? "driver not capable of error recovery\n" : >> "no driver\n"); well, it seems like a better way. maybe :-) pci_info(dev, dev->driver ? "driver has no error_detected callback\n" : "device has no driver\n") >> >> I am curious about the larger question of why we fail if the device >> has no driver. I understand why we fail the recovery for a device >> with a driver that has no err_handlers: we can't just reset the device >> out from under the driver. >> >> But why do we fail if a device has no driver at all? Shouldn't it be >> safe to reset such a device? The commit log of 918b4053184c >> ("PCI/AER: Report success only when every device has AER-aware >> driver") suggests that it has something to do with KVM and the >> pci-stub driver, but I don't understand it. > That commit is from 2012, when legacy KVM device assignment existed in > the kernel and was able to drive a device without really binding to it > as a driver. Sane users of this mechanism would at least bind the > device to pci-stub to prevent it from being used by both a host driver > and KVM simultaneously. In any case, legacy KVM device assignment no > longer exists in the kernel, so if that was the justification to not > reset driver-less devices, we can probably fill that gap now. Thanks, > > Alex > I think the reason why we fail if a device has no driver is the same as the driver has no error_detected callback. Under both conditions, the next step process is uncertain. If the error is from other devices, probably we can just ignore the unbind devices. But I'm not certain whether errors will occur at devices with no drivers and results to ignore them. If it's okay, maybe just ignore the unbind devices and return PCI_ERS_RESULT_NONE. however, I tried to inject errors on an unbind device using aer_inject and the recovery routine is invoked. is there real condition like this? > . >
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index b0e6048..ec37c33 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -61,8 +61,10 @@ static int report_error_detected(struct pci_dev *dev, * error callbacks of "any" device in the subtree, and will * exit in the disconnected error state. */ - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) { vote = PCI_ERS_RESULT_NO_AER_DRIVER; + pci_info(dev, "AER: Device has no error_detected callback\n"); + } else vote = PCI_ERS_RESULT_NONE; } else {
The PCI error recovery will fail if any device under root port doesn't have an error_detected callback. Currently only failure result is printed, which is not enough to determine which device leads to the failure and the detailed failure reason. Add print information if certain device under root port has no error_detected callback. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> --- drivers/pci/pcie/err.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 2.8.1