Message ID | 20250108075703.410961-1-bijie.xu@corigine.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | PCI/AER: Add kernel.aer_print_skip_mask to control aer log | expand |
[+cc Jon, Karolina] On Wed, Jan 08, 2025 at 03:57:03PM +0800, Bijie Xu wrote: > Sometimes certain PCIE devices installed on some servers occasionally > produce large number of AER correctable error logs, which is quite > annoying. Add this sysctl parameter kernel.aer_print_skip_mask to > skip printing AER errors of certain severity. > > The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3 > low bits of the mask are used to skip these 3 severities. Set bit 0 > can skip printing NONFATAL AER errors, and set bit 1 can skip printing > FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors. > And multiple bits can be set to skip multiple severities. This is definitely annoying, actually MORE than annoying in some cases. I'm hoping the correctable error rate-limiting work can reduce the annoyance to an tolerable level: https://lore.kernel.org/r/20250214023543.992372-1-pandoh@google.com Can you take a look at this and see if it's going the right direction for you, or if it needs extensions to do what you need? > Signed-off-by: Bijie Xu <bijie.xu@corigine.com> > --- > drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++ > 1 file changed, 23 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 80c5ba8d8296..b46973526bcf 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev, > pci_dev_aer_stats_incr(dev, info); > } > > +unsigned int aer_print_skip_mask __read_mostly; > void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > { > int layer, agent; > @@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) > goto out; > } > > + if ((1 << info->severity) & aer_print_skip_mask) > + goto out; > + > layer = AER_GET_LAYER_ERROR(info->severity, info->status); > agent = AER_GET_AGENT(info->severity, info->status); > > @@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void) > return -ENXIO; > return pcie_port_service_register(&aerdriver); > } > + > +static const struct ctl_table aer_print_skip_mask_sysctls[] = { > + { > + .procname = "aer_print_skip_mask", > + .data = &aer_print_skip_mask, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = &proc_douintvec, > + }, > + {} > +}; > + > +static int __init aer_print_skip_mask_sysctl_init(void) > +{ > + register_sysctl_init("kernel", aer_print_skip_mask_sysctls); > + return 0; > +} > + > +late_initcall(aer_print_skip_mask_sysctl_init); > -- > 2.25.1 >
On Tue, 4 Mar 2025 17:22:30 -0600, Bjorn Helgaas wrote: > Can you take a look at this and see if it's going the right direction > for you, or if it needs extensions to do what you need? Thanks for your suggestion. I've taken sometime to review that patch you suggested. It solves part of the problem. And it can set ratelimit on a single device, which is good. But this patch solves the problem in a different way. 1. Some users are very nervous to notice this kind of error logs. This patch can give them an option to disable these logs entirely on the whole system level instead of just set a ratelimit on a specific device. 2. The sysctl configuration can be persisted after a system reboot. Users may dislike these AER logs appearing again after a system reboot. Regards, Bijie Xu
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 80c5ba8d8296..b46973526bcf 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev, pci_dev_aer_stats_incr(dev, info); } +unsigned int aer_print_skip_mask __read_mostly; void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) { int layer, agent; @@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info) goto out; } + if ((1 << info->severity) & aer_print_skip_mask) + goto out; + layer = AER_GET_LAYER_ERROR(info->severity, info->status); agent = AER_GET_AGENT(info->severity, info->status); @@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void) return -ENXIO; return pcie_port_service_register(&aerdriver); } + +static const struct ctl_table aer_print_skip_mask_sysctls[] = { + { + .procname = "aer_print_skip_mask", + .data = &aer_print_skip_mask, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_douintvec, + }, + {} +}; + +static int __init aer_print_skip_mask_sysctl_init(void) +{ + register_sysctl_init("kernel", aer_print_skip_mask_sysctls); + return 0; +} + +late_initcall(aer_print_skip_mask_sysctl_init);
Sometimes certain PCIE devices installed on some servers occasionally produce large number of AER correctable error logs, which is quite annoying. Add this sysctl parameter kernel.aer_print_skip_mask to skip printing AER errors of certain severity. The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3 low bits of the mask are used to skip these 3 severities. Set bit 0 can skip printing NONFATAL AER errors, and set bit 1 can skip printing FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors. And multiple bits can be set to skip multiple severities. Signed-off-by: Bijie Xu <bijie.xu@corigine.com> --- drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)