diff mbox series

PCI/AER: Add kernel.aer_print_skip_mask to control aer log

Message ID 20250108075703.410961-1-bijie.xu@corigine.com (mailing list archive)
State Handled Elsewhere
Headers show
Series PCI/AER: Add kernel.aer_print_skip_mask to control aer log | expand

Commit Message

Bijie Xu Jan. 8, 2025, 7:57 a.m. UTC
Sometimes certain PCIE devices installed on some servers occasionally
produce large number of AER correctable error logs, which is quite
annoying. Add this sysctl parameter kernel.aer_print_skip_mask to
skip printing AER errors of certain severity.

The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3
low bits of the mask are used to skip these 3 severities. Set bit 0
can skip printing NONFATAL AER errors, and set bit 1 can skip printing
FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors.
And multiple bits can be set to skip multiple severities.

Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
---
 drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Comments

Bjorn Helgaas March 4, 2025, 11:22 p.m. UTC | #1
[+cc Jon, Karolina]

On Wed, Jan 08, 2025 at 03:57:03PM +0800, Bijie Xu wrote:
> Sometimes certain PCIE devices installed on some servers occasionally
> produce large number of AER correctable error logs, which is quite
> annoying. Add this sysctl parameter kernel.aer_print_skip_mask to
> skip printing AER errors of certain severity.
> 
> The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3
> low bits of the mask are used to skip these 3 severities. Set bit 0
> can skip printing NONFATAL AER errors, and set bit 1 can skip printing
> FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors.
> And multiple bits can be set to skip multiple severities.

This is definitely annoying, actually MORE than annoying in some
cases.

I'm hoping the correctable error rate-limiting work can reduce the
annoyance to an tolerable level:

  https://lore.kernel.org/r/20250214023543.992372-1-pandoh@google.com

Can you take a look at this and see if it's going the right direction
for you, or if it needs extensions to do what you need?

> Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
> ---
>  drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 80c5ba8d8296..b46973526bcf 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev,
>  	pci_dev_aer_stats_incr(dev, info);
>  }
>  
> +unsigned int aer_print_skip_mask __read_mostly;
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	int layer, agent;
> @@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  		goto out;
>  	}
>  
> +	if ((1 << info->severity) & aer_print_skip_mask)
> +		goto out;
> +
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> @@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void)
>  		return -ENXIO;
>  	return pcie_port_service_register(&aerdriver);
>  }
> +
> +static const struct ctl_table aer_print_skip_mask_sysctls[] = {
> +	{
> +		.procname       = "aer_print_skip_mask",
> +		.data           = &aer_print_skip_mask,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = &proc_douintvec,
> +	},
> +	{}
> +};
> +
> +static int __init aer_print_skip_mask_sysctl_init(void)
> +{
> +	register_sysctl_init("kernel", aer_print_skip_mask_sysctls);
> +	return 0;
> +}
> +
> +late_initcall(aer_print_skip_mask_sysctl_init);
> -- 
> 2.25.1
>
Bijie Xu March 6, 2025, 8:25 a.m. UTC | #2
On Tue, 4 Mar 2025 17:22:30 -0600, Bjorn Helgaas wrote:
> Can you take a look at this and see if it's going the right direction
> for you, or if it needs extensions to do what you need?
Thanks for your suggestion. I've taken sometime to review that patch you suggested. 
It solves part of the problem. And it can set ratelimit on a single device, which
is good. 

But this patch solves the problem in a different way. 

1. Some users are very nervous to notice this kind of error logs. This patch can
give them an option to disable these logs entirely on the whole system level 
instead of just set a ratelimit on a specific device.

2. The sysctl configuration can be persisted after a system reboot. Users may dislike
these AER logs appearing again after a system reboot.

Regards,
Bijie Xu
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 80c5ba8d8296..b46973526bcf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -698,6 +698,7 @@  static void __aer_print_error(struct pci_dev *dev,
 	pci_dev_aer_stats_incr(dev, info);
 }
 
+unsigned int aer_print_skip_mask __read_mostly;
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int layer, agent;
@@ -710,6 +711,9 @@  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 		goto out;
 	}
 
+	if ((1 << info->severity) & aer_print_skip_mask)
+		goto out;
+
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
@@ -1596,3 +1600,22 @@  int __init pcie_aer_init(void)
 		return -ENXIO;
 	return pcie_port_service_register(&aerdriver);
 }
+
+static const struct ctl_table aer_print_skip_mask_sysctls[] = {
+	{
+		.procname       = "aer_print_skip_mask",
+		.data           = &aer_print_skip_mask,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = &proc_douintvec,
+	},
+	{}
+};
+
+static int __init aer_print_skip_mask_sysctl_init(void)
+{
+	register_sysctl_init("kernel", aer_print_skip_mask_sysctls);
+	return 0;
+}
+
+late_initcall(aer_print_skip_mask_sysctl_init);