diff mbox series

i40e: Disable i40e PCIe AER on system reboot

Message ID 20241227035459.90602-1-yue.zhao@shopee.com
State Deferred
Headers show
Series i40e: Disable i40e PCIe AER on system reboot | expand

Commit Message

Yue Zhao Dec. 27, 2024, 3:54 a.m. UTC
Disable PCIe AER on the i40e device on system reboot on a limited
list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
on the i40e device during the ACPI _PTS (prepare to sleep) method for
S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
as part of the kernel's reboot sequence as a result of commit
38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").

We first noticed this abnormal reboot issue in tg3 device, and there
is a similar patch about disable PCIe AER to fix hardware error during
reboot. The hardware error in tg3 device has gone after we apply this
patch below.

https://lore.kernel.org/lkml/20241129203640.54492-1-lszubowi@redhat.com/T/

So we try to disable PCIe AER on the i40e device in the similar way.

hardware crash dmesg log:

ACPI: PM: Preparing to enter system sleep state S5
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
{1}[Hardware Error]: event severity: fatal
{1}[Hardware Error]:  Error 0, type: fatal
{1}[Hardware Error]:   section_type: PCIe error
{1}[Hardware Error]:   port_type: 0, PCIe end point
{1}[Hardware Error]:   version: 3.0
{1}[Hardware Error]:   command: 0x0006, status: 0x0010
{1}[Hardware Error]:   device_id: 0000:05:00.1
{1}[Hardware Error]:   slot: 0
{1}[Hardware Error]:   secondary_bus: 0x00
{1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1572
{1}[Hardware Error]:   class_code: 020000
{1}[Hardware Error]:   aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00018000
{1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
{1}[Hardware Error]:   TLP Header: 40000001 0000000f 90028090 00000000
Kernel panic - not syncing: Fatal hardware error!
Hardware name: Dell Inc. PowerEdge C4140/08Y2GR, BIOS 2.21.1 12/12/2023
Call Trace:
 <NMI>
 dump_stack_lvl+0x48/0x70
 dump_stack+0x10/0x20
 panic+0x1b4/0x3a0
 __ghes_panic+0x6c/0x70
 ghes_in_nmi_queue_one_entry.constprop.0+0x1ee/0x2c0
 ghes_notify_nmi+0x5e/0xe0
 nmi_handle+0x62/0x160
 default_do_nmi+0x4c/0x150
 exc_nmi+0x140/0x1f0
 end_repeat_nmi+0x16/0x67
RIP: 0010:intel_idle_irq+0x70/0xf0
 </NMI>
 <TASK>
 cpuidle_enter_state+0x91/0x6f0
 cpuidle_enter+0x2e/0x50
 call_cpuidle+0x23/0x60
 cpuidle_idle_call+0x11d/0x190
 do_idle+0x82/0xf0
 cpu_startup_entry+0x2a/0x30
 rest_init+0xc2/0xf0
 arch_call_rest_init+0xe/0x30
 start_kernel+0x34f/0x440
 x86_64_start_reservations+0x18/0x30
 x86_64_start_kernel+0xbf/0x110
 secondary_startup_64_no_verify+0x18f/0x19b
 </TASK>

Fixes: 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot")
Signed-off-by: Yue Zhao <yue.zhao@shopee.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 64 +++++++++++++++++++++
 1 file changed, 64 insertions(+)

Comments

Tony Nguyen March 6, 2025, 12:47 a.m. UTC | #1
On 12/26/2024 7:54 PM, Yue Zhao wrote:
> Disable PCIe AER on the i40e device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the i40e device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").

Hi Yue,

We've contacted Dell to try to root cause the issue and find the proper 
fix. It would help if we could provide more information about the 
problem and circumstances. Have you reported the issue to Dell? If so, 
could you provide that to me (here or privately) so that we can pass 
that along to help the investigation?

Thank you,
Tony

> We first noticed this abnormal reboot issue in tg3 device, and there
> is a similar patch about disable PCIe AER to fix hardware error during
> reboot. The hardware error in tg3 device has gone after we apply this
> patch below.
> 
> https://lore.kernel.org/lkml/20241129203640.54492-1-lszubowi@redhat.com/T/
> 
> So we try to disable PCIe AER on the i40e device in the similar way.
> 
> hardware crash dmesg log:
> 
> ACPI: PM: Preparing to enter system sleep state S5
> {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> {1}[Hardware Error]: event severity: fatal
> {1}[Hardware Error]:  Error 0, type: fatal
> {1}[Hardware Error]:   section_type: PCIe error
> {1}[Hardware Error]:   port_type: 0, PCIe end point
> {1}[Hardware Error]:   version: 3.0
> {1}[Hardware Error]:   command: 0x0006, status: 0x0010
> {1}[Hardware Error]:   device_id: 0000:05:00.1
> {1}[Hardware Error]:   slot: 0
> {1}[Hardware Error]:   secondary_bus: 0x00
> {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1572
> {1}[Hardware Error]:   class_code: 020000
> {1}[Hardware Error]:   aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00018000
> {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
> {1}[Hardware Error]:   TLP Header: 40000001 0000000f 90028090 00000000
> Kernel panic - not syncing: Fatal hardware error!
> Hardware name: Dell Inc. PowerEdge C4140/08Y2GR, BIOS 2.21.1 12/12/2023
> Call Trace:
>   <NMI>
>   dump_stack_lvl+0x48/0x70
>   dump_stack+0x10/0x20
>   panic+0x1b4/0x3a0
>   __ghes_panic+0x6c/0x70
>   ghes_in_nmi_queue_one_entry.constprop.0+0x1ee/0x2c0
>   ghes_notify_nmi+0x5e/0xe0
>   nmi_handle+0x62/0x160
>   default_do_nmi+0x4c/0x150
>   exc_nmi+0x140/0x1f0
>   end_repeat_nmi+0x16/0x67
> RIP: 0010:intel_idle_irq+0x70/0xf0
>   </NMI>
>   <TASK>
>   cpuidle_enter_state+0x91/0x6f0
>   cpuidle_enter+0x2e/0x50
>   call_cpuidle+0x23/0x60
>   cpuidle_idle_call+0x11d/0x190
>   do_idle+0x82/0xf0
>   cpu_startup_entry+0x2a/0x30
>   rest_init+0xc2/0xf0
>   arch_call_rest_init+0xe/0x30
>   start_kernel+0x34f/0x440
>   x86_64_start_reservations+0x18/0x30
>   x86_64_start_kernel+0xbf/0x110
>   secondary_startup_64_no_verify+0x18f/0x19b
>   </TASK>
> 
> Fixes: 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot")
> Signed-off-by: Yue Zhao <yue.zhao@shopee.com>
> ---
>   drivers/net/ethernet/intel/i40e/i40e_main.c | 64 +++++++++++++++++++++
>   1 file changed, 64 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 0e1d9e2fbf38..80e66e4e90f7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -8,6 +8,7 @@
>   #include <linux/module.h>
>   #include <net/pkt_cls.h>
>   #include <net/xdp_sock_drv.h>
> +#include <linux/dmi.h>
>   
>   /* Local includes */
>   #include "i40e.h"
> @@ -16608,6 +16609,56 @@ static void i40e_pci_error_resume(struct pci_dev *pdev)
>   	i40e_io_resume(pf);
>   }
>   
> +/* Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
> + * PCIe AER event on the i40e device if the i40e device is not, or cannot
> + * be, powered down.
> + */
> +static const struct dmi_system_id i40e_restart_aer_quirk_table[] = {
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge C4140"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
> +		},
> +	},
> +	{
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
> +		},
> +	},
> +	{}
> +};
> +
>   /**
>    * i40e_shutdown - PCI callback for shutting down
>    * @pdev: PCI device information struct
> @@ -16654,6 +16705,19 @@ static void i40e_shutdown(struct pci_dev *pdev)
>   	i40e_clear_interrupt_scheme(pf);
>   	rtnl_unlock();
>   
> +	if (system_state == SYSTEM_RESTART &&
> +		dmi_first_match(i40e_restart_aer_quirk_table) &&
> +		pdev->current_state <= PCI_D3hot) {
> +		/* Disable PCIe AER on the i40e to avoid a fatal
> +		 * error during this system restart.
> +		 */
> +		pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
> +					   PCI_EXP_DEVCTL_CERE |
> +					   PCI_EXP_DEVCTL_NFERE |
> +					   PCI_EXP_DEVCTL_FERE |
> +					   PCI_EXP_DEVCTL_URRE);
> +	}
> +
>   	if (system_state == SYSTEM_POWER_OFF) {
>   		pci_wake_from_d3(pdev, pf->wol_en);
>   		pci_set_power_state(pdev, PCI_D3hot);
Yue Zhao March 6, 2025, 4:28 a.m. UTC | #2
Hi Tony,

Our DELL servers are all out of warranty, so I cannot provide more
useful information from the communication with the vendor side.
Is there any possible fix via upgrading firmware or other components?

Thanks,
Best Regards

Yue

On Thu, Mar 6, 2025 at 8:47 AM Tony Nguyen <anthony.l.nguyen@intel.com>
wrote:

> On 12/26/2024 7:54 PM, Yue Zhao wrote:
> > Disable PCIe AER on the i40e device on system reboot on a limited
> > list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> > on the i40e device during the ACPI _PTS (prepare to sleep) method for
> > S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> > as part of the kernel's reboot sequence as a result of commit
> > 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> Hi Yue,
>
> We've contacted Dell to try to root cause the issue and find the proper
> fix. It would help if we could provide more information about the
> problem and circumstances. Have you reported the issue to Dell? If so,
> could you provide that to me (here or privately) so that we can pass
> that along to help the investigation?
>
> Thank you,
> Tony
>
> > We first noticed this abnormal reboot issue in tg3 device, and there
> > is a similar patch about disable PCIe AER to fix hardware error during
> > reboot. The hardware error in tg3 device has gone after we apply this
> > patch below.
> >
> >
> https://lore.kernel.org/lkml/20241129203640.54492-1-lszubowi@redhat.com/T/
> >
> > So we try to disable PCIe AER on the i40e device in the similar way.
> >
> > hardware crash dmesg log:
> >
> > ACPI: PM: Preparing to enter system sleep state S5
> > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
> Source: 5
> > {1}[Hardware Error]: event severity: fatal
> > {1}[Hardware Error]:  Error 0, type: fatal
> > {1}[Hardware Error]:   section_type: PCIe error
> > {1}[Hardware Error]:   port_type: 0, PCIe end point
> > {1}[Hardware Error]:   version: 3.0
> > {1}[Hardware Error]:   command: 0x0006, status: 0x0010
> > {1}[Hardware Error]:   device_id: 0000:05:00.1
> > {1}[Hardware Error]:   slot: 0
> > {1}[Hardware Error]:   secondary_bus: 0x00
> > {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1572
> > {1}[Hardware Error]:   class_code: 020000
> > {1}[Hardware Error]:   aer_uncor_status: 0x00100000, aer_uncor_mask:
> 0x00018000
> > {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
> > {1}[Hardware Error]:   TLP Header: 40000001 0000000f 90028090 00000000
> > Kernel panic - not syncing: Fatal hardware error!
> > Hardware name: Dell Inc. PowerEdge C4140/08Y2GR, BIOS 2.21.1 12/12/2023
> > Call Trace:
> >   <NMI>
> >   dump_stack_lvl+0x48/0x70
> >   dump_stack+0x10/0x20
> >   panic+0x1b4/0x3a0
> >   __ghes_panic+0x6c/0x70
> >   ghes_in_nmi_queue_one_entry.constprop.0+0x1ee/0x2c0
> >   ghes_notify_nmi+0x5e/0xe0
> >   nmi_handle+0x62/0x160
> >   default_do_nmi+0x4c/0x150
> >   exc_nmi+0x140/0x1f0
> >   end_repeat_nmi+0x16/0x67
> > RIP: 0010:intel_idle_irq+0x70/0xf0
> >   </NMI>
> >   <TASK>
> >   cpuidle_enter_state+0x91/0x6f0
> >   cpuidle_enter+0x2e/0x50
> >   call_cpuidle+0x23/0x60
> >   cpuidle_idle_call+0x11d/0x190
> >   do_idle+0x82/0xf0
> >   cpu_startup_entry+0x2a/0x30
> >   rest_init+0xc2/0xf0
> >   arch_call_rest_init+0xe/0x30
> >   start_kernel+0x34f/0x440
> >   x86_64_start_reservations+0x18/0x30
> >   x86_64_start_kernel+0xbf/0x110
> >   secondary_startup_64_no_verify+0x18f/0x19b
> >   </TASK>
> >
> > Fixes: 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot")
> > Signed-off-by: Yue Zhao <yue.zhao@shopee.com>
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_main.c | 64 +++++++++++++++++++++
> >   1 file changed, 64 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index 0e1d9e2fbf38..80e66e4e90f7 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -8,6 +8,7 @@
> >   #include <linux/module.h>
> >   #include <net/pkt_cls.h>
> >   #include <net/xdp_sock_drv.h>
> > +#include <linux/dmi.h>
> >
> >   /* Local includes */
> >   #include "i40e.h"
> > @@ -16608,6 +16609,56 @@ static void i40e_pci_error_resume(struct
> pci_dev *pdev)
> >       i40e_io_resume(pf);
> >   }
> >
> > +/* Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
> > + * PCIe AER event on the i40e device if the i40e device is not, or
> cannot
> > + * be, powered down.
> > + */
> > +static const struct dmi_system_id i40e_restart_aer_quirk_table[] = {
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge C4140"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
> > +             },
> > +     },
> > +     {
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
> > +             },
> > +     },
> > +     {}
> > +};
> > +
> >   /**
> >    * i40e_shutdown - PCI callback for shutting down
> >    * @pdev: PCI device information struct
> > @@ -16654,6 +16705,19 @@ static void i40e_shutdown(struct pci_dev *pdev)
> >       i40e_clear_interrupt_scheme(pf);
> >       rtnl_unlock();
> >
> > +     if (system_state == SYSTEM_RESTART &&
> > +             dmi_first_match(i40e_restart_aer_quirk_table) &&
> > +             pdev->current_state <= PCI_D3hot) {
> > +             /* Disable PCIe AER on the i40e to avoid a fatal
> > +              * error during this system restart.
> > +              */
> > +             pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
> > +                                        PCI_EXP_DEVCTL_CERE |
> > +                                        PCI_EXP_DEVCTL_NFERE |
> > +                                        PCI_EXP_DEVCTL_FERE |
> > +                                        PCI_EXP_DEVCTL_URRE);
> > +     }
> > +
> >       if (system_state == SYSTEM_POWER_OFF) {
> >               pci_wake_from_d3(pdev, pf->wol_en);
> >               pci_set_power_state(pdev, PCI_D3hot);
>
>
Yue Zhao March 6, 2025, 6:40 a.m. UTC | #3
Hi Tony,

I am resending as I realized I sent in Rich Text instead of Plain Text.
I am sorry if any of you got this duplicate email.

Our DELL servers are all out of warranty, so I cannot provide more
useful information from the communication with the vendor side.
Is there any possible fix via upgrading firmware or other components?

Thanks,
Best Regards

Yue


On Thu, Mar 6, 2025 at 12:28 PM Yue Zhao <yue.zhao@shopee.com> wrote:
>
> Hi Tony,
>
> Our DELL servers are all out of warranty, so I cannot provide more
> useful information from the communication with the vendor side.
> Is there any possible fix via upgrading firmware or other components?
>
> Thanks,
> Best Regards
>
> Yue
>
> On Thu, Mar 6, 2025 at 8:47 AM Tony Nguyen <anthony.l.nguyen@intel.com> wrote:
>>
>> On 12/26/2024 7:54 PM, Yue Zhao wrote:
>> > Disable PCIe AER on the i40e device on system reboot on a limited
>> > list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
>> > on the i40e device during the ACPI _PTS (prepare to sleep) method for
>> > S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
>> > as part of the kernel's reboot sequence as a result of commit
>> > 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>>
>> Hi Yue,
>>
>> We've contacted Dell to try to root cause the issue and find the proper
>> fix. It would help if we could provide more information about the
>> problem and circumstances. Have you reported the issue to Dell? If so,
>> could you provide that to me (here or privately) so that we can pass
>> that along to help the investigation?
>>
>> Thank you,
>> Tony
>>
>> > We first noticed this abnormal reboot issue in tg3 device, and there
>> > is a similar patch about disable PCIe AER to fix hardware error during
>> > reboot. The hardware error in tg3 device has gone after we apply this
>> > patch below.
>> >
>> > https://lore.kernel.org/lkml/20241129203640.54492-1-lszubowi@redhat.com/T/
>> >
>> > So we try to disable PCIe AER on the i40e device in the similar way.
>> >
>> > hardware crash dmesg log:
>> >
>> > ACPI: PM: Preparing to enter system sleep state S5
>> > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
>> > {1}[Hardware Error]: event severity: fatal
>> > {1}[Hardware Error]:  Error 0, type: fatal
>> > {1}[Hardware Error]:   section_type: PCIe error
>> > {1}[Hardware Error]:   port_type: 0, PCIe end point
>> > {1}[Hardware Error]:   version: 3.0
>> > {1}[Hardware Error]:   command: 0x0006, status: 0x0010
>> > {1}[Hardware Error]:   device_id: 0000:05:00.1
>> > {1}[Hardware Error]:   slot: 0
>> > {1}[Hardware Error]:   secondary_bus: 0x00
>> > {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x1572
>> > {1}[Hardware Error]:   class_code: 020000
>> > {1}[Hardware Error]:   aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00018000
>> > {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
>> > {1}[Hardware Error]:   TLP Header: 40000001 0000000f 90028090 00000000
>> > Kernel panic - not syncing: Fatal hardware error!
>> > Hardware name: Dell Inc. PowerEdge C4140/08Y2GR, BIOS 2.21.1 12/12/2023
>> > Call Trace:
>> >   <NMI>
>> >   dump_stack_lvl+0x48/0x70
>> >   dump_stack+0x10/0x20
>> >   panic+0x1b4/0x3a0
>> >   __ghes_panic+0x6c/0x70
>> >   ghes_in_nmi_queue_one_entry.constprop.0+0x1ee/0x2c0
>> >   ghes_notify_nmi+0x5e/0xe0
>> >   nmi_handle+0x62/0x160
>> >   default_do_nmi+0x4c/0x150
>> >   exc_nmi+0x140/0x1f0
>> >   end_repeat_nmi+0x16/0x67
>> > RIP: 0010:intel_idle_irq+0x70/0xf0
>> >   </NMI>
>> >   <TASK>
>> >   cpuidle_enter_state+0x91/0x6f0
>> >   cpuidle_enter+0x2e/0x50
>> >   call_cpuidle+0x23/0x60
>> >   cpuidle_idle_call+0x11d/0x190
>> >   do_idle+0x82/0xf0
>> >   cpu_startup_entry+0x2a/0x30
>> >   rest_init+0xc2/0xf0
>> >   arch_call_rest_init+0xe/0x30
>> >   start_kernel+0x34f/0x440
>> >   x86_64_start_reservations+0x18/0x30
>> >   x86_64_start_kernel+0xbf/0x110
>> >   secondary_startup_64_no_verify+0x18f/0x19b
>> >   </TASK>
>> >
>> > Fixes: 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot")
>> > Signed-off-by: Yue Zhao <yue.zhao@shopee.com>
>> > ---
>> >   drivers/net/ethernet/intel/i40e/i40e_main.c | 64 +++++++++++++++++++++
>> >   1 file changed, 64 insertions(+)
>> >
>> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > index 0e1d9e2fbf38..80e66e4e90f7 100644
>> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > @@ -8,6 +8,7 @@
>> >   #include <linux/module.h>
>> >   #include <net/pkt_cls.h>
>> >   #include <net/xdp_sock_drv.h>
>> > +#include <linux/dmi.h>
>> >
>> >   /* Local includes */
>> >   #include "i40e.h"
>> > @@ -16608,6 +16609,56 @@ static void i40e_pci_error_resume(struct pci_dev *pdev)
>> >       i40e_io_resume(pf);
>> >   }
>> >
>> > +/* Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
>> > + * PCIe AER event on the i40e device if the i40e device is not, or cannot
>> > + * be, powered down.
>> > + */
>> > +static const struct dmi_system_id i40e_restart_aer_quirk_table[] = {
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge C4140"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
>> > +             },
>> > +     },
>> > +     {
>> > +             .matches = {
>> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
>> > +             },
>> > +     },
>> > +     {}
>> > +};
>> > +
>> >   /**
>> >    * i40e_shutdown - PCI callback for shutting down
>> >    * @pdev: PCI device information struct
>> > @@ -16654,6 +16705,19 @@ static void i40e_shutdown(struct pci_dev *pdev)
>> >       i40e_clear_interrupt_scheme(pf);
>> >       rtnl_unlock();
>> >
>> > +     if (system_state == SYSTEM_RESTART &&
>> > +             dmi_first_match(i40e_restart_aer_quirk_table) &&
>> > +             pdev->current_state <= PCI_D3hot) {
>> > +             /* Disable PCIe AER on the i40e to avoid a fatal
>> > +              * error during this system restart.
>> > +              */
>> > +             pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
>> > +                                        PCI_EXP_DEVCTL_CERE |
>> > +                                        PCI_EXP_DEVCTL_NFERE |
>> > +                                        PCI_EXP_DEVCTL_FERE |
>> > +                                        PCI_EXP_DEVCTL_URRE);
>> > +     }
>> > +
>> >       if (system_state == SYSTEM_POWER_OFF) {
>> >               pci_wake_from_d3(pdev, pf->wol_en);
>> >               pci_set_power_state(pdev, PCI_D3hot);
>>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0e1d9e2fbf38..80e66e4e90f7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8,6 +8,7 @@ 
 #include <linux/module.h>
 #include <net/pkt_cls.h>
 #include <net/xdp_sock_drv.h>
+#include <linux/dmi.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -16608,6 +16609,56 @@  static void i40e_pci_error_resume(struct pci_dev *pdev)
 	i40e_io_resume(pf);
 }
 
+/* Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
+ * PCIe AER event on the i40e device if the i40e device is not, or cannot
+ * be, powered down.
+ */
+static const struct dmi_system_id i40e_restart_aer_quirk_table[] = {
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge C4140"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
+		},
+	},
+	{
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
+		},
+	},
+	{}
+};
+
 /**
  * i40e_shutdown - PCI callback for shutting down
  * @pdev: PCI device information struct
@@ -16654,6 +16705,19 @@  static void i40e_shutdown(struct pci_dev *pdev)
 	i40e_clear_interrupt_scheme(pf);
 	rtnl_unlock();
 
+	if (system_state == SYSTEM_RESTART &&
+		dmi_first_match(i40e_restart_aer_quirk_table) &&
+		pdev->current_state <= PCI_D3hot) {
+		/* Disable PCIe AER on the i40e to avoid a fatal
+		 * error during this system restart.
+		 */
+		pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
+					   PCI_EXP_DEVCTL_CERE |
+					   PCI_EXP_DEVCTL_NFERE |
+					   PCI_EXP_DEVCTL_FERE |
+					   PCI_EXP_DEVCTL_URRE);
+	}
+
 	if (system_state == SYSTEM_POWER_OFF) {
 		pci_wake_from_d3(pdev, pf->wol_en);
 		pci_set_power_state(pdev, PCI_D3hot);