Message ID | 20180409220444.6632-1-keith.busch@intel.com |
---|---|
Headers | show |
Series | PCI/AER: Use-after-free fix | expand |
在 2018/4/10 6:04, Keith Busch 写道: > AER error handling walks the PCI topology below a root port, saving > pointers of the pci_dev structs affected by the error along the way. > > At the same time, the pcie hotplug driver could be freeing those very > same structures, causing the AER driver to reference freed memory. I also have met such issue. The details see the below link. https://www.spinics.net/lists/linux-pci/msg70614.html It seems do reset_link() will trigger hotplug driver to remove/rescan device when met AER fatal error. do_recovery() --- reset_link() --- pci_reset_secondary_bus() //then trigger link down/up. So if the root port support hotplug, it will remove/rescan device. I still have a question, since AER driver have already done recovery, it seems no need to trigger hotplug to remove and rescan the device. Thanks, Dongdong > > This series fixes this by synchroniziing the aer driver with the pci > hotplug driver during. The final patch is the one that ultimately provides > the locking by having AER lock the same pci lock rescan/remove mutex as > the pciehp driver. The first three patches are prepping to make it safe > for the aer bottom half handler to hold that lock. > > Keith Busch (4): > PCI/AER: Remove unused parameters > PCI/AER: Replace struct pcie_device with pci_dev > PCI/AER: Reference count aer structures > PCI/AER: Lock pci topology when scanning errors > > drivers/pci/pcie/aer/aerdrv.c | 28 +++++++++++++++++++++------- > drivers/pci/pcie/aer/aerdrv.h | 9 +++------ > drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++--------------------- > 3 files changed, 41 insertions(+), 34 deletions(-) >
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote: > From: Keith Busch [mailto:keith.busch@intel.com] > > > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way. > > Hi Keith, > > I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing. > > Alex If you need help triggering the race you can add a sleep/microsleep here: aer_isr_one_error() between the find_source_device and process err device: sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff drivers/pci/pcie/aer/aerdrv_core.c diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index a4bfea52e7d4..5ca0c07b1d05 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -22,6 +22,7 @@ #include <linux/delay.h> #include <linux/slab.h> #include <linux/kfifo.h> +#include <linux/delay.h> #include "aerdrv.h" #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ @@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_device, aer_print_port_info(p_device->port, e_info); - if (find_source_device(p_device->port, e_info)) + if (find_source_device(p_device->port, e_info)) { + msleep(350); aer_process_err_devices(p_device, e_info); + } } if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { @@ -759,8 +762,10 @@ static void aer_isr_one_error(struct pcie_device *p_device, aer_print_port_info(p_device->port, e_info); - if (find_source_device(p_device->port, e_info)) + if (find_source_device(p_device->port, e_info)) { + msleep(350); aer_process_err_devices(p_device, e_info); + } } }
From: Keith Busch [mailto:keith.busch@intel.com] > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way. Hi Keith, I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing. Alex
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote: > From: Keith Busch [mailto:keith.busch@intel.com] > > > AER error handling walks the PCI topology below a root port, saving > > pointers of the pci_dev structs affected by the error along the way. > > Hi Keith, > > I've been trying to do an ABA test to confirm that your change > eliminates the use-after-free issue we've seen. The race seems to be > quite elusive, so I can't reliably reproduce it. Your changes have not > been forgotten; I have them staged for further testing. > > Alex No worries. This is essentially just a "safe" version of the patch that you had tested, so if that one was successful, this one should be too.
I got the cold chills when I realized you called for a delay of 350ms. It's because 350ms is around the delay I've observed to be caused by FFS. First run KASANed with the extra delay, so hopefully, I'll have more cement test results by EOB today. Alex -----Original Message----- From: Scott Bauer [mailto:scott.bauer@intel.com] Sent: Thursday, April 12, 2018 11:47 AM To: Gagniuc, Alexandru - Dell Team Cc: keith.busch@intel.com; linux-pci@vger.kernel.org; bhelgaas@google.com Subject: Re: [PATCH 0/4] PCI/AER: Use-after-free fix On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote: > From: Keith Busch [mailto:keith.busch@intel.com] > > > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way. > > Hi Keith, > > I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing. > > Alex If you need help triggering the race you can add a sleep/microsleep here: aer_isr_one_error() between the find_source_device and process err device: sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff drivers/pci/pcie/aer/aerdrv_core.c diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index a4bfea52e7d4..5ca0c07b1d05 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -22,6 +22,7 @@ #include <linux/delay.h> #include <linux/slab.h> #include <linux/kfifo.h> +#include <linux/delay.h> #include "aerdrv.h" #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \ @@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_device, aer_print_port_info(p_device->port, e_info); - if (find_source_device(p_device->port, e_info)) + if (find_source_device(p_device->port, e_info)) { + msleep(350); aer_process_err_devices(p_device, e_info); + } } if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { @@ -759,8 +762,10 @@ static void aer_isr_one_error(struct pcie_device *p_device, aer_print_port_info(p_device->port, e_info); - if (find_source_device(p_device->port, e_info)) + if (find_source_device(p_device->port, e_info)) { + msleep(350); aer_process_err_devices(p_device, e_info); + } } }
From: Scott Bauer [mailto:scott.bauer@intel.com] On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote: > From: Keith Busch [mailto:keith.busch@intel.com] > > > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way. > > Hi Keith, > > I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing. > > Alex > If you need help triggering the race you can add a sleep/microsleep here: Tested-by: Alexandru Gagniuc
On Mon, Apr 09, 2018 at 04:04:40PM -0600, Keith Busch wrote: > AER error handling walks the PCI topology below a root port, saving > pointers of the pci_dev structs affected by the error along the way. > > At the same time, the pcie hotplug driver could be freeing those very > same structures, causing the AER driver to reference freed memory. > > This series fixes this by synchroniziing the aer driver with the pci > hotplug driver during. The final patch is the one that ultimately provides > the locking by having AER lock the same pci lock rescan/remove mutex as > the pciehp driver. The first three patches are prepping to make it safe > for the aer bottom half handler to hold that lock. > > Keith Busch (4): > PCI/AER: Remove unused parameters > PCI/AER: Replace struct pcie_device with pci_dev > PCI/AER: Reference count aer structures > PCI/AER: Lock pci topology when scanning errors > > drivers/pci/pcie/aer/aerdrv.c | 28 +++++++++++++++++++++------- > drivers/pci/pcie/aer/aerdrv.h | 9 +++------ > drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++--------------------- > 3 files changed, 41 insertions(+), 34 deletions(-) I applied the first two to pci/aer for v4.18, thanks! I think the last two might need some rework to adapt after Oza's patches.