Message ID | 20190531171216.20532-3-logang@deltatee.com |
---|---|
State | New |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Fix a pair of setup bus bugs | expand |
On Fri, May 31, 2019 at 11:12:16AM -0600, Logan Gunthorpe wrote: > One odd quirk of PLX switches is that their upstream bridge port has > 256K of space allocated behind its BAR0 (most other bridge > implementations do not report any BAR space). Somewhat unusual, but completely legal, of course. If a bridge has memory BARs, AFAIK it is impossible to enable a memory window without also enabling the BARs, so if we want to use the bridge at all, we *must* allocate space for its BARs, just like for any other device. > The lspci for such device > looks like: > > 04:00.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI > Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) > (prog-if 00 [Normal decode]) > Physical Slot: 1 > Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0 > Memory at 90a00000 (32-bit, non-prefetchable) [size=256K] > Bus: primary=04, secondary=05, subordinate=0a, sec-latency=0 > I/O behind bridge: 00002000-00003fff > Memory behind bridge: 90000000-909fffff > Prefetchable memory behind bridge: 0000380000800000-0000380000bfffff > Kernel driver in use: pcieport > > It's not clear what the purpose of the memory at 0x90a00000 is, and > currently the kernel never actually uses it for anything. In most cases, > it's safely ignored and does not cause a problem. > > However, when the kernel assigns the resource addresses (with the > pci=realloc command line parameter, for example) it can inadvertently > disable the struct resource corresponding to the bar. When this happens, > lspci will report this memory as ignored: > > Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K] > > This is because the kernel reports a zero start address and zero flags > in the corresponding sysfs resource file and in /proc/bus/pci/devices. > Investigation with 'lspci -x', however shows the bios-assigned address > will still be programmed in the device's BAR registers. Ugh, yep. It took me a while to trace through this, but "lspci -v" by default shows the kernel view of addresses, i.e., the pdev->resource[] values, which it gets from the sysfs "resource" file (resource_show()) or "/proc/bus/pci/devices" (show_device()). But "lspci -x" shows the config space values (I think "lspci -bv" should also show these) from the sysfs "config" file (pci_read_config()). It's definitely a kernel bug that we lost track of what's in config space. > In many cases, this still isn't a problem. Nothing uses the memory, > so nothing is affected. However, a big problem shows up when an IOMMU > is in use: the IOMMU will not reserve this space in the IOVA because the > kernel no longer thinks the range is valid. (See > dmar_init_reserved_ranges() for the Intel implementation of this.) > > Without the proper reserved range, we have a situation where a DMA > mapping may occasionally allocate an IOVA which the PCI bus will actually > route to a BAR in the PLX switch. This will result in some random DMA > writes not actually writing to the RAM they are supposed to, or random > DMA reads returning all FFs from the PLX BAR when it's supposed to have > read from RAM. > > The problem is caused in pci_assign_unassigned_root_bus_resources(). > When any resource from a bridge device fails to get assigned, the code > sets the resource's flags to zero. This makes sense for bridge resources, > as they will be re-enabled later, but for regular BARs, it disables them > permanently. To fix the problem, we only set the flags to zero for > bridge resources and treat any other resources like non-bridge devices. > > Reported-by: Kit Chow <kchow@gigaio.com> > Fixes: da7822e5ad71 ("PCI: update bridge resources to get more big ranges when allocating space (again)") > Signed-off-by: Logan Gunthorpe <logang@deltatee.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Yinghai Lu <yinghai@kernel.org> > --- > drivers/pci/setup-bus.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 0eb40924169b..7adbd4bedd16 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1784,11 +1784,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) > /* restore size and flags */ > list_for_each_entry(fail_res, &fail_head, list) { > struct resource *res = fail_res->res; > + int idx; > > res->start = fail_res->start; > res->end = fail_res->end; > res->flags = fail_res->flags; > - if (fail_res->dev->subordinate) > + > + idx = res - &fail_res->dev->resource[0]; > + if (fail_res->dev->subordinate && > + idx >= PCI_BRIDGE_RESOURCES && > + idx <= PCI_BRIDGE_RESOURCE_END) > res->flags = 0; In my ideal world we wouldn't zap the flags of any resource. I think we should derive the flags from the device's config space *once* during enumeration and remember them for the life of the device. This patch preserves res->flags for bridge BARs just like for any other device, so I think this is definitely a step in the right direction. I'm not sure the "dev->subordinate" test is really correct, though. I think the original intent of this code was to clear res->flags for bridge windows under the assumptions that (a) we can identify bridges by "dev->subordinate" being non-zero, and (b) bridges only have windows and didn't have BARs. This patch fixes assumption (b), but I think (a) is false, and we should fix it as well. One can imagine a bridge device without a subordinate bus (maybe we ran out of bus numbers), so I don't think we should test dev->subordinate. We could test something like pci_is_bridge(), although testing for idx being in the PCI_BRIDGE_RESOURCES range should be sufficient because I don't think we use those resource for anything other than windows. > } > free_list(&fail_head); > -- > 2.20.1 >
On 2019-06-17 7:53 a.m., Bjorn Helgaas wrote: >> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c >> index 0eb40924169b..7adbd4bedd16 100644 >> --- a/drivers/pci/setup-bus.c >> +++ b/drivers/pci/setup-bus.c >> @@ -1784,11 +1784,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) >> /* restore size and flags */ >> list_for_each_entry(fail_res, &fail_head, list) { >> struct resource *res = fail_res->res; >> + int idx; >> >> res->start = fail_res->start; >> res->end = fail_res->end; >> res->flags = fail_res->flags; >> - if (fail_res->dev->subordinate) >> + >> + idx = res - &fail_res->dev->resource[0]; >> + if (fail_res->dev->subordinate && >> + idx >= PCI_BRIDGE_RESOURCES && >> + idx <= PCI_BRIDGE_RESOURCE_END) >> res->flags = 0; > > In my ideal world we wouldn't zap the flags of any resource. I think > we should derive the flags from the device's config space *once* > during enumeration and remember them for the life of the device. Yes, I agree. The fact that this code seems to be constantly modifying everything makes it difficult to follow. When it clears the flags like this it's not clear if/where/how it will ever put them back. > This patch preserves res->flags for bridge BARs just like for any > other device, so I think this is definitely a step in the right > direction. > > I'm not sure the "dev->subordinate" test is really correct, though. > I think the original intent of this code was to clear res->flags for > bridge windows under the assumptions that (a) we can identify bridges > by "dev->subordinate" being non-zero, and (b) bridges only have > windows and didn't have BARs. Yes, I was also unsure of the reasoning behind the dev->subordinate test as well. But given that I didn't fully understand it, and it wasn't itself causing any problems, I elected to just change around it only for the bug I was trying to fix. > This patch fixes assumption (b), but I think (a) is false, and we > should fix it as well. One can imagine a bridge device without a > subordinate bus (maybe we ran out of bus numbers), so I don't think we > should test dev->subordinate. > > We could test something like pci_is_bridge(), although testing for idx > being in the PCI_BRIDGE_RESOURCES range should be sufficient because I > don't think we use those resource for anything other than windows. Ok, yes, there are a couple possibilities here and I'm unsure of the best thing to do. I agree that, right now, testing the idx for the range is probably sufficient. So logically we could probably just remove the dev->subordinate test. Assuming nobody decides to reuse the bridge indices for something else (which is probably a safe assumption). Though, testing for pci_is_bridge() would definitely be an improvement in terms of readability and the issues you point out. One way or another I can add a third patch to do this next time I submit this series. Thanks, Logan
On Mon, 2019-06-17 at 08:53 -0500, Bjorn Helgaas wrote: > On Fri, May 31, 2019 at 11:12:16AM -0600, Logan Gunthorpe wrote: > > One odd quirk of PLX switches is that their upstream bridge port has > > 256K of space allocated behind its BAR0 (most other bridge > > implementations do not report any BAR space). > > Somewhat unusual, but completely legal, of course. Ah yes, I've seen these. They have an MMIO path to their internal registers in addition to cfg. Can be annoying. > If a bridge has memory BARs, AFAIK it is impossible to enable a memory > window without also enabling the BARs, so if we want to use the bridge > at all, we *must* allocate space for its BARs, just like for any other > device. Right. .../... (agreeing violently) > In my ideal world we wouldn't zap the flags of any resource. I think > we should derive the flags from the device's config space *once* > during enumeration and remember them for the life of the device. Amen brother. It will take a little while to get there. One thing we should do is have a clearer way to mark a resource that failed to assign/allocate (though technically parent=NULL is it really, as long as all archs these days claim properly, it used not to be the case). We do wipe *bridge* windows (nor BARs) all over the place, that is less of an issue I suppose though I would be more comfortable if we also wrote to the bridge to close those windows as we do so... The problem of course is how much old weird quirky will break due to subtle assumptions as we "fix" these things :-) > This patch preserves res->flags for bridge BARs just like for any > other device, so I think this is definitely a step in the right > direction. > > I'm not sure the "dev->subordinate" test is really correct, though. Right, shouldn't it be pci_is_bridge() ? > I think the original intent of this code was to clear res->flags for > bridge windows under the assumptions that (a) we can identify bridges > by "dev->subordinate" being non-zero, and (b) bridges only have > windows and didn't have BARs. > > This patch fixes assumption (b), but I think (a) is false, and we > should fix it as well. One can imagine a bridge device without a > subordinate bus (maybe we ran out of bus numbers), so I don't think we > should test dev->subordinate. Yup. > We could test something like pci_is_bridge(), although testing for idx > being in the PCI_BRIDGE_RESOURCES range should be sufficient because I > don't think we use those resource for anything other than windows. Yeah quite possibly. Cheers, Ben.
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 0eb40924169b..7adbd4bedd16 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1784,11 +1784,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) /* restore size and flags */ list_for_each_entry(fail_res, &fail_head, list) { struct resource *res = fail_res->res; + int idx; res->start = fail_res->start; res->end = fail_res->end; res->flags = fail_res->flags; - if (fail_res->dev->subordinate) + + idx = res - &fail_res->dev->resource[0]; + if (fail_res->dev->subordinate && + idx >= PCI_BRIDGE_RESOURCES && + idx <= PCI_BRIDGE_RESOURCE_END) res->flags = 0; } free_list(&fail_head);
One odd quirk of PLX switches is that their upstream bridge port has 256K of space allocated behind its BAR0 (most other bridge implementations do not report any BAR space). The lspci for such device looks like: 04:00.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca) (prog-if 00 [Normal decode]) Physical Slot: 1 Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0 Memory at 90a00000 (32-bit, non-prefetchable) [size=256K] Bus: primary=04, secondary=05, subordinate=0a, sec-latency=0 I/O behind bridge: 00002000-00003fff Memory behind bridge: 90000000-909fffff Prefetchable memory behind bridge: 0000380000800000-0000380000bfffff Kernel driver in use: pcieport It's not clear what the purpose of the memory at 0x90a00000 is, and currently the kernel never actually uses it for anything. In most cases, it's safely ignored and does not cause a problem. However, when the kernel assigns the resource addresses (with the pci=realloc command line parameter, for example) it can inadvertently disable the struct resource corresponding to the bar. When this happens, lspci will report this memory as ignored: Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K] This is because the kernel reports a zero start address and zero flags in the corresponding sysfs resource file and in /proc/bus/pci/devices. Investigation with 'lspci -x', however shows the bios-assigned address will still be programmed in the device's BAR registers. In many cases, this still isn't a problem. Nothing uses the memory, so nothing is affected. However, a big problem shows up when an IOMMU is in use: the IOMMU will not reserve this space in the IOVA because the kernel no longer thinks the range is valid. (See dmar_init_reserved_ranges() for the Intel implementation of this.) Without the proper reserved range, we have a situation where a DMA mapping may occasionally allocate an IOVA which the PCI bus will actually route to a BAR in the PLX switch. This will result in some random DMA writes not actually writing to the RAM they are supposed to, or random DMA reads returning all FFs from the PLX BAR when it's supposed to have read from RAM. The problem is caused in pci_assign_unassigned_root_bus_resources(). When any resource from a bridge device fails to get assigned, the code sets the resource's flags to zero. This makes sense for bridge resources, as they will be re-enabled later, but for regular BARs, it disables them permanently. To fix the problem, we only set the flags to zero for bridge resources and treat any other resources like non-bridge devices. Reported-by: Kit Chow <kchow@gigaio.com> Fixes: da7822e5ad71 ("PCI: update bridge resources to get more big ranges when allocating space (again)") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> --- drivers/pci/setup-bus.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)