Message ID | 20190312065724.28583-1-oohall@gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFC] pci: Change bus number assignment policy | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (261ca8e779e5138869a45f174caa49be6a274501) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | fail | Signed-off-by missing |
On Tue, Mar 12, 2019 at 5:57 PM Oliver O'Halloran <oohall@gmail.com> wrote: > > Change the way we assign bus numbers so that rather than assigning > bus numbers densely, we spread the available bus numbers across the > downstream bridges of a given topology. > > For PCIe topologies with a wide-fan out this allows hot-plugging > large numbers of devices into a downstream port near the top of the > PCIe topology. > > This patch changes the bus number assignment policy to evenly split the > available bus numbers across the bridges on bus. This allows us to > support hot-adding of large numbers of devices near the root of the > topology. Currently we only assign a handful of spare buses to each > hotplug port so if a large number of devices are added at the root > we won't have enough spare bus numbers for them. I really should have proof-read that. Ah well.
Hello Oliver, On 3/12/19 9:57 AM, Oliver O'Halloran wrote: > Change the way we assign bus numbers so that rather than assigning > bus numbers densely, we spread the available bus numbers across the > downstream bridges of a given topology. > > For PCIe topologies with a wide-fan out this allows hot-plugging > large numbers of devices into a downstream port near the top of the > PCIe topology. > > This patch changes the bus number assignment policy to evenly split the > available bus numbers across the bridges on bus. This allows us to > support hot-adding of large numbers of devices near the root of the > topology. Currently we only assign a handful of spare buses to each > hotplug port so if a large number of devices are added at the root > we won't have enough spare bus numbers for them. > > e.g. On a zaius system with a two-switch (HBA switch and drawer switch) topology: > > Before: > > -+-[0030:00]---00.0-[01-40]----00.0-[02-40]--+-04.0-[03-1e]----00.0-[04-1e]--+-08.0-[05]----00.0 > | | +-09.0-[06]----00.0 > | | +-0a.0-[07-0b]-- > | | +-0b.0-[0c-10]-- > | | +-10.0-[11]----00.0 > | | +-11.0-[12]----00.0 > | | +-12.0-[13-17]-- > | | +-13.0-[18-1c]-- > | | +-15.0-[1d]----00.0 > | | \-16.0-[1e]----00.0 > | +-05.0-[1f-36]----00.0-[20-36]--+-04.0-[21]----00.0 > | | +-05.0-[22]----00.0 > | | +-06.0-[23]----00.0 > | | +-07.0-[24]----00.0 > | | +-0c.0-[25-29]-- > | | +-0d.0-[2a]----00.0 > | | +-0e.0-[2b-2f]-- > | | +-0f.0-[30]----00.0 > | | +-14.0-[31-35]-- > | | \-17.0-[36]----00.0 > | +-06.0-[37-3b]-- > | \-07.0-[3c-40]-- > > After: > > -+-[0030:00]---00.0-[01-fe]----00.0-[02-fd]--+-04.0-[03-41]----00.0-[04-40]--+-08.0-[05-0a]----00.0 > | | +-09.0-[0b-10]----00.0 > | | +-0a.0-[11-16]-- > | | +-0b.0-[17-1c]-- > | | +-10.0-[1d-22]----00.0 > | | +-11.0-[23-28]----00.0 > | | +-12.0-[29-2e]-- > | | +-13.0-[2f-34]-- > | | +-15.0-[35-3a]----00.0 > | | \-16.0-[3b-40]----00.0 > | +-05.0-[42-80]----00.0-[43-7f]--+-04.0-[44-49]----00.0 > | | +-05.0-[4a-4f]-- > | | +-06.0-[50-55]----00.0 > | | +-07.0-[56-5b]----00.0 > | | +-0c.0-[5c-61]-- > | | +-0d.0-[62-67]----00.0 > | | +-0e.0-[68-6d]-- > | | +-0f.0-[6e-73]----00.0 > | | +-14.0-[74-79]-- > | | \-17.0-[7a-7f]----00.0 > | +-06.0-[81-bf]-- > | \-07.0-[c0-fe]-- > > This does however have the disadvantage that we can't really support > deep rather than wide topologies. In the above example you can see that > when we hit the lowest level there is only 5 buses available per port, > so if we had an architecture where downstream storage was daisy-chained > we would run out bus numbers pretty quickly. > > These are solvable problems, but I figure I should see what people think > before spending a lot of time on this. > > Not-Signed-off-by: Oliver O'Halloran <oohall@gmail.com> > Cc: Sergey Miroshnichenko <s.miroshnichenko@yadro.com> > --- > Sergey, would something like this work for the NVMe drawers you've been > working with? I think we'll need to support bus-number reassignment at > eventually, but if we could kick that can down the road a bit it'd be > helpful. Reserving more bus numbers per downstream port, covering all the 0-0xff range, would be definitely helpful for the current state of Linux kernel. But with the "pci=realloc" command line argument the kernel will reassign the bus numbers during boot: compactly or based on the "hpbussize" argument. I have a patchset for Linux that handles a bridge hot-plugged in the middle of a dense PCIe tree, we are actively using it internally for about a year on our Vesnin servers and also Xeons, but we was going to propose upstreaming it later - after the "Movable BARs" serie. These patches increment bus numbers of a "tail" of the tree, freeing a space for the new bridge, while all the working drivers remain working. The hardest part was rebuilding procfs and sysfs entries and simlinks. Best regards, Serge > --- > core/pci.c | 44 ++++++++++++++++---------------------------- > 1 file changed, 16 insertions(+), 28 deletions(-) > > diff --git a/core/pci.c b/core/pci.c > index 454b50102e59..d5537b6a376b 100644 > --- a/core/pci.c > +++ b/core/pci.c > @@ -743,9 +743,10 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, > bool scan_downstream) > { > struct pci_device *pd = NULL, *rc = NULL; > - uint8_t dev, fn, next_bus, max_sub, save_max; > + uint8_t dev, fn, next_bus, max_sub; > uint32_t scan_map; > bool use_max; > + int bridges = 0, buses_per_bridge; > > /* Decide what to scan */ > scan_map = parent ? parent->scan_map : phb->scan_map; > @@ -810,7 +811,18 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, > > next_bus = bus + 1; > max_sub = bus; > - save_max = max_bus; > + > + list_for_each(list, pd, link) > + if (pd->is_bridge) > + bridges++; > + > + buses_per_bridge = max_bus - next_bus - 1; > + if (bridges) > + buses_per_bridge /= bridges; > + > + PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n", > + bridges, next_bus, max_bus, scan_downstream ? "" : "not ", > + buses_per_bridge); > > /* Scan down bridges */ > list_for_each(list, pd, link) { > @@ -819,32 +831,8 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, > if (!pd->is_bridge) > continue; > > - /* We need to figure out a new bus number to start from. > - * > - * This can be tricky due to our HW constraints which differ > - * from bridge to bridge so we are going to let the phb > - * driver decide what to do. This can return us a maximum > - * bus number to assign as well > - * > - * This function will: > - * > - * - Return the bus number to use as secondary for the > - * bridge or 0 for a failure > - * > - * - "max_bus" will be adjusted to represent the max > - * subordinate that can be associated with the downstream > - * device > - * > - * - "use_max" will be set to true if the returned max_bus > - * *must* be used as the subordinate bus number of that > - * bridge (when we need to give aligned powers of two's > - * on P7IOC). If is is set to false, we just adjust the > - * subordinate bus number based on what we probed. > - * > - */ > - max_bus = save_max; > - next_bus = phb->ops->choose_bus(phb, pd, next_bus, > - &max_bus, &use_max); > + use_max = 1; > + max_bus = next_bus + buses_per_bridge; > > /* Configure the bridge with the returned values */ > if (next_bus <= bus) { >
diff --git a/core/pci.c b/core/pci.c index 454b50102e59..d5537b6a376b 100644 --- a/core/pci.c +++ b/core/pci.c @@ -743,9 +743,10 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, bool scan_downstream) { struct pci_device *pd = NULL, *rc = NULL; - uint8_t dev, fn, next_bus, max_sub, save_max; + uint8_t dev, fn, next_bus, max_sub; uint32_t scan_map; bool use_max; + int bridges = 0, buses_per_bridge; /* Decide what to scan */ scan_map = parent ? parent->scan_map : phb->scan_map; @@ -810,7 +811,18 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, next_bus = bus + 1; max_sub = bus; - save_max = max_bus; + + list_for_each(list, pd, link) + if (pd->is_bridge) + bridges++; + + buses_per_bridge = max_bus - next_bus - 1; + if (bridges) + buses_per_bridge /= bridges; + + PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n", + bridges, next_bus, max_bus, scan_downstream ? "" : "not ", + buses_per_bridge); /* Scan down bridges */ list_for_each(list, pd, link) { @@ -819,32 +831,8 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus, if (!pd->is_bridge) continue; - /* We need to figure out a new bus number to start from. - * - * This can be tricky due to our HW constraints which differ - * from bridge to bridge so we are going to let the phb - * driver decide what to do. This can return us a maximum - * bus number to assign as well - * - * This function will: - * - * - Return the bus number to use as secondary for the - * bridge or 0 for a failure - * - * - "max_bus" will be adjusted to represent the max - * subordinate that can be associated with the downstream - * device - * - * - "use_max" will be set to true if the returned max_bus - * *must* be used as the subordinate bus number of that - * bridge (when we need to give aligned powers of two's - * on P7IOC). If is is set to false, we just adjust the - * subordinate bus number based on what we probed. - * - */ - max_bus = save_max; - next_bus = phb->ops->choose_bus(phb, pd, next_bus, - &max_bus, &use_max); + use_max = 1; + max_bus = next_bus + buses_per_bridge; /* Configure the bridge with the returned values */ if (next_bus <= bus) {