Message ID | 20171218050253.13478.49457.stgit@gimli.home |
---|---|
State | New |
Headers | show |
Series | vfio/pci: MSI-X MMIO relocation | expand |
On 18/12/17 16:02, Alex Williamson wrote: > With recently proposed kernel side vfio-pci changes, the MSI-X vector > table area can be mmap'd from userspace, allowing direct access to > non-MSI-X registers within the host page size of this area. However, > we only get that direct access if QEMU isn't also emulating MSI-X > within that same page. For x86/64 host, the system page size is 4K > and the PCI spec recommends a minimum of 4K to 8K alignment to > separate MSI-X from non-MSI-X registers, therefore only devices which > don't honor this recommendation would see any improvement from this > option. The real targets for this feature are hosts where the page > size exceeds the PCI spec recommended alignment, such as ARM64 systems > with 64K pages. > > This new x-msix-relocation option accepts the following options: > > off: Disable MSI-X relocation, use native device config (default) > auto: Automaically relocate MSI-X MMIO to another BAR or offset > based on minimum additional MMIO requirement > bar0..bar5: Specify the target BAR, which will either be extended > if the BAR exists or added if the BAR slot is available. While I am digesting the patchset, here are some test results. This is the device: 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f000 Test #1: x-msix-relocation = "off": FlatView #1 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 @000000000000e600 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] Ok, works. Test #2: x-msix-relocation = "auto": FlatView #2 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000200080000000-00002000800005ff (prio 0, i/o): msix-table 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0 @0000000000000600 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] The guest fails probing because the first 64bit BAR is broken. lspci: Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K] Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] Capabilities: [c0] MSI-X: Enable- Count=96 Masked- Vector table: BAR=0 offset=00000000 PBA: BAR=0 offset=00000600 Test #3: x-msix-relocation = "bar1" FlatView #1 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 0000210000010000-00002100000105ff (prio 0, i/o): msix-table 0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 @0000000000010600 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e. appear as "ramd" in flatview, should it have appeared? This is "mtree": memory-region: pci@800000020000000.mmio 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio 0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 0000210000010000-00002100000105ff (prio 0, i/o): msix-table 0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled] 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3 0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] Test #4: x-msix-relocation = "bar5" The same net result as test #3: it works but BAR1 is not mapped: Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K] Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- Vector table: BAR=5 offset=00000000 PBA: BAR=5 offset=00000600 FlatView #0 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000200080000000-00002000800005ff (prio 0, i/o): msix-table 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5 @0000000000000600 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] memory-region: pci@800000020000000.mmio 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio 0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5 0000000080000000-00000000800005ff (prio 0, i/o): msix-table 0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled] 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3 0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] and there is also one minor comment below. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ > hw/vfio/pci.h | 1 > hw/vfio/trace-events | 2 + > 3 files changed, 104 insertions(+), 1 deletion(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index c383b842da20..b4426abf297a 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > } > } > > +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) > +{ > + int target_bar = -1; > + size_t msix_sz; > + > + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { > + return; > + } > + > + /* The actual minimum size of MSI-X structures */ > + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + > + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); > + /* Round up to host pages, we don't want to share a page */ > + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); > + /* PCI BARs must be a power of 2 */ > + msix_sz = pow2ceil(msix_sz); > + > + /* Auto: pick the BAR that incurs the least additional MMIO space */ > + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { > + int i; > + size_t best = UINT64_MAX; > + > + for (i = 0; i < PCI_ROM_SLOT; i++) { > + size_t size; > + > + if (vdev->bars[i].ioport) { > + continue; > + } > + > + /* MSI-X MMIO must reside within first 32bit offset of BAR */ > + if (vdev->bars[i].size > (UINT32_MAX / 2)) > + continue; > + > + /* > + * Must be pow2, so larger of double existing or double msix_sz, > + * or if BAR unimplemented, msix_sz > + */ > + size = MAX(vdev->bars[i].size * 2, > + vdev->bars[i].size ? msix_sz * 2 : msix_sz); > + > + trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size); > + > + if (size < best) { > + best = size; > + target_bar = i; > + } > + > + if (vdev->bars[i].mem64) { > + i++; > + } > + } > + } else { > + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0); > + } > + > + if (target_bar < 0 || vdev->bars[target_bar].ioport || > + (!vdev->bars[target_bar].size && > + target_bar > 0 && vdev->bars[target_bar - 1].mem64)) { > + return; /* Go BOOM? Plumb Error */ > + } > + > + /* > + * If adding a new BAR, test if we can make it 64bit. We make it > + * prefetchable since QEMU MSI-X emulation has no read side effects > + * and doing so makes mapping more flexible. > + */ > + if (!vdev->bars[target_bar].size) { > + if (target_bar < (PCI_ROM_SLOT - 1) && > + !vdev->bars[target_bar + 1].size) { > + vdev->bars[target_bar].mem64 = true; > + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64; > + } > + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH; > + vdev->bars[target_bar].size = msix_sz; > + vdev->msix->table_offset = 0; > + } else { > + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2, > + msix_sz * 2); > + /* > + * Due to above size calc, MSI-X always starts halfway into the BAR, > + * which will always be a separate host page. > + */ > + vdev->msix->table_offset = vdev->bars[target_bar].size / 2; > + } > + > + vdev->msix->table_bar = target_bar; > + vdev->msix->pba_bar = target_bar; > + /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */ > + vdev->msix->pba_offset = vdev->msix->table_offset + > + (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE); > + > + trace_vfio_msix_relo(vdev->vbasedev.name, > + vdev->msix->table_bar, vdev->msix->table_offset); > +} > + > /* > * We don't have any control over how pci_add_capability() inserts > * capabilities into the chain. In order to setup MSI-X we need a > @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > vdev->msix = msix; > > vfio_pci_fixup_msix_region(vdev); > + > + vfio_pci_relocate_msix(vdev); > } > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) > > vfio_pci_size_rom(vdev); > > + vfio_bars_prepare(vdev); > + > vfio_msix_early_setup(vdev, &err); > if (err) { > error_propagate(errp, err); > goto error; > } > > - vfio_bars_prepare(vdev); This could be in 2/5. > vfio_bars_register(vdev); > > ret = vfio_add_capabilities(vdev, errp); > @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = { > DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice, > nv_gpudirect_clique, > qdev_prop_nv_gpudirect_clique, uint8_t), > + DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo, > + OFF_AUTOPCIBAR_OFF), > /* > * TODO - support passed fds... is this necessary? > * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h > index dcdb1a806769..588381f201b4 100644 > --- a/hw/vfio/pci.h > +++ b/hw/vfio/pci.h > @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice { > (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT) > int32_t bootindex; > uint32_t igd_gms; > + OffAutoPCIBAR msix_relo; > uint8_t pm_cap; > uint8_t nv_gpudirect_clique; > bool pci_aer; > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events > index fae096c0724f..437ccdd29053 100644 > --- a/hw/vfio/trace-events > +++ b/hw/vfio/trace-events > @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)" > vfio_msix_pba_enable(const char *name) " (%s)" > vfio_msix_disable(const char *name) " (%s)" > vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]" > +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64"" > +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64"" > vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors" > vfio_msi_disable(const char *name) " (%s)" > vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n size: 0x%lx, offset: 0x%lx, flags: 0x%lx" > >
On Mon, 18 Dec 2017 20:04:23 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 18/12/17 16:02, Alex Williamson wrote: > > With recently proposed kernel side vfio-pci changes, the MSI-X vector > > table area can be mmap'd from userspace, allowing direct access to > > non-MSI-X registers within the host page size of this area. However, > > we only get that direct access if QEMU isn't also emulating MSI-X > > within that same page. For x86/64 host, the system page size is 4K > > and the PCI spec recommends a minimum of 4K to 8K alignment to > > separate MSI-X from non-MSI-X registers, therefore only devices which > > don't honor this recommendation would see any improvement from this > > option. The real targets for this feature are hosts where the page > > size exceeds the PCI spec recommended alignment, such as ARM64 systems > > with 64K pages. > > > > This new x-msix-relocation option accepts the following options: > > > > off: Disable MSI-X relocation, use native device config (default) > > auto: Automaically relocate MSI-X MMIO to another BAR or offset > > based on minimum additional MMIO requirement > > bar0..bar5: Specify the target BAR, which will either be extended > > if the BAR exists or added if the BAR slot is available. > > > While I am digesting the patchset, here are some test results. Thanks for testing! > This is the device: > > 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 > PCI-Express Fusion-MPT SAS-3 (rev 02) BAR1: > Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] BAR3: > Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > > Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- > Vector table: BAR=1 offset=0000e000 > PBA: BAR=1 offset=0000f000 > > > Test #1: x-msix-relocation = "off": > > FlatView #1 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1 > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > @000000000000e600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > Ok, works. > > > Test #2: x-msix-relocation = "auto": > > FlatView #2 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000200080000000-00002000800005ff (prio 0, i/o): msix-table > 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0 > @0000000000000600 > 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > The guest fails probing because the first 64bit BAR is broken. > > lspci: > > Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K] > Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > > Capabilities: [c0] MSI-X: Enable- Count=96 Masked- > Vector table: BAR=0 offset=00000000 > PBA: BAR=0 offset=00000600 Why do you suppose it's broken? The added BAR0 is 32bit, it cannot be 64bit since BAR1 is implemented. I don't see anything fundamentally different between this and the working BAR5 test below. > Test #3: x-msix-relocation = "bar1" > > > FlatView #1 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > 0000210000010000-00002100000105ff (prio 0, i/o): msix-table > 0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 > @0000000000010600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e. > appear as "ramd" in flatview, should it have appeared? > > This is "mtree": > > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 > 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > 0000210000010000-00002100000105ff (prio 0, i/o): msix-table > 0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3 > 0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR > 3 mmaps[0] Did you disable vfio_pci_fixup_msix_region() as noted in 0/5? This series doesn't do anything about consuming the new MSI-X mappable flag that you introduced in the kernel, so vfio_pci_fixup_msix_region() will continue to exclude mmap'ing the 64K page overlapping the actual BAR. > Test #4: x-msix-relocation = "bar5" > > The same net result as test #3: it works but BAR1 is not mapped: > > > Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K] > > Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- > Vector table: BAR=5 offset=00000000 > PBA: BAR=5 offset=00000600 > > FlatView #0 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000200080000000-00002000800005ff (prio 0, i/o): msix-table > 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5 > @0000000000000600 > 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5 > 0000000080000000-00000000800005ff (prio 0, i/o): msix-table > 0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled] > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 > 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3 > 0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR > 3 mmaps[0] As above, you won't get the mmap without disabling the implicit page exclusion. The real question for this case is why does it work while 'auto' came up with a nearly identical layout, swapping BAR5 for BAR0 and it did not work. The placement of the BARs is even the same. > and there is also one minor comment below. > > > > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) > > > > vfio_pci_size_rom(vdev); > > > > + vfio_bars_prepare(vdev); > > + > > vfio_msix_early_setup(vdev, &err); > > if (err) { > > error_propagate(errp, err); > > goto error; > > } > > > > - vfio_bars_prepare(vdev); > > > This could be in 2/5. It could, but 2/5 was attempting to add the base BAR MemoryRegion and split vfio_bars_setup() into vfio_bars_prepare() and vfio_bars_register() without otherwise changing the ordering. It's only when we want to modify BARs between prepare and register that we need to make this change, thus it's done here. Thanks, Alex
On 19/12/17 00:28, Alex Williamson wrote: > On Mon, 18 Dec 2017 20:04:23 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 18/12/17 16:02, Alex Williamson wrote: >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector >>> table area can be mmap'd from userspace, allowing direct access to >>> non-MSI-X registers within the host page size of this area. However, >>> we only get that direct access if QEMU isn't also emulating MSI-X >>> within that same page. For x86/64 host, the system page size is 4K >>> and the PCI spec recommends a minimum of 4K to 8K alignment to >>> separate MSI-X from non-MSI-X registers, therefore only devices which >>> don't honor this recommendation would see any improvement from this >>> option. The real targets for this feature are hosts where the page >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems >>> with 64K pages. >>> >>> This new x-msix-relocation option accepts the following options: >>> >>> off: Disable MSI-X relocation, use native device config (default) >>> auto: Automaically relocate MSI-X MMIO to another BAR or offset >>> based on minimum additional MMIO requirement >>> bar0..bar5: Specify the target BAR, which will either be extended >>> if the BAR exists or added if the BAR slot is available. >> >> >> While I am digesting the patchset, here are some test results. > > Thanks for testing! > >> This is the device: >> >> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 >> PCI-Express Fusion-MPT SAS-3 (rev 02) > > BAR1: > >> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > > BAR3: > >> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] >> >> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- >> Vector table: BAR=1 offset=0000e000 >> PBA: BAR=1 offset=0000f000 >> >> >> Test #1: x-msix-relocation = "off": >> >> FlatView #1 >> AS "memory", root: system >> AS "cpu-memory", root: system >> Root memory region: system >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >> 0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1 >> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >> 000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >> @000000000000e600 >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >> >> Ok, works. >> >> >> Test #2: x-msix-relocation = "auto": >> >> FlatView #2 >> AS "memory", root: system >> AS "cpu-memory", root: system >> Root memory region: system >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >> 0000200080000000-00002000800005ff (prio 0, i/o): msix-table >> 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0 >> @0000000000000600 >> 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >> >> >> The guest fails probing because the first 64bit BAR is broken. >> >> lspci: >> >> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K] >> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] >> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] >> >> Capabilities: [c0] MSI-X: Enable- Count=96 Masked- >> Vector table: BAR=0 offset=00000000 >> PBA: BAR=0 offset=00000600 > > Why do you suppose it's broken? The added BAR0 is 32bit, it cannot be > 64bit since BAR1 is implemented. I don't see anything fundamentally > different between this and the working BAR5 test below. BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit BAR, no? > >> Test #3: x-msix-relocation = "bar1" >> >> >> FlatView #1 >> AS "memory", root: system >> AS "cpu-memory", root: system >> Root memory region: system >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >> 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >> 0000210000010000-00002100000105ff (prio 0, i/o): msix-table >> 0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 >> @0000000000010600 >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >> >> Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e. >> appear as "ramd" in flatview, should it have appeared? >> >> This is "mtree": >> >> memory-region: pci@800000020000000.mmio >> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >> 0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1 >> 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >> 0000210000010000-00002100000105ff (prio 0, i/o): msix-table >> 0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled] >> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3 >> 0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3 >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR >> 3 mmaps[0] > > Did you disable vfio_pci_fixup_msix_region() as noted in 0/5? This > series doesn't do anything about consuming the new MSI-X mappable flag > that you introduced in the kernel, so vfio_pci_fixup_msix_region() will > continue to exclude mmap'ing the 64K page overlapping the actual BAR. Ah, my bad, I've read this but when I got to testing - forgot. Sorry for the noise, tests 3 and 4 mmap as expected with fixup disabled.
On Tue, 19 Dec 2017 00:55:32 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 19/12/17 00:28, Alex Williamson wrote: > > On Mon, 18 Dec 2017 20:04:23 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> On 18/12/17 16:02, Alex Williamson wrote: > >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector > >>> table area can be mmap'd from userspace, allowing direct access to > >>> non-MSI-X registers within the host page size of this area. However, > >>> we only get that direct access if QEMU isn't also emulating MSI-X > >>> within that same page. For x86/64 host, the system page size is 4K > >>> and the PCI spec recommends a minimum of 4K to 8K alignment to > >>> separate MSI-X from non-MSI-X registers, therefore only devices which > >>> don't honor this recommendation would see any improvement from this > >>> option. The real targets for this feature are hosts where the page > >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems > >>> with 64K pages. > >>> > >>> This new x-msix-relocation option accepts the following options: > >>> > >>> off: Disable MSI-X relocation, use native device config (default) > >>> auto: Automaically relocate MSI-X MMIO to another BAR or offset > >>> based on minimum additional MMIO requirement > >>> bar0..bar5: Specify the target BAR, which will either be extended > >>> if the BAR exists or added if the BAR slot is available. > >> > >> > >> While I am digesting the patchset, here are some test results. > > > > Thanks for testing! > > > >> This is the device: > >> > >> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 > >> PCI-Express Fusion-MPT SAS-3 (rev 02) > > > > BAR1: > > > >> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > > > > BAR3: > > > >> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > >> > >> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- > >> Vector table: BAR=1 offset=0000e000 > >> PBA: BAR=1 offset=0000f000 > >> > >> > >> Test #1: x-msix-relocation = "off": > >> > >> FlatView #1 > >> AS "memory", root: system > >> AS "cpu-memory", root: system > >> Root memory region: system > >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >> 0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1 > >> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >> 000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > >> @000000000000e600 > >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >> > >> Ok, works. > >> > >> > >> Test #2: x-msix-relocation = "auto": > >> > >> FlatView #2 > >> AS "memory", root: system > >> AS "cpu-memory", root: system > >> Root memory region: system > >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >> 0000200080000000-00002000800005ff (prio 0, i/o): msix-table > >> 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0 > >> @0000000000000600 > >> 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 > >> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >> > >> > >> The guest fails probing because the first 64bit BAR is broken. > >> > >> lspci: > >> > >> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K] > >> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > >> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > >> > >> Capabilities: [c0] MSI-X: Enable- Count=96 Masked- > >> Vector table: BAR=0 offset=00000000 > >> PBA: BAR=0 offset=00000600 > > > > Why do you suppose it's broken? The added BAR0 is 32bit, it cannot be > > 64bit since BAR1 is implemented. I don't see anything fundamentally > > different between this and the working BAR5 test below. > > > BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit > BAR, no? AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and 0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2. Likewise the 64bit BAR3 also consumes BAR4. See for instance the 82576 datasheet: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf 9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3 (upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4 (lower) + BAR5 (upper). lspci would show this as Region 0, 2, 4. The layout of your SAS card does seem poorly thought out that they've essentially precluded a 3rd 64bit BAR by starting with BAR1, but perhaps it's for compatibility with an equally poorly designed 32bit version of the device. Thanks, Alex
On 19/12/17 01:28, Alex Williamson wrote: > On Tue, 19 Dec 2017 00:55:32 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 19/12/17 00:28, Alex Williamson wrote: >>> On Mon, 18 Dec 2017 20:04:23 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 18/12/17 16:02, Alex Williamson wrote: >>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector >>>>> table area can be mmap'd from userspace, allowing direct access to >>>>> non-MSI-X registers within the host page size of this area. However, >>>>> we only get that direct access if QEMU isn't also emulating MSI-X >>>>> within that same page. For x86/64 host, the system page size is 4K >>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to >>>>> separate MSI-X from non-MSI-X registers, therefore only devices which >>>>> don't honor this recommendation would see any improvement from this >>>>> option. The real targets for this feature are hosts where the page >>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems >>>>> with 64K pages. >>>>> >>>>> This new x-msix-relocation option accepts the following options: >>>>> >>>>> off: Disable MSI-X relocation, use native device config (default) >>>>> auto: Automaically relocate MSI-X MMIO to another BAR or offset >>>>> based on minimum additional MMIO requirement >>>>> bar0..bar5: Specify the target BAR, which will either be extended >>>>> if the BAR exists or added if the BAR slot is available. >>>> >>>> >>>> While I am digesting the patchset, here are some test results. >>> >>> Thanks for testing! >>> >>>> This is the device: >>>> >>>> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 >>>> PCI-Express Fusion-MPT SAS-3 (rev 02) >>> >>> BAR1: >>> >>>> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] >>> >>> BAR3: >>> >>>> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] >>>> >>>> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- >>>> Vector table: BAR=1 offset=0000e000 >>>> PBA: BAR=1 offset=0000f000 >>>> >>>> >>>> Test #1: x-msix-relocation = "off": >>>> >>>> FlatView #1 >>>> AS "memory", root: system >>>> AS "cpu-memory", root: system >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>> 0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1 >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>> 000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >>>> @000000000000e600 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> Ok, works. >>>> >>>> >>>> Test #2: x-msix-relocation = "auto": >>>> >>>> FlatView #2 >>>> AS "memory", root: system >>>> AS "cpu-memory", root: system >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>> 0000200080000000-00002000800005ff (prio 0, i/o): msix-table >>>> 0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0 >>>> @0000000000000600 >>>> 0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> >>>> The guest fails probing because the first 64bit BAR is broken. >>>> >>>> lspci: >>>> >>>> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K] >>>> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] >>>> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] >>>> >>>> Capabilities: [c0] MSI-X: Enable- Count=96 Masked- >>>> Vector table: BAR=0 offset=00000000 >>>> PBA: BAR=0 offset=00000600 >>> >>> Why do you suppose it's broken? The added BAR0 is 32bit, it cannot be >>> 64bit since BAR1 is implemented. I don't see anything fundamentally >>> different between this and the working BAR5 test below. >> >> >> BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit >> BAR, no? > > AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and > 0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2. Likewise > the 64bit BAR3 also consumes BAR4. See for instance the 82576 > datasheet: > > https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf > > 9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes > BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3 > (upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4 > (lower) + BAR5 (upper). lspci would show this as Region 0, 2, 4. The > layout of your SAS card does seem poorly thought out that they've > essentially precluded a 3rd 64bit BAR by starting with BAR1, but > perhaps it's for compatibility with an equally poorly designed 32bit > version of the device. Thanks, Ah, makes sense, I just never saw 64bit BARs starting from an odd offset. My card is weird^Wunusual then: aik@stratton2:~$ lspci -vbxs 0001:03:00.0 0001:03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) Subsystem: Super Micro Computer Inc SAS3008 PCI-Express Fusion-MPT SAS-3 Flags: bus master, fast devsel, latency 0 I/O ports at <unassigned> [disabled] Memory at 80140000 (64-bit, non-prefetchable) Memory at 80100000 (64-bit, non-prefetchable) Capabilities: <access denied> Kernel driver in use: vfio-pci Kernel modules: mpt3sas 00: 00 10 97 00 46 05 10 00 02 00 07 01 00 00 00 00 10: 01 00 00 00 04 00 14 80 00 00 00 00 04 00 10 80 20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 08 08 30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00 The mpt3sas driver is funny too - it fails probing with MSIX in bar0 but succeeds with bar5. Region 1: Memory at 210000000000 (64-bit, non-prefetchable) Region 3: Memory at 210000040000 (64-bit, non-prefetchable) Region 5: Memory at 80000000 (32-bit, prefetchable) Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- Vector table: BAR=5 offset=00000000 PBA: BAR=5 offset=00000600 vs. Region 0: Memory at 80000000 (32-bit, prefetchable) Region 1: Memory at 210000000000 (64-bit, non-prefetchable) Region 3: Memory at 210000040000 (64-bit, non-prefetchable) Capabilities: [c0] MSI-X: Enable- Count=96 Masked- Vector table: BAR=0 offset=00000000 PBA: BAR=0 offset=00000600 Here is why: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/mpt3sas/mpt3sas_base.c?h=v4.15-rc4#n2608 It is looking for a first MMIO BAR and assumes it is the one which implements the basic registers including doorbell. I am not so sure this is that unusual.
On 18/12/17 16:02, Alex Williamson wrote: > With recently proposed kernel side vfio-pci changes, the MSI-X vector > table area can be mmap'd from userspace, allowing direct access to > non-MSI-X registers within the host page size of this area. However, > we only get that direct access if QEMU isn't also emulating MSI-X > within that same page. For x86/64 host, the system page size is 4K > and the PCI spec recommends a minimum of 4K to 8K alignment to > separate MSI-X from non-MSI-X registers, therefore only devices which > don't honor this recommendation would see any improvement from this > option. The real targets for this feature are hosts where the page > size exceeds the PCI spec recommended alignment, such as ARM64 systems > with 64K pages. > > This new x-msix-relocation option accepts the following options: > > off: Disable MSI-X relocation, use native device config (default) > auto: Automaically relocate MSI-X MMIO to another BAR or offset > based on minimum additional MMIO requirement > bar0..bar5: Specify the target BAR, which will either be extended > if the BAR exists or added if the BAR slot is available. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ > hw/vfio/pci.h | 1 > hw/vfio/trace-events | 2 + > 3 files changed, 104 insertions(+), 1 deletion(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index c383b842da20..b4426abf297a 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > } > } > > +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) > +{ > + int target_bar = -1; > + size_t msix_sz; > + > + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { > + return; > + } > + > + /* The actual minimum size of MSI-X structures */ > + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + > + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); > + /* Round up to host pages, we don't want to share a page */ > + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); > + /* PCI BARs must be a power of 2 */ > + msix_sz = pow2ceil(msix_sz); > + > + /* Auto: pick the BAR that incurs the least additional MMIO space */ > + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { > + int i; > + size_t best = UINT64_MAX; > + > + for (i = 0; i < PCI_ROM_SLOT; i++) { I belieive that going from the other end is safer approach for "auto", especially after discovering how mpt3sas works. Or you could add "autoreverse" switch... > + size_t size; > + > + if (vdev->bars[i].ioport) { > + continue; > + } > + > + /* MSI-X MMIO must reside within first 32bit offset of BAR */ > + if (vdev->bars[i].size > (UINT32_MAX / 2)) > + continue; > + > + /* > + * Must be pow2, so larger of double existing or double msix_sz, > + * or if BAR unimplemented, msix_sz > + */ > + size = MAX(vdev->bars[i].size * 2, > + vdev->bars[i].size ? msix_sz * 2 : msix_sz); > + > + trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size); > + > + if (size < best) { > + best = size; > + target_bar = i; > + } > + > + if (vdev->bars[i].mem64) { > + i++; > + } > + } > + } else { > + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0); > + } > + > + if (target_bar < 0 || vdev->bars[target_bar].ioport || > + (!vdev->bars[target_bar].size && > + target_bar > 0 && vdev->bars[target_bar - 1].mem64)) { > + return; /* Go BOOM? Plumb Error */ > + } This "if" only seems to make sense for the non-auto branch... > + > + /* > + * If adding a new BAR, test if we can make it 64bit. We make it > + * prefetchable since QEMU MSI-X emulation has no read side effects > + * and doing so makes mapping more flexible. > + */ > + if (!vdev->bars[target_bar].size) { > + if (target_bar < (PCI_ROM_SLOT - 1) && > + !vdev->bars[target_bar + 1].size) { > + vdev->bars[target_bar].mem64 = true; > + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64; > + } > + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH; > + vdev->bars[target_bar].size = msix_sz; > + vdev->msix->table_offset = 0; > + } else { > + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2, > + msix_sz * 2); > + /* > + * Due to above size calc, MSI-X always starts halfway into the BAR, > + * which will always be a separate host page. > + */ > + vdev->msix->table_offset = vdev->bars[target_bar].size / 2; > + } > + > + vdev->msix->table_bar = target_bar; > + vdev->msix->pba_bar = target_bar; Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out was not necessary at the time but I missed that it is called before vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks, > + /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */ > + vdev->msix->pba_offset = vdev->msix->table_offset + > + (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE); > + > + trace_vfio_msix_relo(vdev->vbasedev.name, > + vdev->msix->table_bar, vdev->msix->table_offset); > +} > + > /* > * We don't have any control over how pci_add_capability() inserts > * capabilities into the chain. In order to setup MSI-X we need a > @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > vdev->msix = msix; > > vfio_pci_fixup_msix_region(vdev); > + > + vfio_pci_relocate_msix(vdev); > } > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) > > vfio_pci_size_rom(vdev); > > + vfio_bars_prepare(vdev); > + > vfio_msix_early_setup(vdev, &err); > if (err) { > error_propagate(errp, err); > goto error; > } > > - vfio_bars_prepare(vdev); > vfio_bars_register(vdev); > > ret = vfio_add_capabilities(vdev, errp); > @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = { > DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice, > nv_gpudirect_clique, > qdev_prop_nv_gpudirect_clique, uint8_t), > + DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo, > + OFF_AUTOPCIBAR_OFF), > /* > * TODO - support passed fds... is this necessary? > * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h > index dcdb1a806769..588381f201b4 100644 > --- a/hw/vfio/pci.h > +++ b/hw/vfio/pci.h > @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice { > (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT) > int32_t bootindex; > uint32_t igd_gms; > + OffAutoPCIBAR msix_relo; > uint8_t pm_cap; > uint8_t nv_gpudirect_clique; > bool pci_aer; > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events > index fae096c0724f..437ccdd29053 100644 > --- a/hw/vfio/trace-events > +++ b/hw/vfio/trace-events > @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)" > vfio_msix_pba_enable(const char *name) " (%s)" > vfio_msix_disable(const char *name) " (%s)" > vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]" > +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64"" > +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64"" > vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors" > vfio_msi_disable(const char *name) " (%s)" > vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n size: 0x%lx, offset: 0x%lx, flags: 0x%lx" > >
On Tue, 19 Dec 2017 14:07:13 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 18/12/17 16:02, Alex Williamson wrote: > > With recently proposed kernel side vfio-pci changes, the MSI-X vector > > table area can be mmap'd from userspace, allowing direct access to > > non-MSI-X registers within the host page size of this area. However, > > we only get that direct access if QEMU isn't also emulating MSI-X > > within that same page. For x86/64 host, the system page size is 4K > > and the PCI spec recommends a minimum of 4K to 8K alignment to > > separate MSI-X from non-MSI-X registers, therefore only devices which > > don't honor this recommendation would see any improvement from this > > option. The real targets for this feature are hosts where the page > > size exceeds the PCI spec recommended alignment, such as ARM64 systems > > with 64K pages. > > > > This new x-msix-relocation option accepts the following options: > > > > off: Disable MSI-X relocation, use native device config (default) > > auto: Automaically relocate MSI-X MMIO to another BAR or offset > > based on minimum additional MMIO requirement > > bar0..bar5: Specify the target BAR, which will either be extended > > if the BAR exists or added if the BAR slot is available. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ > > hw/vfio/pci.h | 1 > > hw/vfio/trace-events | 2 + > > 3 files changed, 104 insertions(+), 1 deletion(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > index c383b842da20..b4426abf297a 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > > } > > } > > > > +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) > > +{ > > + int target_bar = -1; > > + size_t msix_sz; > > + > > + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { > > + return; > > + } > > + > > + /* The actual minimum size of MSI-X structures */ > > + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + > > + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); > > + /* Round up to host pages, we don't want to share a page */ > > + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); > > + /* PCI BARs must be a power of 2 */ > > + msix_sz = pow2ceil(msix_sz); > > + > > + /* Auto: pick the BAR that incurs the least additional MMIO space */ > > + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { > > + int i; > > + size_t best = UINT64_MAX; > > + > > + for (i = 0; i < PCI_ROM_SLOT; i++) { > > > I belieive that going from the other end is safer approach for "auto", > especially after discovering how mpt3sas works. Or you could add > "autoreverse" switch... Or is extending the smallest BAR really a safer option? I wonder how many drivers go through and fill fixed sized arrays with BAR info, expecting only the device implemented number of BARs. Maybe they wouldn't notice if the BAR was simply bigger than expected. On the other hand there are probably drivers dumb enough to index registers from the end for the BAR as well. I don't think there exists an auto algorithm that will fit every device, but a higher hit rate than we have so far would be nice. We could also implement MemoryRegionOps for the base BAR with some error reporting if it gets called. That might make the problem more obvious than unassigned_mem_ops silently eating those accesses. > > + size_t size; > > + > > + if (vdev->bars[i].ioport) { > > + continue; > > + } > > + > > + /* MSI-X MMIO must reside within first 32bit offset of BAR */ > > + if (vdev->bars[i].size > (UINT32_MAX / 2)) Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs. NB, the existing test here needs a bit of work too, 32bit BARs max out at 2G not 4G, so maybe we need separate tests here. >1G for 32bit BARs, >2G for 64bit BARs. Hmm, do we have the option of promoting 32bit BARs to 64bit? It's all virtual addresses anyway, right. We're in real trouble if were extending BARs where this is an issue though. > > + continue; > > + > > + /* > > + * Must be pow2, so larger of double existing or double msix_sz, > > + * or if BAR unimplemented, msix_sz > > + */ > > + size = MAX(vdev->bars[i].size * 2, > > + vdev->bars[i].size ? msix_sz * 2 : msix_sz); > > + > > + trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size); > > + > > + if (size < best) { > > + best = size; > > + target_bar = i; > > + } > > + > > + if (vdev->bars[i].mem64) { > > + i++; > > + } > > + } > > + } else { > > + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0); > > + } > > + > > + if (target_bar < 0 || vdev->bars[target_bar].ioport || > > + (!vdev->bars[target_bar].size && > > + target_bar > 0 && vdev->bars[target_bar - 1].mem64)) { > > + return; /* Go BOOM? Plumb Error */ > > + } > > > This "if" only seems to make sense for the non-auto branch... Most of it, yes, but it's still possible for a device to exist where the auto loop would come up empty. Imagine if each BAR was sufficiently large that we couldn't extend it and still give the MSI-X MMIO areas a 32-bit offset within the BAR. Exceptionally unlikely, it doesn't hurt to test all the corner cases. I also missed the case of testing that the BAR isn't too large already here. > > + > > + /* > > + * If adding a new BAR, test if we can make it 64bit. We make it > > + * prefetchable since QEMU MSI-X emulation has no read side effects > > + * and doing so makes mapping more flexible. > > + */ > > + if (!vdev->bars[target_bar].size) { > > + if (target_bar < (PCI_ROM_SLOT - 1) && > > + !vdev->bars[target_bar + 1].size) { > > + vdev->bars[target_bar].mem64 = true; > > + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64; > > + } > > + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH; > > + vdev->bars[target_bar].size = msix_sz; > > + vdev->msix->table_offset = 0; > > + } else { > > + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2, > > + msix_sz * 2); > > + /* > > + * Due to above size calc, MSI-X always starts halfway into the BAR, > > + * which will always be a separate host page. > > + */ > > + vdev->msix->table_offset = vdev->bars[target_bar].size / 2; > > + } > > + > > + vdev->msix->table_bar = target_bar; > > + vdev->msix->pba_bar = target_bar; > > > Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out > was not necessary at the time but I missed that it is called before > vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks, For a kernel that allows mapping the MSI-X region, yes, but if you ran that on an older kernel I think QEMU would break when it can't mmap the entire region. We can't only support new kernels. Thanks, Alex > > + /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */ > > + vdev->msix->pba_offset = vdev->msix->table_offset + > > + (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE); > > + > > + trace_vfio_msix_relo(vdev->vbasedev.name, > > + vdev->msix->table_bar, vdev->msix->table_offset); > > +} > > + > > /* > > * We don't have any control over how pci_add_capability() inserts > > * capabilities into the chain. In order to setup MSI-X we need a > > @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > > vdev->msix = msix; > > > > vfio_pci_fixup_msix_region(vdev); > > + > > + vfio_pci_relocate_msix(vdev); > > } > > > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) > > > > vfio_pci_size_rom(vdev); > > > > + vfio_bars_prepare(vdev); > > + > > vfio_msix_early_setup(vdev, &err); > > if (err) { > > error_propagate(errp, err); > > goto error; > > } > > > > - vfio_bars_prepare(vdev); > > vfio_bars_register(vdev); > > > > ret = vfio_add_capabilities(vdev, errp); > > @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = { > > DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice, > > nv_gpudirect_clique, > > qdev_prop_nv_gpudirect_clique, uint8_t), > > + DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo, > > + OFF_AUTOPCIBAR_OFF), > > /* > > * TODO - support passed fds... is this necessary? > > * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h > > index dcdb1a806769..588381f201b4 100644 > > --- a/hw/vfio/pci.h > > +++ b/hw/vfio/pci.h > > @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice { > > (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT) > > int32_t bootindex; > > uint32_t igd_gms; > > + OffAutoPCIBAR msix_relo; > > uint8_t pm_cap; > > uint8_t nv_gpudirect_clique; > > bool pci_aer; > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events > > index fae096c0724f..437ccdd29053 100644 > > --- a/hw/vfio/trace-events > > +++ b/hw/vfio/trace-events > > @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)" > > vfio_msix_pba_enable(const char *name) " (%s)" > > vfio_msix_disable(const char *name) " (%s)" > > vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]" > > +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64"" > > +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64"" > > vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors" > > vfio_msi_disable(const char *name) " (%s)" > > vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n size: 0x%lx, offset: 0x%lx, flags: 0x%lx" > > > > > >
On 19/12/17 14:40, Alex Williamson wrote: > On Tue, 19 Dec 2017 14:07:13 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 18/12/17 16:02, Alex Williamson wrote: >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector >>> table area can be mmap'd from userspace, allowing direct access to >>> non-MSI-X registers within the host page size of this area. However, >>> we only get that direct access if QEMU isn't also emulating MSI-X >>> within that same page. For x86/64 host, the system page size is 4K >>> and the PCI spec recommends a minimum of 4K to 8K alignment to >>> separate MSI-X from non-MSI-X registers, therefore only devices which >>> don't honor this recommendation would see any improvement from this >>> option. The real targets for this feature are hosts where the page >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems >>> with 64K pages. >>> >>> This new x-msix-relocation option accepts the following options: >>> >>> off: Disable MSI-X relocation, use native device config (default) >>> auto: Automaically relocate MSI-X MMIO to another BAR or offset >>> based on minimum additional MMIO requirement >>> bar0..bar5: Specify the target BAR, which will either be extended >>> if the BAR exists or added if the BAR slot is available. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>> --- >>> hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ >>> hw/vfio/pci.h | 1 >>> hw/vfio/trace-events | 2 + >>> 3 files changed, 104 insertions(+), 1 deletion(-) >>> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index c383b842da20..b4426abf297a 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>> } >>> } >>> >>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) >>> +{ >>> + int target_bar = -1; >>> + size_t msix_sz; >>> + >>> + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { >>> + return; >>> + } >>> + >>> + /* The actual minimum size of MSI-X structures */ >>> + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + >>> + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); >>> + /* Round up to host pages, we don't want to share a page */ >>> + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); >>> + /* PCI BARs must be a power of 2 */ >>> + msix_sz = pow2ceil(msix_sz); >>> + >>> + /* Auto: pick the BAR that incurs the least additional MMIO space */ >>> + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { >>> + int i; >>> + size_t best = UINT64_MAX; >>> + >>> + for (i = 0; i < PCI_ROM_SLOT; i++) { >> >> >> I belieive that going from the other end is safer approach for "auto", >> especially after discovering how mpt3sas works. Or you could add >> "autoreverse" switch... > > Or is extending the smallest BAR really a safer option? I wonder how > many drivers go through and fill fixed sized arrays with BAR info, > expecting only the device implemented number of BARs. Maybe they > wouldn't notice if the BAR was simply bigger than expected. On the > other hand there are probably drivers dumb enough to index registers > from the end for the BAR as well. I don't think there exists an > auto algorithm that will fit every device, but a higher hit rate than > we have so far would be nice. Everything is possible :( I do not know if there are many users for this relocation though. So far only one device has the problem (in 5 years or so) and it is fixed by moving msix to bar5, I'd suggest start with this for now. In general, I think we still need a way to simply disable that msix_table region anyway if we find a device driver which uses all BARs, does not tolerate changes to the default set of BARs, etc. > We could also implement MemoryRegionOps > for the base BAR with some error reporting if it gets called. That > might make the problem more obvious than unassigned_mem_ops silently > eating those accesses. Makes sense. > >>> + size_t size; >>> + >>> + if (vdev->bars[i].ioport) { >>> + continue; >>> + } >>> + >>> + /* MSI-X MMIO must reside within first 32bit offset of BAR */ >>> + if (vdev->bars[i].size > (UINT32_MAX / 2)) > > Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs. > > NB, the existing test here needs a bit of work too, 32bit BARs max out > at 2G not 4G, so maybe we need separate tests here. >1G for 32bit > BARs, >2G for 64bit BARs. Hmm, do we have the option of promoting > 32bit BARs to 64bit? It's all virtual addresses anyway, right. We're > in real trouble if were extending BARs where this is an issue though. until you get a driver like this :) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782 > >>> + continue; >>> + >>> + /* >>> + * Must be pow2, so larger of double existing or double msix_sz, >>> + * or if BAR unimplemented, msix_sz >>> + */ >>> + size = MAX(vdev->bars[i].size * 2, >>> + vdev->bars[i].size ? msix_sz * 2 : msix_sz); >>> + >>> + trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size); >>> + >>> + if (size < best) { >>> + best = size; >>> + target_bar = i; >>> + } >>> + >>> + if (vdev->bars[i].mem64) { >>> + i++; >>> + } >>> + } >>> + } else { >>> + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0); >>> + } >>> + >>> + if (target_bar < 0 || vdev->bars[target_bar].ioport || >>> + (!vdev->bars[target_bar].size && >>> + target_bar > 0 && vdev->bars[target_bar - 1].mem64)) { >>> + return; /* Go BOOM? Plumb Error */ >>> + } >> >> >> This "if" only seems to make sense for the non-auto branch... > > Most of it, yes, but it's still possible for a device to exist where > the auto loop would come up empty. Imagine if each BAR was > sufficiently large that we couldn't extend it and still give the MSI-X > MMIO areas a 32-bit offset within the BAR. Exceptionally unlikely, it > doesn't hurt to test all the corner cases. I also missed the case of > testing that the BAR isn't too large already here. Fair enough. > >>> + >>> + /* >>> + * If adding a new BAR, test if we can make it 64bit. We make it >>> + * prefetchable since QEMU MSI-X emulation has no read side effects >>> + * and doing so makes mapping more flexible. >>> + */ >>> + if (!vdev->bars[target_bar].size) { >>> + if (target_bar < (PCI_ROM_SLOT - 1) && >>> + !vdev->bars[target_bar + 1].size) { >>> + vdev->bars[target_bar].mem64 = true; >>> + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64; >>> + } >>> + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH; >>> + vdev->bars[target_bar].size = msix_sz; >>> + vdev->msix->table_offset = 0; >>> + } else { >>> + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2, >>> + msix_sz * 2); >>> + /* >>> + * Due to above size calc, MSI-X always starts halfway into the BAR, >>> + * which will always be a separate host page. >>> + */ >>> + vdev->msix->table_offset = vdev->bars[target_bar].size / 2; >>> + } >>> + >>> + vdev->msix->table_bar = target_bar; >>> + vdev->msix->pba_bar = target_bar; >> >> >> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out >> was not necessary at the time but I missed that it is called before >> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks, > > For a kernel that allows mapping the MSI-X region, yes, but if you ran > that on an older kernel I think QEMU would break when it can't mmap the > entire region. We can't only support new kernels. Thanks, Sure, I am not suggesting changing this.
On Tue, 19 Dec 2017 17:02:59 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 19/12/17 14:40, Alex Williamson wrote: > > On Tue, 19 Dec 2017 14:07:13 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> On 18/12/17 16:02, Alex Williamson wrote: > >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector > >>> table area can be mmap'd from userspace, allowing direct access to > >>> non-MSI-X registers within the host page size of this area. However, > >>> we only get that direct access if QEMU isn't also emulating MSI-X > >>> within that same page. For x86/64 host, the system page size is 4K > >>> and the PCI spec recommends a minimum of 4K to 8K alignment to > >>> separate MSI-X from non-MSI-X registers, therefore only devices which > >>> don't honor this recommendation would see any improvement from this > >>> option. The real targets for this feature are hosts where the page > >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems > >>> with 64K pages. > >>> > >>> This new x-msix-relocation option accepts the following options: > >>> > >>> off: Disable MSI-X relocation, use native device config (default) > >>> auto: Automaically relocate MSI-X MMIO to another BAR or offset > >>> based on minimum additional MMIO requirement > >>> bar0..bar5: Specify the target BAR, which will either be extended > >>> if the BAR exists or added if the BAR slot is available. > >>> > >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >>> --- > >>> hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> hw/vfio/pci.h | 1 > >>> hw/vfio/trace-events | 2 + > >>> 3 files changed, 104 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>> index c383b842da20..b4426abf297a 100644 > >>> --- a/hw/vfio/pci.c > >>> +++ b/hw/vfio/pci.c > >>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>> } > >>> } > >>> > >>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) > >>> +{ > >>> + int target_bar = -1; > >>> + size_t msix_sz; > >>> + > >>> + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { > >>> + return; > >>> + } > >>> + > >>> + /* The actual minimum size of MSI-X structures */ > >>> + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + > >>> + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); > >>> + /* Round up to host pages, we don't want to share a page */ > >>> + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); > >>> + /* PCI BARs must be a power of 2 */ > >>> + msix_sz = pow2ceil(msix_sz); > >>> + > >>> + /* Auto: pick the BAR that incurs the least additional MMIO space */ > >>> + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { > >>> + int i; > >>> + size_t best = UINT64_MAX; > >>> + > >>> + for (i = 0; i < PCI_ROM_SLOT; i++) { > >> > >> > >> I belieive that going from the other end is safer approach for "auto", > >> especially after discovering how mpt3sas works. Or you could add > >> "autoreverse" switch... > > > > Or is extending the smallest BAR really a safer option? I wonder how > > many drivers go through and fill fixed sized arrays with BAR info, > > expecting only the device implemented number of BARs. Maybe they > > wouldn't notice if the BAR was simply bigger than expected. On the > > other hand there are probably drivers dumb enough to index registers > > from the end for the BAR as well. I don't think there exists an > > auto algorithm that will fit every device, but a higher hit rate than > > we have so far would be nice. > > Everything is possible :( > > I do not know if there are many users for this relocation though. So far > only one device has the problem (in 5 years or so) and it is fixed by > moving msix to bar5, I'd suggest start with this for now. Interesting, I would have thought it to be more common. > In general, I think we still need a way to simply disable that msix_table > region anyway if we find a device driver which uses all BARs, does not > tolerate changes to the default set of BARs, etc. Only SPAPR can do that. In fact, I'm somewhat surprised by your interest in this series as I positioned it as a way for other platforms, which require interaction with MSI-X MMIO space for programming interrupts. > > We could also implement MemoryRegionOps > > for the base BAR with some error reporting if it gets called. That > > might make the problem more obvious than unassigned_mem_ops silently > > eating those accesses. > > Makes sense. > > > > > >>> + size_t size; > >>> + > >>> + if (vdev->bars[i].ioport) { > >>> + continue; > >>> + } > >>> + > >>> + /* MSI-X MMIO must reside within first 32bit offset of BAR */ > >>> + if (vdev->bars[i].size > (UINT32_MAX / 2)) > > > > Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs. > > > > NB, the existing test here needs a bit of work too, 32bit BARs max out > > at 2G not 4G, so maybe we need separate tests here. >1G for 32bit > > BARs, >2G for 64bit BARs. Hmm, do we have the option of promoting > > 32bit BARs to 64bit? It's all virtual addresses anyway, right. We're > > in real trouble if were extending BARs where this is an issue though. > > until you get a driver like this :) > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782 Right, a diametric opposite of the SAS driver, verifying all the attributes it can of specific BARs rather than assuming the first BAR it finds must be the one to use. Is it even worthwhile to try to have any automatic selection? I suppose this driver is another point towards a reverse search rather than extended BAR. Thanks, Alex
On 19/12/17 17:56, Alex Williamson wrote: > On Tue, 19 Dec 2017 17:02:59 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 19/12/17 14:40, Alex Williamson wrote: >>> On Tue, 19 Dec 2017 14:07:13 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 18/12/17 16:02, Alex Williamson wrote: >>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector >>>>> table area can be mmap'd from userspace, allowing direct access to >>>>> non-MSI-X registers within the host page size of this area. However, >>>>> we only get that direct access if QEMU isn't also emulating MSI-X >>>>> within that same page. For x86/64 host, the system page size is 4K >>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to >>>>> separate MSI-X from non-MSI-X registers, therefore only devices which >>>>> don't honor this recommendation would see any improvement from this >>>>> option. The real targets for this feature are hosts where the page >>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems >>>>> with 64K pages. >>>>> >>>>> This new x-msix-relocation option accepts the following options: >>>>> >>>>> off: Disable MSI-X relocation, use native device config (default) >>>>> auto: Automaically relocate MSI-X MMIO to another BAR or offset >>>>> based on minimum additional MMIO requirement >>>>> bar0..bar5: Specify the target BAR, which will either be extended >>>>> if the BAR exists or added if the BAR slot is available. >>>>> >>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>>>> --- >>>>> hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> hw/vfio/pci.h | 1 >>>>> hw/vfio/trace-events | 2 + >>>>> 3 files changed, 104 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>> index c383b842da20..b4426abf297a 100644 >>>>> --- a/hw/vfio/pci.c >>>>> +++ b/hw/vfio/pci.c >>>>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>> } >>>>> } >>>>> >>>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) >>>>> +{ >>>>> + int target_bar = -1; >>>>> + size_t msix_sz; >>>>> + >>>>> + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { >>>>> + return; >>>>> + } >>>>> + >>>>> + /* The actual minimum size of MSI-X structures */ >>>>> + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + >>>>> + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); >>>>> + /* Round up to host pages, we don't want to share a page */ >>>>> + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); >>>>> + /* PCI BARs must be a power of 2 */ >>>>> + msix_sz = pow2ceil(msix_sz); >>>>> + >>>>> + /* Auto: pick the BAR that incurs the least additional MMIO space */ >>>>> + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { >>>>> + int i; >>>>> + size_t best = UINT64_MAX; >>>>> + >>>>> + for (i = 0; i < PCI_ROM_SLOT; i++) { >>>> >>>> >>>> I belieive that going from the other end is safer approach for "auto", >>>> especially after discovering how mpt3sas works. Or you could add >>>> "autoreverse" switch... >>> >>> Or is extending the smallest BAR really a safer option? I wonder how >>> many drivers go through and fill fixed sized arrays with BAR info, >>> expecting only the device implemented number of BARs. Maybe they >>> wouldn't notice if the BAR was simply bigger than expected. On the >>> other hand there are probably drivers dumb enough to index registers >>> from the end for the BAR as well. I don't think there exists an >>> auto algorithm that will fit every device, but a higher hit rate than >>> we have so far would be nice. >> >> Everything is possible :( >> >> I do not know if there are many users for this relocation though. So far >> only one device has the problem (in 5 years or so) and it is fixed by >> moving msix to bar5, I'd suggest start with this for now. > > Interesting, I would have thought it to be more common. Just to clarify - one device with performance issue because of msix emulation, non-64k-aligned msix data is not that unusual. > >> In general, I think we still need a way to simply disable that msix_table >> region anyway if we find a device driver which uses all BARs, does not >> tolerate changes to the default set of BARs, etc. > > Only SPAPR can do that. In fact, I'm somewhat surprised by your > interest in this series as I positioned it as a way for other > platforms, which require interaction with MSI-X MMIO space for > programming interrupts. Well, it moves the guest-visible msix section away from the BAR causing performance issues so I figured it might work for SPAPR eventually :) >>> We could also implement MemoryRegionOps >>> for the base BAR with some error reporting if it gets called. That >>> might make the problem more obvious than unassigned_mem_ops silently >>> eating those accesses. >> >> Makes sense. >> >> >>> >>>>> + size_t size; >>>>> + >>>>> + if (vdev->bars[i].ioport) { >>>>> + continue; >>>>> + } >>>>> + >>>>> + /* MSI-X MMIO must reside within first 32bit offset of BAR */ >>>>> + if (vdev->bars[i].size > (UINT32_MAX / 2)) >>> >>> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs. >>> >>> NB, the existing test here needs a bit of work too, 32bit BARs max out >>> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit >>> BARs, >2G for 64bit BARs. Hmm, do we have the option of promoting >>> 32bit BARs to 64bit? It's all virtual addresses anyway, right. We're >>> in real trouble if were extending BARs where this is an issue though. >> >> until you get a driver like this :) >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782 > > Right, a diametric opposite of the SAS driver, verifying all the > attributes it can of specific BARs rather than assuming the first BAR > it finds must be the one to use. Is it even worthwhile to try to have > any automatic selection? I suppose this driver is another point > towards a reverse search rather than extended BAR. Thanks, Well, guessing like this may fail occasionally and simply allowing MSIX mapping won't fail on SPAPR, I do not really know if it is going to be very useful anywhere else than just SPAPR. And I guess if we go the automatic selection path, than extending a BAR does not have much benefit over using the last BAR because it seems quite unlikely that a device 1) does not have any BARs unused and 2) none of BARs is MSIX-only but if this is a case, I am not sure what guess would be safer. I looked nearby, for example: 001e:80:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) Region 0: Memory at 3fc2c0250000 (64-bit, prefetchable) [size=64K] Region 2: Memory at 3fc2c0240000 (64-bit, prefetchable) [size=64K] Region 4: Memory at 3fc2c0230000 (64-bit, prefetchable) [size=64K] Capabilities: [a0] MSI-X: Enable- Count=17 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000120 It is fully packed and it *seems* that BAR4 is MSIX only but who knows why it is 64K - can be anything... This one looks more convincing but still no guarantee: 0001:09:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02) Region 0: Memory at 3fe080800000 (64-bit, non-prefetchable) [size=64K] Region 2: Memory at 3fe080810000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c0] MSI-X: Enable+ Count=8 Masked- Vector table: BAR=2 offset=00000000 PBA: BAR=2 offset=00001000 A funny thing - my thinkpad x1 does not have a single msix-capable device, many are MSI and "Express (v2) Endpoint, MSI 00". Hmmm. Xeon and POWER8 boxes do have MSIX.
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index c383b842da20..b4426abf297a 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) } } +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev) +{ + int target_bar = -1; + size_t msix_sz; + + if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) { + return; + } + + /* The actual minimum size of MSI-X structures */ + msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) + + (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8); + /* Round up to host pages, we don't want to share a page */ + msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz); + /* PCI BARs must be a power of 2 */ + msix_sz = pow2ceil(msix_sz); + + /* Auto: pick the BAR that incurs the least additional MMIO space */ + if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) { + int i; + size_t best = UINT64_MAX; + + for (i = 0; i < PCI_ROM_SLOT; i++) { + size_t size; + + if (vdev->bars[i].ioport) { + continue; + } + + /* MSI-X MMIO must reside within first 32bit offset of BAR */ + if (vdev->bars[i].size > (UINT32_MAX / 2)) + continue; + + /* + * Must be pow2, so larger of double existing or double msix_sz, + * or if BAR unimplemented, msix_sz + */ + size = MAX(vdev->bars[i].size * 2, + vdev->bars[i].size ? msix_sz * 2 : msix_sz); + + trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size); + + if (size < best) { + best = size; + target_bar = i; + } + + if (vdev->bars[i].mem64) { + i++; + } + } + } else { + target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0); + } + + if (target_bar < 0 || vdev->bars[target_bar].ioport || + (!vdev->bars[target_bar].size && + target_bar > 0 && vdev->bars[target_bar - 1].mem64)) { + return; /* Go BOOM? Plumb Error */ + } + + /* + * If adding a new BAR, test if we can make it 64bit. We make it + * prefetchable since QEMU MSI-X emulation has no read side effects + * and doing so makes mapping more flexible. + */ + if (!vdev->bars[target_bar].size) { + if (target_bar < (PCI_ROM_SLOT - 1) && + !vdev->bars[target_bar + 1].size) { + vdev->bars[target_bar].mem64 = true; + vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64; + } + vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH; + vdev->bars[target_bar].size = msix_sz; + vdev->msix->table_offset = 0; + } else { + vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2, + msix_sz * 2); + /* + * Due to above size calc, MSI-X always starts halfway into the BAR, + * which will always be a separate host page. + */ + vdev->msix->table_offset = vdev->bars[target_bar].size / 2; + } + + vdev->msix->table_bar = target_bar; + vdev->msix->pba_bar = target_bar; + /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */ + vdev->msix->pba_offset = vdev->msix->table_offset + + (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE); + + trace_vfio_msix_relo(vdev->vbasedev.name, + vdev->msix->table_bar, vdev->msix->table_offset); +} + /* * We don't have any control over how pci_add_capability() inserts * capabilities into the chain. In order to setup MSI-X we need a @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) vdev->msix = msix; vfio_pci_fixup_msix_region(vdev); + + vfio_pci_relocate_msix(vdev); } static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) vfio_pci_size_rom(vdev); + vfio_bars_prepare(vdev); + vfio_msix_early_setup(vdev, &err); if (err) { error_propagate(errp, err); goto error; } - vfio_bars_prepare(vdev); vfio_bars_register(vdev); ret = vfio_add_capabilities(vdev, errp); @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = { DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice, nv_gpudirect_clique, qdev_prop_nv_gpudirect_clique, uint8_t), + DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo, + OFF_AUTOPCIBAR_OFF), /* * TODO - support passed fds... is this necessary? * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index dcdb1a806769..588381f201b4 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice { (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT) int32_t bootindex; uint32_t igd_gms; + OffAutoPCIBAR msix_relo; uint8_t pm_cap; uint8_t nv_gpudirect_clique; bool pci_aer; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index fae096c0724f..437ccdd29053 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)" vfio_msix_pba_enable(const char *name) " (%s)" vfio_msix_disable(const char *name) " (%s)" vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]" +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64"" +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64"" vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors" vfio_msi_disable(const char *name) " (%s)" vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
With recently proposed kernel side vfio-pci changes, the MSI-X vector table area can be mmap'd from userspace, allowing direct access to non-MSI-X registers within the host page size of this area. However, we only get that direct access if QEMU isn't also emulating MSI-X within that same page. For x86/64 host, the system page size is 4K and the PCI spec recommends a minimum of 4K to 8K alignment to separate MSI-X from non-MSI-X registers, therefore only devices which don't honor this recommendation would see any improvement from this option. The real targets for this feature are hosts where the page size exceeds the PCI spec recommended alignment, such as ARM64 systems with 64K pages. This new x-msix-relocation option accepts the following options: off: Disable MSI-X relocation, use native device config (default) auto: Automaically relocate MSI-X MMIO to another BAR or offset based on minimum additional MMIO requirement bar0..bar5: Specify the target BAR, which will either be extended if the BAR exists or added if the BAR slot is available. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- hw/vfio/pci.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ hw/vfio/pci.h | 1 hw/vfio/trace-events | 2 + 3 files changed, 104 insertions(+), 1 deletion(-)