diff mbox series

target/i386: Support up to 32768 CPUs without IRQ remapping

Message ID 78097f9218300e63e751e077a0a5ca029b56ba46.camel@infradead.org
State New
Headers show
Series target/i386: Support up to 32768 CPUs without IRQ remapping | expand

Commit Message

David Woodhouse Oct. 5, 2020, 2:18 p.m. UTC
The IOAPIC has an 'Extended Destination ID' field in its RTE, which maps
to bits 11-4 of the MSI address. Since those address bits fall within a
given 4KiB page they were historically non-trivial to use on real hardware.

The Intel IOMMU uses the lowest bit to indicate a remappable format MSI,
and then the remaining 7 bits are part of the index.

Where the remappable format bit isn't set, we can actually use the other
seven to allow external (IOAPIC and MSI) interrupts to reach up to 32768
CPUs instead of just the 255 permitted on bare metal.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/kvm/apic.c                          |  7 ++
 hw/i386/pc.c                                | 16 ++---
 include/standard-headers/asm-x86/kvm_para.h |  1 +
 target/i386/cpu.c                           |  5 +-
 target/i386/kvm.c                           | 74 +++++++++++++++------
 target/i386/kvm_i386.h                      |  2 +
 6 files changed, 75 insertions(+), 30 deletions(-)

Comments

David Woodhouse Oct. 7, 2020, 12:24 p.m. UTC | #1
On Mon, 2020-10-05 at 15:18 +0100, David Woodhouse wrote:
> The IOAPIC has an 'Extended Destination ID' field in its RTE, which maps
> to bits 11-4 of the MSI address. Since those address bits fall within a
> given 4KiB page they were historically non-trivial to use on real hardware.
> 
> The Intel IOMMU uses the lowest bit to indicate a remappable format MSI,
> and then the remaining 7 bits are part of the index.
> 
> Where the remappable format bit isn't set, we can actually use the other
> seven to allow external (IOAPIC and MSI) interrupts to reach up to 32768
> CPUs instead of just the 255 permitted on bare metal.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Corresponding kernel patch at 
https://patchwork.kernel.org/patch/11820535/
Paolo Bonzini Oct. 8, 2020, 6:56 a.m. UTC | #2
On 05/10/20 16:18, David Woodhouse wrote:
> +        if (kvm_irqchip_is_split()) {
> +            ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
> +        }

IIUC this is because in-kernel IOAPIC still doesn't work; and when it
does, KVM will advertise the feature itself so no other QEMU changes
will be needed.

I queued this, though of course it has to wait for the corresponding
kernel patches to be accepted (or separated into doc and non-KVM parts;
we'll see).

Paolo
David Woodhouse Oct. 8, 2020, 7:29 a.m. UTC | #3
On Thu, 2020-10-08 at 08:56 +0200, Paolo Bonzini wrote:
> On 05/10/20 16:18, David Woodhouse wrote:
> > +        if (kvm_irqchip_is_split()) {
> > +            ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
> > +        }
> 
> IIUC this is because in-kernel IOAPIC still doesn't work; and when it
> does, KVM will advertise the feature itself so no other QEMU changes
> will be needed.

More the MSI handling than the IOAPIC. I haven't actually worked out
*what* handles cycles to addresses in the 0xFEExxxxx range for the in-
kernel irqchip and turns them into interrupts (after putting them
through interrupt remapping, if/when the kernel learns to do that).

Ideally the IOAPIC would just swizzle the bits in its RTE to create an
MSI message and pass it on to the same code to be (translated and)
delivered.

You'll note my qemu patch didn't touch IOAPIC code at all, because
qemu's IOAPIC really does just that.

> I queued this, though of course it has to wait for the corresponding
> kernel patches to be accepted (or separated into doc and non-KVM
> parts; we'll see).

Thanks.
Paolo Bonzini Oct. 8, 2020, 7:53 a.m. UTC | #4
On 08/10/20 09:29, David Woodhouse wrote:
> On Thu, 2020-10-08 at 08:56 +0200, Paolo Bonzini wrote:
>> On 05/10/20 16:18, David Woodhouse wrote:
>>> +        if (kvm_irqchip_is_split()) {
>>> +            ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
>>> +        }
>>
>> IIUC this is because in-kernel IOAPIC still doesn't work; and when it
>> does, KVM will advertise the feature itself so no other QEMU changes
>> will be needed.
> 
> More the MSI handling than the IOAPIC. I haven't actually worked out
> *what* handles cycles to addresses in the 0xFEExxxxx range for the in-
> kernel irqchip and turns them into interrupts (after putting them
> through interrupt remapping, if/when the kernel learns to do that).

That's easy: it's QEMU. :)  See kvm_apic_mem_write in hw/i386/kvm/apic.c
(note that this memory region is never used when the CPU accesses
0xFEExxxxx, only when QEMU does.

Conversion from the IOAPIC and MSI formats to struct kvm_lapic_irq is
completely separate in KVM, it is respectively in ioapic_service and
kvm_set_msi_irq.  Both of them prepare a struct kvm_lapic_irq, but
they're two different paths.

> Ideally the IOAPIC would just swizzle the bits in its RTE to create an
> MSI message and pass it on to the same code to be (translated and)
> delivered.
> 
> You'll note my qemu patch didn't touch IOAPIC code at all, because
> qemu's IOAPIC really does just that.

Indeed the nice thing about irqchip=split is that the handling of device
interrupts is entirely confined within QEMU, no matter if they're IOAPIC
or MSI.  And because we had to implement interrupt remapping, the IOAPIC
is effectively using MSIs to deliver its interrupts.

There's still the hack to communicate IOAPIC routes to KVM and have it
set the EOI exit bitmap correctly, though.  The code is in
kvm_scan_ioapic_routes and it uses kvm_set_msi_irq (with irqchip=split
everything is also an MSI within the kernel).  I think you're not
handling that correctly for CPUs >255, so after all we _do_ need some
kernel support.

Paolo

>> I queued this, though of course it has to wait for the corresponding
>> kernel patches to be accepted (or separated into doc and non-KVM
>> parts; we'll see).
> 
> Thanks.
>
David Woodhouse Oct. 19, 2020, 12:21 p.m. UTC | #5
On Thu, 2020-10-08 at 09:53 +0200, Paolo Bonzini wrote:
> Indeed the nice thing about irqchip=split is that the handling of device
> interrupts is entirely confined within QEMU, no matter if they're IOAPIC
> or MSI.  And because we had to implement interrupt remapping, the IOAPIC
> is effectively using MSIs to deliver its interrupts.
> 
> There's still the hack to communicate IOAPIC routes to KVM and have it
> set the EOI exit bitmap correctly, though.  The code is in
> kvm_scan_ioapic_routes and it uses kvm_set_msi_irq (with irqchip=split
> everything is also an MSI within the kernel).  I think you're not
> handling that correctly for CPUs >255, so after all we _do_ need some
> kernel support.

I think that works out OK.

In QEMU's ioapic_update_kvm_routes() it calls ioapic_entry_parse()
which generates the actual "bus" MSI with the extended dest ID in bits
11-5 of the address.

That MSI message is passed to kvm_irqchip_update_msi_route() which
passes it through translation —  which does interrupt remapping and
shifting the ext bits up into ->address_hi as the KVM X2APIC API
expects.

So when the kernel's kvm_scan_ioapic_routes() goes looking,
kvm_set_msi_irq() fills 'irq' in with the correct dest_id, and
kvm_apic_match_dest() does the right thing.

No?


As far as I can tell, we *do* have a QEMU bug — not related to the ext
dest ID — because for MSIs of assigned devices we don't update the KVM
IRQ routing table when the Interrupt Remapping IEC cache is flushed.

> Paolo
> 
> > > I queued this, though of course it has to wait for the corresponding
> > > kernel patches to be accepted (or separated into doc and non-KVM
> > > parts; we'll see).
> > 
> > Thanks.

So... it'll hit the tip.git tree and thus linux-next as soon as Linus
releases 5.10-rc1, and it'll then get merged into 5.11-rc1 and be in
the 5.11 release.

At which of those three points in time would you be happy to merge it
to QEMU? If it's either of the latter two, maybe it *is* worth doing a
patch which *only* reserves the feature bit, and trying to slip it into
5.10?
Paolo Bonzini Oct. 19, 2020, 1:55 p.m. UTC | #6
On 19/10/20 14:21, David Woodhouse wrote:
> On Thu, 2020-10-08 at 09:53 +0200, Paolo Bonzini wrote:
>> I think you're not
>> handling that correctly for CPUs >255, so after all we _do_ need some
>> kernel support.
> 
> I think that works out OK.
> 
> In QEMU's ioapic_update_kvm_routes() it calls ioapic_entry_parse()
> which generates the actual "bus" MSI with the extended dest ID in bits
> 11-5 of the address.
> 
> That MSI message is passed to kvm_irqchip_update_msi_route() which
> passes it through translation —  which does interrupt remapping and
> shifting the ext bits up into ->address_hi as the KVM X2APIC API
> expects.
> 
> So when the kernel's kvm_scan_ioapic_routes() goes looking,
> kvm_set_msi_irq() fills 'irq' in with the correct dest_id, and
> kvm_apic_match_dest() does the right thing.
> 
> No?

Yeah, that seems fine.

> As far as I can tell, we *do* have a QEMU bug — not related to the ext
> dest ID — because for MSIs of assigned devices we don't update the KVM
> IRQ routing table when the Interrupt Remapping IEC cache is flushed.

> So... it'll hit the tip.git tree and thus linux-next as soon as Linus
> releases 5.10-rc1, and it'll then get merged into 5.11-rc1 and be in
> the 5.11 release.
> 
> At which of those three points in time would you be happy to merge it
> to QEMU? If it's either of the latter two, maybe it *is* worth doing a
> patch which *only* reserves the feature bit, and trying to slip it into
> 5.10?

It would be 5.11-rc1 because of the KVM_FEATURE_MSI_EXT_DEST_ID
definition which would not be in your patch but rather synchronized from
the Linux tree by scripts/update-linux-headers.sh.

If you send me the doc patch any time before 5.10-rc7, it will be in 5.10.

Paolo
diff mbox series

Patch

diff --git a/hw/i386/kvm/apic.c b/hw/i386/kvm/apic.c
index 4eb2d77b87..aeb3366ae8 100644
--- a/hw/i386/kvm/apic.c
+++ b/hw/i386/kvm/apic.c
@@ -183,6 +183,13 @@  static void kvm_send_msi(MSIMessage *msg)
 {
     int ret;
 
+    /*
+     * The message has already passed through interrupt remapping if enabled,
+     * but the legacy extended destination ID in low bits still needs to be
+     * handled.
+     */
+    msg->address = kvm_swizzle_msi_ext_dest_id(msg->address);
+
     ret = kvm_irqchip_send_msi(kvm_state, *msg);
     if (ret < 0) {
         fprintf(stderr, "KVM: injection failed, MSI lost (%s)\n",
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e87be5d29a..a06c091227 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -99,6 +99,7 @@ 
 
 GlobalProperty pc_compat_5_1[] = {
     { "ICH9-LPC", "x-smi-cpu-hotplug", "off" },
+    { TYPE_X86_CPU, "kvm-msi-ext-dest-id", "off" },
 };
 const size_t pc_compat_5_1_len = G_N_ELEMENTS(pc_compat_5_1);
 
@@ -807,17 +808,12 @@  void pc_machine_done(Notifier *notifier, void *data)
         fw_cfg_modify_i16(x86ms->fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus);
     }
 
-    if (x86ms->apic_id_limit > 255 && !xen_enabled()) {
-        IntelIOMMUState *iommu = INTEL_IOMMU_DEVICE(x86_iommu_get_default());
 
-        if (!iommu || !x86_iommu_ir_supported(X86_IOMMU_DEVICE(iommu)) ||
-            iommu->intr_eim != ON_OFF_AUTO_ON) {
-            error_report("current -smp configuration requires "
-                         "Extended Interrupt Mode enabled. "
-                         "You can add an IOMMU using: "
-                         "-device intel-iommu,intremap=on,eim=on");
-            exit(EXIT_FAILURE);
-        }
+    if (x86ms->apic_id_limit > 255 && !xen_enabled() &&
+        !kvm_irqchip_in_kernel()) {
+        error_report("current -smp configuration requires kernel "
+                     "irqchip support.");
+        exit(EXIT_FAILURE);
     }
 }
 
diff --git a/include/standard-headers/asm-x86/kvm_para.h b/include/standard-headers/asm-x86/kvm_para.h
index 07877d3295..215d01b4ec 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -32,6 +32,7 @@ 
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_MSI_EXT_DEST_ID	15
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index f37eb7b675..a93f50a6a7 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -799,7 +799,7 @@  static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
             "kvmclock", "kvm-nopiodelay", "kvm-mmu", "kvmclock",
             "kvm-asyncpf", "kvm-steal-time", "kvm-pv-eoi", "kvm-pv-unhalt",
             NULL, "kvm-pv-tlb-flush", NULL, "kvm-pv-ipi",
-            "kvm-poll-control", "kvm-pv-sched-yield", "kvm-asyncpf-int", NULL,
+            "kvm-poll-control", "kvm-pv-sched-yield", "kvm-asyncpf-int", "kvm-msi-ext-dest-id",
             NULL, NULL, NULL, NULL,
             NULL, NULL, NULL, NULL,
             "kvmclock-stable-bit", NULL, NULL, NULL,
@@ -4109,6 +4109,7 @@  static PropValue kvm_default_props[] = {
     { "kvm-pv-eoi", "on" },
     { "kvmclock-stable-bit", "on" },
     { "x2apic", "on" },
+    { "kvm-msi-ext-dest-id", "off" },
     { "acpi", "off" },
     { "monitor", "off" },
     { "svm", "off" },
@@ -5132,6 +5133,8 @@  static void x86_cpu_load_model(X86CPU *cpu, X86CPUModel *model)
     if (kvm_enabled()) {
         if (!kvm_irqchip_in_kernel()) {
             x86_cpu_change_kvm_default("x2apic", "off");
+        } else if (kvm_irqchip_is_split() && kvm_enable_x2apic()) {
+            x86_cpu_change_kvm_default("kvm-msi-ext-dest-id", "on");
         }
 
         x86_cpu_apply_props(cpu, kvm_default_props);
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index f6dae4cfb6..90952cae7c 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -420,6 +420,9 @@  uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
         if (!kvm_irqchip_in_kernel()) {
             ret &= ~(1U << KVM_FEATURE_PV_UNHALT);
         }
+        if (kvm_irqchip_is_split()) {
+            ret |= 1U << KVM_FEATURE_MSI_EXT_DEST_ID;
+        }
     } else if (function == KVM_CPUID_FEATURES && reg == R_EDX) {
         ret |= 1U << KVM_HINTS_REALTIME;
     }
@@ -4583,38 +4586,71 @@  int kvm_arch_irqchip_create(KVMState *s)
     }
 }
 
+uint64_t kvm_swizzle_msi_ext_dest_id(uint64_t address)
+{
+        CPUX86State *env = &X86_CPU(first_cpu)->env;
+        uint64_t ext_id;
+
+        if (!first_cpu ||
+            !(env->features[FEAT_KVM] & (1 << KVM_FEATURE_MSI_EXT_DEST_ID))) {
+            return address;
+        }
+
+        /*
+         * If the remappable format bit is set, or the upper bits are
+         * already set in address_hi, or the low extended bits aren't
+         * there anyway, do nothing.
+         */
+        ext_id = address & (0xff << MSI_ADDR_DEST_IDX_SHIFT);
+        if (!ext_id || (ext_id & (1 << MSI_ADDR_DEST_IDX_SHIFT)) ||
+            (address >> 32))
+            return address;
+
+        address &= ~ext_id;
+        address |= ext_id << 35;
+        return address;
+}
+
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
                              uint64_t address, uint32_t data, PCIDevice *dev)
 {
     X86IOMMUState *iommu = x86_iommu_get_default();
 
     if (iommu) {
-        int ret;
-        MSIMessage src, dst;
         X86IOMMUClass *class = X86_IOMMU_DEVICE_GET_CLASS(iommu);
 
-        if (!class->int_remap) {
-            return 0;
-        }
+        if (class->int_remap) {
+            int ret;
+            MSIMessage src, dst;
 
-        src.address = route->u.msi.address_hi;
-        src.address <<= VTD_MSI_ADDR_HI_SHIFT;
-        src.address |= route->u.msi.address_lo;
-        src.data = route->u.msi.data;
+            src.address = route->u.msi.address_hi;
+            src.address <<= VTD_MSI_ADDR_HI_SHIFT;
+            src.address |= route->u.msi.address_lo;
+            src.data = route->u.msi.data;
 
-        ret = class->int_remap(iommu, &src, &dst, dev ? \
-                               pci_requester_id(dev) : \
-                               X86_IOMMU_SID_INVALID);
-        if (ret) {
-            trace_kvm_x86_fixup_msi_error(route->gsi);
-            return 1;
-        }
+            ret = class->int_remap(iommu, &src, &dst, dev ?     \
+                                   pci_requester_id(dev) :      \
+                                   X86_IOMMU_SID_INVALID);
+            if (ret) {
+                trace_kvm_x86_fixup_msi_error(route->gsi);
+                return 1;
+            }
+
+            /*
+             * Handled untranslated compatibilty format interrupt with
+             * extended destination ID in the low bits 11-5. */
+            dst.address = kvm_swizzle_msi_ext_dest_id(dst.address);
 
-        route->u.msi.address_hi = dst.address >> VTD_MSI_ADDR_HI_SHIFT;
-        route->u.msi.address_lo = dst.address & VTD_MSI_ADDR_LO_MASK;
-        route->u.msi.data = dst.data;
+            route->u.msi.address_hi = dst.address >> VTD_MSI_ADDR_HI_SHIFT;
+            route->u.msi.address_lo = dst.address & VTD_MSI_ADDR_LO_MASK;
+            route->u.msi.data = dst.data;
+            return 0;
+        }
     }
 
+    address = kvm_swizzle_msi_ext_dest_id(address);
+    route->u.msi.address_hi = address >> VTD_MSI_ADDR_HI_SHIFT;
+    route->u.msi.address_lo = address & VTD_MSI_ADDR_LO_MASK;
     return 0;
 }
 
diff --git a/target/i386/kvm_i386.h b/target/i386/kvm_i386.h
index 0fce4e51d2..ede94760ae 100644
--- a/target/i386/kvm_i386.h
+++ b/target/i386/kvm_i386.h
@@ -49,4 +49,6 @@  bool kvm_has_waitpkg(void);
 
 bool kvm_hv_vpindex_settable(void);
 
+uint64_t kvm_swizzle_msi_ext_dest_id(uint64_t address);
+
 #endif