Message ID | 20160426073426.GD28545@pxdev.xzpeter.org |
---|---|
State | New |
Headers | show |
On 2016-04-26 09:34, Peter Xu wrote: > On Mon, Apr 25, 2016 at 09:24:12AM +0200, Jan Kiszka wrote: >> On 2016-04-25 09:18, Peter Xu wrote: >>> On Mon, Apr 25, 2016 at 07:16:19AM +0200, Jan Kiszka wrote: >>>> On 2016-04-19 10:38, Peter Xu wrote: >>> >>> [...] >>> >>>>> By default, IR is disabled to be better compatible with current >>>>> QEMU. To enable IR, we can using the following command to boot a >>>>> IR-supported VM with virtio-net device with vhost (still do not >>>>> support kvm-ioapic, so we need to specify kernel-irqchip={split|off} >>>>> here): >>>>> >>>>> $ qemu-system-x86_64 -M q35,iommu=on,intr=on,kernel-irqchip=split \ >>>> >>>> "intr" sounds a bit too much like "interrupt", not "interrupt >>>> remapping". Why not use the kernel's form, "intremap"? >>> >>> Sure. It sounds nice to be aligned with the kernel one. Let me take >>> it in v5. >>> >>>> >>>>> -enable-kvm -m 1024 \ >>>>> -netdev tap,id=net0,vhost=on \ >>>>> -device virtio-net-pci,netdev=user.0 \ >>>>> -monitor telnet::3333,server,nowait \ >>>>> /var/lib/libvirt/images/vm1.qcow2 >>>>> >>>>> When guest boots, we can verify whether IR enabled by grepping the >>>>> dmesg like: >>>>> >>>>> [root@localhost ~]# journalctl -k | grep "DMAR-IR" >>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: IOAPIC id 0 under DRHD base 0xfed90000 IOMMU 0 >>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: Enabled IRQ remapping in xapic mode >>>>> >>>>> Currently supported devices: >>>>> >>>>> - Emulated/Splitted irqchip >>>>> - Generic PCI Devices >>>>> - vhost devices >>>>> - pass through device support? Not tested, but suppose it should work. >>>> >>>> I've tested this series against my Jailhouse setup, and it works pretty >>>> well! Actually considering to move my test setup over this branch. >>> >>> This is really encouraging feedback! Btw, thanks for all kinds of >>> help on this patchset. :-) >>> >>>> >>>> However, split irqchip still has some issues: When I boot a q35 machine >>>> with Linux, the e1000 network adapter only gets a single IRQ delivered. >>>> Interestingly, other IOAPIC IRQs like the keyboard work all the time. I >>>> didn't debug this in details yet. >>> >>> I reproduced this problem. It seems that it fails even with >>> kernel-irqchip=off. Will try to dig it out. >> >> Very good. Hope it can be easily fixed. > > Hi, Jan, > > The above issue should be caused by EOI missing of level-triggered > interrupts. Before that, I was always using edge-triggered > interrupts for test, so didn't encounter this one. Would you please > help try below patch? It can be applied directly onto the series, > and should solve the issue (it works on my test vm, and I'll take it > in v5 as well if it also works for you): > Works here as well. I even made EIM working with some hack, though Jailhouse spits out strange warnings, despite it works fine (x2apic mode, split irqchip). > ------------------------- > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > index b41ab89..de6a8cf 100644 > --- a/hw/intc/ioapic.c > +++ b/hw/intc/ioapic.c > @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int size) > return val; > } > > +/* > + * This is to satisfy the hack in Linux kernel. One hack of it is to > + * simulate clearing the Remote IRR bit of IOAPIC entry using the > + * following: > + * > + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. > + * Otherwise, we simulate the EOI message manually by changing the trigger > + * mode to edge and then back to level, with RTE being masked during > + * this." > + * > + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) > + * > + * This is based on the assumption that, Remote IRR bit will be > + * cleared by IOAPIC hardware for edge-triggered interrupts (I > + * believe that's what the IOAPIC version 0x1X hardware does). So > + * if we are emulating it, we'd better do it the same here, so that > + * the guest kernel hack will work as well on QEMU. > + * > + * Without this, level-triggered interrupts in IR mode might fail to > + * work correctly. > + */ > +static inline void > +ioapic_fix_edge_remote_irr(uint64_t *entry) > +{ > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > + /* Level triggered interrupts, make sure remote IRR is zero */ > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > + } > +} > + > static void > ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, > unsigned int size) > @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, > s->ioredtbl[index] &= ~0xffffffffULL; > s->ioredtbl[index] |= val; > } > + ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); > ioapic_service(s); > } > } > > ------------------------ > > I am still looking into guest part codes. Although the above patch > should solve the issue, there are still issues in guest codes when > IR is enabled: > > - mismatched "vector" in IOAPIC entry and IRTE entry (this is > required in vt-d spec 5.1.5.1, and required to correctly deliver > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): > > ... > /* > * IO-APIC RTE will be configured with virtual vector. > * irq handler will do the explicit EOI to the io-apic. > */ > entry->vector = info->ioapic_pin; > ... > > - I encountered that level-triggered entries in IOAPIC is marked as > edge-triggered interrupt in APIC (which is strange)... This will > also affect correct delivery of EOI broadcast. I still need time > to figure out why. > > If EOI broadcast can work, e1000 issue would be solved as > well even without above patch. > > [...] I don't remember details in this area, but maybe it's worth to look how my hacks dealt with these cause (or made Linux to not create such weird configurations). Jan
On 2016-04-26 09:57, Jan Kiszka wrote: > On 2016-04-26 09:34, Peter Xu wrote: >> On Mon, Apr 25, 2016 at 09:24:12AM +0200, Jan Kiszka wrote: >>> On 2016-04-25 09:18, Peter Xu wrote: >>>> On Mon, Apr 25, 2016 at 07:16:19AM +0200, Jan Kiszka wrote: >>>>> On 2016-04-19 10:38, Peter Xu wrote: >>>> >>>> [...] >>>> >>>>>> By default, IR is disabled to be better compatible with current >>>>>> QEMU. To enable IR, we can using the following command to boot a >>>>>> IR-supported VM with virtio-net device with vhost (still do not >>>>>> support kvm-ioapic, so we need to specify kernel-irqchip={split|off} >>>>>> here): >>>>>> >>>>>> $ qemu-system-x86_64 -M q35,iommu=on,intr=on,kernel-irqchip=split \ >>>>> >>>>> "intr" sounds a bit too much like "interrupt", not "interrupt >>>>> remapping". Why not use the kernel's form, "intremap"? >>>> >>>> Sure. It sounds nice to be aligned with the kernel one. Let me take >>>> it in v5. >>>> >>>>> >>>>>> -enable-kvm -m 1024 \ >>>>>> -netdev tap,id=net0,vhost=on \ >>>>>> -device virtio-net-pci,netdev=user.0 \ >>>>>> -monitor telnet::3333,server,nowait \ >>>>>> /var/lib/libvirt/images/vm1.qcow2 >>>>>> >>>>>> When guest boots, we can verify whether IR enabled by grepping the >>>>>> dmesg like: >>>>>> >>>>>> [root@localhost ~]# journalctl -k | grep "DMAR-IR" >>>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: IOAPIC id 0 under DRHD base 0xfed90000 IOMMU 0 >>>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: Enabled IRQ remapping in xapic mode >>>>>> >>>>>> Currently supported devices: >>>>>> >>>>>> - Emulated/Splitted irqchip >>>>>> - Generic PCI Devices >>>>>> - vhost devices >>>>>> - pass through device support? Not tested, but suppose it should work. >>>>> >>>>> I've tested this series against my Jailhouse setup, and it works pretty >>>>> well! Actually considering to move my test setup over this branch. >>>> >>>> This is really encouraging feedback! Btw, thanks for all kinds of >>>> help on this patchset. :-) >>>> >>>>> >>>>> However, split irqchip still has some issues: When I boot a q35 machine >>>>> with Linux, the e1000 network adapter only gets a single IRQ delivered. >>>>> Interestingly, other IOAPIC IRQs like the keyboard work all the time. I >>>>> didn't debug this in details yet. >>>> >>>> I reproduced this problem. It seems that it fails even with >>>> kernel-irqchip=off. Will try to dig it out. >>> >>> Very good. Hope it can be easily fixed. >> >> Hi, Jan, >> >> The above issue should be caused by EOI missing of level-triggered >> interrupts. Before that, I was always using edge-triggered >> interrupts for test, so didn't encounter this one. Would you please >> help try below patch? It can be applied directly onto the series, >> and should solve the issue (it works on my test vm, and I'll take it >> in v5 as well if it also works for you): >> > > Works here as well. I even made EIM working with some hack, though > Jailhouse spits out strange warnings, despite it works fine (x2apic > mode, split irqchip). Corrections: the warnings are issued by qemu, not Jailhouse, e.g. qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. I suspect that comes from the hand-over phase of Jailhouse, when it mutes all interrupts in the system while reconfiguring IR and IOAPIC. Please convert this error (in kvm_arch_fixup_msi_route) into a trace point. It shall not annoy the host. Also check if you have more of such guest-triggerable error messages. Jan
On Tue, Apr 26, 2016 at 10:15:46AM +0200, Jan Kiszka wrote: > On 2016-04-26 09:57, Jan Kiszka wrote: > > On 2016-04-26 09:34, Peter Xu wrote: > >> On Mon, Apr 25, 2016 at 09:24:12AM +0200, Jan Kiszka wrote: > >>> On 2016-04-25 09:18, Peter Xu wrote: > >>>> On Mon, Apr 25, 2016 at 07:16:19AM +0200, Jan Kiszka wrote: > >>>>> On 2016-04-19 10:38, Peter Xu wrote: > >>>> > >>>> [...] > >>>> > >>>>>> By default, IR is disabled to be better compatible with current > >>>>>> QEMU. To enable IR, we can using the following command to boot a > >>>>>> IR-supported VM with virtio-net device with vhost (still do not > >>>>>> support kvm-ioapic, so we need to specify kernel-irqchip={split|off} > >>>>>> here): > >>>>>> > >>>>>> $ qemu-system-x86_64 -M q35,iommu=on,intr=on,kernel-irqchip=split \ > >>>>> > >>>>> "intr" sounds a bit too much like "interrupt", not "interrupt > >>>>> remapping". Why not use the kernel's form, "intremap"? > >>>> > >>>> Sure. It sounds nice to be aligned with the kernel one. Let me take > >>>> it in v5. > >>>> > >>>>> > >>>>>> -enable-kvm -m 1024 \ > >>>>>> -netdev tap,id=net0,vhost=on \ > >>>>>> -device virtio-net-pci,netdev=user.0 \ > >>>>>> -monitor telnet::3333,server,nowait \ > >>>>>> /var/lib/libvirt/images/vm1.qcow2 > >>>>>> > >>>>>> When guest boots, we can verify whether IR enabled by grepping the > >>>>>> dmesg like: > >>>>>> > >>>>>> [root@localhost ~]# journalctl -k | grep "DMAR-IR" > >>>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: IOAPIC id 0 under DRHD base 0xfed90000 IOMMU 0 > >>>>>> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: Enabled IRQ remapping in xapic mode > >>>>>> > >>>>>> Currently supported devices: > >>>>>> > >>>>>> - Emulated/Splitted irqchip > >>>>>> - Generic PCI Devices > >>>>>> - vhost devices > >>>>>> - pass through device support? Not tested, but suppose it should work. > >>>>> > >>>>> I've tested this series against my Jailhouse setup, and it works pretty > >>>>> well! Actually considering to move my test setup over this branch. > >>>> > >>>> This is really encouraging feedback! Btw, thanks for all kinds of > >>>> help on this patchset. :-) > >>>> > >>>>> > >>>>> However, split irqchip still has some issues: When I boot a q35 machine > >>>>> with Linux, the e1000 network adapter only gets a single IRQ delivered. > >>>>> Interestingly, other IOAPIC IRQs like the keyboard work all the time. I > >>>>> didn't debug this in details yet. > >>>> > >>>> I reproduced this problem. It seems that it fails even with > >>>> kernel-irqchip=off. Will try to dig it out. > >>> > >>> Very good. Hope it can be easily fixed. > >> > >> Hi, Jan, > >> > >> The above issue should be caused by EOI missing of level-triggered > >> interrupts. Before that, I was always using edge-triggered > >> interrupts for test, so didn't encounter this one. Would you please > >> help try below patch? It can be applied directly onto the series, > >> and should solve the issue (it works on my test vm, and I'll take it > >> in v5 as well if it also works for you): > >> > > > > Works here as well. I even made EIM working with some hack, though > > Jailhouse spits out strange warnings, despite it works fine (x2apic > > mode, split irqchip). > > Corrections: the warnings are issued by qemu, not Jailhouse, e.g. > > qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. > > I suspect that comes from the hand-over phase of Jailhouse, when it > mutes all interrupts in the system while reconfiguring IR and IOAPIC. > > Please convert this error (in kvm_arch_fixup_msi_route) into a trace > point. It shall not annoy the host. Also check if you have more of such > guest-triggerable error messages. Okay. This should be the only one. I can use trace instead. Meanwhile, I still suppose we should not seen it even with error_report().. Would this happen when boot e.g. generic kernels? -- peterx
2016-04-26 15:34+0800, Peter Xu: > Hi, Jan, > > The above issue should be caused by EOI missing of level-triggered > interrupts. Before that, I was always using edge-triggered > interrupts for test, so didn't encounter this one. Would you please > help try below patch? It can be applied directly onto the series, > and should solve the issue (it works on my test vm, and I'll take it > in v5 as well if it also works for you): > > ------------------------- > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int size) > +/* > + * This is to satisfy the hack in Linux kernel. One hack of it is to > + * simulate clearing the Remote IRR bit of IOAPIC entry using the > + * following: > + * > + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. > + * Otherwise, we simulate the EOI message manually by changing the trigger > + * mode to edge and then back to level, with RTE being masked during > + * this." > + * > + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) > + * > + * This is based on the assumption that, Remote IRR bit will be > + * cleared by IOAPIC hardware for edge-triggered interrupts (I > + * believe that's what the IOAPIC version 0x1X hardware does). I thought that Linux doesn't use explicit "EOI" to IO-APIC, but relies on EOI broadcast from LAPIC -- does that change with IR? > + * So > + * if we are emulating it, we'd better do it the same here, so that > + * the guest kernel hack will work as well on QEMU. Totally. > + * Without this, level-triggered interrupts in IR mode might fail to > + * work correctly. (I don't really understand why it worked before.) > + */ > +static inline void > +ioapic_fix_edge_remote_irr(uint64_t *entry) > +{ > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > + /* Level triggered interrupts, make sure remote IRR is zero */ > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); (You can just unconditionally zero it, edge doesn't care.) > + } > +} > + > @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, > s->ioredtbl[index] &= ~0xffffffffULL; > s->ioredtbl[index] |= val; > } > + ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); I think this can be done only in the else branch of (s->ioregsel & 1). (If the guest kernel does level->edge->level, then remote_irr probably should be cleared only on edge->level transition and not on level->level, but I haven't seen that in the spec ...) > ioapic_service(s); > ------------------------ > > I am still looking into guest part codes. Although the above patch > should solve the issue, there are still issues in guest codes when > IR is enabled: > > - mismatched "vector" in IOAPIC entry and IRTE entry (this is > required in vt-d spec 5.1.5.1, and required to correctly deliver > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): "required" is a way of saying that the opposite is undefined. No need to think about it in IOMMU. > - I encountered that level-triggered entries in IOAPIC is marked as > edge-triggered interrupt in APIC (which is strange)... What/where do you mean? (The only difference I know of is that level triggered vectors in LAPIC have their respective TMR bit set while edge do not.) Thanks.
On Tue, Apr 26, 2016 at 04:19:00PM +0200, Radim Krčmář wrote: > 2016-04-26 15:34+0800, Peter Xu: > > Hi, Jan, > > > > The above issue should be caused by EOI missing of level-triggered > > interrupts. Before that, I was always using edge-triggered > > interrupts for test, so didn't encounter this one. Would you please > > help try below patch? It can be applied directly onto the series, > > and should solve the issue (it works on my test vm, and I'll take it > > in v5 as well if it also works for you): > > > > ------------------------- > > > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > > @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int size) > > +/* > > + * This is to satisfy the hack in Linux kernel. One hack of it is to > > + * simulate clearing the Remote IRR bit of IOAPIC entry using the > > + * following: > > + * > > + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. > > + * Otherwise, we simulate the EOI message manually by changing the trigger > > + * mode to edge and then back to level, with RTE being masked during > > + * this." > > + * > > + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) > > + * > > + * This is based on the assumption that, Remote IRR bit will be > > + * cleared by IOAPIC hardware for edge-triggered interrupts (I > > + * believe that's what the IOAPIC version 0x1X hardware does). > > I thought that Linux doesn't use explicit "EOI" to IO-APIC, but relies > on EOI broadcast from LAPIC -- does that change with IR? IIUC, ioapic_ack_level() should be the one to handle EOI when IR is disabled. And, the EOI broadcast should be happening at: ack_APIC_irq(); While, after that, we can see some more lines: /* * Tail end of clearing remote IRR bit (either by delivering the EOI * message via io-apic EOI register write or simulating it using * mask+edge followed by unnask+level logic) manually when the * level triggered interrupt is seen as the edge triggered interrupt * at the cpu. */ if (!(v & (1 << (i & 0x1f)))) { atomic_inc(&irq_mis_count); eoi_ioapic_pin(cfg->vector, irq_data->chip_data); } What I understand the above is that: first of all, we will do EOI broadcast. However, if we found that one level-triggered interrupt is treated as edge-triggered interrupt (that is exactly what I have encountered below), we will do one more explicit EOI in eoi_ioapic_pin(), in which we played the edge-mask/level-unmask trick for IOAPIC with version 0x1X. For IR enabled case, we just do both without checking (see ioapic_ir_ack_level()). So that's why I think this should not happen if either way works... Or say, if without this patch, both "EOI broadcast" and "explicit EOI (hacky version)" are not working for IR case. And I am still looking for the reason for previous one (this patch fix the latter one). > > > + * So > > + * if we are emulating it, we'd better do it the same here, so that > > + * the guest kernel hack will work as well on QEMU. > > Totally. > > > + * Without this, level-triggered interrupts in IR mode might fail to > > + * work correctly. > > (I don't really understand why it worked before.) Yes, actually what I want to try is to have one IOMMU hardware machine, plug e1000 (I mean real hardware) into it, and see whether current Linux kernel IOMMU driver can cope well with level-triggered devices (I suppose this scenario is rarely used, since level-triggered interrupts are most legacy IIUC). > > > + */ > > +static inline void > > +ioapic_fix_edge_remote_irr(uint64_t *entry) > > +{ > > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > > + /* Level triggered interrupts, make sure remote IRR is zero */ > > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > > (You can just unconditionally zero it, edge doesn't care.) Ah! I made a mistake. I suppose what I really want is: + if (!(*entry & IOAPIC_LVT_TRIGGER_MODE)) { + /* Edge-triggered interrupts, make sure remote IRR is zero */ + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); + } Though both should help do the trick, I should be using this new one in v5. > > > + } > > +} > > + > > @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, > > s->ioredtbl[index] &= ~0xffffffffULL; > > s->ioredtbl[index] |= val; > > } > > + ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); > > I think this can be done only in the else branch of (s->ioregsel & 1). Yes. I can move it there, but there will be hidden assumption (or say, truth...) that these magic bits are inside entry bits 31-0, and people might be confused if we do not know that. IMHO, for better readability of code, I would still prefer to put it here (it means "we need to make sure the entry satisfy some kind of rule, but we do not need to know further about what the rule is"). If you still insist, I'd like to take your advice though. :) > > (If the guest kernel does level->edge->level, then remote_irr probably > should be cleared only on edge->level transition and not on > level->level, but I haven't seen that in the spec ...) Agree. That's what my above diff is trying to fix. Thanks to point out. > > > ioapic_service(s); > > ------------------------ > > > > I am still looking into guest part codes. Although the above patch > > should solve the issue, there are still issues in guest codes when > > IR is enabled: > > > > - mismatched "vector" in IOAPIC entry and IRTE entry (this is > > required in vt-d spec 5.1.5.1, and required to correctly deliver > > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): > > "required" is a way of saying that the opposite is undefined. > No need to think about it in IOMMU. Why? Without correct vector information, IOAPIC will not be able to know which entry to clear the Remote IRR bit (please check ioapic_eoi_broadcast())? > > > - I encountered that level-triggered entries in IOAPIC is marked as > > edge-triggered interrupt in APIC (which is strange)... > > What/where do you mean? > (The only difference I know of is that level triggered vectors in LAPIC > have their respective TMR bit set while edge do not.) Exactly. Here is what I mean: static void apic_eoi(APICCommonState *s) { int isrv; isrv = get_highest_priority_int(s->isr); if (isrv < 0) return; apic_reset_bit(s->isr, isrv); if (!(s->spurious_vec & APIC_SV_DIRECTED_IO) && apic_get_bit(s->tmr, isrv)) { ioapic_eoi_broadcast(isrv); } apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC); apic_update_irq(s); } APIC will notify IOAPIC only if the corresponding vector in TMR bit is set (in "apic_get_bit(s->tmr, isrv)", or say, it's a level-triggered interrupt in APIC registers). What I have traced is that, the EOI broadcast is missing because this bit is cleared in APIC TMR while it should be set. I need some more tests to double confirm this though, in case I made any mistake. (P.S. Actually I saw some similiar comments in kernel codes around, please check the long comments in ioapic_ack_level(). Not sure whether these are related.) Thanks! -- peterx
2016-04-27 15:29+0800, Peter Xu: > On Tue, Apr 26, 2016 at 04:19:00PM +0200, Radim Krčmář wrote: >> 2016-04-26 15:34+0800, Peter Xu: >> > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c >> > @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int size) >> > +/* >> > + * This is to satisfy the hack in Linux kernel. One hack of it is to >> > + * simulate clearing the Remote IRR bit of IOAPIC entry using the >> > + * following: >> > + * >> > + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. >> > + * Otherwise, we simulate the EOI message manually by changing the trigger >> > + * mode to edge and then back to level, with RTE being masked during >> > + * this." >> > + * >> > + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) >> > + * >> > + * This is based on the assumption that, Remote IRR bit will be >> > + * cleared by IOAPIC hardware for edge-triggered interrupts (I >> > + * believe that's what the IOAPIC version 0x1X hardware does). >> >> I thought that Linux doesn't use explicit "EOI" to IO-APIC, but relies >> on EOI broadcast from LAPIC -- does that change with IR? > > IIUC, ioapic_ack_level() should be the one to handle EOI when IR is > disabled. And, the EOI broadcast should be happening at: > > ack_APIC_irq(); > > While, after that, we can see some more lines: > > /* > * Tail end of clearing remote IRR bit (either by delivering the EOI > * message via io-apic EOI register write or simulating it using > * mask+edge followed by unnask+level logic) manually when the > * level triggered interrupt is seen as the edge triggered interrupt > * at the cpu. > */ > if (!(v & (1 << (i & 0x1f)))) { > atomic_inc(&irq_mis_count); > eoi_ioapic_pin(cfg->vector, irq_data->chip_data); > } > > What I understand the above is that: first of all, we will do EOI > broadcast. However, if we found that one level-triggered interrupt > is treated as edge-triggered interrupt (that is exactly what I have > encountered below), we will do one more explicit EOI in > eoi_ioapic_pin(), in which we played the edge-mask/level-unmask > trick for IOAPIC with version 0x1X. Indeed, thanks for the explanation. > For IR enabled case, we just do both without checking (see > ioapic_ir_ack_level()). (IR with IO-APIC below version 0x20 probably does not exist in the wild. I don't find any reason why the interaction would bug, though.) >> > + */ >> > +static inline void >> > +ioapic_fix_edge_remote_irr(uint64_t *entry) >> > +{ >> > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { >> > + /* Level triggered interrupts, make sure remote IRR is zero */ >> > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); >> >> (You can just unconditionally zero it, edge doesn't care.) > > Ah! I made a mistake. I suppose what I really want is: > > + if (!(*entry & IOAPIC_LVT_TRIGGER_MODE)) { > + /* Edge-triggered interrupts, make sure remote IRR is zero */ > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > + } > > Though both should help do the trick, I should be using this new > one in v5. (You'd need to look at the old value for this to work.) >> > @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, >> > s->ioredtbl[index] &= ~0xffffffffULL; >> > s->ioredtbl[index] |= val; >> > } >> > + ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); >> >> I think this can be done only in the else branch of (s->ioregsel & 1). > > Yes. I can move it there, but there will be hidden assumption (or > say, truth...) that these magic bits are inside entry bits 31-0, and > people might be confused if we do not know that. IMHO, for better > readability of code, I would still prefer to put it here (it means > "we need to make sure the entry satisfy some kind of rule, but we do > not need to know further about what the rule is"). If you still > insist, I'd like to take your advice though. :) I don't. If you clear it only on edge->level transition, then those two also behave the same. >> > I am still looking into guest part codes. Although the above patch >> > should solve the issue, there are still issues in guest codes when >> > IR is enabled: >> > >> > - mismatched "vector" in IOAPIC entry and IRTE entry (this is >> > required in vt-d spec 5.1.5.1, and required to correctly deliver >> > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): >> >> "required" is a way of saying that the opposite is undefined. >> No need to think about it in IOMMU. > > Why? Without correct vector information, IOAPIC will not be able to > know which entry to clear the Remote IRR bit (please check > ioapic_eoi_broadcast())? IOAPIC won't get correct EOI and Intel made it into an OS bug, because there was no good action that the hardware could take. (We have a lot more freedom, but I think that partially fixing something that doesn't work on real hardware is a wasted effort.) Or did you mean that mismatched vector is a possible source of the fixed bug? (I originally dismissed it, because real hardware works.) >> > - I encountered that level-triggered entries in IOAPIC is marked as >> > edge-triggered interrupt in APIC (which is strange)... >> >> What/where do you mean? >> (The only difference I know of is that level triggered vectors in LAPIC >> have their respective TMR bit set while edge do not.) > > Exactly. Here is what I mean: > > static void apic_eoi(APICCommonState *s) > { > int isrv; > isrv = get_highest_priority_int(s->isr); > if (isrv < 0) > return; > apic_reset_bit(s->isr, isrv); > if (!(s->spurious_vec & APIC_SV_DIRECTED_IO) && apic_get_bit(s->tmr, isrv)) { > ioapic_eoi_broadcast(isrv); > } > apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC); > apic_update_irq(s); > } > > APIC will notify IOAPIC only if the corresponding vector in TMR bit > is set (in "apic_get_bit(s->tmr, isrv)", or say, it's a > level-triggered interrupt in APIC registers). What I have traced is > that, the EOI broadcast is missing because this bit is cleared in > APIC TMR while it should be set. I need some more tests to double > confirm this though, in case I made any mistake. (There are two "legal" situations where TMR can be 0 and IOAPIC sets remote IRR -- if edge and level interrupts are assigned to the same vector and if IOAPIC is level while IR and OS edge, both would bug on real hardware too ...) Does QEMU bug with TCG? > (P.S. Actually I saw some similiar comments in kernel codes around, > please check the long comments in ioapic_ack_level(). Not sure > whether these are related.) I hope we didn't emulate the hardware bug. :)
On Wed, Apr 27, 2016 at 04:31:13PM +0200, Radim Krčmář wrote: [...] > >> > I am still looking into guest part codes. Although the above patch > >> > should solve the issue, there are still issues in guest codes when > >> > IR is enabled: > >> > > >> > - mismatched "vector" in IOAPIC entry and IRTE entry (this is > >> > required in vt-d spec 5.1.5.1, and required to correctly deliver > >> > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): > >> > >> "required" is a way of saying that the opposite is undefined. > >> No need to think about it in IOMMU. > > > > Why? Without correct vector information, IOAPIC will not be able to > > know which entry to clear the Remote IRR bit (please check > > ioapic_eoi_broadcast())? > > IOAPIC won't get correct EOI and Intel made it into an OS bug, because > there was no good action that the hardware could take. (We have a lot > more freedom, but I think that partially fixing something that doesn't > work on real hardware is a wasted effort.) To make sure I understand this correctly... Do you mean that real IOAPIC hardware will not handle this EOI broadcast correctly even if we fill in matched vector in the IOAPIC entry with IRTE one (when IR is enabled)? I'd appreciate if there is any link or anything that can provide me more background on this matter.. TIA. > > Or did you mean that mismatched vector is a possible source of the fixed > bug? (I originally dismissed it, because real hardware works.) Nop. The above patch fixes the hack for "explicit IOAPIC EOI", and I suppose mismatched vector issue will cause "EOI broadcast" problem. But IIUC from your above comment, we can temporarily skip this "issue" for now, if it won't work even on real hardwares and even vectors are matched. Anyway, as long as the explicit EOI works, we can survive. And this gives me the reason to send v5 first. > > >> > - I encountered that level-triggered entries in IOAPIC is marked as > >> > edge-triggered interrupt in APIC (which is strange)... > >> > >> What/where do you mean? > >> (The only difference I know of is that level triggered vectors in LAPIC > >> have their respective TMR bit set while edge do not.) > > > > Exactly. Here is what I mean: > > > > static void apic_eoi(APICCommonState *s) > > { > > int isrv; > > isrv = get_highest_priority_int(s->isr); > > if (isrv < 0) > > return; > > apic_reset_bit(s->isr, isrv); > > if (!(s->spurious_vec & APIC_SV_DIRECTED_IO) && apic_get_bit(s->tmr, isrv)) { > > ioapic_eoi_broadcast(isrv); > > } > > apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC); > > apic_update_irq(s); > > } > > > > APIC will notify IOAPIC only if the corresponding vector in TMR bit > > is set (in "apic_get_bit(s->tmr, isrv)", or say, it's a > > level-triggered interrupt in APIC registers). What I have traced is > > that, the EOI broadcast is missing because this bit is cleared in > > APIC TMR while it should be set. I need some more tests to double > > confirm this though, in case I made any mistake. > > (There are two "legal" situations where TMR can be 0 and IOAPIC sets > remote IRR -- if edge and level interrupts are assigned to the same > vector and if IOAPIC is level while IR and OS edge, both would bug on > real hardware too ...) > > Does QEMU bug with TCG? Gave it a shot today. It happens as well. Thanks, -- peterx
On Wed, Apr 27, 2016 at 04:31:13PM +0200, Radim Krčmář wrote: > >> > + */ > >> > +static inline void > >> > +ioapic_fix_edge_remote_irr(uint64_t *entry) > >> > +{ > >> > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > >> > + /* Level triggered interrupts, make sure remote IRR is zero */ > >> > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > >> > >> (You can just unconditionally zero it, edge doesn't care.) > > > > Ah! I made a mistake. I suppose what I really want is: > > > > + if (!(*entry & IOAPIC_LVT_TRIGGER_MODE)) { > > + /* Edge-triggered interrupts, make sure remote IRR is zero */ > > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > > + } > > > > Though both should help do the trick, I should be using this new > > one in v5. > > (You'd need to look at the old value for this to work.) Yes, you are right. The problem is that, we actually has RW permission for remote IRR bit in emulated IOAPIC. If so, I'd rather take the original version, and unconditionally zero it, as you have adviced (also, will fix up the comments to get them aligned). -- peterx
On Thu, Apr 28, 2016 at 02:06:17PM +0800, Peter Xu wrote: > On Wed, Apr 27, 2016 at 04:31:13PM +0200, Radim Krčmář wrote: > > >> > + */ > > >> > +static inline void > > >> > +ioapic_fix_edge_remote_irr(uint64_t *entry) > > >> > +{ > > >> > + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > > >> > + /* Level triggered interrupts, make sure remote IRR is zero */ > > >> > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > > >> > > >> (You can just unconditionally zero it, edge doesn't care.) > > > > > > Ah! I made a mistake. I suppose what I really want is: > > > > > > + if (!(*entry & IOAPIC_LVT_TRIGGER_MODE)) { > > > + /* Edge-triggered interrupts, make sure remote IRR is zero */ > > > + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); > > > + } > > > > > > Though both should help do the trick, I should be using this new > > > one in v5. > > > > (You'd need to look at the old value for this to work.) > > Yes, you are right. The problem is that, we actually has RW > permission for remote IRR bit in emulated IOAPIC. If so, I'd rather > take the original version, and unconditionally zero it, as you have > adviced (also, will fix up the comments to get them aligned). After a second thought, a better idea (though may need several more lines of codes) is to make sure the RO bits in IOAPIC entry are read-only (I mean, "real" read-only) before the above hack. I suppose this further matches with real hardware behavior. Let me send v5 directly to see the codes. Thanks, -- peterx
2016-04-28 13:27+0800, Peter Xu: > On Wed, Apr 27, 2016 at 04:31:13PM +0200, Radim Krčmář wrote: > > [...] > >> >> > I am still looking into guest part codes. Although the above patch >> >> > should solve the issue, there are still issues in guest codes when >> >> > IR is enabled: >> >> > >> >> > - mismatched "vector" in IOAPIC entry and IRTE entry (this is >> >> > required in vt-d spec 5.1.5.1, and required to correctly deliver >> >> > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): >> >> >> >> "required" is a way of saying that the opposite is undefined. >> >> No need to think about it in IOMMU. >> > >> > Why? Without correct vector information, IOAPIC will not be able to >> > know which entry to clear the Remote IRR bit (please check >> > ioapic_eoi_broadcast())? >> >> IOAPIC won't get correct EOI and Intel made it into an OS bug, because >> there was no good action that the hardware could take. (We have a lot >> more freedom, but I think that partially fixing something that doesn't >> work on real hardware is a wasted effort.) > > To make sure I understand this correctly... Do you mean that real > IOAPIC hardware will not handle this EOI broadcast correctly even if > we fill in matched vector in the IOAPIC entry with IRTE one (when IR > is enabled)? No, if the OS configures same vector in IR and IOAPIC, then EOI broadcast will work just fine. My point was that the OS *must* do it that way. If the OS doesn't, then hardware's behavior is undefined = everything that happens is correct. QEMU/KVM just shouldn't bug. I think that QEMU even behaves pretty much like real hardware here, so doing nothing now is the best choice. > I'd appreciate if there is any link or anything that can provide me > more background on this matter.. TIA. Hm, I only read the specs ... LAPIC EOI broadcast doesn't distinguish whether IOAPIC or IR injected the interrupt and notifies IOAPICs with the vector in ISR. The vector doesn't provide enough information for a unique mapping between IOAPIC and IR entries, so IOAPIC just clears Remote IRR bits of the vector. There is no nice solution if you allow different vectors, so the hardware doesn't. >> Or did you mean that mismatched vector is a possible source of the fixed >> bug? (I originally dismissed it, because real hardware works.) > > Nop. The above patch fixes the hack for "explicit IOAPIC EOI", and I > suppose mismatched vector issue will cause "EOI broadcast" problem. > But IIUC from your above comment, we can temporarily skip this > "issue" for now, if it won't work even on real hardwares and even > vectors are matched. > > Anyway, as long as the explicit EOI works, we can survive. And this > gives me the reason to send v5 first. Yep. EOI broadcast has to work in some cases, though, I'm sorry if I said the opposite.
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c index b41ab89..de6a8cf 100644 --- a/hw/intc/ioapic.c +++ b/hw/intc/ioapic.c @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int size) return val; } +/* + * This is to satisfy the hack in Linux kernel. One hack of it is to + * simulate clearing the Remote IRR bit of IOAPIC entry using the + * following: + * + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. + * Otherwise, we simulate the EOI message manually by changing the trigger + * mode to edge and then back to level, with RTE being masked during + * this." + * + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) + * + * This is based on the assumption that, Remote IRR bit will be + * cleared by IOAPIC hardware for edge-triggered interrupts (I + * believe that's what the IOAPIC version 0x1X hardware does). So + * if we are emulating it, we'd better do it the same here, so that + * the guest kernel hack will work as well on QEMU. + * + * Without this, level-triggered interrupts in IR mode might fail to + * work correctly. + */ +static inline void +ioapic_fix_edge_remote_irr(uint64_t *entry) +{ + if (*entry & IOAPIC_LVT_TRIGGER_MODE) { + /* Level triggered interrupts, make sure remote IRR is zero */ + *entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); + } +} + static void ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, unsigned int size) @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, s->ioredtbl[index] &= ~0xffffffffULL; s->ioredtbl[index] |= val; } + ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); ioapic_service(s); } }