Message ID | 20170215025243.GA3988@pxdev.xzpeter.org |
---|---|
State | New |
Headers | show |
On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote: > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote: > > [...] > > > > > >> > I misunderstood what you said? > > > > > > > > > > I failed to understand why an vIOMMU could help boost performance. > :( > > > > > Could you provide your command line here so that I can try to > > > > > reproduce? > > > > > > > > Sure. This is the command line to launch L1 VM > > > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \ > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \ > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \ > > > > -smp 4,sockets=4,cores=1,threads=1 \ > > > > -device vfio-pci,host=08:00.0,id=net0 > > > > > > > > And this is for L2 VM. > > > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \ > > > > -m 8G \ > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \ > > > > -device vfio-pci,host=00:03.0,id=net0 > > > > > > ... here looks like these are command lines for L1/L2 guest, rather > > > than L1 guest with/without vIOMMU? > > > > > > > That's right. I thought you were asking about command lines for L1/L2 > guest > > :(. > > I think I made the confusion, and as I said above, I didn't mean to talk > > about the performance of L1 guest with/without vIOMMO. > > We can move on! > > I see. Sure! :-) > > [...] > > > > > > > Then, I *think* above assertion you encountered would fail only if > > > prev == 0 here, but I still don't quite sure why was that happening. > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1 > > > guest? > > > > > > > Sure. This is from my L1 guest. > > Hmm... I think I found the problem... > > > > > root@guest0:~# lspci -vvv -s 00:03.0 > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family > > [ConnectX-3] > > Subsystem: Mellanox Technologies Device 0050 > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > Stepping- SERR+ FastB2B- DisINTx+ > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > > <MAbort- >SERR- <PERR- INTx- > > Latency: 0, Cache Line Size: 64 bytes > > Interrupt: pin A routed to IRQ 23 > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M] > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M] > > Expansion ROM at fea00000 [disabled] [size=1M] > > Capabilities: [40] Power Management version 3 > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold- > ) > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > > Capabilities: [48] Vital Product Data > > Product Name: CX354A - ConnectX-3 QSFP > > Read-only fields: > > [PN] Part number: MCX354A-FCBT > > [EC] Engineering changes: A4 > > [SN] Serial number: MT1346X00791 > > [V0] Vendor specific: PCIe Gen3 x8 > > [RV] Reserved: checksum good, 0 byte(s) reserved > > Read/write fields: > > [V1] Vendor specific: N/A > > [YA] Asset tag: N/A > > [RW] Read-write area: 105 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 253 byte(s) free > > [RW] Read-write area: 252 byte(s) free > > End > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- > > Vector table: BAR=0 offset=0007c000 > > PBA: BAR=0 offset=0007d000 > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00 > > DevCap: MaxPayload 256 bytes, PhantFunc 0 > > ExtTag- RBE+ > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > > MaxPayload 256 bytes, MaxReadReq 4096 bytes > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not > > Supported > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF > Disabled > > Capabilities: [100 v0] #00 > > Here we have the head of ecap capability as cap_id==0, then when we > boot the l2 guest with the same device, we'll first copy this > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter > problem since pcie_find_capability_list() will thought there is no cap > at all (cap_id==0 is skipped). > > Do you want to try this "hacky patch" to see whether it works for you? > Thanks for following this up! I just tried this, and I got some different message this time. qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available reset mechanism. qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available reset mechanism. Thanks, Jintack > ------8<------- > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 332f41d..bacd302 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > > } > > - /* Cleanup chain head ID if necessary */ > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); > - } > - > g_free(config); > return; > } > ------>8------- > > I don't think it's a good solution (it just used 0xffff instead of 0x0 > for the masked cap_id, then l2 guest would like to co-op with it), but > it should workaround this temporarily. I'll try to think of a better > one later and post when proper. > > (Alex, please leave comment if you have any better suggestion before > mine :) > > Thanks, > > -- peterx > >
On Wed, 15 Feb 2017 17:05:35 -0500 Jintack Lim <jintack@cs.columbia.edu> wrote: > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote: > > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote: > > > > [...] > > > > > > > >> > I misunderstood what you said? > > > > > > > > > > > > I failed to understand why an vIOMMU could help boost performance. > > :( > > > > > > Could you provide your command line here so that I can try to > > > > > > reproduce? > > > > > > > > > > Sure. This is the command line to launch L1 VM > > > > > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \ > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \ > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host \ > > > > > -smp 4,sockets=4,cores=1,threads=1 \ > > > > > -device vfio-pci,host=08:00.0,id=net0 > > > > > > > > > > And this is for L2 VM. > > > > > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \ > > > > > -m 8G \ > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \ > > > > > -device vfio-pci,host=00:03.0,id=net0 > > > > > > > > ... here looks like these are command lines for L1/L2 guest, rather > > > > than L1 guest with/without vIOMMU? > > > > > > > > > > That's right. I thought you were asking about command lines for L1/L2 > > guest > > > :(. > > > I think I made the confusion, and as I said above, I didn't mean to talk > > > about the performance of L1 guest with/without vIOMMO. > > > We can move on! > > > > I see. Sure! :-) > > > > [...] > > > > > > > > > > Then, I *think* above assertion you encountered would fail only if > > > > prev == 0 here, but I still don't quite sure why was that happening. > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1 > > > > guest? > > > > > > > > > > Sure. This is from my L1 guest. > > > > Hmm... I think I found the problem... > > > > > > > > root@guest0:~# lspci -vvv -s 00:03.0 > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family > > > [ConnectX-3] > > > Subsystem: Mellanox Technologies Device 0050 > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > Stepping- SERR+ FastB2B- DisINTx+ > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > > > <MAbort- >SERR- <PERR- INTx- > > > Latency: 0, Cache Line Size: 64 bytes > > > Interrupt: pin A routed to IRQ 23 > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M] > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M] > > > Expansion ROM at fea00000 [disabled] [size=1M] > > > Capabilities: [40] Power Management version 3 > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold- > > ) > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > > > Capabilities: [48] Vital Product Data > > > Product Name: CX354A - ConnectX-3 QSFP > > > Read-only fields: > > > [PN] Part number: MCX354A-FCBT > > > [EC] Engineering changes: A4 > > > [SN] Serial number: MT1346X00791 > > > [V0] Vendor specific: PCIe Gen3 x8 > > > [RV] Reserved: checksum good, 0 byte(s) reserved > > > Read/write fields: > > > [V1] Vendor specific: N/A > > > [YA] Asset tag: N/A > > > [RW] Read-write area: 105 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 252 byte(s) free > > > End > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- > > > Vector table: BAR=0 offset=0007c000 > > > PBA: BAR=0 offset=0007d000 > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00 > > > DevCap: MaxPayload 256 bytes, PhantFunc 0 > > > ExtTag- RBE+ > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not > > > Supported > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF > > Disabled > > > Capabilities: [100 v0] #00 > > > > Here we have the head of ecap capability as cap_id==0, then when we > > boot the l2 guest with the same device, we'll first copy this > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter > > problem since pcie_find_capability_list() will thought there is no cap > > at all (cap_id==0 is skipped). > > > > Do you want to try this "hacky patch" to see whether it works for you? > > > > Thanks for following this up! > > I just tried this, and I got some different message this time. > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > reset mechanism. > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > reset mechanism. Possibly very true, it might affect the reliability of the device in the l2 guest, but shouldn't prevent it from being assigned. What's the reset mechanism on the physical device (lspci -vvv from host please). Thanks, Alex
On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com > wrote: > On Wed, 15 Feb 2017 17:05:35 -0500 > Jintack Lim <jintack@cs.columbia.edu> wrote: > > > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote: > > > > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote: > > > > > > [...] > > > > > > > > > >> > I misunderstood what you said? > > > > > > > > > > > > > > I failed to understand why an vIOMMU could help boost > performance. > > > :( > > > > > > > Could you provide your command line here so that I can try to > > > > > > > reproduce? > > > > > > > > > > > > Sure. This is the command line to launch L1 VM > > > > > > > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \ > > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \ > > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host > \ > > > > > > -smp 4,sockets=4,cores=1,threads=1 \ > > > > > > -device vfio-pci,host=08:00.0,id=net0 > > > > > > > > > > > > And this is for L2 VM. > > > > > > > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \ > > > > > > -m 8G \ > > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \ > > > > > > -device vfio-pci,host=00:03.0,id=net0 > > > > > > > > > > ... here looks like these are command lines for L1/L2 guest, rather > > > > > than L1 guest with/without vIOMMU? > > > > > > > > > > > > > That's right. I thought you were asking about command lines for L1/L2 > > > guest > > > > :(. > > > > I think I made the confusion, and as I said above, I didn't mean to > talk > > > > about the performance of L1 guest with/without vIOMMO. > > > > We can move on! > > > > > > I see. Sure! :-) > > > > > > [...] > > > > > > > > > > > > > Then, I *think* above assertion you encountered would fail only if > > > > > prev == 0 here, but I still don't quite sure why was that > happening. > > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in > your L1 > > > > > guest? > > > > > > > > > > > > > Sure. This is from my L1 guest. > > > > > > Hmm... I think I found the problem... > > > > > > > > > > > root@guest0:~# lspci -vvv -s 00:03.0 > > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family > > > > [ConnectX-3] > > > > Subsystem: Mellanox Technologies Device 0050 > > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > > Stepping- SERR+ FastB2B- DisINTx+ > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- > > > > <MAbort- >SERR- <PERR- INTx- > > > > Latency: 0, Cache Line Size: 64 bytes > > > > Interrupt: pin A routed to IRQ 23 > > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M] > > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M] > > > > Expansion ROM at fea00000 [disabled] [size=1M] > > > > Capabilities: [40] Power Management version 3 > > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot-,D3cold- > > > ) > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > > > > Capabilities: [48] Vital Product Data > > > > Product Name: CX354A - ConnectX-3 QSFP > > > > Read-only fields: > > > > [PN] Part number: MCX354A-FCBT > > > > [EC] Engineering changes: A4 > > > > [SN] Serial number: MT1346X00791 > > > > [V0] Vendor specific: PCIe Gen3 x8 > > > > [RV] Reserved: checksum good, 0 byte(s) reserved > > > > Read/write fields: > > > > [V1] Vendor specific: N/A > > > > [YA] Asset tag: N/A > > > > [RW] Read-write area: 105 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 253 byte(s) free > > > > [RW] Read-write area: 252 byte(s) free > > > > End > > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- > > > > Vector table: BAR=0 offset=0007c000 > > > > PBA: BAR=0 offset=0007d000 > > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, > MSI 00 > > > > DevCap: MaxPayload 256 bytes, PhantFunc 0 > > > > ExtTag- RBE+ > > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ > > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes > > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not > > > > Supported > > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF > > > Disabled > > > > Capabilities: [100 v0] #00 > > > > > > Here we have the head of ecap capability as cap_id==0, then when we > > > boot the l2 guest with the same device, we'll first copy this > > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter > > > problem since pcie_find_capability_list() will thought there is no cap > > > at all (cap_id==0 is skipped). > > > > > > Do you want to try this "hacky patch" to see whether it works for you? > > > > > > > Thanks for following this up! > > > > I just tried this, and I got some different message this time. > > > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > > reset mechanism. > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > > reset mechanism. > > Possibly very true, it might affect the reliability of the device in > the l2 guest, but shouldn't prevent it from being assigned. What's the > reset mechanism on the physical device (lspci -vvv from host please). > Thanks, Alex. This is from the host (L0). 08:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Subsystem: Mellanox Technologies Device 0050 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 31 Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M] Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M] Expansion ROM at d9000000 [disabled] [size=1M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data Product Name: CX354A - ConnectX-3 QSFP Read-only fields: [PN] Part number: MCX354A-FCBT [EC] Engineering changes: A4 [SN] Serial number: MT1346X00624 [V0] Vendor specific: PCIe Gen3 x8 [RV] Reserved: checksum good, 0 byte(s) reserved Read/write fields: [V1] Vendor specific: N/A [YA] Asset tag: N/A [RW] Read-write area: 105 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 253 byte(s) free [RW] Read-write area: 252 byte(s) free End Capabilities: [9c] MSI-X: Enable- Count=128 Masked- Vector table: BAR=0 offset=0007c000 PBA: BAR=0 offset=0007d000 Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-15-51-10 Capabilities: [154 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [18c v1] #19 Kernel driver in use: vfio-pci Thanks, > > Alex > >
On Wed, 15 Feb 2017 18:25:26 -0500 Jintack Lim <jintack@cs.columbia.edu> wrote: > On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson <alex.williamson@redhat.com > > wrote: > > > On Wed, 15 Feb 2017 17:05:35 -0500 > > Jintack Lim <jintack@cs.columbia.edu> wrote: > > > > > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu <peterx@redhat.com> wrote: > > > > > > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote: > > > > > > > > [...] > > > > > > > > > > > >> > I misunderstood what you said? > > > > > > > > > > > > > > > > I failed to understand why an vIOMMU could help boost > > performance. > > > > :( > > > > > > > > Could you provide your command line here so that I can try to > > > > > > > > reproduce? > > > > > > > > > > > > > > Sure. This is the command line to launch L1 VM > > > > > > > > > > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \ > > > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \ > > > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host > > \ > > > > > > > -smp 4,sockets=4,cores=1,threads=1 \ > > > > > > > -device vfio-pci,host=08:00.0,id=net0 > > > > > > > > > > > > > > And this is for L2 VM. > > > > > > > > > > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \ > > > > > > > -m 8G \ > > > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \ > > > > > > > -device vfio-pci,host=00:03.0,id=net0 > > > > > > > > > > > > ... here looks like these are command lines for L1/L2 guest, rather > > > > > > than L1 guest with/without vIOMMU? > > > > > > > > > > > > > > > > That's right. I thought you were asking about command lines for L1/L2 > > > > guest > > > > > :(. > > > > > I think I made the confusion, and as I said above, I didn't mean to > > talk > > > > > about the performance of L1 guest with/without vIOMMO. > > > > > We can move on! > > > > > > > > I see. Sure! :-) > > > > > > > > [...] > > > > > > > > > > > > > > > > Then, I *think* above assertion you encountered would fail only if > > > > > > prev == 0 here, but I still don't quite sure why was that > > happening. > > > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in > > your L1 > > > > > > guest? > > > > > > > > > > > > > > > > Sure. This is from my L1 guest. > > > > > > > > Hmm... I think I found the problem... > > > > > > > > > > > > > > root@guest0:~# lspci -vvv -s 00:03.0 > > > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family > > > > > [ConnectX-3] > > > > > Subsystem: Mellanox Technologies Device 0050 > > > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > > > Stepping- SERR+ FastB2B- DisINTx+ > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > <TAbort- > > > > > <MAbort- >SERR- <PERR- INTx- > > > > > Latency: 0, Cache Line Size: 64 bytes > > > > > Interrupt: pin A routed to IRQ 23 > > > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M] > > > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M] > > > > > Expansion ROM at fea00000 [disabled] [size=1M] > > > > > Capabilities: [40] Power Management version 3 > > > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > PME(D0-,D1-,D2-,D3hot-,D3cold- > > > > ) > > > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > > > > > Capabilities: [48] Vital Product Data > > > > > Product Name: CX354A - ConnectX-3 QSFP > > > > > Read-only fields: > > > > > [PN] Part number: MCX354A-FCBT > > > > > [EC] Engineering changes: A4 > > > > > [SN] Serial number: MT1346X00791 > > > > > [V0] Vendor specific: PCIe Gen3 x8 > > > > > [RV] Reserved: checksum good, 0 byte(s) reserved > > > > > Read/write fields: > > > > > [V1] Vendor specific: N/A > > > > > [YA] Asset tag: N/A > > > > > [RW] Read-write area: 105 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 253 byte(s) free > > > > > [RW] Read-write area: 252 byte(s) free > > > > > End > > > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- > > > > > Vector table: BAR=0 offset=0007c000 > > > > > PBA: BAR=0 offset=0007d000 > > > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, > > MSI 00 > > > > > DevCap: MaxPayload 256 bytes, PhantFunc 0 > > > > > ExtTag- RBE+ > > > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ > > > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > > > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes > > > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > > > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not > > > > > Supported > > > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF > > > > Disabled > > > > > Capabilities: [100 v0] #00 > > > > > > > > Here we have the head of ecap capability as cap_id==0, then when we > > > > boot the l2 guest with the same device, we'll first copy this > > > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter > > > > problem since pcie_find_capability_list() will thought there is no cap > > > > at all (cap_id==0 is skipped). > > > > > > > > Do you want to try this "hacky patch" to see whether it works for you? > > > > > > > > > > Thanks for following this up! > > > > > > I just tried this, and I got some different message this time. > > > > > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > > > reset mechanism. > > > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > > > reset mechanism. > > > > Possibly very true, it might affect the reliability of the device in > > the l2 guest, but shouldn't prevent it from being assigned. What's the > > reset mechanism on the physical device (lspci -vvv from host please). > > > > Thanks, Alex. > This is from the host (L0). > > 08:00.0 Network controller: Mellanox Technologies MT27500 Family > [ConnectX-3] > Subsystem: Mellanox Technologies Device 0050 > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- INTx- > Interrupt: pin A routed to IRQ 31 > Region 0: Memory at d9f00000 (64-bit, non-prefetchable) [disabled] [size=1M] > Region 2: Memory at d5000000 (64-bit, prefetchable) [disabled] [size=8M] > Expansion ROM at d9000000 [disabled] [size=1M] > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Does not support reset on D3->D0 transition. > Capabilities: [60] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- Does not support PCIe FLR. No AF capability. Looks right to me, the only mechanism available to the host is a bus reset, which isn't available to the VM. If you were to configure it downstream of a root port, the VM might think it could reset the device, but I'm pretty sure it cannot. Thanks, Alex
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 332f41d..bacd302 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) } - /* Cleanup chain head ID if necessary */ - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); - } - g_free(config); return; }