Message ID | 20191217210658.73144-1-leonardo@linux.ibm.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [1/1] kvm/book3s_64: Fixes crash caused by not cleaning vhost IOTLB | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (270c0c3e491684893e7250f6c32f4f2eb2e4c3b2) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 9 lines checked |
On 18/12/2019 08:06, Leonardo Bras wrote: > Fixes a bug that happens when a virtual machine is created without DDW, > with vhost supporting a virtio-net device. > > In this scenario, an IOMMU with 32-bit DMA window will possibly map > IOVA's to different memory addresses. > > As the code works today, H_STUFF_TCE hypercall will be dealt only with > kvm code, which does not invalidate the IOTLB entry in vhost, meaning > that at some point, and old entry can cause an access to a previous > memory address that IOVA pointed. > > Example: > - virtio-net passes IOVA N to vhost, which point to M1 > - vhost tries IOTLB, but miss > - vhost translates IOVA N and stores result to IOTLB > - vhost writes to M1 > - (some IOMMU usage) > - virtio-net passes IOVA N to vhost, which now points to M2 > - vhost tries IOTLB, and translates IOVA N to M1 > - vhost writes to M1 <error, should write to M2> > > The reason why this error was not so evident, is probably because the > IOTLB was small enough to almost always miss at the point an IOVA was > reused. Raising the IOTLB size to 32k (which is a module parameter that > defaults to 2k) is enough to reproduce the bug in +90% of the runs. > It usually takes less than 10 seconds of netperf to cause this bug > to happen. > > A few minutes after reproducing this bug, the guest usually crash. > > Fixing this bug involves cleaning a IOVA entry from IOTLB. > The guest kernel trigger this by doing a H_STUFF_TCE hypercall with > tce_value == 0. > > This change fixes this bug by returning H_TOO_HARD on kvmppc_h_stuff_tce > when tce_value == 0, which causes kvm to let qemu deal with this. > In this case, qemu does free the vhost IOTLB entry, which fixes the bug. > > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> > --- > arch/powerpc/kvm/book3s_64_vio.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index 883a66e76638..841eff3f6392 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -710,6 +710,9 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > if (ret != H_SUCCESS) > return ret; > > + if (tce_value == 0) H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere calls it with a value other than zero, and I probably saw some other value somewhere but in QEMU/KVM case it is 0 so you effectively disable in-kernel acceleration of H_STUFF_TCE which is undesirable. For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU just like we do for VFIO for older host kernels: https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208 I am not sure what a proper solution would be, something like an eventfd and KVM's kvmppc_h_stuff_tce() signalling vhost that the latter needs to invalidate iotlbs. Or we can just say that we do not allow KVM acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty much). Thanks, > + return H_TOO_HARD; > + > /* Check permission bits only to allow userspace poison TCE for debug */ > if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) > return H_PARAMETER; >
On Wed, 2019-12-18 at 15:53 +1100, Alexey Kardashevskiy wrote: > H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere > calls it with a value other than zero, and I probably saw some other > value somewhere but in QEMU/KVM case it is 0 so you effectively disable > in-kernel acceleration of H_STUFF_TCE which is > undesirable. > Thanks for the feedback! > For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU > just like we do for VFIO for older host kernels: > > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208 I am still reading into this temporary solution, I could still not understand how it works. > I am not sure what a proper solution would be, something like an eventfd > and KVM's kvmppc_h_stuff_tce() signaling vhost that the latter needs to > invalidate iotlbs. Or we can just say that we do not allow KVM > acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty > much). Thanks, I am not used to eventfd, but i agree it's a valid solution to talk to QEMU and then use it to send a message via /dev/vhost. KVM -> QEMU -> vhost But I can't get my mind out of another solution: doing it in kernelspace. I am not sure how that would work, though. If I could understand correctly, there is a vhost IOTLB per vhost_dev, and H_STUFF_TCE is not called in 64-bit DMA case (for tce_value == 0 case, at least), which makes sense, given it doesn't need to invalidate entries on IOTLB. So, we would need to somehow replace `return H_TOO_HARD` in this patch with code that could call vhost_process_iotlb_msg() with VHOST_IOTLB_INVALIDATE. For that, I would need to know what are the vhost_dev's of that process, which I don't know if it's possible to do currently (or safe at all). I am thinking of linking all vhost_dev's with a list (list.h) that could be searched, comparing `mm_struct *` of the calling task with all vhost_dev's, and removing the entry of all IOTLB that hits. Not sure if that's the best approach to find the related vhost_dev's. What do you think? Best regards, Leonardo Bras
On 19/12/2019 10:28, Leonardo Bras wrote: > On Wed, 2019-12-18 at 15:53 +1100, Alexey Kardashevskiy wrote: >> H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere >> calls it with a value other than zero, and I probably saw some other >> value somewhere but in QEMU/KVM case it is 0 so you effectively disable >> in-kernel acceleration of H_STUFF_TCE which is >> undesirable. >> > > Thanks for the feedback! > >> For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU >> just like we do for VFIO for older host kernels: >> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208 > > I am still reading into this temporary solution, I could still not > understand how it works. > >> I am not sure what a proper solution would be, something like an eventfd >> and KVM's kvmppc_h_stuff_tce() signaling vhost that the latter needs to >> invalidate iotlbs. Or we can just say that we do not allow KVM >> acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty >> much). Thanks, > > I am not used to eventfd, but i agree it's a valid solution to talk to > QEMU and then use it to send a message via /dev/vhost. > KVM -> QEMU -> vhost > > But I can't get my mind out of another solution: doing it in > kernelspace. I am not sure how that would work, though. > > If I could understand correctly, there is a vhost IOTLB per vhost_dev, > and H_STUFF_TCE is not called in 64-bit DMA case (for tce_value == 0 > case, at least), which makes sense, given it doesn't need to invalidate > entries on IOTLB. > > So, we would need to somehow replace `return H_TOO_HARD` in this patch > with code that could call vhost_process_iotlb_msg() with > VHOST_IOTLB_INVALIDATE. > > For that, I would need to know what are the vhost_dev's of that > process, which I don't know if it's possible to do currently (or safe > at all). > > I am thinking of linking all vhost_dev's with a list (list.h) that > could be searched, comparing `mm_struct *` of the calling task with all > vhost_dev's, and removing the entry of all IOTLB that hits. > > Not sure if that's the best approach to find the related vhost_dev's. > > What do you think? As discussed in slack, we need to do the same thing we do with physical devices when we invalidate hardware IOMMU translation caches via tbl->it_ops->tce_kill. The problem to solve now is how we tell KVM/PPC about vhost/iotlb (is there an fd?), something similar to the existing KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE. I guess x86 handles all the mappings in QEMU and therefore they do not have this problem. Thanks,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 883a66e76638..841eff3f6392 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -710,6 +710,9 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, if (ret != H_SUCCESS) return ret; + if (tce_value == 0) + return H_TOO_HARD; + /* Check permission bits only to allow userspace poison TCE for debug */ if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) return H_PARAMETER;
Fixes a bug that happens when a virtual machine is created without DDW, with vhost supporting a virtio-net device. In this scenario, an IOMMU with 32-bit DMA window will possibly map IOVA's to different memory addresses. As the code works today, H_STUFF_TCE hypercall will be dealt only with kvm code, which does not invalidate the IOTLB entry in vhost, meaning that at some point, and old entry can cause an access to a previous memory address that IOVA pointed. Example: - virtio-net passes IOVA N to vhost, which point to M1 - vhost tries IOTLB, but miss - vhost translates IOVA N and stores result to IOTLB - vhost writes to M1 - (some IOMMU usage) - virtio-net passes IOVA N to vhost, which now points to M2 - vhost tries IOTLB, and translates IOVA N to M1 - vhost writes to M1 <error, should write to M2> The reason why this error was not so evident, is probably because the IOTLB was small enough to almost always miss at the point an IOVA was reused. Raising the IOTLB size to 32k (which is a module parameter that defaults to 2k) is enough to reproduce the bug in +90% of the runs. It usually takes less than 10 seconds of netperf to cause this bug to happen. A few minutes after reproducing this bug, the guest usually crash. Fixing this bug involves cleaning a IOVA entry from IOTLB. The guest kernel trigger this by doing a H_STUFF_TCE hypercall with tce_value == 0. This change fixes this bug by returning H_TOO_HARD on kvmppc_h_stuff_tce when tce_value == 0, which causes kvm to let qemu deal with this. In this case, qemu does free the vhost IOTLB entry, which fixes the bug. Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> --- arch/powerpc/kvm/book3s_64_vio.c | 3 +++ 1 file changed, 3 insertions(+)