Message ID | 20210624035749.4054934-1-stevensd@google.com |
---|---|
Headers | show |
Series | KVM: Remove uses of struct page from x86 and arm64 MMU | expand |
On 24/06/21 05:57, David Stevens wrote: > KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using > follow_pte in gfn_to_pfn. However, the resolved pfns may not have > assoicated struct pages, so they should not be passed to pfn_to_page. > This series removes such calls from the x86 and arm64 secondary MMU. To > do this, this series modifies gfn_to_pfn to return a struct page in > addition to a pfn, if the hva was resolved by gup. This allows the > caller to call put_page only when necessated by gup. > > This series provides a helper function that unwraps the new return type > of gfn_to_pfn to provide behavior identical to the old behavior. As I > have no hardware to test powerpc/mips changes, the function is used > there for minimally invasive changes. Additionally, as gfn_to_page and > gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be > easily changed over to only use pfns. > > This addresses CVE-2021-22543 on x86 and arm64. Thank you very much for this. I agree that it makes sense to have a minimal change; I had similar changes almost ready, but was stuck with deadlocks in the gfn_to_pfn_cache case. In retrospect I should have posted something similar to your patches. I have started reviewing the patches, and they look good. I will try to include them in 5.13. Paolo
Excerpts from David Stevens's message of June 24, 2021 1:57 pm: > KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using > follow_pte in gfn_to_pfn. However, the resolved pfns may not have > assoicated struct pages, so they should not be passed to pfn_to_page. > This series removes such calls from the x86 and arm64 secondary MMU. To > do this, this series modifies gfn_to_pfn to return a struct page in > addition to a pfn, if the hva was resolved by gup. This allows the > caller to call put_page only when necessated by gup. > > This series provides a helper function that unwraps the new return type > of gfn_to_pfn to provide behavior identical to the old behavior. As I > have no hardware to test powerpc/mips changes, the function is used > there for minimally invasive changes. Additionally, as gfn_to_page and > gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be > easily changed over to only use pfns. > > This addresses CVE-2021-22543 on x86 and arm64. Does this fix the problem? (untested I don't have a POC setup at hand, but at least in concept) I have no problem with improving the API and probably in the direction of your series is good. But there seems to be a lot of unfixed arch code and broken APIs remaining left to do after your series too. This might be most suitable to backport and as a base for your series that can take more time to convert to new APIs. Thanks, Nick --- diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6a6bc7af0e28..e208c279d903 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2104,13 +2104,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. + * + * Certain IO or PFNMAP mappings can be backed with valid + * struct pages, but be allocated without refcounting e.g., + * tail pages of non-compound higher order allocations, which + * would then underflow the refcount when the caller does the + * required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /* @@ -2487,6 +2495,13 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn) } EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + void kvm_get_pfn(kvm_pfn_t pfn) { if (!kvm_is_reserved_pfn(pfn))
Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm: > Excerpts from David Stevens's message of June 24, 2021 1:57 pm: >> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using >> follow_pte in gfn_to_pfn. However, the resolved pfns may not have >> assoicated struct pages, so they should not be passed to pfn_to_page. >> This series removes such calls from the x86 and arm64 secondary MMU. To >> do this, this series modifies gfn_to_pfn to return a struct page in >> addition to a pfn, if the hva was resolved by gup. This allows the >> caller to call put_page only when necessated by gup. >> >> This series provides a helper function that unwraps the new return type >> of gfn_to_pfn to provide behavior identical to the old behavior. As I >> have no hardware to test powerpc/mips changes, the function is used >> there for minimally invasive changes. Additionally, as gfn_to_page and >> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be >> easily changed over to only use pfns. >> >> This addresses CVE-2021-22543 on x86 and arm64. > > Does this fix the problem? (untested I don't have a POC setup at hand, > but at least in concept) This one actually compiles at least. Unfortunately I don't have much time in the near future to test, and I only just found out about this CVE a few hours ago. --- It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on the page if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). --- virt/kvm/kvm_main.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6a6bc7af0e28..46fb042837d2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. + * + * Certain IO or PFNMAP mappings can be backed with valid + * struct pages, but be allocated without refcounting e.g., + * tail pages of non-compound higher order allocations, which + * would then underflow the refcount when the caller does the + * required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /*
On 24/06/21 13:42, Nicholas Piggin wrote: > +static int kvm_try_get_pfn(kvm_pfn_t pfn) > +{ > + if (kvm_is_reserved_pfn(pfn)) > + return 1; So !pfn_valid would always return true. Yeah, this should work and is certainly appealing! Paolo > + return get_page_unless_zero(pfn_to_page(pfn)); > +} > + > static int hva_to_pfn_remapped(struct vm_area_struct *vma, > unsigned long addr, bool *async, > bool write_fault, bool *writable, > @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > * Whoever called remap_pfn_range is also going to call e.g. > * unmap_mapping_range before the underlying pages are freed, > * causing a call to our MMU notifier. > + * > + * Certain IO or PFNMAP mappings can be backed with valid > + * struct pages, but be allocated without refcounting e.g., > + * tail pages of non-compound higher order allocations, which > + * would then underflow the refcount when the caller does the > + * required put_page. Don't allow those pages here. > */ > - kvm_get_pfn(pfn); > + if (!kvm_try_get_pfn(pfn)) > + r = -EFAULT; > > out: > pte_unmap_unlock(ptep, ptl); > *p_pfn = pfn; > - return 0; > + > + return r; > } > > /* >
On 24/06/21 13:42, Nicholas Piggin wrote: > Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm: >> Excerpts from David Stevens's message of June 24, 2021 1:57 pm: >>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using >>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have >>> assoicated struct pages, so they should not be passed to pfn_to_page. >>> This series removes such calls from the x86 and arm64 secondary MMU. To >>> do this, this series modifies gfn_to_pfn to return a struct page in >>> addition to a pfn, if the hva was resolved by gup. This allows the >>> caller to call put_page only when necessated by gup. >>> >>> This series provides a helper function that unwraps the new return type >>> of gfn_to_pfn to provide behavior identical to the old behavior. As I >>> have no hardware to test powerpc/mips changes, the function is used >>> there for minimally invasive changes. Additionally, as gfn_to_page and >>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be >>> easily changed over to only use pfns. >>> >>> This addresses CVE-2021-22543 on x86 and arm64. >> >> Does this fix the problem? (untested I don't have a POC setup at hand, >> but at least in concept) > > This one actually compiles at least. Unfortunately I don't have much > time in the near future to test, and I only just found out about this > CVE a few hours ago. And it also works (the reproducer gets an infinite stream of userspace exits and especially does not crash). We can still go for David's solution later since MMU notifiers are able to deal with this pages, but it's a very nice patch for stable kernels. If you provide a Signed-off-by, I can integrate it. Paolo > --- > > > It's possible to create a region which maps valid but non-refcounted > pages (e.g., tail pages of non-compound higher order allocations). These > host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family > of APIs, which take a reference to the page, which takes it from 0 to 1. > When the reference is dropped, this will free the page incorrectly. > > Fix this by only taking a reference on the page if it was non-zero, > which indicates it is participating in normal refcounting (and can be > released with put_page). > > --- > virt/kvm/kvm_main.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 6a6bc7af0e28..46fb042837d2 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) > return true; > } > > +static int kvm_try_get_pfn(kvm_pfn_t pfn) > +{ > + if (kvm_is_reserved_pfn(pfn)) > + return 1; > + return get_page_unless_zero(pfn_to_page(pfn)); > +} > + > static int hva_to_pfn_remapped(struct vm_area_struct *vma, > unsigned long addr, bool *async, > bool write_fault, bool *writable, > @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > * Whoever called remap_pfn_range is also going to call e.g. > * unmap_mapping_range before the underlying pages are freed, > * causing a call to our MMU notifier. > + * > + * Certain IO or PFNMAP mappings can be backed with valid > + * struct pages, but be allocated without refcounting e.g., > + * tail pages of non-compound higher order allocations, which > + * would then underflow the refcount when the caller does the > + * required put_page. Don't allow those pages here. > */ > - kvm_get_pfn(pfn); > + if (!kvm_try_get_pfn(pfn)) > + r = -EFAULT; > > out: > pte_unmap_unlock(ptep, ptl); > *p_pfn = pfn; > - return 0; > + > + return r; > } > > /* >
Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm: > On 24/06/21 13:42, Nicholas Piggin wrote: >> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm: >>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm: >>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using >>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have >>>> assoicated struct pages, so they should not be passed to pfn_to_page. >>>> This series removes such calls from the x86 and arm64 secondary MMU. To >>>> do this, this series modifies gfn_to_pfn to return a struct page in >>>> addition to a pfn, if the hva was resolved by gup. This allows the >>>> caller to call put_page only when necessated by gup. >>>> >>>> This series provides a helper function that unwraps the new return type >>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I >>>> have no hardware to test powerpc/mips changes, the function is used >>>> there for minimally invasive changes. Additionally, as gfn_to_page and >>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be >>>> easily changed over to only use pfns. >>>> >>>> This addresses CVE-2021-22543 on x86 and arm64. >>> >>> Does this fix the problem? (untested I don't have a POC setup at hand, >>> but at least in concept) >> >> This one actually compiles at least. Unfortunately I don't have much >> time in the near future to test, and I only just found out about this >> CVE a few hours ago. > > And it also works (the reproducer gets an infinite stream of userspace > exits and especially does not crash). We can still go for David's > solution later since MMU notifiers are able to deal with this pages, but > it's a very nice patch for stable kernels. Oh nice, thanks for testing. How's this? Thanks, Nick --- KVM: Fix page ref underflow for regions with valid but non-refcounted pages It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on the page if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- virt/kvm/kvm_main.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6a6bc7af0e28..46fb042837d2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. + * + * Certain IO or PFNMAP mappings can be backed with valid + * struct pages, but be allocated without refcounting e.g., + * tail pages of non-compound higher order allocations, which + * would then underflow the refcount when the caller does the + * required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /*
On 24/06/21 14:57, Nicholas Piggin wrote: > KVM: Fix page ref underflow for regions with valid but non-refcounted pages It doesn't really fix the underflow, it disallows mapping them in the first place. Since in principle things can break, I'd rather be explicit, so let's go with "KVM: do not allow mapping valid but non-reference-counted pages". > It's possible to create a region which maps valid but non-refcounted > pages (e.g., tail pages of non-compound higher order allocations). These > host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family > of APIs, which take a reference to the page, which takes it from 0 to 1. > When the reference is dropped, this will free the page incorrectly. > > Fix this by only taking a reference on the page if it was non-zero, s/on the page/on valid pages/ (makes clear that invalid pages are fine without refcounting). Thank you *so* much, I'm awful at Linux mm. Paolo > which indicates it is participating in normal refcounting (and can be > released with put_page). > > Signed-off-by: Nicholas Piggin<npiggin@gmail.com>
Excerpts from Paolo Bonzini's message of June 25, 2021 1:35 am: > On 24/06/21 14:57, Nicholas Piggin wrote: >> KVM: Fix page ref underflow for regions with valid but non-refcounted pages > > It doesn't really fix the underflow, it disallows mapping them in the > first place. Since in principle things can break, I'd rather be > explicit, so let's go with "KVM: do not allow mapping valid but > non-reference-counted pages". > >> It's possible to create a region which maps valid but non-refcounted >> pages (e.g., tail pages of non-compound higher order allocations). These >> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family >> of APIs, which take a reference to the page, which takes it from 0 to 1. >> When the reference is dropped, this will free the page incorrectly. >> >> Fix this by only taking a reference on the page if it was non-zero, > > s/on the page/on valid pages/ (makes clear that invalid pages are fine > without refcounting). That seems okay, you can adjust the title or changelog as you like. > Thank you *so* much, I'm awful at Linux mm. Glad to help. Easy to see why you were taking this approach because the API really does need to be improved and even a pretty intwined with mm subsystem like KVM shouldn't _really_ be doing this kind of trick (and it should go away when old API is removed). Thanks, Nick
On 24.06.21 14:57, Nicholas Piggin wrote: > Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm: >> On 24/06/21 13:42, Nicholas Piggin wrote: >>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm: >>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm: >>>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using >>>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have >>>>> assoicated struct pages, so they should not be passed to pfn_to_page. >>>>> This series removes such calls from the x86 and arm64 secondary MMU. To >>>>> do this, this series modifies gfn_to_pfn to return a struct page in >>>>> addition to a pfn, if the hva was resolved by gup. This allows the >>>>> caller to call put_page only when necessated by gup. >>>>> >>>>> This series provides a helper function that unwraps the new return type >>>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I >>>>> have no hardware to test powerpc/mips changes, the function is used >>>>> there for minimally invasive changes. Additionally, as gfn_to_page and >>>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be >>>>> easily changed over to only use pfns. >>>>> >>>>> This addresses CVE-2021-22543 on x86 and arm64. >>>> >>>> Does this fix the problem? (untested I don't have a POC setup at hand, >>>> but at least in concept) >>> >>> This one actually compiles at least. Unfortunately I don't have much >>> time in the near future to test, and I only just found out about this >>> CVE a few hours ago. >> >> And it also works (the reproducer gets an infinite stream of userspace >> exits and especially does not crash). We can still go for David's >> solution later since MMU notifiers are able to deal with this pages, but >> it's a very nice patch for stable kernels. > > Oh nice, thanks for testing. How's this? > > Thanks, > Nick > > --- > > KVM: Fix page ref underflow for regions with valid but non-refcounted pages > > It's possible to create a region which maps valid but non-refcounted > pages (e.g., tail pages of non-compound higher order allocations). These > host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family > of APIs, which take a reference to the page, which takes it from 0 to 1. > When the reference is dropped, this will free the page incorrectly. > > Fix this by only taking a reference on the page if it was non-zero, > which indicates it is participating in normal refcounting (and can be > released with put_page). > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > virt/kvm/kvm_main.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 6a6bc7af0e28..46fb042837d2 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) > return true; > } > > +static int kvm_try_get_pfn(kvm_pfn_t pfn) > +{ > + if (kvm_is_reserved_pfn(pfn)) > + return 1; > + return get_page_unless_zero(pfn_to_page(pfn)); > +} > + > static int hva_to_pfn_remapped(struct vm_area_struct *vma, > unsigned long addr, bool *async, > bool write_fault, bool *writable, > @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, > * Whoever called remap_pfn_range is also going to call e.g. > * unmap_mapping_range before the underlying pages are freed, > * causing a call to our MMU notifier. > + * > + * Certain IO or PFNMAP mappings can be backed with valid > + * struct pages, but be allocated without refcounting e.g., > + * tail pages of non-compound higher order allocations, which > + * would then underflow the refcount when the caller does the > + * required put_page. Don't allow those pages here. > */ > - kvm_get_pfn(pfn); > + if (!kvm_try_get_pfn(pfn)) > + r = -EFAULT; > Right. That should also take care of s390 (pin_guest_page in vsie.c which calls gfn_to_page). FWIW, the current API is really hard to follow as it does not tell which functions take a reference and which dont. Anyway, this patch (with cc stable?) Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> > out: > pte_unmap_unlock(ptep, ptl); > *p_pfn = pfn; > - return 0; > + > + return r; > } > > /* >