Message ID | 20201217012557.118906-2-cascardo@canonical.com |
---|---|
State | New |
Headers | show |
Series | CVE-2020-29374 | expand |
On Wed, Dec 16, 2020 at 10:25:57PM -0300, Thadeu Lima de Souza Cascardo wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > ambiguous: the page can be COW'ed at any time afterwards, and the > direction of a COW event isn't defined. > > Yes, whoever writes to it will generally do the COW, but if the thread > that did the get_user_pages() unmapped the page before the write (and > that could happen due to memory pressure in addition to any outright > action), the writer could also just take over the old page instead. > > End result: the get_user_pages() call might result in a page pointer > that is no longer associated with the original VM, and is associated > with - and controlled by - another VM having taken it over instead. > > So when doing a get_user_pages() on a COW mapping, the only really safe > thing to do would be to break the COW when getting the page, even when > only getting it for reading. > > At the same time, some users simply don't even care. > > For example, the perf code wants to look up the page not because it > cares about the page, but because the code simply wants to look up the > physical address of the access for informational purposes, and doesn't > really care about races when a page might be unmapped and remapped > elsewhere. > > This adds logic to force a COW event by setting FOLL_WRITE on any > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > pointer as a result. > > The current semantics end up being: > > - __get_user_pages_fast(): no change. If you don't ask for a write, > you won't break COW. You'd better know what you're doing. > > - get_user_pages_fast(): the fast-case "look it up in the page tables > without anything getting mmap_sem" now refuses to follow a read-only > page, since it might need COW breaking. Which happens in the slow > path - the fast path doesn't know if the memory might be COW or not. > > - get_user_pages() (including the slow-path fallback for gup_fast()): > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > very similar semantics to FOLL_FORCE. > > If it turns out that we want finer granularity (ie "only break COW when > it might actually matter" - things like the zero page are special and > don't need to be broken) we might need to push these semantics deeper > into the lookup fault path. So if people care enough, it's possible > that we might end up adding a new internal FOLL_BREAK_COW flag to go > with the internal FOLL_COW flag we already have for tracking "I had a > COW". > > Alternatively, if it turns out that different callers might want to > explicitly control the forced COW break behavior, we might even want to > make such a flag visible to the users of get_user_pages() instead of > using the above default semantics. > > But for now, this is mostly commentary on the issue (this commit message > being a lot bigger than the patch, and that patch in turn is almost all > comments), with that minimal "enable COW breaking early" logic using the > existing FOLL_WRITE behavior. > > [ It might be worth noting that we've always had this ambiguity, and it > could arguably be seen as a user-space issue. > > You only get private COW mappings that could break either way in > situations where user space is doing cooperative things (ie fork() > before an execve() etc), but it _is_ surprising and very subtle, and > fork() is supposed to give you independent address spaces. > > So let's treat this as a kernel issue and make the semantics of > get_user_pages() easier to understand. Note that obviously a true > shared mapping will still get a page that can change under us, so this > does _not_ mean that get_user_pages() somehow returns any "stable" > page ] > > Reported-by: Jann Horn <jannh@google.com> > Tested-by: Christoph Hellwig <hch@lst.de> > Acked-by: Oleg Nesterov <oleg@redhat.com> > Acked-by: Kirill Shutemov <kirill@shutemov.name> > Acked-by: Jan Kara <jack@suse.cz> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > (backported from commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f) > [cascardo: fixed conflict due to missing commit ad415db817964e96df824e8bb1a861527f8012b6] > [cascardo: fixed conflict due to missing commit b798bec4741bdd80224214fdd004c8e52698e425] > [cascardo: fixed minor conflict on __get_user_pages_fast comment] > [cascardo: fixup for absence of FOLL_PIN] > [cascardo: use write=1 on s390's get_user_pages_fast too] > CVE-2020-29374 > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> > --- > arch/s390/mm/gup.c | 9 ++++- > drivers/gpu/drm/i915/i915_gem_userptr.c | 8 +++++ > mm/gup.c | 44 +++++++++++++++++++++---- > mm/huge_memory.c | 7 ++-- > 4 files changed, 57 insertions(+), 11 deletions(-) > > diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c > index 9bce54eac0b0..713cf80a740e 100644 > --- a/arch/s390/mm/gup.c > +++ b/arch/s390/mm/gup.c > @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > { > int nr, ret; > > + /* > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > + * because get_user_pages() may need to cause an early COW in > + * order to avoid confusing the normal COW routines. So only > + * targets that are already writable are safe to do by just > + * looking at the page tables. > + */ > might_sleep(); > start &= PAGE_MASK; > - nr = __get_user_pages_fast(start, nr_pages, write, pages); > + nr = __get_user_pages_fast(start, nr_pages, 1, pages); > if (nr == nr_pages) > return nr; > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > index f1bd66b5f006..d87beef8ceb3 100644 > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj) > GFP_KERNEL | > __GFP_NORETRY | > __GFP_NOWARN); > + /* > + * Using __get_user_pages_fast() with a read-only > + * access is questionable. A read-only page may be > + * COW-broken, and then this might end up giving > + * the wrong side of the COW.. > + * > + * We may or may not care. > + */ > if (pvec) /* defer to worker if malloc fails */ > pinned = __get_user_pages_fast(obj->userptr.ptr, > num_pages, > diff --git a/mm/gup.c b/mm/gup.c > index 12b9626b1a9e..ebf4a3482dee 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > } > > /* > - * FOLL_FORCE can write to even unwritable pte's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) > { > - return pte_write(pte) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); > + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); > +} > + > +/* > + * A (separate) COW fault might break the page the other way and > + * get_user_pages() would return the page from what is now the wrong > + * VM. So we need to force a COW break at GUP time even for reads. > + */ > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) > +{ > + return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET)); > } > > static struct page *follow_page_pte(struct vm_area_struct *vma, > @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > if (!vma || check_vma_flags(vma, gup_flags)) > return i ? : -EFAULT; > if (is_vm_hugetlb_page(vma)) { > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > i = follow_hugetlb_page(mm, vma, pages, vmas, > &start, &nr_pages, i, > - gup_flags, nonblocking); > + foll_flags, nonblocking); > continue; > } > } > + > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > + > retry: > /* > * If we have a pending SIGKILL, don't keep faulting pages and > @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write) > /* > * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to > * the regular GUP. It will only return non-negative values. > + * > + * Careful, careful! COW breaking can go either way, so a non-write > + * access can get ambiguous page results. If you call this function without > + * 'write' set, you'd better be sure that you're ok with that ambiguity. > */ > int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages) > @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > * > * We do not adopt an rcu_read_lock(.) here as we also want to > * block IPIs that come from THPs splitting. > + * > + * NOTE! We allow read-only gup_fast() here, but you'd better be > + * careful about possible COW pages. You'll get _a_ COW page, but > + * not necessarily the one you intended to get depending on what > + * COW event happens after this. COW may break the page copy in a > + * random direction. > */ > > if (gup_fast_permitted(start, nr_pages, write)) { > @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > (void __user *)start, len))) > return -EFAULT; > > + /* > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > + * because get_user_pages() may need to cause an early COW in > + * order to avoid confusing the normal COW routines. So only > + * targets that are already writable are safe to do by just > + * looking at the page tables. > + */ > if (gup_fast_permitted(start, nr_pages, write)) { > local_irq_disable(); > - gup_pgd_range(addr, end, write, pages, &nr); > + gup_pgd_range(addr, end, 1, pages, &nr); > local_irq_enable(); > ret = nr; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index b0a6ba0833e4..9f8fbb504d15 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) > } > > /* > - * FOLL_FORCE can write to even unwritable pmd's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) > { > - return pmd_write(pmd) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); > + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); The comment above this function mentions FOLL_FORCE, but I see it removed here. Is that removal intentional? William Breathitt Gray > } > > struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
On Wed, Jan 13, 2021 at 05:54:06AM -0500, William Breathitt Gray wrote: > On Wed, Dec 16, 2020 at 10:25:57PM -0300, Thadeu Lima de Souza Cascardo wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > > ambiguous: the page can be COW'ed at any time afterwards, and the > > direction of a COW event isn't defined. > > > > Yes, whoever writes to it will generally do the COW, but if the thread > > that did the get_user_pages() unmapped the page before the write (and > > that could happen due to memory pressure in addition to any outright > > action), the writer could also just take over the old page instead. > > > > End result: the get_user_pages() call might result in a page pointer > > that is no longer associated with the original VM, and is associated > > with - and controlled by - another VM having taken it over instead. > > > > So when doing a get_user_pages() on a COW mapping, the only really safe > > thing to do would be to break the COW when getting the page, even when > > only getting it for reading. > > > > At the same time, some users simply don't even care. > > > > For example, the perf code wants to look up the page not because it > > cares about the page, but because the code simply wants to look up the > > physical address of the access for informational purposes, and doesn't > > really care about races when a page might be unmapped and remapped > > elsewhere. > > > > This adds logic to force a COW event by setting FOLL_WRITE on any > > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > > pointer as a result. > > > > The current semantics end up being: > > > > - __get_user_pages_fast(): no change. If you don't ask for a write, > > you won't break COW. You'd better know what you're doing. > > > > - get_user_pages_fast(): the fast-case "look it up in the page tables > > without anything getting mmap_sem" now refuses to follow a read-only > > page, since it might need COW breaking. Which happens in the slow > > path - the fast path doesn't know if the memory might be COW or not. > > > > - get_user_pages() (including the slow-path fallback for gup_fast()): > > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > > very similar semantics to FOLL_FORCE. > > > > If it turns out that we want finer granularity (ie "only break COW when > > it might actually matter" - things like the zero page are special and > > don't need to be broken) we might need to push these semantics deeper > > into the lookup fault path. So if people care enough, it's possible > > that we might end up adding a new internal FOLL_BREAK_COW flag to go > > with the internal FOLL_COW flag we already have for tracking "I had a > > COW". > > > > Alternatively, if it turns out that different callers might want to > > explicitly control the forced COW break behavior, we might even want to > > make such a flag visible to the users of get_user_pages() instead of > > using the above default semantics. > > > > But for now, this is mostly commentary on the issue (this commit message > > being a lot bigger than the patch, and that patch in turn is almost all > > comments), with that minimal "enable COW breaking early" logic using the > > existing FOLL_WRITE behavior. > > > > [ It might be worth noting that we've always had this ambiguity, and it > > could arguably be seen as a user-space issue. > > > > You only get private COW mappings that could break either way in > > situations where user space is doing cooperative things (ie fork() > > before an execve() etc), but it _is_ surprising and very subtle, and > > fork() is supposed to give you independent address spaces. > > > > So let's treat this as a kernel issue and make the semantics of > > get_user_pages() easier to understand. Note that obviously a true > > shared mapping will still get a page that can change under us, so this > > does _not_ mean that get_user_pages() somehow returns any "stable" > > page ] > > > > Reported-by: Jann Horn <jannh@google.com> > > Tested-by: Christoph Hellwig <hch@lst.de> > > Acked-by: Oleg Nesterov <oleg@redhat.com> > > Acked-by: Kirill Shutemov <kirill@shutemov.name> > > Acked-by: Jan Kara <jack@suse.cz> > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > (backported from commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f) > > [cascardo: fixed conflict due to missing commit ad415db817964e96df824e8bb1a861527f8012b6] > > [cascardo: fixed conflict due to missing commit b798bec4741bdd80224214fdd004c8e52698e425] > > [cascardo: fixed minor conflict on __get_user_pages_fast comment] > > [cascardo: fixup for absence of FOLL_PIN] > > [cascardo: use write=1 on s390's get_user_pages_fast too] > > CVE-2020-29374 > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> > > --- > > arch/s390/mm/gup.c | 9 ++++- > > drivers/gpu/drm/i915/i915_gem_userptr.c | 8 +++++ > > mm/gup.c | 44 +++++++++++++++++++++---- > > mm/huge_memory.c | 7 ++-- > > 4 files changed, 57 insertions(+), 11 deletions(-) > > > > diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c > > index 9bce54eac0b0..713cf80a740e 100644 > > --- a/arch/s390/mm/gup.c > > +++ b/arch/s390/mm/gup.c > > @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > { > > int nr, ret; > > > > + /* > > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > > + * because get_user_pages() may need to cause an early COW in > > + * order to avoid confusing the normal COW routines. So only > > + * targets that are already writable are safe to do by just > > + * looking at the page tables. > > + */ > > might_sleep(); > > start &= PAGE_MASK; > > - nr = __get_user_pages_fast(start, nr_pages, write, pages); > > + nr = __get_user_pages_fast(start, nr_pages, 1, pages); > > if (nr == nr_pages) > > return nr; > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > index f1bd66b5f006..d87beef8ceb3 100644 > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj) > > GFP_KERNEL | > > __GFP_NORETRY | > > __GFP_NOWARN); > > + /* > > + * Using __get_user_pages_fast() with a read-only > > + * access is questionable. A read-only page may be > > + * COW-broken, and then this might end up giving > > + * the wrong side of the COW.. > > + * > > + * We may or may not care. > > + */ > > if (pvec) /* defer to worker if malloc fails */ > > pinned = __get_user_pages_fast(obj->userptr.ptr, > > num_pages, > > diff --git a/mm/gup.c b/mm/gup.c > > index 12b9626b1a9e..ebf4a3482dee 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > } > > > > /* > > - * FOLL_FORCE can write to even unwritable pte's, but only > > - * after we've gone through a COW cycle and they are dirty. > > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, > > + * but only after we've gone through a COW cycle and they are dirty. > > */ > > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) > > { > > - return pte_write(pte) || > > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); > > + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); > > +} > > + > > +/* > > + * A (separate) COW fault might break the page the other way and > > + * get_user_pages() would return the page from what is now the wrong > > + * VM. So we need to force a COW break at GUP time even for reads. > > + */ > > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) > > +{ > > + return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET)); > > } > > > > static struct page *follow_page_pte(struct vm_area_struct *vma, > > @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > if (!vma || check_vma_flags(vma, gup_flags)) > > return i ? : -EFAULT; > > if (is_vm_hugetlb_page(vma)) { > > + if (should_force_cow_break(vma, foll_flags)) > > + foll_flags |= FOLL_WRITE; > > i = follow_hugetlb_page(mm, vma, pages, vmas, > > &start, &nr_pages, i, > > - gup_flags, nonblocking); > > + foll_flags, nonblocking); > > continue; > > } > > } > > + > > + if (should_force_cow_break(vma, foll_flags)) > > + foll_flags |= FOLL_WRITE; > > + > > retry: > > /* > > * If we have a pending SIGKILL, don't keep faulting pages and > > @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write) > > /* > > * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to > > * the regular GUP. It will only return non-negative values. > > + * > > + * Careful, careful! COW breaking can go either way, so a non-write > > + * access can get ambiguous page results. If you call this function without > > + * 'write' set, you'd better be sure that you're ok with that ambiguity. > > */ > > int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages) > > @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > > * > > * We do not adopt an rcu_read_lock(.) here as we also want to > > * block IPIs that come from THPs splitting. > > + * > > + * NOTE! We allow read-only gup_fast() here, but you'd better be > > + * careful about possible COW pages. You'll get _a_ COW page, but > > + * not necessarily the one you intended to get depending on what > > + * COW event happens after this. COW may break the page copy in a > > + * random direction. > > */ > > > > if (gup_fast_permitted(start, nr_pages, write)) { > > @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > (void __user *)start, len))) > > return -EFAULT; > > > > + /* > > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > > + * because get_user_pages() may need to cause an early COW in > > + * order to avoid confusing the normal COW routines. So only > > + * targets that are already writable are safe to do by just > > + * looking at the page tables. > > + */ > > if (gup_fast_permitted(start, nr_pages, write)) { > > local_irq_disable(); > > - gup_pgd_range(addr, end, write, pages, &nr); > > + gup_pgd_range(addr, end, 1, pages, &nr); > > local_irq_enable(); > > ret = nr; > > } > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index b0a6ba0833e4..9f8fbb504d15 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) > > } > > > > /* > > - * FOLL_FORCE can write to even unwritable pmd's, but only > > - * after we've gone through a COW cycle and they are dirty. > > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, > > + * but only after we've gone through a COW cycle and they are dirty. > > */ > > static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) > > { > > - return pmd_write(pmd) || > > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); > > + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); > > The comment above this function mentions FOLL_FORCE, but I see it > removed here. Is that removal intentional? Yep, that comes from the upstream code. Cascardo. > > William Breathitt Gray > > > } > > > > struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > > -- > > 2.27.0 > > > > > > -- > > kernel-team mailing list > > kernel-team@lists.ubuntu.com > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
On Thu, Jan 21, 2021 at 05:40:10PM -0300, Thadeu Lima de Souza Cascardo wrote: > On Wed, Jan 13, 2021 at 05:54:06AM -0500, William Breathitt Gray wrote: > > On Wed, Dec 16, 2020 at 10:25:57PM -0300, Thadeu Lima de Souza Cascardo wrote: > > > From: Linus Torvalds <torvalds@linux-foundation.org> > > > > > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > > > ambiguous: the page can be COW'ed at any time afterwards, and the > > > direction of a COW event isn't defined. > > > > > > Yes, whoever writes to it will generally do the COW, but if the thread > > > that did the get_user_pages() unmapped the page before the write (and > > > that could happen due to memory pressure in addition to any outright > > > action), the writer could also just take over the old page instead. > > > > > > End result: the get_user_pages() call might result in a page pointer > > > that is no longer associated with the original VM, and is associated > > > with - and controlled by - another VM having taken it over instead. > > > > > > So when doing a get_user_pages() on a COW mapping, the only really safe > > > thing to do would be to break the COW when getting the page, even when > > > only getting it for reading. > > > > > > At the same time, some users simply don't even care. > > > > > > For example, the perf code wants to look up the page not because it > > > cares about the page, but because the code simply wants to look up the > > > physical address of the access for informational purposes, and doesn't > > > really care about races when a page might be unmapped and remapped > > > elsewhere. > > > > > > This adds logic to force a COW event by setting FOLL_WRITE on any > > > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > > > pointer as a result. > > > > > > The current semantics end up being: > > > > > > - __get_user_pages_fast(): no change. If you don't ask for a write, > > > you won't break COW. You'd better know what you're doing. > > > > > > - get_user_pages_fast(): the fast-case "look it up in the page tables > > > without anything getting mmap_sem" now refuses to follow a read-only > > > page, since it might need COW breaking. Which happens in the slow > > > path - the fast path doesn't know if the memory might be COW or not. > > > > > > - get_user_pages() (including the slow-path fallback for gup_fast()): > > > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > > > very similar semantics to FOLL_FORCE. > > > > > > If it turns out that we want finer granularity (ie "only break COW when > > > it might actually matter" - things like the zero page are special and > > > don't need to be broken) we might need to push these semantics deeper > > > into the lookup fault path. So if people care enough, it's possible > > > that we might end up adding a new internal FOLL_BREAK_COW flag to go > > > with the internal FOLL_COW flag we already have for tracking "I had a > > > COW". > > > > > > Alternatively, if it turns out that different callers might want to > > > explicitly control the forced COW break behavior, we might even want to > > > make such a flag visible to the users of get_user_pages() instead of > > > using the above default semantics. > > > > > > But for now, this is mostly commentary on the issue (this commit message > > > being a lot bigger than the patch, and that patch in turn is almost all > > > comments), with that minimal "enable COW breaking early" logic using the > > > existing FOLL_WRITE behavior. > > > > > > [ It might be worth noting that we've always had this ambiguity, and it > > > could arguably be seen as a user-space issue. > > > > > > You only get private COW mappings that could break either way in > > > situations where user space is doing cooperative things (ie fork() > > > before an execve() etc), but it _is_ surprising and very subtle, and > > > fork() is supposed to give you independent address spaces. > > > > > > So let's treat this as a kernel issue and make the semantics of > > > get_user_pages() easier to understand. Note that obviously a true > > > shared mapping will still get a page that can change under us, so this > > > does _not_ mean that get_user_pages() somehow returns any "stable" > > > page ] > > > > > > Reported-by: Jann Horn <jannh@google.com> > > > Tested-by: Christoph Hellwig <hch@lst.de> > > > Acked-by: Oleg Nesterov <oleg@redhat.com> > > > Acked-by: Kirill Shutemov <kirill@shutemov.name> > > > Acked-by: Jan Kara <jack@suse.cz> > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > > (backported from commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f) > > > [cascardo: fixed conflict due to missing commit ad415db817964e96df824e8bb1a861527f8012b6] > > > [cascardo: fixed conflict due to missing commit b798bec4741bdd80224214fdd004c8e52698e425] > > > [cascardo: fixed minor conflict on __get_user_pages_fast comment] > > > [cascardo: fixup for absence of FOLL_PIN] > > > [cascardo: use write=1 on s390's get_user_pages_fast too] > > > CVE-2020-29374 > > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> > > > --- > > > arch/s390/mm/gup.c | 9 ++++- > > > drivers/gpu/drm/i915/i915_gem_userptr.c | 8 +++++ > > > mm/gup.c | 44 +++++++++++++++++++++---- > > > mm/huge_memory.c | 7 ++-- > > > 4 files changed, 57 insertions(+), 11 deletions(-) > > > > > > diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c > > > index 9bce54eac0b0..713cf80a740e 100644 > > > --- a/arch/s390/mm/gup.c > > > +++ b/arch/s390/mm/gup.c > > > @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > > { > > > int nr, ret; > > > > > > + /* > > > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > > > + * because get_user_pages() may need to cause an early COW in > > > + * order to avoid confusing the normal COW routines. So only > > > + * targets that are already writable are safe to do by just > > > + * looking at the page tables. > > > + */ > > > might_sleep(); > > > start &= PAGE_MASK; > > > - nr = __get_user_pages_fast(start, nr_pages, write, pages); > > > + nr = __get_user_pages_fast(start, nr_pages, 1, pages); > > > if (nr == nr_pages) > > > return nr; > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > index f1bd66b5f006..d87beef8ceb3 100644 > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj) > > > GFP_KERNEL | > > > __GFP_NORETRY | > > > __GFP_NOWARN); > > > + /* > > > + * Using __get_user_pages_fast() with a read-only > > > + * access is questionable. A read-only page may be > > > + * COW-broken, and then this might end up giving > > > + * the wrong side of the COW.. > > > + * > > > + * We may or may not care. > > > + */ > > > if (pvec) /* defer to worker if malloc fails */ > > > pinned = __get_user_pages_fast(obj->userptr.ptr, > > > num_pages, > > > diff --git a/mm/gup.c b/mm/gup.c > > > index 12b9626b1a9e..ebf4a3482dee 100644 > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > > > } > > > > > > /* > > > - * FOLL_FORCE can write to even unwritable pte's, but only > > > - * after we've gone through a COW cycle and they are dirty. > > > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, > > > + * but only after we've gone through a COW cycle and they are dirty. > > > */ > > > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) > > > { > > > - return pte_write(pte) || > > > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); > > > + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); > > > +} > > > + > > > +/* > > > + * A (separate) COW fault might break the page the other way and > > > + * get_user_pages() would return the page from what is now the wrong > > > + * VM. So we need to force a COW break at GUP time even for reads. > > > + */ > > > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) > > > +{ > > > + return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET)); > > > } > > > > > > static struct page *follow_page_pte(struct vm_area_struct *vma, > > > @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > > if (!vma || check_vma_flags(vma, gup_flags)) > > > return i ? : -EFAULT; > > > if (is_vm_hugetlb_page(vma)) { > > > + if (should_force_cow_break(vma, foll_flags)) > > > + foll_flags |= FOLL_WRITE; > > > i = follow_hugetlb_page(mm, vma, pages, vmas, > > > &start, &nr_pages, i, > > > - gup_flags, nonblocking); > > > + foll_flags, nonblocking); > > > continue; > > > } > > > } > > > + > > > + if (should_force_cow_break(vma, foll_flags)) > > > + foll_flags |= FOLL_WRITE; > > > + > > > retry: > > > /* > > > * If we have a pending SIGKILL, don't keep faulting pages and > > > @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write) > > > /* > > > * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to > > > * the regular GUP. It will only return non-negative values. > > > + * > > > + * Careful, careful! COW breaking can go either way, so a non-write > > > + * access can get ambiguous page results. If you call this function without > > > + * 'write' set, you'd better be sure that you're ok with that ambiguity. > > > */ > > > int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > > > struct page **pages) > > > @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > > > * > > > * We do not adopt an rcu_read_lock(.) here as we also want to > > > * block IPIs that come from THPs splitting. > > > + * > > > + * NOTE! We allow read-only gup_fast() here, but you'd better be > > > + * careful about possible COW pages. You'll get _a_ COW page, but > > > + * not necessarily the one you intended to get depending on what > > > + * COW event happens after this. COW may break the page copy in a > > > + * random direction. > > > */ > > > > > > if (gup_fast_permitted(start, nr_pages, write)) { > > > @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > > (void __user *)start, len))) > > > return -EFAULT; > > > > > > + /* > > > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > > > + * because get_user_pages() may need to cause an early COW in > > > + * order to avoid confusing the normal COW routines. So only > > > + * targets that are already writable are safe to do by just > > > + * looking at the page tables. > > > + */ > > > if (gup_fast_permitted(start, nr_pages, write)) { > > > local_irq_disable(); > > > - gup_pgd_range(addr, end, write, pages, &nr); > > > + gup_pgd_range(addr, end, 1, pages, &nr); > > > local_irq_enable(); > > > ret = nr; > > > } > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index b0a6ba0833e4..9f8fbb504d15 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) > > > } > > > > > > /* > > > - * FOLL_FORCE can write to even unwritable pmd's, but only > > > - * after we've gone through a COW cycle and they are dirty. > > > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, > > > + * but only after we've gone through a COW cycle and they are dirty. > > > */ > > > static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) > > > { > > > - return pmd_write(pmd) || > > > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); > > > + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); > > > > The comment above this function mentions FOLL_FORCE, but I see it > > removed here. Is that removal intentional? > > Yep, that comes from the upstream code. > > Cascardo. You're right, upstream patch removes this. Acked-by: William Breathitt Gray <william.gray@canonical.com> > > > > > William Breathitt Gray > > > > > } > > > > > > struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > > > -- > > > 2.27.0 > > > > > > > > > -- > > > kernel-team mailing list > > > kernel-team@lists.ubuntu.com > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team > >
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c index 9bce54eac0b0..713cf80a740e 100644 --- a/arch/s390/mm/gup.c +++ b/arch/s390/mm/gup.c @@ -283,9 +283,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, { int nr, ret; + /* + * The FAST_GUP case requires FOLL_WRITE even for pure reads, + * because get_user_pages() may need to cause an early COW in + * order to avoid confusing the normal COW routines. So only + * targets that are already writable are safe to do by just + * looking at the page tables. + */ might_sleep(); start &= PAGE_MASK; - nr = __get_user_pages_fast(start, nr_pages, write, pages); + nr = __get_user_pages_fast(start, nr_pages, 1, pages); if (nr == nr_pages) return nr; diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index f1bd66b5f006..d87beef8ceb3 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -640,6 +640,14 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj) GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); + /* + * Using __get_user_pages_fast() with a read-only + * access is questionable. A read-only page may be + * COW-broken, and then this might end up giving + * the wrong side of the COW.. + * + * We may or may not care. + */ if (pvec) /* defer to worker if malloc fails */ pinned = __get_user_pages_fast(obj->userptr.ptr, num_pages, diff --git a/mm/gup.c b/mm/gup.c index 12b9626b1a9e..ebf4a3482dee 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, } /* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, + * but only after we've gone through a COW cycle and they are dirty. */ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) { - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); +} + +/* + * A (separate) COW fault might break the page the other way and + * get_user_pages() would return the page from what is now the wrong + * VM. So we need to force a COW break at GUP time even for reads. + */ +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) +{ + return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET)); } static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { + if (should_force_cow_break(vma, foll_flags)) + foll_flags |= FOLL_WRITE; i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, - gup_flags, nonblocking); + foll_flags, nonblocking); continue; } } + + if (should_force_cow_break(vma, foll_flags)) + foll_flags |= FOLL_WRITE; + retry: /* * If we have a pending SIGKILL, don't keep faulting pages and @@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write) /* * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to * the regular GUP. It will only return non-negative values. + * + * Careful, careful! COW breaking can go either way, so a non-write + * access can get ambiguous page results. If you call this function without + * 'write' set, you'd better be sure that you're ok with that ambiguity. */ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages) @@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. + * + * NOTE! We allow read-only gup_fast() here, but you'd better be + * careful about possible COW pages. You'll get _a_ COW page, but + * not necessarily the one you intended to get depending on what + * COW event happens after this. COW may break the page copy in a + * random direction. */ if (gup_fast_permitted(start, nr_pages, write)) { @@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, (void __user *)start, len))) return -EFAULT; + /* + * The FAST_GUP case requires FOLL_WRITE even for pure reads, + * because get_user_pages() may need to cause an early COW in + * order to avoid confusing the normal COW routines. So only + * targets that are already writable are safe to do by just + * looking at the page tables. + */ if (gup_fast_permitted(start, nr_pages, write)) { local_irq_disable(); - gup_pgd_range(addr, end, write, pages, &nr); + gup_pgd_range(addr, end, 1, pages, &nr); local_irq_enable(); ret = nr; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b0a6ba0833e4..9f8fbb504d15 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1425,13 +1425,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) } /* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, + * but only after we've gone through a COW cycle and they are dirty. */ static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); } struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,