Message ID | 1496244325-180257-6-git-send-email-pasha.tatashin@oracle.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
From: Pavel Tatashin <pasha.tatashin@oracle.com> Date: Wed, 31 May 2017 11:25:24 -0400 > + for_each_online_cpu(cpu) { > + /* > + * If a new mm is stored after we took this mm from the array, > + * it will go into get_new_mmu_context() path, because we > + * already bumped the version in tlb_context_cache. > + */ > + mm = per_cpu(per_cpu_secondary_mm, cpu); > + > + if (unlikely(!mm || mm == &init_mm)) > + continue; > + > + old_ctx = mm->context.sparc64_ctx_val; > + if (likely((old_ctx & CTX_VERSION_MASK) == old_ver)) { > + new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver; > + set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap); > + mm->context.sparc64_ctx_val = new_ctx; I wonder if there is a potential use after free here. What synchronizes the per-cpu mm pointers with free_mm()? For example, what stops another cpu from exiting a thread and dropping the mm between when you do the per_cpu() read of the 'mm' pointer and the tests and sets you do a few lines later? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017-06-04 22:01, David Miller wrote: > From: Pavel Tatashin <pasha.tatashin@oracle.com> > Date: Wed, 31 May 2017 11:25:24 -0400 > >> + for_each_online_cpu(cpu) { >> + /* >> + * If a new mm is stored after we took this mm from the array, >> + * it will go into get_new_mmu_context() path, because we >> + * already bumped the version in tlb_context_cache. >> + */ >> + mm = per_cpu(per_cpu_secondary_mm, cpu); >> + >> + if (unlikely(!mm || mm == &init_mm)) >> + continue; >> + >> + old_ctx = mm->context.sparc64_ctx_val; >> + if (likely((old_ctx & CTX_VERSION_MASK) == old_ver)) { >> + new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver; >> + set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap); >> + mm->context.sparc64_ctx_val = new_ctx; > > I wonder if there is a potential use after free here. > > What synchronizes the per-cpu mm pointers with free_mm()? > > For example, what stops another cpu from exiting a thread > and dropping the mm between when you do the per_cpu() read > of the 'mm' pointer and the tests and sets you do a few > lines later? Hi Dave, ctx_alloc_lock in destroy_context() synchronizes wrap with free_mm(). By the time destroy_context() is called, we know that the last user of mm is freeing it, and we take the same lock that is owned during wrap. I had the following asserts in destroy_context() for testing purposes: for_each_cpu(cpu, mm_cpumask(mm)) { mmp = per_cpu_ptr(&per_cpu_secondary_mm, cpu); if (*mmp == mm) { cmpxchg(mmp, mm, NULL); panic("BUG: mm[%p] in per_cpu_secondary_mm CPU[%d, %d]", mm, smp_processor_id(), cpu); } } And never hit it. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Sun, 4 Jun 2017 23:50:55 -0400 > On 2017-06-04 22:01, David Miller wrote: >> From: Pavel Tatashin <pasha.tatashin@oracle.com> >> Date: Wed, 31 May 2017 11:25:24 -0400 >> >>> + for_each_online_cpu(cpu) { >>> + /* >>> + * If a new mm is stored after we took this mm from the array, >>> + * it will go into get_new_mmu_context() path, because we >>> + * already bumped the version in tlb_context_cache. >>> + */ >>> + mm = per_cpu(per_cpu_secondary_mm, cpu); >>> + >>> + if (unlikely(!mm || mm == &init_mm)) >>> + continue; >>> + >>> + old_ctx = mm->context.sparc64_ctx_val; >>> + if (likely((old_ctx & CTX_VERSION_MASK) == old_ver)) { >>> + new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver; >>> + set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap); >>> + mm->context.sparc64_ctx_val = new_ctx; >> I wonder if there is a potential use after free here. >> What synchronizes the per-cpu mm pointers with free_mm()? >> For example, what stops another cpu from exiting a thread >> and dropping the mm between when you do the per_cpu() read >> of the 'mm' pointer and the tests and sets you do a few >> lines later? > > Hi Dave, > > ctx_alloc_lock in destroy_context() synchronizes wrap with > free_mm(). By the time destroy_context() is called, we know that the > last user of mm is freeing it, and we take the same lock that is owned > during wrap. > > I had the following asserts in destroy_context() for testing purposes: > > for_each_cpu(cpu, mm_cpumask(mm)) { > mmp = per_cpu_ptr(&per_cpu_secondary_mm, cpu); > if (*mmp == mm) { > cmpxchg(mmp, mm, NULL); > panic("BUG: mm[%p] in per_cpu_secondary_mm > CPU[%d, %d]", mm, smp_processor_id(), cpu); > } > } > > And never hit it. Ok, thanks for the details. I'm not too happy about scanning NCPUS 'mm' pointers every context wrap, that can be up to 4096 so the cost is not exactly trivial. However, I don't have an alternative suggestion at this time and such quick wraps only happen in extreme situations, so your changes definitely improve the situation. So I'll apply this series and queued it up for -stable, thank you. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Ok, thanks for the details. > > I'm not too happy about scanning NCPUS 'mm' pointers every context > wrap, that can be up to 4096 so the cost is not exactly trivial. Hi Dave, Thank you for accepting the changes. Going through NCPU proportional loop for xcall purpose or 'mm' pointers is a fundamental limitation when there is only one global context counter per system. To solve this limitation would be a different project: we will need multiple context counters per system: up-to one per core. Bob Picco is working on adding this support. In that case the loop sizes are going to be bound by: NCPU / (# context domains) Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Mon, 5 Jun 2017 16:10:04 -0400 > Going through NCPU proportional loop for xcall purpose or 'mm' > pointers is a fundamental limitation when there is only one global > context counter per system. To solve this limitation would be a > different project: we will need multiple context counters per system: > up-to one per core. Bob Picco is working on adding this support. In > that case the loop sizes are going to be bound by: > NCPU / (# context domains) I see, I think MIPS uses per-cpu context IDs, perhaps you can see how they handle this. Thanks. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 5b4f030..bfb7672 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -712,6 +712,53 @@ void __flush_dcache_range(unsigned long start, unsigned long end) DECLARE_BITMAP(mmu_context_bmap, MAX_CTX_NR); DEFINE_PER_CPU(struct mm_struct *, per_cpu_secondary_mm) = {0}; +static void mmu_context_wrap(void) +{ + unsigned long old_ver = tlb_context_cache & CTX_VERSION_MASK; + unsigned long new_ver, new_ctx, old_ctx; + struct mm_struct *mm; + int cpu; + + bitmap_zero(mmu_context_bmap, 1 << CTX_NR_BITS); + + /* Reserve kernel context */ + set_bit(0, mmu_context_bmap); + + new_ver = (tlb_context_cache & CTX_VERSION_MASK) + CTX_FIRST_VERSION; + if (unlikely(new_ver == 0)) + new_ver = CTX_FIRST_VERSION; + tlb_context_cache = new_ver; + + /* + * Make sure that any new mm that are added into per_cpu_secondary_mm, + * are going to go through get_new_mmu_context() path. + */ + mb(); + + /* + * Updated versions to current on those CPUs that had valid secondary + * contexts + */ + for_each_online_cpu(cpu) { + /* + * If a new mm is stored after we took this mm from the array, + * it will go into get_new_mmu_context() path, because we + * already bumped the version in tlb_context_cache. + */ + mm = per_cpu(per_cpu_secondary_mm, cpu); + + if (unlikely(!mm || mm == &init_mm)) + continue; + + old_ctx = mm->context.sparc64_ctx_val; + if (likely((old_ctx & CTX_VERSION_MASK) == old_ver)) { + new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver; + set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap); + mm->context.sparc64_ctx_val = new_ctx; + } + } +} + /* Caller does TLB context flushing on local CPU if necessary. * The caller also ensures that CTX_VALID(mm->context) is false. * @@ -726,50 +773,30 @@ void get_new_mmu_context(struct mm_struct *mm) { unsigned long ctx, new_ctx; unsigned long orig_pgsz_bits; - int new_version; spin_lock(&ctx_alloc_lock); +retry: + /* wrap might have happened, test again if our context became valid */ + if (unlikely(CTX_VALID(mm->context))) + goto out; orig_pgsz_bits = (mm->context.sparc64_ctx_val & CTX_PGSZ_MASK); ctx = (tlb_context_cache + 1) & CTX_NR_MASK; new_ctx = find_next_zero_bit(mmu_context_bmap, 1 << CTX_NR_BITS, ctx); - new_version = 0; if (new_ctx >= (1 << CTX_NR_BITS)) { new_ctx = find_next_zero_bit(mmu_context_bmap, ctx, 1); if (new_ctx >= ctx) { - int i; - new_ctx = (tlb_context_cache & CTX_VERSION_MASK) + - CTX_FIRST_VERSION + 1; - if (new_ctx == 1) - new_ctx = CTX_FIRST_VERSION + 1; - - /* Don't call memset, for 16 entries that's just - * plain silly... - */ - mmu_context_bmap[0] = 3; - mmu_context_bmap[1] = 0; - mmu_context_bmap[2] = 0; - mmu_context_bmap[3] = 0; - for (i = 4; i < CTX_BMAP_SLOTS; i += 4) { - mmu_context_bmap[i + 0] = 0; - mmu_context_bmap[i + 1] = 0; - mmu_context_bmap[i + 2] = 0; - mmu_context_bmap[i + 3] = 0; - } - new_version = 1; - goto out; + mmu_context_wrap(); + goto retry; } } if (mm->context.sparc64_ctx_val) cpumask_clear(mm_cpumask(mm)); mmu_context_bmap[new_ctx>>6] |= (1UL << (new_ctx & 63)); new_ctx |= (tlb_context_cache & CTX_VERSION_MASK); -out: tlb_context_cache = new_ctx; mm->context.sparc64_ctx_val = new_ctx | orig_pgsz_bits; +out: spin_unlock(&ctx_alloc_lock); - - if (unlikely(new_version)) - smp_new_mmu_context_version(); } static int numa_enabled = 1;