Message ID | 20210118122609.1447366-1-npiggin@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/2] KVM: PPC: Book3S HV: Remove shared-TLB optimisation from vCPU TLB coherency logic | expand |
On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: > Processors that implement ISA v3.0 or later don't necessarily have > threads in a core sharing all translations, and/or TLBIEL does not > necessarily invalidate translations on all other threads (the > architecture talks only about the effect on translations for "the thread > executing the tlbiel instruction". It seems to me that to have an implementation where TLB entries created on one thread (say T0) are visible to and usable by another thread (T1), but a tlbiel on thread T0 does not result in the entry being removed from visibility/usability on T1, is a pretty insane implementation. I'm not sure that the architecture envisaged allowing this kind of implementation, though perhaps the language doesn't absolutely prohibit it. This kind of implementation is what you are allowing for in this patch, isn't it? The sane implementations would be ones where either (a) TLB entries are private to each thread and tlbiel only works on the local thread, or (b) TLB entries can be shared and tlbiel works across all threads. I think this is the conclusion we collectively came to when working on that bug we worked on towards the end of last year. > While this worked for POWER9, it may not for future implementations, so > remove it. A POWER9 specific optimisation would have to have a specific > CPU feature to check, if it were to be re-added. Did you do any measurements of how much performance impact this has on POWER9? I don't believe this patch will actually be necessary on POWER10, so it seems like this patch is just to allow for some undefined possible future CPU. It may still be worth putting in for the sake of strict architecture compliance if the performance impact is minimal. Paul.
Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: >> Processors that implement ISA v3.0 or later don't necessarily have >> threads in a core sharing all translations, and/or TLBIEL does not >> necessarily invalidate translations on all other threads (the >> architecture talks only about the effect on translations for "the thread >> executing the tlbiel instruction". > > It seems to me that to have an implementation where TLB entries > created on one thread (say T0) are visible to and usable by another > thread (T1), but a tlbiel on thread T0 does not result in the entry > being removed from visibility/usability on T1, is a pretty insane > implementation. I'm not sure that the architecture envisaged allowing > this kind of implementation, though perhaps the language doesn't > absolutely prohibit it. > > This kind of implementation is what you are allowing for in this > patch, isn't it? Not intentionally, and patch 2 removes the possibility. The main thing it allows is an implementation where TLB entries created by T1 which are visble only to T1 do not get removed by TLBIEL on T0. I also have some concern with ordering of in-flight operations (ptesync, memory ordering, etc) which are mostly avoided with this. > The sane implementations would be ones where either (a) TLB entries > are private to each thread and tlbiel only works on the local thread, > or (b) TLB entries can be shared and tlbiel works across all threads. > I think this is the conclusion we collectively came to when working on > that bug we worked on towards the end of last year. I think an implementation could have both types. So it's hard to get away from flushing all threads. If the vCPU ran on T0 then migrated to another core, then before running a vCPU from the same LPAR in this core again, we could flush _just_ T0 and that should be fine, but if the vCPU was scheduled onto T1 first, we would have to flush to catch shared xlates. But if we flush on T1, then we would still have to flush T0 when that ran the LPID again to catch the private xlates (but probably would not have to flush T2 or T3). We could catch that and optimise it with a shared mask and a private mask, but it seems like a lot of complexity and memory ordering issues for unclear gain. > >> While this worked for POWER9, it may not for future implementations, so >> remove it. A POWER9 specific optimisation would have to have a specific >> CPU feature to check, if it were to be re-added. > > Did you do any measurements of how much performance impact this has on > POWER9? I don't think I've really been able to generate enough CPU migrations to cause a blip. Not to say it doesn't have an impact just that I don't know how to create a worst case load (short of spamming vCPus with sched_setaffinity calls). > I don't believe this patch will actually be necessary on > POWER10, so it seems like this patch is just to allow for some > undefined possible future CPU. It may still be worth putting in for > the sake of strict architecture compliance if the performance impact > is minimal. AFAIK we still have an issue upstream with other known fixes in, that was fixed with this patch. Thanks, Nick
On Tue, Feb 09, 2021 at 06:44:53PM +1000, Nicholas Piggin wrote: > Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: > > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: > >> Processors that implement ISA v3.0 or later don't necessarily have > >> threads in a core sharing all translations, and/or TLBIEL does not > >> necessarily invalidate translations on all other threads (the > >> architecture talks only about the effect on translations for "the thread > >> executing the tlbiel instruction". > > > > It seems to me that to have an implementation where TLB entries > > created on one thread (say T0) are visible to and usable by another > > thread (T1), but a tlbiel on thread T0 does not result in the entry > > being removed from visibility/usability on T1, is a pretty insane > > implementation. I'm not sure that the architecture envisaged allowing > > this kind of implementation, though perhaps the language doesn't > > absolutely prohibit it. > > > > This kind of implementation is what you are allowing for in this > > patch, isn't it? > > Not intentionally, and patch 2 removes the possibility. > > The main thing it allows is an implementation where TLB entries created > by T1 which are visble only to T1 do not get removed by TLBIEL on T0. I could understand this patch as trying to accommodate both those implementations where TLB entries are private to each thread, and those implementations where TLB entries are shared, without needing to distinguish between them, at the expense of doing unnecessary invalidations on both kinds of implementation. > I also have some concern with ordering of in-flight operations (ptesync, > memory ordering, etc) which are mostly avoided with this. > > > The sane implementations would be ones where either (a) TLB entries > > are private to each thread and tlbiel only works on the local thread, > > or (b) TLB entries can be shared and tlbiel works across all threads. > > I think this is the conclusion we collectively came to when working on > > that bug we worked on towards the end of last year. > > I think an implementation could have both types. So it's hard to get > away from flushing all threads. Having both private and shared TLB entries in the same implementation would seem very odd to me. What would determine whether a given entry is shared or private? Paul.
Excerpts from Paul Mackerras's message of February 10, 2021 10:39 am: > On Tue, Feb 09, 2021 at 06:44:53PM +1000, Nicholas Piggin wrote: >> Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: >> > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: >> >> Processors that implement ISA v3.0 or later don't necessarily have >> >> threads in a core sharing all translations, and/or TLBIEL does not >> >> necessarily invalidate translations on all other threads (the >> >> architecture talks only about the effect on translations for "the thread >> >> executing the tlbiel instruction". >> > >> > It seems to me that to have an implementation where TLB entries >> > created on one thread (say T0) are visible to and usable by another >> > thread (T1), but a tlbiel on thread T0 does not result in the entry >> > being removed from visibility/usability on T1, is a pretty insane >> > implementation. I'm not sure that the architecture envisaged allowing >> > this kind of implementation, though perhaps the language doesn't >> > absolutely prohibit it. >> > >> > This kind of implementation is what you are allowing for in this >> > patch, isn't it? >> >> Not intentionally, and patch 2 removes the possibility. >> >> The main thing it allows is an implementation where TLB entries created >> by T1 which are visble only to T1 do not get removed by TLBIEL on T0. > > I could understand this patch as trying to accommodate both those > implementations where TLB entries are private to each thread, and > those implementations where TLB entries are shared, without needing to > distinguish between them, at the expense of doing unnecessary > invalidations on both kinds of implementation. That's exactly what it is. Existing code accommodates shared TLBs, this patch additionally allows for private. >> I also have some concern with ordering of in-flight operations (ptesync, >> memory ordering, etc) which are mostly avoided with this. >> >> > The sane implementations would be ones where either (a) TLB entries >> > are private to each thread and tlbiel only works on the local thread, >> > or (b) TLB entries can be shared and tlbiel works across all threads. >> > I think this is the conclusion we collectively came to when working on >> > that bug we worked on towards the end of last year. >> >> I think an implementation could have both types. So it's hard to get >> away from flushing all threads. > > Having both private and shared TLB entries in the same implementation > would seem very odd to me. What would determine whether a given entry > is shared or private? Example: an ERAT or L1 TLB per-thread, and a shared L2 TLB behind that. The L1 may not be PID/LPID tagged so you don't want to cross-invalidate other threads every time, say. Thanks, Nick
On Wed, Feb 10, 2021 at 11:44:54AM +1000, Nicholas Piggin wrote: > Excerpts from Paul Mackerras's message of February 10, 2021 10:39 am: > > On Tue, Feb 09, 2021 at 06:44:53PM +1000, Nicholas Piggin wrote: > >> Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: > >> > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: > >> >> Processors that implement ISA v3.0 or later don't necessarily have > >> >> threads in a core sharing all translations, and/or TLBIEL does not > >> >> necessarily invalidate translations on all other threads (the > >> >> architecture talks only about the effect on translations for "the thread > >> >> executing the tlbiel instruction". > >> > > >> > It seems to me that to have an implementation where TLB entries > >> > created on one thread (say T0) are visible to and usable by another > >> > thread (T1), but a tlbiel on thread T0 does not result in the entry > >> > being removed from visibility/usability on T1, is a pretty insane > >> > implementation. I'm not sure that the architecture envisaged allowing > >> > this kind of implementation, though perhaps the language doesn't > >> > absolutely prohibit it. > >> > > >> > This kind of implementation is what you are allowing for in this > >> > patch, isn't it? > >> > >> Not intentionally, and patch 2 removes the possibility. > >> > >> The main thing it allows is an implementation where TLB entries created > >> by T1 which are visble only to T1 do not get removed by TLBIEL on T0. > > > > I could understand this patch as trying to accommodate both those > > implementations where TLB entries are private to each thread, and > > those implementations where TLB entries are shared, without needing to > > distinguish between them, at the expense of doing unnecessary > > invalidations on both kinds of implementation. > > That's exactly what it is. Existing code accommodates shared TLBs, this > patch additionally allows for private. > > >> I also have some concern with ordering of in-flight operations (ptesync, > >> memory ordering, etc) which are mostly avoided with this. > >> > >> > The sane implementations would be ones where either (a) TLB entries > >> > are private to each thread and tlbiel only works on the local thread, > >> > or (b) TLB entries can be shared and tlbiel works across all threads. > >> > I think this is the conclusion we collectively came to when working on > >> > that bug we worked on towards the end of last year. > >> > >> I think an implementation could have both types. So it's hard to get > >> away from flushing all threads. > > > > Having both private and shared TLB entries in the same implementation > > would seem very odd to me. What would determine whether a given entry > > is shared or private? > > Example: an ERAT or L1 TLB per-thread, and a shared L2 TLB behind that. > The L1 may not be PID/LPID tagged so you don't want to cross-invalidate > other threads every time, say. That's the insane implementation I referred to above, so we're back to saying we need to allow for that kind of implementation. Paul.
Excerpts from Paul Mackerras's message of February 10, 2021 12:46 pm: > On Wed, Feb 10, 2021 at 11:44:54AM +1000, Nicholas Piggin wrote: >> Excerpts from Paul Mackerras's message of February 10, 2021 10:39 am: >> > On Tue, Feb 09, 2021 at 06:44:53PM +1000, Nicholas Piggin wrote: >> >> Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: >> >> > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: >> >> >> Processors that implement ISA v3.0 or later don't necessarily have >> >> >> threads in a core sharing all translations, and/or TLBIEL does not >> >> >> necessarily invalidate translations on all other threads (the >> >> >> architecture talks only about the effect on translations for "the thread >> >> >> executing the tlbiel instruction". >> >> > >> >> > It seems to me that to have an implementation where TLB entries >> >> > created on one thread (say T0) are visible to and usable by another >> >> > thread (T1), but a tlbiel on thread T0 does not result in the entry >> >> > being removed from visibility/usability on T1, is a pretty insane >> >> > implementation. I'm not sure that the architecture envisaged allowing >> >> > this kind of implementation, though perhaps the language doesn't >> >> > absolutely prohibit it. >> >> > >> >> > This kind of implementation is what you are allowing for in this >> >> > patch, isn't it? >> >> >> >> Not intentionally, and patch 2 removes the possibility. >> >> >> >> The main thing it allows is an implementation where TLB entries created >> >> by T1 which are visble only to T1 do not get removed by TLBIEL on T0. >> > >> > I could understand this patch as trying to accommodate both those >> > implementations where TLB entries are private to each thread, and >> > those implementations where TLB entries are shared, without needing to >> > distinguish between them, at the expense of doing unnecessary >> > invalidations on both kinds of implementation. >> >> That's exactly what it is. Existing code accommodates shared TLBs, this >> patch additionally allows for private. >> >> >> I also have some concern with ordering of in-flight operations (ptesync, >> >> memory ordering, etc) which are mostly avoided with this. >> >> >> >> > The sane implementations would be ones where either (a) TLB entries >> >> > are private to each thread and tlbiel only works on the local thread, >> >> > or (b) TLB entries can be shared and tlbiel works across all threads. >> >> > I think this is the conclusion we collectively came to when working on >> >> > that bug we worked on towards the end of last year. >> >> >> >> I think an implementation could have both types. So it's hard to get >> >> away from flushing all threads. >> > >> > Having both private and shared TLB entries in the same implementation >> > would seem very odd to me. What would determine whether a given entry >> > is shared or private? >> >> Example: an ERAT or L1 TLB per-thread, and a shared L2 TLB behind that. >> The L1 may not be PID/LPID tagged so you don't want to cross-invalidate >> other threads every time, say. > > That's the insane implementation I referred to above, so we're back to > saying we need to allow for that kind of implementation. I misunderstood you then. The really insane implementation we were talking about a couple of months ago is one where a translation can be brought in by T0, then invalidated with TLBIEL on T0, but it is later visible to T1 if it switches to that context. A combination of private and shared TLBs I don't see as being insane, so long as (there is the appearance that) 1. translations are not created on threads that are not running that context, and 2. tlbiel invalidates translations visible to that thread. Thanks, Nick
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 2d8627dbd9f6..752daf43f780 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2588,22 +2588,34 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu) { struct kvm_nested_guest *nested = vcpu->arch.nested; cpumask_t *cpu_in_guest; + cpumask_t *need_tlb_flush; int i; - cpu = cpu_first_thread_sibling(cpu); if (nested) { - cpumask_set_cpu(cpu, &nested->need_tlb_flush); + need_tlb_flush = &nested->need_tlb_flush; cpu_in_guest = &nested->cpu_in_guest; } else { - cpumask_set_cpu(cpu, &kvm->arch.need_tlb_flush); + need_tlb_flush = &kvm->arch.need_tlb_flush; cpu_in_guest = &kvm->arch.cpu_in_guest; } + + cpu = cpu_first_thread_sibling(cpu); + for (i = 0; i < threads_per_core; ++i) + cpumask_set_cpu(cpu + i, need_tlb_flush); + /* - * Make sure setting of bit in need_tlb_flush precedes + * Make sure setting of bits in need_tlb_flush precedes * testing of cpu_in_guest bits. The matching barrier on * the other side is the first smp_mb() in kvmppc_run_core(). */ smp_mb(); + + /* + * Pull vcpus out of guests if necessary, such that they'll notice + * the need_tlb_flush bit when they re-enter the guest. If this was + * ever a performance concern, it would be interesting to compare + * with performance of using TLBIE. + */ for (i = 0; i < threads_per_core; ++i) if (cpumask_test_cpu(cpu + i, cpu_in_guest)) smp_call_function_single(cpu + i, do_nothing, NULL, 1); @@ -2632,18 +2644,18 @@ static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu) * can move around between pcpus. To cope with this, when * a vcpu moves from one pcpu to another, we need to tell * any vcpus running on the same core as this vcpu previously - * ran to flush the TLB. The TLB is shared between threads, - * so we use a single bit in .need_tlb_flush for all 4 threads. + * ran to flush the TLB. */ if (prev_cpu != pcpu) { - if (prev_cpu >= 0 && - cpu_first_thread_sibling(prev_cpu) != - cpu_first_thread_sibling(pcpu)) - radix_flush_cpu(kvm, prev_cpu, vcpu); if (nested) nested->prev_cpu[vcpu->arch.nested_vcpu_id] = pcpu; else vcpu->arch.prev_cpu = pcpu; + + if (prev_cpu < 0) + return; /* first run */ + + radix_flush_cpu(kvm, prev_cpu, vcpu); } } diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index f3d3183249fe..dad118760a4e 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -789,15 +789,6 @@ void kvmppc_check_need_tlb_flush(struct kvm *kvm, int pcpu, { cpumask_t *need_tlb_flush; - /* - * On POWER9, individual threads can come in here, but the - * TLB is shared between the 4 threads in a core, hence - * invalidating on one thread invalidates for all. - * Thus we make all 4 threads use the same bit. - */ - if (cpu_has_feature(CPU_FTR_ARCH_300)) - pcpu = cpu_first_thread_sibling(pcpu); - if (nested) need_tlb_flush = &nested->need_tlb_flush; else diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 88da2764c1bb..f87237927096 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -62,12 +62,6 @@ static int global_invalidates(struct kvm *kvm) smp_wmb(); cpumask_setall(&kvm->arch.need_tlb_flush); cpu = local_paca->kvm_hstate.kvm_vcore->pcpu; - /* - * On POWER9, threads are independent but the TLB is shared, - * so use the bit for the first thread to represent the core. - */ - if (cpu_has_feature(CPU_FTR_ARCH_300)) - cpu = cpu_first_thread_sibling(cpu); cpumask_clear_cpu(cpu, &kvm->arch.need_tlb_flush); }
Processors that implement ISA v3.0 or later don't necessarily have threads in a core sharing all translations, and/or TLBIEL does not necessarily invalidate translations on all other threads (the architecture talks only about the effect on translations for "the thread executing the tlbiel instruction". While this worked for POWER9, it may not for future implementations, so remove it. A POWER9 specific optimisation would have to have a specific CPU feature to check, if it were to be re-added. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/powerpc/kvm/book3s_hv.c | 32 +++++++++++++++++++--------- arch/powerpc/kvm/book3s_hv_builtin.c | 9 -------- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 6 ------ 3 files changed, 22 insertions(+), 25 deletions(-)