Message ID | 20240328053131.2604454-1-npiggin@gmail.com |
---|---|
Headers | show |
Series | target/ppc: fix tlb flushing race | expand |
On Thu Mar 28, 2024 at 3:31 PM AEST, Nicholas Piggin wrote: > ppc broadcast tlb flushes should be synchronised with other vCPUs, > like all other architectures that support such operations seem to > be doing. > > Fixing ppc removes the last caller of the non-synced TLB flush > variants, we can remove some dead code. I'd like to merge patch 1 > for 9.0, and hold patches 2 and 3 until 9.1 to avoid churn (unless > someone prefers to remove the dead code asap). Hmm, turns out to not be so simple, this in parts reverts the fix in commit 4ddc104689b. Do other architectures that use the _synced TLB flush variants have that same problem with the TLB flush not actually flushing until the TB ends, I wonder? AFAIKS it seems like the right fix would be to use _synced, but force a new TB at the end of the TLB flush instruction so the flush will take effect on all CPUs before the next instruction? In any case this is tricky enough and I only hit it with a test program, so I'll leave it out of 9.0. Thanks, Nick
On Thu Mar 28, 2024 at 6:12 PM AEST, Nicholas Piggin wrote: > On Thu Mar 28, 2024 at 3:31 PM AEST, Nicholas Piggin wrote: > > ppc broadcast tlb flushes should be synchronised with other vCPUs, > > like all other architectures that support such operations seem to > > be doing. > > > > Fixing ppc removes the last caller of the non-synced TLB flush > > variants, we can remove some dead code. I'd like to merge patch 1 > > for 9.0, and hold patches 2 and 3 until 9.1 to avoid churn (unless > > someone prefers to remove the dead code asap). > > Hmm, turns out to not be so simple, this in parts reverts > the fix in commit 4ddc104689b. Do other architectures > that use the _synced TLB flush variants have that same problem > with the TLB flush not actually flushing until the TB ends, > I wonder? Huh, I can reproduce that original problem with a little test case (which I will upstream into kvm-unit-tests). async_run_on_cpu(this_cpu) seems to flush before the next TB, but async_safe_run_on_cpu(this_cpu) does not? How does it execute it without exiting from the TB? In any case, patch 1 to make it _synced, plus the following, seems to close both races. Thanks, Nick --- diff --git a/target/ppc/translate.c b/target/ppc/translate.c index 93ffec787c..c44e0ce687 100644 --- a/target/ppc/translate.c +++ b/target/ppc/translate.c @@ -3495,6 +3495,7 @@ static inline void gen_check_tlb_flush(DisasContext *ctx, bool global) gen_helper_check_tlb_flush_local(tcg_env); } gen_set_label(l); + ctx->base.is_jmp = DISAS_EXIT_UPDATE; } #else static inline void gen_check_tlb_flush(DisasContext *ctx, bool global) { } diff --git a/target/ppc/translate/storage-ctrl-impl.c.inc b/target/ppc/translate/storage-ctrl-impl.c.inc index 74c23a4191..673e754404 100644 --- a/target/ppc/translate/storage-ctrl-impl.c.inc +++ b/target/ppc/translate/storage-ctrl-impl.c.inc @@ -224,6 +224,9 @@ static bool do_tlbie(DisasContext *ctx, arg_X_tlbie *a, bool local) a->prs << TLBIE_F_PRS_SHIFT | a->r << TLBIE_F_R_SHIFT | local << TLBIE_F_LOCAL_SHIFT)); + if (!local) { + ctx->base.is_jmp = DISAS_EXIT_UPDATE; + } return true; #endif
On Thu Mar 28, 2024 at 8:15 PM AEST, Nicholas Piggin wrote: > On Thu Mar 28, 2024 at 6:12 PM AEST, Nicholas Piggin wrote: > > On Thu Mar 28, 2024 at 3:31 PM AEST, Nicholas Piggin wrote: > > > ppc broadcast tlb flushes should be synchronised with other vCPUs, > > > like all other architectures that support such operations seem to > > > be doing. > > > > > > Fixing ppc removes the last caller of the non-synced TLB flush > > > variants, we can remove some dead code. I'd like to merge patch 1 > > > for 9.0, and hold patches 2 and 3 until 9.1 to avoid churn (unless > > > someone prefers to remove the dead code asap). > > > > Hmm, turns out to not be so simple, this in parts reverts > > the fix in commit 4ddc104689b. Do other architectures > > that use the _synced TLB flush variants have that same problem > > with the TLB flush not actually flushing until the TB ends, > > I wonder? > > Huh, I can reproduce that original problem with a little test > case (which I will upstream into kvm-unit-tests). > > async_run_on_cpu(this_cpu) seems to flush before the next TB, but > async_safe_run_on_cpu(this_cpu) does not? How does it execute it > without exiting from the TB? Duh, it's because the non-_synced tlb flush variants don't use that for running on this CPU, they just call it directly. Okay that all makes sense now. I think this series plus the below are good then. Also it's possible some other archs that use _all_cpus_synced() (arm, riscv, s390x) _may_ be racy. I had a quick look at sfence.vma and ipte, and AFAIKS they're supposed to take immediate effect after they execute. Thanks, Nick
On 28/3/24 11:37, Nicholas Piggin wrote: > On Thu Mar 28, 2024 at 8:15 PM AEST, Nicholas Piggin wrote: >> On Thu Mar 28, 2024 at 6:12 PM AEST, Nicholas Piggin wrote: >>> On Thu Mar 28, 2024 at 3:31 PM AEST, Nicholas Piggin wrote: >>>> ppc broadcast tlb flushes should be synchronised with other vCPUs, >>>> like all other architectures that support such operations seem to >>>> be doing. >>>> >>>> Fixing ppc removes the last caller of the non-synced TLB flush >>>> variants, we can remove some dead code. I'd like to merge patch 1 >>>> for 9.0, and hold patches 2 and 3 until 9.1 to avoid churn (unless >>>> someone prefers to remove the dead code asap). >>> >>> Hmm, turns out to not be so simple, this in parts reverts >>> the fix in commit 4ddc104689b. Please mention that in the patch. > Do other architectures >>> that use the _synced TLB flush variants have that same problem >>> with the TLB flush not actually flushing until the TB ends, >>> I wonder? >> >> Huh, I can reproduce that original problem with a little test >> case (which I will upstream into kvm-unit-tests). >> >> async_run_on_cpu(this_cpu) seems to flush before the next TB, but >> async_safe_run_on_cpu(this_cpu) does not? How does it execute it >> without exiting from the TB? > > Duh, it's because the non-_synced tlb flush variants don't use > that for running on this CPU, they just call it directly. > > Okay that all makes sense now. I think this series plus the > below are good then. Also it's possible some other archs that > use _all_cpus_synced() (arm, riscv, s390x) _may_ be racy. I > had a quick look at sfence.vma and ipte, and AFAIKS they're > supposed to take immediate effect after they execute. > > Thanks, > Nick >