Message ID | cover.1716763435.git.balaton@eik.bme.hu |
---|---|
Headers | show |
Series | Remaining MMU clean up patches | expand |
On Mon, 27 May 2024, BALATON Zoltan wrote: > This is the rest of the MMU clean up series the first part of which > was merged. Here are the remaining patches rebased and some more added. Besides cleaning up this code my other goal with this and previous already merged series was trying to optimise this a bit. Here are some numbers I've got with 9.0 compared to after this series. I've got these running old benchmarks that do a lot of memory accesses under AmigaOS on sam460ex and amigaone machines. The speed up is not much but measurable. *** sam460ex: QEMU v9.0.0 =========== Sieve of Eratosthenes (Scaled to 10 Iterations) Version 1.2, 03 April 1992 Array Size Number Last Prime Linear RunTime MIPS (Bytes) of Primes Time(sec) (Sec) 8191 1899 16381 0.001068 0.001068 1552.2 10000 2261 19997 0.001304 0.001373 1479.5 20000 4202 39989 0.002608 0.002747 1498.7 40000 7836 79999 0.005216 0.005493 1517.0 80000 14683 160001 0.010432 0.010681 1578.4 160000 27607 319993 0.020864 0.241089 141.4 320000 52073 639997 0.041728 0.799561 86.2 640000 98609 1279997 0.083457 2.019043 68.9 1280000 187133 2559989 0.166913 4.824219 58.2 2560000 356243 5119997 0.333827 11.142578 50.9 5120000 679460 10239989 0.667654 25.488281 44.9 10240000 1299068 20479999 1.335307 55.898438 41.2 20480000 2488465 40960001 2.670614 125.625000 37.0 Average RunTime = 0.020496 (sec) High MIPS = 1578.4 Low MIPS = 37.0 STREAM version $Revision: 5.10 $ Your clock granularity/precision appears to be 3 microseconds. Each test below will take on the order of 228749 microseconds. (= 76249 clock ticks) Function Best Rate MB/s Avg time Min time Max time Copy: 1703.0 0.097710 0.093951 0.107887 Scale: 706.9 0.233270 0.226353 0.244723 Add: 862.8 0.288949 0.278165 0.302902 Triad: 763.4 0.318530 0.314394 0.324167 QEMU master + this series: ========================== Sieve of Eratosthenes (Scaled to 10 Iterations) Version 1.2, 03 April 1992 Array Size Number Last Prime Linear RunTime MIPS (Bytes) of Primes Time(sec) (Sec) 8191 1899 16381 0.001068 0.001068 1552.2 10000 2261 19997 0.001304 0.001259 1614.0 20000 4202 39989 0.002608 0.002518 1635.0 40000 7836 79999 0.005216 0.005493 1517.0 80000 14683 160001 0.010432 0.011597 1453.8 160000 27607 319993 0.020864 0.218506 156.0 320000 52073 639997 0.041728 0.747070 92.2 640000 98609 1279997 0.083457 1.887207 73.7 1280000 187133 2559989 0.166913 4.633789 60.6 2560000 356243 5119997 0.333827 10.449219 54.2 5120000 679460 10239989 0.667654 23.750000 48.1 10240000 1299068 20479999 1.335307 52.109375 44.2 20480000 2488465 40960001 2.670614 117.031250 39.7 Relative to 10 Iterations and the 8191 Array Size: Average RunTime = 0.019190 (sec) High MIPS = 1635.0 Low MIPS = 39.7 STREAM version $Revision: 5.10 $ Your clock granularity/precision appears to be 3 microseconds. Each test below will take on the order of 189202 microseconds. (= 63067 clock ticks) Function Best Rate MB/s Avg time Min time Max time Copy: 1730.8 0.096015 0.092444 0.107926 Scale: 784.4 0.208925 0.203969 0.217610 Add: 1002.6 0.243805 0.239389 0.254176 Triad: 768.9 0.322526 0.312154 0.345777 *** amigaone: QEMU v9.0.0 =========== Sieve of Eratosthenes (Scaled to 10 Iterations) Version 1.2, 03 April 1992 Array Size Number Last Prime Linear RunTime MIPS (Bytes) of Primes Time(sec) (Sec) 8191 1899 16381 0.001068 0.001068 1552.2 10000 2261 19997 0.001304 0.001259 1614.0 20000 4202 39989 0.002608 0.002747 1498.7 40000 7836 79999 0.005216 0.005493 1517.0 80000 14683 160001 0.010432 0.010986 1534.6 160000 27607 319993 0.020864 0.023193 1469.7 320000 52073 639997 0.041728 0.047607 1446.9 640000 98609 1279997 0.083457 0.100098 1389.9 1280000 187133 2559989 0.166913 0.200195 1403.0 2560000 356243 5119997 0.333827 0.400391 1415.6 5120000 679460 10239989 0.667654 1.484375 770.2 10240000 1299068 20479999 1.335307 1.679688 1372.5 20480000 2488465 40960001 2.670614 6.796875 683.7 Relative to 10 Iterations and the 8191 Array Size: Average RunTime = 0.001397 (sec) High MIPS = 1614.0 Low MIPS = 683.7 STREAM version $Revision: 5.10 $ Your clock granularity/precision appears to be 2 microseconds. Each test below will take on the order of 203076 microseconds. (= 101538 clock ticks) Function Best Rate MB/s Avg time Min time Max time Copy: 2529.4 0.067538 0.063255 0.079943 Scale: 885.4 0.187032 0.180708 0.194940 Add: 1119.5 0.226545 0.214384 0.246212 Triad: 959.5 0.260417 0.250131 0.281227 QEMU master + this series: ========================== Sieve of Eratosthenes (Scaled to 10 Iterations) Version 1.2, 03 April 1992 Array Size Number Last Prime Linear RunTime MIPS (Bytes) of Primes Time(sec) (Sec) 8191 1899 16381 0.001068 0.001068 1552.2 10000 2261 19997 0.001304 0.001373 1479.5 20000 4202 39989 0.002608 0.002518 1635.0 40000 7836 79999 0.005216 0.005798 1437.2 80000 14683 160001 0.010432 0.010986 1534.6 160000 27607 319993 0.020864 0.021973 1551.3 320000 52073 639997 0.041728 0.046387 1485.0 640000 98609 1279997 0.083457 0.100098 1389.9 1280000 187133 2559989 0.166913 0.200195 1403.0 2560000 356243 5119997 0.333827 0.400391 1415.6 5120000 679460 10239989 0.667654 0.859375 1330.4 10240000 1299068 20479999 1.335307 3.085938 747.1 20480000 2488465 40960001 2.670614 6.562500 708.1 Relative to 10 Iterations and the 8191 Array Size: Average RunTime = 0.001397 (sec) High MIPS = 1635.0 Low MIPS = 708.1 STREAM version $Revision: 5.10 $ Your clock granularity/precision appears to be 2 microseconds. Each test below will take on the order of 168572 microseconds. (= 84286 clock ticks) Function Best Rate MB/s Avg time Min time Max time Copy: 2410.2 0.076613 0.066384 0.127486 Scale: 1007.6 0.164015 0.158791 0.175446 Add: 1236.3 0.203815 0.194123 0.216319 Triad: 967.6 0.262833 0.248042 0.281844 There is some variation between results between multiple runs but the optimised version seems to run a bit faster and it should be more readable code than it was before. It could be possible to improve it further but I stop here for now. The sam460ex seems to be slower due to TLB misses generating an exception on embedded PPC and exceptions are slow on QEMU (not only because of needing to go through guest code but generally, I've also seen this with workloads that do a lot of syscalls but I don't have measurements of that now). The amigaone with G4 CPU uses hash MMU which can access the needed data from guest memory without an exception so it can keep running faster with TLB misses. Regards, BALATON Zoltan
On Mon, 27 May 2024, BALATON Zoltan wrote: > This is the rest of the MMU clean up series the first part of which > was merged. Here are the remaining patches rebased and some more added. Ping? > Regards, > BALATON Zoltan > > BALATON Zoltan (43): > target/ppc: Reorganise and rename ppc_hash32_pp_prot() > target/ppc/mmu_common.c: Remove local name for a constant > target/ppc/mmu_common.c: Remove single use local variable > target/ppc/mmu_common.c: Remove single use local variable > target/ppc/mmu_common.c: Remove another single use local variable > target/ppc/mmu_common.c: Remove yet another single use local variable > target/ppc/mmu_common.c: Return directly in ppc6xx_tlb_pte_check() > target/ppc/mmu_common.c: Simplify ppc6xx_tlb_pte_check() > target/ppc/mmu_common.c: Remove unused field from mmu_ctx_t > target/ppc/mmu_common.c: Remove hash field from mmu_ctx_t > target/ppc/mmu_common.c: Remove pte_update_flags() > target/ppc/mmu_common.c: Remove nx field from mmu_ctx_t > target/ppc/mmu_common.c: Convert local variable to bool > target/ppc/mmu_common.c: Remove single use local variable > target/ppc/mmu_common.c: Simplify a switch statement > target/ppc/mmu_common.c: Inline and remove ppc6xx_tlb_pte_check() > target/ppc/mmu_common.c: Remove ptem field from mmu_ctx_t > target/ppc: Add function to get protection key for hash32 MMU > target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_prot() > target/ppc/mmu_common.c: Init variable in function that relies on it > target/ppc/mmu_common.c: Remove key field from mmu_ctx_t > target/ppc/mmu_common.c: Stop using ctx in ppc6xx_tlb_check() > target/ppc/mmu_common.c: Rename function parameter > target/ppc/mmu_common.c: Use defines instead of numeric constants > target/ppc: Remove bat_size_prot() > target/ppc/mmu_common.c: Stop using ctx in get_bat_6xx_tlb() > target/ppc/mmu_common.c: Remove mmu_ctx_t > target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_raddr() > target/ppc/mmu-hash32.c: Move get_pteg_offset32() to the header > target/ppc: Unexport some functions from mmu-book3s-v3.h > target/ppc/mmu-radix64: Remove externally unused parts from header > target/ppc: Remove includes from mmu-book3s-v3.h > target/ppc: Remove single use static inline function > target/ppc/internal.h: Consolidate ifndef CONFIG_USER_ONLY blocks > target/ppc/mmu-hash32.c: Change parameter type of > ppc_hash32_bat_lookup() > target/ppc/mmu-hash32: Remove some static inlines from header > target/ppc/mmu-hash32.c: Return and use pte address instead of base + > offset > target/ppc/mmu-hash32.c: Use pte address as parameter instead of > offset > target/ppc: Change parameter type of some inline functions > target/ppc: Change parameter type of ppc64_v3_radix() > target/ppc: Change MMU xlate functions to take CPUState > target/ppc/mmu-hash32.c: Change parameter type of ppc_hash32_set_[rc] > target/ppc/mmu-hash32.c: Change parameter type of > ppc_hash32_direct_store > > hw/ppc/spapr_rtas.c | 2 +- > hw/ppc/spapr_vhyp_mmu.c | 21 +- > target/ppc/internal.h | 34 +-- > target/ppc/mmu-book3s-v3.c | 1 - > target/ppc/mmu-book3s-v3.h | 47 +--- > target/ppc/mmu-booke.c | 5 +- > target/ppc/mmu-booke.h | 2 +- > target/ppc/mmu-hash32.c | 165 ++++-------- > target/ppc/mmu-hash32.h | 86 +++--- > target/ppc/mmu-hash64.c | 54 +++- > target/ppc/mmu-hash64.h | 3 +- > target/ppc/mmu-radix64.c | 57 +++- > target/ppc/mmu-radix64.h | 55 +--- > target/ppc/mmu_common.c | 405 ++++++++++------------------ > target/ppc/mmu_helper.c | 9 +- > target/ppc/translate/vsx-impl.c.inc | 6 +- > 16 files changed, 376 insertions(+), 576 deletions(-) > >