Message ID | 57C38D0E.1090006@huawei.com |
---|---|
State | New |
Headers | show |
On Mon, Aug 29, 2016 at 09:17:02AM +0800, Longpeng (Mike) wrote: > This patch presents virtual L3 cache info for virtual cpus. Just changing the L3 cache size in the CPUID code will make guests see a different cache topology after upgrading QEMU (even on live migration). If you want to change the default, you need to keep the old values on old machine-types. In other words, we need to make the cache size configurable, and set compat_props on PC_COMPAT_2_7. Other comments below: > > Some software algorithms are based on the hardware's cache info, for example, > for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger > a resched IPI and told cpu2 to do the wakeup if they don't share low level > cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc. > The relevant linux-kernel code as bellow: > > static void ttwu_queue(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > ...... > if (... && !cpus_share_cache(smp_processor_id(), cpu)) { > ...... > ttwu_queue_remote(p, cpu); /* will trigger RES IPI */ > return; > } > ...... > ttwu_do_activate(rq, p, 0); /* access target's rq directly */ > ...... > } > > In real hardware, the cpus on the same socket share L3 cache, so one won't > trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a > virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs > under some workloads even if the virtual cpus belongs to the same virtual socket. > > For KVM, this degrades performance, because there will be lots of vmexit due to > guest send IPIs. > > The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates) > and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during > the period: > > No-L3 With-L3(applied this patch) > cpu0: 363890 44582 > cpu1: 373405 43109 > cpu2: 340783 43797 > cpu3: 333854 43409 > cpu4: 327170 40038 > cpu5: 325491 39922 > cpu6: 319129 42391 > cpu7: 306480 41035 > cpu8: 161139 32188 > cpu9: 164649 31024 > cpu10: 149823 30398 > cpu11: 149823 32455 > cpu12: 164830 35143 > cpu13: 172269 35805 > cpu14: 179979 33898 > cpu15: 194505 32754 > avg: 268963.6 40129.8 > > The VM's topology is "1*socket 8*cores 2*threads". > After present virtual L3 cache info for VM, the amounts of RES IPI in guest > reduce 85%. What happens to overall system performance if the VCPU threads actually run on separate sockets in the host? Other questions about the code below: > > Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> > --- > target-i386/cpu.c | 34 +++++++++++++++++++++++++++------- > 1 file changed, 27 insertions(+), 7 deletions(-) > > diff --git a/target-i386/cpu.c b/target-i386/cpu.c > index 6a1afab..5a5fd06 100644 > --- a/target-i386/cpu.c > +++ b/target-i386/cpu.c > @@ -57,6 +57,7 @@ > #define CPUID_2_L1D_32KB_8WAY_64B 0x2c > #define CPUID_2_L1I_32KB_8WAY_64B 0x30 > #define CPUID_2_L2_2MB_8WAY_64B 0x7d > +#define CPUID_2_L3_12MB_24WAY_64B 0xea > > > /* CPUID Leaf 4 constants: */ > @@ -131,11 +132,15 @@ > #define L2_LINES_PER_TAG 1 > #define L2_SIZE_KB_AMD 512 > > -/* No L3 cache: */ > -#define L3_SIZE_KB 0 /* disabled */ > -#define L3_ASSOCIATIVITY 0 /* disabled */ > -#define L3_LINES_PER_TAG 0 /* disabled */ > -#define L3_LINE_SIZE 0 /* disabled */ > +/* Level 3 unified cache: */ > +#define L3_LINE_SIZE 64 > +#define L3_ASSOCIATIVITY 24 > +#define L3_SETS 8192 > +#define L3_PARTITIONS 1 > +#define L3_DESCRIPTOR CPUID_2_L3_12MB_24WAY_64B > +/*FIXME: CPUID leaf 0x80000006 is inconsistent with leaves 2 & 4 */ Why are you intentionally introducing a bug? > +#define L3_LINES_PER_TAG 1 > +#define L3_SIZE_KB_AMD 1024 > > /* TLB definitions: */ > > @@ -2328,7 +2333,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, > uint32_t count, I believe Thunderbird line wrapping broke the patch. Git can't apply it. > } > *eax = 1; /* Number of CPUID[EAX=2] calls required */ > *ebx = 0; > - *ecx = 0; > + *ecx = (L3_DESCRIPTOR); > *edx = (L1D_DESCRIPTOR << 16) | \ > (L1I_DESCRIPTOR << 8) | \ > (L2_DESCRIPTOR); > @@ -2374,6 +2379,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, > uint32_t count, > *ecx = L2_SETS - 1; > *edx = CPUID_4_NO_INVD_SHARING; > break; > + case 3: /* L3 cache info */ > + *eax |= CPUID_4_TYPE_UNIFIED | \ > + CPUID_4_LEVEL(3) | \ > + CPUID_4_SELF_INIT_LEVEL; > + /* > + * According to qemu's APICIDs generating rule, this can make > + * sure vcpus on the same vsocket get the same llc_id. > + */ > + *eax |= (cs->nr_cores * cs->nr_threads - 1) << 14; The above comment doesn't seem to be true. For example: if nr_cores=9,nr_threads=3, then: SMT_Mask_Width = Log2 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1)) = Log2 ( RoundToNearestPof2(nr_cores * nr_threads) / ((nr_cores - 1) + 1 )) = Log2 ( RoundToNearestPof2(27) / 9 ) = Log2 ( 32 / 9 ) = Log2 ( 3 ) = 2 CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26])) = Log2(1 + (nr_cores - 1)) = Log2(nr_cores) = = Log2(9) = 4 CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width = 2 + 4 = 6 But: Cache_Mask_Width[3] = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=n):EAX[25:14])) = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=3):EAX[25:14])) = Log2(RoundToNearestPof2( (1 + (nr_cores * nr_threads - 1))) = Log2(RoundToNearestPof2( nr_cores * nr_threads )) = Log2(RoundToNearestPof2( 27 )) = Log2(32) = 5 That means the VCPU at package 1, core 8, thread 2 (APIC ID 1100010b) have Package_ID=1 but L3 Cache_ID would be 3. > + *ebx = (L3_LINE_SIZE - 1) | \ > + ((L3_PARTITIONS - 1) << 12) | \ > + ((L3_ASSOCIATIVITY - 1) << 22); > + *ecx = L3_SETS - 1; > + *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX; > + break; > default: /* end of info */ > *eax = 0; > *ebx = 0; > @@ -2585,7 +2605,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, > uint32_t count, > *ecx = (L2_SIZE_KB_AMD << 16) | \ > (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \ > (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE); > - *edx = ((L3_SIZE_KB/512) << 18) | \ > + *edx = ((L3_SIZE_KB_AMD / 512) << 18) | \ > (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \ > (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE); > break; > -- > 1.8.3.1 >
Hi Eduardo, On 2016/8/30 22:25, Eduardo Habkost wrote: > On Mon, Aug 29, 2016 at 09:17:02AM +0800, Longpeng (Mike) wrote: >> This patch presents virtual L3 cache info for virtual cpus. > > Just changing the L3 cache size in the CPUID code will make > guests see a different cache topology after upgrading QEMU (even > on live migration). If you want to change the default, you need > to keep the old values on old machine-types. > > In other words, we need to make the cache size configurable, and > set compat_props on PC_COMPAT_2_7. > Thanks for your suggestions, I will fix it later. > Other comments below: > >> >> Some software algorithms are based on the hardware's cache info, for example, >> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger >> a resched IPI and told cpu2 to do the wakeup if they don't share low level >> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc. >> The relevant linux-kernel code as bellow: >> >> static void ttwu_queue(struct task_struct *p, int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> ...... >> if (... && !cpus_share_cache(smp_processor_id(), cpu)) { >> ...... >> ttwu_queue_remote(p, cpu); /* will trigger RES IPI */ >> return; >> } >> ...... >> ttwu_do_activate(rq, p, 0); /* access target's rq directly */ >> ...... >> } >> >> In real hardware, the cpus on the same socket share L3 cache, so one won't >> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a >> virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs >> under some workloads even if the virtual cpus belongs to the same virtual socket. >> >> For KVM, this degrades performance, because there will be lots of vmexit due to >> guest send IPIs. >> >> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates) >> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during >> the period: >> >> No-L3 With-L3(applied this patch) >> cpu0: 363890 44582 >> cpu1: 373405 43109 >> cpu2: 340783 43797 >> cpu3: 333854 43409 >> cpu4: 327170 40038 >> cpu5: 325491 39922 >> cpu6: 319129 42391 >> cpu7: 306480 41035 >> cpu8: 161139 32188 >> cpu9: 164649 31024 >> cpu10: 149823 30398 >> cpu11: 149823 32455 >> cpu12: 164830 35143 >> cpu13: 172269 35805 >> cpu14: 179979 33898 >> cpu15: 194505 32754 >> avg: 268963.6 40129.8 >> >> The VM's topology is "1*socket 8*cores 2*threads". >> After present virtual L3 cache info for VM, the amounts of RES IPI in guest >> reduce 85%. > > What happens to overall system performance if the VCPU threads > actually run on separate sockets in the host? On separate socket, the worst case is access remote memory, but I think the overhead of vmexits is more heavily. I will do some tests and give the data in next version. > > Other questions about the code below: > >> >> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> >> --- >> target-i386/cpu.c | 34 +++++++++++++++++++++++++++------- >> 1 file changed, 27 insertions(+), 7 deletions(-) >> >> diff --git a/target-i386/cpu.c b/target-i386/cpu.c >> index 6a1afab..5a5fd06 100644 >> --- a/target-i386/cpu.c >> +++ b/target-i386/cpu.c >> @@ -57,6 +57,7 @@ >> #define CPUID_2_L1D_32KB_8WAY_64B 0x2c >> #define CPUID_2_L1I_32KB_8WAY_64B 0x30 >> #define CPUID_2_L2_2MB_8WAY_64B 0x7d >> +#define CPUID_2_L3_12MB_24WAY_64B 0xea >> >> >> /* CPUID Leaf 4 constants: */ >> @@ -131,11 +132,15 @@ >> #define L2_LINES_PER_TAG 1 >> #define L2_SIZE_KB_AMD 512 >> >> -/* No L3 cache: */ >> -#define L3_SIZE_KB 0 /* disabled */ >> -#define L3_ASSOCIATIVITY 0 /* disabled */ >> -#define L3_LINES_PER_TAG 0 /* disabled */ >> -#define L3_LINE_SIZE 0 /* disabled */ >> +/* Level 3 unified cache: */ >> +#define L3_LINE_SIZE 64 >> +#define L3_ASSOCIATIVITY 24 >> +#define L3_SETS 8192 >> +#define L3_PARTITIONS 1 >> +#define L3_DESCRIPTOR CPUID_2_L3_12MB_24WAY_64B >> +/*FIXME: CPUID leaf 0x80000006 is inconsistent with leaves 2 & 4 */ > > Why are you intentionally introducing a bug? Please forgive my foolish, I will fix it. By the way, There are the same legacy bugs in L1/L2 codes, is there need to fix them ? > >> +#define L3_LINES_PER_TAG 1 >> +#define L3_SIZE_KB_AMD 1024 >> >> /* TLB definitions: */ >> >> @@ -2328,7 +2333,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, >> uint32_t count, > > I believe Thunderbird line wrapping broke the patch. Git can't > apply it. > >> } >> *eax = 1; /* Number of CPUID[EAX=2] calls required */ >> *ebx = 0; >> - *ecx = 0; >> + *ecx = (L3_DESCRIPTOR); >> *edx = (L1D_DESCRIPTOR << 16) | \ >> (L1I_DESCRIPTOR << 8) | \ >> (L2_DESCRIPTOR); >> @@ -2374,6 +2379,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, >> uint32_t count, >> *ecx = L2_SETS - 1; >> *edx = CPUID_4_NO_INVD_SHARING; >> break; >> + case 3: /* L3 cache info */ >> + *eax |= CPUID_4_TYPE_UNIFIED | \ >> + CPUID_4_LEVEL(3) | \ >> + CPUID_4_SELF_INIT_LEVEL; >> + /* >> + * According to qemu's APICIDs generating rule, this can make >> + * sure vcpus on the same vsocket get the same llc_id. >> + */ >> + *eax |= (cs->nr_cores * cs->nr_threads - 1) << 14; > > The above comment doesn't seem to be true. > > For example: if nr_cores=9,nr_threads=3, then: > > SMT_Mask_Width = Log2 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1)) > = Log2 ( RoundToNearestPof2(nr_cores * nr_threads) / ((nr_cores - 1) + 1 )) > = Log2 ( RoundToNearestPof2(27) / 9 ) > = Log2 ( 32 / 9 ) = Log2 ( 3 ) = 2 > CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26])) > = Log2(1 + (nr_cores - 1)) = Log2(nr_cores) = > = Log2(9) = 4 > CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width > = 2 + 4 = 6 > > But: > > Cache_Mask_Width[3] = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=n):EAX[25:14])) > = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=3):EAX[25:14])) > = Log2(RoundToNearestPof2( (1 + (nr_cores * nr_threads - 1))) > = Log2(RoundToNearestPof2( nr_cores * nr_threads )) > = Log2(RoundToNearestPof2( 27 )) > = Log2(32) = 5 > > That means the VCPU at package 1, core 8, thread 2 (APIC ID > 1100010b) have Package_ID=1 but L3 Cache_ID would be 3. > Nice! I will fix it. > >> + *ebx = (L3_LINE_SIZE - 1) | \ >> + ((L3_PARTITIONS - 1) << 12) | \ >> + ((L3_ASSOCIATIVITY - 1) << 22); >> + *ecx = L3_SETS - 1; >> + *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX; >> + break; >> default: /* end of info */ >> *eax = 0; >> *ebx = 0; >> @@ -2585,7 +2605,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, >> uint32_t count, >> *ecx = (L2_SIZE_KB_AMD << 16) | \ >> (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \ >> (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE); >> - *edx = ((L3_SIZE_KB/512) << 18) | \ >> + *edx = ((L3_SIZE_KB_AMD / 512) << 18) | \ >> (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \ >> (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE); >> break; >> -- >> 1.8.3.1 >> > Thanks!
On Wed, Aug 31, 2016 at 08:59:41AM +0800, Longpeng (Mike) wrote: [...] > >> -/* No L3 cache: */ > >> -#define L3_SIZE_KB 0 /* disabled */ > >> -#define L3_ASSOCIATIVITY 0 /* disabled */ > >> -#define L3_LINES_PER_TAG 0 /* disabled */ > >> -#define L3_LINE_SIZE 0 /* disabled */ > >> +/* Level 3 unified cache: */ > >> +#define L3_LINE_SIZE 64 > >> +#define L3_ASSOCIATIVITY 24 > >> +#define L3_SETS 8192 > >> +#define L3_PARTITIONS 1 > >> +#define L3_DESCRIPTOR CPUID_2_L3_12MB_24WAY_64B > >> +/*FIXME: CPUID leaf 0x80000006 is inconsistent with leaves 2 & 4 */ > > > > Why are you intentionally introducing a bug? > > Please forgive my foolish, I will fix it. > > By the way, There are the same legacy bugs in L1/L2 codes, is there need to fix > them ? We want to fix them, but fixing them also requires a mechanism to configure cache sizes (so we keep compatibility on old machine-types).
diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 6a1afab..5a5fd06 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -57,6 +57,7 @@ #define CPUID_2_L1D_32KB_8WAY_64B 0x2c #define CPUID_2_L1I_32KB_8WAY_64B 0x30 #define CPUID_2_L2_2MB_8WAY_64B 0x7d +#define CPUID_2_L3_12MB_24WAY_64B 0xea /* CPUID Leaf 4 constants: */ @@ -131,11 +132,15 @@ #define L2_LINES_PER_TAG 1 #define L2_SIZE_KB_AMD 512 -/* No L3 cache: */ -#define L3_SIZE_KB 0 /* disabled */ -#define L3_ASSOCIATIVITY 0 /* disabled */ -#define L3_LINES_PER_TAG 0 /* disabled */ -#define L3_LINE_SIZE 0 /* disabled */ +/* Level 3 unified cache: */ +#define L3_LINE_SIZE 64 +#define L3_ASSOCIATIVITY 24 +#define L3_SETS 8192 +#define L3_PARTITIONS 1 +#define L3_DESCRIPTOR CPUID_2_L3_12MB_24WAY_64B +/*FIXME: CPUID leaf 0x80000006 is inconsistent with leaves 2 & 4 */ +#define L3_LINES_PER_TAG 1 +#define L3_SIZE_KB_AMD 1024 /* TLB definitions: */ @@ -2328,7 +2333,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, } *eax = 1; /* Number of CPUID[EAX=2] calls required */ *ebx = 0; - *ecx = 0; + *ecx = (L3_DESCRIPTOR); *edx = (L1D_DESCRIPTOR << 16) | \ (L1I_DESCRIPTOR << 8) | \
This patch presents virtual L3 cache info for virtual cpus. Some software algorithms are based on the hardware's cache info, for example, for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger a resched IPI and told cpu2 to do the wakeup if they don't share low level cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc. The relevant linux-kernel code as bellow: static void ttwu_queue(struct task_struct *p, int cpu) { struct rq *rq = cpu_rq(cpu); ...... if (... && !cpus_share_cache(smp_processor_id(), cpu)) { ...... ttwu_queue_remote(p, cpu); /* will trigger RES IPI */ return; } ...... ttwu_do_activate(rq, p, 0); /* access target's rq directly */ ...... } In real hardware, the cpus on the same socket share L3 cache, so one won't trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs under some workloads even if the virtual cpus belongs to the same virtual socket. For KVM, this degrades performance, because there will be lots of vmexit due to guest send IPIs. The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates) and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during the period: No-L3 With-L3(applied this patch) cpu0: 363890 44582 cpu1: 373405 43109 cpu2: 340783 43797 cpu3: 333854 43409 cpu4: 327170 40038 cpu5: 325491 39922 cpu6: 319129 42391 cpu7: 306480 41035 cpu8: 161139 32188 cpu9: 164649 31024 cpu10: 149823 30398 cpu11: 149823 32455 cpu12: 164830 35143 cpu13: 172269 35805 cpu14: 179979 33898 cpu15: 194505 32754 avg: 268963.6 40129.8 The VM's topology is "1*socket 8*cores 2*threads". After present virtual L3 cache info for VM, the amounts of RES IPI in guest reduce 85%. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> --- target-i386/cpu.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) (L2_DESCRIPTOR); @@ -2374,6 +2379,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = L2_SETS - 1; *edx = CPUID_4_NO_INVD_SHARING; break; + case 3: /* L3 cache info */ + *eax |= CPUID_4_TYPE_UNIFIED | \ + CPUID_4_LEVEL(3) | \ + CPUID_4_SELF_INIT_LEVEL; + /* + * According to qemu's APICIDs generating rule, this can make + * sure vcpus on the same vsocket get the same llc_id. + */ + *eax |= (cs->nr_cores * cs->nr_threads - 1) << 14; + *ebx = (L3_LINE_SIZE - 1) | \ + ((L3_PARTITIONS - 1) << 12) | \ + ((L3_ASSOCIATIVITY - 1) << 22); + *ecx = L3_SETS - 1; + *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX; + break; default: /* end of info */ *eax = 0; *ebx = 0; @@ -2585,7 +2605,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = (L2_SIZE_KB_AMD << 16) | \ (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \ (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE); - *edx = ((L3_SIZE_KB/512) << 18) | \ + *edx = ((L3_SIZE_KB_AMD / 512) << 18) | \ (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \ (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE); break;