Message ID | 1242233551-3369-1-git-send-email-hong.pham@windriver.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
From: "Hong H. Pham" <hong.pham@windriver.com> Date: Wed, 13 May 2009 12:52:31 -0400 > irq_choose_cpu() should compare the affinity mask against cpu_online_map > rather than CPU_MASK_ALL, since irq_select_affinity() sets the interrupt's > affinity mask to cpu_online_map "and" CPU_MASK_ALL (which ends up being > just cpu_online_map). The mask comparison in irq_choose_cpu() will always > fail since the two masks are not the same. So the CPU chosen is the first CPU > in the intersection of cpu_online_map and CPU_MASK_ALL, which is always CPU0. > That means all interrupts are reassigned to CPU0... > > Distributing interrupts to CPUs in a linearly increasing round robin fashion > is not optimal for the UltraSPARC T1/T2. Also, the irq_rover in > irq_choose_cpu() causes an interrupt to be assigned to a different > processor each time the interrupt is allocated and released. This may lead > to an unbalanced distribution over time. > > A static mapping of interrupts to processors is done to optimize and balance > interrupt distribution. For the T1/T2, interrupts are spread to different > cores first, and then to strands within a core. > > The following are benchmarks showing the effects of interrupt distribution > on a T2. The test was done with iperf using a pair of T5220 boxes, each > with a 10GBe NIU (XAUI) connected back to back. > > TCP | Stock Linear RR IRQ Optimized IRQ > Streams | 2.6.30-rc5 Distribution Distribution > | GBits/sec GBits/sec GBits/sec > --------+----------------------------------------- > 1 0.839 0.862 0.868 > 8 1.16 4.96 5.88 > 16 1.15 6.40 8.04 > 100 1.09 7.28 8.68 > > Signed-off-by: Hong H. Pham <hong.pham@windriver.com> I like this patch a lot but it's going to do the wrong thing on virtualized guests. There is absolutely no connection between virtual cpu numbers and the hierarchy in which they sit in the cores and higher level hierarchy of the processor. So you can't just say (cpu_id / 4) is the core number or anything like that. You must use the machine description to determine this kind of information, just as we do in arch/sparc/kernel/mdesc.c to figure out the CPU scheduler grouping maps. (see mark_proc_ids() and mark_core_ids()) This will also allow your code to transparently work on ROCK and other future cpus without any changes. I'm happy to apply this patch once you change it to use the MDESC properly to probe the cpu hierarchy information. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Thu, 21 May 2009 17:14:24 -0700 (PDT) > I'm happy to apply this patch once you change it to use the MDESC > properly to probe the cpu hierarchy information. BTW, you could also use the precomputed scheduler grouping cpu masks in your distribution table building too. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Here's a revised patch to fix and optimize interrupt distribution. The major change since the last patch is that a tree representation of the CPU hierarchy is built from the per CPU cpu_data. Each iteration through the CPU tree picks the next optimal CPU. The following are example CPU distribution maps for various Niagara2/2+ machines. T5220 (64 cpus) { 0 8 16 24 32 40 48 56 4 12 20 28 36 44 52 60 1 9 17 25 33 41 49 57 5 13 21 29 37 45 53 61 2 10 18 26 34 42 50 58 6 14 22 30 38 46 54 62 3 11 19 27 35 43 51 59 7 15 23 31 39 47 55 63} T5440 (2 way, 96 cpus) { 0 8 16 24 32 40 72 80 88 96 104 112 4 12 20 28 36 44 76 84 92 100 108 116 1 9 17 25 33 41 73 81 89 97 105 113 5 13 21 29 37 45 77 85 93 101 109 117 2 10 18 26 34 42 74 82 90 98 106 114 6 14 22 30 38 46 78 86 94 102 110 118 3 11 19 27 35 43 75 83 91 99 107 115 7 15 23 31 39 47 79 87 95 103 111 119} LDOM (on a T5220) { 0 3 1 4 2 5 0 6} An assumption used when building the CPU tree is that cpu_data is sorted by node, core_id, and proc_id (in order of significance). This the case for the Niagara2 machines I have available. If this isn't true for all sparc64 machines, a copy of cpu_data would need to be sorted prior to building the CPU tree. Regards, Hong -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "Hong H. Pham" <hong.pham@windriver.com> Date: Wed, 3 Jun 2009 12:41:01 -0400 > Here's a revised patch to fix and optimize interrupt distribution. The major > change since the last patch is that a tree representation of the CPU hierarchy > is built from the per CPU cpu_data. Each iteration through the CPU tree > picks the next optimal CPU. The following are example CPU distribution maps > for various Niagara2/2+ machines. > > T5220 (64 cpus) > { 0 8 16 24 32 40 48 56 4 12 20 28 36 44 52 60 1 9 17 25 33 41 49 57 5 13 21 29 37 45 53 61 2 10 18 26 34 42 50 58 6 14 22 30 38 46 54 62 3 11 19 27 35 43 51 59 7 15 23 31 39 47 55 63} > > T5440 (2 way, 96 cpus) > { 0 8 16 24 32 40 72 80 88 96 104 112 4 12 20 28 36 44 76 84 92 100 108 116 1 9 17 25 33 41 73 81 89 97 105 113 5 13 21 29 37 45 77 85 93 101 109 117 2 10 18 26 34 42 74 82 90 98 106 114 6 14 22 30 38 46 78 86 94 102 110 118 3 11 19 27 35 43 75 83 91 99 107 115 7 15 23 31 39 47 79 87 95 103 111 119} > > LDOM (on a T5220) > { 0 3 1 4 2 5 0 6} This looks great! > An assumption used when building the CPU tree is that cpu_data is sorted > by node, core_id, and proc_id (in order of significance). This the case > for the Niagara2 machines I have available. If this isn't true for all > sparc64 machines, a copy of cpu_data would need to be sorted prior to > building the CPU tree. The MDESC and OF cpu scanners allocate the node, core_id, and proc_ids linearly as the cpu's are scanned linearly, so this should be OK at least for now. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile index 54742e5..47029c6 100644 --- a/arch/sparc/kernel/Makefile +++ b/arch/sparc/kernel/Makefile @@ -53,8 +53,9 @@ obj-$(CONFIG_SPARC64) += hvapi.o obj-$(CONFIG_SPARC64) += sstate.o obj-$(CONFIG_SPARC64) += mdesc.o obj-$(CONFIG_SPARC64) += pcr.o obj-$(CONFIG_SPARC64) += nmi.o +obj-$(CONFIG_SPARC64_SMP) += cpumap.o # sparc32 do not use GENERIC_HARDIRQS but uses the generic devres implementation obj-$(CONFIG_SPARC32) += devres.o devres-y := ../../../kernel/irq/devres.o diff --git a/arch/sparc/kernel/cpumap.c b/arch/sparc/kernel/cpumap.c new file mode 100644 index 0000000..0b1dce7 --- /dev/null +++ b/arch/sparc/kernel/cpumap.c @@ -0,0 +1,110 @@ +/* cpumap.c: used for optimizing CPU assignment + * + * Copyright (C) 2009 Hong H. Pham <hong.pham@windriver.com> + */ + +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/cpumask.h> +#include <linux/spinlock.h> +#include "cpumap.h" + + +static u16 cpu_distribution_map[NR_CPUS]; +static int cpu_map_entries = 0; +static DEFINE_SPINLOCK(cpu_map_lock); + + +static int strands_per_core(void) +{ + int n; + + switch (sun4v_chip_type) { + case SUN4V_CHIP_NIAGARA1: + n = 4; + break; + + case SUN4V_CHIP_NIAGARA2: + n = 8; + break; + + default: + n = 1; + break; + } + return n; +} + +static int iterate_cpu(unsigned int index) +{ + static unsigned int num_cpus = 0; + static unsigned int num_cores = 0; + unsigned int strand, s_per_core; + + s_per_core = strands_per_core(); + + /* num_cpus must be a multiple of strands_per_core. */ + if (unlikely(num_cores == 0)) { + num_cpus = num_possible_cpus(); + num_cores = ((num_cpus / s_per_core) + + (num_cpus % s_per_core ? 1 : 0)); + num_cpus = num_cores * s_per_core; + } + + strand = (index * s_per_core) / num_cpus; + + /* Optimize for the T2. Each core in the T2 has two instruction + * pipelines. Stagger the CPU distribution across different cores + * first, and then across different pipelines. + */ + if (sun4v_chip_type == SUN4V_CHIP_NIAGARA2) { + if ((index / num_cores) & 0x01) + strand = s_per_core - strand; + } + + return ((index * s_per_core) % num_cpus) + strand; +} + +void cpu_map_init(void) +{ + int i, cpu, cpu_rover = 0; + unsigned long flag; + + spin_lock_irqsave(&cpu_map_lock, flag); + for (i = 0; i < num_online_cpus(); i++) { + do { + cpu = iterate_cpu(cpu_rover++); + } while (!cpu_online(cpu)); + + cpu_distribution_map[i] = cpu; + } + cpu_map_entries = i; + spin_unlock_irqrestore(&cpu_map_lock, flag); +} + +int map_to_cpu(unsigned int index) +{ + unsigned int mapped_cpu; + unsigned long flag; + + spin_lock_irqsave(&cpu_map_lock, flag); + if (unlikely(cpu_map_entries != num_online_cpus())) { + spin_unlock_irqrestore(&cpu_map_lock, flag); + cpu_map_init(); + spin_lock_irqsave(&cpu_map_lock, flag); + } + + mapped_cpu = cpu_distribution_map[index % cpu_map_entries]; +#ifdef CONFIG_HOTPLUG_CPU + while (!cpu_online(mapped_cpu)) { + spin_unlock_irqrestore(&cpu_map_lock, flag); + cpu_map_init(); + spin_lock_irqsave(&cpu_map_lock, flag); + mapped_cpu = cpu_distribution_map[index % cpu_map_entries]; + } +#endif /* CONFIG_HOTPLUG_CPU */ + spin_unlock_irqrestore(&cpu_map_lock, flag); + return mapped_cpu; +} +EXPORT_SYMBOL(map_to_cpu); diff --git a/arch/sparc/kernel/cpumap.h b/arch/sparc/kernel/cpumap.h new file mode 100644 index 0000000..524b207 --- /dev/null +++ b/arch/sparc/kernel/cpumap.h @@ -0,0 +1,15 @@ +#ifndef _CPUMAP_H +#define _CPUMAP_H + +#ifdef CONFIG_SMP +extern void cpu_map_init(void); +extern int map_to_cpu(unsigned int index); +#else +#define cpu_map_init() do {} while (0) +static inline int map_to_cpu(unsigned int index) +{ + return raw_smp_processor_id(); +} +#endif + +#endif diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c index 5deabe9..b68386d 100644 --- a/arch/sparc/kernel/irq_64.c +++ b/arch/sparc/kernel/irq_64.c @@ -44,8 +44,9 @@ #include <asm/hypervisor.h> #include <asm/cacheflush.h> #include "entry.h" +#include "cpumap.h" #define NUM_IVECS (IMAP_INR + 1) struct ino_bucket *ivector_table; @@ -255,37 +256,15 @@ static int irq_choose_cpu(unsigned int virt_irq) cpumask_t mask; int cpuid; cpumask_copy(&mask, irq_desc[virt_irq].affinity); - if (cpus_equal(mask, CPU_MASK_ALL)) { - static int irq_rover; - static DEFINE_SPINLOCK(irq_rover_lock); - unsigned long flags; - - /* Round-robin distribution... */ - do_round_robin: - spin_lock_irqsave(&irq_rover_lock, flags); - - while (!cpu_online(irq_rover)) { - if (++irq_rover >= nr_cpu_ids) - irq_rover = 0; - } - cpuid = irq_rover; - do { - if (++irq_rover >= nr_cpu_ids) - irq_rover = 0; - } while (!cpu_online(irq_rover)); - - spin_unlock_irqrestore(&irq_rover_lock, flags); + if (cpus_equal(mask, cpu_online_map)) { + cpuid = map_to_cpu(virt_irq); } else { cpumask_t tmp; cpus_and(tmp, cpu_online_map, mask); - - if (cpus_empty(tmp)) - goto do_round_robin; - - cpuid = first_cpu(tmp); + cpuid = cpus_empty(tmp) ? map_to_cpu(virt_irq) : first_cpu(tmp); } return cpuid; } diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c index f7642e5..54906aa 100644 --- a/arch/sparc/kernel/smp_64.c +++ b/arch/sparc/kernel/smp_64.c @@ -1314,8 +1314,10 @@ int __cpu_disable(void) ipi_call_lock(); cpu_clear(cpu, cpu_online_map); ipi_call_unlock(); + cpu_map_init(); + return 0; } void __cpu_die(unsigned int cpu)
irq_choose_cpu() should compare the affinity mask against cpu_online_map rather than CPU_MASK_ALL, since irq_select_affinity() sets the interrupt's affinity mask to cpu_online_map "and" CPU_MASK_ALL (which ends up being just cpu_online_map). The mask comparison in irq_choose_cpu() will always fail since the two masks are not the same. So the CPU chosen is the first CPU in the intersection of cpu_online_map and CPU_MASK_ALL, which is always CPU0. That means all interrupts are reassigned to CPU0... Distributing interrupts to CPUs in a linearly increasing round robin fashion is not optimal for the UltraSPARC T1/T2. Also, the irq_rover in irq_choose_cpu() causes an interrupt to be assigned to a different processor each time the interrupt is allocated and released. This may lead to an unbalanced distribution over time. A static mapping of interrupts to processors is done to optimize and balance interrupt distribution. For the T1/T2, interrupts are spread to different cores first, and then to strands within a core. The following are benchmarks showing the effects of interrupt distribution on a T2. The test was done with iperf using a pair of T5220 boxes, each with a 10GBe NIU (XAUI) connected back to back. TCP | Stock Linear RR IRQ Optimized IRQ Streams | 2.6.30-rc5 Distribution Distribution | GBits/sec GBits/sec GBits/sec --------+----------------------------------------- 1 0.839 0.862 0.868 8 1.16 4.96 5.88 16 1.15 6.40 8.04 100 1.09 7.28 8.68 Signed-off-by: Hong H. Pham <hong.pham@windriver.com> --- arch/sparc/kernel/Makefile | 1 + arch/sparc/kernel/cpumap.c | 110 ++++++++++++++++++++++++++++++++++++++++++++ arch/sparc/kernel/cpumap.h | 15 ++++++ arch/sparc/kernel/irq_64.c | 29 ++---------- arch/sparc/kernel/smp_64.c | 2 + 5 files changed, 132 insertions(+), 25 deletions(-) create mode 100644 arch/sparc/kernel/cpumap.c create mode 100644 arch/sparc/kernel/cpumap.h