Message ID | 4ABA2DE2.6000601@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Tejun Heo wrote: > Can you please apply the attached patch and see whether anything > interesting shows up in the kernel log? > Thanks Tejun for the debug patch. Attached here are the relevant logs. The only messages related to percpu in the logs are <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288 <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576 <7>pcpu-alloc: [0] 0 1 The captured logs are with latest git. Thanks -Sachin <4>Crash kernel location must be 0x2000000 <6>Reserving 256MB of memory at 32MB for crashkernel (System RAM: 4096MB) <6>Phyp-dump disabled at boot time <6>Using pSeries machine description <7>Page orders: linear mapping = 16, virtual = 16, io = 12 <6>Using 1TB segments <4>Found initrd at 0xc000000003500000:0xc000000003ccdf60 <6>bootconsole [udbg0] enabled <6>Partition configured for 2 cpus. <6>CPU maps initialized for 2 threads per core <7> (thread shift is 1) <4>Starting Linux PPC64 #2 SMP Thu Sep 24 12:59:21 IST 2009 <4>----------------------------------------------------- <4>ppc64_pft_size = 0x1a <4>physicalMemorySize = 0x100000000 <4>htab_hash_mask = 0x7ffff <4>----------------------------------------------------- <6>Initializing cgroup subsys cpuset <6>Initializing cgroup subsys cpu <5>Linux version 2.6.31-git13-autotest (root@mpower6lp5) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #2 SMP Thu Sep 24 12:59:21 IST 2009 <4>[boot]0012 Setup Arch <7>Node 0 Memory: <7>Node 2 Memory: 0x0-0xe0000000 <7>Node 3 Memory: 0xe0000000-0x100000000 <4>EEH: No capable adapters found <6>PPC64 nvram contains 15360 bytes <7>Using shared processor idle loop <4>Zone PFN ranges: <4> DMA 0x00000000 -> 0x00010000 <4> Normal 0x00010000 -> 0x00010000 <4>Movable zone start PFN for each node <4>early_node_map[2] active PFN ranges <4> 2: 0x00000000 -> 0x0000e000 <4> 3: 0x0000e000 -> 0x00010000 <4>Could not find start_pfn for node 0 <7>On node 0 totalpages: 0 <7>On node 2 totalpages: 57344 <7> DMA zone: 56 pages used for memmap <7> DMA zone: 0 pages reserved <7> DMA zone: 57288 pages, LIFO batch:1 <7>On node 3 totalpages: 8192 <7> DMA zone: 8 pages used for memmap <7> DMA zone: 0 pages reserved <7> DMA zone: 8184 pages, LIFO batch:0 <4>[boot]0015 Setup Done <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288 <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576 <7>pcpu-alloc: [0] 0 1 <4>Built 3 zonelists in Node order, mobility grouping on. Total pages: 65472 <4>Policy zone: DMA <5>Kernel command line: root=/dev/sda3 sysrq=8 insmod=sym53c8xx insmod=ipr crashkernel=512M-:256M xmon=early <6>PID hash table entries: 4096 (order: -1, 32768 bytes) <4>freeing bootmem node 2 <4>freeing bootmem node 3 <6>Memory: 3896832k/4194304k available (9728k kernel code, 297472k reserved, 3072k data, 4291k bss, 576k init) <6>SLUB: Genslabs=18, HWalign=128, Order=0-3, MinObjects=0, CPUs=2, Nodes=16 <6>Hierarchical RCU implementation. <6>RCU-based detection of stalled CPUs is enabled. <6>NR_IRQS:512 <4>[boot]0020 XICS Init <4>[boot]0021 XICS Done <7>pic: no ISA interrupt controller <7>time_init: decrementer frequency = 512.000000 MHz <7>time_init: processor frequency = 4704.000000 MHz <6>clocksource: timebase mult[7d0000] shift[22] registered <7>clockevent: decrementer mult[83126e97] shift[32] cpu[0] <4>Console: colour dummy device 80x25 <6>console [hvc0] enabled, bootconsole disabled <6>allocated 2621440 bytes of page_cgroup <6>please try 'cgroup_disable=memory' option if you don't want memory cgroups <6>Security Framework initialized <6>SELinux: Disabled at boot. <6>Dentry cache hash table entries: 524288 (order: 6, 4194304 bytes) <6>Inode-cache hash table entries: 262144 (order: 5, 2097152 bytes) <4>Mount-cache hash table entries: 4096 <6>Initializing cgroup subsys ns <6>Initializing cgroup subsys cpuacct <6>Initializing cgroup subsys memory <6>Initializing cgroup subsys devices <6>Initializing cgroup subsys freezer <7>irq: irq 2 on host null mapped to virtual irq 16 <7>clockevent: decrementer mult[83126e97] shift[32] cpu[1] <4>Processor 1 found. <6>Brought up 2 CPUs <7>Node 0 CPUs: 0-1 <7>Node 2 CPUs: <7>Node 3 CPUs: <7>CPU0 attaching sched-domain: <7> domain 0: span 0-1 level SIBLING <7> groups: 0 (cpu_power = 589) 1 (cpu_power = 589) <7> domain 1: span 0-1 level CPU <7> groups: 0-1 (cpu_power = 1178) <7>CPU1 attaching sched-domain: <7> domain 0: span 0-1 level SIBLING <7> groups: 1 (cpu_power = 589) 0 (cpu_power = 589) <7> domain 1: span 0-1 level CPU <7> groups: 0-1 (cpu_power = 1178) <6>NET: Registered protocol family 16 <6>IBM eBus Device Driver <6>POWER6 performance monitor hardware support registered <6>PCI: Probing PCI hardware <7>PCI: Probing PCI hardware done <4>bio: create slab <bio-0> at 0 <6>vgaarb: loaded <6>usbcore: registered new interface driver usbfs <6>usbcore: registered new interface driver hub <6>usbcore: registered new device driver usb <6>Switching to clocksource timebase <6>NET: Registered protocol family 2 <6>IP route cache hash table entries: 32768 (order: 2, 262144 bytes) <6>TCP established hash table entries: 131072 (order: 5, 2097152 bytes) <6>TCP bind hash table entries: 65536 (order: 5, 2097152 bytes) <6>TCP: Hash tables configured (established 131072 bind 65536) <6>TCP reno registered <6>NET: Registered protocol family 1 <6>Unpacking initramfs... <7>Switched to high resolution mode on CPU 0 <7>Switched to high resolution mode on CPU 1 <7>irq: irq 655360 on host null mapped to virtual irq 17 <7>irq: irq 655367 on host null mapped to virtual irq 18 <6>IOMMU table initialized, virtual merging enabled <7>irq: irq 589825 on host null mapped to virtual irq 19 <7>RTAS daemon started <6>audit: initializing netlink socket (disabled) <5>type=2000 audit(1253778214.210:1): initialized <6>Kprobe smoke test started <6>Kprobe smoke test passed successfully <6>HugeTLB registered 16 MB page size, pre-allocated 0 pages <6>HugeTLB registered 16 GB page size, pre-allocated 0 pages <5>VFS: Disk quotas dquot_6.5.2 <4>Dquot-cache hash table entries: 8192 (order 0, 65536 bytes) <6>Btrfs loaded <6>msgmni has been set to 7608 <6>alg: No test for stdrng (krng) <6>Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254) <6>io scheduler noop registered <6>io scheduler anticipatory registered <6>io scheduler deadline registered <6>io scheduler cfq registered (default) <6>pci_hotplug: PCI Hot Plug PCI Core version: 0.5 <6>rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 <7>vio_register_driver: driver hvc_console registering <7>HVSI: registered 0 devices <6>Generic RTC Driver v1.07 <6>Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled <6>pmac_zilog: 0.6 (Benjamin Herrenschmidt <benh@kernel.crashing.org>) <6>input: Macintosh mouse button emulation as /devices/virtual/input/input0 <6>Uniform Multi-Platform E-IDE driver <6>ide-gd driver 1.18 <6>IBM eHEA ethernet device driver (Release EHEA_0102) <7>irq: irq 590088 on host null mapped to virtual irq 264 <6>ehea: eth0: Jumbo frames are disabled <6>ehea: eth0 -> logical port id #2 <6>ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver <6>ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver <6>mice: PS/2 mouse device common for all mice <6>EDAC MC: Ver: 2.1.0 Sep 24 2009 <6>usbcore: registered new interface driver hiddev <6>usbcore: registered new interface driver usbhid <6>usbhid: v2.6:USB HID core driver <6>TCP cubic registered <6>NET: Registered protocol family 15 <4>registered taskstats version 1 <4>Freeing unused kernel memory: 576k freed <6>SysRq : Changing Loglevel <4>Loglevel set to 8 <5>SCSI subsystem initialized <7>vio_register_driver: driver ibmvscsi registering <6>ibmvscsi 30000007: SRP_VERSION: 16.a <6>scsi0 : IBM POWER Virtual SCSI Adapter 1.5.8 <6>ibmvscsi 30000007: partner initialization complete <6>ibmvscsi 30000007: host srp version: 16.a, host partition VIO Server (1), OS 3, max io 1048576 <6>ibmvscsi 30000007: Client reserve enabled <6>ibmvscsi 30000007: sent SRP login <6>ibmvscsi 30000007: SRP_LOGIN succeeded <5>scsi 0:0:1:0: Direct-Access AIX VDASD 0001 PQ: 0 ANSI: 3 <5>scsi 0:0:2:0: CD-ROM AIX VOPTA PQ: 0 ANSI: 4 <6>udevd version 128 started <5>sd 0:0:1:0: [sda] 146800640 512-byte logical blocks: (75.1 GB/70.0 GiB) <5>sd 0:0:1:0: [sda] Write Protect is off <7>sd 0:0:1:0: [sda] Mode Sense: 17 00 00 08 <5>sd 0:0:1:0: [sda] Cache data unavailable <3>sd 0:0:1:0: [sda] Assuming drive cache: write through <5>sd 0:0:1:0: [sda] Cache data unavailable <3>sd 0:0:1:0: [sda] Assuming drive cache: write through <6> sda: sda1 sda2 sda3 <5>sd 0:0:1:0: [sda] Cache data unavailable <3>sd 0:0:1:0: [sda] Assuming drive cache: write through <5>sd 0:0:1:0: [sda] Attached SCSI disk <6>kjournald starting. Commit interval 5 seconds <6>EXT3 FS on sda3, internal journal <6>EXT3-fs: mounted filesystem with writeback data mode. <6>udevd version 128 started <5>sd 0:0:1:0: Attached scsi generic sg0 type 0 <5>scsi 0:0:2:0: Attached scsi generic sg1 type 5 <4>sr0: scsi-1 drive <6>Uniform CD-ROM driver Revision: 3.20 <7>sr 0:0:2:0: Attached scsi CD-ROM sr0 <6>Adding 2096320k swap on /dev/sda2. Priority:-1 extents:1 across:2096320k <6>device-mapper: uevent: version 1.0.3 <6>device-mapper: ioctl: 4.15.0-ioctl (2009-04-01) initialised: dm-devel@redhat.com <6>loop: module loaded <6>fuse init (API version 7.13) <7>irq: irq 33539 on host null mapped to virtual irq 259 <6>ehea: eth0: Physical port up <6>ehea: External switch port is backup port <7>irq: irq 33540 on host null mapped to virtual irq 260 <6>NET: Registered protocol family 10 <3>INFO: RCU detected CPU 0 stall (t=1000 jiffies)
Sachin Sant wrote: > Tejun Heo wrote: >> Can you please apply the attached patch and see whether anything >> interesting shows up in the kernel log? >> > Thanks Tejun for the debug patch. Attached here are the relevant logs. > The only messages related to percpu in the logs are > > <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288 > <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576 > <7>pcpu-alloc: [0] 0 1 > The captured logs are with latest git. Hmm... that means it wasn't caused by rogue percpu pointer access. Pleast wait a bit. I'll try to reproduce it. Thanks.
Tejun Heo wrote: > Sachin Sant wrote: > >> Tejun Heo wrote: >> >>> Can you please apply the attached patch and see whether anything >>> interesting shows up in the kernel log? >>> >>> >> Thanks Tejun for the debug patch. Attached here are the relevant logs. >> The only messages related to percpu in the logs are >> >> <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288 >> <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576 >> <7>pcpu-alloc: [0] 0 1 >> The captured logs are with latest git. >> > > Hmm... that means it wasn't caused by rogue percpu pointer access. > Pleast wait a bit. I'll try to reproduce it. > I was able to reproduce the hang in a different way. (I still had IPV6 disabled in my config). I executed the network namespace container tests from LTP and could reproduce a similar hang. The top three function calls were the same as with IPV6. Here are the traces using xmon debugger. Oops: System Reset, sig: 6 [#4] SMP NR_CPUS=1024 DEBUG_PAGEALLOC NUMA pSeries Modules linked in: quota_v2 quota_tree fuse loop dm_mod sg sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod NIP: c00000000003c310 LR: c0000000000055d0 CTR: 0000000000000040 REGS: c0000000fc90f340 TRAP: 0100 Tainted: G D (2.6.31-git13-autotest) MSR: 8000000000081032 <ME,IR,DR> CR: 28004420 XER: 20000001 TASK = c00000002c408890[8753] 'check_netns_ena' THREAD: c0000000fc90c000 CPU: 2 GPR00: 00000fffffffffff c0000000fc90f5c0 c000000000b8c2a8 d00007fffff00000 GPR04: 0000000000000201 0000000000000300 d00007fffff00000 d00007fffff00000 GPR08: 0000000000000000 000007fffff00000 0000000000000000 0000000000000000 GPR12: 8000000000009032 c000000000c82a00 0000000000000001 c0000000fc90f924 GPR16: 0000000000000300 0000000000000001 c0000000fa8e2380 0000000000000000 GPR20: 0000000000010000 0000000000000001 0000000000000000 0000000000000000 GPR24: c0000000fa9c09c8 0000000000000001 0000000000000001 c0000000faef6f60 GPR28: c000000000c6b620 0000000000000000 c000000000af2aa0 c000000000c6d1b0 NIP [c00000000003c310] .hash_page+0x24/0x4bc LR [c0000000000055d0] .do_hash_page+0x50/0x6c Call Trace: [c0000000fc90f5c0] [c0000000000055d0] .do_hash_page+0x50/0x6c (unreliable) --- Exception: 301 at .memset+0x60/0xfc LR = .pcpu_alloc+0x718/0x8fc [c0000000fc90f8b0] [c0000000001700dc] .pcpu_alloc+0x6a8/0x8fc (unreliable) [c0000000fc90f9d0] [c000000000614648] .snmp_mib_init+0x54/0x9c [c0000000fc90fa60] [c000000000614764] .ipv4_mib_init_net+0xd4/0x1e0 [c0000000fc90fb10] [c0000000005a839c] .setup_net+0x68/0x124 [c0000000fc90fbb0] [c0000000005a8ad0] .copy_net_ns+0x88/0x130 [c0000000fc90fc40] [c0000000000bd5ac] .create_new_namespaces+0x110/0x1d0 [c0000000fc90fce0] [c0000000000bd874] .unshare_nsproxy_namespaces+0x6c/0xe8 [c0000000fc90fd80] [c000000000091ee8] .SyS_unshare+0x13c/0x318 [c0000000fc90fe30] [c0000000000085b4] syscall_exit+0x0/0x40 Instruction dump: 7c0803a6 ebe1fff8 4e800020 78690100 7c0802a6 f8010010 3800ffff fa01ff80 7cb02b78 78000500 fa21ff88 fb61ffd8 <7c912378> fa41ff90 7c7b1b78 fa61ff98 As you can see the call trace is same as far as top three function calls are concerned [snmp_mib_init(), pcpu_alloc() and memset()]. The snmp_mib_init() function is : int snmp_mib_init(void *ptr[2], size_t mibsize) { BUG_ON(ptr == NULL); ptr[0] = __alloc_percpu(mibsize, __alignof__(unsigned long long)); if (!ptr[0]) goto err0; ptr[1] = __alloc_percpu(mibsize, __alignof__(unsigned long long)); if (!ptr[1]) goto err1; return 0; ..... May be this might help.. Thanks -Sachin
On Thu, 2009-09-24 at 18:53 +0530, Sachin Sant wrote: > Tejun Heo wrote: > > Sachin Sant wrote: > > > >> Tejun Heo wrote: > >> > >>> Can you please apply the attached patch and see whether anything > >>> interesting shows up in the kernel log? > >>> > >>> > >> Thanks Tejun for the debug patch. Attached here are the relevant logs. > >> The only messages related to percpu in the logs are > >> > >> <6>PERCPU: Embedded 2 pages/cpu @c000000001200000 s100232 r0 d30840 u524288 > >> <7>pcpu-alloc: s100232 r0 d30840 u524288 alloc=1*1048576 > >> <7>pcpu-alloc: [0] 0 1 > >> The captured logs are with latest git. > >> > > > > Hmm... that means it wasn't caused by rogue percpu pointer access. > > Pleast wait a bit. I'll try to reproduce it. > > > I was able to reproduce the hang in a different way. (I still had > IPV6 disabled in my config). I executed the network namespace container > tests from LTP and could reproduce a similar hang. The top three > function calls were the same as with IPV6. Here are the traces > using xmon debugger. > > > Oops: System Reset, sig: 6 [#4] > SMP NR_CPUS=1024 DEBUG_PAGEALLOC NUMA pSeries > Modules linked in: quota_v2 quota_tree fuse loop dm_mod sg sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt scsi_mod > NIP: c00000000003c310 LR: c0000000000055d0 CTR: 0000000000000040 > REGS: c0000000fc90f340 TRAP: 0100 Tainted: G D (2.6.31-git13-autotest) > MSR: 8000000000081032 <ME,IR,DR> CR: 28004420 XER: 200 00001 > TASK = c00000002c408890[8753] 'check_netns_ena' THREAD: c0000000fc90c000 CPU: 2 > GPR00: 00000fffffffffff c0000000fc90f5c0 c000000000b8c2a8 d00007fffff00000 > GPR04: 0000000000000201 0000000000000300 d00007fffff00000 d00007fffff00000 > GPR08: 0000000000000000 000007fffff00000 0000000000000000 0000000000000000 > GPR12: 8000000000009032 c000000000c82a00 0000000000000001 c0000000fc90f924 > GPR16: 0000000000000300 0000000000000001 c0000000fa8e2380 0000000000000000 > GPR20: 0000000000010000 0000000000000001 0000000000000000 0000000000000000 > GPR24: c0000000fa9c09c8 0000000000000001 0000000000000001 c0000000faef6f60 > GPR28: c000000000c6b620 0000000000000000 c000000000af2aa0 c000000000c6d1b0 > NIP [c00000000003c310] .hash_page+0x24/0x4bc > LR [c0000000000055d0] .do_hash_page+0x50/0x6c > Call Trace: > [c0000000fc90f5c0] [c0000000000055d0] .do_hash_page+0x50/0x6c (unreliable) > --- Exception: 301 at .memset+0x60/0xfc > LR = .pcpu_alloc+0x718/0x8fc So it's memsetting something that causes it to hash_page(), ie, faulting in pages (vmalloc space ?) so far nothing obviously wrong.... > [c0000000fc90f8b0] [c0000000001700dc] .pcpu_alloc+0x6a8/0x8fc (unreliable) > [c0000000fc90f9d0] [c000000000614648] .snmp_mib_init+0x54/0x9c > [c0000000fc90fa60] [c000000000614764] .ipv4_mib_init_net+0xd4/0x1e0 > [c0000000fc90fb10] [c0000000005a839c] .setup_net+0x68/0x124 > [c0000000fc90fbb0] [c0000000005a8ad0] .copy_net_ns+0x88/0x130 > [c0000000fc90fc40] [c0000000000bd5ac] .create_new_namespaces+0x110/0x1d0 > [c0000000fc90fce0] [c0000000000bd874] .unshare_nsproxy_namespaces+0x6c/0xe8 > [c0000000fc90fd80] [c000000000091ee8] .SyS_unshare+0x13c/0x318 > [c0000000fc90fe30] [c0000000000085b4] syscall_exit+0x0/0x40 > Instruction dump: > 7c0803a6 ebe1fff8 4e800020 78690100 7c0802a6 f8010010 3800ffff fa01ff80 > 7cb02b78 78000500 fa21ff88 fb61ffd8 <7c912378> fa41ff90 7c7b1b78 fa61ff98 > > As you can see the call trace is same as far as top three function calls > are concerned [snmp_mib_init(), pcpu_alloc() and memset()]. > > The snmp_mib_init() function is : > > int snmp_mib_init(void *ptr[2], size_t mibsize) > { > BUG_ON(ptr == NULL); > ptr[0] = __alloc_percpu(mibsize, __alignof__(unsigned long long)); > if (!ptr[0]) > goto err0; > ptr[1] = __alloc_percpu(mibsize, __alignof__(unsigned long long)); > if (!ptr[1]) > goto err1; > return 0; > ..... > > May be this might help.. > > Thanks > -Sachin > >
Index: work/arch/ia64/include/asm/sn/arch.h =================================================================== --- work.orig/arch/ia64/include/asm/sn/arch.h +++ work/arch/ia64/include/asm/sn/arch.h @@ -71,8 +71,8 @@ DECLARE_PER_CPU(struct sn_hub_info_s, __ * Compact node ID to nasid mappings kept in the per-cpu data areas of each * cpu. */ -DECLARE_PER_CPU(short, __sn_cnodeid_to_nasid[MAX_COMPACT_NODES]); -#define sn_cnodeid_to_nasid (&__get_cpu_var(__sn_cnodeid_to_nasid[0])) +DECLARE_PER_CPU(short [MAX_COMPACT_NODES], __sn_cnodeid_to_nasid); +#define sn_cnodeid_to_nasid (&__get_cpu_var(__sn_cnodeid_to_nasid)[0]) extern u8 sn_partition_id; Index: work/arch/powerpc/mm/stab.c =================================================================== --- work.orig/arch/powerpc/mm/stab.c +++ work/arch/powerpc/mm/stab.c @@ -138,7 +138,7 @@ static int __ste_allocate(unsigned long if (!is_kernel_addr(ea)) { offset = __get_cpu_var(stab_cache_ptr); if (offset < NR_STAB_CACHE_ENTRIES) - __get_cpu_var(stab_cache[offset++]) = stab_entry; + __get_cpu_var(stab_cache)[offset++] = stab_entry; else offset = NR_STAB_CACHE_ENTRIES+1; __get_cpu_var(stab_cache_ptr) = offset; @@ -185,7 +185,7 @@ void switch_stab(struct task_struct *tsk int i; for (i = 0; i < offset; i++) { - ste = stab + __get_cpu_var(stab_cache[i]); + ste = stab + __get_cpu_var(stab_cache)[i]; ste->esid_data = 0; /* invalidate entry */ } } else { Index: work/arch/x86/kernel/cpu/cpu_debug.c =================================================================== --- work.orig/arch/x86/kernel/cpu/cpu_debug.c +++ work/arch/x86/kernel/cpu/cpu_debug.c @@ -531,7 +531,7 @@ static int cpu_create_file(unsigned cpu, /* Already intialized */ if (file == CPU_INDEX_BIT) - if (per_cpu(cpu_arr[type].init, cpu)) + if (per_cpu(cpu_arr, cpu)[type].init) return 0; priv = kzalloc(sizeof(*priv), GFP_KERNEL); @@ -543,7 +543,7 @@ static int cpu_create_file(unsigned cpu, priv->reg = reg; priv->file = file; mutex_lock(&cpu_debug_lock); - per_cpu(priv_arr[type], cpu) = priv; + per_cpu(priv_arr, cpu)[type] = priv; per_cpu(cpu_priv_count, cpu)++; mutex_unlock(&cpu_debug_lock); @@ -552,10 +552,10 @@ static int cpu_create_file(unsigned cpu, dentry, (void *)priv, &cpu_fops); else { debugfs_create_file(cpu_base[type].name, S_IRUGO, - per_cpu(cpu_arr[type].dentry, cpu), + per_cpu(cpu_arr, cpu)[type].dentry, (void *)priv, &cpu_fops); mutex_lock(&cpu_debug_lock); - per_cpu(cpu_arr[type].init, cpu) = 1; + per_cpu(cpu_arr, cpu)[type].init = 1; mutex_unlock(&cpu_debug_lock); } @@ -615,7 +615,7 @@ static int cpu_init_allreg(unsigned cpu, if (!is_typeflag_valid(cpu, cpu_base[type].flag)) continue; cpu_dentry = debugfs_create_dir(cpu_base[type].name, dentry); - per_cpu(cpu_arr[type].dentry, cpu) = cpu_dentry; + per_cpu(cpu_arr, cpu)[type].dentry = cpu_dentry; if (type < CPU_TSS_BIT) err = cpu_init_msr(cpu, type, cpu_dentry); @@ -677,7 +677,7 @@ static void __exit cpu_debug_exit(void) for (cpu = 0; cpu < nr_cpu_ids; cpu++) for (i = 0; i < per_cpu(cpu_priv_count, cpu); i++) - kfree(per_cpu(priv_arr[i], cpu)); + kfree(per_cpu(priv_arr, cpu)[i]); } module_init(cpu_debug_init); Index: work/arch/x86/kernel/cpu/perf_event.c =================================================================== --- work.orig/arch/x86/kernel/cpu/perf_event.c +++ work/arch/x86/kernel/cpu/perf_event.c @@ -1253,7 +1253,7 @@ x86_perf_event_set_period(struct perf_ev if (left > x86_pmu.max_period) left = x86_pmu.max_period; - per_cpu(pmc_prev_left[idx], smp_processor_id()) = left; + per_cpu(pmc_prev_left, smp_processor_id())[idx] = left; /* * The hw event starts counting from this event offset, @@ -1470,7 +1470,7 @@ void perf_event_print_debug(void) rdmsrl(x86_pmu.eventsel + idx, pmc_ctrl); rdmsrl(x86_pmu.perfctr + idx, pmc_count); - prev_left = per_cpu(pmc_prev_left[idx], cpu); + prev_left = per_cpu(pmc_prev_left, cpu)[idx]; pr_info("CPU#%d: gen-PMC%d ctrl: %016llx\n", cpu, idx, pmc_ctrl); Index: work/include/asm-generic/percpu.h =================================================================== --- work.orig/include/asm-generic/percpu.h +++ work/include/asm-generic/percpu.h @@ -49,13 +49,22 @@ extern unsigned long __per_cpu_offset[NR * established ways to produce a usable pointer from the percpu variable * offset. */ -#define per_cpu(var, cpu) \ - (*SHIFT_PERCPU_PTR(&per_cpu_var(var), per_cpu_offset(cpu))) -#define __get_cpu_var(var) \ - (*SHIFT_PERCPU_PTR(&per_cpu_var(var), my_cpu_offset)) -#define __raw_get_cpu_var(var) \ - (*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset)) - +#define per_cpu(var, cpu) (*({ \ + typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var); \ + unsigned int __pcpu_cpu__ = (cpu); \ + pcpu_verify_access(__pcpu_ptr__, __pcpu_cpu__); \ + SHIFT_PERCPU_PTR(__pcpu_ptr__, per_cpu_offset(__pcpu_cpu__)); \ +})) +#define __get_cpu_var(var) (*({ \ + typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var); \ + pcpu_verify_access(__pcpu_ptr__, NR_CPUS); \ + SHIFT_PERCPU_PTR(__pcpu_ptr__, my_cpu_offset); \ +})) +#define __raw_get_cpu_var(var) (*({ \ + typeof(&per_cpu_var(var)) __pcpu_ptr__ = &per_cpu_var(var); \ + pcpu_verify_access(__pcpu_ptr__, NR_CPUS); \ + SHIFT_PERCPU_PTR(__pcpu_ptr__, __my_cpu_offset); \ +})) #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA extern void setup_per_cpu_areas(void); Index: work/include/linux/percpu-defs.h =================================================================== --- work.orig/include/linux/percpu-defs.h +++ work/include/linux/percpu-defs.h @@ -7,6 +7,12 @@ */ #define per_cpu_var(var) per_cpu__##var +#ifdef CONFIG_DEBUG_VERIFY_PER_CPU +extern void pcpu_verify_access(void *ptr, unsigned int cpu); +#else +#define pcpu_verify_access(ptr, cpu) do {} while (0) +#endif + /* * Base implementations of per-CPU variable declarations and definitions, where * the section in which the variable is to be placed is provided by the Index: work/include/linux/percpu.h =================================================================== --- work.orig/include/linux/percpu.h +++ work/include/linux/percpu.h @@ -127,7 +127,12 @@ extern int __init pcpu_page_first_chunk( * dynamically allocated. Non-atomic access to the current CPU's * version should probably be combined with get_cpu()/put_cpu(). */ -#define per_cpu_ptr(ptr, cpu) SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))) +#define per_cpu_ptr(ptr, cpu) ({ \ + typeof(ptr) __pcpu_ptr__ = (ptr); \ + unsigned int __pcpu_cpu__ = (cpu); \ + pcpu_verify_access(__pcpu_ptr__, __pcpu_cpu__); \ + SHIFT_PERCPU_PTR(__pcpu_ptr__, per_cpu_offset((__pcpu_cpu__))); \ +}) extern void *__alloc_reserved_percpu(size_t size, size_t align); Index: work/kernel/softirq.c =================================================================== --- work.orig/kernel/softirq.c +++ work/kernel/softirq.c @@ -560,7 +560,7 @@ EXPORT_PER_CPU_SYMBOL(softirq_work_list) static void __local_trigger(struct call_single_data *cp, int softirq) { - struct list_head *head = &__get_cpu_var(softirq_work_list[softirq]); + struct list_head *head = &__get_cpu_var(softirq_work_list)[softirq]; list_add_tail(&cp->list, head); @@ -656,13 +656,13 @@ static int __cpuinit remote_softirq_cpu_ local_irq_disable(); for (i = 0; i < NR_SOFTIRQS; i++) { - struct list_head *head = &per_cpu(softirq_work_list[i], cpu); + struct list_head *head = &per_cpu(softirq_work_list, cpu)[i]; struct list_head *local_head; if (list_empty(head)) continue; - local_head = &__get_cpu_var(softirq_work_list[i]); + local_head = &__get_cpu_var(softirq_work_list)[i]; list_splice_init(head, local_head); raise_softirq_irqoff(i); } @@ -688,7 +688,7 @@ void __init softirq_init(void) per_cpu(tasklet_hi_vec, cpu).tail = &per_cpu(tasklet_hi_vec, cpu).head; for (i = 0; i < NR_SOFTIRQS; i++) - INIT_LIST_HEAD(&per_cpu(softirq_work_list[i], cpu)); + INIT_LIST_HEAD(&per_cpu(softirq_work_list, cpu)[i]); } register_hotcpu_notifier(&remote_softirq_cpu_notifier); Index: work/lib/Kconfig.debug =================================================================== --- work.orig/lib/Kconfig.debug +++ work/lib/Kconfig.debug @@ -805,6 +805,21 @@ config DEBUG_BLOCK_EXT_DEVT Say N if you are unsure. +config DEBUG_VERIFY_PER_CPU + bool "Verify per-cpu accesses" + depends on DEBUG_KERNEL + depends on SMP + help + + This option makes percpu access macros to verify the + specified processor and percpu variable offset on each + access. This helps catching percpu variable access bugs + which may cause corruption on unrelated memory region making + it very difficult to catch at the cost of making percpu + accesses considerably slow. + + Say N if you are unsure. + config DEBUG_FORCE_WEAK_PER_CPU bool "Force weak per-cpu definitions" depends on DEBUG_KERNEL @@ -820,6 +835,8 @@ config DEBUG_FORCE_WEAK_PER_CPU To ensure that generic code follows the above rules, this option forces all percpu variables to be defined as weak. + Say N if you are unsure. + config LKDTM tristate "Linux Kernel Dump Test Tool Module" depends on DEBUG_KERNEL Index: work/mm/percpu.c =================================================================== --- work.orig/mm/percpu.c +++ work/mm/percpu.c @@ -1241,6 +1241,118 @@ void free_percpu(void *ptr) } EXPORT_SYMBOL_GPL(free_percpu); +#ifdef CONFIG_DEBUG_VERIFY_PER_CPU +static struct pcpu_chunk *pcpu_verify_match_chunk(void *addr) +{ + void *first_start = pcpu_first_chunk->base_addr; + struct pcpu_chunk *chunk; + int slot; + + /* is it in the first chunk? */ + if (addr >= first_start && addr < first_start + pcpu_unit_size) { + /* is it in the reserved area? */ + if (addr < first_start + pcpu_reserved_chunk_limit) + return pcpu_reserved_chunk; + return pcpu_first_chunk; + } + + /* walk each dynamic chunk */ + for (slot = 0; slot < pcpu_nr_slots; slot++) + list_for_each_entry(chunk, &pcpu_slot[slot], list) + if (addr >= chunk->base_addr && + addr < chunk->base_addr + pcpu_unit_size) + return chunk; + return NULL; +} + +void pcpu_verify_access(void *ptr, unsigned int cpu) +{ + static bool verifying[NR_CPUS]; + static int warn_limit = 10; + char cbuf[80], obuf[160]; + void *addr = __pcpu_ptr_to_addr(ptr); + bool is_static = false; + struct pcpu_chunk *chunk; + unsigned long flags; + int i, addr_off, off, len, end; + + /* not been initialized yet or whined enough already */ + if (unlikely(!pcpu_first_chunk || !warn_limit)) + return; + + /* don't re-enter */ + preempt_disable(); + if (verifying[raw_smp_processor_id()]) { + preempt_enable_no_resched(); + return; + } + verifying[raw_smp_processor_id()] = true; + + cbuf[0] = '\0'; + obuf[0] = '\0'; + + if (unlikely(cpu < NR_CPUS && !cpu_possible(cpu)) && warn_limit) + snprintf(cbuf, sizeof(cbuf), "invalid cpu %u", cpu); + + /* + * We can enter this function from weird places and have no + * way to reliably avoid deadlock. If lock is available, grab + * it and verify. If not, just let it go through. + */ + if (!spin_trylock_irqsave(&pcpu_lock, flags)) + goto out; + + chunk = pcpu_verify_match_chunk(addr); + if (!chunk) { + snprintf(obuf, sizeof(obuf), + "no matching chunk ptr=%p addr=%p", ptr, addr); + goto out_unlock; + } + + addr_off = addr - chunk->base_addr; + if (chunk->base_addr == pcpu_first_chunk->base_addr) + if (chunk == pcpu_reserved_chunk || addr_off < -chunk->map[0]) + is_static = true; + + for (i = 0, off = 0; i < chunk->map_used; i++, off = end) { + len = chunk->map[i]; + end = off + abs(len); + + if (addr_off == off) { + if (unlikely(len > 0)) + snprintf(obuf, sizeof(obuf), + "free area accessed ptr=%p addr=%p " + "off=%d len=%d", ptr, addr, off, len); + break; + } + if (!is_static && off < addr_off && addr_off < end) { + snprintf(obuf, sizeof(obuf), + "%sarea accessed in the middle ptr=%p " + "addr=%p:%d off=%d len=%d", + len > 0 ? "free " : "", + ptr, addr, addr_off, off, abs(len)); + break; + } + } + +out_unlock: + spin_unlock_irqrestore(&pcpu_lock, flags); +out: + if (unlikely(cbuf[0] || obuf[0])) { + printk(KERN_ERR "PERCPU: %s%s%s\n", + cbuf, cbuf[0] ? ", " : "", obuf); + dump_stack(); + if (!--warn_limit) + printk(KERN_WARNING "PERCPU: access warning limit " + "reached, turning off access validation\n"); + } + + verifying[raw_smp_processor_id()] = false; + preempt_enable_no_resched(); +} +EXPORT_SYMBOL_GPL(pcpu_verify_access); +#endif /* CONFIG_DEBUG_VERIFY_PER_CPU */ + static inline size_t pcpu_calc_fc_sizes(size_t static_size, size_t reserved_size, ssize_t *dyn_sizep)