Message ID | 1232617672.14549.25.camel@penberg-laptop |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote: > On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote: > > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > > > command line to force an order 0 alloc. > > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > > > functionality. > > > > > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > > > calls in the same function in a loop? > > It's kfree(skb->head). > > > > > > > > If it's the former, with big enough size passed to __alloc_skb(), the > > > networking code might be taking a hit from the SLUB page allocator > > > pass-through. > > Do we know what kind of size is being passed to __alloc_skb() in this > case? In function __alloc_skb, original parameter size=4155, SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so __kmalloc_track_caller's parameter size=4696. > Maybe we want to do something like this. > > Pekka > > SLUB: revert page allocator pass-through This patch amost fixes the netperf UDP-U-4k issue. #slabinfo -AD Name Objects Alloc Free %Fast :0000256 1658 70350463 70348946 99 99 kmalloc-8192 31 70322309 70322293 99 99 :0000168 2592 143154 140684 93 28 :0004096 1456 91072 89644 99 96 :0000192 3402 63838 60491 89 11 :0000064 6177 49635 43743 98 77 So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. kmalloc-8192's default order on my 8-core stoakley is 2. 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result is about 10% better than SLUB's. I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: > direct pass through of page size or higher kmalloc requests"). > --- > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h > index 2f5c16b..3bd3662 100644 > --- a/include/linux/slub_def.h > +++ b/include/linux/slub_def.h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin wrote: >>>> If it's the former, with big enough size passed to __alloc_skb(), the >>>> networking code might be taking a hit from the SLUB page allocator >>>> pass-through. >> Do we know what kind of size is being passed to __alloc_skb() in this >> case? > In function __alloc_skb, original parameter size=4155, > SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so > __kmalloc_track_caller's parameter size=4696. OK, so all allocations go straight to the page allocator. > >> Maybe we want to do something like this. >> >> SLUB: revert page allocator pass-through > This patch amost fixes the netperf UDP-U-4k issue. > > #slabinfo -AD > Name Objects Alloc Free %Fast > :0000256 1658 70350463 70348946 99 99 > kmalloc-8192 31 70322309 70322293 99 99 > :0000168 2592 143154 140684 93 28 > :0004096 1456 91072 89644 99 96 > :0000192 3402 63838 60491 89 11 > :0000064 6177 49635 43743 98 77 > > So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. > kmalloc-8192's default order on my 8-core stoakley is 2. Christoph, should we merge my patch as-is or do you have an alternative fix in mind? We could, of course, increase kmalloc() caches one level up to 8192 or higher. > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > is about 10% better than SLUB's. > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? Maybe we can use the perfstat and/or kerneltop utilities of the new perf counters patch to diagnose this: http://lkml.org/lkml/2009/1/21/273 And do oprofile, of course. Thanks! Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > is about 10% better than SLUB's. > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > counters patch to diagnose this: > > http://lkml.org/lkml/2009/1/21/273 > > And do oprofile, of course. Thanks! I assume binding the client and the server to different physical CPUs also means that the SKB is always allocated on CPU 1 and freed on CPU 2? If so, we will be taking the __slab_free() slow path all the time on kfree() which will cause cache effects, no doubt. But there's another potential performance hit we're taking because the object size of the cache is so big. As allocations from CPU 1 keep coming in, we need to allocate new pages and unfreeze the per-cpu page. That in turn causes __slab_free() to be more eager to discard the slab (see the PageSlubFrozen check there). So before going for cache profiling, I'd really like to see an oprofile report. I suspect we're still going to see much more page allocator activity there than with SLAB or SLQB which is why we're still behaving so badly here. Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote: > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > > is about 10% better than SLUB's. > > > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > > counters patch to diagnose this: > > > > http://lkml.org/lkml/2009/1/21/273 > > > > And do oprofile, of course. Thanks! > > I assume binding the client and the server to different physical CPUs > also means that the SKB is always allocated on CPU 1 and freed on CPU > 2? If so, we will be taking the __slab_free() slow path all the time on > kfree() which will cause cache effects, no doubt. > > But there's another potential performance hit we're taking because the > object size of the cache is so big. As allocations from CPU 1 keep > coming in, we need to allocate new pages and unfreeze the per-cpu page. > That in turn causes __slab_free() to be more eager to discard the slab > (see the PageSlubFrozen check there). > > So before going for cache profiling, I'd really like to see an oprofile > report. I suspect we're still going to see much more page allocator > activity Theoretically, it should, but oprofile doesn't show that. > there than with SLAB or SLQB which is why we're still behaving > so badly here. oprofile output with 2.6.29-rc2-slubrevertlarge: CPU: Core 2, speed 2666.71 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 132779 32.9951 vmlinux copy_user_generic_string 25334 6.2954 vmlinux schedule 21032 5.2264 vmlinux tg_shares_up 17175 4.2679 vmlinux __skb_recv_datagram 9091 2.2591 vmlinux sock_def_readable 8934 2.2201 vmlinux mwait_idle 8796 2.1858 vmlinux try_to_wake_up 6940 1.7246 vmlinux __slab_free #slaninfo -AD Name Objects Alloc Free %Fast :0000256 1643 5215544 5214027 94 0 kmalloc-8192 28 5189576 5189560 0 0 :0000168 2631 141466 138976 92 28 :0004096 1452 88697 87269 99 96 :0000192 3402 63050 59732 89 11 :0000064 6265 46611 40721 98 82 :0000128 1895 30429 28654 93 32 oprofile output with kernel 2.6.29-rc2-slqb0121: CPU: Core 2, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name app name symbol name 114793 28.7163 vmlinux vmlinux copy_user_generic_string 27880 6.9744 vmlinux vmlinux tg_shares_up 22218 5.5580 vmlinux vmlinux schedule 12238 3.0614 vmlinux vmlinux mwait_idle 7395 1.8499 vmlinux vmlinux task_rq_lock 7348 1.8382 vmlinux vmlinux sock_def_readable 7202 1.8016 vmlinux vmlinux sched_clock_cpu 6981 1.7464 vmlinux vmlinux __skb_recv_datagram 6566 1.6425 vmlinux vmlinux udp_queue_rcv_skb -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > than SLQB's; I'll have to look into this too. Could be evidence of the possible TLB improvement from using bigger pages and/or page-specific freelist, I suppose. Do you have a scripted used to start netperf in that configuration? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote: > > I assume binding the client and the server to different physical CPUs > > also means that the SKB is always allocated on CPU 1 and freed on CPU > > 2? If so, we will be taking the __slab_free() slow path all the time on > > kfree() which will cause cache effects, no doubt. > > > > But there's another potential performance hit we're taking because the > > object size of the cache is so big. As allocations from CPU 1 keep > > coming in, we need to allocate new pages and unfreeze the per-cpu page. > > That in turn causes __slab_free() to be more eager to discard the slab > > (see the PageSlubFrozen check there). > > > > So before going for cache profiling, I'd really like to see an oprofile > > report. I suspect we're still going to see much more page allocator > > activity > Theoretically, it should, but oprofile doesn't show that. > > > there than with SLAB or SLQB which is why we're still behaving > > so badly here. > > oprofile output with 2.6.29-rc2-slubrevertlarge: > CPU: Core 2, speed 2666.71 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 132779 32.9951 vmlinux copy_user_generic_string > 25334 6.2954 vmlinux schedule > 21032 5.2264 vmlinux tg_shares_up > 17175 4.2679 vmlinux __skb_recv_datagram > 9091 2.2591 vmlinux sock_def_readable > 8934 2.2201 vmlinux mwait_idle > 8796 2.1858 vmlinux try_to_wake_up > 6940 1.7246 vmlinux __slab_free > > #slaninfo -AD > Name Objects Alloc Free %Fast > :0000256 1643 5215544 5214027 94 0 > kmalloc-8192 28 5189576 5189560 0 0 ^^^^^^ This looks bit funny. Hmm. > :0000168 2631 141466 138976 92 28 > :0004096 1452 88697 87269 99 96 > :0000192 3402 63050 59732 89 11 > :0000064 6265 46611 40721 98 82 > :0000128 1895 30429 28654 93 32 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote: > On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > > than SLQB's; > > I'll have to look into this too. Could be evidence of the possible > TLB improvement from using bigger pages and/or page-specific freelist, > I suppose. > > Do you have a scripted used to start netperf in that configuration? See the attachment. Steps to run testing: 1) compile netperf; 2) Change PROG_DIR to path/to/netperf/src; 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. Some comments on the script: > #!/bin/sh > > PROG_DIR=/home/ymzhang/test/netperf/src > date=`date +%H%M%N` > #PROG_DIR=/root/netperf/netperf/src > client_num=$1 > pin_cpu=$2 > > start_port_server=12384 > start_port_client=15888 > > killall netserver > ${PROG_DIR}/netserver > sleep 2 Any particular reason for killing-off the netserver daemon? > if [ ! -d result ]; then > mkdir result > fi > > all_result_files="" > for i in `seq 1 ${client_num}`; do > if [ "${pin_cpu}" == "pin" ]; then > pin_param="-T ${i} ${i}" The -T option takes arguments of the form: N - bind both netperf and netserver to core N N, - bind only netperf to core N, float netserver ,M - float netperf, bind only netserver to core M N,M - bind netperf to core N and netserver to core M Without a comma between N and M knuth only knows what the command line parser will do :) > fi > result_file=result/netperf_${start_port_client}.${date} > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & Same thing here for the -P option - there needs to be a comma between the two port numbers otherwise, the best case is that the second port number is ignored. Worst case is that netperf starts doing knuth only knows what. To get quick profiles, that form of aggregate netperf is OK - just the one iteration with background processes using a moderatly long run time. However, for result reporting, it is best to (ab)use the confidence intervals functionality to try to avoid skew errors. I tend to add-in a global -i 30 option to get each netperf to repeat its measurments 30 times. That way one is reasonably confident that skew issues are minimized. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance And I would probably add the -c and -C options to have netperf report service demands. > sub_pid="${sub_pid} `echo $!`" > port_num=$((${port_num}+1)) > all_result_files="${all_result_files} ${result_file}" > start_port_server=$((${start_port_server}+1)) > start_port_client=$((${start_port_client}+1)) > done; > > wait ${sub_pid} > killall netserver > > result="0" > for i in `echo ${all_result_files}`; do > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > result=`echo "${result}+${sub_result}"|bc` > done; The documented-only-in-source :( "omni" tests in top-of-trunk netperf: http://www.netperf.org/svn/netperf2/trunk ./configure --enable-omni allow one to specify which result values one wants, in which order, either as more or less traditional netperf output (test-specific -O), CSV (test-specific -o) or keyval (test-specific -k). All three take an optional filename as an argument with the file containing a list of desired output values. You can give a "filename" of '?' to get the list of output values known to that version of netperf. Might help simplify parsing and whatnot. happy benchmarking, rick jones > > echo $result > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote: ... > And I would probably add the -c and -C options to have netperf report > service demands. For performance analysis, the service demand is often more interesting than the absolute performance (which typically only varies a few Mb/s for gigE NICs). I strongly encourage adding -c and -C. grant -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote: > > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. > > Some comments on the script: Thanks. I wanted to run the testing to get result quickly as long as the result has no big fluctuation. > > > #!/bin/sh > > > > PROG_DIR=/home/ymzhang/test/netperf/src > > date=`date +%H%M%N` > > #PROG_DIR=/root/netperf/netperf/src > > client_num=$1 > > pin_cpu=$2 > > > > start_port_server=12384 > > start_port_client=15888 > > > > killall netserver > > ${PROG_DIR}/netserver > > sleep 2 > > Any particular reason for killing-off the netserver daemon? I'm not sure if prior running might leave any impact on later running, so just kill netserver. > > > if [ ! -d result ]; then > > mkdir result > > fi > > > > all_result_files="" > > for i in `seq 1 ${client_num}`; do > > if [ "${pin_cpu}" == "pin" ]; then > > pin_param="-T ${i} ${i}" > > The -T option takes arguments of the form: > > N - bind both netperf and netserver to core N > N, - bind only netperf to core N, float netserver > ,M - float netperf, bind only netserver to core M > N,M - bind netperf to core N and netserver to core M > > Without a comma between N and M knuth only knows what the command line parser > will do :) > > > fi > > result_file=result/netperf_${start_port_client}.${date} > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & > > Same thing here for the -P option - there needs to be a comma between the two > port numbers otherwise, the best case is that the second port number is ignored. > Worst case is that netperf starts doing knuth only knows what. Thanks. > > > To get quick profiles, that form of aggregate netperf is OK - just the one > iteration with background processes using a moderatly long run time. However, > for result reporting, it is best to (ab)use the confidence intervals > functionality to try to avoid skew errors. Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need finer-tuning or investigation, I would turn on more options. > I tend to add-in a global -i 30 > option to get each netperf to repeat its measurments 30 times. That way one is > reasonably confident that skew issues are minimized. > > http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance > > And I would probably add the -c and -C options to have netperf report service > demands. Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization in real time. > > > > sub_pid="${sub_pid} `echo $!`" > > port_num=$((${port_num}+1)) > > all_result_files="${all_result_files} ${result_file}" > > start_port_server=$((${start_port_server}+1)) > > start_port_client=$((${start_port_client}+1)) > > done; > > > > wait ${sub_pid} > > killall netserver > > > > result="0" > > for i in `echo ${all_result_files}`; do > > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > > result=`echo "${result}+${sub_result}"|bc` > > done; > > The documented-only-in-source :( "omni" tests in top-of-trunk netperf: > > http://www.netperf.org/svn/netperf2/trunk > > ./configure --enable-omni > > allow one to specify which result values one wants, in which order, either as > more or less traditional netperf output (test-specific -O), CSV (test-specific > -o) or keyval (test-specific -k). All three take an optional filename as an > argument with the file containing a list of desired output values. You can give > a "filename" of '?' to get the list of output values known to that version of > netperf. > > Might help simplify parsing and whatnot. Yes, it does. > > happy benchmarking, > > rick jones Thanks again. I learned a lot. > > > > > echo $result > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>To get quick profiles, that form of aggregate netperf is OK - just the one >>iteration with background processes using a moderatly long run time. However, >>for result reporting, it is best to (ab)use the confidence intervals >>functionality to try to avoid skew errors. > > Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need > finer-tuning or investigation, I would turn on more options. Netperf will silently clip that to 30 as that is all the built-in tables know. > Thanks again. I learned a lot. Feel free to wander over to netperf-talk over at netperf.org if you want to talk some more about the care and feeding of netperf. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 2f5c16b..3bd3662 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -124,7 +124,7 @@ struct kmem_cache { * We keep the general caches in an array of slab caches that are used for * 2^x bytes of allocations. */ -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1]; +extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; /* * Sorry that the following has to be that ugly but some versions of GCC @@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size) if (!size) return 0; + if (size > KMALLOC_MAX_SIZE) + return -1; + if (size <= KMALLOC_MIN_SIZE) return KMALLOC_SHIFT_LOW; @@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 1024) return 10; if (size <= 2 * 1024) return 11; if (size <= 4 * 1024) return 12; -/* - * The following is only needed to support architectures with a larger page - * size than 4k. - */ if (size <= 8 * 1024) return 13; if (size <= 16 * 1024) return 14; if (size <= 32 * 1024) return 15; @@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 512 * 1024) return 19; if (size <= 1024 * 1024) return 20; if (size <= 2 * 1024 * 1024) return 21; + if (size <= 4 * 1024 * 1024) return 22; + if (size <= 8 * 1024 * 1024) return 23; + if (size <= 16 * 1024 * 1024) return 24; + if (size <= 32 * 1024 * 1024) return 25; return -1; /* @@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) if (index == 0) return NULL; + /* + * This function only gets expanded if __builtin_constant_p(size), so + * testing it here shouldn't be needed. But some versions of gcc need + * help. + */ + if (__builtin_constant_p(size) && index < 0) { + /* + * Generate a link failure. Would be great if we could + * do something to stop the compile here. + */ + extern void __kmalloc_size_too_large(void); + __kmalloc_size_too_large(); + } return &kmalloc_caches[index]; } @@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) void *kmem_cache_alloc(struct kmem_cache *, gfp_t); void *__kmalloc(size_t size, gfp_t flags); -static __always_inline void *kmalloc_large(size_t size, gfp_t flags) -{ - return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size)); -} - static __always_inline void *kmalloc(size_t size, gfp_t flags) { if (__builtin_constant_p(size)) { - if (size > PAGE_SIZE) - return kmalloc_large(size, flags); - if (!(flags & SLUB_DMA)) { struct kmem_cache *s = kmalloc_slab(size); diff --git a/mm/slub.c b/mm/slub.c index 6392ae5..8fad23f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy); * Kmalloc subsystem *******************************************************************/ -struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned; +struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); static int __init setup_slub_min_order(char *str) @@ -2537,7 +2537,7 @@ panic: } #ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1]; static void sysfs_add_func(struct work_struct *w) { @@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags) return ZERO_SIZE_PTR; index = size_index[(size - 1) / 8]; - } else + } else { + if (size > KMALLOC_MAX_SIZE) + return NULL; + index = fls(size - 1); + } #ifdef CONFIG_ZONE_DMA if (unlikely((flags & SLUB_DMA))) @@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, flags); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags) } EXPORT_SYMBOL(__kmalloc); -static void *kmalloc_large_node(size_t size, gfp_t flags, int node) -{ - struct page *page = alloc_pages_node(node, flags | __GFP_COMP, - get_order(size)); - - if (page) - return page_address(page); - else - return NULL; -} - #ifdef CONFIG_NUMA void *__kmalloc_node(size_t size, gfp_t flags, int node) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, flags, node); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2746,11 +2733,8 @@ void kfree(const void *x) return; page = virt_to_head_page(x); - if (unlikely(!PageSlab(page))) { - BUG_ON(!PageCompound(page)); - put_page(page); + if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */ return; - } slab_free(page->slab, page, object, _RET_IP_); } EXPORT_SYMBOL(kfree); @@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void) caches++; } - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) { + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { create_kmalloc_cache(&kmalloc_caches[i], "kmalloc", 1 << i, GFP_KERNEL); caches++; @@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void) slab_state = UP; /* Provide the correct kmalloc names now that the caches are up */ - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) kmalloc_caches[i]. name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i); @@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, gfpflags); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, gfpflags, node); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s)))