Message ID | 1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | Speculative page faults | expand |
Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series tested on Intel 4s Skylake platform. The regression result is sorted by the metric will-it-scale.per_thread_ops. Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) Commit id: base commit: d55f34411b1b126429a823d06c3124c16283231f head commit: 0355322b3577eeab7669066df42c550a56801110 Benchmark suite: will-it-scale Download link: https://github.com/antonblanchard/will-it-scale/tree/master/tests Metrics: will-it-scale.per_process_ops=processes/nr_cpu will-it-scale.per_thread_ops=threads/nr_cpu test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) THP: enable / disable nr_task: 100% 1. Regressions: a) THP enabled: testcase base change head metric page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops b) THP disabled: testcase base change head metric page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops 2. Improvements: a) THP enabled: testcase base change head metric malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops b) THP disabled: testcase base change head metric malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops Notes: for above values in column "change", the higher value means that the related testcase result on head commit is better than that on base commit for this benchmark. Best regards Haiyan Song
On 28/05/2018 07:23, Song, HaiyanX wrote: > > Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series > tested on Intel 4s Skylake platform. Hi, Thanks for reporting this benchmark results, but you mentioned the "V9 patch series" while responding to the v11 header series... Were these tests done on v9 or v11 ? Cheers, Laurent. > > The regression result is sorted by the metric will-it-scale.per_thread_ops. > Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) > Commit id: > base commit: d55f34411b1b126429a823d06c3124c16283231f > head commit: 0355322b3577eeab7669066df42c550a56801110 > Benchmark suite: will-it-scale > Download link: > https://github.com/antonblanchard/will-it-scale/tree/master/tests > Metrics: > will-it-scale.per_process_ops=processes/nr_cpu > will-it-scale.per_thread_ops=threads/nr_cpu > test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > THP: enable / disable > nr_task: 100% > > 1. Regressions: > a) THP enabled: > testcase base change head metric > page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops > page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops > brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops > page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops > signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops > > b) THP disabled: > testcase base change head metric > page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops > page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops > context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops > brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops > page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops > signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops > > 2. Improvements: > a) THP enabled: > testcase base change head metric > malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops > writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops > signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops > > b) THP disabled: > testcase base change head metric > malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops > read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops > page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops > read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops > writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops > signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops > > Notes: for above values in column "change", the higher value means that the related testcase result > on head commit is better than that on base commit for this benchmark. > > > Best regards > Haiyan Song > > ________________________________________ > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Thursday, May 17, 2018 7:06 PM > To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: [PATCH v11 00/26] Speculative page faults > > This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle > page fault without holding the mm semaphore [1]. > > The idea is to try to handle user space page faults without holding the > mmap_sem. This should allow better concurrency for massively threaded > process since the page fault handler will not wait for other threads memory > layout change to be done, assuming that this change is done in another part > of the process's memory space. This type page fault is named speculative > page fault. If the speculative page fault fails because of a concurrency is > detected or because underlying PMD or PTE tables are not yet allocating, it > is failing its processing and a classic page fault is then tried. > > The speculative page fault (SPF) has to look for the VMA matching the fault > address without holding the mmap_sem, this is done by introducing a rwlock > which protects the access to the mm_rb tree. Previously this was done using > SRCU but it was introducing a lot of scheduling to process the VMA's > freeing operation which was hitting the performance by 20% as reported by > Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is > limiting the locking contention to these operations which are expected to > be in a O(log n) order. In addition to ensure that the VMA is not freed in > our back a reference count is added and 2 services (get_vma() and > put_vma()) are introduced to handle the reference count. Once a VMA is > fetched from the RB tree using get_vma(), it must be later freed using > put_vma(). I can't see anymore the overhead I got while will-it-scale > benchmark anymore. > > The VMA's attributes checked during the speculative page fault processing > have to be protected against parallel changes. This is done by using a per > VMA sequence lock. This sequence lock allows the speculative page fault > handler to fast check for parallel changes in progress and to abort the > speculative page fault in that case. > > Once the VMA has been found, the speculative page fault handler would check > for the VMA's attributes to verify that the page fault has to be handled > correctly or not. Thus, the VMA is protected through a sequence lock which > allows fast detection of concurrent VMA changes. If such a change is > detected, the speculative page fault is aborted and a *classic* page fault > is tried. VMA sequence lockings are added when VMA attributes which are > checked during the page fault are modified. > > When the PTE is fetched, the VMA is checked to see if it has been changed, > so once the page table is locked, the VMA is valid, so any other changes > leading to touching this PTE will need to lock the page table, so no > parallel change is possible at this time. > > The locking of the PTE is done with interrupts disabled, this allows > checking for the PMD to ensure that there is not an ongoing collapsing > operation. Since khugepaged is firstly set the PMD to pmd_none and then is > waiting for the other CPU to have caught the IPI interrupt, if the pmd is > valid at the time the PTE is locked, we have the guarantee that the > collapsing operation will have to wait on the PTE lock to move forward. > This allows the SPF handler to map the PTE safely. If the PMD value is > different from the one recorded at the beginning of the SPF operation, the > classic page fault handler will be called to handle the operation while > holding the mmap_sem. As the PTE lock is done with the interrupts disabled, > the lock is done using spin_trylock() to avoid dead lock when handling a > page fault while a TLB invalidate is requested by another CPU holding the > PTE. > > In pseudo code, this could be seen as: > speculative_page_fault() > { > vma = get_vma() > check vma sequence count > check vma's support > disable interrupt > check pgd,p4d,...,pte > save pmd and pte in vmf > save vma sequence counter in vmf > enable interrupt > check vma sequence count > handle_pte_fault(vma) > .. > page = alloc_page() > pte_map_lock() > disable interrupt > abort if sequence counter has changed > abort if pmd or pte has changed > pte map and lock > enable interrupt > if abort > free page > abort > ... > } > > arch_fault_handler() > { > if (speculative_page_fault(&vma)) > goto done > again: > lock(mmap_sem) > vma = find_vma(); > handle_pte_fault(vma); > if retry > unlock(mmap_sem) > goto again; > done: > handle fault error > } > > Support for THP is not done because when checking for the PMD, we can be > confused by an in progress collapsing operation done by khugepaged. The > issue is that pmd_none() could be true either if the PMD is not already > populated or if the underlying PTE are in the way to be collapsed. So we > cannot safely allocate a PMD if pmd_none() is true. > > This series add a new software performance event named 'speculative-faults' > or 'spf'. It counts the number of successful page fault event handled > speculatively. When recording 'faults,spf' events, the faults one is > counting the total number of page fault events while 'spf' is only counting > the part of the faults processed speculatively. > > There are some trace events introduced by this series. They allow > identifying why the page faults were not processed speculatively. This > doesn't take in account the faults generated by a monothreaded process > which directly processed while holding the mmap_sem. This trace events are > grouped in a system named 'pagefault', they are: > - pagefault:spf_vma_changed : if the VMA has been changed in our back > - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. > - pagefault:spf_vma_notsup : the VMA's type is not supported > - pagefault:spf_vma_access : the VMA's access right are not respected > - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our > back. > > To record all the related events, the easier is to run perf with the > following arguments : > $ perf stat -e 'faults,spf,pagefault:*' <command> > > There is also a dedicated vmstat counter showing the number of successful > page fault handled speculatively. I can be seen this way: > $ grep speculative_pgfault /proc/vmstat > > This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional > on x86, PowerPC and arm64. > > --------------------- > Real Workload results > > As mentioned in previous email, we did non official runs using a "popular > in memory multithreaded database product" on 176 cores SMT8 Power system > which showed a 30% improvements in the number of transaction processed per > second. This run has been done on the v6 series, but changes introduced in > this new version should not impact the performance boost seen. > > Here are the perf data captured during 2 of these runs on top of the v8 > series: > vanilla spf > faults 89.418 101.364 +13% > spf n/a 97.989 > > With the SPF kernel, most of the page fault were processed in a speculative > way. > > Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave > it a try on an android device. He reported that the application launch time > was improved in average by 6%, and for large applications (~100 threads) by > 20%. > > Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom > MSM845 (8 cores) with 6GB (the less is better): > > Application 4.9 4.9+spf delta > com.tencent.mm 416 389 -7% > com.eg.android.AlipayGphone 1135 986 -13% > com.tencent.mtt 455 454 0% > com.qqgame.hlddz 1497 1409 -6% > com.autonavi.minimap 711 701 -1% > com.tencent.tmgp.sgame 788 748 -5% > com.immomo.momo 501 487 -3% > com.tencent.peng 2145 2112 -2% > com.smile.gifmaker 491 461 -6% > com.baidu.BaiduMap 479 366 -23% > com.taobao.taobao 1341 1198 -11% > com.baidu.searchbox 333 314 -6% > com.tencent.mobileqq 394 384 -3% > com.sina.weibo 907 906 0% > com.youku.phone 816 731 -11% > com.happyelements.AndroidAnimal.qq 763 717 -6% > com.UCMobile 415 411 -1% > com.tencent.tmgp.ak 1464 1431 -2% > com.tencent.qqmusic 336 329 -2% > com.sankuai.meituan 1661 1302 -22% > com.netease.cloudmusic 1193 1200 1% > air.tv.douyu.android 4257 4152 -2% > > ------------------ > Benchmarks results > > Base kernel is v4.17.0-rc4-mm1 > SPF is BASE + this series > > Kernbench: > ---------- > Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 > kernel (kernel is build 5 times): > > Average Half load -j 8 > Run (std deviation) > BASE SPF > Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% > User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% > System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% > Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% > Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% > Sleeps 105064 (1240.96) 105074 (337.612) 0.01% > > Average Optimal load -j 16 > Run (std deviation) > BASE SPF > Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% > User Time 11064.8 (981.142) 11085 (990.897) 0.18% > System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% > Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% > Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% > Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% > > > During a run on the SPF, perf events were captured: > Performance counter stats for '../kernbench -M': > 526743764 faults > 210 spf > 3 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 2278 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > Very few speculative page faults were recorded as most of the processes > involved are monothreaded (sounds that on this architecture some threads > were created during the kernel build processing). > > Here are the kerbench results on a 80 CPUs Power8 system: > > Average Half load -j 40 > Run (std deviation) > BASE SPF > Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% > User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% > System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% > Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% > Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% > Sleeps 317923 (652.499) 318469 (1255.59) 0.17% > > Average Optimal load -j 80 > Run (std deviation) > BASE SPF > Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% > User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% > System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% > Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% > Context Switches 223861 (138865) 225032 (139632) 0.52% > Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% > > During a run on the SPF, perf events were captured: > Performance counter stats for '../kernbench -M': > 116730856 faults > 0 spf > 3 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 476 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > Most of the processes involved are monothreaded so SPF is not activated but > there is no impact on the performance. > > Ebizzy: > ------- > The test is counting the number of records per second it can manage, the > higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get > consistent result I repeated the test 100 times and measure the average > result. The number is the record processes per second, the higher is the > best. > > BASE SPF delta > 16 CPUs x86 VM 742.57 1490.24 100.69% > 80 CPUs P8 node 13105.4 24174.23 84.46% > > Here are the performance counter read during a run on a 16 CPUs x86 VM: > Performance counter stats for './ebizzy -mTt 16': > 1706379 faults > 1674599 spf > 30588 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 363 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > And the ones captured during a run on a 80 CPUs Power node: > Performance counter stats for './ebizzy -mTt 80': > 1874773 faults > 1461153 spf > 413293 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 200 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > In ebizzy's case most of the page fault were handled in a speculative way, > leading the ebizzy performance boost. > > ------------------ > Changes since v10 (https://lkml.org/lkml/2018/4/17/572): > - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran > and Minchan Kim, hopefully. > - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in > __do_page_fault(). > - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails > instead > of aborting the speculative page fault handling. Dropping the now > useless > trace event pagefault:spf_pte_lock. > - No more try to reuse the fetched VMA during the speculative page fault > handling when retrying is needed. This adds a lot of complexity and > additional tests done didn't show a significant performance improvement. > - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. > > [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none > [2] https://patchwork.kernel.org/patch/9999687/ > > > Laurent Dufour (20): > mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT > x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE > mm: make pte_unmap_same compatible with SPF > mm: introduce INIT_VMA() > mm: protect VMA modifications using VMA sequence count > mm: protect mremap() against SPF hanlder > mm: protect SPF handler against anon_vma changes > mm: cache some VMA fields in the vm_fault structure > mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() > mm: introduce __lru_cache_add_active_or_unevictable > mm: introduce __vm_normal_page() > mm: introduce __page_add_new_anon_rmap() > mm: protect mm_rb tree with a rwlock > mm: adding speculative page fault failure trace events > perf: add a speculative page fault sw event > perf tools: add support for the SPF perf event > mm: add speculative page fault vmstats > powerpc/mm: add speculative page fault > > Mahendran Ganesh (2): > arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > arm64/mm: add speculative page fault > > Peter Zijlstra (4): > mm: prepare for FAULT_FLAG_SPECULATIVE > mm: VMA sequence count > mm: provide speculative fault infrastructure > x86/mm: add speculative pagefault handling > > arch/arm64/Kconfig | 1 + > arch/arm64/mm/fault.c | 12 + > arch/powerpc/Kconfig | 1 + > arch/powerpc/mm/fault.c | 16 + > arch/x86/Kconfig | 1 + > arch/x86/mm/fault.c | 27 +- > fs/exec.c | 2 +- > fs/proc/task_mmu.c | 5 +- > fs/userfaultfd.c | 17 +- > include/linux/hugetlb_inline.h | 2 +- > include/linux/migrate.h | 4 +- > include/linux/mm.h | 136 +++++++- > include/linux/mm_types.h | 7 + > include/linux/pagemap.h | 4 +- > include/linux/rmap.h | 12 +- > include/linux/swap.h | 10 +- > include/linux/vm_event_item.h | 3 + > include/trace/events/pagefault.h | 80 +++++ > include/uapi/linux/perf_event.h | 1 + > kernel/fork.c | 5 +- > mm/Kconfig | 22 ++ > mm/huge_memory.c | 6 +- > mm/hugetlb.c | 2 + > mm/init-mm.c | 3 + > mm/internal.h | 20 ++ > mm/khugepaged.c | 5 + > mm/madvise.c | 6 +- > mm/memory.c | 612 +++++++++++++++++++++++++++++----- > mm/mempolicy.c | 51 ++- > mm/migrate.c | 6 +- > mm/mlock.c | 13 +- > mm/mmap.c | 229 ++++++++++--- > mm/mprotect.c | 4 +- > mm/mremap.c | 13 + > mm/nommu.c | 2 +- > mm/rmap.c | 5 +- > mm/swap.c | 6 +- > mm/swap_state.c | 8 +- > mm/vmstat.c | 5 +- > tools/include/uapi/linux/perf_event.h | 1 + > tools/perf/util/evsel.c | 1 + > tools/perf/util/parse-events.c | 4 + > tools/perf/util/parse-events.l | 1 + > tools/perf/util/python.c | 1 + > 44 files changed, 1161 insertions(+), 211 deletions(-) > create mode 100644 include/trace/events/pagefault.h > > -- > 2.7.4 > >
Hi Laurent, Yes, these tests are done on V9 patch. Best regards, Haiyan Song On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: > On 28/05/2018 07:23, Song, HaiyanX wrote: > > > > Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series > > tested on Intel 4s Skylake platform. > > Hi, > > Thanks for reporting this benchmark results, but you mentioned the "V9 patch > series" while responding to the v11 header series... > Were these tests done on v9 or v11 ? > > Cheers, > Laurent. > > > > > The regression result is sorted by the metric will-it-scale.per_thread_ops. > > Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) > > Commit id: > > base commit: d55f34411b1b126429a823d06c3124c16283231f > > head commit: 0355322b3577eeab7669066df42c550a56801110 > > Benchmark suite: will-it-scale > > Download link: > > https://github.com/antonblanchard/will-it-scale/tree/master/tests > > Metrics: > > will-it-scale.per_process_ops=processes/nr_cpu > > will-it-scale.per_thread_ops=threads/nr_cpu > > test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > > THP: enable / disable > > nr_task: 100% > > > > 1. Regressions: > > a) THP enabled: > > testcase base change head metric > > page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops > > page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops > > brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops > > page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops > > signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops > > > > b) THP disabled: > > testcase base change head metric > > page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops > > page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops > > context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops > > brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops > > page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops > > signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops > > > > 2. Improvements: > > a) THP enabled: > > testcase base change head metric > > malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops > > writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops > > signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops > > > > b) THP disabled: > > testcase base change head metric > > malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops > > read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops > > page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops > > read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops > > writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops > > signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops > > > > Notes: for above values in column "change", the higher value means that the related testcase result > > on head commit is better than that on base commit for this benchmark. > > > > > > Best regards > > Haiyan Song > > > > ________________________________________ > > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > > Sent: Thursday, May 17, 2018 7:06 PM > > To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > > Subject: [PATCH v11 00/26] Speculative page faults > > > > This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle > > page fault without holding the mm semaphore [1]. > > > > The idea is to try to handle user space page faults without holding the > > mmap_sem. This should allow better concurrency for massively threaded > > process since the page fault handler will not wait for other threads memory > > layout change to be done, assuming that this change is done in another part > > of the process's memory space. This type page fault is named speculative > > page fault. If the speculative page fault fails because of a concurrency is > > detected or because underlying PMD or PTE tables are not yet allocating, it > > is failing its processing and a classic page fault is then tried. > > > > The speculative page fault (SPF) has to look for the VMA matching the fault > > address without holding the mmap_sem, this is done by introducing a rwlock > > which protects the access to the mm_rb tree. Previously this was done using > > SRCU but it was introducing a lot of scheduling to process the VMA's > > freeing operation which was hitting the performance by 20% as reported by > > Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is > > limiting the locking contention to these operations which are expected to > > be in a O(log n) order. In addition to ensure that the VMA is not freed in > > our back a reference count is added and 2 services (get_vma() and > > put_vma()) are introduced to handle the reference count. Once a VMA is > > fetched from the RB tree using get_vma(), it must be later freed using > > put_vma(). I can't see anymore the overhead I got while will-it-scale > > benchmark anymore. > > > > The VMA's attributes checked during the speculative page fault processing > > have to be protected against parallel changes. This is done by using a per > > VMA sequence lock. This sequence lock allows the speculative page fault > > handler to fast check for parallel changes in progress and to abort the > > speculative page fault in that case. > > > > Once the VMA has been found, the speculative page fault handler would check > > for the VMA's attributes to verify that the page fault has to be handled > > correctly or not. Thus, the VMA is protected through a sequence lock which > > allows fast detection of concurrent VMA changes. If such a change is > > detected, the speculative page fault is aborted and a *classic* page fault > > is tried. VMA sequence lockings are added when VMA attributes which are > > checked during the page fault are modified. > > > > When the PTE is fetched, the VMA is checked to see if it has been changed, > > so once the page table is locked, the VMA is valid, so any other changes > > leading to touching this PTE will need to lock the page table, so no > > parallel change is possible at this time. > > > > The locking of the PTE is done with interrupts disabled, this allows > > checking for the PMD to ensure that there is not an ongoing collapsing > > operation. Since khugepaged is firstly set the PMD to pmd_none and then is > > waiting for the other CPU to have caught the IPI interrupt, if the pmd is > > valid at the time the PTE is locked, we have the guarantee that the > > collapsing operation will have to wait on the PTE lock to move forward. > > This allows the SPF handler to map the PTE safely. If the PMD value is > > different from the one recorded at the beginning of the SPF operation, the > > classic page fault handler will be called to handle the operation while > > holding the mmap_sem. As the PTE lock is done with the interrupts disabled, > > the lock is done using spin_trylock() to avoid dead lock when handling a > > page fault while a TLB invalidate is requested by another CPU holding the > > PTE. > > > > In pseudo code, this could be seen as: > > speculative_page_fault() > > { > > vma = get_vma() > > check vma sequence count > > check vma's support > > disable interrupt > > check pgd,p4d,...,pte > > save pmd and pte in vmf > > save vma sequence counter in vmf > > enable interrupt > > check vma sequence count > > handle_pte_fault(vma) > > .. > > page = alloc_page() > > pte_map_lock() > > disable interrupt > > abort if sequence counter has changed > > abort if pmd or pte has changed > > pte map and lock > > enable interrupt > > if abort > > free page > > abort > > ... > > } > > > > arch_fault_handler() > > { > > if (speculative_page_fault(&vma)) > > goto done > > again: > > lock(mmap_sem) > > vma = find_vma(); > > handle_pte_fault(vma); > > if retry > > unlock(mmap_sem) > > goto again; > > done: > > handle fault error > > } > > > > Support for THP is not done because when checking for the PMD, we can be > > confused by an in progress collapsing operation done by khugepaged. The > > issue is that pmd_none() could be true either if the PMD is not already > > populated or if the underlying PTE are in the way to be collapsed. So we > > cannot safely allocate a PMD if pmd_none() is true. > > > > This series add a new software performance event named 'speculative-faults' > > or 'spf'. It counts the number of successful page fault event handled > > speculatively. When recording 'faults,spf' events, the faults one is > > counting the total number of page fault events while 'spf' is only counting > > the part of the faults processed speculatively. > > > > There are some trace events introduced by this series. They allow > > identifying why the page faults were not processed speculatively. This > > doesn't take in account the faults generated by a monothreaded process > > which directly processed while holding the mmap_sem. This trace events are > > grouped in a system named 'pagefault', they are: > > - pagefault:spf_vma_changed : if the VMA has been changed in our back > > - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. > > - pagefault:spf_vma_notsup : the VMA's type is not supported > > - pagefault:spf_vma_access : the VMA's access right are not respected > > - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our > > back. > > > > To record all the related events, the easier is to run perf with the > > following arguments : > > $ perf stat -e 'faults,spf,pagefault:*' <command> > > > > There is also a dedicated vmstat counter showing the number of successful > > page fault handled speculatively. I can be seen this way: > > $ grep speculative_pgfault /proc/vmstat > > > > This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional > > on x86, PowerPC and arm64. > > > > --------------------- > > Real Workload results > > > > As mentioned in previous email, we did non official runs using a "popular > > in memory multithreaded database product" on 176 cores SMT8 Power system > > which showed a 30% improvements in the number of transaction processed per > > second. This run has been done on the v6 series, but changes introduced in > > this new version should not impact the performance boost seen. > > > > Here are the perf data captured during 2 of these runs on top of the v8 > > series: > > vanilla spf > > faults 89.418 101.364 +13% > > spf n/a 97.989 > > > > With the SPF kernel, most of the page fault were processed in a speculative > > way. > > > > Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave > > it a try on an android device. He reported that the application launch time > > was improved in average by 6%, and for large applications (~100 threads) by > > 20%. > > > > Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom > > MSM845 (8 cores) with 6GB (the less is better): > > > > Application 4.9 4.9+spf delta > > com.tencent.mm 416 389 -7% > > com.eg.android.AlipayGphone 1135 986 -13% > > com.tencent.mtt 455 454 0% > > com.qqgame.hlddz 1497 1409 -6% > > com.autonavi.minimap 711 701 -1% > > com.tencent.tmgp.sgame 788 748 -5% > > com.immomo.momo 501 487 -3% > > com.tencent.peng 2145 2112 -2% > > com.smile.gifmaker 491 461 -6% > > com.baidu.BaiduMap 479 366 -23% > > com.taobao.taobao 1341 1198 -11% > > com.baidu.searchbox 333 314 -6% > > com.tencent.mobileqq 394 384 -3% > > com.sina.weibo 907 906 0% > > com.youku.phone 816 731 -11% > > com.happyelements.AndroidAnimal.qq 763 717 -6% > > com.UCMobile 415 411 -1% > > com.tencent.tmgp.ak 1464 1431 -2% > > com.tencent.qqmusic 336 329 -2% > > com.sankuai.meituan 1661 1302 -22% > > com.netease.cloudmusic 1193 1200 1% > > air.tv.douyu.android 4257 4152 -2% > > > > ------------------ > > Benchmarks results > > > > Base kernel is v4.17.0-rc4-mm1 > > SPF is BASE + this series > > > > Kernbench: > > ---------- > > Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 > > kernel (kernel is build 5 times): > > > > Average Half load -j 8 > > Run (std deviation) > > BASE SPF > > Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% > > User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% > > System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% > > Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% > > Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% > > Sleeps 105064 (1240.96) 105074 (337.612) 0.01% > > > > Average Optimal load -j 16 > > Run (std deviation) > > BASE SPF > > Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% > > User Time 11064.8 (981.142) 11085 (990.897) 0.18% > > System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% > > Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% > > Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% > > Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% > > > > > > During a run on the SPF, perf events were captured: > > Performance counter stats for '../kernbench -M': > > 526743764 faults > > 210 spf > > 3 pagefault:spf_vma_changed > > 0 pagefault:spf_vma_noanon > > 2278 pagefault:spf_vma_notsup > > 0 pagefault:spf_vma_access > > 0 pagefault:spf_pmd_changed > > > > Very few speculative page faults were recorded as most of the processes > > involved are monothreaded (sounds that on this architecture some threads > > were created during the kernel build processing). > > > > Here are the kerbench results on a 80 CPUs Power8 system: > > > > Average Half load -j 40 > > Run (std deviation) > > BASE SPF > > Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% > > User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% > > System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% > > Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% > > Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% > > Sleeps 317923 (652.499) 318469 (1255.59) 0.17% > > > > Average Optimal load -j 80 > > Run (std deviation) > > BASE SPF > > Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% > > User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% > > System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% > > Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% > > Context Switches 223861 (138865) 225032 (139632) 0.52% > > Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% > > > > During a run on the SPF, perf events were captured: > > Performance counter stats for '../kernbench -M': > > 116730856 faults > > 0 spf > > 3 pagefault:spf_vma_changed > > 0 pagefault:spf_vma_noanon > > 476 pagefault:spf_vma_notsup > > 0 pagefault:spf_vma_access > > 0 pagefault:spf_pmd_changed > > > > Most of the processes involved are monothreaded so SPF is not activated but > > there is no impact on the performance. > > > > Ebizzy: > > ------- > > The test is counting the number of records per second it can manage, the > > higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get > > consistent result I repeated the test 100 times and measure the average > > result. The number is the record processes per second, the higher is the > > best. > > > > BASE SPF delta > > 16 CPUs x86 VM 742.57 1490.24 100.69% > > 80 CPUs P8 node 13105.4 24174.23 84.46% > > > > Here are the performance counter read during a run on a 16 CPUs x86 VM: > > Performance counter stats for './ebizzy -mTt 16': > > 1706379 faults > > 1674599 spf > > 30588 pagefault:spf_vma_changed > > 0 pagefault:spf_vma_noanon > > 363 pagefault:spf_vma_notsup > > 0 pagefault:spf_vma_access > > 0 pagefault:spf_pmd_changed > > > > And the ones captured during a run on a 80 CPUs Power node: > > Performance counter stats for './ebizzy -mTt 80': > > 1874773 faults > > 1461153 spf > > 413293 pagefault:spf_vma_changed > > 0 pagefault:spf_vma_noanon > > 200 pagefault:spf_vma_notsup > > 0 pagefault:spf_vma_access > > 0 pagefault:spf_pmd_changed > > > > In ebizzy's case most of the page fault were handled in a speculative way, > > leading the ebizzy performance boost. > > > > ------------------ > > Changes since v10 (https://lkml.org/lkml/2018/4/17/572): > > - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran > > and Minchan Kim, hopefully. > > - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in > > __do_page_fault(). > > - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails > > instead > > of aborting the speculative page fault handling. Dropping the now > > useless > > trace event pagefault:spf_pte_lock. > > - No more try to reuse the fetched VMA during the speculative page fault > > handling when retrying is needed. This adds a lot of complexity and > > additional tests done didn't show a significant performance improvement. > > - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. > > > > [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none > > [2] https://patchwork.kernel.org/patch/9999687/ > > > > > > Laurent Dufour (20): > > mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT > > x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > > powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > > mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE > > mm: make pte_unmap_same compatible with SPF > > mm: introduce INIT_VMA() > > mm: protect VMA modifications using VMA sequence count > > mm: protect mremap() against SPF hanlder > > mm: protect SPF handler against anon_vma changes > > mm: cache some VMA fields in the vm_fault structure > > mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() > > mm: introduce __lru_cache_add_active_or_unevictable > > mm: introduce __vm_normal_page() > > mm: introduce __page_add_new_anon_rmap() > > mm: protect mm_rb tree with a rwlock > > mm: adding speculative page fault failure trace events > > perf: add a speculative page fault sw event > > perf tools: add support for the SPF perf event > > mm: add speculative page fault vmstats > > powerpc/mm: add speculative page fault > > > > Mahendran Ganesh (2): > > arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > > arm64/mm: add speculative page fault > > > > Peter Zijlstra (4): > > mm: prepare for FAULT_FLAG_SPECULATIVE > > mm: VMA sequence count > > mm: provide speculative fault infrastructure > > x86/mm: add speculative pagefault handling > > > > arch/arm64/Kconfig | 1 + > > arch/arm64/mm/fault.c | 12 + > > arch/powerpc/Kconfig | 1 + > > arch/powerpc/mm/fault.c | 16 + > > arch/x86/Kconfig | 1 + > > arch/x86/mm/fault.c | 27 +- > > fs/exec.c | 2 +- > > fs/proc/task_mmu.c | 5 +- > > fs/userfaultfd.c | 17 +- > > include/linux/hugetlb_inline.h | 2 +- > > include/linux/migrate.h | 4 +- > > include/linux/mm.h | 136 +++++++- > > include/linux/mm_types.h | 7 + > > include/linux/pagemap.h | 4 +- > > include/linux/rmap.h | 12 +- > > include/linux/swap.h | 10 +- > > include/linux/vm_event_item.h | 3 + > > include/trace/events/pagefault.h | 80 +++++ > > include/uapi/linux/perf_event.h | 1 + > > kernel/fork.c | 5 +- > > mm/Kconfig | 22 ++ > > mm/huge_memory.c | 6 +- > > mm/hugetlb.c | 2 + > > mm/init-mm.c | 3 + > > mm/internal.h | 20 ++ > > mm/khugepaged.c | 5 + > > mm/madvise.c | 6 +- > > mm/memory.c | 612 +++++++++++++++++++++++++++++----- > > mm/mempolicy.c | 51 ++- > > mm/migrate.c | 6 +- > > mm/mlock.c | 13 +- > > mm/mmap.c | 229 ++++++++++--- > > mm/mprotect.c | 4 +- > > mm/mremap.c | 13 + > > mm/nommu.c | 2 +- > > mm/rmap.c | 5 +- > > mm/swap.c | 6 +- > > mm/swap_state.c | 8 +- > > mm/vmstat.c | 5 +- > > tools/include/uapi/linux/perf_event.h | 1 + > > tools/perf/util/evsel.c | 1 + > > tools/perf/util/parse-events.c | 4 + > > tools/perf/util/parse-events.l | 1 + > > tools/perf/util/python.c | 1 + > > 44 files changed, 1161 insertions(+), 211 deletions(-) > > create mode 100644 include/trace/events/pagefault.h > > > > -- > > 2.7.4 > > > > >
On 28/05/2018 10:22, Haiyan Song wrote: > Hi Laurent, > > Yes, these tests are done on V9 patch. Do you plan to give this V11 a run ? > > > Best regards, > Haiyan Song > > On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >> On 28/05/2018 07:23, Song, HaiyanX wrote: >>> >>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>> tested on Intel 4s Skylake platform. >> >> Hi, >> >> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >> series" while responding to the v11 header series... >> Were these tests done on v9 or v11 ? >> >> Cheers, >> Laurent. >> >>> >>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>> Commit id: >>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>> Benchmark suite: will-it-scale >>> Download link: >>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>> Metrics: >>> will-it-scale.per_process_ops=processes/nr_cpu >>> will-it-scale.per_thread_ops=threads/nr_cpu >>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>> THP: enable / disable >>> nr_task: 100% >>> >>> 1. Regressions: >>> a) THP enabled: >>> testcase base change head metric >>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>> >>> b) THP disabled: >>> testcase base change head metric >>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>> >>> 2. Improvements: >>> a) THP enabled: >>> testcase base change head metric >>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>> >>> b) THP disabled: >>> testcase base change head metric >>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>> >>> Notes: for above values in column "change", the higher value means that the related testcase result >>> on head commit is better than that on base commit for this benchmark. >>> >>> >>> Best regards >>> Haiyan Song >>> >>> ________________________________________ >>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>> Sent: Thursday, May 17, 2018 7:06 PM >>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>> Subject: [PATCH v11 00/26] Speculative page faults >>> >>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>> page fault without holding the mm semaphore [1]. >>> >>> The idea is to try to handle user space page faults without holding the >>> mmap_sem. This should allow better concurrency for massively threaded >>> process since the page fault handler will not wait for other threads memory >>> layout change to be done, assuming that this change is done in another part >>> of the process's memory space. This type page fault is named speculative >>> page fault. If the speculative page fault fails because of a concurrency is >>> detected or because underlying PMD or PTE tables are not yet allocating, it >>> is failing its processing and a classic page fault is then tried. >>> >>> The speculative page fault (SPF) has to look for the VMA matching the fault >>> address without holding the mmap_sem, this is done by introducing a rwlock >>> which protects the access to the mm_rb tree. Previously this was done using >>> SRCU but it was introducing a lot of scheduling to process the VMA's >>> freeing operation which was hitting the performance by 20% as reported by >>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>> limiting the locking contention to these operations which are expected to >>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>> our back a reference count is added and 2 services (get_vma() and >>> put_vma()) are introduced to handle the reference count. Once a VMA is >>> fetched from the RB tree using get_vma(), it must be later freed using >>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>> benchmark anymore. >>> >>> The VMA's attributes checked during the speculative page fault processing >>> have to be protected against parallel changes. This is done by using a per >>> VMA sequence lock. This sequence lock allows the speculative page fault >>> handler to fast check for parallel changes in progress and to abort the >>> speculative page fault in that case. >>> >>> Once the VMA has been found, the speculative page fault handler would check >>> for the VMA's attributes to verify that the page fault has to be handled >>> correctly or not. Thus, the VMA is protected through a sequence lock which >>> allows fast detection of concurrent VMA changes. If such a change is >>> detected, the speculative page fault is aborted and a *classic* page fault >>> is tried. VMA sequence lockings are added when VMA attributes which are >>> checked during the page fault are modified. >>> >>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>> so once the page table is locked, the VMA is valid, so any other changes >>> leading to touching this PTE will need to lock the page table, so no >>> parallel change is possible at this time. >>> >>> The locking of the PTE is done with interrupts disabled, this allows >>> checking for the PMD to ensure that there is not an ongoing collapsing >>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>> valid at the time the PTE is locked, we have the guarantee that the >>> collapsing operation will have to wait on the PTE lock to move forward. >>> This allows the SPF handler to map the PTE safely. If the PMD value is >>> different from the one recorded at the beginning of the SPF operation, the >>> classic page fault handler will be called to handle the operation while >>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>> the lock is done using spin_trylock() to avoid dead lock when handling a >>> page fault while a TLB invalidate is requested by another CPU holding the >>> PTE. >>> >>> In pseudo code, this could be seen as: >>> speculative_page_fault() >>> { >>> vma = get_vma() >>> check vma sequence count >>> check vma's support >>> disable interrupt >>> check pgd,p4d,...,pte >>> save pmd and pte in vmf >>> save vma sequence counter in vmf >>> enable interrupt >>> check vma sequence count >>> handle_pte_fault(vma) >>> .. >>> page = alloc_page() >>> pte_map_lock() >>> disable interrupt >>> abort if sequence counter has changed >>> abort if pmd or pte has changed >>> pte map and lock >>> enable interrupt >>> if abort >>> free page >>> abort >>> ... >>> } >>> >>> arch_fault_handler() >>> { >>> if (speculative_page_fault(&vma)) >>> goto done >>> again: >>> lock(mmap_sem) >>> vma = find_vma(); >>> handle_pte_fault(vma); >>> if retry >>> unlock(mmap_sem) >>> goto again; >>> done: >>> handle fault error >>> } >>> >>> Support for THP is not done because when checking for the PMD, we can be >>> confused by an in progress collapsing operation done by khugepaged. The >>> issue is that pmd_none() could be true either if the PMD is not already >>> populated or if the underlying PTE are in the way to be collapsed. So we >>> cannot safely allocate a PMD if pmd_none() is true. >>> >>> This series add a new software performance event named 'speculative-faults' >>> or 'spf'. It counts the number of successful page fault event handled >>> speculatively. When recording 'faults,spf' events, the faults one is >>> counting the total number of page fault events while 'spf' is only counting >>> the part of the faults processed speculatively. >>> >>> There are some trace events introduced by this series. They allow >>> identifying why the page faults were not processed speculatively. This >>> doesn't take in account the faults generated by a monothreaded process >>> which directly processed while holding the mmap_sem. This trace events are >>> grouped in a system named 'pagefault', they are: >>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>> - pagefault:spf_vma_access : the VMA's access right are not respected >>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>> back. >>> >>> To record all the related events, the easier is to run perf with the >>> following arguments : >>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>> >>> There is also a dedicated vmstat counter showing the number of successful >>> page fault handled speculatively. I can be seen this way: >>> $ grep speculative_pgfault /proc/vmstat >>> >>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>> on x86, PowerPC and arm64. >>> >>> --------------------- >>> Real Workload results >>> >>> As mentioned in previous email, we did non official runs using a "popular >>> in memory multithreaded database product" on 176 cores SMT8 Power system >>> which showed a 30% improvements in the number of transaction processed per >>> second. This run has been done on the v6 series, but changes introduced in >>> this new version should not impact the performance boost seen. >>> >>> Here are the perf data captured during 2 of these runs on top of the v8 >>> series: >>> vanilla spf >>> faults 89.418 101.364 +13% >>> spf n/a 97.989 >>> >>> With the SPF kernel, most of the page fault were processed in a speculative >>> way. >>> >>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>> it a try on an android device. He reported that the application launch time >>> was improved in average by 6%, and for large applications (~100 threads) by >>> 20%. >>> >>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>> MSM845 (8 cores) with 6GB (the less is better): >>> >>> Application 4.9 4.9+spf delta >>> com.tencent.mm 416 389 -7% >>> com.eg.android.AlipayGphone 1135 986 -13% >>> com.tencent.mtt 455 454 0% >>> com.qqgame.hlddz 1497 1409 -6% >>> com.autonavi.minimap 711 701 -1% >>> com.tencent.tmgp.sgame 788 748 -5% >>> com.immomo.momo 501 487 -3% >>> com.tencent.peng 2145 2112 -2% >>> com.smile.gifmaker 491 461 -6% >>> com.baidu.BaiduMap 479 366 -23% >>> com.taobao.taobao 1341 1198 -11% >>> com.baidu.searchbox 333 314 -6% >>> com.tencent.mobileqq 394 384 -3% >>> com.sina.weibo 907 906 0% >>> com.youku.phone 816 731 -11% >>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>> com.UCMobile 415 411 -1% >>> com.tencent.tmgp.ak 1464 1431 -2% >>> com.tencent.qqmusic 336 329 -2% >>> com.sankuai.meituan 1661 1302 -22% >>> com.netease.cloudmusic 1193 1200 1% >>> air.tv.douyu.android 4257 4152 -2% >>> >>> ------------------ >>> Benchmarks results >>> >>> Base kernel is v4.17.0-rc4-mm1 >>> SPF is BASE + this series >>> >>> Kernbench: >>> ---------- >>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>> kernel (kernel is build 5 times): >>> >>> Average Half load -j 8 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>> >>> Average Optimal load -j 16 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>> >>> >>> During a run on the SPF, perf events were captured: >>> Performance counter stats for '../kernbench -M': >>> 526743764 faults >>> 210 spf >>> 3 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 2278 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> Very few speculative page faults were recorded as most of the processes >>> involved are monothreaded (sounds that on this architecture some threads >>> were created during the kernel build processing). >>> >>> Here are the kerbench results on a 80 CPUs Power8 system: >>> >>> Average Half load -j 40 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>> >>> Average Optimal load -j 80 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>> >>> During a run on the SPF, perf events were captured: >>> Performance counter stats for '../kernbench -M': >>> 116730856 faults >>> 0 spf >>> 3 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 476 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> Most of the processes involved are monothreaded so SPF is not activated but >>> there is no impact on the performance. >>> >>> Ebizzy: >>> ------- >>> The test is counting the number of records per second it can manage, the >>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>> consistent result I repeated the test 100 times and measure the average >>> result. The number is the record processes per second, the higher is the >>> best. >>> >>> BASE SPF delta >>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>> >>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>> Performance counter stats for './ebizzy -mTt 16': >>> 1706379 faults >>> 1674599 spf >>> 30588 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 363 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> And the ones captured during a run on a 80 CPUs Power node: >>> Performance counter stats for './ebizzy -mTt 80': >>> 1874773 faults >>> 1461153 spf >>> 413293 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 200 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> In ebizzy's case most of the page fault were handled in a speculative way, >>> leading the ebizzy performance boost. >>> >>> ------------------ >>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>> and Minchan Kim, hopefully. >>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>> __do_page_fault(). >>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>> instead >>> of aborting the speculative page fault handling. Dropping the now >>> useless >>> trace event pagefault:spf_pte_lock. >>> - No more try to reuse the fetched VMA during the speculative page fault >>> handling when retrying is needed. This adds a lot of complexity and >>> additional tests done didn't show a significant performance improvement. >>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>> >>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>> [2] https://patchwork.kernel.org/patch/9999687/ >>> >>> >>> Laurent Dufour (20): >>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>> mm: make pte_unmap_same compatible with SPF >>> mm: introduce INIT_VMA() >>> mm: protect VMA modifications using VMA sequence count >>> mm: protect mremap() against SPF hanlder >>> mm: protect SPF handler against anon_vma changes >>> mm: cache some VMA fields in the vm_fault structure >>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>> mm: introduce __lru_cache_add_active_or_unevictable >>> mm: introduce __vm_normal_page() >>> mm: introduce __page_add_new_anon_rmap() >>> mm: protect mm_rb tree with a rwlock >>> mm: adding speculative page fault failure trace events >>> perf: add a speculative page fault sw event >>> perf tools: add support for the SPF perf event >>> mm: add speculative page fault vmstats >>> powerpc/mm: add speculative page fault >>> >>> Mahendran Ganesh (2): >>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> arm64/mm: add speculative page fault >>> >>> Peter Zijlstra (4): >>> mm: prepare for FAULT_FLAG_SPECULATIVE >>> mm: VMA sequence count >>> mm: provide speculative fault infrastructure >>> x86/mm: add speculative pagefault handling >>> >>> arch/arm64/Kconfig | 1 + >>> arch/arm64/mm/fault.c | 12 + >>> arch/powerpc/Kconfig | 1 + >>> arch/powerpc/mm/fault.c | 16 + >>> arch/x86/Kconfig | 1 + >>> arch/x86/mm/fault.c | 27 +- >>> fs/exec.c | 2 +- >>> fs/proc/task_mmu.c | 5 +- >>> fs/userfaultfd.c | 17 +- >>> include/linux/hugetlb_inline.h | 2 +- >>> include/linux/migrate.h | 4 +- >>> include/linux/mm.h | 136 +++++++- >>> include/linux/mm_types.h | 7 + >>> include/linux/pagemap.h | 4 +- >>> include/linux/rmap.h | 12 +- >>> include/linux/swap.h | 10 +- >>> include/linux/vm_event_item.h | 3 + >>> include/trace/events/pagefault.h | 80 +++++ >>> include/uapi/linux/perf_event.h | 1 + >>> kernel/fork.c | 5 +- >>> mm/Kconfig | 22 ++ >>> mm/huge_memory.c | 6 +- >>> mm/hugetlb.c | 2 + >>> mm/init-mm.c | 3 + >>> mm/internal.h | 20 ++ >>> mm/khugepaged.c | 5 + >>> mm/madvise.c | 6 +- >>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>> mm/mempolicy.c | 51 ++- >>> mm/migrate.c | 6 +- >>> mm/mlock.c | 13 +- >>> mm/mmap.c | 229 ++++++++++--- >>> mm/mprotect.c | 4 +- >>> mm/mremap.c | 13 + >>> mm/nommu.c | 2 +- >>> mm/rmap.c | 5 +- >>> mm/swap.c | 6 +- >>> mm/swap_state.c | 8 +- >>> mm/vmstat.c | 5 +- >>> tools/include/uapi/linux/perf_event.h | 1 + >>> tools/perf/util/evsel.c | 1 + >>> tools/perf/util/parse-events.c | 4 + >>> tools/perf/util/parse-events.l | 1 + >>> tools/perf/util/python.c | 1 + >>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>> create mode 100644 include/trace/events/pagefault.h >>> >>> -- >>> 2.7.4 >>> >>> >> >
Full run would take one or two weeks depended on our resource available. Could you pick some ones up, e.g. those have performance regression? -----Original Message----- From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Laurent Dufour Sent: Monday, May 28, 2018 4:55 PM To: Song, HaiyanX <haiyanx.song@intel.com> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox <willy@infradead.org>; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; hpa@zytor.com; Will Deacon <will.deacon@arm.com>; Sergey Senozhatsky <sergey.senozhatsky@gmail.com>; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli <aarcange@redhat.com>; Alexei Starovoitov <alexei.starovoitov@gmail.com>; Wang, Kemi <kemi.wang@intel.com>; Daniel Jordan <daniel.m.jordan@oracle.com>; David Rientjes <rientjes@google.com>; Jerome Glisse <jglisse@redhat.com>; Ganesh Mahendran <opensource.ganesh@gmail.com>; Minchan Kim <minchan@kernel.org>; Punit Agrawal <punitagrawal@gmail.com>; vinayak menon <vinayakm.list@gmail.com>; Yang Shi <yang.shi@linux.alibaba.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen <tim.c.chen@linux.intel.com>; linuxppc-dev@lists.ozlabs.org; x86@kernel.org Subject: Re: [PATCH v11 00/26] Speculative page faults On 28/05/2018 10:22, Haiyan Song wrote: > Hi Laurent, > > Yes, these tests are done on V9 patch. Do you plan to give this V11 a run ? > > > Best regards, > Haiyan Song > > On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >> On 28/05/2018 07:23, Song, HaiyanX wrote: >>> >>> Some regression and improvements is found by LKP-tools(linux kernel >>> performance) on V9 patch series tested on Intel 4s Skylake platform. >> >> Hi, >> >> Thanks for reporting this benchmark results, but you mentioned the >> "V9 patch series" while responding to the v11 header series... >> Were these tests done on v9 or v11 ? >> >> Cheers, >> Laurent. >> >>> >>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 >>> patch series) Commit id: >>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>> Benchmark suite: will-it-scale >>> Download link: >>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>> Metrics: >>> will-it-scale.per_process_ops=processes/nr_cpu >>> will-it-scale.per_thread_ops=threads/nr_cpu >>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>> THP: enable / disable >>> nr_task: 100% >>> >>> 1. Regressions: >>> a) THP enabled: >>> testcase base change head metric >>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>> >>> b) THP disabled: >>> testcase base change head metric >>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>> >>> 2. Improvements: >>> a) THP enabled: >>> testcase base change head metric >>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>> >>> b) THP disabled: >>> testcase base change head metric >>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>> >>> Notes: for above values in column "change", the higher value means >>> that the related testcase result on head commit is better than that on base commit for this benchmark. >>> >>> >>> Best regards >>> Haiyan Song >>> >>> ________________________________________ >>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf >>> of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>> Sent: Thursday, May 17, 2018 7:06 PM >>> To: akpm@linux-foundation.org; mhocko@kernel.org; >>> peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; >>> dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; >>> khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; >>> benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; >>> Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey >>> Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; >>> Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; >>> Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak >>> menon; Yang Shi >>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >>> haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; >>> paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; >>> x86@kernel.org >>> Subject: [PATCH v11 00/26] Speculative page faults >>> >>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to >>> handle page fault without holding the mm semaphore [1]. >>> >>> The idea is to try to handle user space page faults without holding >>> the mmap_sem. This should allow better concurrency for massively >>> threaded process since the page fault handler will not wait for >>> other threads memory layout change to be done, assuming that this >>> change is done in another part of the process's memory space. This >>> type page fault is named speculative page fault. If the speculative >>> page fault fails because of a concurrency is detected or because >>> underlying PMD or PTE tables are not yet allocating, it is failing its processing and a classic page fault is then tried. >>> >>> The speculative page fault (SPF) has to look for the VMA matching >>> the fault address without holding the mmap_sem, this is done by >>> introducing a rwlock which protects the access to the mm_rb tree. >>> Previously this was done using SRCU but it was introducing a lot of >>> scheduling to process the VMA's freeing operation which was hitting >>> the performance by 20% as reported by Kemi Wang [2]. Using a rwlock >>> to protect access to the mm_rb tree is limiting the locking >>> contention to these operations which are expected to be in a O(log >>> n) order. In addition to ensure that the VMA is not freed in our >>> back a reference count is added and 2 services (get_vma() and >>> put_vma()) are introduced to handle the reference count. Once a VMA >>> is fetched from the RB tree using get_vma(), it must be later freed >>> using put_vma(). I can't see anymore the overhead I got while >>> will-it-scale benchmark anymore. >>> >>> The VMA's attributes checked during the speculative page fault >>> processing have to be protected against parallel changes. This is >>> done by using a per VMA sequence lock. This sequence lock allows the >>> speculative page fault handler to fast check for parallel changes in >>> progress and to abort the speculative page fault in that case. >>> >>> Once the VMA has been found, the speculative page fault handler >>> would check for the VMA's attributes to verify that the page fault >>> has to be handled correctly or not. Thus, the VMA is protected >>> through a sequence lock which allows fast detection of concurrent >>> VMA changes. If such a change is detected, the speculative page >>> fault is aborted and a *classic* page fault is tried. VMA sequence >>> lockings are added when VMA attributes which are checked during the page fault are modified. >>> >>> When the PTE is fetched, the VMA is checked to see if it has been >>> changed, so once the page table is locked, the VMA is valid, so any >>> other changes leading to touching this PTE will need to lock the >>> page table, so no parallel change is possible at this time. >>> >>> The locking of the PTE is done with interrupts disabled, this allows >>> checking for the PMD to ensure that there is not an ongoing >>> collapsing operation. Since khugepaged is firstly set the PMD to >>> pmd_none and then is waiting for the other CPU to have caught the >>> IPI interrupt, if the pmd is valid at the time the PTE is locked, we >>> have the guarantee that the collapsing operation will have to wait on the PTE lock to move forward. >>> This allows the SPF handler to map the PTE safely. If the PMD value >>> is different from the one recorded at the beginning of the SPF >>> operation, the classic page fault handler will be called to handle >>> the operation while holding the mmap_sem. As the PTE lock is done >>> with the interrupts disabled, the lock is done using spin_trylock() >>> to avoid dead lock when handling a page fault while a TLB invalidate >>> is requested by another CPU holding the PTE. >>> >>> In pseudo code, this could be seen as: >>> speculative_page_fault() >>> { >>> vma = get_vma() >>> check vma sequence count >>> check vma's support >>> disable interrupt >>> check pgd,p4d,...,pte >>> save pmd and pte in vmf >>> save vma sequence counter in vmf >>> enable interrupt >>> check vma sequence count >>> handle_pte_fault(vma) >>> .. >>> page = alloc_page() >>> pte_map_lock() >>> disable interrupt >>> abort if sequence counter has changed >>> abort if pmd or pte has changed >>> pte map and lock >>> enable interrupt >>> if abort >>> free page >>> abort >>> ... >>> } >>> >>> arch_fault_handler() >>> { >>> if (speculative_page_fault(&vma)) >>> goto done >>> again: >>> lock(mmap_sem) >>> vma = find_vma(); >>> handle_pte_fault(vma); >>> if retry >>> unlock(mmap_sem) >>> goto again; >>> done: >>> handle fault error >>> } >>> >>> Support for THP is not done because when checking for the PMD, we >>> can be confused by an in progress collapsing operation done by >>> khugepaged. The issue is that pmd_none() could be true either if the >>> PMD is not already populated or if the underlying PTE are in the way >>> to be collapsed. So we cannot safely allocate a PMD if pmd_none() is true. >>> >>> This series add a new software performance event named 'speculative-faults' >>> or 'spf'. It counts the number of successful page fault event >>> handled speculatively. When recording 'faults,spf' events, the >>> faults one is counting the total number of page fault events while >>> 'spf' is only counting the part of the faults processed speculatively. >>> >>> There are some trace events introduced by this series. They allow >>> identifying why the page faults were not processed speculatively. >>> This doesn't take in account the faults generated by a monothreaded >>> process which directly processed while holding the mmap_sem. This >>> trace events are grouped in a system named 'pagefault', they are: >>> - pagefault:spf_vma_changed : if the VMA has been changed in our >>> back >>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>> - pagefault:spf_vma_access : the VMA's access right are not >>> respected >>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>> back. >>> >>> To record all the related events, the easier is to run perf with the >>> following arguments : >>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>> >>> There is also a dedicated vmstat counter showing the number of >>> successful page fault handled speculatively. I can be seen this way: >>> $ grep speculative_pgfault /proc/vmstat >>> >>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is >>> functional on x86, PowerPC and arm64. >>> >>> --------------------- >>> Real Workload results >>> >>> As mentioned in previous email, we did non official runs using a >>> "popular in memory multithreaded database product" on 176 cores SMT8 >>> Power system which showed a 30% improvements in the number of >>> transaction processed per second. This run has been done on the v6 >>> series, but changes introduced in this new version should not impact the performance boost seen. >>> >>> Here are the perf data captured during 2 of these runs on top of the >>> v8 >>> series: >>> vanilla spf >>> faults 89.418 101.364 +13% >>> spf n/a 97.989 >>> >>> With the SPF kernel, most of the page fault were processed in a >>> speculative way. >>> >>> Ganesh Mahendran had backported the series on top of a 4.9 kernel >>> and gave it a try on an android device. He reported that the >>> application launch time was improved in average by 6%, and for large >>> applications (~100 threads) by 20%. >>> >>> Here are the launch time Ganesh mesured on Android 8.0 on top of a >>> Qcom >>> MSM845 (8 cores) with 6GB (the less is better): >>> >>> Application 4.9 4.9+spf delta >>> com.tencent.mm 416 389 -7% >>> com.eg.android.AlipayGphone 1135 986 -13% >>> com.tencent.mtt 455 454 0% >>> com.qqgame.hlddz 1497 1409 -6% >>> com.autonavi.minimap 711 701 -1% >>> com.tencent.tmgp.sgame 788 748 -5% >>> com.immomo.momo 501 487 -3% >>> com.tencent.peng 2145 2112 -2% >>> com.smile.gifmaker 491 461 -6% >>> com.baidu.BaiduMap 479 366 -23% >>> com.taobao.taobao 1341 1198 -11% >>> com.baidu.searchbox 333 314 -6% >>> com.tencent.mobileqq 394 384 -3% >>> com.sina.weibo 907 906 0% >>> com.youku.phone 816 731 -11% >>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>> com.UCMobile 415 411 -1% >>> com.tencent.tmgp.ak 1464 1431 -2% >>> com.tencent.qqmusic 336 329 -2% >>> com.sankuai.meituan 1661 1302 -22% >>> com.netease.cloudmusic 1193 1200 1% >>> air.tv.douyu.android 4257 4152 -2% >>> >>> ------------------ >>> Benchmarks results >>> >>> Base kernel is v4.17.0-rc4-mm1 >>> SPF is BASE + this series >>> >>> Kernbench: >>> ---------- >>> Here are the results on a 16 CPUs X86 guest using kernbench on a >>> 4.15 kernel (kernel is build 5 times): >>> >>> Average Half load -j 8 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>> >>> Average Optimal load -j 16 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>> >>> >>> During a run on the SPF, perf events were captured: >>> Performance counter stats for '../kernbench -M': >>> 526743764 faults >>> 210 spf >>> 3 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 2278 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> Very few speculative page faults were recorded as most of the >>> processes involved are monothreaded (sounds that on this >>> architecture some threads were created during the kernel build processing). >>> >>> Here are the kerbench results on a 80 CPUs Power8 system: >>> >>> Average Half load -j 40 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>> >>> Average Optimal load -j 80 >>> Run (std deviation) >>> BASE SPF >>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>> >>> During a run on the SPF, perf events were captured: >>> Performance counter stats for '../kernbench -M': >>> 116730856 faults >>> 0 spf >>> 3 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 476 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> Most of the processes involved are monothreaded so SPF is not >>> activated but there is no impact on the performance. >>> >>> Ebizzy: >>> ------- >>> The test is counting the number of records per second it can manage, >>> the higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. >>> To get consistent result I repeated the test 100 times and measure >>> the average result. The number is the record processes per second, >>> the higher is the best. >>> >>> BASE SPF delta >>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>> >>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>> Performance counter stats for './ebizzy -mTt 16': >>> 1706379 faults >>> 1674599 spf >>> 30588 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 363 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> And the ones captured during a run on a 80 CPUs Power node: >>> Performance counter stats for './ebizzy -mTt 80': >>> 1874773 faults >>> 1461153 spf >>> 413293 pagefault:spf_vma_changed >>> 0 pagefault:spf_vma_noanon >>> 200 pagefault:spf_vma_notsup >>> 0 pagefault:spf_vma_access >>> 0 pagefault:spf_pmd_changed >>> >>> In ebizzy's case most of the page fault were handled in a >>> speculative way, leading the ebizzy performance boost. >>> >>> ------------------ >>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>> and Minchan Kim, hopefully. >>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>> __do_page_fault(). >>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>> instead >>> of aborting the speculative page fault handling. Dropping the now >>> useless >>> trace event pagefault:spf_pte_lock. >>> - No more try to reuse the fetched VMA during the speculative page fault >>> handling when retrying is needed. This adds a lot of complexity and >>> additional tests done didn't show a significant performance improvement. >>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>> >>> [1] >>> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-s >>> peculative-page-faults-tt965642.html#none >>> [2] https://patchwork.kernel.org/patch/9999687/ >>> >>> >>> Laurent Dufour (20): >>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>> mm: make pte_unmap_same compatible with SPF >>> mm: introduce INIT_VMA() >>> mm: protect VMA modifications using VMA sequence count >>> mm: protect mremap() against SPF hanlder >>> mm: protect SPF handler against anon_vma changes >>> mm: cache some VMA fields in the vm_fault structure >>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>> mm: introduce __lru_cache_add_active_or_unevictable >>> mm: introduce __vm_normal_page() >>> mm: introduce __page_add_new_anon_rmap() >>> mm: protect mm_rb tree with a rwlock >>> mm: adding speculative page fault failure trace events >>> perf: add a speculative page fault sw event >>> perf tools: add support for the SPF perf event >>> mm: add speculative page fault vmstats >>> powerpc/mm: add speculative page fault >>> >>> Mahendran Ganesh (2): >>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>> arm64/mm: add speculative page fault >>> >>> Peter Zijlstra (4): >>> mm: prepare for FAULT_FLAG_SPECULATIVE >>> mm: VMA sequence count >>> mm: provide speculative fault infrastructure >>> x86/mm: add speculative pagefault handling >>> >>> arch/arm64/Kconfig | 1 + >>> arch/arm64/mm/fault.c | 12 + >>> arch/powerpc/Kconfig | 1 + >>> arch/powerpc/mm/fault.c | 16 + >>> arch/x86/Kconfig | 1 + >>> arch/x86/mm/fault.c | 27 +- >>> fs/exec.c | 2 +- >>> fs/proc/task_mmu.c | 5 +- >>> fs/userfaultfd.c | 17 +- >>> include/linux/hugetlb_inline.h | 2 +- >>> include/linux/migrate.h | 4 +- >>> include/linux/mm.h | 136 +++++++- >>> include/linux/mm_types.h | 7 + >>> include/linux/pagemap.h | 4 +- >>> include/linux/rmap.h | 12 +- >>> include/linux/swap.h | 10 +- >>> include/linux/vm_event_item.h | 3 + >>> include/trace/events/pagefault.h | 80 +++++ >>> include/uapi/linux/perf_event.h | 1 + >>> kernel/fork.c | 5 +- >>> mm/Kconfig | 22 ++ >>> mm/huge_memory.c | 6 +- >>> mm/hugetlb.c | 2 + >>> mm/init-mm.c | 3 + >>> mm/internal.h | 20 ++ >>> mm/khugepaged.c | 5 + >>> mm/madvise.c | 6 +- >>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>> mm/mempolicy.c | 51 ++- >>> mm/migrate.c | 6 +- >>> mm/mlock.c | 13 +- >>> mm/mmap.c | 229 ++++++++++--- >>> mm/mprotect.c | 4 +- >>> mm/mremap.c | 13 + >>> mm/nommu.c | 2 +- >>> mm/rmap.c | 5 +- >>> mm/swap.c | 6 +- >>> mm/swap_state.c | 8 +- >>> mm/vmstat.c | 5 +- >>> tools/include/uapi/linux/perf_event.h | 1 + >>> tools/perf/util/evsel.c | 1 + >>> tools/perf/util/parse-events.c | 4 + >>> tools/perf/util/parse-events.l | 1 + >>> tools/perf/util/python.c | 1 + >>> 44 files changed, 1161 insertions(+), 211 deletions(-) create mode >>> 100644 include/trace/events/pagefault.h >>> >>> -- >>> 2.7.4 >>> >>> >> >
Hi Laurent, Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on V9 patch serials. The regression result is sorted by the metric will-it-scale.per_thread_ops. branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 commit id: head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 Benchmark: will-it-scale Download link: https://github.com/antonblanchard/will-it-scale/tree/master Metrics: will-it-scale.per_process_ops=processes/nr_cpu will-it-scale.per_thread_ops=threads/nr_cpu test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) THP: enable / disable nr_task:100% 1. Regressions: a). Enable THP testcase base change head metric page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops b). Disable THP page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops Notes: for the above values of test result, the higher is better. 2. Improvement: not found improvement based on the selected test cases. Best regards Haiyan Song
Hi Haiyan, I don't have access to the same hardware you ran the test on, but I give a try to those test on a Power8 system (2 sockets, 5 cores/s, 8 threads/c, 80 CPUs 32G). I run each will-it-scale test 10 times and compute the average. test THP enabled 4.17.0-rc4-mm1 spf delta page_fault3_threads 2697.7 2683.5 -0.53% page_fault2_threads 170660.6 169574.1 -0.64% context_switch1_threads 6915269.2 6877507.3 -0.55% context_switch1_processes 6478076.2 6529493.5 0.79% brk1 243391.2 238527.5 -2.00% Test were launched with the arguments '-t 80 -s 5', only the average report is taken in account. Note that page size is 64K by default on ppc64. It would be nice if you could capture some perf data to figure out why the page_fault2/3 are showing such a performance regression. Thanks, Laurent. On 11/06/2018 09:49, Song, HaiyanX wrote: > Hi Laurent, > > Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) > tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on > V9 patch serials. > > The regression result is sorted by the metric will-it-scale.per_thread_ops. > branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 > commit id: > head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 > base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 > Benchmark: will-it-scale > Download link: https://github.com/antonblanchard/will-it-scale/tree/master > > Metrics: > will-it-scale.per_process_ops=processes/nr_cpu > will-it-scale.per_thread_ops=threads/nr_cpu > test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > THP: enable / disable > nr_task:100% > > 1. Regressions: > > a). Enable THP > testcase base change head metric > page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops > page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops > brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops > context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops > context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops > > b). Disable THP > page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops > page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops > brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops > context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops > brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops > page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops > context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops > > Notes: for the above values of test result, the higher is better. > > 2. Improvement: not found improvement based on the selected test cases. > > > Best regards > Haiyan Song > ________________________________________ > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Monday, May 28, 2018 4:54 PM > To: Song, HaiyanX > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: Re: [PATCH v11 00/26] Speculative page faults > > On 28/05/2018 10:22, Haiyan Song wrote: >> Hi Laurent, >> >> Yes, these tests are done on V9 patch. > > Do you plan to give this V11 a run ? > >> >> >> Best regards, >> Haiyan Song >> >> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>> >>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>> tested on Intel 4s Skylake platform. >>> >>> Hi, >>> >>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>> series" while responding to the v11 header series... >>> Were these tests done on v9 or v11 ? >>> >>> Cheers, >>> Laurent. >>> >>>> >>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>> Commit id: >>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>> Benchmark suite: will-it-scale >>>> Download link: >>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>> Metrics: >>>> will-it-scale.per_process_ops=processes/nr_cpu >>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>> THP: enable / disable >>>> nr_task: 100% >>>> >>>> 1. Regressions: >>>> a) THP enabled: >>>> testcase base change head metric >>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>> >>>> b) THP disabled: >>>> testcase base change head metric >>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>> >>>> 2. Improvements: >>>> a) THP enabled: >>>> testcase base change head metric >>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>> >>>> b) THP disabled: >>>> testcase base change head metric >>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>> >>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>> on head commit is better than that on base commit for this benchmark. >>>> >>>> >>>> Best regards >>>> Haiyan Song >>>> >>>> ________________________________________ >>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>> Sent: Thursday, May 17, 2018 7:06 PM >>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>> Subject: [PATCH v11 00/26] Speculative page faults >>>> >>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>> page fault without holding the mm semaphore [1]. >>>> >>>> The idea is to try to handle user space page faults without holding the >>>> mmap_sem. This should allow better concurrency for massively threaded >>>> process since the page fault handler will not wait for other threads memory >>>> layout change to be done, assuming that this change is done in another part >>>> of the process's memory space. This type page fault is named speculative >>>> page fault. If the speculative page fault fails because of a concurrency is >>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>> is failing its processing and a classic page fault is then tried. >>>> >>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>> which protects the access to the mm_rb tree. Previously this was done using >>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>> freeing operation which was hitting the performance by 20% as reported by >>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>> limiting the locking contention to these operations which are expected to >>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>> our back a reference count is added and 2 services (get_vma() and >>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>> fetched from the RB tree using get_vma(), it must be later freed using >>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>> benchmark anymore. >>>> >>>> The VMA's attributes checked during the speculative page fault processing >>>> have to be protected against parallel changes. This is done by using a per >>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>> handler to fast check for parallel changes in progress and to abort the >>>> speculative page fault in that case. >>>> >>>> Once the VMA has been found, the speculative page fault handler would check >>>> for the VMA's attributes to verify that the page fault has to be handled >>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>> allows fast detection of concurrent VMA changes. If such a change is >>>> detected, the speculative page fault is aborted and a *classic* page fault >>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>> checked during the page fault are modified. >>>> >>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>> so once the page table is locked, the VMA is valid, so any other changes >>>> leading to touching this PTE will need to lock the page table, so no >>>> parallel change is possible at this time. >>>> >>>> The locking of the PTE is done with interrupts disabled, this allows >>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>> valid at the time the PTE is locked, we have the guarantee that the >>>> collapsing operation will have to wait on the PTE lock to move forward. >>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>> different from the one recorded at the beginning of the SPF operation, the >>>> classic page fault handler will be called to handle the operation while >>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>> page fault while a TLB invalidate is requested by another CPU holding the >>>> PTE. >>>> >>>> In pseudo code, this could be seen as: >>>> speculative_page_fault() >>>> { >>>> vma = get_vma() >>>> check vma sequence count >>>> check vma's support >>>> disable interrupt >>>> check pgd,p4d,...,pte >>>> save pmd and pte in vmf >>>> save vma sequence counter in vmf >>>> enable interrupt >>>> check vma sequence count >>>> handle_pte_fault(vma) >>>> .. >>>> page = alloc_page() >>>> pte_map_lock() >>>> disable interrupt >>>> abort if sequence counter has changed >>>> abort if pmd or pte has changed >>>> pte map and lock >>>> enable interrupt >>>> if abort >>>> free page >>>> abort >>>> ... >>>> } >>>> >>>> arch_fault_handler() >>>> { >>>> if (speculative_page_fault(&vma)) >>>> goto done >>>> again: >>>> lock(mmap_sem) >>>> vma = find_vma(); >>>> handle_pte_fault(vma); >>>> if retry >>>> unlock(mmap_sem) >>>> goto again; >>>> done: >>>> handle fault error >>>> } >>>> >>>> Support for THP is not done because when checking for the PMD, we can be >>>> confused by an in progress collapsing operation done by khugepaged. The >>>> issue is that pmd_none() could be true either if the PMD is not already >>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>> cannot safely allocate a PMD if pmd_none() is true. >>>> >>>> This series add a new software performance event named 'speculative-faults' >>>> or 'spf'. It counts the number of successful page fault event handled >>>> speculatively. When recording 'faults,spf' events, the faults one is >>>> counting the total number of page fault events while 'spf' is only counting >>>> the part of the faults processed speculatively. >>>> >>>> There are some trace events introduced by this series. They allow >>>> identifying why the page faults were not processed speculatively. This >>>> doesn't take in account the faults generated by a monothreaded process >>>> which directly processed while holding the mmap_sem. This trace events are >>>> grouped in a system named 'pagefault', they are: >>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>> back. >>>> >>>> To record all the related events, the easier is to run perf with the >>>> following arguments : >>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>> >>>> There is also a dedicated vmstat counter showing the number of successful >>>> page fault handled speculatively. I can be seen this way: >>>> $ grep speculative_pgfault /proc/vmstat >>>> >>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>> on x86, PowerPC and arm64. >>>> >>>> --------------------- >>>> Real Workload results >>>> >>>> As mentioned in previous email, we did non official runs using a "popular >>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>> which showed a 30% improvements in the number of transaction processed per >>>> second. This run has been done on the v6 series, but changes introduced in >>>> this new version should not impact the performance boost seen. >>>> >>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>> series: >>>> vanilla spf >>>> faults 89.418 101.364 +13% >>>> spf n/a 97.989 >>>> >>>> With the SPF kernel, most of the page fault were processed in a speculative >>>> way. >>>> >>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>> it a try on an android device. He reported that the application launch time >>>> was improved in average by 6%, and for large applications (~100 threads) by >>>> 20%. >>>> >>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>> MSM845 (8 cores) with 6GB (the less is better): >>>> >>>> Application 4.9 4.9+spf delta >>>> com.tencent.mm 416 389 -7% >>>> com.eg.android.AlipayGphone 1135 986 -13% >>>> com.tencent.mtt 455 454 0% >>>> com.qqgame.hlddz 1497 1409 -6% >>>> com.autonavi.minimap 711 701 -1% >>>> com.tencent.tmgp.sgame 788 748 -5% >>>> com.immomo.momo 501 487 -3% >>>> com.tencent.peng 2145 2112 -2% >>>> com.smile.gifmaker 491 461 -6% >>>> com.baidu.BaiduMap 479 366 -23% >>>> com.taobao.taobao 1341 1198 -11% >>>> com.baidu.searchbox 333 314 -6% >>>> com.tencent.mobileqq 394 384 -3% >>>> com.sina.weibo 907 906 0% >>>> com.youku.phone 816 731 -11% >>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>> com.UCMobile 415 411 -1% >>>> com.tencent.tmgp.ak 1464 1431 -2% >>>> com.tencent.qqmusic 336 329 -2% >>>> com.sankuai.meituan 1661 1302 -22% >>>> com.netease.cloudmusic 1193 1200 1% >>>> air.tv.douyu.android 4257 4152 -2% >>>> >>>> ------------------ >>>> Benchmarks results >>>> >>>> Base kernel is v4.17.0-rc4-mm1 >>>> SPF is BASE + this series >>>> >>>> Kernbench: >>>> ---------- >>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>> kernel (kernel is build 5 times): >>>> >>>> Average Half load -j 8 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>> >>>> Average Optimal load -j 16 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>> >>>> >>>> During a run on the SPF, perf events were captured: >>>> Performance counter stats for '../kernbench -M': >>>> 526743764 faults >>>> 210 spf >>>> 3 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 2278 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> Very few speculative page faults were recorded as most of the processes >>>> involved are monothreaded (sounds that on this architecture some threads >>>> were created during the kernel build processing). >>>> >>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>> >>>> Average Half load -j 40 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>> >>>> Average Optimal load -j 80 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>> >>>> During a run on the SPF, perf events were captured: >>>> Performance counter stats for '../kernbench -M': >>>> 116730856 faults >>>> 0 spf >>>> 3 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 476 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> Most of the processes involved are monothreaded so SPF is not activated but >>>> there is no impact on the performance. >>>> >>>> Ebizzy: >>>> ------- >>>> The test is counting the number of records per second it can manage, the >>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>> consistent result I repeated the test 100 times and measure the average >>>> result. The number is the record processes per second, the higher is the >>>> best. >>>> >>>> BASE SPF delta >>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>> >>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>> Performance counter stats for './ebizzy -mTt 16': >>>> 1706379 faults >>>> 1674599 spf >>>> 30588 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 363 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> And the ones captured during a run on a 80 CPUs Power node: >>>> Performance counter stats for './ebizzy -mTt 80': >>>> 1874773 faults >>>> 1461153 spf >>>> 413293 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 200 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>> leading the ebizzy performance boost. >>>> >>>> ------------------ >>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>> and Minchan Kim, hopefully. >>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>> __do_page_fault(). >>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>> instead >>>> of aborting the speculative page fault handling. Dropping the now >>>> useless >>>> trace event pagefault:spf_pte_lock. >>>> - No more try to reuse the fetched VMA during the speculative page fault >>>> handling when retrying is needed. This adds a lot of complexity and >>>> additional tests done didn't show a significant performance improvement. >>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>> >>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>> >>>> >>>> Laurent Dufour (20): >>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>> mm: make pte_unmap_same compatible with SPF >>>> mm: introduce INIT_VMA() >>>> mm: protect VMA modifications using VMA sequence count >>>> mm: protect mremap() against SPF hanlder >>>> mm: protect SPF handler against anon_vma changes >>>> mm: cache some VMA fields in the vm_fault structure >>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>> mm: introduce __lru_cache_add_active_or_unevictable >>>> mm: introduce __vm_normal_page() >>>> mm: introduce __page_add_new_anon_rmap() >>>> mm: protect mm_rb tree with a rwlock >>>> mm: adding speculative page fault failure trace events >>>> perf: add a speculative page fault sw event >>>> perf tools: add support for the SPF perf event >>>> mm: add speculative page fault vmstats >>>> powerpc/mm: add speculative page fault >>>> >>>> Mahendran Ganesh (2): >>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> arm64/mm: add speculative page fault >>>> >>>> Peter Zijlstra (4): >>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>> mm: VMA sequence count >>>> mm: provide speculative fault infrastructure >>>> x86/mm: add speculative pagefault handling >>>> >>>> arch/arm64/Kconfig | 1 + >>>> arch/arm64/mm/fault.c | 12 + >>>> arch/powerpc/Kconfig | 1 + >>>> arch/powerpc/mm/fault.c | 16 + >>>> arch/x86/Kconfig | 1 + >>>> arch/x86/mm/fault.c | 27 +- >>>> fs/exec.c | 2 +- >>>> fs/proc/task_mmu.c | 5 +- >>>> fs/userfaultfd.c | 17 +- >>>> include/linux/hugetlb_inline.h | 2 +- >>>> include/linux/migrate.h | 4 +- >>>> include/linux/mm.h | 136 +++++++- >>>> include/linux/mm_types.h | 7 + >>>> include/linux/pagemap.h | 4 +- >>>> include/linux/rmap.h | 12 +- >>>> include/linux/swap.h | 10 +- >>>> include/linux/vm_event_item.h | 3 + >>>> include/trace/events/pagefault.h | 80 +++++ >>>> include/uapi/linux/perf_event.h | 1 + >>>> kernel/fork.c | 5 +- >>>> mm/Kconfig | 22 ++ >>>> mm/huge_memory.c | 6 +- >>>> mm/hugetlb.c | 2 + >>>> mm/init-mm.c | 3 + >>>> mm/internal.h | 20 ++ >>>> mm/khugepaged.c | 5 + >>>> mm/madvise.c | 6 +- >>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>> mm/mempolicy.c | 51 ++- >>>> mm/migrate.c | 6 +- >>>> mm/mlock.c | 13 +- >>>> mm/mmap.c | 229 ++++++++++--- >>>> mm/mprotect.c | 4 +- >>>> mm/mremap.c | 13 + >>>> mm/nommu.c | 2 +- >>>> mm/rmap.c | 5 +- >>>> mm/swap.c | 6 +- >>>> mm/swap_state.c | 8 +- >>>> mm/vmstat.c | 5 +- >>>> tools/include/uapi/linux/perf_event.h | 1 + >>>> tools/perf/util/evsel.c | 1 + >>>> tools/perf/util/parse-events.c | 4 + >>>> tools/perf/util/parse-events.l | 1 + >>>> tools/perf/util/python.c | 1 + >>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>> create mode 100644 include/trace/events/pagefault.h >>>> >>>> -- >>>> 2.7.4 >>>> >>>> >>> >> >
On Mon, Jun 11, 2018 at 05:15:22PM +0200, Laurent Dufour wrote: Hi Laurent, For perf date tested on Intel 4s Skylake platform, here attached the compare result between base and head commit in attachment, which include the perf-profile comparision information. And also attached some perf-profile.json captured from test result for page_fault2 and page_fault3 for checking the regression, thanks. Best regards, Haiyan Song > Hi Haiyan, > > I don't have access to the same hardware you ran the test on, but I give a try > to those test on a Power8 system (2 sockets, 5 cores/s, 8 threads/c, 80 CPUs 32G). > I run each will-it-scale test 10 times and compute the average. > > test THP enabled 4.17.0-rc4-mm1 spf delta > page_fault3_threads 2697.7 2683.5 -0.53% > page_fault2_threads 170660.6 169574.1 -0.64% > context_switch1_threads 6915269.2 6877507.3 -0.55% > context_switch1_processes 6478076.2 6529493.5 0.79% > rk1 243391.2 238527.5 -2.00% > > Test were launched with the arguments '-t 80 -s 5', only the average report is > taken in account. Note that page size is 64K by default on ppc64. > > It would be nice if you could capture some perf data to figure out why the > page_fault2/3 are showing such a performance regression. > > Thanks, > Laurent. > > On 11/06/2018 09:49, Song, HaiyanX wrote: > > Hi Laurent, > > > > Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) > > tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on > > V9 patch serials. > > > > The regression result is sorted by the metric will-it-scale.per_thread_ops. > > branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 > > commit id: > > head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 > > base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 > > Benchmark: will-it-scale > > Download link: https://github.com/antonblanchard/will-it-scale/tree/master > > > > Metrics: > > will-it-scale.per_process_ops=processes/nr_cpu > > will-it-scale.per_thread_ops=threads/nr_cpu > > test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > > THP: enable / disable > > nr_task:100% > > > > 1. Regressions: > > > > a). Enable THP > > testcase base change head metric > > page_fault3/enable THP 10519 -20.5% 8368 will-it-scale.per_thread_ops > > page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops > > brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops > > context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops > > context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops > > > > b). Disable THP > > page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops > > page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops > > brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops > > context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops > > brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops > > page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops > > context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops > > > > Notes: for the above values of test result, the higher is better. > > > > 2. Improvement: not found improvement based on the selected test cases. > > > > > > Best regards > > Haiyan Song > > ________________________________________ > > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > > Sent: Monday, May 28, 2018 4:54 PM > > To: Song, HaiyanX > > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > > Subject: Re: [PATCH v11 00/26] Speculative page faults > > > > On 28/05/2018 10:22, Haiyan Song wrote: > >> Hi Laurent, > >> > >> Yes, these tests are done on V9 patch. > > > > Do you plan to give this V11 a run ? > > > >> > >> > >> Best regards, > >> Haiyan Song > >> > >> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: > >>> On 28/05/2018 07:23, Song, HaiyanX wrote: > >>>> > >>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series > >>>> tested on Intel 4s Skylake platform. > >>> > >>> Hi, > >>> > >>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch > >>> series" while responding to the v11 header series... > >>> Were these tests done on v9 or v11 ? > >>> > >>> Cheers, > >>> Laurent. > >>> > >>>> > >>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. > >>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) > >>>> Commit id: > >>>> base commit: d55f34411b1b126429a823d06c3124c16283231f > >>>> head commit: 0355322b3577eeab7669066df42c550a56801110 > >>>> Benchmark suite: will-it-scale > >>>> Download link: > >>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests > >>>> Metrics: > >>>> will-it-scale.per_process_ops=processes/nr_cpu > >>>> will-it-scale.per_thread_ops=threads/nr_cpu > >>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > >>>> THP: enable / disable > >>>> nr_task: 100% > >>>> > >>>> 1. Regressions: > >>>> a) THP enabled: > >>>> testcase base change head metric > >>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops > >>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops > >>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops > >>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops > >>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops > >>>> > >>>> b) THP disabled: > >>>> testcase base change head metric > >>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops > >>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops > >>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops > >>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops > >>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops > >>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops > >>>> > >>>> 2. Improvements: > >>>> a) THP enabled: > >>>> testcase base change head metric > >>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops > >>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops > >>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops > >>>> > >>>> b) THP disabled: > >>>> testcase base change head metric > >>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops > >>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops > >>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops > >>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops > >>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops > >>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops > >>>> > >>>> Notes: for above values in column "change", the higher value means that the related testcase result > >>>> on head commit is better than that on base commit for this benchmark. > >>>> > >>>> > >>>> Best regards > >>>> Haiyan Song > >>>> > >>>> ________________________________________ > >>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > >>>> Sent: Thursday, May 17, 2018 7:06 PM > >>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi > >>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > >>>> Subject: [PATCH v11 00/26] Speculative page faults > >>>> > >>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle > >>>> page fault without holding the mm semaphore [1]. > >>>> > >>>> The idea is to try to handle user space page faults without holding the > >>>> mmap_sem. This should allow better concurrency for massively threaded > >>>> process since the page fault handler will not wait for other threads memory > >>>> layout change to be done, assuming that this change is done in another part > >>>> of the process's memory space. This type page fault is named speculative > >>>> page fault. If the speculative page fault fails because of a concurrency is > >>>> detected or because underlying PMD or PTE tables are not yet allocating, it > >>>> is failing its processing and a classic page fault is then tried. > >>>> > >>>> The speculative page fault (SPF) has to look for the VMA matching the fault > >>>> address without holding the mmap_sem, this is done by introducing a rwlock > >>>> which protects the access to the mm_rb tree. Previously this was done using > >>>> SRCU but it was introducing a lot of scheduling to process the VMA's > >>>> freeing operation which was hitting the performance by 20% as reported by > >>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is > >>>> limiting the locking contention to these operations which are expected to > >>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in > >>>> our back a reference count is added and 2 services (get_vma() and > >>>> put_vma()) are introduced to handle the reference count. Once a VMA is > >>>> fetched from the RB tree using get_vma(), it must be later freed using > >>>> put_vma(). I can't see anymore the overhead I got while will-it-scale > >>>> benchmark anymore. > >>>> > >>>> The VMA's attributes checked during the speculative page fault processing > >>>> have to be protected against parallel changes. This is done by using a per > >>>> VMA sequence lock. This sequence lock allows the speculative page fault > >>>> handler to fast check for parallel changes in progress and to abort the > >>>> speculative page fault in that case. > >>>> > >>>> Once the VMA has been found, the speculative page fault handler would check > >>>> for the VMA's attributes to verify that the page fault has to be handled > >>>> correctly or not. Thus, the VMA is protected through a sequence lock which > >>>> allows fast detection of concurrent VMA changes. If such a change is > >>>> detected, the speculative page fault is aborted and a *classic* page fault > >>>> is tried. VMA sequence lockings are added when VMA attributes which are > >>>> checked during the page fault are modified. > >>>> > >>>> When the PTE is fetched, the VMA is checked to see if it has been changed, > >>>> so once the page table is locked, the VMA is valid, so any other changes > >>>> leading to touching this PTE will need to lock the page table, so no > >>>> parallel change is possible at this time. > >>>> > >>>> The locking of the PTE is done with interrupts disabled, this allows > >>>> checking for the PMD to ensure that there is not an ongoing collapsing > >>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is > >>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is > >>>> valid at the time the PTE is locked, we have the guarantee that the > >>>> collapsing operation will have to wait on the PTE lock to move forward. > >>>> This allows the SPF handler to map the PTE safely. If the PMD value is > >>>> different from the one recorded at the beginning of the SPF operation, the > >>>> classic page fault handler will be called to handle the operation while > >>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, > >>>> the lock is done using spin_trylock() to avoid dead lock when handling a > >>>> page fault while a TLB invalidate is requested by another CPU holding the > >>>> PTE. > >>>> > >>>> In pseudo code, this could be seen as: > >>>> speculative_page_fault() > >>>> { > >>>> vma = get_vma() > >>>> check vma sequence count > >>>> check vma's support > >>>> disable interrupt > >>>> check pgd,p4d,...,pte > >>>> save pmd and pte in vmf > >>>> save vma sequence counter in vmf > >>>> enable interrupt > >>>> check vma sequence count > >>>> handle_pte_fault(vma) > >>>> .. > >>>> page = alloc_page() > >>>> pte_map_lock() > >>>> disable interrupt > >>>> abort if sequence counter has changed > >>>> abort if pmd or pte has changed > >>>> pte map and lock > >>>> enable interrupt > >>>> if abort > >>>> free page > >>>> abort > >>>> ... > >>>> } > >>>> > >>>> arch_fault_handler() > >>>> { > >>>> if (speculative_page_fault(&vma)) > >>>> goto done > >>>> again: > >>>> lock(mmap_sem) > >>>> vma = find_vma(); > >>>> handle_pte_fault(vma); > >>>> if retry > >>>> unlock(mmap_sem) > >>>> goto again; > >>>> done: > >>>> handle fault error > >>>> } > >>>> > >>>> Support for THP is not done because when checking for the PMD, we can be > >>>> confused by an in progress collapsing operation done by khugepaged. The > >>>> issue is that pmd_none() could be true either if the PMD is not already > >>>> populated or if the underlying PTE are in the way to be collapsed. So we > >>>> cannot safely allocate a PMD if pmd_none() is true. > >>>> > >>>> This series add a new software performance event named 'speculative-faults' > >>>> or 'spf'. It counts the number of successful page fault event handled > >>>> speculatively. When recording 'faults,spf' events, the faults one is > >>>> counting the total number of page fault events while 'spf' is only counting > >>>> the part of the faults processed speculatively. > >>>> > >>>> There are some trace events introduced by this series. They allow > >>>> identifying why the page faults were not processed speculatively. This > >>>> doesn't take in account the faults generated by a monothreaded process > >>>> which directly processed while holding the mmap_sem. This trace events are > >>>> grouped in a system named 'pagefault', they are: > >>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back > >>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. > >>>> - pagefault:spf_vma_notsup : the VMA's type is not supported > >>>> - pagefault:spf_vma_access : the VMA's access right are not respected > >>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our > >>>> back. > >>>> > >>>> To record all the related events, the easier is to run perf with the > >>>> following arguments : > >>>> $ perf stat -e 'faults,spf,pagefault:*' <command> > >>>> > >>>> There is also a dedicated vmstat counter showing the number of successful > >>>> page fault handled speculatively. I can be seen this way: > >>>> $ grep speculative_pgfault /proc/vmstat > >>>> > >>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional > >>>> on x86, PowerPC and arm64. > >>>> > >>>> --------------------- > >>>> Real Workload results > >>>> > >>>> As mentioned in previous email, we did non official runs using a "popular > >>>> in memory multithreaded database product" on 176 cores SMT8 Power system > >>>> which showed a 30% improvements in the number of transaction processed per > >>>> second. This run has been done on the v6 series, but changes introduced in > >>>> this new version should not impact the performance boost seen. > >>>> > >>>> Here are the perf data captured during 2 of these runs on top of the v8 > >>>> series: > >>>> vanilla spf > >>>> faults 89.418 101.364 +13% > >>>> spf n/a 97.989 > >>>> > >>>> With the SPF kernel, most of the page fault were processed in a speculative > >>>> way. > >>>> > >>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave > >>>> it a try on an android device. He reported that the application launch time > >>>> was improved in average by 6%, and for large applications (~100 threads) by > >>>> 20%. > >>>> > >>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom > >>>> MSM845 (8 cores) with 6GB (the less is better): > >>>> > >>>> Application 4.9 4.9+spf delta > >>>> com.tencent.mm 416 389 -7% > >>>> com.eg.android.AlipayGphone 1135 986 -13% > >>>> com.tencent.mtt 455 454 0% > >>>> com.qqgame.hlddz 1497 1409 -6% > >>>> com.autonavi.minimap 711 701 -1% > >>>> com.tencent.tmgp.sgame 788 748 -5% > >>>> com.immomo.momo 501 487 -3% > >>>> com.tencent.peng 2145 2112 -2% > >>>> com.smile.gifmaker 491 461 -6% > >>>> com.baidu.BaiduMap 479 366 -23% > >>>> com.taobao.taobao 1341 1198 -11% > >>>> com.baidu.searchbox 333 314 -6% > >>>> com.tencent.mobileqq 394 384 -3% > >>>> com.sina.weibo 907 906 0% > >>>> com.youku.phone 816 731 -11% > >>>> com.happyelements.AndroidAnimal.qq 763 717 -6% > >>>> com.UCMobile 415 411 -1% > >>>> com.tencent.tmgp.ak 1464 1431 -2% > >>>> com.tencent.qqmusic 336 329 -2% > >>>> com.sankuai.meituan 1661 1302 -22% > >>>> com.netease.cloudmusic 1193 1200 1% > >>>> air.tv.douyu.android 4257 4152 -2% > >>>> > >>>> ------------------ > >>>> Benchmarks results > >>>> > >>>> Base kernel is v4.17.0-rc4-mm1 > >>>> SPF is BASE + this series > >>>> > >>>> Kernbench: > >>>> ---------- > >>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 > >>>> kernel (kernel is build 5 times): > >>>> > >>>> Average Half load -j 8 > >>>> Run (std deviation) > >>>> BASE SPF > >>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% > >>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% > >>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% > >>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% > >>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% > >>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% > >>>> > >>>> Average Optimal load -j 16 > >>>> Run (std deviation) > >>>> BASE SPF > >>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% > >>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% > >>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% > >>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% > >>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% > >>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% > >>>> > >>>> > >>>> During a run on the SPF, perf events were captured: > >>>> Performance counter stats for '../kernbench -M': > >>>> 526743764 faults > >>>> 210 spf > >>>> 3 pagefault:spf_vma_changed > >>>> 0 pagefault:spf_vma_noanon > >>>> 2278 pagefault:spf_vma_notsup > >>>> 0 pagefault:spf_vma_access > >>>> 0 pagefault:spf_pmd_changed > >>>> > >>>> Very few speculative page faults were recorded as most of the processes > >>>> involved are monothreaded (sounds that on this architecture some threads > >>>> were created during the kernel build processing). > >>>> > >>>> Here are the kerbench results on a 80 CPUs Power8 system: > >>>> > >>>> Average Half load -j 40 > >>>> Run (std deviation) > >>>> BASE SPF > >>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% > >>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% > >>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% > >>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% > >>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% > >>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% > >>>> > >>>> Average Optimal load -j 80 > >>>> Run (std deviation) > >>>> BASE SPF > >>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% > >>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% > >>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% > >>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% > >>>> Context Switches 223861 (138865) 225032 (139632) 0.52% > >>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% > >>>> > >>>> During a run on the SPF, perf events were captured: > >>>> Performance counter stats for '../kernbench -M': > >>>> 116730856 faults > >>>> 0 spf > >>>> 3 pagefault:spf_vma_changed > >>>> 0 pagefault:spf_vma_noanon > >>>> 476 pagefault:spf_vma_notsup > >>>> 0 pagefault:spf_vma_access > >>>> 0 pagefault:spf_pmd_changed > >>>> > >>>> Most of the processes involved are monothreaded so SPF is not activated but > >>>> there is no impact on the performance. > >>>> > >>>> Ebizzy: > >>>> ------- > >>>> The test is counting the number of records per second it can manage, the > >>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get > >>>> consistent result I repeated the test 100 times and measure the average > >>>> result. The number is the record processes per second, the higher is the > >>>> best. > >>>> > >>>> BASE SPF delta > >>>> 16 CPUs x86 VM 742.57 1490.24 100.69% > >>>> 80 CPUs P8 node 13105.4 24174.23 84.46% > >>>> > >>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: > >>>> Performance counter stats for './ebizzy -mTt 16': > >>>> 1706379 faults > >>>> 1674599 spf > >>>> 30588 pagefault:spf_vma_changed > >>>> 0 pagefault:spf_vma_noanon > >>>> 363 pagefault:spf_vma_notsup > >>>> 0 pagefault:spf_vma_access > >>>> 0 pagefault:spf_pmd_changed > >>>> > >>>> And the ones captured during a run on a 80 CPUs Power node: > >>>> Performance counter stats for './ebizzy -mTt 80': > >>>> 1874773 faults > >>>> 1461153 spf > >>>> 413293 pagefault:spf_vma_changed > >>>> 0 pagefault:spf_vma_noanon > >>>> 200 pagefault:spf_vma_notsup > >>>> 0 pagefault:spf_vma_access > >>>> 0 pagefault:spf_pmd_changed > >>>> > >>>> In ebizzy's case most of the page fault were handled in a speculative way, > >>>> leading the ebizzy performance boost. > >>>> > >>>> ------------------ > >>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): > >>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran > >>>> and Minchan Kim, hopefully. > >>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in > >>>> __do_page_fault(). > >>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails > >>>> instead > >>>> of aborting the speculative page fault handling. Dropping the now > >>>> useless > >>>> trace event pagefault:spf_pte_lock. > >>>> - No more try to reuse the fetched VMA during the speculative page fault > >>>> handling when retrying is needed. This adds a lot of complexity and > >>>> additional tests done didn't show a significant performance improvement. > >>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. > >>>> > >>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none > >>>> [2] https://patchwork.kernel.org/patch/9999687/ > >>>> > >>>> > >>>> Laurent Dufour (20): > >>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT > >>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > >>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > >>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE > >>>> mm: make pte_unmap_same compatible with SPF > >>>> mm: introduce INIT_VMA() > >>>> mm: protect VMA modifications using VMA sequence count > >>>> mm: protect mremap() against SPF hanlder > >>>> mm: protect SPF handler against anon_vma changes > >>>> mm: cache some VMA fields in the vm_fault structure > >>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() > >>>> mm: introduce __lru_cache_add_active_or_unevictable > >>>> mm: introduce __vm_normal_page() > >>>> mm: introduce __page_add_new_anon_rmap() > >>>> mm: protect mm_rb tree with a rwlock > >>>> mm: adding speculative page fault failure trace events > >>>> perf: add a speculative page fault sw event > >>>> perf tools: add support for the SPF perf event > >>>> mm: add speculative page fault vmstats > >>>> powerpc/mm: add speculative page fault > >>>> > >>>> Mahendran Ganesh (2): > >>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > >>>> arm64/mm: add speculative page fault > >>>> > >>>> Peter Zijlstra (4): > >>>> mm: prepare for FAULT_FLAG_SPECULATIVE > >>>> mm: VMA sequence count > >>>> mm: provide speculative fault infrastructure > >>>> x86/mm: add speculative pagefault handling > >>>> > >>>> arch/arm64/Kconfig | 1 + > >>>> arch/arm64/mm/fault.c | 12 + > >>>> arch/powerpc/Kconfig | 1 + > >>>> arch/powerpc/mm/fault.c | 16 + > >>>> arch/x86/Kconfig | 1 + > >>>> arch/x86/mm/fault.c | 27 +- > >>>> fs/exec.c | 2 +- > >>>> fs/proc/task_mmu.c | 5 +- > >>>> fs/userfaultfd.c | 17 +- > >>>> include/linux/hugetlb_inline.h | 2 +- > >>>> include/linux/migrate.h | 4 +- > >>>> include/linux/mm.h | 136 +++++++- > >>>> include/linux/mm_types.h | 7 + > >>>> include/linux/pagemap.h | 4 +- > >>>> include/linux/rmap.h | 12 +- > >>>> include/linux/swap.h | 10 +- > >>>> include/linux/vm_event_item.h | 3 + > >>>> include/trace/events/pagefault.h | 80 +++++ > >>>> include/uapi/linux/perf_event.h | 1 + > >>>> kernel/fork.c | 5 +- > >>>> mm/Kconfig | 22 ++ > >>>> mm/huge_memory.c | 6 +- > >>>> mm/hugetlb.c | 2 + > >>>> mm/init-mm.c | 3 + > >>>> mm/internal.h | 20 ++ > >>>> mm/khugepaged.c | 5 + > >>>> mm/madvise.c | 6 +- > >>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- > >>>> mm/mempolicy.c | 51 ++- > >>>> mm/migrate.c | 6 +- > >>>> mm/mlock.c | 13 +- > >>>> mm/mmap.c | 229 ++++++++++--- > >>>> mm/mprotect.c | 4 +- > >>>> mm/mremap.c | 13 + > >>>> mm/nommu.c | 2 +- > >>>> mm/rmap.c | 5 +- > >>>> mm/swap.c | 6 +- > >>>> mm/swap_state.c | 8 +- > >>>> mm/vmstat.c | 5 +- > >>>> tools/include/uapi/linux/perf_event.h | 1 + > >>>> tools/perf/util/evsel.c | 1 + > >>>> tools/perf/util/parse-events.c | 4 + > >>>> tools/perf/util/parse-events.l | 1 + > >>>> tools/perf/util/python.c | 1 + > >>>> 44 files changed, 1161 insertions(+), 211 deletions(-) > >>>> create mode 100644 include/trace/events/pagefault.h > >>>> > >>>> -- > >>>> 2.7.4 > >>>> > >>>> > >>> > >> > > > ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/always/page_fault3/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | 44:3 -13% 43:3 perf-profile.calltrace.cycles-pp.error_entry 22:3 -6% 22:3 perf-profile.calltrace.cycles-pp.sync_regs.error_entry 44:3 -13% 44:3 perf-profile.children.cycles-pp.error_entry 21:3 -7% 21:3 perf-profile.self.cycles-pp.error_entry %stddev %change %stddev \ | \ 10519 ± 3% -20.5% 8368 ± 6% will-it-scale.per_thread_ops 118098 +11.2% 131287 ± 2% will-it-scale.time.involuntary_context_switches 6.084e+08 ± 3% -20.4% 4.845e+08 ± 6% will-it-scale.time.minor_page_faults 7467 +5.0% 7841 will-it-scale.time.percent_of_cpu_this_job_got 44922 +5.0% 47176 will-it-scale.time.system_time 7126337 ± 3% -15.4% 6025689 ± 6% will-it-scale.time.voluntary_context_switches 91905646 -1.3% 90673935 will-it-scale.workload 27.15 ± 6% -8.7% 24.80 ± 10% boot-time.boot 2516213 ± 6% +8.3% 2726303 interrupts.CAL:Function_call_interrupts 388.00 ± 9% +60.2% 621.67 ± 20% irq_exception_noise.softirq_nr 11.28 ± 2% -1.9 9.37 ± 4% mpstat.cpu.idle% 10065 ±140% +243.4% 34559 ± 4% numa-numastat.node0.other_node 18739 -11.6% 16573 ± 3% uptime.idle 29406 ± 2% -11.8% 25929 ± 5% vmstat.system.cs 329614 ± 8% +17.0% 385618 ± 10% meminfo.DirectMap4k 237851 +21.2% 288160 ± 5% meminfo.Inactive 237615 +21.2% 287924 ± 5% meminfo.Inactive(anon) 7917847 -10.7% 7071860 softirqs.RCU 4784181 ± 3% -14.5% 4089039 ± 4% softirqs.SCHED 45666107 ± 7% +12.9% 51535472 ± 3% softirqs.TIMER 2.617e+09 ± 2% -13.9% 2.253e+09 ± 6% cpuidle.C1E.time 6688774 ± 2% -12.8% 5835101 ± 5% cpuidle.C1E.usage 1.022e+10 ± 2% -18.0% 8.376e+09 ± 3% cpuidle.C6.time 13440993 ± 2% -16.3% 11243794 ± 4% cpuidle.C6.usage 54781 ± 16% +37.5% 75347 ± 12% numa-meminfo.node0.Inactive 54705 ± 16% +37.7% 75347 ± 12% numa-meminfo.node0.Inactive(anon) 52522 +35.0% 70886 ± 6% numa-meminfo.node2.Inactive 52443 +34.7% 70653 ± 6% numa-meminfo.node2.Inactive(anon) 31046 ± 6% +30.3% 40457 ± 11% numa-meminfo.node2.SReclaimable 58563 +21.1% 70945 ± 6% proc-vmstat.nr_inactive_anon 58564 +21.1% 70947 ± 6% proc-vmstat.nr_zone_inactive_anon 69701118 -1.2% 68842151 proc-vmstat.pgalloc_normal 2.765e+10 -1.3% 2.729e+10 proc-vmstat.pgfault 69330418 -1.2% 68466824 proc-vmstat.pgfree 118098 +11.2% 131287 ± 2% time.involuntary_context_switches 6.084e+08 ± 3% -20.4% 4.845e+08 ± 6% time.minor_page_faults 7467 +5.0% 7841 time.percent_of_cpu_this_job_got 44922 +5.0% 47176 time.system_time 7126337 ± 3% -15.4% 6025689 ± 6% time.voluntary_context_switches 13653 ± 16% +33.5% 18225 ± 12% numa-vmstat.node0.nr_inactive_anon 13651 ± 16% +33.5% 18224 ± 12% numa-vmstat.node0.nr_zone_inactive_anon 13069 ± 3% +30.1% 17001 ± 4% numa-vmstat.node2.nr_inactive_anon 134.67 ± 42% -49.5% 68.00 ± 31% numa-vmstat.node2.nr_mlock 7758 ± 6% +30.4% 10112 ± 11% numa-vmstat.node2.nr_slab_reclaimable 13066 ± 3% +30.1% 16998 ± 4% numa-vmstat.node2.nr_zone_inactive_anon 1039 ± 11% -17.5% 857.33 slabinfo.Acpi-ParseExt.active_objs 1039 ± 11% -17.5% 857.33 slabinfo.Acpi-ParseExt.num_objs 2566 ± 6% -8.8% 2340 ± 5% slabinfo.biovec-64.active_objs 2566 ± 6% -8.8% 2340 ± 5% slabinfo.biovec-64.num_objs 898.33 ± 3% -9.5% 813.33 ± 3% slabinfo.kmem_cache_node.active_objs 1066 ± 2% -8.0% 981.33 ± 3% slabinfo.kmem_cache_node.num_objs 1940 +2.3% 1984 turbostat.Avg_MHz 6679037 ± 2% -12.7% 5830270 ± 5% turbostat.C1E 2.25 ± 2% -0.3 1.94 ± 6% turbostat.C1E% 13418115 -16.3% 11234510 ± 4% turbostat.C6 8.75 ± 2% -1.6 7.18 ± 3% turbostat.C6% 5.99 ± 2% -14.4% 5.13 ± 4% turbostat.CPU%c1 5.01 ± 3% -20.1% 4.00 ± 4% turbostat.CPU%c6 1.77 ± 3% -34.7% 1.15 turbostat.Pkg%pc2 1.378e+13 +1.2% 1.394e+13 perf-stat.branch-instructions 0.98 -0.0 0.94 perf-stat.branch-miss-rate% 1.344e+11 -2.3% 1.313e+11 perf-stat.branch-misses 1.076e+11 -1.8% 1.057e+11 perf-stat.cache-misses 2.258e+11 -2.1% 2.21e+11 perf-stat.cache-references 17788064 ± 2% -11.9% 15674207 ± 6% perf-stat.context-switches 2.241e+14 +2.4% 2.294e+14 perf-stat.cpu-cycles 1.929e+13 +2.2% 1.971e+13 perf-stat.dTLB-loads 4.01 -0.2 3.83 perf-stat.dTLB-store-miss-rate% 4.519e+11 -1.3% 4.461e+11 perf-stat.dTLB-store-misses 1.082e+13 +3.6% 1.121e+13 perf-stat.dTLB-stores 3.02e+10 +23.2% 3.721e+10 ± 3% perf-stat.iTLB-load-misses 2.721e+08 ± 8% -8.8% 2.481e+08 ± 3% perf-stat.iTLB-loads 6.985e+13 +1.8% 7.111e+13 perf-stat.instructions 2313 -17.2% 1914 ± 3% perf-stat.instructions-per-iTLB-miss 2.764e+10 -1.3% 2.729e+10 perf-stat.minor-faults 1.421e+09 ± 2% -16.4% 1.188e+09 ± 9% perf-stat.node-load-misses 1.538e+10 -9.3% 1.395e+10 perf-stat.node-loads 9.75 +1.4 11.10 perf-stat.node-store-miss-rate% 3.012e+09 +14.1% 3.437e+09 perf-stat.node-store-misses 2.789e+10 -1.3% 2.753e+10 perf-stat.node-stores 2.764e+10 -1.3% 2.729e+10 perf-stat.page-faults 760059 +3.2% 784235 perf-stat.path-length 193545 ± 25% -57.8% 81757 ± 46% sched_debug.cfs_rq:/.MIN_vruntime.avg 26516863 ± 19% -49.7% 13338070 ± 33% sched_debug.cfs_rq:/.MIN_vruntime.max 2202271 ± 21% -53.2% 1029581 ± 38% sched_debug.cfs_rq:/.MIN_vruntime.stddev 193545 ± 25% -57.8% 81757 ± 46% sched_debug.cfs_rq:/.max_vruntime.avg 26516863 ± 19% -49.7% 13338070 ± 33% sched_debug.cfs_rq:/.max_vruntime.max 2202271 ± 21% -53.2% 1029581 ± 38% sched_debug.cfs_rq:/.max_vruntime.stddev 0.32 ± 70% +253.2% 1.14 ± 54% sched_debug.cfs_rq:/.removed.load_avg.avg 4.44 ± 70% +120.7% 9.80 ± 27% sched_debug.cfs_rq:/.removed.load_avg.stddev 14.90 ± 70% +251.0% 52.31 ± 53% sched_debug.cfs_rq:/.removed.runnable_sum.avg 205.71 ± 70% +119.5% 451.60 ± 27% sched_debug.cfs_rq:/.removed.runnable_sum.stddev 0.16 ± 70% +237.9% 0.54 ± 50% sched_debug.cfs_rq:/.removed.util_avg.avg 2.23 ± 70% +114.2% 4.77 ± 24% sched_debug.cfs_rq:/.removed.util_avg.stddev 573.70 ± 5% -9.7% 518.06 ± 6% sched_debug.cfs_rq:/.util_avg.min 114.87 ± 8% +14.1% 131.04 ± 10% sched_debug.cfs_rq:/.util_est_enqueued.avg 64.42 ± 54% -63.9% 23.27 ± 68% sched_debug.cpu.cpu_load[1].max 5.05 ± 48% -55.2% 2.26 ± 51% sched_debug.cpu.cpu_load[1].stddev 57.58 ± 59% -60.3% 22.88 ± 70% sched_debug.cpu.cpu_load[2].max 21019 ± 3% -15.1% 17841 ± 6% sched_debug.cpu.nr_switches.min 20797 ± 3% -15.0% 17670 ± 6% sched_debug.cpu.sched_count.min 10287 ± 3% -15.1% 8736 ± 6% sched_debug.cpu.sched_goidle.avg 13693 ± 2% -10.7% 12233 ± 5% sched_debug.cpu.sched_goidle.max 9976 ± 3% -16.0% 8381 ± 7% sched_debug.cpu.sched_goidle.min 0.00 ± 26% +98.9% 0.00 ± 28% sched_debug.rt_rq:/.rt_time.min 4230 ±141% -100.0% 0.00 latency_stats.avg.trace_module_notify.notifier_call_chain.blocking_notifier_call_chain.do_init_module.load_module.__do_sys_finit_module.do_syscall_64.entry_SYSCALL_64_after_hwframe 28498 ±141% -100.0% 0.00 latency_stats.avg.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 4065 ±138% -92.2% 315.33 ± 91% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 0.00 +3.6e+105% 3641 ±141% latency_stats.avg.down.console_lock.console_device.tty_lookup_driver.tty_open.chrdev_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +2.5e+106% 25040 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +3.4e+106% 34015 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 0.00 +4.8e+106% 47686 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 4230 ±141% -100.0% 0.00 latency_stats.max.trace_module_notify.notifier_call_chain.blocking_notifier_call_chain.do_init_module.load_module.__do_sys_finit_module.do_syscall_64.entry_SYSCALL_64_after_hwframe 28498 ±141% -100.0% 0.00 latency_stats.max.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 4065 ±138% -92.2% 315.33 ± 91% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 4254 ±134% -88.0% 511.67 ± 90% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 43093 ± 35% +76.6% 76099 ±115% latency_stats.max.blk_execute_rq.scsi_execute.ioctl_internal_command.scsi_set_medium_removal.cdrom_release.[cdrom].sr_block_release.[sr_mod].__blkdev_put.blkdev_close.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64 24139 ± 70% +228.5% 79285 ±105% latency_stats.max.blk_execute_rq.scsi_execute.scsi_test_unit_ready.sr_check_events.[sr_mod].cdrom_check_events.[cdrom].sr_block_check_events.[sr_mod].disk_check_events.disk_clear_events.check_disk_change.sr_block_open.[sr_mod].__blkdev_get.blkdev_get 0.00 +3.6e+105% 3641 ±141% latency_stats.max.down.console_lock.console_device.tty_lookup_driver.tty_open.chrdev_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +2.5e+106% 25040 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +3.4e+106% 34015 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 0.00 +6.5e+106% 64518 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 4230 ±141% -100.0% 0.00 latency_stats.sum.trace_module_notify.notifier_call_chain.blocking_notifier_call_chain.do_init_module.load_module.__do_sys_finit_module.do_syscall_64.entry_SYSCALL_64_after_hwframe 28498 ±141% -100.0% 0.00 latency_stats.sum.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 4065 ±138% -92.2% 315.33 ± 91% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 57884 ± 9% +47.3% 85264 ±118% latency_stats.sum.blk_execute_rq.scsi_execute.ioctl_internal_command.scsi_set_medium_removal.cdrom_release.[cdrom].sr_block_release.[sr_mod].__blkdev_put.blkdev_close.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64 0.00 +3.6e+105% 3641 ±141% latency_stats.sum.down.console_lock.console_device.tty_lookup_driver.tty_open.chrdev_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +2.5e+106% 25040 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +3.4e+106% 34015 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open 0.00 +9.5e+106% 95373 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 11.70 -11.7 0.00 perf-profile.calltrace.cycles-pp.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 11.52 -11.5 0.00 perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 10.44 -10.4 0.00 perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault 9.83 -9.8 0.00 perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault 9.55 -9.5 0.00 perf-profile.calltrace.cycles-pp.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 9.35 -9.3 0.00 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 6.81 -6.8 0.00 perf-profile.calltrace.cycles-pp.page_add_file_rmap.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault 7.71 -0.3 7.45 perf-profile.calltrace.cycles-pp.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault 0.59 ± 7% -0.2 0.35 ± 70% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.__do_page_fault.do_page_fault.page_fault 0.59 ± 7% -0.2 0.35 ± 70% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.__do_page_fault.do_page_fault.page_fault 10.41 -0.2 10.24 perf-profile.calltrace.cycles-pp.native_irq_return_iret 7.68 -0.1 7.60 perf-profile.calltrace.cycles-pp.swapgs_restore_regs_and_return_to_usermode 0.76 -0.1 0.70 perf-profile.calltrace.cycles-pp.down_read_trylock.__do_page_fault.do_page_fault.page_fault 1.38 -0.0 1.34 perf-profile.calltrace.cycles-pp.do_page_fault 1.05 -0.0 1.02 perf-profile.calltrace.cycles-pp.trace_graph_entry.do_page_fault 0.92 +0.0 0.94 perf-profile.calltrace.cycles-pp.find_vma.__do_page_fault.do_page_fault.page_fault 0.91 +0.0 0.93 perf-profile.calltrace.cycles-pp.vmacache_find.find_vma.__do_page_fault.do_page_fault.page_fault 0.65 +0.0 0.67 perf-profile.calltrace.cycles-pp.set_page_dirty.unmap_page_range.unmap_vmas.unmap_region.do_munmap 0.62 +0.0 0.66 perf-profile.calltrace.cycles-pp.page_mapping.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault 4.15 +0.1 4.27 perf-profile.calltrace.cycles-pp.page_remove_rmap.unmap_page_range.unmap_vmas.unmap_region.do_munmap 10.17 +0.2 10.39 perf-profile.calltrace.cycles-pp.munmap 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.munmap 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64 9.54 +0.2 9.76 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.do_munmap.vm_munmap 9.54 +0.2 9.76 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap 9.56 +0.2 9.78 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap 0.00 +0.6 0.56 ± 2% perf-profile.calltrace.cycles-pp.lock_page_memcg.page_add_file_rmap.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +0.6 0.59 perf-profile.calltrace.cycles-pp.page_mapping.set_page_dirty.fault_dirty_shared_page.handle_pte_fault.__handle_mm_fault 0.00 +0.6 0.60 perf-profile.calltrace.cycles-pp.current_time.file_update_time.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +0.7 0.68 perf-profile.calltrace.cycles-pp.___might_sleep.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault 0.00 +0.7 0.74 perf-profile.calltrace.cycles-pp.unlock_page.fault_dirty_shared_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +0.8 0.80 perf-profile.calltrace.cycles-pp.set_page_dirty.fault_dirty_shared_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +0.9 0.88 perf-profile.calltrace.cycles-pp._raw_spin_lock.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +0.9 0.91 perf-profile.calltrace.cycles-pp.__set_page_dirty_no_writeback.fault_dirty_shared_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +1.3 1.27 perf-profile.calltrace.cycles-pp.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +1.3 1.30 perf-profile.calltrace.cycles-pp.file_update_time.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +2.8 2.76 perf-profile.calltrace.cycles-pp.fault_dirty_shared_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +6.8 6.81 perf-profile.calltrace.cycles-pp.page_add_file_rmap.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +9.4 9.39 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +9.6 9.59 perf-profile.calltrace.cycles-pp.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +9.8 9.77 perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault 0.00 +10.4 10.37 perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault 0.00 +11.5 11.46 perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +11.6 11.60 perf-profile.calltrace.cycles-pp.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +26.6 26.62 perf-profile.calltrace.cycles-pp.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 7.88 -0.3 7.61 perf-profile.children.cycles-pp.find_get_entry 1.34 ± 8% -0.2 1.16 ± 2% perf-profile.children.cycles-pp.hrtimer_interrupt 10.41 -0.2 10.24 perf-profile.children.cycles-pp.native_irq_return_iret 0.38 ± 28% -0.1 0.26 ± 4% perf-profile.children.cycles-pp.tick_sched_timer 11.80 -0.1 11.68 perf-profile.children.cycles-pp.__do_fault 0.55 ± 15% -0.1 0.43 ± 2% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.60 -0.1 0.51 perf-profile.children.cycles-pp.pmd_devmap_trans_unstable 0.38 ± 13% -0.1 0.29 ± 4% perf-profile.children.cycles-pp.ktime_get 7.68 -0.1 7.60 perf-profile.children.cycles-pp.swapgs_restore_regs_and_return_to_usermode 5.18 -0.1 5.12 perf-profile.children.cycles-pp.trace_graph_entry 0.79 -0.1 0.73 perf-profile.children.cycles-pp.down_read_trylock 7.83 -0.1 7.76 perf-profile.children.cycles-pp.sync_regs 3.01 -0.1 2.94 perf-profile.children.cycles-pp.fault_dirty_shared_page 1.02 -0.1 0.96 perf-profile.children.cycles-pp._raw_spin_lock 4.66 -0.1 4.61 perf-profile.children.cycles-pp.prepare_ftrace_return 0.37 ± 8% -0.1 0.32 ± 3% perf-profile.children.cycles-pp.current_kernel_time64 5.26 -0.1 5.21 perf-profile.children.cycles-pp.ftrace_graph_caller 0.66 ± 5% -0.1 0.61 perf-profile.children.cycles-pp.current_time 0.18 ± 5% -0.0 0.15 ± 3% perf-profile.children.cycles-pp.update_process_times 0.27 -0.0 0.26 perf-profile.children.cycles-pp._cond_resched 0.16 -0.0 0.15 ± 3% perf-profile.children.cycles-pp.rcu_all_qs 0.94 +0.0 0.95 perf-profile.children.cycles-pp.vmacache_find 0.48 +0.0 0.50 perf-profile.children.cycles-pp.__mod_node_page_state 0.17 +0.0 0.19 ± 2% perf-profile.children.cycles-pp.__unlock_page_memcg 1.07 +0.0 1.10 perf-profile.children.cycles-pp.find_vma 0.79 ± 3% +0.1 0.86 ± 2% perf-profile.children.cycles-pp.lock_page_memcg 4.29 +0.1 4.40 perf-profile.children.cycles-pp.page_remove_rmap 1.39 ± 2% +0.1 1.52 perf-profile.children.cycles-pp.file_update_time 0.00 +0.2 0.16 perf-profile.children.cycles-pp.__vm_normal_page 9.63 +0.2 9.84 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 9.63 +0.2 9.84 perf-profile.children.cycles-pp.do_syscall_64 9.63 +0.2 9.84 perf-profile.children.cycles-pp.unmap_page_range 10.17 +0.2 10.39 perf-profile.children.cycles-pp.munmap 9.56 +0.2 9.78 perf-profile.children.cycles-pp.unmap_region 9.56 +0.2 9.78 perf-profile.children.cycles-pp.do_munmap 9.56 +0.2 9.78 perf-profile.children.cycles-pp.vm_munmap 9.56 +0.2 9.78 perf-profile.children.cycles-pp.__x64_sys_munmap 9.54 +0.2 9.77 perf-profile.children.cycles-pp.unmap_vmas 1.01 +0.2 1.25 perf-profile.children.cycles-pp.___might_sleep 0.00 +1.6 1.59 perf-profile.children.cycles-pp.pte_map_lock 0.00 +26.9 26.89 perf-profile.children.cycles-pp.handle_pte_fault 4.25 -1.0 3.24 perf-profile.self.cycles-pp.__handle_mm_fault 1.42 -0.3 1.11 perf-profile.self.cycles-pp.alloc_set_pte 4.87 -0.3 4.59 perf-profile.self.cycles-pp.find_get_entry 10.41 -0.2 10.24 perf-profile.self.cycles-pp.native_irq_return_iret 0.37 ± 13% -0.1 0.28 ± 4% perf-profile.self.cycles-pp.ktime_get 0.60 -0.1 0.51 perf-profile.self.cycles-pp.pmd_devmap_trans_unstable 7.50 -0.1 7.42 perf-profile.self.cycles-pp.swapgs_restore_regs_and_return_to_usermode 7.83 -0.1 7.76 perf-profile.self.cycles-pp.sync_regs 4.85 -0.1 4.79 perf-profile.self.cycles-pp.trace_graph_entry 1.01 -0.1 0.95 perf-profile.self.cycles-pp._raw_spin_lock 0.78 -0.1 0.73 perf-profile.self.cycles-pp.down_read_trylock 0.36 ± 9% -0.1 0.31 ± 4% perf-profile.self.cycles-pp.current_kernel_time64 0.28 -0.0 0.23 ± 2% perf-profile.self.cycles-pp.__do_fault 1.04 -0.0 1.00 perf-profile.self.cycles-pp.find_lock_entry 0.30 -0.0 0.28 ± 3% perf-profile.self.cycles-pp.fault_dirty_shared_page 0.70 -0.0 0.67 perf-profile.self.cycles-pp.prepare_ftrace_return 0.44 -0.0 0.42 perf-profile.self.cycles-pp.do_page_fault 0.16 -0.0 0.14 perf-profile.self.cycles-pp.rcu_all_qs 0.78 -0.0 0.77 perf-profile.self.cycles-pp.shmem_getpage_gfp 0.20 -0.0 0.19 perf-profile.self.cycles-pp._cond_resched 0.50 +0.0 0.51 perf-profile.self.cycles-pp.set_page_dirty 0.93 +0.0 0.95 perf-profile.self.cycles-pp.vmacache_find 0.36 ± 2% +0.0 0.38 perf-profile.self.cycles-pp.__might_sleep 0.47 +0.0 0.50 perf-profile.self.cycles-pp.__mod_node_page_state 0.17 +0.0 0.19 ± 2% perf-profile.self.cycles-pp.__unlock_page_memcg 2.34 +0.0 2.38 perf-profile.self.cycles-pp.unmap_page_range 0.78 ± 3% +0.1 0.85 ± 2% perf-profile.self.cycles-pp.lock_page_memcg 2.17 +0.1 2.24 perf-profile.self.cycles-pp.__do_page_fault 0.00 +0.2 0.16 ± 3% perf-profile.self.cycles-pp.__vm_normal_page 1.00 +0.2 1.24 perf-profile.self.cycles-pp.___might_sleep 0.00 +0.7 0.70 perf-profile.self.cycles-pp.pte_map_lock 0.00 +1.4 1.42 ± 2% perf-profile.self.cycles-pp.handle_pte_fault ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/never/context_switch1/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | :3 33% 1:3 dmesg.WARNING:at#for_ip_interrupt_entry/0x 2:3 -67% :3 kmsg.pstore:crypto_comp_decompress_failed,ret= 2:3 -67% :3 kmsg.pstore:decompression_failed %stddev %change %stddev \ | \ 224431 -1.3% 221567 will-it-scale.per_process_ops 237006 -2.2% 231907 will-it-scale.per_thread_ops 1.601e+09 ± 29% -46.9% 8.501e+08 ± 12% will-it-scale.time.involuntary_context_switches 5429 -1.6% 5344 will-it-scale.time.user_time 88596221 -1.7% 87067269 will-it-scale.workload 6863 ± 6% -9.7% 6200 boot-time.idle 144908 ± 40% -66.8% 48173 ± 93% meminfo.CmaFree 0.00 ± 70% +0.0 0.00 mpstat.cpu.iowait% 448336 ± 14% -34.8% 292125 ± 3% turbostat.C1 7684 ± 6% -9.5% 6957 uptime.idle 1.601e+09 ± 29% -46.9% 8.501e+08 ± 12% time.involuntary_context_switches 5429 -1.6% 5344 time.user_time 44013162 -1.7% 43243125 vmstat.system.cs 207684 -1.1% 205485 vmstat.system.in 2217033 ± 15% -15.8% 1866876 ± 2% cpuidle.C1.time 451218 ± 14% -34.7% 294841 ± 2% cpuidle.C1.usage 24839 ± 10% -19.9% 19896 cpuidle.POLL.time 7656 ± 11% -38.9% 4676 ± 8% cpuidle.POLL.usage 5.48 ± 49% -67.3% 1.79 ±100% irq_exception_noise.__do_page_fault.95th 9.46 ± 21% -58.2% 3.95 ± 64% irq_exception_noise.__do_page_fault.99th 35.67 ± 8% +1394.4% 533.00 ± 96% irq_exception_noise.irq_nr 52109 ± 3% -16.0% 43784 ± 4% irq_exception_noise.softirq_time 36226 ± 40% -66.7% 12048 ± 93% proc-vmstat.nr_free_cma 25916 -1.0% 25659 proc-vmstat.nr_slab_reclaimable 16279 ± 8% +2646.1% 447053 ± 82% proc-vmstat.pgalloc_movable 2231117 -18.4% 1820828 ± 20% proc-vmstat.pgalloc_normal 1109316 ± 46% -86.9% 145207 ±109% numa-numastat.node1.local_node 1114700 ± 45% -84.5% 172877 ± 85% numa-numastat.node1.numa_hit 5523 ±140% +402.8% 27768 ± 39% numa-numastat.node1.other_node 29013 ± 29% +3048.1% 913379 ± 73% numa-numastat.node3.local_node 65032 ± 13% +1335.1% 933270 ± 70% numa-numastat.node3.numa_hit 36018 -44.8% 19897 ± 75% numa-numastat.node3.other_node 12.79 ± 21% +7739.1% 1002 ±136% sched_debug.cpu.cpu_load[1].max 1.82 ± 10% +3901.1% 72.92 ±135% sched_debug.cpu.cpu_load[1].stddev 1.71 ± 4% +5055.8% 88.08 ±137% sched_debug.cpu.cpu_load[2].stddev 12.33 ± 23% +9061.9% 1129 ±139% sched_debug.cpu.cpu_load[3].max 1.78 ± 10% +4514.8% 82.18 ±137% sched_debug.cpu.cpu_load[3].stddev 4692 ± 72% +154.5% 11945 ± 29% sched_debug.cpu.max_idle_balance_cost.stddev 23979 -8.3% 21983 slabinfo.kmalloc-96.active_objs 1358 ± 6% -17.9% 1114 ± 3% slabinfo.nsproxy.active_objs 1358 ± 6% -17.9% 1114 ± 3% slabinfo.nsproxy.num_objs 15229 +12.4% 17119 slabinfo.pde_opener.active_objs 15229 +12.4% 17119 slabinfo.pde_opener.num_objs 59541 ± 8% -10.1% 53537 ± 8% slabinfo.vm_area_struct.active_objs 59612 ± 8% -10.1% 53604 ± 8% slabinfo.vm_area_struct.num_objs 4.163e+13 -1.4% 4.105e+13 perf-stat.branch-instructions 6.537e+11 -1.2% 6.459e+11 perf-stat.branch-misses 2.667e+10 -1.7% 2.621e+10 perf-stat.context-switches 1.21 +1.3% 1.22 perf-stat.cpi 150508 -9.8% 135825 ± 3% perf-stat.cpu-migrations 5.75 ± 33% +5.4 11.11 ± 26% perf-stat.iTLB-load-miss-rate% 3.619e+09 ± 36% +100.9% 7.272e+09 ± 30% perf-stat.iTLB-load-misses 2.089e+14 -1.3% 2.062e+14 perf-stat.instructions 64607 ± 29% -50.5% 31964 ± 37% perf-stat.instructions-per-iTLB-miss 0.83 -1.3% 0.82 perf-stat.ipc 3972 ± 4% -14.7% 3388 ± 8% numa-meminfo.node0.PageTables 207919 ± 25% -57.2% 88989 ± 74% numa-meminfo.node1.Active 207715 ± 26% -57.3% 88785 ± 74% numa-meminfo.node1.Active(anon) 356529 -34.3% 234069 ± 2% numa-meminfo.node1.FilePages 789129 ± 5% -19.8% 633161 ± 12% numa-meminfo.node1.MemUsed 34777 ± 8% -48.2% 18010 ± 30% numa-meminfo.node1.SReclaimable 69641 ± 4% -20.7% 55250 ± 12% numa-meminfo.node1.SUnreclaim 125526 ± 4% -96.3% 4602 ± 41% numa-meminfo.node1.Shmem 104419 -29.8% 73261 ± 16% numa-meminfo.node1.Slab 103661 ± 17% -72.0% 29029 ± 99% numa-meminfo.node2.Active 103661 ± 17% -72.2% 28829 ±101% numa-meminfo.node2.Active(anon) 103564 ± 18% -72.0% 29007 ±100% numa-meminfo.node2.AnonPages 671654 ± 7% -14.6% 573598 ± 4% numa-meminfo.node2.MemUsed 44206 ±127% +301.4% 177465 ± 42% numa-meminfo.node3.Active 44206 ±127% +301.0% 177263 ± 42% numa-meminfo.node3.Active(anon) 8738 +12.2% 9805 ± 8% numa-meminfo.node3.KernelStack 603605 ± 9% +27.8% 771554 ± 14% numa-meminfo.node3.MemUsed 14438 ± 6% +122.9% 32181 ± 42% numa-meminfo.node3.SReclaimable 2786 ±137% +3302.0% 94792 ± 71% numa-meminfo.node3.Shmem 71461 ± 7% +45.2% 103771 ± 29% numa-meminfo.node3.Slab 247197 ± 4% -7.8% 227843 numa-meminfo.node3.Unevictable 991.67 ± 4% -14.7% 846.00 ± 8% numa-vmstat.node0.nr_page_table_pages 51926 ± 26% -57.3% 22196 ± 74% numa-vmstat.node1.nr_active_anon 89137 -34.4% 58516 ± 2% numa-vmstat.node1.nr_file_pages 1679 ± 5% -10.8% 1498 ± 4% numa-vmstat.node1.nr_mapped 31386 ± 4% -96.3% 1150 ± 41% numa-vmstat.node1.nr_shmem 8694 ± 8% -48.2% 4502 ± 30% numa-vmstat.node1.nr_slab_reclaimable 17410 ± 4% -20.7% 13812 ± 12% numa-vmstat.node1.nr_slab_unreclaimable 51926 ± 26% -57.3% 22196 ± 74% numa-vmstat.node1.nr_zone_active_anon 1037174 ± 24% -57.0% 446205 ± 35% numa-vmstat.node1.numa_hit 961611 ± 26% -65.8% 328687 ± 50% numa-vmstat.node1.numa_local 75563 ± 44% +55.5% 117517 ± 9% numa-vmstat.node1.numa_other 25914 ± 17% -72.2% 7206 ±101% numa-vmstat.node2.nr_active_anon 25891 ± 18% -72.0% 7251 ±100% numa-vmstat.node2.nr_anon_pages 25914 ± 17% -72.2% 7206 ±101% numa-vmstat.node2.nr_zone_active_anon 11051 ±127% +301.0% 44309 ± 42% numa-vmstat.node3.nr_active_anon 36227 ± 40% -66.7% 12049 ± 93% numa-vmstat.node3.nr_free_cma 0.33 ±141% +25000.0% 83.67 ± 81% numa-vmstat.node3.nr_inactive_file 8739 +12.2% 9806 ± 8% numa-vmstat.node3.nr_kernel_stack 696.67 ±137% +3299.7% 23684 ± 71% numa-vmstat.node3.nr_shmem 3609 ± 6% +122.9% 8044 ± 42% numa-vmstat.node3.nr_slab_reclaimable 61799 ± 4% -7.8% 56960 numa-vmstat.node3.nr_unevictable 11053 ±127% +301.4% 44361 ± 42% numa-vmstat.node3.nr_zone_active_anon 0.33 ±141% +25000.0% 83.67 ± 81% numa-vmstat.node3.nr_zone_inactive_file 61799 ± 4% -7.8% 56960 numa-vmstat.node3.nr_zone_unevictable 217951 ± 8% +280.8% 829976 ± 65% numa-vmstat.node3.numa_hit 91303 ± 19% +689.3% 720647 ± 77% numa-vmstat.node3.numa_local 126648 -13.7% 109329 ± 13% numa-vmstat.node3.numa_other 8.54 -0.1 8.40 perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_wait.pipe_read 5.04 -0.1 4.94 perf-profile.calltrace.cycles-pp.__switch_to.read 3.43 -0.1 3.35 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.write 2.77 -0.1 2.72 perf-profile.calltrace.cycles-pp.reweight_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function 1.99 -0.0 1.94 perf-profile.calltrace.cycles-pp.copy_page_to_iter.pipe_read.__vfs_read.vfs_read.ksys_read 0.60 ± 2% -0.0 0.57 ± 2% perf-profile.calltrace.cycles-pp.find_next_bit.cpumask_next_wrap.select_idle_sibling.select_task_rq_fair.try_to_wake_up 0.81 -0.0 0.78 perf-profile.calltrace.cycles-pp.___perf_sw_event.__schedule.schedule.pipe_wait.pipe_read 0.78 +0.0 0.80 perf-profile.calltrace.cycles-pp.__fdget_pos.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.73 +0.0 0.75 perf-profile.calltrace.cycles-pp.__fget_light.__fdget_pos.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.92 +0.0 0.95 perf-profile.calltrace.cycles-pp.check_preempt_wakeup.check_preempt_curr.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function 2.11 +0.0 2.15 perf-profile.calltrace.cycles-pp.security_file_permission.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 7.00 -0.1 6.86 perf-profile.children.cycles-pp.syscall_return_via_sysret 5.26 -0.1 5.14 perf-profile.children.cycles-pp.__switch_to 5.65 -0.1 5.56 perf-profile.children.cycles-pp.reweight_entity 2.17 -0.1 2.12 perf-profile.children.cycles-pp.copy_page_to_iter 2.94 -0.0 2.90 perf-profile.children.cycles-pp.update_cfs_group 3.11 -0.0 3.07 perf-profile.children.cycles-pp.pick_next_task_fair 2.59 -0.0 2.55 perf-profile.children.cycles-pp.load_new_mm_cr3 1.92 -0.0 1.88 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 1.11 -0.0 1.08 ± 2% perf-profile.children.cycles-pp.find_next_bit 0.59 -0.0 0.56 perf-profile.children.cycles-pp.finish_task_switch 0.14 ± 15% -0.0 0.11 ± 16% perf-profile.children.cycles-pp.write@plt 1.21 -0.0 1.18 perf-profile.children.cycles-pp.set_next_entity 0.85 -0.0 0.82 perf-profile.children.cycles-pp.___perf_sw_event 0.13 ± 3% -0.0 0.11 ± 4% perf-profile.children.cycles-pp.timespec_trunc 0.47 ± 2% -0.0 0.45 perf-profile.children.cycles-pp.anon_pipe_buf_release 0.38 ± 2% -0.0 0.36 perf-profile.children.cycles-pp.file_update_time 0.74 -0.0 0.73 perf-profile.children.cycles-pp.copyout 0.41 ± 2% -0.0 0.39 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string 0.32 -0.0 0.30 perf-profile.children.cycles-pp.__x64_sys_read 0.14 -0.0 0.12 ± 3% perf-profile.children.cycles-pp.current_kernel_time64 0.91 +0.0 0.92 perf-profile.children.cycles-pp.touch_atime 0.40 +0.0 0.41 perf-profile.children.cycles-pp._cond_resched 0.18 ± 2% +0.0 0.20 perf-profile.children.cycles-pp.activate_task 0.05 +0.0 0.07 ± 6% perf-profile.children.cycles-pp.default_wake_function 0.24 +0.0 0.27 ± 3% perf-profile.children.cycles-pp.rcu_all_qs 0.60 ± 2% +0.0 0.64 ± 2% perf-profile.children.cycles-pp.update_min_vruntime 0.42 ± 4% +0.0 0.46 ± 4% perf-profile.children.cycles-pp.probe_sched_switch 1.33 +0.0 1.38 perf-profile.children.cycles-pp.__fget_light 0.53 ± 2% +0.1 0.58 perf-profile.children.cycles-pp.entry_SYSCALL_64_stage2 0.31 +0.1 0.36 ± 2% perf-profile.children.cycles-pp.generic_pipe_buf_confirm 4.35 +0.1 4.41 perf-profile.children.cycles-pp.switch_mm_irqs_off 2.52 +0.1 2.58 perf-profile.children.cycles-pp.selinux_file_permission 0.00 +0.1 0.07 ± 11% perf-profile.children.cycles-pp.hrtick_update 7.00 -0.1 6.86 perf-profile.self.cycles-pp.syscall_return_via_sysret 5.26 -0.1 5.14 perf-profile.self.cycles-pp.__switch_to 0.29 -0.1 0.19 ± 2% perf-profile.self.cycles-pp.ksys_read 1.49 -0.1 1.43 perf-profile.self.cycles-pp.dequeue_task_fair 2.41 -0.1 2.35 perf-profile.self.cycles-pp.__schedule 1.46 -0.0 1.41 perf-profile.self.cycles-pp.select_task_rq_fair 2.94 -0.0 2.90 perf-profile.self.cycles-pp.update_cfs_group 0.44 -0.0 0.40 perf-profile.self.cycles-pp.dequeue_entity 0.48 -0.0 0.44 perf-profile.self.cycles-pp.finish_task_switch 2.59 -0.0 2.55 perf-profile.self.cycles-pp.load_new_mm_cr3 1.11 -0.0 1.08 ± 2% perf-profile.self.cycles-pp.find_next_bit 1.91 -0.0 1.88 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.78 -0.0 0.75 perf-profile.self.cycles-pp.___perf_sw_event 0.14 ± 15% -0.0 0.11 ± 16% perf-profile.self.cycles-pp.write@plt 0.37 -0.0 0.35 ± 2% perf-profile.self.cycles-pp.__wake_up_common_lock 0.20 ± 2% -0.0 0.17 ± 2% perf-profile.self.cycles-pp.__fdget_pos 0.47 ± 2% -0.0 0.44 perf-profile.self.cycles-pp.anon_pipe_buf_release 0.87 -0.0 0.85 perf-profile.self.cycles-pp.copy_user_generic_unrolled 0.13 ± 3% -0.0 0.11 ± 4% perf-profile.self.cycles-pp.timespec_trunc 0.41 ± 2% -0.0 0.39 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string 0.38 -0.0 0.36 perf-profile.self.cycles-pp.__wake_up_common 0.32 -0.0 0.30 perf-profile.self.cycles-pp.__x64_sys_read 0.14 ± 3% -0.0 0.12 ± 3% perf-profile.self.cycles-pp.current_kernel_time64 0.30 -0.0 0.28 perf-profile.self.cycles-pp.set_next_entity 0.28 ± 3% +0.0 0.30 perf-profile.self.cycles-pp._cond_resched 0.18 ± 2% +0.0 0.20 perf-profile.self.cycles-pp.activate_task 0.17 ± 2% +0.0 0.19 perf-profile.self.cycles-pp.__might_fault 0.05 +0.0 0.07 ± 6% perf-profile.self.cycles-pp.default_wake_function 0.17 ± 2% +0.0 0.20 perf-profile.self.cycles-pp.ttwu_do_activate 0.66 +0.0 0.69 perf-profile.self.cycles-pp.write 0.24 +0.0 0.27 ± 3% perf-profile.self.cycles-pp.rcu_all_qs 0.67 +0.0 0.70 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.60 ± 2% +0.0 0.64 ± 2% perf-profile.self.cycles-pp.update_min_vruntime 0.42 ± 4% +0.0 0.46 ± 4% perf-profile.self.cycles-pp.probe_sched_switch 1.33 +0.0 1.37 perf-profile.self.cycles-pp.__fget_light 1.61 +0.0 1.66 perf-profile.self.cycles-pp.pipe_read 0.53 ± 2% +0.1 0.58 perf-profile.self.cycles-pp.entry_SYSCALL_64_stage2 0.31 +0.1 0.36 ± 2% perf-profile.self.cycles-pp.generic_pipe_buf_confirm 1.04 +0.1 1.11 perf-profile.self.cycles-pp.pipe_write 0.00 +0.1 0.07 ± 11% perf-profile.self.cycles-pp.hrtick_update 2.00 +0.1 2.08 perf-profile.self.cycles-pp.switch_mm_irqs_off ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/never/page_fault3/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | 1:3 -33% :3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=file_update_time/0x :3 33% 1:3 stderr.mount.nfs:Connection_timed_out 34:3 -401% 22:3 perf-profile.calltrace.cycles-pp.error_entry.testcase 17:3 -207% 11:3 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.testcase 34:3 -404% 22:3 perf-profile.children.cycles-pp.error_entry 0:3 -2% 0:3 perf-profile.children.cycles-pp.error_exit 16:3 -196% 11:3 perf-profile.self.cycles-pp.error_entry 0:3 -2% 0:3 perf-profile.self.cycles-pp.error_exit %stddev %change %stddev \ | \ 467454 -1.8% 459251 will-it-scale.per_process_ops 10856 ± 4% -23.1% 8344 ± 7% will-it-scale.per_thread_ops 118134 ± 2% +11.7% 131943 will-it-scale.time.involuntary_context_switches 6.277e+08 ± 4% -23.1% 4.827e+08 ± 7% will-it-scale.time.minor_page_faults 7406 +5.8% 7839 will-it-scale.time.percent_of_cpu_this_job_got 44526 +5.8% 47106 will-it-scale.time.system_time 7351468 ± 5% -18.3% 6009014 ± 7% will-it-scale.time.voluntary_context_switches 91835846 -2.2% 89778599 will-it-scale.workload 2534640 +4.3% 2643005 ± 2% interrupts.CAL:Function_call_interrupts 2819 ± 5% +22.9% 3464 ± 18% kthread_noise.total_time 30273 ± 4% -12.7% 26415 ± 5% vmstat.system.cs 1.52 ± 2% +15.2% 1.75 ± 2% irq_exception_noise.__do_page_fault.99th 296.67 ± 12% -36.7% 187.67 ± 12% irq_exception_noise.softirq_time 230900 ± 3% +30.3% 300925 ± 3% meminfo.Inactive 230184 ± 3% +30.4% 300180 ± 3% meminfo.Inactive(anon) 11.62 ± 3% -2.2 9.40 ± 5% mpstat.cpu.idle% 0.00 ± 14% -0.0 0.00 ± 4% mpstat.cpu.iowait% 7992174 -11.1% 7101976 ± 3% softirqs.RCU 4973624 ± 2% -12.9% 4333370 ± 2% softirqs.SCHED 118134 ± 2% +11.7% 131943 time.involuntary_context_switches 6.277e+08 ± 4% -23.1% 4.827e+08 ± 7% time.minor_page_faults 7406 +5.8% 7839 time.percent_of_cpu_this_job_got 44526 +5.8% 47106 time.system_time 7351468 ± 5% -18.3% 6009014 ± 7% time.voluntary_context_switches 2.702e+09 ± 5% -16.7% 2.251e+09 ± 7% cpuidle.C1E.time 6834329 ± 5% -15.8% 5756243 ± 7% cpuidle.C1E.usage 1.046e+10 ± 3% -19.8% 8.389e+09 ± 4% cpuidle.C6.time 13961845 ± 3% -19.3% 11265555 ± 4% cpuidle.C6.usage 1309307 ± 7% -14.8% 1116168 ± 8% cpuidle.POLL.time 19774 ± 6% -13.7% 17063 ± 7% cpuidle.POLL.usage 2523 ± 4% -11.1% 2243 ± 4% slabinfo.biovec-64.active_objs 2523 ± 4% -11.1% 2243 ± 4% slabinfo.biovec-64.num_objs 2610 ± 8% -33.7% 1731 ± 22% slabinfo.dmaengine-unmap-16.active_objs 2610 ± 8% -33.7% 1731 ± 22% slabinfo.dmaengine-unmap-16.num_objs 5118 ± 17% -22.6% 3962 ± 9% slabinfo.eventpoll_pwq.active_objs 5118 ± 17% -22.6% 3962 ± 9% slabinfo.eventpoll_pwq.num_objs 4583 ± 3% -14.0% 3941 ± 4% slabinfo.sock_inode_cache.active_objs 4583 ± 3% -14.0% 3941 ± 4% slabinfo.sock_inode_cache.num_objs 1933 +2.6% 1984 turbostat.Avg_MHz 6832021 ± 5% -15.8% 5754156 ± 7% turbostat.C1E 2.32 ± 5% -0.4 1.94 ± 7% turbostat.C1E% 13954211 ± 3% -19.3% 11259436 ± 4% turbostat.C6 8.97 ± 3% -1.8 7.20 ± 4% turbostat.C6% 6.18 ± 4% -17.1% 5.13 ± 5% turbostat.CPU%c1 5.12 ± 3% -21.7% 4.01 ± 4% turbostat.CPU%c6 1.76 ± 2% -34.7% 1.15 ± 2% turbostat.Pkg%pc2 57314 ± 4% +30.4% 74717 ± 4% proc-vmstat.nr_inactive_anon 57319 ± 4% +30.4% 74719 ± 4% proc-vmstat.nr_zone_inactive_anon 24415 ± 19% -62.2% 9236 ± 7% proc-vmstat.numa_hint_faults 69661453 -1.8% 68405712 proc-vmstat.numa_hit 69553390 -1.8% 68297790 proc-vmstat.numa_local 8792 ± 29% -92.6% 654.33 ± 23% proc-vmstat.numa_pages_migrated 40251 ± 32% -76.5% 9474 ± 3% proc-vmstat.numa_pte_updates 69522532 -1.6% 68383074 proc-vmstat.pgalloc_normal 2.762e+10 -2.2% 2.701e+10 proc-vmstat.pgfault 68825100 -1.5% 67772256 proc-vmstat.pgfree 8792 ± 29% -92.6% 654.33 ± 23% proc-vmstat.pgmigrate_success 57992 ± 6% +56.2% 90591 ± 3% numa-meminfo.node0.Inactive 57916 ± 6% +56.3% 90513 ± 3% numa-meminfo.node0.Inactive(anon) 37285 ± 12% +36.0% 50709 ± 5% numa-meminfo.node0.SReclaimable 110971 ± 8% +22.7% 136209 ± 8% numa-meminfo.node0.Slab 23601 ± 55% +559.5% 155651 ± 36% numa-meminfo.node1.AnonPages 62484 ± 12% +17.5% 73417 ± 3% numa-meminfo.node1.Inactive 62323 ± 12% +17.2% 73023 ± 4% numa-meminfo.node1.Inactive(anon) 109714 ± 63% -85.6% 15832 ± 96% numa-meminfo.node2.AnonPages 52236 ± 13% +22.7% 64074 ± 3% numa-meminfo.node2.Inactive 51922 ± 12% +23.2% 63963 ± 3% numa-meminfo.node2.Inactive(anon) 60241 ± 11% +21.9% 73442 ± 8% numa-meminfo.node3.Inactive 60077 ± 12% +22.0% 73279 ± 8% numa-meminfo.node3.Inactive(anon) 14093 ± 6% +55.9% 21977 ± 3% numa-vmstat.node0.nr_inactive_anon 9321 ± 12% +36.0% 12675 ± 5% numa-vmstat.node0.nr_slab_reclaimable 14090 ± 6% +56.0% 21977 ± 3% numa-vmstat.node0.nr_zone_inactive_anon 5900 ± 55% +559.4% 38909 ± 36% numa-vmstat.node1.nr_anon_pages 15413 ± 12% +14.8% 17688 ± 4% numa-vmstat.node1.nr_inactive_anon 15413 ± 12% +14.8% 17688 ± 4% numa-vmstat.node1.nr_zone_inactive_anon 27430 ± 63% -85.6% 3960 ± 96% numa-vmstat.node2.nr_anon_pages 12928 ± 12% +20.0% 15508 ± 3% numa-vmstat.node2.nr_inactive_anon 12927 ± 12% +20.0% 15507 ± 3% numa-vmstat.node2.nr_zone_inactive_anon 6229 ± 10% +117.5% 13547 ± 44% numa-vmstat.node3 14669 ± 11% +19.6% 17537 ± 7% numa-vmstat.node3.nr_inactive_anon 14674 ± 11% +19.5% 17541 ± 7% numa-vmstat.node3.nr_zone_inactive_anon 24617 ±141% -100.0% 0.00 latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 5049 ±105% -99.4% 28.33 ± 82% latency_stats.avg.call_rwsem_down_write_failed.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 152457 ± 27% +233.6% 508656 ± 92% latency_stats.avg.max 0.00 +3.9e+107% 390767 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_openat 24617 ±141% -100.0% 0.00 latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 4240 ±141% -100.0% 0.00 latency_stats.max.call_rwsem_down_write_failed.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe 8565 ± 70% -99.1% 80.33 ±115% latency_stats.max.call_rwsem_down_write_failed.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 204835 ± 6% +457.6% 1142244 ±114% latency_stats.max.max 0.00 +5.1e+105% 5057 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_openat.do_filp_open 0.00 +1e+108% 995083 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_openat 13175 ± 4% -100.0% 0.00 latency_stats.sum.io_schedule.__lock_page_or_retry.filemap_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 24617 ±141% -100.0% 0.00 latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 4260 ±141% -100.0% 0.00 latency_stats.sum.call_rwsem_down_write_failed.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe 8640 ± 70% -97.5% 216.33 ±108% latency_stats.sum.call_rwsem_down_write_failed.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 6673 ± 89% -92.8% 477.67 ± 74% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 0.00 +4.2e+105% 4228 ±130% latency_stats.sum.io_schedule.__lock_page_killable.__lock_page_or_retry.filemap_fault.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 0.00 +7.5e+105% 7450 ± 98% latency_stats.sum.io_schedule.__lock_page_or_retry.filemap_fault.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 0.00 +1.3e+106% 13050 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_openat.do_filp_open 0.00 +1.5e+110% 1.508e+08 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_openat 0.97 -0.0 0.94 perf-stat.branch-miss-rate% 1.329e+11 -2.6% 1.294e+11 perf-stat.branch-misses 2.254e+11 -1.9% 2.21e+11 perf-stat.cache-references 18308779 ± 4% -12.8% 15969618 ± 5% perf-stat.context-switches 3.20 +1.8% 3.26 perf-stat.cpi 2.233e+14 +2.7% 2.293e+14 perf-stat.cpu-cycles 4.01 -0.2 3.83 perf-stat.dTLB-store-miss-rate% 4.51e+11 -2.2% 4.41e+11 perf-stat.dTLB-store-misses 1.08e+13 +2.6% 1.109e+13 perf-stat.dTLB-stores 3.158e+10 ± 5% +16.8% 3.689e+10 ± 2% perf-stat.iTLB-load-misses 2214 ± 5% -13.8% 1907 ± 2% perf-stat.instructions-per-iTLB-miss 0.31 -1.8% 0.31 perf-stat.ipc 2.762e+10 -2.2% 2.701e+10 perf-stat.minor-faults 1.535e+10 -11.2% 1.362e+10 perf-stat.node-loads 9.75 +1.1 10.89 perf-stat.node-store-miss-rate% 3.012e+09 +10.6% 3.332e+09 ± 2% perf-stat.node-store-misses 2.787e+10 -2.2% 2.725e+10 perf-stat.node-stores 2.762e+10 -2.2% 2.701e+10 perf-stat.page-faults 759458 +3.2% 783404 perf-stat.path-length 246.39 ± 15% -20.4% 196.12 ± 6% sched_debug.cfs_rq:/.load_avg.max 0.21 ± 3% +9.0% 0.23 ± 4% sched_debug.cfs_rq:/.nr_running.stddev 16.64 ± 27% +61.0% 26.79 ± 17% sched_debug.cfs_rq:/.nr_spread_over.max 75.15 -14.4% 64.30 ± 4% sched_debug.cfs_rq:/.util_avg.stddev 178.80 ± 3% +25.4% 224.12 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.avg 1075 ± 5% -12.3% 943.36 ± 2% sched_debug.cfs_rq:/.util_est_enqueued.max 2093630 ± 27% -36.1% 1337941 ± 16% sched_debug.cpu.avg_idle.max 297057 ± 11% +37.8% 409294 ± 14% sched_debug.cpu.avg_idle.min 293240 ± 55% -62.3% 110571 ± 13% sched_debug.cpu.avg_idle.stddev 770075 ± 9% -19.3% 621136 ± 12% sched_debug.cpu.max_idle_balance_cost.max 48919 ± 46% -66.9% 16190 ± 81% sched_debug.cpu.max_idle_balance_cost.stddev 21716 ± 5% -16.8% 18061 ± 7% sched_debug.cpu.nr_switches.min 21519 ± 5% -17.7% 17700 ± 7% sched_debug.cpu.sched_count.min 10586 ± 5% -18.1% 8669 ± 7% sched_debug.cpu.sched_goidle.avg 14183 ± 3% -17.6% 11693 ± 5% sched_debug.cpu.sched_goidle.max 10322 ± 5% -18.6% 8407 ± 7% sched_debug.cpu.sched_goidle.min 400.99 ± 8% -13.0% 348.75 ± 3% sched_debug.cpu.sched_goidle.stddev 5459 ± 8% +10.0% 6006 ± 3% sched_debug.cpu.ttwu_local.avg 8.47 ± 42% +345.8% 37.73 ± 77% sched_debug.rt_rq:/.rt_time.max 0.61 ± 42% +343.0% 2.72 ± 77% sched_debug.rt_rq:/.rt_time.stddev 91.98 -30.9 61.11 ± 70% perf-profile.calltrace.cycles-pp.testcase 9.05 -9.1 0.00 perf-profile.calltrace.cycles-pp.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 8.91 -8.9 0.00 perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 8.06 -8.1 0.00 perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault 7.59 -7.6 0.00 perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault 7.44 -7.4 0.00 perf-profile.calltrace.cycles-pp.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 7.28 -7.3 0.00 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 5.31 -5.3 0.00 perf-profile.calltrace.cycles-pp.page_add_file_rmap.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault 8.08 -2.8 5.30 ± 70% perf-profile.calltrace.cycles-pp.native_irq_return_iret.testcase 5.95 -2.1 3.83 ± 70% perf-profile.calltrace.cycles-pp.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault 5.95 -2.0 3.93 ± 70% perf-profile.calltrace.cycles-pp.swapgs_restore_regs_and_return_to_usermode.testcase 3.10 -1.1 2.01 ± 70% perf-profile.calltrace.cycles-pp.__perf_sw_event.__do_page_fault.do_page_fault.page_fault.testcase 2.36 -0.8 1.55 ± 70% perf-profile.calltrace.cycles-pp.___perf_sw_event.__perf_sw_event.__do_page_fault.do_page_fault.page_fault 1.08 -0.4 0.70 ± 70% perf-profile.calltrace.cycles-pp.do_page_fault.testcase 0.82 -0.3 0.54 ± 70% perf-profile.calltrace.cycles-pp.trace_graph_entry.do_page_fault.testcase 0.77 -0.3 0.50 ± 70% perf-profile.calltrace.cycles-pp.ftrace_graph_caller.__do_page_fault.do_page_fault.page_fault.testcase 0.59 -0.2 0.37 ± 70% perf-profile.calltrace.cycles-pp.down_read_trylock.__do_page_fault.do_page_fault.page_fault.testcase 91.98 -30.9 61.11 ± 70% perf-profile.children.cycles-pp.testcase 9.14 -3.2 5.99 ± 70% perf-profile.children.cycles-pp.__do_fault 8.20 -2.8 5.40 ± 70% perf-profile.children.cycles-pp.shmem_getpage_gfp 8.08 -2.8 5.31 ± 70% perf-profile.children.cycles-pp.native_irq_return_iret 6.08 -2.2 3.92 ± 70% perf-profile.children.cycles-pp.find_get_entry 6.08 -2.1 3.96 ± 70% perf-profile.children.cycles-pp.sync_regs 5.95 -2.0 3.93 ± 70% perf-profile.children.cycles-pp.swapgs_restore_regs_and_return_to_usermode 4.12 -1.4 2.73 ± 70% perf-profile.children.cycles-pp.ftrace_graph_caller 3.65 -1.2 2.42 ± 70% perf-profile.children.cycles-pp.prepare_ftrace_return 3.18 -1.1 2.07 ± 70% perf-profile.children.cycles-pp.__perf_sw_event 2.34 -0.8 1.52 ± 70% perf-profile.children.cycles-pp.fault_dirty_shared_page 0.80 -0.3 0.50 ± 70% perf-profile.children.cycles-pp._raw_spin_lock 0.76 -0.3 0.50 ± 70% perf-profile.children.cycles-pp.tlb_flush_mmu_free 0.61 -0.2 0.39 ± 70% perf-profile.children.cycles-pp.down_read_trylock 0.48 ± 2% -0.2 0.28 ± 70% perf-profile.children.cycles-pp.pmd_devmap_trans_unstable 0.26 ± 6% -0.1 0.15 ± 71% perf-profile.children.cycles-pp.ktime_get 0.20 ± 2% -0.1 0.12 ± 70% perf-profile.children.cycles-pp.perf_exclude_event 0.22 ± 2% -0.1 0.13 ± 70% perf-profile.children.cycles-pp._cond_resched 0.17 -0.1 0.11 ± 70% perf-profile.children.cycles-pp.page_rmapping 0.13 -0.1 0.07 ± 70% perf-profile.children.cycles-pp.rcu_all_qs 0.07 -0.0 0.04 ± 70% perf-profile.children.cycles-pp.ftrace_lookup_ip 22.36 -7.8 14.59 ± 70% perf-profile.self.cycles-pp.testcase 8.08 -2.8 5.31 ± 70% perf-profile.self.cycles-pp.native_irq_return_iret 6.08 -2.1 3.96 ± 70% perf-profile.self.cycles-pp.sync_regs 5.81 -2.0 3.84 ± 70% perf-profile.self.cycles-pp.swapgs_restore_regs_and_return_to_usermode 3.27 -1.6 1.65 ± 70% perf-profile.self.cycles-pp.__handle_mm_fault 3.79 -1.4 2.36 ± 70% perf-profile.self.cycles-pp.find_get_entry 3.80 -1.3 2.53 ± 70% perf-profile.self.cycles-pp.trace_graph_entry 1.10 -0.5 0.57 ± 70% perf-profile.self.cycles-pp.alloc_set_pte 1.24 -0.4 0.81 ± 70% perf-profile.self.cycles-pp.shmem_fault 0.80 -0.3 0.50 ± 70% perf-profile.self.cycles-pp._raw_spin_lock 0.81 -0.3 0.51 ± 70% perf-profile.self.cycles-pp.find_lock_entry 0.80 ± 2% -0.3 0.51 ± 70% perf-profile.self.cycles-pp.__perf_sw_event 0.61 -0.2 0.38 ± 70% perf-profile.self.cycles-pp.down_read_trylock 0.60 -0.2 0.39 ± 70% perf-profile.self.cycles-pp.shmem_getpage_gfp 0.48 -0.2 0.27 ± 70% perf-profile.self.cycles-pp.pmd_devmap_trans_unstable 0.47 -0.2 0.30 ± 70% perf-profile.self.cycles-pp.file_update_time 0.34 -0.1 0.22 ± 70% perf-profile.self.cycles-pp.do_page_fault 0.22 ± 4% -0.1 0.11 ± 70% perf-profile.self.cycles-pp.__do_fault 0.25 ± 5% -0.1 0.14 ± 71% perf-profile.self.cycles-pp.ktime_get 0.21 ± 2% -0.1 0.12 ± 70% perf-profile.self.cycles-pp.finish_fault 0.23 ± 2% -0.1 0.14 ± 70% perf-profile.self.cycles-pp.fault_dirty_shared_page 0.22 ± 2% -0.1 0.14 ± 70% perf-profile.self.cycles-pp.prepare_exit_to_usermode 0.20 ± 2% -0.1 0.12 ± 70% perf-profile.self.cycles-pp.perf_exclude_event 0.16 -0.1 0.10 ± 70% perf-profile.self.cycles-pp._cond_resched 0.13 -0.1 0.07 ± 70% perf-profile.self.cycles-pp.rcu_all_qs 0.07 -0.0 0.04 ± 70% perf-profile.self.cycles-pp.ftrace_lookup_ip ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/always/context_switch1/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | :3 33% 1:3 dmesg.WARNING:at#for_ip_interrupt_entry/0x :3 33% 1:3 dmesg.WARNING:at#for_ip_ret_from_intr/0x :3 67% 2:3 kmsg.pstore:crypto_comp_decompress_failed,ret= :3 67% 2:3 kmsg.pstore:decompression_failed %stddev %change %stddev \ | \ 223910 -1.3% 220930 will-it-scale.per_process_ops 233722 -1.0% 231288 will-it-scale.per_thread_ops 6.001e+08 ± 13% +31.4% 7.887e+08 ± 4% will-it-scale.time.involuntary_context_switches 18003 ± 4% +10.9% 19956 will-it-scale.time.minor_page_faults 1.29e+10 -2.5% 1.258e+10 will-it-scale.time.voluntary_context_switches 87865617 -1.2% 86826277 will-it-scale.workload 2880329 ± 2% +5.4% 3034904 interrupts.CAL:Function_call_interrupts 7695018 -23.3% 5905066 ± 8% meminfo.DirectMap2M 0.00 ± 39% -0.0 0.00 ± 78% mpstat.cpu.iowait% 4621 ± 12% +13.4% 5241 proc-vmstat.numa_hint_faults_local 715714 +27.6% 913142 ± 13% softirqs.SCHED 515653 ± 6% -20.0% 412650 ± 15% turbostat.C1 43643516 -1.2% 43127031 vmstat.system.cs 2893393 ± 4% -23.6% 2210524 ± 10% cpuidle.C1.time 518051 ± 6% -19.9% 415081 ± 15% cpuidle.C1.usage 23.10 +22.9% 28.38 ± 9% boot-time.boot 18.38 +23.2% 22.64 ± 12% boot-time.dhcp 5216 +5.0% 5478 ± 2% boot-time.idle 963.76 ± 44% +109.7% 2021 ± 34% irq_exception_noise.__do_page_fault.sum 6.33 ± 14% +726.3% 52.33 ± 62% irq_exception_noise.irq_time 56524 ± 7% -18.8% 45915 ± 4% irq_exception_noise.softirq_time 6.001e+08 ± 13% +31.4% 7.887e+08 ± 4% time.involuntary_context_switches 18003 ± 4% +10.9% 19956 time.minor_page_faults 1.29e+10 -2.5% 1.258e+10 time.voluntary_context_switches 1386 ± 7% +15.4% 1600 ± 11% slabinfo.scsi_sense_cache.active_objs 1386 ± 7% +15.4% 1600 ± 11% slabinfo.scsi_sense_cache.num_objs 1427 ± 5% -8.9% 1299 ± 2% slabinfo.task_group.active_objs 1427 ± 5% -8.9% 1299 ± 2% slabinfo.task_group.num_objs 65519 ± 12% +20.6% 79014 ± 16% numa-meminfo.node0.SUnreclaim 8484 -11.9% 7475 ± 7% numa-meminfo.node1.KernelStack 9264 ± 26% -33.7% 6146 ± 7% numa-meminfo.node1.Mapped 2138 ± 61% +373.5% 10127 ± 92% numa-meminfo.node3.Inactive 2059 ± 61% +387.8% 10046 ± 93% numa-meminfo.node3.Inactive(anon) 16379 ± 12% +20.6% 19752 ± 16% numa-vmstat.node0.nr_slab_unreclaimable 8483 -11.9% 7474 ± 7% numa-vmstat.node1.nr_kernel_stack 6250 ± 29% -42.8% 3575 ± 24% numa-vmstat.node2 3798 ± 17% +63.7% 6218 ± 5% numa-vmstat.node3 543.00 ± 61% +368.1% 2541 ± 91% numa-vmstat.node3.nr_inactive_anon 543.33 ± 61% +367.8% 2541 ± 91% numa-vmstat.node3.nr_zone_inactive_anon 4.138e+13 -1.1% 4.09e+13 perf-stat.branch-instructions 6.569e+11 -2.0% 6.441e+11 perf-stat.branch-misses 2.645e+10 -1.2% 2.613e+10 perf-stat.context-switches 1.21 +1.2% 1.23 perf-stat.cpi 153343 ± 2% -12.1% 134776 perf-stat.cpu-migrations 5.966e+13 -1.3% 5.889e+13 perf-stat.dTLB-loads 3.736e+13 -1.2% 3.69e+13 perf-stat.dTLB-stores 5.85 ± 15% +8.8 14.67 ± 9% perf-stat.iTLB-load-miss-rate% 3.736e+09 ± 17% +161.3% 9.76e+09 ± 11% perf-stat.iTLB-load-misses 5.987e+10 -5.4% 5.667e+10 perf-stat.iTLB-loads 2.079e+14 -1.2% 2.054e+14 perf-stat.instructions 57547 ± 18% -62.9% 21340 ± 11% perf-stat.instructions-per-iTLB-miss 0.82 -1.2% 0.81 perf-stat.ipc 27502531 ± 8% +9.5% 30122136 ± 3% perf-stat.node-store-misses 1449 ± 27% -34.6% 948.85 sched_debug.cfs_rq:/.load.min 319416 ±115% -188.5% -282549 sched_debug.cfs_rq:/.spread0.avg 657044 ± 55% -88.3% 76887 ± 23% sched_debug.cfs_rq:/.spread0.max -1525243 +54.6% -2357898 sched_debug.cfs_rq:/.spread0.min 101614 ± 6% +30.6% 132713 ± 19% sched_debug.cpu.avg_idle.stddev 11.54 ± 41% -61.2% 4.48 sched_debug.cpu.cpu_load[1].avg 1369 ± 67% -98.5% 20.67 ± 48% sched_debug.cpu.cpu_load[1].max 99.29 ± 67% -97.6% 2.35 ± 26% sched_debug.cpu.cpu_load[1].stddev 9.58 ± 38% -55.2% 4.29 sched_debug.cpu.cpu_load[2].avg 1024 ± 68% -98.5% 15.27 ± 36% sched_debug.cpu.cpu_load[2].max 74.51 ± 67% -97.3% 1.99 ± 15% sched_debug.cpu.cpu_load[2].stddev 7.37 ± 29% -42.0% 4.28 sched_debug.cpu.cpu_load[3].avg 600.58 ± 68% -97.9% 12.48 ± 20% sched_debug.cpu.cpu_load[3].max 43.98 ± 66% -95.8% 1.83 ± 5% sched_debug.cpu.cpu_load[3].stddev 5.95 ± 19% -28.1% 4.28 sched_debug.cpu.cpu_load[4].avg 325.39 ± 67% -96.4% 11.67 ± 10% sched_debug.cpu.cpu_load[4].max 24.19 ± 65% -92.5% 1.81 ± 3% sched_debug.cpu.cpu_load[4].stddev 907.23 ± 4% -14.1% 779.70 ± 10% sched_debug.cpu.nr_load_updates.stddev 0.00 ± 83% +122.5% 0.00 sched_debug.rt_rq:/.rt_time.min 8.49 ± 2% -0.3 8.21 ± 2% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_wait.pipe_read 57.28 -0.3 57.01 perf-profile.calltrace.cycles-pp.read 5.06 -0.2 4.85 perf-profile.calltrace.cycles-pp.select_task_rq_fair.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 4.98 -0.2 4.78 perf-profile.calltrace.cycles-pp.__switch_to.read 3.55 -0.2 3.39 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.read 2.72 -0.1 2.60 perf-profile.calltrace.cycles-pp.reweight_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function 2.67 -0.1 2.57 ± 2% perf-profile.calltrace.cycles-pp.reweight_entity.dequeue_task_fair.__schedule.schedule.pipe_wait 3.40 -0.1 3.31 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.write 3.77 -0.1 3.68 perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.try_to_wake_up.autoremove_wake_function.__wake_up_common 1.95 -0.1 1.88 perf-profile.calltrace.cycles-pp.copy_page_to_iter.pipe_read.__vfs_read.vfs_read.ksys_read 2.19 -0.1 2.13 perf-profile.calltrace.cycles-pp.__switch_to_asm.read 1.30 -0.1 1.25 perf-profile.calltrace.cycles-pp.update_curr.reweight_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up 1.27 -0.1 1.22 ± 2% perf-profile.calltrace.cycles-pp.update_curr.reweight_entity.dequeue_task_fair.__schedule.schedule 2.29 -0.0 2.24 perf-profile.calltrace.cycles-pp.load_new_mm_cr3.switch_mm_irqs_off.__schedule.schedule.pipe_wait 0.96 -0.0 0.92 perf-profile.calltrace.cycles-pp.__calc_delta.update_curr.reweight_entity.dequeue_task_fair.__schedule 0.85 -0.0 0.81 ± 3% perf-profile.calltrace.cycles-pp.cpumask_next_wrap.select_idle_sibling.select_task_rq_fair.try_to_wake_up.autoremove_wake_function 1.63 -0.0 1.59 perf-profile.calltrace.cycles-pp.native_write_msr.read 0.72 -0.0 0.69 perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.pipe_read.__vfs_read.vfs_read 0.65 ± 2% -0.0 0.62 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 0.61 -0.0 0.58 ± 2% perf-profile.calltrace.cycles-pp.find_next_bit.cpumask_next_wrap.select_idle_sibling.select_task_rq_fair.try_to_wake_up 0.88 -0.0 0.85 perf-profile.calltrace.cycles-pp.touch_atime.pipe_read.__vfs_read.vfs_read.ksys_read 0.80 -0.0 0.77 ± 2% perf-profile.calltrace.cycles-pp.___perf_sw_event.__schedule.schedule.pipe_wait.pipe_read 0.82 -0.0 0.79 perf-profile.calltrace.cycles-pp.prepare_to_wait.pipe_wait.pipe_read.__vfs_read.vfs_read 0.72 -0.0 0.70 perf-profile.calltrace.cycles-pp.mutex_lock.pipe_write.__vfs_write.vfs_write.ksys_write 0.56 ± 2% -0.0 0.53 perf-profile.calltrace.cycles-pp.update_rq_clock.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 0.83 -0.0 0.81 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_read.__vfs_read.vfs_read.ksys_read 42.40 +0.3 42.69 perf-profile.calltrace.cycles-pp.write 31.80 +0.4 32.18 perf-profile.calltrace.cycles-pp.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 24.35 +0.5 24.84 perf-profile.calltrace.cycles-pp.pipe_wait.pipe_read.__vfs_read.vfs_read.ksys_read 20.36 +0.6 20.92 ± 2% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write 22.01 +0.6 22.58 perf-profile.calltrace.cycles-pp.schedule.pipe_wait.pipe_read.__vfs_read.vfs_read 21.87 +0.6 22.46 perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_wait.pipe_read.__vfs_read 3.15 ± 11% +1.0 4.12 ± 14% perf-profile.calltrace.cycles-pp.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 1.07 ± 34% +1.1 2.12 ± 31% perf-profile.calltrace.cycles-pp.tracing_record_taskinfo_sched_switch.__schedule.schedule.pipe_wait.pipe_read 0.66 ± 75% +1.1 1.72 ± 37% perf-profile.calltrace.cycles-pp.trace_save_cmdline.tracing_record_taskinfo.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function 0.75 ± 74% +1.1 1.88 ± 34% perf-profile.calltrace.cycles-pp.tracing_record_taskinfo.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function.__wake_up_common 0.69 ± 76% +1.2 1.85 ± 36% perf-profile.calltrace.cycles-pp.trace_save_cmdline.tracing_record_taskinfo_sched_switch.__schedule.schedule.pipe_wait 8.73 ± 2% -0.3 8.45 perf-profile.children.cycles-pp.dequeue_task_fair 57.28 -0.3 57.01 perf-profile.children.cycles-pp.read 6.95 -0.2 6.70 perf-profile.children.cycles-pp.syscall_return_via_sysret 5.57 -0.2 5.35 perf-profile.children.cycles-pp.reweight_entity 5.26 -0.2 5.05 perf-profile.children.cycles-pp.select_task_rq_fair 5.19 -0.2 4.99 perf-profile.children.cycles-pp.__switch_to 4.90 -0.2 4.73 ± 2% perf-profile.children.cycles-pp.update_curr 1.27 -0.1 1.13 ± 8% perf-profile.children.cycles-pp.fsnotify 3.92 -0.1 3.83 perf-profile.children.cycles-pp.select_idle_sibling 2.01 -0.1 1.93 perf-profile.children.cycles-pp.__calc_delta 2.14 -0.1 2.06 perf-profile.children.cycles-pp.copy_page_to_iter 1.58 -0.1 1.51 perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 2.90 -0.1 2.84 perf-profile.children.cycles-pp.update_cfs_group 1.93 -0.1 1.87 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 2.35 -0.1 2.29 perf-profile.children.cycles-pp.__switch_to_asm 1.33 -0.1 1.27 ± 3% perf-profile.children.cycles-pp.cpumask_next_wrap 2.57 -0.1 2.52 perf-profile.children.cycles-pp.load_new_mm_cr3 1.53 -0.1 1.47 ± 2% perf-profile.children.cycles-pp.__fdget_pos 1.11 -0.0 1.07 ± 2% perf-profile.children.cycles-pp.find_next_bit 1.18 -0.0 1.14 perf-profile.children.cycles-pp.update_rq_clock 0.88 -0.0 0.83 perf-profile.children.cycles-pp.copy_user_generic_unrolled 1.70 -0.0 1.65 perf-profile.children.cycles-pp.native_write_msr 0.97 -0.0 0.93 ± 2% perf-profile.children.cycles-pp.account_entity_dequeue 0.59 -0.0 0.56 perf-profile.children.cycles-pp.finish_task_switch 0.91 -0.0 0.88 perf-profile.children.cycles-pp.touch_atime 0.69 -0.0 0.65 perf-profile.children.cycles-pp.account_entity_enqueue 2.13 -0.0 2.09 perf-profile.children.cycles-pp.mutex_lock 0.32 ± 3% -0.0 0.29 ± 4% perf-profile.children.cycles-pp.__sb_start_write 0.84 -0.0 0.81 ± 2% perf-profile.children.cycles-pp.___perf_sw_event 0.89 -0.0 0.87 perf-profile.children.cycles-pp.prepare_to_wait 0.73 -0.0 0.71 perf-profile.children.cycles-pp.copyout 0.31 ± 2% -0.0 0.28 ± 3% perf-profile.children.cycles-pp.__list_del_entry_valid 0.46 ± 2% -0.0 0.44 perf-profile.children.cycles-pp.anon_pipe_buf_release 0.38 -0.0 0.36 ± 3% perf-profile.children.cycles-pp.idle_cpu 0.32 -0.0 0.30 ± 2% perf-profile.children.cycles-pp.__x64_sys_read 0.21 ± 2% -0.0 0.20 ± 2% perf-profile.children.cycles-pp.deactivate_task 0.13 -0.0 0.12 ± 4% perf-profile.children.cycles-pp.timespec_trunc 0.09 -0.0 0.08 perf-profile.children.cycles-pp.iov_iter_init 0.08 -0.0 0.07 perf-profile.children.cycles-pp.native_load_tls 0.11 ± 4% +0.0 0.12 perf-profile.children.cycles-pp.tick_sched_timer 0.08 ± 5% +0.0 0.10 ± 4% perf-profile.children.cycles-pp.finish_wait 0.38 ± 2% +0.0 0.40 ± 2% perf-profile.children.cycles-pp.file_update_time 0.31 +0.0 0.33 ± 2% perf-profile.children.cycles-pp.smp_apic_timer_interrupt 0.24 ± 3% +0.0 0.26 ± 3% perf-profile.children.cycles-pp.rcu_all_qs 0.39 +0.0 0.41 perf-profile.children.cycles-pp._cond_resched 0.05 +0.0 0.07 ± 6% perf-profile.children.cycles-pp.default_wake_function 0.23 ± 2% +0.0 0.26 ± 3% perf-profile.children.cycles-pp.current_time 0.30 +0.0 0.35 ± 2% perf-profile.children.cycles-pp.generic_pipe_buf_confirm 0.52 +0.1 0.58 perf-profile.children.cycles-pp.entry_SYSCALL_64_stage2 0.00 +0.1 0.08 ± 5% perf-profile.children.cycles-pp.hrtick_update 42.40 +0.3 42.69 perf-profile.children.cycles-pp.write 31.86 +0.4 32.26 perf-profile.children.cycles-pp.__vfs_read 24.40 +0.5 24.89 perf-profile.children.cycles-pp.pipe_wait 20.40 +0.6 20.96 ± 2% perf-profile.children.cycles-pp.try_to_wake_up 22.30 +0.6 22.89 perf-profile.children.cycles-pp.schedule 22.22 +0.6 22.84 perf-profile.children.cycles-pp.__schedule 0.99 ± 36% +0.9 1.94 ± 32% perf-profile.children.cycles-pp.tracing_record_taskinfo 3.30 ± 10% +1.0 4.27 ± 13% perf-profile.children.cycles-pp.ttwu_do_wakeup 1.14 ± 31% +1.1 2.24 ± 29% perf-profile.children.cycles-pp.tracing_record_taskinfo_sched_switch 1.59 ± 46% +2.0 3.60 ± 36% perf-profile.children.cycles-pp.trace_save_cmdline 6.95 -0.2 6.70 perf-profile.self.cycles-pp.syscall_return_via_sysret 5.19 -0.2 4.99 perf-profile.self.cycles-pp.__switch_to 1.27 -0.1 1.12 ± 8% perf-profile.self.cycles-pp.fsnotify 1.49 -0.1 1.36 perf-profile.self.cycles-pp.select_task_rq_fair 2.47 -0.1 2.37 ± 2% perf-profile.self.cycles-pp.reweight_entity 0.29 -0.1 0.19 ± 2% perf-profile.self.cycles-pp.ksys_read 1.50 -0.1 1.42 perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore 2.01 -0.1 1.93 perf-profile.self.cycles-pp.__calc_delta 1.93 -0.1 1.86 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 1.47 -0.1 1.40 perf-profile.self.cycles-pp.dequeue_task_fair 2.90 -0.1 2.84 perf-profile.self.cycles-pp.update_cfs_group 1.29 -0.1 1.23 perf-profile.self.cycles-pp.do_syscall_64 2.57 -0.1 2.52 perf-profile.self.cycles-pp.load_new_mm_cr3 2.28 -0.1 2.23 perf-profile.self.cycles-pp.__switch_to_asm 1.80 -0.1 1.75 perf-profile.self.cycles-pp.select_idle_sibling 1.11 -0.0 1.07 ± 2% perf-profile.self.cycles-pp.find_next_bit 0.87 -0.0 0.83 perf-profile.self.cycles-pp.copy_user_generic_unrolled 0.43 -0.0 0.39 ± 2% perf-profile.self.cycles-pp.dequeue_entity 1.70 -0.0 1.65 perf-profile.self.cycles-pp.native_write_msr 0.92 -0.0 0.88 ± 2% perf-profile.self.cycles-pp.account_entity_dequeue 0.48 -0.0 0.44 perf-profile.self.cycles-pp.finish_task_switch 0.77 -0.0 0.74 perf-profile.self.cycles-pp.___perf_sw_event 0.66 -0.0 0.63 perf-profile.self.cycles-pp.account_entity_enqueue 0.46 ± 2% -0.0 0.43 ± 2% perf-profile.self.cycles-pp.anon_pipe_buf_release 0.32 ± 3% -0.0 0.29 ± 4% perf-profile.self.cycles-pp.__sb_start_write 0.31 ± 2% -0.0 0.28 ± 3% perf-profile.self.cycles-pp.__list_del_entry_valid 0.38 -0.0 0.36 ± 3% perf-profile.self.cycles-pp.idle_cpu 0.19 ± 4% -0.0 0.17 ± 2% perf-profile.self.cycles-pp.__fdget_pos 0.50 -0.0 0.48 perf-profile.self.cycles-pp.__atime_needs_update 0.23 ± 2% -0.0 0.21 ± 3% perf-profile.self.cycles-pp.touch_atime 0.31 -0.0 0.30 perf-profile.self.cycles-pp.__x64_sys_read 0.21 ± 2% -0.0 0.20 ± 2% perf-profile.self.cycles-pp.deactivate_task 0.21 ± 2% -0.0 0.19 perf-profile.self.cycles-pp.check_preempt_curr 0.40 -0.0 0.39 perf-profile.self.cycles-pp.autoremove_wake_function 0.40 -0.0 0.38 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string 0.27 -0.0 0.26 perf-profile.self.cycles-pp.pipe_wait 0.13 -0.0 0.12 ± 4% perf-profile.self.cycles-pp.timespec_trunc 0.22 ± 2% -0.0 0.20 ± 2% perf-profile.self.cycles-pp.put_prev_entity 0.09 -0.0 0.08 perf-profile.self.cycles-pp.iov_iter_init 0.08 -0.0 0.07 perf-profile.self.cycles-pp.native_load_tls 0.11 -0.0 0.10 perf-profile.self.cycles-pp.schedule 0.12 ± 4% +0.0 0.13 perf-profile.self.cycles-pp.copyin 0.08 ± 5% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.finish_wait 0.18 +0.0 0.20 ± 2% perf-profile.self.cycles-pp.ttwu_do_activate 0.28 ± 2% +0.0 0.30 ± 2% perf-profile.self.cycles-pp._cond_resched 0.24 ± 3% +0.0 0.26 ± 3% perf-profile.self.cycles-pp.rcu_all_qs 0.05 +0.0 0.07 ± 6% perf-profile.self.cycles-pp.default_wake_function 0.08 ± 14% +0.0 0.11 ± 14% perf-profile.self.cycles-pp.tracing_record_taskinfo_sched_switch 0.51 +0.0 0.55 ± 4% perf-profile.self.cycles-pp.vfs_write 0.30 +0.0 0.35 ± 2% perf-profile.self.cycles-pp.generic_pipe_buf_confirm 0.52 +0.1 0.58 perf-profile.self.cycles-pp.entry_SYSCALL_64_stage2 0.00 +0.1 0.08 ± 5% perf-profile.self.cycles-pp.hrtick_update 1.97 +0.1 2.07 ± 2% perf-profile.self.cycles-pp.switch_mm_irqs_off 1.59 ± 46% +2.0 3.60 ± 36% perf-profile.self.cycles-pp.trace_save_cmdline ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/never/brk1/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | :3 33% 1:3 kmsg.pstore:crypto_comp_decompress_failed,ret= :3 33% 1:3 kmsg.pstore:decompression_failed %stddev %change %stddev \ | \ 997317 -2.0% 977778 will-it-scale.per_process_ops 957.00 -7.9% 881.00 ± 3% will-it-scale.per_thread_ops 18.42 ± 3% -8.2% 16.90 will-it-scale.time.user_time 1.917e+08 -2.0% 1.879e+08 will-it-scale.workload 18.42 ± 3% -8.2% 16.90 time.user_time 0.30 ± 11% -36.7% 0.19 ± 11% turbostat.Pkg%pc2 57539 ± 51% +140.6% 138439 ± 31% meminfo.CmaFree 410877 ± 11% -22.1% 320082 ± 22% meminfo.DirectMap4k 343575 ± 27% +71.3% 588703 ± 31% numa-numastat.node0.local_node 374176 ± 24% +63.3% 611007 ± 27% numa-numastat.node0.numa_hit 1056347 ± 4% -39.9% 634843 ± 38% numa-numastat.node3.local_node 1060682 ± 4% -39.0% 646862 ± 35% numa-numastat.node3.numa_hit 14383 ± 51% +140.6% 34608 ± 31% proc-vmstat.nr_free_cma 179.00 +2.4% 183.33 proc-vmstat.nr_inactive_file 179.00 +2.4% 183.33 proc-vmstat.nr_zone_inactive_file 564483 ± 3% -38.0% 350064 ± 36% proc-vmstat.pgalloc_movable 1811959 +10.8% 2008488 ± 5% proc-vmstat.pgalloc_normal 7153 ± 42% -94.0% 431.33 ±119% latency_stats.max.pipe_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 6627 ±141% +380.5% 31843 ±110% latency_stats.max.call_rwsem_down_write_failed_killable.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe 15244 ± 31% -99.9% 15.00 ±141% latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.do_page_fault.page_fault.__get_user_8.exit_robust_list.mm_release.do_exit.do_group_exit.get_signal.do_signal.exit_to_usermode_loop 4301 ±117% -83.7% 700.33 ± 6% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 12153 ± 28% -83.1% 2056 ± 70% latency_stats.sum.pipe_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 6772 ±141% +1105.8% 81665 ±127% latency_stats.sum.call_rwsem_down_write_failed_killable.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.465e+13 -1.3% 2.434e+13 perf-stat.branch-instructions 2.691e+11 -2.1% 2.635e+11 perf-stat.branch-misses 3.402e+13 -1.4% 3.355e+13 perf-stat.dTLB-loads 1.694e+13 +1.4% 1.718e+13 perf-stat.dTLB-stores 1.75 ± 50% +4.7 6.45 ± 11% perf-stat.iTLB-load-miss-rate% 4.077e+08 ± 48% +232.3% 1.355e+09 ± 11% perf-stat.iTLB-load-misses 2.31e+10 ± 2% -14.9% 1.965e+10 ± 3% perf-stat.iTLB-loads 1.163e+14 -1.6% 1.144e+14 perf-stat.instructions 346171 ± 36% -75.3% 85575 ± 11% perf-stat.instructions-per-iTLB-miss 6.174e+08 ± 2% -9.5% 5.589e+08 perf-stat.node-store-misses 595.00 ± 10% +31.4% 782.00 ± 3% slabinfo.Acpi-State.active_objs 595.00 ± 10% +31.4% 782.00 ± 3% slabinfo.Acpi-State.num_objs 2831 ± 3% -14.0% 2434 ± 5% slabinfo.avtab_node.active_objs 2831 ± 3% -14.0% 2434 ± 5% slabinfo.avtab_node.num_objs 934.00 -10.9% 832.33 ± 5% slabinfo.inotify_inode_mark.active_objs 934.00 -10.9% 832.33 ± 5% slabinfo.inotify_inode_mark.num_objs 1232 ± 4% +13.4% 1397 ± 6% slabinfo.nsproxy.active_objs 1232 ± 4% +13.4% 1397 ± 6% slabinfo.nsproxy.num_objs 499.67 ± 12% +24.8% 623.67 ± 10% slabinfo.secpath_cache.active_objs 499.67 ± 12% +24.8% 623.67 ± 10% slabinfo.secpath_cache.num_objs 31393 ± 84% +220.1% 100477 ± 21% numa-meminfo.node0.Active 31393 ± 84% +220.1% 100477 ± 21% numa-meminfo.node0.Active(anon) 30013 ± 85% +232.1% 99661 ± 21% numa-meminfo.node0.AnonPages 21603 ± 34% -85.0% 3237 ±100% numa-meminfo.node0.Inactive 21528 ± 34% -85.0% 3237 ±100% numa-meminfo.node0.Inactive(anon) 10247 ± 35% -46.4% 5495 numa-meminfo.node0.Mapped 35388 ± 14% -41.6% 20670 ± 15% numa-meminfo.node0.SReclaimable 22911 ± 29% -82.3% 4057 ± 84% numa-meminfo.node0.Shmem 117387 ± 9% -22.5% 90986 ± 12% numa-meminfo.node0.Slab 68863 ± 67% +77.7% 122351 ± 13% numa-meminfo.node1.Active 68863 ± 67% +77.7% 122351 ± 13% numa-meminfo.node1.Active(anon) 228376 +22.3% 279406 ± 17% numa-meminfo.node1.FilePages 1481 ±116% +1062.1% 17218 ± 39% numa-meminfo.node1.Inactive 1481 ±116% +1062.0% 17216 ± 39% numa-meminfo.node1.Inactive(anon) 6593 ± 2% +11.7% 7367 ± 3% numa-meminfo.node1.KernelStack 596227 ± 8% +18.0% 703748 ± 4% numa-meminfo.node1.MemUsed 15298 ± 12% +88.5% 28843 ± 36% numa-meminfo.node1.SReclaimable 52718 ± 9% +21.0% 63810 ± 11% numa-meminfo.node1.SUnreclaim 1808 ± 97% +2723.8% 51054 ± 97% numa-meminfo.node1.Shmem 68017 ± 5% +36.2% 92654 ± 18% numa-meminfo.node1.Slab 125541 ± 29% -64.9% 44024 ± 98% numa-meminfo.node3.Active 125137 ± 29% -65.0% 43823 ± 98% numa-meminfo.node3.Active(anon) 93173 ± 25% -87.8% 11381 ± 20% numa-meminfo.node3.AnonPages 9150 ± 5% -9.3% 8301 ± 8% numa-meminfo.node3.KernelStack 7848 ± 84% +220.0% 25118 ± 21% numa-vmstat.node0.nr_active_anon 7503 ± 85% +232.1% 24914 ± 21% numa-vmstat.node0.nr_anon_pages 5381 ± 34% -85.0% 809.00 ±100% numa-vmstat.node0.nr_inactive_anon 2559 ± 35% -46.4% 1372 numa-vmstat.node0.nr_mapped 5727 ± 29% -82.3% 1014 ± 84% numa-vmstat.node0.nr_shmem 8846 ± 14% -41.6% 5167 ± 15% numa-vmstat.node0.nr_slab_reclaimable 7848 ± 84% +220.0% 25118 ± 21% numa-vmstat.node0.nr_zone_active_anon 5381 ± 34% -85.0% 809.00 ±100% numa-vmstat.node0.nr_zone_inactive_anon 4821 ± 2% +30.3% 6283 ± 15% numa-vmstat.node1 17215 ± 67% +77.7% 30591 ± 13% numa-vmstat.node1.nr_active_anon 57093 +22.3% 69850 ± 17% numa-vmstat.node1.nr_file_pages 370.00 ±116% +1061.8% 4298 ± 39% numa-vmstat.node1.nr_inactive_anon 6593 ± 2% +11.7% 7366 ± 3% numa-vmstat.node1.nr_kernel_stack 451.67 ± 97% +2725.6% 12762 ± 97% numa-vmstat.node1.nr_shmem 3824 ± 12% +88.6% 7211 ± 36% numa-vmstat.node1.nr_slab_reclaimable 13179 ± 9% +21.0% 15952 ± 11% numa-vmstat.node1.nr_slab_unreclaimable 17215 ± 67% +77.7% 30591 ± 13% numa-vmstat.node1.nr_zone_active_anon 370.00 ±116% +1061.8% 4298 ± 39% numa-vmstat.node1.nr_zone_inactive_anon 364789 ± 12% +62.8% 593926 ± 34% numa-vmstat.node1.numa_hit 239539 ± 19% +95.4% 468113 ± 43% numa-vmstat.node1.numa_local 71.00 ± 28% +42.3% 101.00 numa-vmstat.node2.nr_mlock 31285 ± 29% -65.0% 10960 ± 98% numa-vmstat.node3.nr_active_anon 23292 ± 25% -87.8% 2844 ± 19% numa-vmstat.node3.nr_anon_pages 14339 ± 52% +141.1% 34566 ± 32% numa-vmstat.node3.nr_free_cma 9151 ± 5% -9.3% 8299 ± 8% numa-vmstat.node3.nr_kernel_stack 31305 ± 29% -64.9% 10975 ± 98% numa-vmstat.node3.nr_zone_active_anon 930131 ± 3% -35.9% 596006 ± 34% numa-vmstat.node3.numa_hit 836455 ± 3% -40.9% 493947 ± 44% numa-vmstat.node3.numa_local 75182 ± 58% -83.8% 12160 ± 2% sched_debug.cfs_rq:/.load.max 6.65 ± 5% -10.6% 5.94 ± 6% sched_debug.cfs_rq:/.load_avg.avg 0.16 ± 7% +22.6% 0.20 ± 12% sched_debug.cfs_rq:/.nr_running.stddev 5.58 ± 24% +427.7% 29.42 ± 93% sched_debug.cfs_rq:/.nr_spread_over.max 0.54 ± 15% +306.8% 2.19 ± 86% sched_debug.cfs_rq:/.nr_spread_over.stddev 1.05 ± 25% -65.1% 0.37 ± 71% sched_debug.cfs_rq:/.removed.load_avg.avg 9.62 ± 11% -50.7% 4.74 ± 70% sched_debug.cfs_rq:/.removed.load_avg.stddev 48.70 ± 25% -65.1% 17.02 ± 71% sched_debug.cfs_rq:/.removed.runnable_sum.avg 444.31 ± 11% -50.7% 219.26 ± 70% sched_debug.cfs_rq:/.removed.runnable_sum.stddev 0.47 ± 13% -60.9% 0.19 ± 71% sched_debug.cfs_rq:/.removed.util_avg.avg 4.47 ± 4% -46.5% 2.39 ± 70% sched_debug.cfs_rq:/.removed.util_avg.stddev 1.64 ± 7% +22.1% 2.00 ± 13% sched_debug.cfs_rq:/.runnable_load_avg.stddev 74653 ± 59% -84.4% 11676 sched_debug.cfs_rq:/.runnable_weight.max -119169 -491.3% 466350 ± 27% sched_debug.cfs_rq:/.spread0.avg 517161 ± 30% +145.8% 1271292 ± 23% sched_debug.cfs_rq:/.spread0.max 624.79 ± 5% -14.2% 535.76 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.avg 247.91 ± 32% -99.8% 0.48 ± 8% sched_debug.cfs_rq:/.util_est_enqueued.min 179704 ± 3% +30.4% 234297 ± 16% sched_debug.cpu.avg_idle.stddev 1.56 ± 9% +24.4% 1.94 ± 14% sched_debug.cpu.cpu_load[0].stddev 1.50 ± 6% +27.7% 1.91 ± 14% sched_debug.cpu.cpu_load[1].stddev 1.45 ± 3% +30.8% 1.90 ± 14% sched_debug.cpu.cpu_load[2].stddev 1.43 ± 3% +36.1% 1.95 ± 11% sched_debug.cpu.cpu_load[3].stddev 1.55 ± 7% +43.5% 2.22 ± 7% sched_debug.cpu.cpu_load[4].stddev 10004 ± 3% -11.6% 8839 ± 3% sched_debug.cpu.curr->pid.avg 1146 ± 26% +52.2% 1745 ± 7% sched_debug.cpu.curr->pid.min 3162 ± 6% +25.4% 3966 ± 11% sched_debug.cpu.curr->pid.stddev 403738 ± 3% -11.7% 356696 ± 7% sched_debug.cpu.nr_switches.max 0.08 ± 21% +78.2% 0.14 ± 14% sched_debug.cpu.nr_uninterruptible.avg 404435 ± 3% -11.8% 356732 ± 7% sched_debug.cpu.sched_count.max 4.17 -0.3 3.87 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.40 -0.2 2.17 perf-profile.calltrace.cycles-pp.vma_compute_subtree_gap.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk 7.58 -0.2 7.36 perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 15.00 -0.2 14.81 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.brk 7.83 -0.2 7.66 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.do_munmap.__x64_sys_brk.do_syscall_64 28.66 -0.1 28.51 perf-profile.calltrace.cycles-pp.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 2.15 -0.1 2.03 perf-profile.calltrace.cycles-pp.vma_compute_subtree_gap.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.07 -0.1 0.99 perf-profile.calltrace.cycles-pp.memcpy_erms.strlcpy.perf_event_mmap.do_brk_flags.__x64_sys_brk 1.03 -0.1 0.95 perf-profile.calltrace.cycles-pp.kmem_cache_free.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64 7.33 -0.1 7.25 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.do_munmap.__x64_sys_brk 0.76 -0.1 0.69 perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 11.85 -0.1 11.77 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.64 -0.1 1.57 perf-profile.calltrace.cycles-pp.strlcpy.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64 1.06 -0.1 0.99 perf-profile.calltrace.cycles-pp.__indirect_thunk_start.brk 0.73 -0.1 0.67 perf-profile.calltrace.cycles-pp.sync_mm_rss.unmap_page_range.unmap_vmas.unmap_region.do_munmap 4.59 -0.1 4.52 perf-profile.calltrace.cycles-pp.security_vm_enough_memory_mm.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.82 -0.1 2.76 perf-profile.calltrace.cycles-pp.selinux_vm_enough_memory.security_vm_enough_memory_mm.do_brk_flags.__x64_sys_brk.do_syscall_64 2.89 -0.1 2.84 perf-profile.calltrace.cycles-pp.down_write_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 3.37 -0.1 3.32 perf-profile.calltrace.cycles-pp.get_unmapped_area.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.99 -0.0 1.94 perf-profile.calltrace.cycles-pp.cred_has_capability.selinux_vm_enough_memory.security_vm_enough_memory_mm.do_brk_flags.__x64_sys_brk 2.32 -0.0 2.27 perf-profile.calltrace.cycles-pp.perf_iterate_sb.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64 1.88 -0.0 1.84 perf-profile.calltrace.cycles-pp.security_mmap_addr.get_unmapped_area.do_brk_flags.__x64_sys_brk.do_syscall_64 0.77 -0.0 0.73 perf-profile.calltrace.cycles-pp._raw_spin_lock.unmap_page_range.unmap_vmas.unmap_region.do_munmap 1.62 -0.0 1.59 perf-profile.calltrace.cycles-pp.memset_erms.kmem_cache_alloc.do_brk_flags.__x64_sys_brk.do_syscall_64 0.81 -0.0 0.79 perf-profile.calltrace.cycles-pp.___might_sleep.down_write_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.66 -0.0 0.64 perf-profile.calltrace.cycles-pp.arch_get_unmapped_area_topdown.brk 0.72 +0.0 0.74 perf-profile.calltrace.cycles-pp.do_munmap.brk 0.90 +0.0 0.93 perf-profile.calltrace.cycles-pp.___might_sleep.unmap_page_range.unmap_vmas.unmap_region.do_munmap 4.40 +0.1 4.47 perf-profile.calltrace.cycles-pp.find_vma.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.96 +0.1 2.09 perf-profile.calltrace.cycles-pp.vmacache_find.find_vma.do_munmap.__x64_sys_brk.do_syscall_64 0.52 ± 2% +0.2 0.68 perf-profile.calltrace.cycles-pp.__vma_link_rb.brk 0.35 ± 70% +0.2 0.54 ± 2% perf-profile.calltrace.cycles-pp.find_vma.brk 2.20 +0.3 2.50 perf-profile.calltrace.cycles-pp.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 64.62 +0.3 64.94 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.brk 60.53 +0.4 60.92 perf-profile.calltrace.cycles-pp.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 63.20 +0.4 63.60 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 3.73 +0.5 4.26 perf-profile.calltrace.cycles-pp.vma_link.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.6 0.56 perf-profile.calltrace.cycles-pp.free_pgtables.unmap_region.do_munmap.__x64_sys_brk.do_syscall_64 24.54 +0.6 25.14 perf-profile.calltrace.cycles-pp.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 0.00 +0.6 0.64 perf-profile.calltrace.cycles-pp.put_vma.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64 0.71 +0.6 1.36 perf-profile.calltrace.cycles-pp.__vma_rb_erase.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.7 0.70 perf-profile.calltrace.cycles-pp._raw_write_lock.__vma_rb_erase.do_munmap.__x64_sys_brk.do_syscall_64 3.10 +0.7 3.82 perf-profile.calltrace.cycles-pp.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk.do_syscall_64 0.00 +0.8 0.76 perf-profile.calltrace.cycles-pp._raw_write_lock.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk 0.00 +0.8 0.85 perf-profile.calltrace.cycles-pp.__vma_merge.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 5.09 -0.5 4.62 perf-profile.children.cycles-pp.vma_compute_subtree_gap 4.54 -0.3 4.21 perf-profile.children.cycles-pp.kmem_cache_alloc 8.11 -0.2 7.89 perf-profile.children.cycles-pp.perf_event_mmap 8.05 -0.2 7.85 perf-profile.children.cycles-pp.unmap_vmas 15.01 -0.2 14.81 perf-profile.children.cycles-pp.syscall_return_via_sysret 29.20 -0.1 29.06 perf-profile.children.cycles-pp.do_brk_flags 1.11 -0.1 1.00 perf-profile.children.cycles-pp.kmem_cache_free 12.28 -0.1 12.17 perf-profile.children.cycles-pp.unmap_region 7.83 -0.1 7.74 perf-profile.children.cycles-pp.unmap_page_range 0.87 ± 3% -0.1 0.79 perf-profile.children.cycles-pp.__vm_enough_memory 1.29 -0.1 1.22 perf-profile.children.cycles-pp.__indirect_thunk_start 1.81 -0.1 1.74 perf-profile.children.cycles-pp.strlcpy 4.65 -0.1 4.58 perf-profile.children.cycles-pp.security_vm_enough_memory_mm 3.08 -0.1 3.02 perf-profile.children.cycles-pp.down_write_killable 2.88 -0.1 2.82 perf-profile.children.cycles-pp.selinux_vm_enough_memory 0.73 -0.1 0.67 perf-profile.children.cycles-pp.sync_mm_rss 3.65 -0.1 3.59 perf-profile.children.cycles-pp.get_unmapped_area 2.26 -0.1 2.20 perf-profile.children.cycles-pp.cred_has_capability 1.12 -0.1 1.07 perf-profile.children.cycles-pp.memcpy_erms 0.39 -0.0 0.35 perf-profile.children.cycles-pp.__rb_insert_augmented 2.52 -0.0 2.48 perf-profile.children.cycles-pp.perf_iterate_sb 2.13 -0.0 2.09 perf-profile.children.cycles-pp.security_mmap_addr 0.55 ± 2% -0.0 0.52 perf-profile.children.cycles-pp.unmap_single_vma 1.62 -0.0 1.59 perf-profile.children.cycles-pp.memset_erms 0.13 ± 3% -0.0 0.11 ± 4% perf-profile.children.cycles-pp.__vma_link_file 0.80 -0.0 0.77 perf-profile.children.cycles-pp._raw_spin_lock 0.43 -0.0 0.41 perf-profile.children.cycles-pp.strlen 0.07 ± 6% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.should_failslab 0.43 -0.0 0.42 perf-profile.children.cycles-pp.may_expand_vm 0.15 +0.0 0.16 perf-profile.children.cycles-pp.__vma_link_list 0.45 +0.0 0.47 perf-profile.children.cycles-pp.rcu_all_qs 0.81 +0.1 0.89 perf-profile.children.cycles-pp.free_pgtables 6.35 +0.1 6.49 perf-profile.children.cycles-pp.find_vma 2.28 +0.2 2.45 perf-profile.children.cycles-pp.vmacache_find 64.66 +0.3 64.98 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 2.42 +0.3 2.76 perf-profile.children.cycles-pp.remove_vma 61.77 +0.4 62.13 perf-profile.children.cycles-pp.__x64_sys_brk 63.40 +0.4 63.79 perf-profile.children.cycles-pp.do_syscall_64 1.27 +0.4 1.72 perf-profile.children.cycles-pp.__vma_rb_erase 4.02 +0.5 4.53 perf-profile.children.cycles-pp.vma_link 25.26 +0.6 25.89 perf-profile.children.cycles-pp.do_munmap 0.00 +0.7 0.70 perf-profile.children.cycles-pp.put_vma 3.80 +0.7 4.53 perf-profile.children.cycles-pp.__vma_link_rb 0.00 +1.2 1.24 perf-profile.children.cycles-pp.__vma_merge 0.00 +1.5 1.51 perf-profile.children.cycles-pp._raw_write_lock 5.07 -0.5 4.60 perf-profile.self.cycles-pp.vma_compute_subtree_gap 0.59 -0.2 0.38 perf-profile.self.cycles-pp.remove_vma 15.01 -0.2 14.81 perf-profile.self.cycles-pp.syscall_return_via_sysret 3.15 -0.2 2.96 perf-profile.self.cycles-pp.do_munmap 0.98 -0.1 0.87 perf-profile.self.cycles-pp.__vma_rb_erase 1.10 -0.1 0.99 perf-profile.self.cycles-pp.kmem_cache_free 0.68 -0.1 0.58 perf-profile.self.cycles-pp.__vm_enough_memory 0.42 -0.1 0.33 perf-profile.self.cycles-pp.unmap_vmas 3.62 -0.1 3.53 perf-profile.self.cycles-pp.perf_event_mmap 1.41 -0.1 1.34 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 1.29 -0.1 1.22 perf-profile.self.cycles-pp.__indirect_thunk_start 0.73 -0.1 0.66 perf-profile.self.cycles-pp.sync_mm_rss 2.96 -0.1 2.90 perf-profile.self.cycles-pp.__x64_sys_brk 3.24 -0.1 3.19 perf-profile.self.cycles-pp.brk 1.11 -0.0 1.07 perf-profile.self.cycles-pp.memcpy_erms 0.53 ± 3% -0.0 0.49 ± 2% perf-profile.self.cycles-pp.vma_link 0.73 -0.0 0.69 perf-profile.self.cycles-pp.unmap_region 1.66 -0.0 1.61 perf-profile.self.cycles-pp.down_write_killable 0.39 -0.0 0.35 perf-profile.self.cycles-pp.__rb_insert_augmented 1.74 -0.0 1.71 perf-profile.self.cycles-pp.kmem_cache_alloc 0.55 ± 2% -0.0 0.52 perf-profile.self.cycles-pp.unmap_single_vma 1.61 -0.0 1.59 perf-profile.self.cycles-pp.memset_erms 0.80 -0.0 0.77 perf-profile.self.cycles-pp._raw_spin_lock 0.13 -0.0 0.11 ± 4% perf-profile.self.cycles-pp.__vma_link_file 0.43 -0.0 0.41 perf-profile.self.cycles-pp.strlen 0.07 ± 6% -0.0 0.06 ± 8% perf-profile.self.cycles-pp.should_failslab 0.81 -0.0 0.79 perf-profile.self.cycles-pp.tlb_finish_mmu 0.15 +0.0 0.16 perf-profile.self.cycles-pp.__vma_link_list 0.45 +0.0 0.47 perf-profile.self.cycles-pp.rcu_all_qs 0.71 +0.0 0.72 perf-profile.self.cycles-pp.strlcpy 0.51 +0.1 0.56 perf-profile.self.cycles-pp.free_pgtables 1.41 +0.1 1.48 perf-profile.self.cycles-pp.__vma_link_rb 2.27 +0.2 2.44 perf-profile.self.cycles-pp.vmacache_find 0.00 +0.7 0.69 perf-profile.self.cycles-pp.put_vma 0.00 +1.2 1.23 perf-profile.self.cycles-pp.__vma_merge 0.00 +1.5 1.50 perf-profile.self.cycles-pp._raw_write_lock ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/always/brk1/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | :3 33% 1:3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=schedule_tail/0x :3 33% 1:3 kmsg.DHCP/BOOTP:Reply_not_for_us_on_eth#,op[#]xid[#] %stddev %change %stddev \ | \ 998475 -2.2% 976893 will-it-scale.per_process_ops 625.87 -2.3% 611.42 will-it-scale.time.elapsed_time 625.87 -2.3% 611.42 will-it-scale.time.elapsed_time.max 8158 -1.9% 8000 will-it-scale.time.maximum_resident_set_size 18.42 ± 2% -11.9% 16.24 will-it-scale.time.user_time 34349225 ± 13% -14.5% 29371024 ± 17% will-it-scale.time.voluntary_context_switches 1.919e+08 -2.2% 1.877e+08 will-it-scale.workload 1639 ± 23% -18.4% 1337 ± 30% meminfo.Mlocked 17748 ± 82% +103.1% 36051 numa-numastat.node3.other_node 33410486 ± 14% -14.8% 28449258 ± 18% cpuidle.C1.usage 698749 ± 15% -18.0% 573307 ± 20% cpuidle.POLL.usage 3013702 ± 14% -15.1% 2559405 ± 17% softirqs.SCHED 54361293 ± 2% -19.0% 44044816 ± 2% softirqs.TIMER 33408303 ± 14% -14.9% 28447123 ± 18% turbostat.C1 0.34 ± 16% -52.0% 0.16 ± 15% turbostat.Pkg%pc2 1310 ± 74% +412.1% 6710 ± 58% irq_exception_noise.__do_page_fault.samples 3209 ± 74% +281.9% 12258 ± 53% irq_exception_noise.__do_page_fault.sum 600.67 ±132% -96.0% 24.00 ± 23% irq_exception_noise.irq_nr 99557 ± 7% -24.0% 75627 ± 7% irq_exception_noise.softirq_nr 41424 ± 9% -24.6% 31253 ± 6% irq_exception_noise.softirq_time 625.87 -2.3% 611.42 time.elapsed_time 625.87 -2.3% 611.42 time.elapsed_time.max 8158 -1.9% 8000 time.maximum_resident_set_size 18.42 ± 2% -11.9% 16.24 time.user_time 34349225 ± 13% -14.5% 29371024 ± 17% time.voluntary_context_switches 988.00 ± 8% +14.5% 1131 ± 2% slabinfo.Acpi-ParseExt.active_objs 988.00 ± 8% +14.5% 1131 ± 2% slabinfo.Acpi-ParseExt.num_objs 2384 ± 3% +21.1% 2888 ± 11% slabinfo.pool_workqueue.active_objs 2474 ± 2% +20.4% 2979 ± 11% slabinfo.pool_workqueue.num_objs 490.33 ± 10% -19.2% 396.00 ± 11% slabinfo.secpath_cache.active_objs 490.33 ± 10% -19.2% 396.00 ± 11% slabinfo.secpath_cache.num_objs 1123 ± 7% +14.2% 1282 ± 3% slabinfo.skbuff_fclone_cache.active_objs 1123 ± 7% +14.2% 1282 ± 3% slabinfo.skbuff_fclone_cache.num_objs 1.09 -0.0 1.07 perf-stat.branch-miss-rate% 2.691e+11 -2.4% 2.628e+11 perf-stat.branch-misses 71981351 ± 12% -13.8% 62013509 ± 16% perf-stat.context-switches 1.697e+13 +1.1% 1.715e+13 perf-stat.dTLB-stores 2.36 ± 29% +4.4 6.76 ± 11% perf-stat.iTLB-load-miss-rate% 5.21e+08 ± 28% +194.8% 1.536e+09 ± 10% perf-stat.iTLB-load-misses 239983 ± 24% -68.4% 75819 ± 11% perf-stat.instructions-per-iTLB-miss 3295653 ± 2% -6.3% 3088753 ± 3% perf-stat.node-stores 606239 +1.1% 612799 perf-stat.path-length 3755 ± 28% -37.5% 2346 ± 52% sched_debug.cfs_rq:/.exec_clock.stddev 10.45 ± 4% +24.3% 12.98 ± 18% sched_debug.cfs_rq:/.load_avg.stddev 6243 ± 46% -38.6% 3831 ± 78% sched_debug.cpu.load.stddev 867.80 ± 7% +25.3% 1087 ± 6% sched_debug.cpu.nr_load_updates.stddev 395898 ± 3% -11.1% 352071 ± 7% sched_debug.cpu.nr_switches.max -13.33 -21.1% -10.52 sched_debug.cpu.nr_uninterruptible.min 395674 ± 3% -11.1% 351762 ± 7% sched_debug.cpu.sched_count.max 33152 ± 4% -12.8% 28899 sched_debug.cpu.ttwu_count.min 0.03 ± 20% +77.7% 0.05 ± 15% sched_debug.rt_rq:/.rt_time.max 89523 +1.8% 91099 proc-vmstat.nr_active_anon 409.67 ± 23% -18.4% 334.33 ± 30% proc-vmstat.nr_mlock 89530 +1.8% 91117 proc-vmstat.nr_zone_active_anon 2337130 -2.2% 2286775 proc-vmstat.numa_hit 2229090 -2.3% 2178626 proc-vmstat.numa_local 8460 ± 39% -75.5% 2076 ± 53% proc-vmstat.numa_pages_migrated 28643 ± 55% -83.5% 4727 ± 58% proc-vmstat.numa_pte_updates 2695806 -1.8% 2646639 proc-vmstat.pgfault 2330191 -2.1% 2281197 proc-vmstat.pgfree 8460 ± 39% -75.5% 2076 ± 53% proc-vmstat.pgmigrate_success 237651 ± 2% +31.3% 312092 ± 16% numa-meminfo.node0.FilePages 8059 ± 2% +10.7% 8925 ± 7% numa-meminfo.node0.KernelStack 6830 ± 25% +48.8% 10164 ± 35% numa-meminfo.node0.Mapped 1612 ± 21% +70.0% 2740 ± 19% numa-meminfo.node0.PageTables 10772 ± 65% +679.4% 83962 ± 59% numa-meminfo.node0.Shmem 163195 ± 15% -36.9% 103036 ± 32% numa-meminfo.node1.Active 163195 ± 15% -36.9% 103036 ± 32% numa-meminfo.node1.Active(anon) 1730 ± 4% +33.9% 2317 ± 14% numa-meminfo.node1.PageTables 55778 ± 19% +32.5% 73910 ± 8% numa-meminfo.node1.SUnreclaim 2671 ± 16% -45.0% 1469 ± 15% numa-meminfo.node2.PageTables 61537 ± 13% -17.7% 50647 ± 3% numa-meminfo.node2.SUnreclaim 48644 ± 94% +149.8% 121499 ± 11% numa-meminfo.node3.Active 48440 ± 94% +150.4% 121295 ± 11% numa-meminfo.node3.Active(anon) 11832 ± 79% -91.5% 1008 ± 67% numa-meminfo.node3.Inactive 11597 ± 82% -93.3% 772.00 ± 82% numa-meminfo.node3.Inactive(anon) 10389 ± 32% -43.0% 5921 ± 6% numa-meminfo.node3.Mapped 33704 ± 24% -44.2% 18792 ± 15% numa-meminfo.node3.SReclaimable 104733 ± 14% -25.3% 78275 ± 8% numa-meminfo.node3.Slab 139329 ±133% -99.8% 241.67 ± 79% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 5403 ±139% -97.5% 137.67 ± 71% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 165968 ±101% -61.9% 63304 ± 58% latency_stats.avg.max 83.00 +12810.4% 10715 ±140% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 102.67 ± 6% +18845.5% 19450 ±140% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 136.33 ± 16% +25043.5% 34279 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 18497 ±141% -100.0% 0.00 latency_stats.max.call_rwsem_down_write_failed_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 140500 ±131% -99.8% 247.00 ± 78% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 5403 ±139% -97.5% 137.67 ± 71% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 87.33 ± 5% +23963.0% 21015 ±140% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 136.33 ± 16% +25043.5% 34279 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 149.33 ± 14% +25485.9% 38208 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 18761 ±141% -100.0% 0.00 latency_stats.sum.call_rwsem_down_write_failed_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 23363 ±114% -100.0% 0.00 latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.do_page_fault.page_fault.__get_user_8.exit_robust_list.mm_release.do_exit.do_group_exit.get_signal.do_signal.exit_to_usermode_loop 144810 ±125% -99.8% 326.67 ± 70% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64 5403 ±139% -97.5% 137.67 ± 71% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 59698 ± 98% -78.0% 13110 ±141% latency_stats.sum.call_rwsem_down_read_failed.do_exit.do_group_exit.get_signal.do_signal.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 166.33 +12768.5% 21404 ±140% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup 825.00 ± 6% +18761.7% 155609 ±140% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 136.33 ± 16% +25043.5% 34279 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup 59412 ± 2% +31.3% 78021 ± 16% numa-vmstat.node0.nr_file_pages 8059 ± 2% +10.7% 8923 ± 7% numa-vmstat.node0.nr_kernel_stack 1701 ± 25% +49.1% 2536 ± 35% numa-vmstat.node0.nr_mapped 402.33 ± 21% +70.0% 684.00 ± 19% numa-vmstat.node0.nr_page_table_pages 2692 ± 65% +679.5% 20988 ± 59% numa-vmstat.node0.nr_shmem 622587 ± 36% +37.7% 857545 ± 13% numa-vmstat.node0.numa_local 40797 ± 15% -36.9% 25757 ± 32% numa-vmstat.node1.nr_active_anon 432.00 ± 4% +33.9% 578.33 ± 14% numa-vmstat.node1.nr_page_table_pages 13944 ± 19% +32.5% 18477 ± 8% numa-vmstat.node1.nr_slab_unreclaimable 40797 ± 15% -36.9% 25757 ± 32% numa-vmstat.node1.nr_zone_active_anon 625073 ± 26% +29.4% 808657 ± 18% numa-vmstat.node1.numa_hit 503969 ± 34% +39.2% 701446 ± 23% numa-vmstat.node1.numa_local 137.33 ± 40% -49.0% 70.00 ± 29% numa-vmstat.node2.nr_mlock 667.67 ± 17% -45.1% 366.33 ± 15% numa-vmstat.node2.nr_page_table_pages 15384 ± 13% -17.7% 12662 ± 3% numa-vmstat.node2.nr_slab_unreclaimable 12114 ± 94% +150.3% 30326 ± 11% numa-vmstat.node3.nr_active_anon 2887 ± 83% -93.4% 190.00 ± 82% numa-vmstat.node3.nr_inactive_anon 2632 ± 30% -39.2% 1600 ± 5% numa-vmstat.node3.nr_mapped 101.00 -30.0% 70.67 ± 29% numa-vmstat.node3.nr_mlock 8425 ± 24% -44.2% 4697 ± 15% numa-vmstat.node3.nr_slab_reclaimable 12122 ± 94% +150.3% 30346 ± 11% numa-vmstat.node3.nr_zone_active_anon 2887 ± 83% -93.4% 190.00 ± 82% numa-vmstat.node3.nr_zone_inactive_anon 106945 ± 13% +17.4% 125554 numa-vmstat.node3.numa_other 4.17 -0.3 3.82 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 15.02 -0.3 14.77 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.brk 2.42 -0.2 2.18 perf-profile.calltrace.cycles-pp.vma_compute_subtree_gap.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk 7.60 -0.2 7.39 perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 7.79 -0.2 7.63 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.do_munmap.__x64_sys_brk.do_syscall_64 0.82 ± 9% -0.1 0.68 perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.13 -0.1 2.00 perf-profile.calltrace.cycles-pp.vma_compute_subtree_gap.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.05 -0.1 0.95 perf-profile.calltrace.cycles-pp.kmem_cache_free.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64 7.31 -0.1 7.21 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.do_munmap.__x64_sys_brk 0.74 -0.1 0.67 perf-profile.calltrace.cycles-pp.sync_mm_rss.unmap_page_range.unmap_vmas.unmap_region.do_munmap 1.06 -0.1 1.00 perf-profile.calltrace.cycles-pp.memcpy_erms.strlcpy.perf_event_mmap.do_brk_flags.__x64_sys_brk 3.38 -0.1 3.33 perf-profile.calltrace.cycles-pp.get_unmapped_area.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.05 -0.0 1.00 ± 2% perf-profile.calltrace.cycles-pp.__indirect_thunk_start.brk 2.34 -0.0 2.29 perf-profile.calltrace.cycles-pp.perf_iterate_sb.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64 1.64 -0.0 1.59 perf-profile.calltrace.cycles-pp.strlcpy.perf_event_mmap.do_brk_flags.__x64_sys_brk.do_syscall_64 1.89 -0.0 1.86 perf-profile.calltrace.cycles-pp.security_mmap_addr.get_unmapped_area.do_brk_flags.__x64_sys_brk.do_syscall_64 0.76 -0.0 0.73 perf-profile.calltrace.cycles-pp._raw_spin_lock.unmap_page_range.unmap_vmas.unmap_region.do_munmap 0.57 ± 2% -0.0 0.55 perf-profile.calltrace.cycles-pp.selinux_mmap_addr.security_mmap_addr.get_unmapped_area.do_brk_flags.__x64_sys_brk 0.54 ± 2% +0.0 0.56 perf-profile.calltrace.cycles-pp.do_brk_flags.brk 0.72 +0.0 0.76 ± 2% perf-profile.calltrace.cycles-pp.do_munmap.brk 4.38 +0.1 4.43 perf-profile.calltrace.cycles-pp.find_vma.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.96 +0.1 2.04 perf-profile.calltrace.cycles-pp.vmacache_find.find_vma.do_munmap.__x64_sys_brk.do_syscall_64 0.53 +0.2 0.68 perf-profile.calltrace.cycles-pp.__vma_link_rb.brk 2.21 +0.3 2.51 perf-profile.calltrace.cycles-pp.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 64.44 +0.5 64.90 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.brk 63.04 +0.5 63.54 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 60.37 +0.5 60.88 perf-profile.calltrace.cycles-pp.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 3.75 +0.5 4.29 perf-profile.calltrace.cycles-pp.vma_link.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.6 0.57 perf-profile.calltrace.cycles-pp.free_pgtables.unmap_region.do_munmap.__x64_sys_brk.do_syscall_64 0.00 +0.6 0.64 perf-profile.calltrace.cycles-pp.put_vma.remove_vma.do_munmap.__x64_sys_brk.do_syscall_64 0.72 +0.7 1.37 perf-profile.calltrace.cycles-pp.__vma_rb_erase.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 24.42 +0.7 25.08 perf-profile.calltrace.cycles-pp.do_munmap.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk 0.00 +0.7 0.71 perf-profile.calltrace.cycles-pp._raw_write_lock.__vma_rb_erase.do_munmap.__x64_sys_brk.do_syscall_64 3.12 +0.7 3.84 perf-profile.calltrace.cycles-pp.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk.do_syscall_64 0.00 +0.8 0.77 perf-profile.calltrace.cycles-pp._raw_write_lock.__vma_link_rb.vma_link.do_brk_flags.__x64_sys_brk 0.00 +0.9 0.85 perf-profile.calltrace.cycles-pp.__vma_merge.do_brk_flags.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe 5.10 -0.5 4.60 perf-profile.children.cycles-pp.vma_compute_subtree_gap 4.53 -0.3 4.18 perf-profile.children.cycles-pp.kmem_cache_alloc 15.03 -0.3 14.77 perf-profile.children.cycles-pp.syscall_return_via_sysret 8.13 -0.2 7.92 perf-profile.children.cycles-pp.perf_event_mmap 8.01 -0.2 7.81 perf-profile.children.cycles-pp.unmap_vmas 0.97 ± 14% -0.2 0.78 perf-profile.children.cycles-pp.__vm_enough_memory 1.13 -0.1 1.00 perf-profile.children.cycles-pp.kmem_cache_free 7.82 -0.1 7.70 perf-profile.children.cycles-pp.unmap_page_range 12.23 -0.1 12.13 perf-profile.children.cycles-pp.unmap_region 0.74 -0.1 0.67 perf-profile.children.cycles-pp.sync_mm_rss 3.06 -0.1 3.00 perf-profile.children.cycles-pp.down_write_killable 0.40 ± 2% -0.1 0.34 perf-profile.children.cycles-pp.__rb_insert_augmented 1.29 -0.1 1.23 perf-profile.children.cycles-pp.__indirect_thunk_start 2.54 -0.1 2.49 perf-profile.children.cycles-pp.perf_iterate_sb 3.66 -0.0 3.61 perf-profile.children.cycles-pp.get_unmapped_area 1.80 -0.0 1.75 perf-profile.children.cycles-pp.strlcpy 0.53 ± 2% -0.0 0.49 ± 2% perf-profile.children.cycles-pp.cap_capable 1.57 -0.0 1.53 perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown 1.11 -0.0 1.08 perf-profile.children.cycles-pp.memcpy_erms 0.13 -0.0 0.10 perf-profile.children.cycles-pp.__vma_link_file 0.55 -0.0 0.52 perf-profile.children.cycles-pp.unmap_single_vma 1.47 -0.0 1.44 perf-profile.children.cycles-pp.cap_vm_enough_memory 2.14 -0.0 2.12 perf-profile.children.cycles-pp.security_mmap_addr 0.32 -0.0 0.30 perf-profile.children.cycles-pp.userfaultfd_unmap_complete 1.25 -0.0 1.23 perf-profile.children.cycles-pp.up_write 0.50 -0.0 0.49 perf-profile.children.cycles-pp.userfaultfd_unmap_prep 0.27 -0.0 0.26 perf-profile.children.cycles-pp.tlb_flush_mmu_free 1.14 -0.0 1.12 perf-profile.children.cycles-pp.__might_sleep 0.07 -0.0 0.06 perf-profile.children.cycles-pp.should_failslab 0.72 +0.0 0.74 perf-profile.children.cycles-pp._cond_resched 0.45 +0.0 0.47 perf-profile.children.cycles-pp.rcu_all_qs 0.15 ± 3% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.__vma_link_list 0.15 ± 5% +0.0 0.18 ± 5% perf-profile.children.cycles-pp.tick_sched_timer 0.05 ± 8% +0.1 0.12 ± 17% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.80 +0.1 0.89 perf-profile.children.cycles-pp.free_pgtables 0.22 ± 7% +0.1 0.31 ± 9% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.00 +0.1 0.11 ± 15% perf-profile.children.cycles-pp.clockevents_program_event 6.34 +0.1 6.47 perf-profile.children.cycles-pp.find_vma 2.27 +0.1 2.40 perf-profile.children.cycles-pp.vmacache_find 0.40 ± 4% +0.2 0.58 ± 5% perf-profile.children.cycles-pp.apic_timer_interrupt 0.40 ± 4% +0.2 0.58 ± 5% perf-profile.children.cycles-pp.smp_apic_timer_interrupt 0.37 ± 4% +0.2 0.54 ± 5% perf-profile.children.cycles-pp.hrtimer_interrupt 0.00 +0.2 0.19 ± 12% perf-profile.children.cycles-pp.ktime_get 2.42 +0.3 2.77 perf-profile.children.cycles-pp.remove_vma 64.49 +0.5 64.94 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.27 +0.5 1.73 perf-profile.children.cycles-pp.__vma_rb_erase 61.62 +0.5 62.10 perf-profile.children.cycles-pp.__x64_sys_brk 63.24 +0.5 63.74 perf-profile.children.cycles-pp.do_syscall_64 4.03 +0.5 4.56 perf-profile.children.cycles-pp.vma_link 0.00 +0.7 0.69 perf-profile.children.cycles-pp.put_vma 25.13 +0.7 25.84 perf-profile.children.cycles-pp.do_munmap 3.83 +0.7 4.56 perf-profile.children.cycles-pp.__vma_link_rb 0.00 +1.2 1.25 perf-profile.children.cycles-pp.__vma_merge 0.00 +1.5 1.53 perf-profile.children.cycles-pp._raw_write_lock 5.08 -0.5 4.58 perf-profile.self.cycles-pp.vma_compute_subtree_gap 15.03 -0.3 14.77 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.59 -0.2 0.39 perf-profile.self.cycles-pp.remove_vma 0.72 ± 7% -0.1 0.58 perf-profile.self.cycles-pp.__vm_enough_memory 1.12 -0.1 0.99 perf-profile.self.cycles-pp.kmem_cache_free 3.11 -0.1 2.99 perf-profile.self.cycles-pp.do_munmap 0.99 -0.1 0.88 perf-profile.self.cycles-pp.__vma_rb_erase 3.63 -0.1 3.52 perf-profile.self.cycles-pp.perf_event_mmap 3.26 -0.1 3.17 perf-profile.self.cycles-pp.brk 0.41 ± 2% -0.1 0.33 perf-profile.self.cycles-pp.unmap_vmas 0.74 -0.1 0.67 perf-profile.self.cycles-pp.sync_mm_rss 1.75 -0.1 1.68 perf-profile.self.cycles-pp.kmem_cache_alloc 0.40 ± 2% -0.1 0.34 perf-profile.self.cycles-pp.__rb_insert_augmented 1.29 ± 2% -0.1 1.23 perf-profile.self.cycles-pp.__indirect_thunk_start 0.73 -0.0 0.68 ± 2% perf-profile.self.cycles-pp.unmap_region 0.53 -0.0 0.49 perf-profile.self.cycles-pp.vma_link 1.40 -0.0 1.35 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 5.22 -0.0 5.18 perf-profile.self.cycles-pp.unmap_page_range 0.53 ± 2% -0.0 0.49 ± 2% perf-profile.self.cycles-pp.cap_capable 1.11 -0.0 1.07 perf-profile.self.cycles-pp.memcpy_erms 1.86 -0.0 1.82 perf-profile.self.cycles-pp.perf_iterate_sb 1.30 -0.0 1.27 perf-profile.self.cycles-pp.arch_get_unmapped_area_topdown 0.13 -0.0 0.10 perf-profile.self.cycles-pp.__vma_link_file 0.55 -0.0 0.52 perf-profile.self.cycles-pp.unmap_single_vma 0.74 -0.0 0.72 perf-profile.self.cycles-pp.selinux_mmap_addr 0.32 -0.0 0.30 perf-profile.self.cycles-pp.userfaultfd_unmap_complete 1.13 -0.0 1.12 perf-profile.self.cycles-pp.__might_sleep 1.24 -0.0 1.23 perf-profile.self.cycles-pp.up_write 0.50 -0.0 0.49 perf-profile.self.cycles-pp.userfaultfd_unmap_prep 0.27 -0.0 0.26 perf-profile.self.cycles-pp.tlb_flush_mmu_free 0.07 -0.0 0.06 perf-profile.self.cycles-pp.should_failslab 0.45 +0.0 0.47 perf-profile.self.cycles-pp.rcu_all_qs 0.71 +0.0 0.73 perf-profile.self.cycles-pp.strlcpy 0.15 ± 3% +0.0 0.17 ± 4% perf-profile.self.cycles-pp.__vma_link_list 0.51 +0.1 0.57 perf-profile.self.cycles-pp.free_pgtables 1.40 +0.1 1.49 perf-profile.self.cycles-pp.__vma_link_rb 2.27 +0.1 2.39 perf-profile.self.cycles-pp.vmacache_find 0.00 +0.2 0.18 ± 12% perf-profile.self.cycles-pp.ktime_get 0.00 +0.7 0.69 perf-profile.self.cycles-pp.put_vma 0.00 +1.2 1.24 perf-profile.self.cycles-pp.__vma_merge 0.00 +1.5 1.52 perf-profile.self.cycles-pp._raw_write_lock ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/always/page_fault2/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | :3 33% 1:3 dmesg.WARNING:at#for_ip_native_iret/0x 1:3 -33% :3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=__schedule/0x :3 33% 1:3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=__slab_free/0x 1:3 -33% :3 kmsg.DHCP/BOOTP:Reply_not_for_us_on_eth#,op[#]xid[#] 3:3 -100% :3 kmsg.pstore:crypto_comp_decompress_failed,ret= 3:3 -100% :3 kmsg.pstore:decompression_failed 2:3 4% 2:3 perf-profile.calltrace.cycles-pp.sync_regs.error_entry 5:3 7% 5:3 perf-profile.calltrace.cycles-pp.error_entry 5:3 7% 5:3 perf-profile.children.cycles-pp.error_entry 2:3 3% 2:3 perf-profile.self.cycles-pp.error_entry %stddev %change %stddev \ | \ 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops 92778 ± 2% +17.6% 109080 will-it-scale.time.involuntary_context_switches 21954366 ± 3% +4.1% 22857988 ± 2% will-it-scale.time.maximum_resident_set_size 4.81e+08 ± 2% -18.9% 3.899e+08 will-it-scale.time.minor_page_faults 5804 +12.2% 6512 will-it-scale.time.percent_of_cpu_this_job_got 34918 +12.2% 39193 will-it-scale.time.system_time 5638528 ± 2% -15.3% 4778392 will-it-scale.time.voluntary_context_switches 15846405 -2.0% 15531034 will-it-scale.workload 2818137 +1.5% 2861500 interrupts.CAL:Function_call_interrupts 3.33 ± 28% -60.0% 1.33 ± 93% irq_exception_noise.irq_time 2866 +23.9% 3552 ± 2% kthread_noise.total_time 5589674 ± 14% +31.4% 7344810 ± 6% meminfo.DirectMap2M 31169 -16.9% 25906 uptime.idle 25242 ± 4% -14.2% 21654 ± 6% vmstat.system.cs 7055 -11.6% 6237 boot-time.idle 21.12 +19.3% 25.19 ± 9% boot-time.kernel_boot 20.03 ± 2% -3.7 16.38 mpstat.cpu.idle% 0.00 ± 8% -0.0 0.00 ± 4% mpstat.cpu.iowait% 7284147 ± 2% -16.4% 6092495 softirqs.RCU 5350756 ± 2% -10.9% 4769417 ± 4% softirqs.SCHED 42933 ± 21% -28.2% 30807 ± 7% numa-meminfo.node2.SReclaimable 63219 ± 13% -16.6% 52717 ± 6% numa-meminfo.node2.SUnreclaim 106153 ± 16% -21.3% 83525 ± 5% numa-meminfo.node2.Slab 247154 ± 4% -7.6% 228415 numa-meminfo.node3.Unevictable 11904 ± 4% +17.1% 13945 ± 8% numa-vmstat.node0 2239 ± 22% -26.6% 1644 ± 2% numa-vmstat.node2.nr_mapped 10728 ± 21% -28.2% 7701 ± 7% numa-vmstat.node2.nr_slab_reclaimable 15803 ± 13% -16.6% 13179 ± 6% numa-vmstat.node2.nr_slab_unreclaimable 61788 ± 4% -7.6% 57103 numa-vmstat.node3.nr_unevictable 61788 ± 4% -7.6% 57103 numa-vmstat.node3.nr_zone_unevictable 92778 ± 2% +17.6% 109080 time.involuntary_context_switches 21954366 ± 3% +4.1% 22857988 ± 2% time.maximum_resident_set_size 4.81e+08 ± 2% -18.9% 3.899e+08 time.minor_page_faults 5804 +12.2% 6512 time.percent_of_cpu_this_job_got 34918 +12.2% 39193 time.system_time 5638528 ± 2% -15.3% 4778392 time.voluntary_context_switches 3942289 ± 2% -10.5% 3528902 ± 2% cpuidle.C1.time 242290 -14.2% 207992 cpuidle.C1.usage 1.64e+09 ± 2% -15.7% 1.381e+09 cpuidle.C1E.time 4621281 ± 2% -14.7% 3939757 cpuidle.C1E.usage 2.115e+10 ± 2% -18.5% 1.723e+10 cpuidle.C6.time 24771099 ± 2% -18.0% 20305766 cpuidle.C6.usage 1210810 ± 4% -17.6% 997270 ± 2% cpuidle.POLL.time 18742 ± 3% -17.0% 15559 ± 2% cpuidle.POLL.usage 4135 ±141% -100.0% 0.00 latency_stats.avg.x86_reserve_hardware.x86_pmu_event_init.perf_try_init_event.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 33249 ±129% -100.0% 0.00 latency_stats.max.call_rwsem_down_read_failed.m_start.seq_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 4135 ±141% -100.0% 0.00 latency_stats.max.x86_reserve_hardware.x86_pmu_event_init.perf_try_init_event.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 65839 ±116% -100.0% 0.00 latency_stats.sum.call_rwsem_down_read_failed.m_start.seq_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 4135 ±141% -100.0% 0.00 latency_stats.sum.x86_reserve_hardware.x86_pmu_event_init.perf_try_init_event.perf_event_alloc.__do_sys_perf_event_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 8387 ±122% -90.9% 767.00 ± 13% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 263970 ± 10% -68.6% 82994 ± 3% latency_stats.sum.do_syslog.kmsg_read.proc_reg_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 6173 ± 77% +173.3% 16869 ± 98% latency_stats.sum.pipe_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 101.33 -4.6% 96.67 proc-vmstat.nr_anon_transparent_hugepages 39967 -1.8% 39241 proc-vmstat.nr_slab_reclaimable 67166 -2.4% 65522 proc-vmstat.nr_slab_unreclaimable 237743 -3.9% 228396 proc-vmstat.nr_unevictable 237743 -3.9% 228396 proc-vmstat.nr_zone_unevictable 4.807e+09 -2.0% 4.71e+09 proc-vmstat.numa_hit 4.807e+09 -2.0% 4.71e+09 proc-vmstat.numa_local 4.791e+09 -2.1% 4.69e+09 proc-vmstat.pgalloc_normal 4.783e+09 -2.0% 4.685e+09 proc-vmstat.pgfault 4.807e+09 -2.0% 4.709e+09 proc-vmstat.pgfree 1753 +4.6% 1833 turbostat.Avg_MHz 239445 -14.1% 205783 turbostat.C1 4617105 ± 2% -14.8% 3934693 turbostat.C1E 1.40 ± 2% -0.2 1.18 turbostat.C1E% 24764661 ± 2% -18.0% 20297643 turbostat.C6 18.09 ± 2% -3.4 14.74 turbostat.C6% 7.53 ± 2% -17.1% 6.24 turbostat.CPU%c1 11.88 ± 2% -19.1% 9.61 turbostat.CPU%c6 7.62 ± 3% -20.8% 6.04 turbostat.Pkg%pc2 388.30 +1.5% 393.93 turbostat.PkgWatt 390974 ± 8% +35.8% 530867 ± 11% sched_debug.cfs_rq:/.min_vruntime.stddev -1754042 +75.7% -3081270 sched_debug.cfs_rq:/.spread0.min 388140 ± 8% +36.2% 528494 ± 11% sched_debug.cfs_rq:/.spread0.stddev 542.30 ± 3% -10.0% 488.21 ± 3% sched_debug.cfs_rq:/.util_avg.min 53.35 ± 16% +48.7% 79.35 ± 12% sched_debug.cfs_rq:/.util_est_enqueued.avg 30520 ± 6% -15.2% 25883 ± 12% sched_debug.cpu.nr_switches.avg 473770 ± 27% -37.4% 296623 ± 32% sched_debug.cpu.nr_switches.max 17077 ± 2% -15.1% 14493 sched_debug.cpu.nr_switches.min 30138 ± 6% -15.0% 25606 ± 12% sched_debug.cpu.sched_count.avg 472345 ± 27% -37.2% 296419 ± 32% sched_debug.cpu.sched_count.max 16858 ± 2% -15.2% 14299 sched_debug.cpu.sched_count.min 8358 ± 2% -15.5% 7063 sched_debug.cpu.sched_goidle.avg 12225 -13.6% 10565 sched_debug.cpu.sched_goidle.max 8032 ± 2% -16.0% 6749 sched_debug.cpu.sched_goidle.min 14839 ± 6% -15.3% 12568 ± 12% sched_debug.cpu.ttwu_count.avg 235115 ± 28% -38.3% 145175 ± 31% sched_debug.cpu.ttwu_count.max 7627 ± 3% -15.9% 6413 ± 2% sched_debug.cpu.ttwu_count.min 226299 ± 29% -39.5% 136827 ± 32% sched_debug.cpu.ttwu_local.max 0.85 -0.0 0.81 perf-stat.branch-miss-rate% 3.675e+10 -4.1% 3.523e+10 perf-stat.branch-misses 4.052e+11 -2.3% 3.958e+11 perf-stat.cache-misses 7.008e+11 -2.5% 6.832e+11 perf-stat.cache-references 15320995 ± 4% -14.3% 13136557 ± 6% perf-stat.context-switches 9.16 +4.8% 9.59 perf-stat.cpi 2.03e+14 +4.6% 2.124e+14 perf-stat.cpu-cycles 44508 -1.7% 43743 perf-stat.cpu-migrations 1.30 -0.1 1.24 perf-stat.dTLB-store-miss-rate% 4.064e+10 -3.5% 3.922e+10 perf-stat.dTLB-store-misses 3.086e+12 +1.1% 3.119e+12 perf-stat.dTLB-stores 3.611e+08 ± 6% -8.5% 3.304e+08 ± 5% perf-stat.iTLB-loads 0.11 -4.6% 0.10 perf-stat.ipc 4.783e+09 -2.0% 4.685e+09 perf-stat.minor-faults 1.53 ± 2% -0.3 1.22 ± 8% perf-stat.node-load-miss-rate% 1.389e+09 ± 3% -22.1% 1.083e+09 ± 9% perf-stat.node-load-misses 8.922e+10 -1.9% 8.75e+10 perf-stat.node-loads 5.06 +1.7 6.77 ± 3% perf-stat.node-store-miss-rate% 1.204e+09 +29.3% 1.556e+09 ± 3% perf-stat.node-store-misses 2.256e+10 -5.1% 2.142e+10 ± 2% perf-stat.node-stores 4.783e+09 -2.0% 4.685e+09 perf-stat.page-faults 1399242 +1.9% 1425404 perf-stat.path-length 1144 ± 8% -13.6% 988.00 ± 8% slabinfo.Acpi-ParseExt.active_objs 1144 ± 8% -13.6% 988.00 ± 8% slabinfo.Acpi-ParseExt.num_objs 1878 ± 17% +29.0% 2422 ± 16% slabinfo.dmaengine-unmap-16.active_objs 1878 ± 17% +29.0% 2422 ± 16% slabinfo.dmaengine-unmap-16.num_objs 1085 ± 5% -24.1% 823.33 ± 9% slabinfo.file_lock_cache.active_objs 1085 ± 5% -24.1% 823.33 ± 9% slabinfo.file_lock_cache.num_objs 61584 ± 4% -16.6% 51381 ± 5% slabinfo.filp.active_objs 967.00 ± 4% -16.5% 807.67 ± 5% slabinfo.filp.active_slabs 61908 ± 4% -16.5% 51713 ± 5% slabinfo.filp.num_objs 967.00 ± 4% -16.5% 807.67 ± 5% slabinfo.filp.num_slabs 1455 -15.4% 1232 ± 4% slabinfo.nsproxy.active_objs 1455 -15.4% 1232 ± 4% slabinfo.nsproxy.num_objs 84720 ± 6% -18.3% 69210 ± 4% slabinfo.pid.active_objs 1324 ± 6% -18.2% 1083 ± 4% slabinfo.pid.active_slabs 84820 ± 5% -18.2% 69386 ± 4% slabinfo.pid.num_objs 1324 ± 6% -18.2% 1083 ± 4% slabinfo.pid.num_slabs 2112 ± 18% -26.3% 1557 ± 5% slabinfo.scsi_sense_cache.active_objs 2112 ± 18% -26.3% 1557 ± 5% slabinfo.scsi_sense_cache.num_objs 5018 ± 5% -7.6% 4635 ± 4% slabinfo.sock_inode_cache.active_objs 5018 ± 5% -7.6% 4635 ± 4% slabinfo.sock_inode_cache.num_objs 1193 ± 4% +13.8% 1358 ± 4% slabinfo.task_group.active_objs 1193 ± 4% +13.8% 1358 ± 4% slabinfo.task_group.num_objs 62807 ± 3% -14.4% 53757 ± 3% slabinfo.vm_area_struct.active_objs 1571 ± 3% -12.1% 1381 ± 3% slabinfo.vm_area_struct.active_slabs 62877 ± 3% -14.3% 53880 ± 3% slabinfo.vm_area_struct.num_objs 1571 ± 3% -12.1% 1381 ± 3% slabinfo.vm_area_struct.num_slabs 47.45 -47.4 0.00 perf-profile.calltrace.cycles-pp.alloc_pages_vma.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 47.16 -47.2 0.00 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault.handle_mm_fault.__do_page_fault 46.99 -47.0 0.00 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault.handle_mm_fault 44.95 -44.9 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault 7.42 ± 2% -7.4 0.00 perf-profile.calltrace.cycles-pp.copy_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 6.32 ± 10% -6.3 0.00 perf-profile.calltrace.cycles-pp.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 6.28 ± 10% -6.3 0.00 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +0.9 0.85 ± 11% perf-profile.calltrace.cycles-pp._raw_spin_lock.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +0.9 0.92 ± 4% perf-profile.calltrace.cycles-pp.__list_del_entry_valid.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault 0.00 +1.1 1.13 ± 7% perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault 0.00 +1.2 1.19 ± 7% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault 0.00 +1.2 1.22 ± 5% perf-profile.calltrace.cycles-pp.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +1.3 1.34 ± 7% perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +1.4 1.36 ± 7% perf-profile.calltrace.cycles-pp.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +4.5 4.54 ± 19% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +4.6 4.64 ± 19% perf-profile.calltrace.cycles-pp.__lru_cache_add.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +6.6 6.64 ± 15% perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +6.7 6.68 ± 15% perf-profile.calltrace.cycles-pp.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +7.5 7.54 ± 5% perf-profile.calltrace.cycles-pp.copy_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +44.6 44.55 ± 3% perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault 0.00 +46.6 46.63 ± 3% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault.__handle_mm_fault 0.00 +46.8 46.81 ± 3% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +47.1 47.10 ± 3% perf-profile.calltrace.cycles-pp.alloc_pages_vma.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +63.1 63.15 perf-profile.calltrace.cycles-pp.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 0.39 ± 3% +0.0 0.42 ± 3% perf-profile.children.cycles-pp.radix_tree_lookup_slot 0.21 ± 3% +0.0 0.25 ± 5% perf-profile.children.cycles-pp.__mod_node_page_state 0.00 +0.1 0.06 ± 8% perf-profile.children.cycles-pp.get_vma_policy 0.00 +0.1 0.08 ± 5% perf-profile.children.cycles-pp.__lru_cache_add_active_or_unevictable 0.00 +0.2 0.18 ± 6% perf-profile.children.cycles-pp.__page_add_new_anon_rmap 0.00 +1.4 1.35 ± 5% perf-profile.children.cycles-pp.pte_map_lock 0.00 +63.2 63.21 perf-profile.children.cycles-pp.handle_pte_fault 1.40 ± 2% -0.4 1.03 ± 10% perf-profile.self.cycles-pp._raw_spin_lock 0.56 ± 3% -0.2 0.35 ± 6% perf-profile.self.cycles-pp.__handle_mm_fault 0.22 ± 3% -0.0 0.18 ± 7% perf-profile.self.cycles-pp.alloc_set_pte 0.09 +0.0 0.10 ± 4% perf-profile.self.cycles-pp.vmacache_find 0.39 ± 2% +0.0 0.41 ± 3% perf-profile.self.cycles-pp.__radix_tree_lookup 0.18 +0.0 0.20 ± 6% perf-profile.self.cycles-pp.mem_cgroup_charge_statistics 0.17 ± 2% +0.0 0.20 ± 7% perf-profile.self.cycles-pp.___might_sleep 0.33 ± 2% +0.0 0.36 ± 6% perf-profile.self.cycles-pp.handle_mm_fault 0.20 ± 2% +0.0 0.24 ± 3% perf-profile.self.cycles-pp.__mod_node_page_state 0.00 +0.1 0.05 perf-profile.self.cycles-pp.finish_fault 0.00 +0.1 0.05 perf-profile.self.cycles-pp.get_vma_policy 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.__lru_cache_add_active_or_unevictable 0.00 +0.2 0.25 ± 5% perf-profile.self.cycles-pp.handle_pte_fault 0.00 +0.5 0.49 ± 8% perf-profile.self.cycles-pp.pte_map_lock ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/thp_enabled/test/cpufreq_governor: lkp-skl-4sp1/will-it-scale/debian-x86_64-2018-04-03.cgz/x86_64-rhel-7.2/gcc-7/100%/never/page_fault2/performance commit: ba98a1cdad71d259a194461b3a61471b49b14df1 a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 ba98a1cdad71d259 a7a8993bfe3ccb54ad468b9f17 ---------------- -------------------------- fail:runs %reproduction fail:runs | | | 1:3 -33% :3 kmsg.DHCP/BOOTP:Reply_not_for_us_on_eth#,op[#]xid[#] :3 33% 1:3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=sched_slice/0x 1:3 -33% :3 dmesg.WARNING:stack_going_in_the_wrong_direction?ip=schedule_tail/0x 1:3 24% 2:3 perf-profile.calltrace.cycles-pp.sync_regs.error_entry 3:3 46% 5:3 perf-profile.calltrace.cycles-pp.error_entry 5:3 -9% 5:3 perf-profile.children.cycles-pp.error_entry 2:3 -4% 2:3 perf-profile.self.cycles-pp.error_entry %stddev %change %stddev \ | \ 8147 -18.8% 6613 will-it-scale.per_thread_ops 93113 +17.0% 108982 will-it-scale.time.involuntary_context_switches 4.732e+08 -19.0% 3.833e+08 will-it-scale.time.minor_page_faults 5854 +12.0% 6555 will-it-scale.time.percent_of_cpu_this_job_got 35247 +12.1% 39495 will-it-scale.time.system_time 5546661 -15.5% 4689314 will-it-scale.time.voluntary_context_switches 15801637 -1.9% 15504487 will-it-scale.workload 1.43 ± 11% -59.7% 0.58 ± 28% irq_exception_noise.__do_page_fault.min 2811 ± 3% +23.7% 3477 ± 3% kthread_noise.total_time 292776 ± 5% +39.6% 408829 ± 21% meminfo.DirectMap4k 19.80 -3.7 16.12 mpstat.cpu.idle% 29940 -14.5% 25593 uptime.idle 24064 ± 3% -8.5% 22016 vmstat.system.cs 34.86 -1.9% 34.19 boot-time.boot 26.95 -2.8% 26.19 ± 2% boot-time.kernel_boot 7190569 ± 2% -15.2% 6100136 ± 3% softirqs.RCU 5513663 -13.8% 4751548 softirqs.SCHED 18064 ± 2% +24.3% 22461 ± 7% numa-vmstat.node0.nr_slab_unreclaimable 8507 ± 12% -16.8% 7075 ± 4% numa-vmstat.node2.nr_slab_reclaimable 18719 ± 9% -19.6% 15043 ± 4% numa-vmstat.node3.nr_slab_unreclaimable 72265 ± 2% +24.3% 89855 ± 7% numa-meminfo.node0.SUnreclaim 115980 ± 4% +22.6% 142233 ± 12% numa-meminfo.node0.Slab 34035 ± 12% -16.8% 28307 ± 4% numa-meminfo.node2.SReclaimable 74888 ± 9% -19.7% 60162 ± 4% numa-meminfo.node3.SUnreclaim 93113 +17.0% 108982 time.involuntary_context_switches 4.732e+08 -19.0% 3.833e+08 time.minor_page_faults 5854 +12.0% 6555 time.percent_of_cpu_this_job_got 35247 +12.1% 39495 time.system_time 5546661 -15.5% 4689314 time.voluntary_context_switches 4.792e+09 -1.9% 4.699e+09 proc-vmstat.numa_hit 4.791e+09 -1.9% 4.699e+09 proc-vmstat.numa_local 40447 ± 11% +13.2% 45804 ± 6% proc-vmstat.pgactivate 4.778e+09 -1.9% 4.688e+09 proc-vmstat.pgalloc_normal 4.767e+09 -1.9% 4.675e+09 proc-vmstat.pgfault 4.791e+09 -1.9% 4.699e+09 proc-vmstat.pgfree 230178 ± 2% -10.1% 206883 ± 3% cpuidle.C1.usage 1.617e+09 -15.0% 1.375e+09 cpuidle.C1E.time 4514401 -14.1% 3878206 cpuidle.C1E.usage 2.087e+10 -18.5% 1.701e+10 cpuidle.C6.time 24458365 -18.0% 20045336 cpuidle.C6.usage 1163758 -16.1% 976094 ± 4% cpuidle.POLL.time 17907 -14.6% 15294 ± 4% cpuidle.POLL.usage 1758 +4.5% 1838 turbostat.Avg_MHz 227522 ± 2% -10.2% 204426 ± 3% turbostat.C1 4512700 -14.2% 3873264 turbostat.C1E 1.39 -0.2 1.18 turbostat.C1E% 24452583 -18.0% 20039031 turbostat.C6 17.85 -3.3 14.55 turbostat.C6% 7.44 -16.8% 6.19 turbostat.CPU%c1 11.72 -19.3% 9.45 turbostat.CPU%c6 7.51 -21.3% 5.91 turbostat.Pkg%pc2 389.33 +1.6% 395.59 turbostat.PkgWatt 559.33 ± 13% -17.9% 459.33 ± 20% slabinfo.dmaengine-unmap-128.active_objs 559.33 ± 13% -17.9% 459.33 ± 20% slabinfo.dmaengine-unmap-128.num_objs 57734 ± 3% -5.7% 54421 ± 4% slabinfo.filp.active_objs 905.67 ± 3% -5.6% 854.67 ± 4% slabinfo.filp.active_slabs 57981 ± 3% -5.6% 54720 ± 4% slabinfo.filp.num_objs 905.67 ± 3% -5.6% 854.67 ± 4% slabinfo.filp.num_slabs 1378 -12.0% 1212 ± 7% slabinfo.nsproxy.active_objs 1378 -12.0% 1212 ± 7% slabinfo.nsproxy.num_objs 507.33 ± 7% -26.8% 371.33 ± 2% slabinfo.secpath_cache.active_objs 507.33 ± 7% -26.8% 371.33 ± 2% slabinfo.secpath_cache.num_objs 4788 ± 5% -8.3% 4391 ± 2% slabinfo.sock_inode_cache.active_objs 4788 ± 5% -8.3% 4391 ± 2% slabinfo.sock_inode_cache.num_objs 1431 ± 8% -12.3% 1255 ± 3% slabinfo.task_group.active_objs 1431 ± 8% -12.3% 1255 ± 3% slabinfo.task_group.num_objs 4.27 ± 17% +27.0% 5.42 ± 7% sched_debug.cfs_rq:/.runnable_load_avg.avg 13.44 ± 62% +73.6% 23.33 ± 24% sched_debug.cfs_rq:/.runnable_load_avg.stddev 772.55 ± 21% -32.7% 520.27 ± 4% sched_debug.cfs_rq:/.util_est_enqueued.max 4.39 ± 15% +29.0% 5.66 ± 11% sched_debug.cpu.cpu_load[0].avg 152.09 ± 72% +83.9% 279.67 ± 33% sched_debug.cpu.cpu_load[0].max 13.84 ± 58% +78.7% 24.72 ± 29% sched_debug.cpu.cpu_load[0].stddev 4.53 ± 14% +25.8% 5.70 ± 10% sched_debug.cpu.cpu_load[1].avg 156.58 ± 66% +76.6% 276.58 ± 33% sched_debug.cpu.cpu_load[1].max 14.02 ± 55% +72.4% 24.17 ± 28% sched_debug.cpu.cpu_load[1].stddev 4.87 ± 11% +17.3% 5.72 ± 9% sched_debug.cpu.cpu_load[2].avg 1.58 ± 2% +13.5% 1.79 ± 6% sched_debug.cpu.nr_running.max 16694 -14.6% 14259 sched_debug.cpu.nr_switches.min 31989 ± 13% +20.6% 38584 ± 6% sched_debug.cpu.nr_switches.stddev 16505 -14.8% 14068 sched_debug.cpu.sched_count.min 32084 ± 13% +19.9% 38482 ± 6% sched_debug.cpu.sched_count.stddev 8185 -15.0% 6957 sched_debug.cpu.sched_goidle.avg 12151 ± 2% -13.5% 10507 sched_debug.cpu.sched_goidle.max 7867 -15.7% 6631 sched_debug.cpu.sched_goidle.min 7595 -16.1% 6375 sched_debug.cpu.ttwu_count.min 15873 ± 13% +21.2% 19239 ± 6% sched_debug.cpu.ttwu_count.stddev 5244 ± 17% +17.0% 6134 ± 5% sched_debug.cpu.ttwu_local.avg 15646 ± 12% +21.5% 19008 ± 6% sched_debug.cpu.ttwu_local.stddev 0.85 -0.0 0.81 perf-stat.branch-miss-rate% 3.689e+10 -4.6% 3.518e+10 perf-stat.branch-misses 57.39 +0.6 58.00 perf-stat.cache-miss-rate% 4.014e+11 -1.2% 3.967e+11 perf-stat.cache-misses 6.994e+11 -2.2% 6.84e+11 perf-stat.cache-references 14605393 ± 3% -8.5% 13369913 perf-stat.context-switches 9.21 +4.5% 9.63 perf-stat.cpi 2.037e+14 +4.6% 2.13e+14 perf-stat.cpu-cycles 44424 -2.0% 43541 perf-stat.cpu-migrations 1.29 -0.1 1.24 perf-stat.dTLB-store-miss-rate% 4.018e+10 -2.8% 3.905e+10 perf-stat.dTLB-store-misses 3.071e+12 +1.4% 3.113e+12 perf-stat.dTLB-stores 93.04 +1.5 94.51 perf-stat.iTLB-load-miss-rate% 4.946e+09 +19.3% 5.903e+09 ± 5% perf-stat.iTLB-load-misses 3.702e+08 -7.5% 3.423e+08 ± 2% perf-stat.iTLB-loads 4470 -15.9% 3760 ± 5% perf-stat.instructions-per-iTLB-miss 0.11 -4.3% 0.10 perf-stat.ipc 4.767e+09 -1.9% 4.675e+09 perf-stat.minor-faults 1.46 ± 4% -0.1 1.33 ± 9% perf-stat.node-load-miss-rate% 4.91 +1.7 6.65 ± 2% perf-stat.node-store-miss-rate% 1.195e+09 +32.8% 1.587e+09 ± 2% perf-stat.node-store-misses 2.313e+10 -3.7% 2.227e+10 perf-stat.node-stores 4.767e+09 -1.9% 4.675e+09 perf-stat.page-faults 1399047 +2.0% 1427115 perf-stat.path-length 8908 ± 73% -100.0% 0.00 latency_stats.avg.call_rwsem_down_read_failed.m_start.seq_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 3604 ±141% -100.0% 0.00 latency_stats.avg.call_rwsem_down_write_failed.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe 61499 ±130% -92.6% 4534 ± 16% latency_stats.avg.expand_files.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 4391 ±138% -70.9% 1277 ±129% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 67311 ±112% -48.5% 34681 ± 36% latency_stats.avg.max 3956 ±138% +320.4% 16635 ±140% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 164.67 ± 30% +7264.0% 12126 ±138% latency_stats.avg.flush_work.fsnotify_destroy_group.inotify_release.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +5.4e+105% 5367 ±141% latency_stats.avg.call_rwsem_down_write_failed.unlink_file_vma.free_pgtables.exit_mmap.mmput.flush_old_exec.load_elf_binary.search_binary_handler.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe 36937 ±119% -100.0% 0.00 latency_stats.max.call_rwsem_down_read_failed.m_start.seq_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 3604 ±141% -100.0% 0.00 latency_stats.max.call_rwsem_down_write_failed.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe 84146 ±107% -72.5% 23171 ± 31% latency_stats.max.expand_files.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 4391 ±138% -70.9% 1277 ±129% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 5817 ± 83% -69.7% 1760 ± 67% latency_stats.max.pipe_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 6720 ±137% +1628.2% 116147 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 164.67 ± 30% +7264.0% 12126 ±138% latency_stats.max.flush_work.fsnotify_destroy_group.inotify_release.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +1.2e+106% 12153 ±141% latency_stats.max.call_rwsem_down_write_failed.unlink_file_vma.free_pgtables.exit_mmap.mmput.flush_old_exec.load_elf_binary.search_binary_handler.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe 110122 ±120% -100.0% 0.00 latency_stats.sum.call_rwsem_down_read_failed.m_start.seq_read.__vfs_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 3604 ±141% -100.0% 0.00 latency_stats.sum.call_rwsem_down_write_failed.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe 12078828 ±139% -99.3% 89363 ± 29% latency_stats.sum.expand_files.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe 144453 ±120% -80.9% 27650 ± 19% latency_stats.sum.poll_schedule_timeout.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe 4391 ±138% -70.9% 1277 ±129% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup 9438 ± 86% -68.4% 2980 ± 35% latency_stats.sum.pipe_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 31656 ±138% +320.4% 133084 ±140% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat 164.67 ± 30% +7264.0% 12126 ±138% latency_stats.sum.flush_work.fsnotify_destroy_group.inotify_release.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +8.8e+105% 8760 ±141% latency_stats.sum.msleep_interruptible.uart_wait_until_sent.tty_wait_until_sent.tty_port_close_start.tty_port_close.tty_release.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +1.3e+106% 12897 ±141% latency_stats.sum.tty_wait_until_sent.tty_port_close_start.tty_port_close.tty_release.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +3.2e+106% 32207 ±141% latency_stats.sum.call_rwsem_down_write_failed.unlink_file_vma.free_pgtables.exit_mmap.mmput.flush_old_exec.load_elf_binary.search_binary_handler.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe 44.43 ± 3% -44.4 0.00 perf-profile.calltrace.cycles-pp.alloc_pages_vma.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 44.13 ± 3% -44.1 0.00 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault.handle_mm_fault.__do_page_fault 43.95 ± 3% -43.9 0.00 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault.handle_mm_fault 41.85 ± 4% -41.9 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.__handle_mm_fault 7.74 ± 8% -7.7 0.00 perf-profile.calltrace.cycles-pp.copy_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 7.19 ± 4% -7.2 0.00 perf-profile.calltrace.cycles-pp.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 7.15 ± 4% -7.2 0.00 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 5.09 ± 3% -5.1 0.00 perf-profile.calltrace.cycles-pp.__lru_cache_add.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault 4.99 ± 3% -5.0 0.00 perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.alloc_set_pte.finish_fault.__handle_mm_fault 0.93 ± 6% -0.1 0.81 ± 2% perf-profile.calltrace.cycles-pp.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault 0.00 +0.8 0.84 perf-profile.calltrace.cycles-pp._raw_spin_lock.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +0.9 0.92 ± 3% perf-profile.calltrace.cycles-pp.__list_del_entry_valid.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault 0.00 +1.1 1.08 perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault 0.00 +1.1 1.14 perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault 0.00 +1.2 1.17 perf-profile.calltrace.cycles-pp.pte_map_lock.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +1.3 1.29 perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +1.3 1.31 perf-profile.calltrace.cycles-pp.__do_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 61.62 +1.7 63.33 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 41.73 ± 4% +3.0 44.75 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma 0.00 +4.6 4.55 ± 15% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.alloc_set_pte.finish_fault.handle_pte_fault 0.00 +4.6 4.65 ± 14% perf-profile.calltrace.cycles-pp.__lru_cache_add.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault 0.00 +6.6 6.57 ± 10% perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +6.6 6.61 ± 10% perf-profile.calltrace.cycles-pp.finish_fault.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +7.2 7.25 ± 2% perf-profile.calltrace.cycles-pp.copy_page.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 41.41 ± 70% +22.3 63.67 perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 42.19 ± 70% +22.6 64.75 perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault 42.20 ± 70% +22.6 64.76 perf-profile.calltrace.cycles-pp.do_page_fault.page_fault 42.27 ± 70% +22.6 64.86 perf-profile.calltrace.cycles-pp.page_fault 0.00 +44.9 44.88 perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault 0.00 +46.9 46.92 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault.__handle_mm_fault 0.00 +47.1 47.10 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault.__handle_mm_fault.handle_mm_fault 0.00 +47.4 47.37 perf-profile.calltrace.cycles-pp.alloc_pages_vma.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault 0.00 +63.0 63.00 perf-profile.calltrace.cycles-pp.handle_pte_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault 0.97 ± 6% -0.1 0.84 ± 2% perf-profile.children.cycles-pp.find_get_entry 1.23 ± 6% -0.1 1.11 perf-profile.children.cycles-pp.find_lock_entry 0.09 ± 10% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.unlock_page 0.19 ± 4% +0.0 0.21 ± 2% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics 0.21 ± 2% +0.0 0.25 perf-profile.children.cycles-pp.__mod_node_page_state 0.00 +0.1 0.05 ± 8% perf-profile.children.cycles-pp.get_vma_policy 0.00 +0.1 0.08 perf-profile.children.cycles-pp.__lru_cache_add_active_or_unevictable 0.00 +0.2 0.18 ± 2% perf-profile.children.cycles-pp.__page_add_new_anon_rmap 0.00 +1.3 1.30 perf-profile.children.cycles-pp.pte_map_lock 63.40 +1.6 64.97 perf-profile.children.cycles-pp.__do_page_fault 63.19 +1.6 64.83 perf-profile.children.cycles-pp.do_page_fault 61.69 +1.7 63.36 perf-profile.children.cycles-pp.__handle_mm_fault 63.19 +1.7 64.86 perf-profile.children.cycles-pp.page_fault 61.99 +1.7 63.70 perf-profile.children.cycles-pp.handle_mm_fault 72.27 +2.2 74.52 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 67.51 +2.4 69.87 perf-profile.children.cycles-pp._raw_spin_lock 44.49 ± 3% +3.0 47.45 perf-profile.children.cycles-pp.alloc_pages_vma 44.28 ± 3% +3.0 47.26 perf-profile.children.cycles-pp.__alloc_pages_nodemask 44.13 ± 3% +3.0 47.12 perf-profile.children.cycles-pp.get_page_from_freelist 0.00 +63.1 63.06 perf-profile.children.cycles-pp.handle_pte_fault 1.46 ± 7% -0.5 1.01 perf-profile.self.cycles-pp._raw_spin_lock 0.58 ± 6% -0.2 0.34 perf-profile.self.cycles-pp.__handle_mm_fault 0.55 ± 6% -0.1 0.44 ± 2% perf-profile.self.cycles-pp.find_get_entry 0.22 ± 5% -0.1 0.16 ± 2% perf-profile.self.cycles-pp.alloc_set_pte 0.10 ± 8% -0.0 0.08 perf-profile.self.cycles-pp.down_read_trylock 0.09 ± 5% -0.0 0.07 perf-profile.self.cycles-pp.unlock_page 0.06 -0.0 0.05 perf-profile.self.cycles-pp.pmd_devmap_trans_unstable 0.20 ± 2% +0.0 0.24 ± 3% perf-profile.self.cycles-pp.__mod_node_page_state 0.00 +0.1 0.05 perf-profile.self.cycles-pp.finish_fault 0.00 +0.1 0.05 perf-profile.self.cycles-pp.get_vma_policy 0.00 +0.1 0.08 ± 6% perf-profile.self.cycles-pp.__lru_cache_add_active_or_unevictable 0.00 +0.2 0.25 perf-profile.self.cycles-pp.handle_pte_fault 0.00 +0.5 0.46 ± 7% perf-profile.self.cycles-pp.pte_map_lock 72.26 +2.3 74.52 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
On 11/06/2018 09:49, Song, HaiyanX wrote: > Hi Laurent, > > Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) > tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on > V9 patch serials. > > The regression result is sorted by the metric will-it-scale.per_thread_ops. > branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 > commit id: > head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 > base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 > Benchmark: will-it-scale > Download link: https://github.com/antonblanchard/will-it-scale/tree/master > > Metrics: > will-it-scale.per_process_ops=processes/nr_cpu > will-it-scale.per_thread_ops=threads/nr_cpu > test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) > THP: enable / disable > nr_task:100% > > 1. Regressions: > > a). Enable THP > testcase base change head metric > page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops > page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops > brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops > context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops > context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops > > b). Disable THP > page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops > page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops > brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops > context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops > brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops > page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops > context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops > > Notes: for the above values of test result, the higher is better. I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't get reproducible results. The results have huge variation, even on the vanilla kernel, and I can't state on any changes due to that. I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't measure any changes between the vanilla and the SPF patched ones: test THP enabled 4.17.0-rc4-mm1 spf delta page_fault3_threads 2697.7 2683.5 -0.53% page_fault2_threads 170660.6 169574.1 -0.64% context_switch1_threads 6915269.2 6877507.3 -0.55% context_switch1_processes 6478076.2 6529493.5 0.79% brk1 243391.2 238527.5 -2.00% Tests were run 10 times, no high variation detected. Did you see high variation on your side ? How many times the test were run to compute the average values ? Thanks, Laurent. > > 2. Improvement: not found improvement based on the selected test cases. > > > Best regards > Haiyan Song > ________________________________________ > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Monday, May 28, 2018 4:54 PM > To: Song, HaiyanX > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: Re: [PATCH v11 00/26] Speculative page faults > > On 28/05/2018 10:22, Haiyan Song wrote: >> Hi Laurent, >> >> Yes, these tests are done on V9 patch. > > Do you plan to give this V11 a run ? > >> >> >> Best regards, >> Haiyan Song >> >> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>> >>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>> tested on Intel 4s Skylake platform. >>> >>> Hi, >>> >>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>> series" while responding to the v11 header series... >>> Were these tests done on v9 or v11 ? >>> >>> Cheers, >>> Laurent. >>> >>>> >>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>> Commit id: >>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>> Benchmark suite: will-it-scale >>>> Download link: >>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>> Metrics: >>>> will-it-scale.per_process_ops=processes/nr_cpu >>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>> THP: enable / disable >>>> nr_task: 100% >>>> >>>> 1. Regressions: >>>> a) THP enabled: >>>> testcase base change head metric >>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>> >>>> b) THP disabled: >>>> testcase base change head metric >>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>> >>>> 2. Improvements: >>>> a) THP enabled: >>>> testcase base change head metric >>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>> >>>> b) THP disabled: >>>> testcase base change head metric >>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>> >>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>> on head commit is better than that on base commit for this benchmark. >>>> >>>> >>>> Best regards >>>> Haiyan Song >>>> >>>> ________________________________________ >>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>> Sent: Thursday, May 17, 2018 7:06 PM >>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>> Subject: [PATCH v11 00/26] Speculative page faults >>>> >>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>> page fault without holding the mm semaphore [1]. >>>> >>>> The idea is to try to handle user space page faults without holding the >>>> mmap_sem. This should allow better concurrency for massively threaded >>>> process since the page fault handler will not wait for other threads memory >>>> layout change to be done, assuming that this change is done in another part >>>> of the process's memory space. This type page fault is named speculative >>>> page fault. If the speculative page fault fails because of a concurrency is >>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>> is failing its processing and a classic page fault is then tried. >>>> >>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>> which protects the access to the mm_rb tree. Previously this was done using >>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>> freeing operation which was hitting the performance by 20% as reported by >>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>> limiting the locking contention to these operations which are expected to >>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>> our back a reference count is added and 2 services (get_vma() and >>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>> fetched from the RB tree using get_vma(), it must be later freed using >>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>> benchmark anymore. >>>> >>>> The VMA's attributes checked during the speculative page fault processing >>>> have to be protected against parallel changes. This is done by using a per >>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>> handler to fast check for parallel changes in progress and to abort the >>>> speculative page fault in that case. >>>> >>>> Once the VMA has been found, the speculative page fault handler would check >>>> for the VMA's attributes to verify that the page fault has to be handled >>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>> allows fast detection of concurrent VMA changes. If such a change is >>>> detected, the speculative page fault is aborted and a *classic* page fault >>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>> checked during the page fault are modified. >>>> >>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>> so once the page table is locked, the VMA is valid, so any other changes >>>> leading to touching this PTE will need to lock the page table, so no >>>> parallel change is possible at this time. >>>> >>>> The locking of the PTE is done with interrupts disabled, this allows >>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>> valid at the time the PTE is locked, we have the guarantee that the >>>> collapsing operation will have to wait on the PTE lock to move forward. >>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>> different from the one recorded at the beginning of the SPF operation, the >>>> classic page fault handler will be called to handle the operation while >>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>> page fault while a TLB invalidate is requested by another CPU holding the >>>> PTE. >>>> >>>> In pseudo code, this could be seen as: >>>> speculative_page_fault() >>>> { >>>> vma = get_vma() >>>> check vma sequence count >>>> check vma's support >>>> disable interrupt >>>> check pgd,p4d,...,pte >>>> save pmd and pte in vmf >>>> save vma sequence counter in vmf >>>> enable interrupt >>>> check vma sequence count >>>> handle_pte_fault(vma) >>>> .. >>>> page = alloc_page() >>>> pte_map_lock() >>>> disable interrupt >>>> abort if sequence counter has changed >>>> abort if pmd or pte has changed >>>> pte map and lock >>>> enable interrupt >>>> if abort >>>> free page >>>> abort >>>> ... >>>> } >>>> >>>> arch_fault_handler() >>>> { >>>> if (speculative_page_fault(&vma)) >>>> goto done >>>> again: >>>> lock(mmap_sem) >>>> vma = find_vma(); >>>> handle_pte_fault(vma); >>>> if retry >>>> unlock(mmap_sem) >>>> goto again; >>>> done: >>>> handle fault error >>>> } >>>> >>>> Support for THP is not done because when checking for the PMD, we can be >>>> confused by an in progress collapsing operation done by khugepaged. The >>>> issue is that pmd_none() could be true either if the PMD is not already >>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>> cannot safely allocate a PMD if pmd_none() is true. >>>> >>>> This series add a new software performance event named 'speculative-faults' >>>> or 'spf'. It counts the number of successful page fault event handled >>>> speculatively. When recording 'faults,spf' events, the faults one is >>>> counting the total number of page fault events while 'spf' is only counting >>>> the part of the faults processed speculatively. >>>> >>>> There are some trace events introduced by this series. They allow >>>> identifying why the page faults were not processed speculatively. This >>>> doesn't take in account the faults generated by a monothreaded process >>>> which directly processed while holding the mmap_sem. This trace events are >>>> grouped in a system named 'pagefault', they are: >>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>> back. >>>> >>>> To record all the related events, the easier is to run perf with the >>>> following arguments : >>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>> >>>> There is also a dedicated vmstat counter showing the number of successful >>>> page fault handled speculatively. I can be seen this way: >>>> $ grep speculative_pgfault /proc/vmstat >>>> >>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>> on x86, PowerPC and arm64. >>>> >>>> --------------------- >>>> Real Workload results >>>> >>>> As mentioned in previous email, we did non official runs using a "popular >>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>> which showed a 30% improvements in the number of transaction processed per >>>> second. This run has been done on the v6 series, but changes introduced in >>>> this new version should not impact the performance boost seen. >>>> >>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>> series: >>>> vanilla spf >>>> faults 89.418 101.364 +13% >>>> spf n/a 97.989 >>>> >>>> With the SPF kernel, most of the page fault were processed in a speculative >>>> way. >>>> >>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>> it a try on an android device. He reported that the application launch time >>>> was improved in average by 6%, and for large applications (~100 threads) by >>>> 20%. >>>> >>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>> MSM845 (8 cores) with 6GB (the less is better): >>>> >>>> Application 4.9 4.9+spf delta >>>> com.tencent.mm 416 389 -7% >>>> com.eg.android.AlipayGphone 1135 986 -13% >>>> com.tencent.mtt 455 454 0% >>>> com.qqgame.hlddz 1497 1409 -6% >>>> com.autonavi.minimap 711 701 -1% >>>> com.tencent.tmgp.sgame 788 748 -5% >>>> com.immomo.momo 501 487 -3% >>>> com.tencent.peng 2145 2112 -2% >>>> com.smile.gifmaker 491 461 -6% >>>> com.baidu.BaiduMap 479 366 -23% >>>> com.taobao.taobao 1341 1198 -11% >>>> com.baidu.searchbox 333 314 -6% >>>> com.tencent.mobileqq 394 384 -3% >>>> com.sina.weibo 907 906 0% >>>> com.youku.phone 816 731 -11% >>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>> com.UCMobile 415 411 -1% >>>> com.tencent.tmgp.ak 1464 1431 -2% >>>> com.tencent.qqmusic 336 329 -2% >>>> com.sankuai.meituan 1661 1302 -22% >>>> com.netease.cloudmusic 1193 1200 1% >>>> air.tv.douyu.android 4257 4152 -2% >>>> >>>> ------------------ >>>> Benchmarks results >>>> >>>> Base kernel is v4.17.0-rc4-mm1 >>>> SPF is BASE + this series >>>> >>>> Kernbench: >>>> ---------- >>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>> kernel (kernel is build 5 times): >>>> >>>> Average Half load -j 8 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>> >>>> Average Optimal load -j 16 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>> >>>> >>>> During a run on the SPF, perf events were captured: >>>> Performance counter stats for '../kernbench -M': >>>> 526743764 faults >>>> 210 spf >>>> 3 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 2278 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> Very few speculative page faults were recorded as most of the processes >>>> involved are monothreaded (sounds that on this architecture some threads >>>> were created during the kernel build processing). >>>> >>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>> >>>> Average Half load -j 40 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>> >>>> Average Optimal load -j 80 >>>> Run (std deviation) >>>> BASE SPF >>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>> >>>> During a run on the SPF, perf events were captured: >>>> Performance counter stats for '../kernbench -M': >>>> 116730856 faults >>>> 0 spf >>>> 3 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 476 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> Most of the processes involved are monothreaded so SPF is not activated but >>>> there is no impact on the performance. >>>> >>>> Ebizzy: >>>> ------- >>>> The test is counting the number of records per second it can manage, the >>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>> consistent result I repeated the test 100 times and measure the average >>>> result. The number is the record processes per second, the higher is the >>>> best. >>>> >>>> BASE SPF delta >>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>> >>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>> Performance counter stats for './ebizzy -mTt 16': >>>> 1706379 faults >>>> 1674599 spf >>>> 30588 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 363 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> And the ones captured during a run on a 80 CPUs Power node: >>>> Performance counter stats for './ebizzy -mTt 80': >>>> 1874773 faults >>>> 1461153 spf >>>> 413293 pagefault:spf_vma_changed >>>> 0 pagefault:spf_vma_noanon >>>> 200 pagefault:spf_vma_notsup >>>> 0 pagefault:spf_vma_access >>>> 0 pagefault:spf_pmd_changed >>>> >>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>> leading the ebizzy performance boost. >>>> >>>> ------------------ >>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>> and Minchan Kim, hopefully. >>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>> __do_page_fault(). >>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>> instead >>>> of aborting the speculative page fault handling. Dropping the now >>>> useless >>>> trace event pagefault:spf_pte_lock. >>>> - No more try to reuse the fetched VMA during the speculative page fault >>>> handling when retrying is needed. This adds a lot of complexity and >>>> additional tests done didn't show a significant performance improvement. >>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>> >>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>> >>>> >>>> Laurent Dufour (20): >>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>> mm: make pte_unmap_same compatible with SPF >>>> mm: introduce INIT_VMA() >>>> mm: protect VMA modifications using VMA sequence count >>>> mm: protect mremap() against SPF hanlder >>>> mm: protect SPF handler against anon_vma changes >>>> mm: cache some VMA fields in the vm_fault structure >>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>> mm: introduce __lru_cache_add_active_or_unevictable >>>> mm: introduce __vm_normal_page() >>>> mm: introduce __page_add_new_anon_rmap() >>>> mm: protect mm_rb tree with a rwlock >>>> mm: adding speculative page fault failure trace events >>>> perf: add a speculative page fault sw event >>>> perf tools: add support for the SPF perf event >>>> mm: add speculative page fault vmstats >>>> powerpc/mm: add speculative page fault >>>> >>>> Mahendran Ganesh (2): >>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>> arm64/mm: add speculative page fault >>>> >>>> Peter Zijlstra (4): >>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>> mm: VMA sequence count >>>> mm: provide speculative fault infrastructure >>>> x86/mm: add speculative pagefault handling >>>> >>>> arch/arm64/Kconfig | 1 + >>>> arch/arm64/mm/fault.c | 12 + >>>> arch/powerpc/Kconfig | 1 + >>>> arch/powerpc/mm/fault.c | 16 + >>>> arch/x86/Kconfig | 1 + >>>> arch/x86/mm/fault.c | 27 +- >>>> fs/exec.c | 2 +- >>>> fs/proc/task_mmu.c | 5 +- >>>> fs/userfaultfd.c | 17 +- >>>> include/linux/hugetlb_inline.h | 2 +- >>>> include/linux/migrate.h | 4 +- >>>> include/linux/mm.h | 136 +++++++- >>>> include/linux/mm_types.h | 7 + >>>> include/linux/pagemap.h | 4 +- >>>> include/linux/rmap.h | 12 +- >>>> include/linux/swap.h | 10 +- >>>> include/linux/vm_event_item.h | 3 + >>>> include/trace/events/pagefault.h | 80 +++++ >>>> include/uapi/linux/perf_event.h | 1 + >>>> kernel/fork.c | 5 +- >>>> mm/Kconfig | 22 ++ >>>> mm/huge_memory.c | 6 +- >>>> mm/hugetlb.c | 2 + >>>> mm/init-mm.c | 3 + >>>> mm/internal.h | 20 ++ >>>> mm/khugepaged.c | 5 + >>>> mm/madvise.c | 6 +- >>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>> mm/mempolicy.c | 51 ++- >>>> mm/migrate.c | 6 +- >>>> mm/mlock.c | 13 +- >>>> mm/mmap.c | 229 ++++++++++--- >>>> mm/mprotect.c | 4 +- >>>> mm/mremap.c | 13 + >>>> mm/nommu.c | 2 +- >>>> mm/rmap.c | 5 +- >>>> mm/swap.c | 6 +- >>>> mm/swap_state.c | 8 +- >>>> mm/vmstat.c | 5 +- >>>> tools/include/uapi/linux/perf_event.h | 1 + >>>> tools/perf/util/evsel.c | 1 + >>>> tools/perf/util/parse-events.c | 4 + >>>> tools/perf/util/parse-events.l | 1 + >>>> tools/perf/util/python.c | 1 + >>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>> create mode 100644 include/trace/events/pagefault.h >>>> >>>> -- >>>> 2.7.4 >>>> >>>> >>> >> >
Hi Laurent, For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times. I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev. And I did not find other high variation on test case result. a). Enable THP testcase base stddev change head stddev metric page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops b). Disable THP page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops Best regards, Haiyan Song
On 04/07/2018 05:23, Song, HaiyanX wrote: > Hi Laurent, > > > For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times. > I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev. Repeating the test only 3 times seems a bit too low to me. I'll focus on the higher change for the moment, but I don't have access to such a hardware. Is possible to provide a diff between base and SPF of the performance cycles measured when running page_fault3 and page_fault2 when the 20% change is detected. Please stay focus on the test case process to see exactly where the series is impacting. Thanks, Laurent. > > And I did not find other high variation on test case result. > > a). Enable THP > testcase base stddev change head stddev metric > page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops > page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops > brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops > context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops > context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops > > b). Disable THP > page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops > page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops > brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops > context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops > brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops > page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops > context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops > > > Best regards, > Haiyan Song > ________________________________________ > From: Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Monday, July 02, 2018 4:59 PM > To: Song, HaiyanX > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: Re: [PATCH v11 00/26] Speculative page faults > > On 11/06/2018 09:49, Song, HaiyanX wrote: >> Hi Laurent, >> >> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) >> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on >> V9 patch serials. >> >> The regression result is sorted by the metric will-it-scale.per_thread_ops. >> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 >> commit id: >> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 >> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 >> Benchmark: will-it-scale >> Download link: https://github.com/antonblanchard/will-it-scale/tree/master >> >> Metrics: >> will-it-scale.per_process_ops=processes/nr_cpu >> will-it-scale.per_thread_ops=threads/nr_cpu >> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >> THP: enable / disable >> nr_task:100% >> >> 1. Regressions: >> >> a). Enable THP >> testcase base change head metric >> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops >> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops >> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >> >> b). Disable THP >> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >> >> Notes: for the above values of test result, the higher is better. > > I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't > get reproducible results. The results have huge variation, even on the vanilla > kernel, and I can't state on any changes due to that. > > I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't > measure any changes between the vanilla and the SPF patched ones: > > test THP enabled 4.17.0-rc4-mm1 spf delta > page_fault3_threads 2697.7 2683.5 -0.53% > page_fault2_threads 170660.6 169574.1 -0.64% > context_switch1_threads 6915269.2 6877507.3 -0.55% > context_switch1_processes 6478076.2 6529493.5 0.79% > brk1 243391.2 238527.5 -2.00% > > Tests were run 10 times, no high variation detected. > > Did you see high variation on your side ? How many times the test were run to > compute the average values ? > > Thanks, > Laurent. > > >> >> 2. Improvement: not found improvement based on the selected test cases. >> >> >> Best regards >> Haiyan Song >> ________________________________________ >> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >> Sent: Monday, May 28, 2018 4:54 PM >> To: Song, HaiyanX >> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >> Subject: Re: [PATCH v11 00/26] Speculative page faults >> >> On 28/05/2018 10:22, Haiyan Song wrote: >>> Hi Laurent, >>> >>> Yes, these tests are done on V9 patch. >> >> Do you plan to give this V11 a run ? >> >>> >>> >>> Best regards, >>> Haiyan Song >>> >>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>>> >>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>>> tested on Intel 4s Skylake platform. >>>> >>>> Hi, >>>> >>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>>> series" while responding to the v11 header series... >>>> Were these tests done on v9 or v11 ? >>>> >>>> Cheers, >>>> Laurent. >>>> >>>>> >>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>>> Commit id: >>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>>> Benchmark suite: will-it-scale >>>>> Download link: >>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>>> Metrics: >>>>> will-it-scale.per_process_ops=processes/nr_cpu >>>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>>> THP: enable / disable >>>>> nr_task: 100% >>>>> >>>>> 1. Regressions: >>>>> a) THP enabled: >>>>> testcase base change head metric >>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>>> >>>>> b) THP disabled: >>>>> testcase base change head metric >>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>>> >>>>> 2. Improvements: >>>>> a) THP enabled: >>>>> testcase base change head metric >>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>>> >>>>> b) THP disabled: >>>>> testcase base change head metric >>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>>> >>>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>>> on head commit is better than that on base commit for this benchmark. >>>>> >>>>> >>>>> Best regards >>>>> Haiyan Song >>>>> >>>>> ________________________________________ >>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>>> Sent: Thursday, May 17, 2018 7:06 PM >>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>>> Subject: [PATCH v11 00/26] Speculative page faults >>>>> >>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>>> page fault without holding the mm semaphore [1]. >>>>> >>>>> The idea is to try to handle user space page faults without holding the >>>>> mmap_sem. This should allow better concurrency for massively threaded >>>>> process since the page fault handler will not wait for other threads memory >>>>> layout change to be done, assuming that this change is done in another part >>>>> of the process's memory space. This type page fault is named speculative >>>>> page fault. If the speculative page fault fails because of a concurrency is >>>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>>> is failing its processing and a classic page fault is then tried. >>>>> >>>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>>> which protects the access to the mm_rb tree. Previously this was done using >>>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>>> freeing operation which was hitting the performance by 20% as reported by >>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>>> limiting the locking contention to these operations which are expected to >>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>>> our back a reference count is added and 2 services (get_vma() and >>>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>>> fetched from the RB tree using get_vma(), it must be later freed using >>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>>> benchmark anymore. >>>>> >>>>> The VMA's attributes checked during the speculative page fault processing >>>>> have to be protected against parallel changes. This is done by using a per >>>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>>> handler to fast check for parallel changes in progress and to abort the >>>>> speculative page fault in that case. >>>>> >>>>> Once the VMA has been found, the speculative page fault handler would check >>>>> for the VMA's attributes to verify that the page fault has to be handled >>>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>>> allows fast detection of concurrent VMA changes. If such a change is >>>>> detected, the speculative page fault is aborted and a *classic* page fault >>>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>>> checked during the page fault are modified. >>>>> >>>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>>> so once the page table is locked, the VMA is valid, so any other changes >>>>> leading to touching this PTE will need to lock the page table, so no >>>>> parallel change is possible at this time. >>>>> >>>>> The locking of the PTE is done with interrupts disabled, this allows >>>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>>> valid at the time the PTE is locked, we have the guarantee that the >>>>> collapsing operation will have to wait on the PTE lock to move forward. >>>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>>> different from the one recorded at the beginning of the SPF operation, the >>>>> classic page fault handler will be called to handle the operation while >>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>>> page fault while a TLB invalidate is requested by another CPU holding the >>>>> PTE. >>>>> >>>>> In pseudo code, this could be seen as: >>>>> speculative_page_fault() >>>>> { >>>>> vma = get_vma() >>>>> check vma sequence count >>>>> check vma's support >>>>> disable interrupt >>>>> check pgd,p4d,...,pte >>>>> save pmd and pte in vmf >>>>> save vma sequence counter in vmf >>>>> enable interrupt >>>>> check vma sequence count >>>>> handle_pte_fault(vma) >>>>> .. >>>>> page = alloc_page() >>>>> pte_map_lock() >>>>> disable interrupt >>>>> abort if sequence counter has changed >>>>> abort if pmd or pte has changed >>>>> pte map and lock >>>>> enable interrupt >>>>> if abort >>>>> free page >>>>> abort >>>>> ... >>>>> } >>>>> >>>>> arch_fault_handler() >>>>> { >>>>> if (speculative_page_fault(&vma)) >>>>> goto done >>>>> again: >>>>> lock(mmap_sem) >>>>> vma = find_vma(); >>>>> handle_pte_fault(vma); >>>>> if retry >>>>> unlock(mmap_sem) >>>>> goto again; >>>>> done: >>>>> handle fault error >>>>> } >>>>> >>>>> Support for THP is not done because when checking for the PMD, we can be >>>>> confused by an in progress collapsing operation done by khugepaged. The >>>>> issue is that pmd_none() could be true either if the PMD is not already >>>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>>> cannot safely allocate a PMD if pmd_none() is true. >>>>> >>>>> This series add a new software performance event named 'speculative-faults' >>>>> or 'spf'. It counts the number of successful page fault event handled >>>>> speculatively. When recording 'faults,spf' events, the faults one is >>>>> counting the total number of page fault events while 'spf' is only counting >>>>> the part of the faults processed speculatively. >>>>> >>>>> There are some trace events introduced by this series. They allow >>>>> identifying why the page faults were not processed speculatively. This >>>>> doesn't take in account the faults generated by a monothreaded process >>>>> which directly processed while holding the mmap_sem. This trace events are >>>>> grouped in a system named 'pagefault', they are: >>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>>> back. >>>>> >>>>> To record all the related events, the easier is to run perf with the >>>>> following arguments : >>>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>>> >>>>> There is also a dedicated vmstat counter showing the number of successful >>>>> page fault handled speculatively. I can be seen this way: >>>>> $ grep speculative_pgfault /proc/vmstat >>>>> >>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>>> on x86, PowerPC and arm64. >>>>> >>>>> --------------------- >>>>> Real Workload results >>>>> >>>>> As mentioned in previous email, we did non official runs using a "popular >>>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>>> which showed a 30% improvements in the number of transaction processed per >>>>> second. This run has been done on the v6 series, but changes introduced in >>>>> this new version should not impact the performance boost seen. >>>>> >>>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>>> series: >>>>> vanilla spf >>>>> faults 89.418 101.364 +13% >>>>> spf n/a 97.989 >>>>> >>>>> With the SPF kernel, most of the page fault were processed in a speculative >>>>> way. >>>>> >>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>>> it a try on an android device. He reported that the application launch time >>>>> was improved in average by 6%, and for large applications (~100 threads) by >>>>> 20%. >>>>> >>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>>> MSM845 (8 cores) with 6GB (the less is better): >>>>> >>>>> Application 4.9 4.9+spf delta >>>>> com.tencent.mm 416 389 -7% >>>>> com.eg.android.AlipayGphone 1135 986 -13% >>>>> com.tencent.mtt 455 454 0% >>>>> com.qqgame.hlddz 1497 1409 -6% >>>>> com.autonavi.minimap 711 701 -1% >>>>> com.tencent.tmgp.sgame 788 748 -5% >>>>> com.immomo.momo 501 487 -3% >>>>> com.tencent.peng 2145 2112 -2% >>>>> com.smile.gifmaker 491 461 -6% >>>>> com.baidu.BaiduMap 479 366 -23% >>>>> com.taobao.taobao 1341 1198 -11% >>>>> com.baidu.searchbox 333 314 -6% >>>>> com.tencent.mobileqq 394 384 -3% >>>>> com.sina.weibo 907 906 0% >>>>> com.youku.phone 816 731 -11% >>>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>>> com.UCMobile 415 411 -1% >>>>> com.tencent.tmgp.ak 1464 1431 -2% >>>>> com.tencent.qqmusic 336 329 -2% >>>>> com.sankuai.meituan 1661 1302 -22% >>>>> com.netease.cloudmusic 1193 1200 1% >>>>> air.tv.douyu.android 4257 4152 -2% >>>>> >>>>> ------------------ >>>>> Benchmarks results >>>>> >>>>> Base kernel is v4.17.0-rc4-mm1 >>>>> SPF is BASE + this series >>>>> >>>>> Kernbench: >>>>> ---------- >>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>>> kernel (kernel is build 5 times): >>>>> >>>>> Average Half load -j 8 >>>>> Run (std deviation) >>>>> BASE SPF >>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>>> >>>>> Average Optimal load -j 16 >>>>> Run (std deviation) >>>>> BASE SPF >>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>>> >>>>> >>>>> During a run on the SPF, perf events were captured: >>>>> Performance counter stats for '../kernbench -M': >>>>> 526743764 faults >>>>> 210 spf >>>>> 3 pagefault:spf_vma_changed >>>>> 0 pagefault:spf_vma_noanon >>>>> 2278 pagefault:spf_vma_notsup >>>>> 0 pagefault:spf_vma_access >>>>> 0 pagefault:spf_pmd_changed >>>>> >>>>> Very few speculative page faults were recorded as most of the processes >>>>> involved are monothreaded (sounds that on this architecture some threads >>>>> were created during the kernel build processing). >>>>> >>>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>>> >>>>> Average Half load -j 40 >>>>> Run (std deviation) >>>>> BASE SPF >>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>>> >>>>> Average Optimal load -j 80 >>>>> Run (std deviation) >>>>> BASE SPF >>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>>> >>>>> During a run on the SPF, perf events were captured: >>>>> Performance counter stats for '../kernbench -M': >>>>> 116730856 faults >>>>> 0 spf >>>>> 3 pagefault:spf_vma_changed >>>>> 0 pagefault:spf_vma_noanon >>>>> 476 pagefault:spf_vma_notsup >>>>> 0 pagefault:spf_vma_access >>>>> 0 pagefault:spf_pmd_changed >>>>> >>>>> Most of the processes involved are monothreaded so SPF is not activated but >>>>> there is no impact on the performance. >>>>> >>>>> Ebizzy: >>>>> ------- >>>>> The test is counting the number of records per second it can manage, the >>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>>> consistent result I repeated the test 100 times and measure the average >>>>> result. The number is the record processes per second, the higher is the >>>>> best. >>>>> >>>>> BASE SPF delta >>>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>>> >>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>>> Performance counter stats for './ebizzy -mTt 16': >>>>> 1706379 faults >>>>> 1674599 spf >>>>> 30588 pagefault:spf_vma_changed >>>>> 0 pagefault:spf_vma_noanon >>>>> 363 pagefault:spf_vma_notsup >>>>> 0 pagefault:spf_vma_access >>>>> 0 pagefault:spf_pmd_changed >>>>> >>>>> And the ones captured during a run on a 80 CPUs Power node: >>>>> Performance counter stats for './ebizzy -mTt 80': >>>>> 1874773 faults >>>>> 1461153 spf >>>>> 413293 pagefault:spf_vma_changed >>>>> 0 pagefault:spf_vma_noanon >>>>> 200 pagefault:spf_vma_notsup >>>>> 0 pagefault:spf_vma_access >>>>> 0 pagefault:spf_pmd_changed >>>>> >>>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>>> leading the ebizzy performance boost. >>>>> >>>>> ------------------ >>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>>> and Minchan Kim, hopefully. >>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>>> __do_page_fault(). >>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>>> instead >>>>> of aborting the speculative page fault handling. Dropping the now >>>>> useless >>>>> trace event pagefault:spf_pte_lock. >>>>> - No more try to reuse the fetched VMA during the speculative page fault >>>>> handling when retrying is needed. This adds a lot of complexity and >>>>> additional tests done didn't show a significant performance improvement. >>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>>> >>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>>> >>>>> >>>>> Laurent Dufour (20): >>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>>> mm: make pte_unmap_same compatible with SPF >>>>> mm: introduce INIT_VMA() >>>>> mm: protect VMA modifications using VMA sequence count >>>>> mm: protect mremap() against SPF hanlder >>>>> mm: protect SPF handler against anon_vma changes >>>>> mm: cache some VMA fields in the vm_fault structure >>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>>> mm: introduce __lru_cache_add_active_or_unevictable >>>>> mm: introduce __vm_normal_page() >>>>> mm: introduce __page_add_new_anon_rmap() >>>>> mm: protect mm_rb tree with a rwlock >>>>> mm: adding speculative page fault failure trace events >>>>> perf: add a speculative page fault sw event >>>>> perf tools: add support for the SPF perf event >>>>> mm: add speculative page fault vmstats >>>>> powerpc/mm: add speculative page fault >>>>> >>>>> Mahendran Ganesh (2): >>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>> arm64/mm: add speculative page fault >>>>> >>>>> Peter Zijlstra (4): >>>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>>> mm: VMA sequence count >>>>> mm: provide speculative fault infrastructure >>>>> x86/mm: add speculative pagefault handling >>>>> >>>>> arch/arm64/Kconfig | 1 + >>>>> arch/arm64/mm/fault.c | 12 + >>>>> arch/powerpc/Kconfig | 1 + >>>>> arch/powerpc/mm/fault.c | 16 + >>>>> arch/x86/Kconfig | 1 + >>>>> arch/x86/mm/fault.c | 27 +- >>>>> fs/exec.c | 2 +- >>>>> fs/proc/task_mmu.c | 5 +- >>>>> fs/userfaultfd.c | 17 +- >>>>> include/linux/hugetlb_inline.h | 2 +- >>>>> include/linux/migrate.h | 4 +- >>>>> include/linux/mm.h | 136 +++++++- >>>>> include/linux/mm_types.h | 7 + >>>>> include/linux/pagemap.h | 4 +- >>>>> include/linux/rmap.h | 12 +- >>>>> include/linux/swap.h | 10 +- >>>>> include/linux/vm_event_item.h | 3 + >>>>> include/trace/events/pagefault.h | 80 +++++ >>>>> include/uapi/linux/perf_event.h | 1 + >>>>> kernel/fork.c | 5 +- >>>>> mm/Kconfig | 22 ++ >>>>> mm/huge_memory.c | 6 +- >>>>> mm/hugetlb.c | 2 + >>>>> mm/init-mm.c | 3 + >>>>> mm/internal.h | 20 ++ >>>>> mm/khugepaged.c | 5 + >>>>> mm/madvise.c | 6 +- >>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>>> mm/mempolicy.c | 51 ++- >>>>> mm/migrate.c | 6 +- >>>>> mm/mlock.c | 13 +- >>>>> mm/mmap.c | 229 ++++++++++--- >>>>> mm/mprotect.c | 4 +- >>>>> mm/mremap.c | 13 + >>>>> mm/nommu.c | 2 +- >>>>> mm/rmap.c | 5 +- >>>>> mm/swap.c | 6 +- >>>>> mm/swap_state.c | 8 +- >>>>> mm/vmstat.c | 5 +- >>>>> tools/include/uapi/linux/perf_event.h | 1 + >>>>> tools/perf/util/evsel.c | 1 + >>>>> tools/perf/util/parse-events.c | 4 + >>>>> tools/perf/util/parse-events.l | 1 + >>>>> tools/perf/util/python.c | 1 + >>>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>>> create mode 100644 include/trace/events/pagefault.h >>>>> >>>>> -- >>>>> 2.7.4 >>>>> >>>>> >>>> >>> >> > >
Hi Haiyan, Do you get a chance to capture some performance cycles on your system ? I still can't get these numbers on my hardware. Thanks, Laurent. On 04/07/2018 09:51, Laurent Dufour wrote: > On 04/07/2018 05:23, Song, HaiyanX wrote: >> Hi Laurent, >> >> >> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times. >> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev. > > Repeating the test only 3 times seems a bit too low to me. > > I'll focus on the higher change for the moment, but I don't have access to such > a hardware. > > Is possible to provide a diff between base and SPF of the performance cycles > measured when running page_fault3 and page_fault2 when the 20% change is detected. > > Please stay focus on the test case process to see exactly where the series is > impacting. > > Thanks, > Laurent. > >> >> And I did not find other high variation on test case result. >> >> a). Enable THP >> testcase base stddev change head stddev metric >> page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops >> page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops >> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >> >> b). Disable THP >> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >> >> >> Best regards, >> Haiyan Song >> ________________________________________ >> From: Laurent Dufour [ldufour@linux.vnet.ibm.com] >> Sent: Monday, July 02, 2018 4:59 PM >> To: Song, HaiyanX >> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >> Subject: Re: [PATCH v11 00/26] Speculative page faults >> >> On 11/06/2018 09:49, Song, HaiyanX wrote: >>> Hi Laurent, >>> >>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) >>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on >>> V9 patch serials. >>> >>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 >>> commit id: >>> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 >>> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 >>> Benchmark: will-it-scale >>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master >>> >>> Metrics: >>> will-it-scale.per_process_ops=processes/nr_cpu >>> will-it-scale.per_thread_ops=threads/nr_cpu >>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>> THP: enable / disable >>> nr_task:100% >>> >>> 1. Regressions: >>> >>> a). Enable THP >>> testcase base change head metric >>> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops >>> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops >>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >>> >>> b). Disable THP >>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >>> >>> Notes: for the above values of test result, the higher is better. >> >> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't >> get reproducible results. The results have huge variation, even on the vanilla >> kernel, and I can't state on any changes due to that. >> >> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't >> measure any changes between the vanilla and the SPF patched ones: >> >> test THP enabled 4.17.0-rc4-mm1 spf delta >> page_fault3_threads 2697.7 2683.5 -0.53% >> page_fault2_threads 170660.6 169574.1 -0.64% >> context_switch1_threads 6915269.2 6877507.3 -0.55% >> context_switch1_processes 6478076.2 6529493.5 0.79% >> brk1 243391.2 238527.5 -2.00% >> >> Tests were run 10 times, no high variation detected. >> >> Did you see high variation on your side ? How many times the test were run to >> compute the average values ? >> >> Thanks, >> Laurent. >> >> >>> >>> 2. Improvement: not found improvement based on the selected test cases. >>> >>> >>> Best regards >>> Haiyan Song >>> ________________________________________ >>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>> Sent: Monday, May 28, 2018 4:54 PM >>> To: Song, HaiyanX >>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>> Subject: Re: [PATCH v11 00/26] Speculative page faults >>> >>> On 28/05/2018 10:22, Haiyan Song wrote: >>>> Hi Laurent, >>>> >>>> Yes, these tests are done on V9 patch. >>> >>> Do you plan to give this V11 a run ? >>> >>>> >>>> >>>> Best regards, >>>> Haiyan Song >>>> >>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>>>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>>>> >>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>>>> tested on Intel 4s Skylake platform. >>>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>>>> series" while responding to the v11 header series... >>>>> Were these tests done on v9 or v11 ? >>>>> >>>>> Cheers, >>>>> Laurent. >>>>> >>>>>> >>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>>>> Commit id: >>>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>>>> Benchmark suite: will-it-scale >>>>>> Download link: >>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>>>> Metrics: >>>>>> will-it-scale.per_process_ops=processes/nr_cpu >>>>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>>>> THP: enable / disable >>>>>> nr_task: 100% >>>>>> >>>>>> 1. Regressions: >>>>>> a) THP enabled: >>>>>> testcase base change head metric >>>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>>>> >>>>>> b) THP disabled: >>>>>> testcase base change head metric >>>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>>>> >>>>>> 2. Improvements: >>>>>> a) THP enabled: >>>>>> testcase base change head metric >>>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>>>> >>>>>> b) THP disabled: >>>>>> testcase base change head metric >>>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>>>> >>>>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>>>> on head commit is better than that on base commit for this benchmark. >>>>>> >>>>>> >>>>>> Best regards >>>>>> Haiyan Song >>>>>> >>>>>> ________________________________________ >>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>>>> Sent: Thursday, May 17, 2018 7:06 PM >>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>>>> Subject: [PATCH v11 00/26] Speculative page faults >>>>>> >>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>>>> page fault without holding the mm semaphore [1]. >>>>>> >>>>>> The idea is to try to handle user space page faults without holding the >>>>>> mmap_sem. This should allow better concurrency for massively threaded >>>>>> process since the page fault handler will not wait for other threads memory >>>>>> layout change to be done, assuming that this change is done in another part >>>>>> of the process's memory space. This type page fault is named speculative >>>>>> page fault. If the speculative page fault fails because of a concurrency is >>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>>>> is failing its processing and a classic page fault is then tried. >>>>>> >>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>>>> which protects the access to the mm_rb tree. Previously this was done using >>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>>>> freeing operation which was hitting the performance by 20% as reported by >>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>>>> limiting the locking contention to these operations which are expected to >>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>>>> our back a reference count is added and 2 services (get_vma() and >>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>>>> fetched from the RB tree using get_vma(), it must be later freed using >>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>>>> benchmark anymore. >>>>>> >>>>>> The VMA's attributes checked during the speculative page fault processing >>>>>> have to be protected against parallel changes. This is done by using a per >>>>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>>>> handler to fast check for parallel changes in progress and to abort the >>>>>> speculative page fault in that case. >>>>>> >>>>>> Once the VMA has been found, the speculative page fault handler would check >>>>>> for the VMA's attributes to verify that the page fault has to be handled >>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>>>> allows fast detection of concurrent VMA changes. If such a change is >>>>>> detected, the speculative page fault is aborted and a *classic* page fault >>>>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>>>> checked during the page fault are modified. >>>>>> >>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>>>> so once the page table is locked, the VMA is valid, so any other changes >>>>>> leading to touching this PTE will need to lock the page table, so no >>>>>> parallel change is possible at this time. >>>>>> >>>>>> The locking of the PTE is done with interrupts disabled, this allows >>>>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>>>> valid at the time the PTE is locked, we have the guarantee that the >>>>>> collapsing operation will have to wait on the PTE lock to move forward. >>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>>>> different from the one recorded at the beginning of the SPF operation, the >>>>>> classic page fault handler will be called to handle the operation while >>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>>>> page fault while a TLB invalidate is requested by another CPU holding the >>>>>> PTE. >>>>>> >>>>>> In pseudo code, this could be seen as: >>>>>> speculative_page_fault() >>>>>> { >>>>>> vma = get_vma() >>>>>> check vma sequence count >>>>>> check vma's support >>>>>> disable interrupt >>>>>> check pgd,p4d,...,pte >>>>>> save pmd and pte in vmf >>>>>> save vma sequence counter in vmf >>>>>> enable interrupt >>>>>> check vma sequence count >>>>>> handle_pte_fault(vma) >>>>>> .. >>>>>> page = alloc_page() >>>>>> pte_map_lock() >>>>>> disable interrupt >>>>>> abort if sequence counter has changed >>>>>> abort if pmd or pte has changed >>>>>> pte map and lock >>>>>> enable interrupt >>>>>> if abort >>>>>> free page >>>>>> abort >>>>>> ... >>>>>> } >>>>>> >>>>>> arch_fault_handler() >>>>>> { >>>>>> if (speculative_page_fault(&vma)) >>>>>> goto done >>>>>> again: >>>>>> lock(mmap_sem) >>>>>> vma = find_vma(); >>>>>> handle_pte_fault(vma); >>>>>> if retry >>>>>> unlock(mmap_sem) >>>>>> goto again; >>>>>> done: >>>>>> handle fault error >>>>>> } >>>>>> >>>>>> Support for THP is not done because when checking for the PMD, we can be >>>>>> confused by an in progress collapsing operation done by khugepaged. The >>>>>> issue is that pmd_none() could be true either if the PMD is not already >>>>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>>>> cannot safely allocate a PMD if pmd_none() is true. >>>>>> >>>>>> This series add a new software performance event named 'speculative-faults' >>>>>> or 'spf'. It counts the number of successful page fault event handled >>>>>> speculatively. When recording 'faults,spf' events, the faults one is >>>>>> counting the total number of page fault events while 'spf' is only counting >>>>>> the part of the faults processed speculatively. >>>>>> >>>>>> There are some trace events introduced by this series. They allow >>>>>> identifying why the page faults were not processed speculatively. This >>>>>> doesn't take in account the faults generated by a monothreaded process >>>>>> which directly processed while holding the mmap_sem. This trace events are >>>>>> grouped in a system named 'pagefault', they are: >>>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>>>> back. >>>>>> >>>>>> To record all the related events, the easier is to run perf with the >>>>>> following arguments : >>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>>>> >>>>>> There is also a dedicated vmstat counter showing the number of successful >>>>>> page fault handled speculatively. I can be seen this way: >>>>>> $ grep speculative_pgfault /proc/vmstat >>>>>> >>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>>>> on x86, PowerPC and arm64. >>>>>> >>>>>> --------------------- >>>>>> Real Workload results >>>>>> >>>>>> As mentioned in previous email, we did non official runs using a "popular >>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>>>> which showed a 30% improvements in the number of transaction processed per >>>>>> second. This run has been done on the v6 series, but changes introduced in >>>>>> this new version should not impact the performance boost seen. >>>>>> >>>>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>>>> series: >>>>>> vanilla spf >>>>>> faults 89.418 101.364 +13% >>>>>> spf n/a 97.989 >>>>>> >>>>>> With the SPF kernel, most of the page fault were processed in a speculative >>>>>> way. >>>>>> >>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>>>> it a try on an android device. He reported that the application launch time >>>>>> was improved in average by 6%, and for large applications (~100 threads) by >>>>>> 20%. >>>>>> >>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>>>> MSM845 (8 cores) with 6GB (the less is better): >>>>>> >>>>>> Application 4.9 4.9+spf delta >>>>>> com.tencent.mm 416 389 -7% >>>>>> com.eg.android.AlipayGphone 1135 986 -13% >>>>>> com.tencent.mtt 455 454 0% >>>>>> com.qqgame.hlddz 1497 1409 -6% >>>>>> com.autonavi.minimap 711 701 -1% >>>>>> com.tencent.tmgp.sgame 788 748 -5% >>>>>> com.immomo.momo 501 487 -3% >>>>>> com.tencent.peng 2145 2112 -2% >>>>>> com.smile.gifmaker 491 461 -6% >>>>>> com.baidu.BaiduMap 479 366 -23% >>>>>> com.taobao.taobao 1341 1198 -11% >>>>>> com.baidu.searchbox 333 314 -6% >>>>>> com.tencent.mobileqq 394 384 -3% >>>>>> com.sina.weibo 907 906 0% >>>>>> com.youku.phone 816 731 -11% >>>>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>>>> com.UCMobile 415 411 -1% >>>>>> com.tencent.tmgp.ak 1464 1431 -2% >>>>>> com.tencent.qqmusic 336 329 -2% >>>>>> com.sankuai.meituan 1661 1302 -22% >>>>>> com.netease.cloudmusic 1193 1200 1% >>>>>> air.tv.douyu.android 4257 4152 -2% >>>>>> >>>>>> ------------------ >>>>>> Benchmarks results >>>>>> >>>>>> Base kernel is v4.17.0-rc4-mm1 >>>>>> SPF is BASE + this series >>>>>> >>>>>> Kernbench: >>>>>> ---------- >>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>>>> kernel (kernel is build 5 times): >>>>>> >>>>>> Average Half load -j 8 >>>>>> Run (std deviation) >>>>>> BASE SPF >>>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>>>> >>>>>> Average Optimal load -j 16 >>>>>> Run (std deviation) >>>>>> BASE SPF >>>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>>>> >>>>>> >>>>>> During a run on the SPF, perf events were captured: >>>>>> Performance counter stats for '../kernbench -M': >>>>>> 526743764 faults >>>>>> 210 spf >>>>>> 3 pagefault:spf_vma_changed >>>>>> 0 pagefault:spf_vma_noanon >>>>>> 2278 pagefault:spf_vma_notsup >>>>>> 0 pagefault:spf_vma_access >>>>>> 0 pagefault:spf_pmd_changed >>>>>> >>>>>> Very few speculative page faults were recorded as most of the processes >>>>>> involved are monothreaded (sounds that on this architecture some threads >>>>>> were created during the kernel build processing). >>>>>> >>>>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>>>> >>>>>> Average Half load -j 40 >>>>>> Run (std deviation) >>>>>> BASE SPF >>>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>>>> >>>>>> Average Optimal load -j 80 >>>>>> Run (std deviation) >>>>>> BASE SPF >>>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>>>> >>>>>> During a run on the SPF, perf events were captured: >>>>>> Performance counter stats for '../kernbench -M': >>>>>> 116730856 faults >>>>>> 0 spf >>>>>> 3 pagefault:spf_vma_changed >>>>>> 0 pagefault:spf_vma_noanon >>>>>> 476 pagefault:spf_vma_notsup >>>>>> 0 pagefault:spf_vma_access >>>>>> 0 pagefault:spf_pmd_changed >>>>>> >>>>>> Most of the processes involved are monothreaded so SPF is not activated but >>>>>> there is no impact on the performance. >>>>>> >>>>>> Ebizzy: >>>>>> ------- >>>>>> The test is counting the number of records per second it can manage, the >>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>>>> consistent result I repeated the test 100 times and measure the average >>>>>> result. The number is the record processes per second, the higher is the >>>>>> best. >>>>>> >>>>>> BASE SPF delta >>>>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>>>> >>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>>>> Performance counter stats for './ebizzy -mTt 16': >>>>>> 1706379 faults >>>>>> 1674599 spf >>>>>> 30588 pagefault:spf_vma_changed >>>>>> 0 pagefault:spf_vma_noanon >>>>>> 363 pagefault:spf_vma_notsup >>>>>> 0 pagefault:spf_vma_access >>>>>> 0 pagefault:spf_pmd_changed >>>>>> >>>>>> And the ones captured during a run on a 80 CPUs Power node: >>>>>> Performance counter stats for './ebizzy -mTt 80': >>>>>> 1874773 faults >>>>>> 1461153 spf >>>>>> 413293 pagefault:spf_vma_changed >>>>>> 0 pagefault:spf_vma_noanon >>>>>> 200 pagefault:spf_vma_notsup >>>>>> 0 pagefault:spf_vma_access >>>>>> 0 pagefault:spf_pmd_changed >>>>>> >>>>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>>>> leading the ebizzy performance boost. >>>>>> >>>>>> ------------------ >>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>>>> and Minchan Kim, hopefully. >>>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>>>> __do_page_fault(). >>>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>>>> instead >>>>>> of aborting the speculative page fault handling. Dropping the now >>>>>> useless >>>>>> trace event pagefault:spf_pte_lock. >>>>>> - No more try to reuse the fetched VMA during the speculative page fault >>>>>> handling when retrying is needed. This adds a lot of complexity and >>>>>> additional tests done didn't show a significant performance improvement. >>>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>>>> >>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>>>> >>>>>> >>>>>> Laurent Dufour (20): >>>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>>>> mm: make pte_unmap_same compatible with SPF >>>>>> mm: introduce INIT_VMA() >>>>>> mm: protect VMA modifications using VMA sequence count >>>>>> mm: protect mremap() against SPF hanlder >>>>>> mm: protect SPF handler against anon_vma changes >>>>>> mm: cache some VMA fields in the vm_fault structure >>>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>>>> mm: introduce __lru_cache_add_active_or_unevictable >>>>>> mm: introduce __vm_normal_page() >>>>>> mm: introduce __page_add_new_anon_rmap() >>>>>> mm: protect mm_rb tree with a rwlock >>>>>> mm: adding speculative page fault failure trace events >>>>>> perf: add a speculative page fault sw event >>>>>> perf tools: add support for the SPF perf event >>>>>> mm: add speculative page fault vmstats >>>>>> powerpc/mm: add speculative page fault >>>>>> >>>>>> Mahendran Ganesh (2): >>>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>> arm64/mm: add speculative page fault >>>>>> >>>>>> Peter Zijlstra (4): >>>>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>>>> mm: VMA sequence count >>>>>> mm: provide speculative fault infrastructure >>>>>> x86/mm: add speculative pagefault handling >>>>>> >>>>>> arch/arm64/Kconfig | 1 + >>>>>> arch/arm64/mm/fault.c | 12 + >>>>>> arch/powerpc/Kconfig | 1 + >>>>>> arch/powerpc/mm/fault.c | 16 + >>>>>> arch/x86/Kconfig | 1 + >>>>>> arch/x86/mm/fault.c | 27 +- >>>>>> fs/exec.c | 2 +- >>>>>> fs/proc/task_mmu.c | 5 +- >>>>>> fs/userfaultfd.c | 17 +- >>>>>> include/linux/hugetlb_inline.h | 2 +- >>>>>> include/linux/migrate.h | 4 +- >>>>>> include/linux/mm.h | 136 +++++++- >>>>>> include/linux/mm_types.h | 7 + >>>>>> include/linux/pagemap.h | 4 +- >>>>>> include/linux/rmap.h | 12 +- >>>>>> include/linux/swap.h | 10 +- >>>>>> include/linux/vm_event_item.h | 3 + >>>>>> include/trace/events/pagefault.h | 80 +++++ >>>>>> include/uapi/linux/perf_event.h | 1 + >>>>>> kernel/fork.c | 5 +- >>>>>> mm/Kconfig | 22 ++ >>>>>> mm/huge_memory.c | 6 +- >>>>>> mm/hugetlb.c | 2 + >>>>>> mm/init-mm.c | 3 + >>>>>> mm/internal.h | 20 ++ >>>>>> mm/khugepaged.c | 5 + >>>>>> mm/madvise.c | 6 +- >>>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>>>> mm/mempolicy.c | 51 ++- >>>>>> mm/migrate.c | 6 +- >>>>>> mm/mlock.c | 13 +- >>>>>> mm/mmap.c | 229 ++++++++++--- >>>>>> mm/mprotect.c | 4 +- >>>>>> mm/mremap.c | 13 + >>>>>> mm/nommu.c | 2 +- >>>>>> mm/rmap.c | 5 +- >>>>>> mm/swap.c | 6 +- >>>>>> mm/swap_state.c | 8 +- >>>>>> mm/vmstat.c | 5 +- >>>>>> tools/include/uapi/linux/perf_event.h | 1 + >>>>>> tools/perf/util/evsel.c | 1 + >>>>>> tools/perf/util/parse-events.c | 4 + >>>>>> tools/perf/util/parse-events.l | 1 + >>>>>> tools/perf/util/python.c | 1 + >>>>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>>>> create mode 100644 include/trace/events/pagefault.h >>>>>> >>>>>> -- >>>>>> 2.7.4 >>>>>> >>>>>> >>>>> >>>> >>> >> >> >
Hi Laurent, I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case. Please help to check on these data if it can help you to find the higher change. Thanks. File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2 tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration. Best regards, Haiyan Song
On 13/07/2018 05:56, Song, HaiyanX wrote: > Hi Laurent, Hi Haiyan, Thanks a lot for sharing this perf reports. I looked at them closely, and I've to admit that I was not able to found a major difference between the base and the head report, except that handle_pte_fault() is no more in-lined in the head one. As expected, __handle_speculative_fault() is never traced since these tests are dealing with file mapping, not handled in the speculative way. When running these test did you seen a major differences in the test's result between base and head ? From the number of cycles counted, the biggest difference is page_fault3 when run with the THP enabled: BASE HEAD Delta page_fault2_base_thp_never 1142252426747 1065866197589 -6.69% page_fault2_base_THP-Alwasys 1124844374523 1076312228927 -4.31% page_fault3_base_thp_never 1099387298152 1134118402345 3.16% page_fault3_base_THP-Always 1059370178101 853985561949 -19.39% The very weird thing is the difference of the delta cycles reported between thp never and thp always, because the speculative way is aborted when checking for the vma->ops field, which is the same in both case, and the thp is never checked. So there is no code covering differnce, on the speculative path, between these 2 cases. This leads me to think that there are other interactions interfering in the measure. Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at the head of the perf report is the 92% testcase which is weirdly not reported on the head side : 92.02% 22.33% page_fault3_processes [.] testcase 92.02% testcase Then the base reported 37.67% for __do_page_fault() where the head reported 48.41%, but the only difference in this function, between base and head, is the call to handle_speculative_fault(). But this is a macro checking for the fault flags, and mm->users and then calling __handle_speculative_fault() if needed. So this can't explain this difference, except if __handle_speculative_fault() is inlined in __do_page_fault(). Is this the case on your build ? Haiyan, do you still have the output of the test to check those numbers too ? Cheers, Laurent > I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case. > Please help to check on these data if it can help you to find the higher change. Thanks. > > File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2 > tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration. > > Best regards, > Haiyan Song > > ________________________________________ > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Thursday, July 12, 2018 1:05 AM > To: Song, HaiyanX > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: Re: [PATCH v11 00/26] Speculative page faults > > Hi Haiyan, > > Do you get a chance to capture some performance cycles on your system ? > I still can't get these numbers on my hardware. > > Thanks, > Laurent. > > On 04/07/2018 09:51, Laurent Dufour wrote: >> On 04/07/2018 05:23, Song, HaiyanX wrote: >>> Hi Laurent, >>> >>> >>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times. >>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev. >> >> Repeating the test only 3 times seems a bit too low to me. >> >> I'll focus on the higher change for the moment, but I don't have access to such >> a hardware. >> >> Is possible to provide a diff between base and SPF of the performance cycles >> measured when running page_fault3 and page_fault2 when the 20% change is detected. >> >> Please stay focus on the test case process to see exactly where the series is >> impacting. >> >> Thanks, >> Laurent. >> >>> >>> And I did not find other high variation on test case result. >>> >>> a). Enable THP >>> testcase base stddev change head stddev metric >>> page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops >>> page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops >>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >>> >>> b). Disable THP >>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >>> >>> >>> Best regards, >>> Haiyan Song >>> ________________________________________ >>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com] >>> Sent: Monday, July 02, 2018 4:59 PM >>> To: Song, HaiyanX >>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>> Subject: Re: [PATCH v11 00/26] Speculative page faults >>> >>> On 11/06/2018 09:49, Song, HaiyanX wrote: >>>> Hi Laurent, >>>> >>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) >>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on >>>> V9 patch serials. >>>> >>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 >>>> commit id: >>>> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 >>>> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 >>>> Benchmark: will-it-scale >>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master >>>> >>>> Metrics: >>>> will-it-scale.per_process_ops=processes/nr_cpu >>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>> THP: enable / disable >>>> nr_task:100% >>>> >>>> 1. Regressions: >>>> >>>> a). Enable THP >>>> testcase base change head metric >>>> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops >>>> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops >>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >>>> >>>> b). Disable THP >>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >>>> >>>> Notes: for the above values of test result, the higher is better. >>> >>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't >>> get reproducible results. The results have huge variation, even on the vanilla >>> kernel, and I can't state on any changes due to that. >>> >>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't >>> measure any changes between the vanilla and the SPF patched ones: >>> >>> test THP enabled 4.17.0-rc4-mm1 spf delta >>> page_fault3_threads 2697.7 2683.5 -0.53% >>> page_fault2_threads 170660.6 169574.1 -0.64% >>> context_switch1_threads 6915269.2 6877507.3 -0.55% >>> context_switch1_processes 6478076.2 6529493.5 0.79% >>> brk1 243391.2 238527.5 -2.00% >>> >>> Tests were run 10 times, no high variation detected. >>> >>> Did you see high variation on your side ? How many times the test were run to >>> compute the average values ? >>> >>> Thanks, >>> Laurent. >>> >>> >>>> >>>> 2. Improvement: not found improvement based on the selected test cases. >>>> >>>> >>>> Best regards >>>> Haiyan Song >>>> ________________________________________ >>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>> Sent: Monday, May 28, 2018 4:54 PM >>>> To: Song, HaiyanX >>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>> Subject: Re: [PATCH v11 00/26] Speculative page faults >>>> >>>> On 28/05/2018 10:22, Haiyan Song wrote: >>>>> Hi Laurent, >>>>> >>>>> Yes, these tests are done on V9 patch. >>>> >>>> Do you plan to give this V11 a run ? >>>> >>>>> >>>>> >>>>> Best regards, >>>>> Haiyan Song >>>>> >>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>>>>> >>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>>>>> tested on Intel 4s Skylake platform. >>>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>>>>> series" while responding to the v11 header series... >>>>>> Were these tests done on v9 or v11 ? >>>>>> >>>>>> Cheers, >>>>>> Laurent. >>>>>> >>>>>>> >>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>>>>> Commit id: >>>>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>>>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>>>>> Benchmark suite: will-it-scale >>>>>>> Download link: >>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>>>>> Metrics: >>>>>>> will-it-scale.per_process_ops=processes/nr_cpu >>>>>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>>>>> THP: enable / disable >>>>>>> nr_task: 100% >>>>>>> >>>>>>> 1. Regressions: >>>>>>> a) THP enabled: >>>>>>> testcase base change head metric >>>>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>>>>> >>>>>>> b) THP disabled: >>>>>>> testcase base change head metric >>>>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>>>>> >>>>>>> 2. Improvements: >>>>>>> a) THP enabled: >>>>>>> testcase base change head metric >>>>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>>>>> >>>>>>> b) THP disabled: >>>>>>> testcase base change head metric >>>>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>>>>> >>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>>>>> on head commit is better than that on base commit for this benchmark. >>>>>>> >>>>>>> >>>>>>> Best regards >>>>>>> Haiyan Song >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>>>>> Sent: Thursday, May 17, 2018 7:06 PM >>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>>>>> Subject: [PATCH v11 00/26] Speculative page faults >>>>>>> >>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>>>>> page fault without holding the mm semaphore [1]. >>>>>>> >>>>>>> The idea is to try to handle user space page faults without holding the >>>>>>> mmap_sem. This should allow better concurrency for massively threaded >>>>>>> process since the page fault handler will not wait for other threads memory >>>>>>> layout change to be done, assuming that this change is done in another part >>>>>>> of the process's memory space. This type page fault is named speculative >>>>>>> page fault. If the speculative page fault fails because of a concurrency is >>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>>>>> is failing its processing and a classic page fault is then tried. >>>>>>> >>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>>>>> which protects the access to the mm_rb tree. Previously this was done using >>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>>>>> freeing operation which was hitting the performance by 20% as reported by >>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>>>>> limiting the locking contention to these operations which are expected to >>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>>>>> our back a reference count is added and 2 services (get_vma() and >>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>>>>> fetched from the RB tree using get_vma(), it must be later freed using >>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>>>>> benchmark anymore. >>>>>>> >>>>>>> The VMA's attributes checked during the speculative page fault processing >>>>>>> have to be protected against parallel changes. This is done by using a per >>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>>>>> handler to fast check for parallel changes in progress and to abort the >>>>>>> speculative page fault in that case. >>>>>>> >>>>>>> Once the VMA has been found, the speculative page fault handler would check >>>>>>> for the VMA's attributes to verify that the page fault has to be handled >>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>>>>> allows fast detection of concurrent VMA changes. If such a change is >>>>>>> detected, the speculative page fault is aborted and a *classic* page fault >>>>>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>>>>> checked during the page fault are modified. >>>>>>> >>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>>>>> so once the page table is locked, the VMA is valid, so any other changes >>>>>>> leading to touching this PTE will need to lock the page table, so no >>>>>>> parallel change is possible at this time. >>>>>>> >>>>>>> The locking of the PTE is done with interrupts disabled, this allows >>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>>>>> valid at the time the PTE is locked, we have the guarantee that the >>>>>>> collapsing operation will have to wait on the PTE lock to move forward. >>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>>>>> different from the one recorded at the beginning of the SPF operation, the >>>>>>> classic page fault handler will be called to handle the operation while >>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>>>>> page fault while a TLB invalidate is requested by another CPU holding the >>>>>>> PTE. >>>>>>> >>>>>>> In pseudo code, this could be seen as: >>>>>>> speculative_page_fault() >>>>>>> { >>>>>>> vma = get_vma() >>>>>>> check vma sequence count >>>>>>> check vma's support >>>>>>> disable interrupt >>>>>>> check pgd,p4d,...,pte >>>>>>> save pmd and pte in vmf >>>>>>> save vma sequence counter in vmf >>>>>>> enable interrupt >>>>>>> check vma sequence count >>>>>>> handle_pte_fault(vma) >>>>>>> .. >>>>>>> page = alloc_page() >>>>>>> pte_map_lock() >>>>>>> disable interrupt >>>>>>> abort if sequence counter has changed >>>>>>> abort if pmd or pte has changed >>>>>>> pte map and lock >>>>>>> enable interrupt >>>>>>> if abort >>>>>>> free page >>>>>>> abort >>>>>>> ... >>>>>>> } >>>>>>> >>>>>>> arch_fault_handler() >>>>>>> { >>>>>>> if (speculative_page_fault(&vma)) >>>>>>> goto done >>>>>>> again: >>>>>>> lock(mmap_sem) >>>>>>> vma = find_vma(); >>>>>>> handle_pte_fault(vma); >>>>>>> if retry >>>>>>> unlock(mmap_sem) >>>>>>> goto again; >>>>>>> done: >>>>>>> handle fault error >>>>>>> } >>>>>>> >>>>>>> Support for THP is not done because when checking for the PMD, we can be >>>>>>> confused by an in progress collapsing operation done by khugepaged. The >>>>>>> issue is that pmd_none() could be true either if the PMD is not already >>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>>>>> cannot safely allocate a PMD if pmd_none() is true. >>>>>>> >>>>>>> This series add a new software performance event named 'speculative-faults' >>>>>>> or 'spf'. It counts the number of successful page fault event handled >>>>>>> speculatively. When recording 'faults,spf' events, the faults one is >>>>>>> counting the total number of page fault events while 'spf' is only counting >>>>>>> the part of the faults processed speculatively. >>>>>>> >>>>>>> There are some trace events introduced by this series. They allow >>>>>>> identifying why the page faults were not processed speculatively. This >>>>>>> doesn't take in account the faults generated by a monothreaded process >>>>>>> which directly processed while holding the mmap_sem. This trace events are >>>>>>> grouped in a system named 'pagefault', they are: >>>>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>>>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>>>>> back. >>>>>>> >>>>>>> To record all the related events, the easier is to run perf with the >>>>>>> following arguments : >>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>>>>> >>>>>>> There is also a dedicated vmstat counter showing the number of successful >>>>>>> page fault handled speculatively. I can be seen this way: >>>>>>> $ grep speculative_pgfault /proc/vmstat >>>>>>> >>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>>>>> on x86, PowerPC and arm64. >>>>>>> >>>>>>> --------------------- >>>>>>> Real Workload results >>>>>>> >>>>>>> As mentioned in previous email, we did non official runs using a "popular >>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>>>>> which showed a 30% improvements in the number of transaction processed per >>>>>>> second. This run has been done on the v6 series, but changes introduced in >>>>>>> this new version should not impact the performance boost seen. >>>>>>> >>>>>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>>>>> series: >>>>>>> vanilla spf >>>>>>> faults 89.418 101.364 +13% >>>>>>> spf n/a 97.989 >>>>>>> >>>>>>> With the SPF kernel, most of the page fault were processed in a speculative >>>>>>> way. >>>>>>> >>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>>>>> it a try on an android device. He reported that the application launch time >>>>>>> was improved in average by 6%, and for large applications (~100 threads) by >>>>>>> 20%. >>>>>>> >>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>>>>> MSM845 (8 cores) with 6GB (the less is better): >>>>>>> >>>>>>> Application 4.9 4.9+spf delta >>>>>>> com.tencent.mm 416 389 -7% >>>>>>> com.eg.android.AlipayGphone 1135 986 -13% >>>>>>> com.tencent.mtt 455 454 0% >>>>>>> com.qqgame.hlddz 1497 1409 -6% >>>>>>> com.autonavi.minimap 711 701 -1% >>>>>>> com.tencent.tmgp.sgame 788 748 -5% >>>>>>> com.immomo.momo 501 487 -3% >>>>>>> com.tencent.peng 2145 2112 -2% >>>>>>> com.smile.gifmaker 491 461 -6% >>>>>>> com.baidu.BaiduMap 479 366 -23% >>>>>>> com.taobao.taobao 1341 1198 -11% >>>>>>> com.baidu.searchbox 333 314 -6% >>>>>>> com.tencent.mobileqq 394 384 -3% >>>>>>> com.sina.weibo 907 906 0% >>>>>>> com.youku.phone 816 731 -11% >>>>>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>>>>> com.UCMobile 415 411 -1% >>>>>>> com.tencent.tmgp.ak 1464 1431 -2% >>>>>>> com.tencent.qqmusic 336 329 -2% >>>>>>> com.sankuai.meituan 1661 1302 -22% >>>>>>> com.netease.cloudmusic 1193 1200 1% >>>>>>> air.tv.douyu.android 4257 4152 -2% >>>>>>> >>>>>>> ------------------ >>>>>>> Benchmarks results >>>>>>> >>>>>>> Base kernel is v4.17.0-rc4-mm1 >>>>>>> SPF is BASE + this series >>>>>>> >>>>>>> Kernbench: >>>>>>> ---------- >>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>>>>> kernel (kernel is build 5 times): >>>>>>> >>>>>>> Average Half load -j 8 >>>>>>> Run (std deviation) >>>>>>> BASE SPF >>>>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>>>>> >>>>>>> Average Optimal load -j 16 >>>>>>> Run (std deviation) >>>>>>> BASE SPF >>>>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>>>>> >>>>>>> >>>>>>> During a run on the SPF, perf events were captured: >>>>>>> Performance counter stats for '../kernbench -M': >>>>>>> 526743764 faults >>>>>>> 210 spf >>>>>>> 3 pagefault:spf_vma_changed >>>>>>> 0 pagefault:spf_vma_noanon >>>>>>> 2278 pagefault:spf_vma_notsup >>>>>>> 0 pagefault:spf_vma_access >>>>>>> 0 pagefault:spf_pmd_changed >>>>>>> >>>>>>> Very few speculative page faults were recorded as most of the processes >>>>>>> involved are monothreaded (sounds that on this architecture some threads >>>>>>> were created during the kernel build processing). >>>>>>> >>>>>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>>>>> >>>>>>> Average Half load -j 40 >>>>>>> Run (std deviation) >>>>>>> BASE SPF >>>>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>>>>> >>>>>>> Average Optimal load -j 80 >>>>>>> Run (std deviation) >>>>>>> BASE SPF >>>>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>>>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>>>>> >>>>>>> During a run on the SPF, perf events were captured: >>>>>>> Performance counter stats for '../kernbench -M': >>>>>>> 116730856 faults >>>>>>> 0 spf >>>>>>> 3 pagefault:spf_vma_changed >>>>>>> 0 pagefault:spf_vma_noanon >>>>>>> 476 pagefault:spf_vma_notsup >>>>>>> 0 pagefault:spf_vma_access >>>>>>> 0 pagefault:spf_pmd_changed >>>>>>> >>>>>>> Most of the processes involved are monothreaded so SPF is not activated but >>>>>>> there is no impact on the performance. >>>>>>> >>>>>>> Ebizzy: >>>>>>> ------- >>>>>>> The test is counting the number of records per second it can manage, the >>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>>>>> consistent result I repeated the test 100 times and measure the average >>>>>>> result. The number is the record processes per second, the higher is the >>>>>>> best. >>>>>>> >>>>>>> BASE SPF delta >>>>>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>>>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>>>>> >>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>>>>> Performance counter stats for './ebizzy -mTt 16': >>>>>>> 1706379 faults >>>>>>> 1674599 spf >>>>>>> 30588 pagefault:spf_vma_changed >>>>>>> 0 pagefault:spf_vma_noanon >>>>>>> 363 pagefault:spf_vma_notsup >>>>>>> 0 pagefault:spf_vma_access >>>>>>> 0 pagefault:spf_pmd_changed >>>>>>> >>>>>>> And the ones captured during a run on a 80 CPUs Power node: >>>>>>> Performance counter stats for './ebizzy -mTt 80': >>>>>>> 1874773 faults >>>>>>> 1461153 spf >>>>>>> 413293 pagefault:spf_vma_changed >>>>>>> 0 pagefault:spf_vma_noanon >>>>>>> 200 pagefault:spf_vma_notsup >>>>>>> 0 pagefault:spf_vma_access >>>>>>> 0 pagefault:spf_pmd_changed >>>>>>> >>>>>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>>>>> leading the ebizzy performance boost. >>>>>>> >>>>>>> ------------------ >>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>>>>> and Minchan Kim, hopefully. >>>>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>>>>> __do_page_fault(). >>>>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>>>>> instead >>>>>>> of aborting the speculative page fault handling. Dropping the now >>>>>>> useless >>>>>>> trace event pagefault:spf_pte_lock. >>>>>>> - No more try to reuse the fetched VMA during the speculative page fault >>>>>>> handling when retrying is needed. This adds a lot of complexity and >>>>>>> additional tests done didn't show a significant performance improvement. >>>>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>>>>> >>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>>>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>>>>> >>>>>>> >>>>>>> Laurent Dufour (20): >>>>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>>>>> mm: make pte_unmap_same compatible with SPF >>>>>>> mm: introduce INIT_VMA() >>>>>>> mm: protect VMA modifications using VMA sequence count >>>>>>> mm: protect mremap() against SPF hanlder >>>>>>> mm: protect SPF handler against anon_vma changes >>>>>>> mm: cache some VMA fields in the vm_fault structure >>>>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>>>>> mm: introduce __lru_cache_add_active_or_unevictable >>>>>>> mm: introduce __vm_normal_page() >>>>>>> mm: introduce __page_add_new_anon_rmap() >>>>>>> mm: protect mm_rb tree with a rwlock >>>>>>> mm: adding speculative page fault failure trace events >>>>>>> perf: add a speculative page fault sw event >>>>>>> perf tools: add support for the SPF perf event >>>>>>> mm: add speculative page fault vmstats >>>>>>> powerpc/mm: add speculative page fault >>>>>>> >>>>>>> Mahendran Ganesh (2): >>>>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>> arm64/mm: add speculative page fault >>>>>>> >>>>>>> Peter Zijlstra (4): >>>>>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>>>>> mm: VMA sequence count >>>>>>> mm: provide speculative fault infrastructure >>>>>>> x86/mm: add speculative pagefault handling >>>>>>> >>>>>>> arch/arm64/Kconfig | 1 + >>>>>>> arch/arm64/mm/fault.c | 12 + >>>>>>> arch/powerpc/Kconfig | 1 + >>>>>>> arch/powerpc/mm/fault.c | 16 + >>>>>>> arch/x86/Kconfig | 1 + >>>>>>> arch/x86/mm/fault.c | 27 +- >>>>>>> fs/exec.c | 2 +- >>>>>>> fs/proc/task_mmu.c | 5 +- >>>>>>> fs/userfaultfd.c | 17 +- >>>>>>> include/linux/hugetlb_inline.h | 2 +- >>>>>>> include/linux/migrate.h | 4 +- >>>>>>> include/linux/mm.h | 136 +++++++- >>>>>>> include/linux/mm_types.h | 7 + >>>>>>> include/linux/pagemap.h | 4 +- >>>>>>> include/linux/rmap.h | 12 +- >>>>>>> include/linux/swap.h | 10 +- >>>>>>> include/linux/vm_event_item.h | 3 + >>>>>>> include/trace/events/pagefault.h | 80 +++++ >>>>>>> include/uapi/linux/perf_event.h | 1 + >>>>>>> kernel/fork.c | 5 +- >>>>>>> mm/Kconfig | 22 ++ >>>>>>> mm/huge_memory.c | 6 +- >>>>>>> mm/hugetlb.c | 2 + >>>>>>> mm/init-mm.c | 3 + >>>>>>> mm/internal.h | 20 ++ >>>>>>> mm/khugepaged.c | 5 + >>>>>>> mm/madvise.c | 6 +- >>>>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>>>>> mm/mempolicy.c | 51 ++- >>>>>>> mm/migrate.c | 6 +- >>>>>>> mm/mlock.c | 13 +- >>>>>>> mm/mmap.c | 229 ++++++++++--- >>>>>>> mm/mprotect.c | 4 +- >>>>>>> mm/mremap.c | 13 + >>>>>>> mm/nommu.c | 2 +- >>>>>>> mm/rmap.c | 5 +- >>>>>>> mm/swap.c | 6 +- >>>>>>> mm/swap_state.c | 8 +- >>>>>>> mm/vmstat.c | 5 +- >>>>>>> tools/include/uapi/linux/perf_event.h | 1 + >>>>>>> tools/perf/util/evsel.c | 1 + >>>>>>> tools/perf/util/parse-events.c | 4 + >>>>>>> tools/perf/util/parse-events.l | 1 + >>>>>>> tools/perf/util/python.c | 1 + >>>>>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>>>>> create mode 100644 include/trace/events/pagefault.h >>>>>>> >>>>>>> -- >>>>>>> 2.7.4 >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >> >
Hi Laurent, Thanks for your analysis for the last perf results. Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported on the head side", which is a bug of 0-day,and it caused the item is not counted in perf. I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times). I checked the perf report have no above mentioned problem. I have compared them, found some items have difference, such as below case: page_fault2-thp-always: handle_mm_fault, base: 45.22% head: 29.41% page_fault3-thp-always: handle_mm_fault, base: 22.95% head: 14.15% So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit. Thanks, Haiyan, Song
Add another 3 perf file.
On 03/08/2018 08:36, Song, HaiyanX wrote: > Hi Laurent, Hi Haiyan, Sorry for the late answer, I was off a couple of days. > > Thanks for your analysis for the last perf results. > Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported > on the head side", which is a bug of 0-day,and it caused the item is not counted in perf. > > I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times). > I checked the perf report have no above mentioned problem. > > I have compared them, found some items have difference, such as below case: > page_fault2-thp-always: handle_mm_fault, base: 45.22% head: 29.41% > page_fault3-thp-always: handle_mm_fault, base: 22.95% head: 14.15% These would mean that the system spends lees time running handle_mm_fault() when SPF is in the picture in this 2 cases which is good. This should lead to better results with the SPF series, and I can't find any values higher on the head side. > > So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit. I took a close look to all the perf result you sent, but I can't identify any major difference. But the compiler optimization is getting rid of the handle_pte_fault() symbol on the base kernel which add complexity to check the differences. To get rid of that, I'm proposing that you applied the attached patch to the spf kernel. This patch is allowing to turn on/off the SPF handler through /proc/sys/vm/speculative_page_fault. This should ease the testing by limiting the reboot and avoid kernel's symbols mismatch. Obviously there is still a small overhead due to the check but it should not be viewable. With this patch applied you can simply run echo 1 > /proc/sys/vm/speculative_page_fault to run a test with the speculative page fault handler activated. Or run echo 0 > /proc/sys/vm/speculative_page_fault to run a test without it. I'm really sorry to asking that again, but could please run the test page_fault3_base_THP-Always with and without SPF and capture the perf output. I think we should focus on that test which showed the biggest regression. Thanks, Laurent. > > Thanks, > Haiyan, Song > > ________________________________________ > From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] > Sent: Tuesday, July 17, 2018 5:36 PM > To: Song, HaiyanX > Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org > Subject: Re: [PATCH v11 00/26] Speculative page faults > > On 13/07/2018 05:56, Song, HaiyanX wrote: >> Hi Laurent, > > Hi Haiyan, > > Thanks a lot for sharing this perf reports. > > I looked at them closely, and I've to admit that I was not able to found a > major difference between the base and the head report, except that > handle_pte_fault() is no more in-lined in the head one. > > As expected, __handle_speculative_fault() is never traced since these tests are > dealing with file mapping, not handled in the speculative way. > > When running these test did you seen a major differences in the test's result > between base and head ? > > From the number of cycles counted, the biggest difference is page_fault3 when > run with the THP enabled: > BASE HEAD Delta > page_fault2_base_thp_never 1142252426747 1065866197589 -6.69% > page_fault2_base_THP-Alwasys 1124844374523 1076312228927 -4.31% > page_fault3_base_thp_never 1099387298152 1134118402345 3.16% > page_fault3_base_THP-Always 1059370178101 853985561949 -19.39% > > > The very weird thing is the difference of the delta cycles reported between > thp never and thp always, because the speculative way is aborted when checking > for the vma->ops field, which is the same in both case, and the thp is never > checked. So there is no code covering differnce, on the speculative path, > between these 2 cases. This leads me to think that there are other interactions > interfering in the measure. > > Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at > the head of the perf report is the 92% testcase which is weirdly not reported > on the head side : > 92.02% 22.33% page_fault3_processes [.] testcase > 92.02% testcase > > Then the base reported 37.67% for __do_page_fault() where the head reported > 48.41%, but the only difference in this function, between base and head, is the > call to handle_speculative_fault(). But this is a macro checking for the fault > flags, and mm->users and then calling __handle_speculative_fault() if needed. > So this can't explain this difference, except if __handle_speculative_fault() > is inlined in __do_page_fault(). > Is this the case on your build ? > > Haiyan, do you still have the output of the test to check those numbers too ? > > Cheers, > Laurent > >> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case. >> Please help to check on these data if it can help you to find the higher change. Thanks. >> >> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2 >> tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration. >> >> Best regards, >> Haiyan Song >> >> ________________________________________ >> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >> Sent: Thursday, July 12, 2018 1:05 AM >> To: Song, HaiyanX >> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >> Subject: Re: [PATCH v11 00/26] Speculative page faults >> >> Hi Haiyan, >> >> Do you get a chance to capture some performance cycles on your system ? >> I still can't get these numbers on my hardware. >> >> Thanks, >> Laurent. >> >> On 04/07/2018 09:51, Laurent Dufour wrote: >>> On 04/07/2018 05:23, Song, HaiyanX wrote: >>>> Hi Laurent, >>>> >>>> >>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times. >>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev. >>> >>> Repeating the test only 3 times seems a bit too low to me. >>> >>> I'll focus on the higher change for the moment, but I don't have access to such >>> a hardware. >>> >>> Is possible to provide a diff between base and SPF of the performance cycles >>> measured when running page_fault3 and page_fault2 when the 20% change is detected. >>> >>> Please stay focus on the test case process to see exactly where the series is >>> impacting. >>> >>> Thanks, >>> Laurent. >>> >>>> >>>> And I did not find other high variation on test case result. >>>> >>>> a). Enable THP >>>> testcase base stddev change head stddev metric >>>> page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops >>>> page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops >>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >>>> >>>> b). Disable THP >>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >>>> >>>> >>>> Best regards, >>>> Haiyan Song >>>> ________________________________________ >>>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>> Sent: Monday, July 02, 2018 4:59 PM >>>> To: Song, HaiyanX >>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>> Subject: Re: [PATCH v11 00/26] Speculative page faults >>>> >>>> On 11/06/2018 09:49, Song, HaiyanX wrote: >>>>> Hi Laurent, >>>>> >>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance) >>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on >>>>> V9 patch serials. >>>>> >>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126 >>>>> commit id: >>>>> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12 >>>>> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1 >>>>> Benchmark: will-it-scale >>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master >>>>> >>>>> Metrics: >>>>> will-it-scale.per_process_ops=processes/nr_cpu >>>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>>> THP: enable / disable >>>>> nr_task:100% >>>>> >>>>> 1. Regressions: >>>>> >>>>> a). Enable THP >>>>> testcase base change head metric >>>>> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops >>>>> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops >>>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops >>>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops >>>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops >>>>> >>>>> b). Disable THP >>>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops >>>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops >>>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops >>>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops >>>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops >>>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops >>>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops >>>>> >>>>> Notes: for the above values of test result, the higher is better. >>>> >>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't >>>> get reproducible results. The results have huge variation, even on the vanilla >>>> kernel, and I can't state on any changes due to that. >>>> >>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't >>>> measure any changes between the vanilla and the SPF patched ones: >>>> >>>> test THP enabled 4.17.0-rc4-mm1 spf delta >>>> page_fault3_threads 2697.7 2683.5 -0.53% >>>> page_fault2_threads 170660.6 169574.1 -0.64% >>>> context_switch1_threads 6915269.2 6877507.3 -0.55% >>>> context_switch1_processes 6478076.2 6529493.5 0.79% >>>> brk1 243391.2 238527.5 -2.00% >>>> >>>> Tests were run 10 times, no high variation detected. >>>> >>>> Did you see high variation on your side ? How many times the test were run to >>>> compute the average values ? >>>> >>>> Thanks, >>>> Laurent. >>>> >>>> >>>>> >>>>> 2. Improvement: not found improvement based on the selected test cases. >>>>> >>>>> >>>>> Best regards >>>>> Haiyan Song >>>>> ________________________________________ >>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>>> Sent: Monday, May 28, 2018 4:54 PM >>>>> To: Song, HaiyanX >>>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults >>>>> >>>>> On 28/05/2018 10:22, Haiyan Song wrote: >>>>>> Hi Laurent, >>>>>> >>>>>> Yes, these tests are done on V9 patch. >>>>> >>>>> Do you plan to give this V11 a run ? >>>>> >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Haiyan Song >>>>>> >>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote: >>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote: >>>>>>>> >>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series >>>>>>>> tested on Intel 4s Skylake platform. >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch >>>>>>> series" while responding to the v11 header series... >>>>>>> Were these tests done on v9 or v11 ? >>>>>>> >>>>>>> Cheers, >>>>>>> Laurent. >>>>>>> >>>>>>>> >>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops. >>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series) >>>>>>>> Commit id: >>>>>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f >>>>>>>> head commit: 0355322b3577eeab7669066df42c550a56801110 >>>>>>>> Benchmark suite: will-it-scale >>>>>>>> Download link: >>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests >>>>>>>> Metrics: >>>>>>>> will-it-scale.per_process_ops=processes/nr_cpu >>>>>>>> will-it-scale.per_thread_ops=threads/nr_cpu >>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G) >>>>>>>> THP: enable / disable >>>>>>>> nr_task: 100% >>>>>>>> >>>>>>>> 1. Regressions: >>>>>>>> a) THP enabled: >>>>>>>> testcase base change head metric >>>>>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops >>>>>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops >>>>>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops >>>>>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops >>>>>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops >>>>>>>> >>>>>>>> b) THP disabled: >>>>>>>> testcase base change head metric >>>>>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops >>>>>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops >>>>>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops >>>>>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops >>>>>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops >>>>>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops >>>>>>>> >>>>>>>> 2. Improvements: >>>>>>>> a) THP enabled: >>>>>>>> testcase base change head metric >>>>>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops >>>>>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops >>>>>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops >>>>>>>> >>>>>>>> b) THP disabled: >>>>>>>> testcase base change head metric >>>>>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops >>>>>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops >>>>>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops >>>>>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops >>>>>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops >>>>>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops >>>>>>>> >>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result >>>>>>>> on head commit is better than that on base commit for this benchmark. >>>>>>>> >>>>>>>> >>>>>>>> Best regards >>>>>>>> Haiyan Song >>>>>>>> >>>>>>>> ________________________________________ >>>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com] >>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM >>>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi >>>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org >>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults >>>>>>>> >>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >>>>>>>> page fault without holding the mm semaphore [1]. >>>>>>>> >>>>>>>> The idea is to try to handle user space page faults without holding the >>>>>>>> mmap_sem. This should allow better concurrency for massively threaded >>>>>>>> process since the page fault handler will not wait for other threads memory >>>>>>>> layout change to be done, assuming that this change is done in another part >>>>>>>> of the process's memory space. This type page fault is named speculative >>>>>>>> page fault. If the speculative page fault fails because of a concurrency is >>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it >>>>>>>> is failing its processing and a classic page fault is then tried. >>>>>>>> >>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault >>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock >>>>>>>> which protects the access to the mm_rb tree. Previously this was done using >>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's >>>>>>>> freeing operation which was hitting the performance by 20% as reported by >>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is >>>>>>>> limiting the locking contention to these operations which are expected to >>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in >>>>>>>> our back a reference count is added and 2 services (get_vma() and >>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is >>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using >>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale >>>>>>>> benchmark anymore. >>>>>>>> >>>>>>>> The VMA's attributes checked during the speculative page fault processing >>>>>>>> have to be protected against parallel changes. This is done by using a per >>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault >>>>>>>> handler to fast check for parallel changes in progress and to abort the >>>>>>>> speculative page fault in that case. >>>>>>>> >>>>>>>> Once the VMA has been found, the speculative page fault handler would check >>>>>>>> for the VMA's attributes to verify that the page fault has to be handled >>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which >>>>>>>> allows fast detection of concurrent VMA changes. If such a change is >>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault >>>>>>>> is tried. VMA sequence lockings are added when VMA attributes which are >>>>>>>> checked during the page fault are modified. >>>>>>>> >>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed, >>>>>>>> so once the page table is locked, the VMA is valid, so any other changes >>>>>>>> leading to touching this PTE will need to lock the page table, so no >>>>>>>> parallel change is possible at this time. >>>>>>>> >>>>>>>> The locking of the PTE is done with interrupts disabled, this allows >>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing >>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is >>>>>>>> valid at the time the PTE is locked, we have the guarantee that the >>>>>>>> collapsing operation will have to wait on the PTE lock to move forward. >>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is >>>>>>>> different from the one recorded at the beginning of the SPF operation, the >>>>>>>> classic page fault handler will be called to handle the operation while >>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled, >>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a >>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the >>>>>>>> PTE. >>>>>>>> >>>>>>>> In pseudo code, this could be seen as: >>>>>>>> speculative_page_fault() >>>>>>>> { >>>>>>>> vma = get_vma() >>>>>>>> check vma sequence count >>>>>>>> check vma's support >>>>>>>> disable interrupt >>>>>>>> check pgd,p4d,...,pte >>>>>>>> save pmd and pte in vmf >>>>>>>> save vma sequence counter in vmf >>>>>>>> enable interrupt >>>>>>>> check vma sequence count >>>>>>>> handle_pte_fault(vma) >>>>>>>> .. >>>>>>>> page = alloc_page() >>>>>>>> pte_map_lock() >>>>>>>> disable interrupt >>>>>>>> abort if sequence counter has changed >>>>>>>> abort if pmd or pte has changed >>>>>>>> pte map and lock >>>>>>>> enable interrupt >>>>>>>> if abort >>>>>>>> free page >>>>>>>> abort >>>>>>>> ... >>>>>>>> } >>>>>>>> >>>>>>>> arch_fault_handler() >>>>>>>> { >>>>>>>> if (speculative_page_fault(&vma)) >>>>>>>> goto done >>>>>>>> again: >>>>>>>> lock(mmap_sem) >>>>>>>> vma = find_vma(); >>>>>>>> handle_pte_fault(vma); >>>>>>>> if retry >>>>>>>> unlock(mmap_sem) >>>>>>>> goto again; >>>>>>>> done: >>>>>>>> handle fault error >>>>>>>> } >>>>>>>> >>>>>>>> Support for THP is not done because when checking for the PMD, we can be >>>>>>>> confused by an in progress collapsing operation done by khugepaged. The >>>>>>>> issue is that pmd_none() could be true either if the PMD is not already >>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we >>>>>>>> cannot safely allocate a PMD if pmd_none() is true. >>>>>>>> >>>>>>>> This series add a new software performance event named 'speculative-faults' >>>>>>>> or 'spf'. It counts the number of successful page fault event handled >>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is >>>>>>>> counting the total number of page fault events while 'spf' is only counting >>>>>>>> the part of the faults processed speculatively. >>>>>>>> >>>>>>>> There are some trace events introduced by this series. They allow >>>>>>>> identifying why the page faults were not processed speculatively. This >>>>>>>> doesn't take in account the faults generated by a monothreaded process >>>>>>>> which directly processed while holding the mmap_sem. This trace events are >>>>>>>> grouped in a system named 'pagefault', they are: >>>>>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back >>>>>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. >>>>>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported >>>>>>>> - pagefault:spf_vma_access : the VMA's access right are not respected >>>>>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our >>>>>>>> back. >>>>>>>> >>>>>>>> To record all the related events, the easier is to run perf with the >>>>>>>> following arguments : >>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command> >>>>>>>> >>>>>>>> There is also a dedicated vmstat counter showing the number of successful >>>>>>>> page fault handled speculatively. I can be seen this way: >>>>>>>> $ grep speculative_pgfault /proc/vmstat >>>>>>>> >>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional >>>>>>>> on x86, PowerPC and arm64. >>>>>>>> >>>>>>>> --------------------- >>>>>>>> Real Workload results >>>>>>>> >>>>>>>> As mentioned in previous email, we did non official runs using a "popular >>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system >>>>>>>> which showed a 30% improvements in the number of transaction processed per >>>>>>>> second. This run has been done on the v6 series, but changes introduced in >>>>>>>> this new version should not impact the performance boost seen. >>>>>>>> >>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8 >>>>>>>> series: >>>>>>>> vanilla spf >>>>>>>> faults 89.418 101.364 +13% >>>>>>>> spf n/a 97.989 >>>>>>>> >>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative >>>>>>>> way. >>>>>>>> >>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave >>>>>>>> it a try on an android device. He reported that the application launch time >>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by >>>>>>>> 20%. >>>>>>>> >>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom >>>>>>>> MSM845 (8 cores) with 6GB (the less is better): >>>>>>>> >>>>>>>> Application 4.9 4.9+spf delta >>>>>>>> com.tencent.mm 416 389 -7% >>>>>>>> com.eg.android.AlipayGphone 1135 986 -13% >>>>>>>> com.tencent.mtt 455 454 0% >>>>>>>> com.qqgame.hlddz 1497 1409 -6% >>>>>>>> com.autonavi.minimap 711 701 -1% >>>>>>>> com.tencent.tmgp.sgame 788 748 -5% >>>>>>>> com.immomo.momo 501 487 -3% >>>>>>>> com.tencent.peng 2145 2112 -2% >>>>>>>> com.smile.gifmaker 491 461 -6% >>>>>>>> com.baidu.BaiduMap 479 366 -23% >>>>>>>> com.taobao.taobao 1341 1198 -11% >>>>>>>> com.baidu.searchbox 333 314 -6% >>>>>>>> com.tencent.mobileqq 394 384 -3% >>>>>>>> com.sina.weibo 907 906 0% >>>>>>>> com.youku.phone 816 731 -11% >>>>>>>> com.happyelements.AndroidAnimal.qq 763 717 -6% >>>>>>>> com.UCMobile 415 411 -1% >>>>>>>> com.tencent.tmgp.ak 1464 1431 -2% >>>>>>>> com.tencent.qqmusic 336 329 -2% >>>>>>>> com.sankuai.meituan 1661 1302 -22% >>>>>>>> com.netease.cloudmusic 1193 1200 1% >>>>>>>> air.tv.douyu.android 4257 4152 -2% >>>>>>>> >>>>>>>> ------------------ >>>>>>>> Benchmarks results >>>>>>>> >>>>>>>> Base kernel is v4.17.0-rc4-mm1 >>>>>>>> SPF is BASE + this series >>>>>>>> >>>>>>>> Kernbench: >>>>>>>> ---------- >>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15 >>>>>>>> kernel (kernel is build 5 times): >>>>>>>> >>>>>>>> Average Half load -j 8 >>>>>>>> Run (std deviation) >>>>>>>> BASE SPF >>>>>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50% >>>>>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13% >>>>>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53% >>>>>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16% >>>>>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74% >>>>>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01% >>>>>>>> >>>>>>>> Average Optimal load -j 16 >>>>>>>> Run (std deviation) >>>>>>>> BASE SPF >>>>>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75% >>>>>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18% >>>>>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17% >>>>>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31% >>>>>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79% >>>>>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16% >>>>>>>> >>>>>>>> >>>>>>>> During a run on the SPF, perf events were captured: >>>>>>>> Performance counter stats for '../kernbench -M': >>>>>>>> 526743764 faults >>>>>>>> 210 spf >>>>>>>> 3 pagefault:spf_vma_changed >>>>>>>> 0 pagefault:spf_vma_noanon >>>>>>>> 2278 pagefault:spf_vma_notsup >>>>>>>> 0 pagefault:spf_vma_access >>>>>>>> 0 pagefault:spf_pmd_changed >>>>>>>> >>>>>>>> Very few speculative page faults were recorded as most of the processes >>>>>>>> involved are monothreaded (sounds that on this architecture some threads >>>>>>>> were created during the kernel build processing). >>>>>>>> >>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system: >>>>>>>> >>>>>>>> Average Half load -j 40 >>>>>>>> Run (std deviation) >>>>>>>> BASE SPF >>>>>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01% >>>>>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03% >>>>>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24% >>>>>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08% >>>>>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50% >>>>>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17% >>>>>>>> >>>>>>>> Average Optimal load -j 80 >>>>>>>> Run (std deviation) >>>>>>>> BASE SPF >>>>>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39% >>>>>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03% >>>>>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23% >>>>>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33% >>>>>>>> Context Switches 223861 (138865) 225032 (139632) 0.52% >>>>>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45% >>>>>>>> >>>>>>>> During a run on the SPF, perf events were captured: >>>>>>>> Performance counter stats for '../kernbench -M': >>>>>>>> 116730856 faults >>>>>>>> 0 spf >>>>>>>> 3 pagefault:spf_vma_changed >>>>>>>> 0 pagefault:spf_vma_noanon >>>>>>>> 476 pagefault:spf_vma_notsup >>>>>>>> 0 pagefault:spf_vma_access >>>>>>>> 0 pagefault:spf_pmd_changed >>>>>>>> >>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but >>>>>>>> there is no impact on the performance. >>>>>>>> >>>>>>>> Ebizzy: >>>>>>>> ------- >>>>>>>> The test is counting the number of records per second it can manage, the >>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get >>>>>>>> consistent result I repeated the test 100 times and measure the average >>>>>>>> result. The number is the record processes per second, the higher is the >>>>>>>> best. >>>>>>>> >>>>>>>> BASE SPF delta >>>>>>>> 16 CPUs x86 VM 742.57 1490.24 100.69% >>>>>>>> 80 CPUs P8 node 13105.4 24174.23 84.46% >>>>>>>> >>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM: >>>>>>>> Performance counter stats for './ebizzy -mTt 16': >>>>>>>> 1706379 faults >>>>>>>> 1674599 spf >>>>>>>> 30588 pagefault:spf_vma_changed >>>>>>>> 0 pagefault:spf_vma_noanon >>>>>>>> 363 pagefault:spf_vma_notsup >>>>>>>> 0 pagefault:spf_vma_access >>>>>>>> 0 pagefault:spf_pmd_changed >>>>>>>> >>>>>>>> And the ones captured during a run on a 80 CPUs Power node: >>>>>>>> Performance counter stats for './ebizzy -mTt 80': >>>>>>>> 1874773 faults >>>>>>>> 1461153 spf >>>>>>>> 413293 pagefault:spf_vma_changed >>>>>>>> 0 pagefault:spf_vma_noanon >>>>>>>> 200 pagefault:spf_vma_notsup >>>>>>>> 0 pagefault:spf_vma_access >>>>>>>> 0 pagefault:spf_pmd_changed >>>>>>>> >>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way, >>>>>>>> leading the ebizzy performance boost. >>>>>>>> >>>>>>>> ------------------ >>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572): >>>>>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran >>>>>>>> and Minchan Kim, hopefully. >>>>>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in >>>>>>>> __do_page_fault(). >>>>>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails >>>>>>>> instead >>>>>>>> of aborting the speculative page fault handling. Dropping the now >>>>>>>> useless >>>>>>>> trace event pagefault:spf_pte_lock. >>>>>>>> - No more try to reuse the fetched VMA during the speculative page fault >>>>>>>> handling when retrying is needed. This adds a lot of complexity and >>>>>>>> additional tests done didn't show a significant performance improvement. >>>>>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error. >>>>>>>> >>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none >>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/ >>>>>>>> >>>>>>>> >>>>>>>> Laurent Dufour (20): >>>>>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT >>>>>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE >>>>>>>> mm: make pte_unmap_same compatible with SPF >>>>>>>> mm: introduce INIT_VMA() >>>>>>>> mm: protect VMA modifications using VMA sequence count >>>>>>>> mm: protect mremap() against SPF hanlder >>>>>>>> mm: protect SPF handler against anon_vma changes >>>>>>>> mm: cache some VMA fields in the vm_fault structure >>>>>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() >>>>>>>> mm: introduce __lru_cache_add_active_or_unevictable >>>>>>>> mm: introduce __vm_normal_page() >>>>>>>> mm: introduce __page_add_new_anon_rmap() >>>>>>>> mm: protect mm_rb tree with a rwlock >>>>>>>> mm: adding speculative page fault failure trace events >>>>>>>> perf: add a speculative page fault sw event >>>>>>>> perf tools: add support for the SPF perf event >>>>>>>> mm: add speculative page fault vmstats >>>>>>>> powerpc/mm: add speculative page fault >>>>>>>> >>>>>>>> Mahendran Ganesh (2): >>>>>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT >>>>>>>> arm64/mm: add speculative page fault >>>>>>>> >>>>>>>> Peter Zijlstra (4): >>>>>>>> mm: prepare for FAULT_FLAG_SPECULATIVE >>>>>>>> mm: VMA sequence count >>>>>>>> mm: provide speculative fault infrastructure >>>>>>>> x86/mm: add speculative pagefault handling >>>>>>>> >>>>>>>> arch/arm64/Kconfig | 1 + >>>>>>>> arch/arm64/mm/fault.c | 12 + >>>>>>>> arch/powerpc/Kconfig | 1 + >>>>>>>> arch/powerpc/mm/fault.c | 16 + >>>>>>>> arch/x86/Kconfig | 1 + >>>>>>>> arch/x86/mm/fault.c | 27 +- >>>>>>>> fs/exec.c | 2 +- >>>>>>>> fs/proc/task_mmu.c | 5 +- >>>>>>>> fs/userfaultfd.c | 17 +- >>>>>>>> include/linux/hugetlb_inline.h | 2 +- >>>>>>>> include/linux/migrate.h | 4 +- >>>>>>>> include/linux/mm.h | 136 +++++++- >>>>>>>> include/linux/mm_types.h | 7 + >>>>>>>> include/linux/pagemap.h | 4 +- >>>>>>>> include/linux/rmap.h | 12 +- >>>>>>>> include/linux/swap.h | 10 +- >>>>>>>> include/linux/vm_event_item.h | 3 + >>>>>>>> include/trace/events/pagefault.h | 80 +++++ >>>>>>>> include/uapi/linux/perf_event.h | 1 + >>>>>>>> kernel/fork.c | 5 +- >>>>>>>> mm/Kconfig | 22 ++ >>>>>>>> mm/huge_memory.c | 6 +- >>>>>>>> mm/hugetlb.c | 2 + >>>>>>>> mm/init-mm.c | 3 + >>>>>>>> mm/internal.h | 20 ++ >>>>>>>> mm/khugepaged.c | 5 + >>>>>>>> mm/madvise.c | 6 +- >>>>>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++----- >>>>>>>> mm/mempolicy.c | 51 ++- >>>>>>>> mm/migrate.c | 6 +- >>>>>>>> mm/mlock.c | 13 +- >>>>>>>> mm/mmap.c | 229 ++++++++++--- >>>>>>>> mm/mprotect.c | 4 +- >>>>>>>> mm/mremap.c | 13 + >>>>>>>> mm/nommu.c | 2 +- >>>>>>>> mm/rmap.c | 5 +- >>>>>>>> mm/swap.c | 6 +- >>>>>>>> mm/swap_state.c | 8 +- >>>>>>>> mm/vmstat.c | 5 +- >>>>>>>> tools/include/uapi/linux/perf_event.h | 1 + >>>>>>>> tools/perf/util/evsel.c | 1 + >>>>>>>> tools/perf/util/parse-events.c | 4 + >>>>>>>> tools/perf/util/parse-events.l | 1 + >>>>>>>> tools/perf/util/python.c | 1 + >>>>>>>> 44 files changed, 1161 insertions(+), 211 deletions(-) >>>>>>>> create mode 100644 include/trace/events/pagefault.h >>>>>>>> >>>>>>>> -- >>>>>>>> 2.7.4 >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> > From b6c7fa413f25b8574edf8c764b136715c40299c2 Mon Sep 17 00:00:00 2001 From: Laurent Dufour <ldufour@linux.vnet.ibm.com> Date: Mon, 20 Aug 2018 17:51:26 +0200 Subject: [PATCH] mm: Add a speculative page fault switch in sysctl This allows to turn on/off the use of the speculative page fault handler. By default it's turned on. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> --- include/linux/mm.h | 3 +++ kernel/sysctl.c | 9 +++++++++ mm/memory.c | 3 +++ 3 files changed, 15 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 31acf98a7d92..ac102efc4c86 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1422,6 +1422,7 @@ extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags); #ifdef CONFIG_SPECULATIVE_PAGE_FAULT +extern int sysctl_speculative_page_fault; extern int __handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags); @@ -1429,6 +1430,8 @@ static inline int handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags) { + if (unlikely(!sysctl_speculative_page_fault)) + return VM_FAULT_RETRY; /* * Try speculative page fault for multithreaded user space task only. */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f45ed9e696eb..0fb81edd22c1 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1243,6 +1243,15 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, .extra2 = &two, }, +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT + { + .procname = "speculative_page_fault", + .data = &sysctl_speculative_page_fault, + .maxlen = sizeof(sysctl_speculative_page_fault), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif { .procname = "panic_on_oom", .data = &sysctl_panic_on_oom, diff --git a/mm/memory.c b/mm/memory.c index 48e1cf0a54ef..c3db3bc4347b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -82,6 +82,9 @@ #define CREATE_TRACE_POINTS #include <trace/events/pagefault.h> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT +int sysctl_speculative_page_fault = 1; +#endif #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST) #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
Hi Laurent, I am sorry for replying you so late. The previous LKP test for this case are running on the same Intel skylake 4s platform, but it need maintain recently. So I changed to another test box to run the page_fault3 test case, it is Intel skylake 2s platform (nr_cpu: 104, memory: 64G). I applied your patch to the SPF kernel (commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12), then triggered below 2 cases test. a) Turn on the SPF handler by below command, then run page_fault3-thp-always test. echo 1 > /proc/sys/vm/speculative_page_fault b) Turn off the SPF handler by below command, then run page_fault3-thp-always test. echo 0 > /proc/sys/vm/speculative_page_fault Every test run 3 times, and then get test result and capture perf data. Here is average result for will-it-scale.per_thread_ops: SPF_turn_off SPF_turn_on page_fault3-THP-Alwasys.will-it-scale.per_thread_ops 31963 26285 Best regards, Haiyan Song
On Thu, May 17, 2018 at 01:06:07PM +0200, Laurent Dufour wrote: > This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle > page fault without holding the mm semaphore [1]. > > The idea is to try to handle user space page faults without holding the > mmap_sem. This should allow better concurrency for massively threaded Question -- I presume mmap_sem (rw_semaphore implementation tested against) was qrwlock? Balbir Singh.
Le 05/11/2018 à 11:42, Balbir Singh a écrit : > On Thu, May 17, 2018 at 01:06:07PM +0200, Laurent Dufour wrote: >> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle >> page fault without holding the mm semaphore [1]. >> >> The idea is to try to handle user space page faults without holding the >> mmap_sem. This should allow better concurrency for massively threaded > > Question -- I presume mmap_sem (rw_semaphore implementation tested against) > was qrwlock? I don't think so, this series doesn't change the mmap_sem definition so it still belongs to the 'struct rw_semaphore'. Laurent.