Message ID | 20170509181234.GA4397@dhcp22.suse.cz |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Hi Michal, > I like the idea of postponing the zeroing from the allocation to the > init time. To be honest the improvement looks much larger than I would > expect (Btw. this should be a part of the changelog rather than a > outside link). The improvements are larger, because this time was never measured, as Linux does not have early boot time stamps. I added them for x86 and SPARC to emasure the performance. I am pushing those changes through separate patchsets. > > The implementation just looks too large to what I would expect. E.g. do > we really need to add zero argument to the large part of the memblock > API? Wouldn't it be easier to simply export memblock_virt_alloc_internal > (or its tiny wrapper memblock_virt_alloc_core) and move the zeroing > outside to its 2 callers? A completely untested scratched version at the > end of the email. I am OK, with this change. But, I do not really see a difference between: memblock_virt_alloc_raw() and memblock_virt_alloc_core() In both cases we use memblock_virt_alloc_internal(), but the only difference is that in my case we tell memblock_virt_alloc_internal() to zero the pages if needed, and in your case the other two callers are zeroing it. I like moving memblock_dbg() inside memblock_virt_alloc_internal() > > Also it seems that this is not 100% correct either as it only cares > about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for > SPARSEMEM. This would suggest that we would zero out pages twice, > right? Thank you, I will check this combination before sending out the next patch. > > A similar concern would go to the memory hotplug patch which will > fall back to the slab/page allocator IIRC. On the other hand > __init_single_page is shared with the hotplug code so again we would > initialize 2 times. Correct, when memory it hotplugged, to gain the benefit of this fix, and also not to regress by actually double zeroing "struct pages" we should not zero it out. However, I do not really have means to test it. > > So I suspect more changes are needed. I will have a closer look tomorrow. Thank you for reviewing this work. I will wait for your comments before sending out updated patches. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 09-05-17 14:54:50, Pasha Tatashin wrote: [...] > >The implementation just looks too large to what I would expect. E.g. do > >we really need to add zero argument to the large part of the memblock > >API? Wouldn't it be easier to simply export memblock_virt_alloc_internal > >(or its tiny wrapper memblock_virt_alloc_core) and move the zeroing > >outside to its 2 callers? A completely untested scratched version at the > >end of the email. > > I am OK, with this change. But, I do not really see a difference between: > > memblock_virt_alloc_raw() > and > memblock_virt_alloc_core() > > In both cases we use memblock_virt_alloc_internal(), but the only difference > is that in my case we tell memblock_virt_alloc_internal() to zero the pages > if needed, and in your case the other two callers are zeroing it. I like > moving memblock_dbg() inside memblock_virt_alloc_internal() Well, I didn't object to this particular part. I was mostly concerned about http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com and the "zero" argument for other functions. I guess we can do without that. I _think_ that we should simply _always_ initialize the page at the __init_single_page time rather than during the allocation. That would require dropping __GFP_ZERO for non-memblock allocations. Or do you think we could regress for single threaded initialization? > >Also it seems that this is not 100% correct either as it only cares > >about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for > >SPARSEMEM. This would suggest that we would zero out pages twice, > >right? > > Thank you, I will check this combination before sending out the next patch. > > > > >A similar concern would go to the memory hotplug patch which will > >fall back to the slab/page allocator IIRC. On the other hand > >__init_single_page is shared with the hotplug code so again we would > >initialize 2 times. > > Correct, when memory it hotplugged, to gain the benefit of this fix, and > also not to regress by actually double zeroing "struct pages" we should not > zero it out. However, I do not really have means to test it. It should be pretty easy to test with kvm, but I can help with testing on the real HW as well. Thanks!
> > Well, I didn't object to this particular part. I was mostly concerned > about > http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com > and the "zero" argument for other functions. I guess we can do without > that. I _think_ that we should simply _always_ initialize the page at the > __init_single_page time rather than during the allocation. That would > require dropping __GFP_ZERO for non-memblock allocations. Or do you > think we could regress for single threaded initialization? > Hi Michal, Thats exactly right, I am worried that we will regress when there is no parallelized initialization of "struct pages" if we force unconditionally do memset() in __init_single_page(). The overhead of calling memset() on a smaller chunks (64-bytes) may cause the regression, this is why I opted only for parallelized case to zero this metadata. This way, we are guaranteed to see great improvements from this change without having regressions on platforms and builds that do not support parallelized initialization of "struct pages". However, on some chips such as latest SPARCs it is beneficial to have memset() right inside __init_single_page() even for single threaded case, because it can act as a prefetch on chips with optimized block initialized store instructions. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 10-05-17 09:42:22, Pasha Tatashin wrote: > > > >Well, I didn't object to this particular part. I was mostly concerned > >about > >http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com > >and the "zero" argument for other functions. I guess we can do without > >that. I _think_ that we should simply _always_ initialize the page at the > >__init_single_page time rather than during the allocation. That would > >require dropping __GFP_ZERO for non-memblock allocations. Or do you > >think we could regress for single threaded initialization? > > > > Hi Michal, > > Thats exactly right, I am worried that we will regress when there is no > parallelized initialization of "struct pages" if we force unconditionally do > memset() in __init_single_page(). The overhead of calling memset() on a > smaller chunks (64-bytes) may cause the regression, this is why I opted only > for parallelized case to zero this metadata. This way, we are guaranteed to > see great improvements from this change without having regressions on > platforms and builds that do not support parallelized initialization of > "struct pages". Have you measured that? I do not think it would be super hard to measure. I would be quite surprised if this added much if anything at all as the whole struct page should be in the cache line already. We do set reference count and other struct members. Almost nobody should be looking at our page at this time and stealing the cache line. On the other hand a large memcpy will basically wipe everything away from the cpu cache. Or am I missing something?
On 05/10/2017 10:57 AM, Michal Hocko wrote: > On Wed 10-05-17 09:42:22, Pasha Tatashin wrote: >>> >>> Well, I didn't object to this particular part. I was mostly concerned >>> about >>> http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com >>> and the "zero" argument for other functions. I guess we can do without >>> that. I _think_ that we should simply _always_ initialize the page at the >>> __init_single_page time rather than during the allocation. That would >>> require dropping __GFP_ZERO for non-memblock allocations. Or do you >>> think we could regress for single threaded initialization? >>> >> >> Hi Michal, >> >> Thats exactly right, I am worried that we will regress when there is no >> parallelized initialization of "struct pages" if we force unconditionally do >> memset() in __init_single_page(). The overhead of calling memset() on a >> smaller chunks (64-bytes) may cause the regression, this is why I opted only >> for parallelized case to zero this metadata. This way, we are guaranteed to >> see great improvements from this change without having regressions on >> platforms and builds that do not support parallelized initialization of >> "struct pages". > > Have you measured that? I do not think it would be super hard to > measure. I would be quite surprised if this added much if anything at > all as the whole struct page should be in the cache line already. We do > set reference count and other struct members. Almost nobody should be > looking at our page at this time and stealing the cache line. On the > other hand a large memcpy will basically wipe everything away from the > cpu cache. Or am I missing something? > Perhaps you are right, and I will measure on x86. But, I suspect hit can become unacceptable on some platfoms: there is an overhead of calling a function, even if it is leaf-optimized, and there is an overhead in memset() to check for alignments of size and address, types of setting (zeroing vs. non-zeroing), etc., that adds up quickly. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Michal Hocko <mhocko@kernel.org> Date: Wed, 10 May 2017 16:57:26 +0200 > Have you measured that? I do not think it would be super hard to > measure. I would be quite surprised if this added much if anything at > all as the whole struct page should be in the cache line already. We do > set reference count and other struct members. Almost nobody should be > looking at our page at this time and stealing the cache line. On the > other hand a large memcpy will basically wipe everything away from the > cpu cache. Or am I missing something? I guess it might be clearer if you understand what the block initializing stores do on sparc64. There are no memory accesses at all. The cpu just zeros out the cache line, that's it. No L3 cache line is allocated. So this "wipe everything" behavior will not happen in the L3. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Wed, 10 May 2017 11:01:40 -0400 > Perhaps you are right, and I will measure on x86. But, I suspect hit > can become unacceptable on some platfoms: there is an overhead of > calling a function, even if it is leaf-optimized, and there is an > overhead in memset() to check for alignments of size and address, > types of setting (zeroing vs. non-zeroing), etc., that adds up > quickly. Another source of overhead on the sparc64 side is that we much do memory barriers around the block initializiing stores. So batching calls to memset() amortize that as well. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote: > From: Michal Hocko <mhocko@kernel.org> > Date: Wed, 10 May 2017 16:57:26 +0200 > > > Have you measured that? I do not think it would be super hard to > > measure. I would be quite surprised if this added much if anything at > > all as the whole struct page should be in the cache line already. We do > > set reference count and other struct members. Almost nobody should be > > looking at our page at this time and stealing the cache line. On the > > other hand a large memcpy will basically wipe everything away from the > > cpu cache. Or am I missing something? > > I guess it might be clearer if you understand what the block > initializing stores do on sparc64. There are no memory accesses at > all. > > The cpu just zeros out the cache line, that's it. > > No L3 cache line is allocated. So this "wipe everything" behavior > will not happen in the L3. There's either something wrong with your explanation or my reading skills :-) "There are no memory accesses" "No L3 cache line is allocated" You can have one or the other ... either the CPU sends a cacheline-sized write of zeroes to memory without allocating an L3 cache line (maybe using the store buffer?), or the CPU allocates an L3 cache line and sets its contents to zeroes, probably putting it in the last way of the set so it's the first thing to be evicted if not touched. Or there's some magic in the memory bus protocol where the CPU gets to tell the DRAM "hey, clear these cache lines". Although that's also a memory access of sorts ... -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Matthew Wilcox <willy@infradead.org> Date: Wed, 10 May 2017 10:17:03 -0700 > On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote: >> From: Michal Hocko <mhocko@kernel.org> >> Date: Wed, 10 May 2017 16:57:26 +0200 >> >> > Have you measured that? I do not think it would be super hard to >> > measure. I would be quite surprised if this added much if anything at >> > all as the whole struct page should be in the cache line already. We do >> > set reference count and other struct members. Almost nobody should be >> > looking at our page at this time and stealing the cache line. On the >> > other hand a large memcpy will basically wipe everything away from the >> > cpu cache. Or am I missing something? >> >> I guess it might be clearer if you understand what the block >> initializing stores do on sparc64. There are no memory accesses at >> all. >> >> The cpu just zeros out the cache line, that's it. >> >> No L3 cache line is allocated. So this "wipe everything" behavior >> will not happen in the L3. > > There's either something wrong with your explanation or my reading > skills :-) > > "There are no memory accesses" > "No L3 cache line is allocated" > > You can have one or the other ... either the CPU sends a cacheline-sized > write of zeroes to memory without allocating an L3 cache line (maybe > using the store buffer?), or the CPU allocates an L3 cache line and sets > its contents to zeroes, probably putting it in the last way of the set > so it's the first thing to be evicted if not touched. There is no conflict in what I said. Only an L2 cache line is allocated and cleared. L3 is left alone. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 10, 2017 at 02:00:26PM -0400, David Miller wrote: > From: Matthew Wilcox <willy@infradead.org> > Date: Wed, 10 May 2017 10:17:03 -0700 > > On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote: > >> I guess it might be clearer if you understand what the block > >> initializing stores do on sparc64. There are no memory accesses at > >> all. > >> > >> The cpu just zeros out the cache line, that's it. > >> > >> No L3 cache line is allocated. So this "wipe everything" behavior > >> will not happen in the L3. > > > > There's either something wrong with your explanation or my reading > > skills :-) > > > > "There are no memory accesses" > > "No L3 cache line is allocated" > > > > You can have one or the other ... either the CPU sends a cacheline-sized > > write of zeroes to memory without allocating an L3 cache line (maybe > > using the store buffer?), or the CPU allocates an L3 cache line and sets > > its contents to zeroes, probably putting it in the last way of the set > > so it's the first thing to be evicted if not touched. > > There is no conflict in what I said. > > Only an L2 cache line is allocated and cleared. L3 is left alone. I thought SPARC had inclusive caches. So allocating an L2 cacheline would necessitate allocating an L3 cacheline. Or is this an exception to the normal order of things? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 10-05-17 11:19:43, David S. Miller wrote: > From: Michal Hocko <mhocko@kernel.org> > Date: Wed, 10 May 2017 16:57:26 +0200 > > > Have you measured that? I do not think it would be super hard to > > measure. I would be quite surprised if this added much if anything at > > all as the whole struct page should be in the cache line already. We do > > set reference count and other struct members. Almost nobody should be > > looking at our page at this time and stealing the cache line. On the > > other hand a large memcpy will basically wipe everything away from the > > cpu cache. Or am I missing something? > > I guess it might be clearer if you understand what the block > initializing stores do on sparc64. There are no memory accesses at > all. > > The cpu just zeros out the cache line, that's it. > > No L3 cache line is allocated. So this "wipe everything" behavior > will not happen in the L3. OK, good to know. My undestanding of sparc64 is close to zero. Anyway, do you agree that doing the struct page initialization along with other writes to it shouldn't add a measurable overhead comparing to pre-zeroing of larger block of struct pages? We already have an exclusive cache line and doing one 64B write along with few other stores should be basically the same.
From: Michal Hocko <mhocko@kernel.org> Date: Thu, 11 May 2017 10:05:38 +0200 > Anyway, do you agree that doing the struct page initialization along > with other writes to it shouldn't add a measurable overhead comparing > to pre-zeroing of larger block of struct pages? We already have an > exclusive cache line and doing one 64B write along with few other stores > should be basically the same. Yes, it should be reasonably cheap. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >> Have you measured that? I do not think it would be super hard to >> measure. I would be quite surprised if this added much if anything at >> all as the whole struct page should be in the cache line already. We do >> set reference count and other struct members. Almost nobody should be >> looking at our page at this time and stealing the cache line. On the >> other hand a large memcpy will basically wipe everything away from the >> cpu cache. Or am I missing something? >> Here is data for single thread (deferred struct page init is disabled): Intel CPU E7-8895 v3 @ 2.60GHz 1T memory ----------------------------------------- time to memset "struct pages in memblock: 11.28s time to init "struct pag"es: 4.90s Moving memset into __init_single_page() time to init and memset "struct page"es: 8.39s SPARC M6 @ 3600 MHz 1T memory ----------------------------------------- time to memset "struct pages in memblock: 1.60s time to init "struct pag"es: 3.37s Moving memset into __init_single_page() time to init and memset "struct page"es: 12.99s So, moving memset() into __init_single_page() benefits Intel. I am actually surprised why memset() is so slow on intel when it is called from memblock. But, hurts SPARC, I guess these membars at the end of memset() kills the performance. Also, when looking at these values, remeber that Intel has twice as many "struct page" for the same amount of memory. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
We should either keep memset() only for deferred struct pages as what I have in my patches. Another option is to add a new function struct_page_clear() which would default to memset() and to something else on platforms that decide to optimize it. On SPARC it would call STBIs, and we would do one membar call after all "struct pages" are initialized. I think what I sent out already is cleaner and better solution, because I am not sure what kind of performance we would see on other chips. On 05/11/2017 04:47 PM, Pasha Tatashin wrote: >>> >>> Have you measured that? I do not think it would be super hard to >>> measure. I would be quite surprised if this added much if anything at >>> all as the whole struct page should be in the cache line already. We do >>> set reference count and other struct members. Almost nobody should be >>> looking at our page at this time and stealing the cache line. On the >>> other hand a large memcpy will basically wipe everything away from the >>> cpu cache. Or am I missing something? >>> > > Here is data for single thread (deferred struct page init is disabled): > > Intel CPU E7-8895 v3 @ 2.60GHz 1T memory > ----------------------------------------- > time to memset "struct pages in memblock: 11.28s > time to init "struct pag"es: 4.90s > > Moving memset into __init_single_page() > time to init and memset "struct page"es: 8.39s > > SPARC M6 @ 3600 MHz 1T memory > ----------------------------------------- > time to memset "struct pages in memblock: 1.60s > time to init "struct pag"es: 3.37s > > Moving memset into __init_single_page() > time to init and memset "struct page"es: 12.99s > > > So, moving memset() into __init_single_page() benefits Intel. I am > actually surprised why memset() is so slow on intel when it is called > from memblock. But, hurts SPARC, I guess these membars at the end of > memset() kills the performance. > > Also, when looking at these values, remeber that Intel has twice as many > "struct page" for the same amount of memory. > > Pasha > -- > To unsubscribe from this list: send the line "unsubscribe sparclinux" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Thu, 11 May 2017 16:47:05 -0400 > So, moving memset() into __init_single_page() benefits Intel. I am > actually surprised why memset() is so slow on intel when it is called > from memblock. But, hurts SPARC, I guess these membars at the end of > memset() kills the performance. Perhaps an x86 expert can chime in, but it might be the case that past a certain size, the microcode for the enhanced stosb uses non-temporal stores or something like that. As for sparc64, yes we can get really killed by the transactional cost of memset because of the membars. But I wonder, for a single page struct, if we even use the special stores and thus eat the membar cost. struct page is only 64 bytes, and the cutoff in the Niagara4 bzero implementation is "64 + (64 - 8)" so indeed the initializing stores will not even be used. So sparc64 will only use initializing stores and do the membars if at least 2 pages are cleared at a time. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Thu, 11 May 2017 16:59:33 -0400 > We should either keep memset() only for deferred struct pages as what > I have in my patches. > > Another option is to add a new function struct_page_clear() which > would default to memset() and to something else on platforms that > decide to optimize it. > > On SPARC it would call STBIs, and we would do one membar call after > all "struct pages" are initialized. No membars will be performed for single individual page struct clear, the cutoff to use the STBI is larger than that. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/12/2017 12:57 PM, David Miller wrote: > From: Pasha Tatashin <pasha.tatashin@oracle.com> > Date: Thu, 11 May 2017 16:59:33 -0400 > >> We should either keep memset() only for deferred struct pages as what >> I have in my patches. >> >> Another option is to add a new function struct_page_clear() which >> would default to memset() and to something else on platforms that >> decide to optimize it. >> >> On SPARC it would call STBIs, and we would do one membar call after >> all "struct pages" are initialized. > > No membars will be performed for single individual page struct clear, > the cutoff to use the STBI is larger than that. > Right now it is larger, but what I suggested is to add a new optimized routine just for this case, which would do STBI for 64-bytes but without membar (do membar at the end of memmap_init_zone() and deferred_init_memmap() #define struct_page_clear(page) \ __asm__ __volatile__( \ "stxa %%g0, [%0]%2\n" \ "stxa %%xg0, [%0 + %1]%2\n" \ : /* No output */ \ : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P)) And insert it into __init_single_page() instead of memset() The final result is 4.01s/T which is even faster compared to current 4.97s/T Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Pasha Tatashin <pasha.tatashin@oracle.com> Date: Fri, 12 May 2017 13:24:52 -0400 > Right now it is larger, but what I suggested is to add a new optimized > routine just for this case, which would do STBI for 64-bytes but > without membar (do membar at the end of memmap_init_zone() and > deferred_init_memmap() > > #define struct_page_clear(page) \ > __asm__ __volatile__( \ > "stxa %%g0, [%0]%2\n" \ > "stxa %%xg0, [%0 + %1]%2\n" \ > : /* No output */ \ > : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P)) > > And insert it into __init_single_page() instead of memset() > > The final result is 4.01s/T which is even faster compared to current > 4.97s/T Ok, indeed, that would work. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Michal, After looking at your suggested memblock_virt_alloc_core() change again, I decided to keep what I have. I do not want to inline memblock_virt_alloc_internal(), because it is not a performance critical path, and by inlining it we will unnecessarily increase the text size on all platforms. Also, because it will be very hard to make sure that no platform regresses by making memset() default in _memblock_virt_alloc_core() (as I already showed last week at least sun4v SPARC64 will require special changes in order for this to work), I decided to make it available only for "deferred struct page init" case. As, what is already in the patch. I am working on testing to make sure we do not need to double zero in the two cases that you found: sparsemem, and mem hotplug. Please let me know if you have any more comments, or if I can send new patches out when they are ready. Thank you, Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 15-05-17 14:12:10, Pasha Tatashin wrote: > Hi Michal, > > After looking at your suggested memblock_virt_alloc_core() change again, I > decided to keep what I have. I do not want to inline > memblock_virt_alloc_internal(), because it is not a performance critical > path, and by inlining it we will unnecessarily increase the text size on all > platforms. I do not insist but I would really _prefer_ if the bool zero argument didn't proliferate all over the memblock API. > Also, because it will be very hard to make sure that no platform regresses > by making memset() default in _memblock_virt_alloc_core() (as I already > showed last week at least sun4v SPARC64 will require special changes in > order for this to work), I decided to make it available only for "deferred > struct page init" case. As, what is already in the patch. I do not think this is the right approach. Your measurements just show that sparc could have a more optimized memset for small sizes. If you keep the same memset only for the parallel initialization then you just hide this fact. I wouldn't worry about other architectures. All sane architectures should simply work reasonably well when touching a single or only few cache lines at the same time. If some arches really suffer from small memsets then the initialization should be driven by a specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend on DEFERRED_INIT. Or if you are too worried then make it opt-in and make it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and sparc after memset optimization.
On 05/15/2017 03:38 PM, Michal Hocko wrote: > On Mon 15-05-17 14:12:10, Pasha Tatashin wrote: >> Hi Michal, >> >> After looking at your suggested memblock_virt_alloc_core() change again, I >> decided to keep what I have. I do not want to inline >> memblock_virt_alloc_internal(), because it is not a performance critical >> path, and by inlining it we will unnecessarily increase the text size on all >> platforms. > > I do not insist but I would really _prefer_ if the bool zero argument > didn't proliferate all over the memblock API. Sure, I will remove zero boolean argument from memblock_virt_alloc_internal(), and do memset() calls inside callers. > >> Also, because it will be very hard to make sure that no platform regresses >> by making memset() default in _memblock_virt_alloc_core() (as I already >> showed last week at least sun4v SPARC64 will require special changes in >> order for this to work), I decided to make it available only for "deferred >> struct page init" case. As, what is already in the patch. > > I do not think this is the right approach. Your measurements just show > that sparc could have a more optimized memset for small sizes. If you > keep the same memset only for the parallel initialization then you > just hide this fact. I wouldn't worry about other architectures. All > sane architectures should simply work reasonably well when touching a > single or only few cache lines at the same time. If some arches really > suffer from small memsets then the initialization should be driven by a > specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend > on DEFERRED_INIT. Or if you are too worried then make it opt-in and make > it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and > sparc after memset optimization. OK, I will think about this. I do not really like adding new configs because they tend to clutter the code. This is why, I wanted to rely on already existing config that I know benefits all platforms that use it. Eventually, "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default everywhere, as there should not be a drawback of using it even on small machines. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 15-05-17 16:44:26, Pasha Tatashin wrote: > On 05/15/2017 03:38 PM, Michal Hocko wrote: > >I do not think this is the right approach. Your measurements just show > >that sparc could have a more optimized memset for small sizes. If you > >keep the same memset only for the parallel initialization then you > >just hide this fact. I wouldn't worry about other architectures. All > >sane architectures should simply work reasonably well when touching a > >single or only few cache lines at the same time. If some arches really > >suffer from small memsets then the initialization should be driven by a > >specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend > >on DEFERRED_INIT. Or if you are too worried then make it opt-in and make > >it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and > >sparc after memset optimization. > > OK, I will think about this. > > I do not really like adding new configs because they tend to clutter the > code. This is why, Yes I hate adding new (arch) config options as well. And I still believe we do not need any here either... > I wanted to rely on already existing config that I know benefits all > platforms that use it. I wouldn't be so sure about this. If any other platform has a similar issues with small memset as sparc then the overhead is just papered over by parallel initialization. > Eventually, > "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default > everywhere, as there should not be a drawback of using it even on small > machines. Maybe and I would highly appreciate that.
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote: > > Right now it is larger, but what I suggested is to add a new optimized > > routine just for this case, which would do STBI for 64-bytes but > > without membar (do membar at the end of memmap_init_zone() and > > deferred_init_memmap() > > > > #define struct_page_clear(page) \ > > __asm__ __volatile__( \ > > "stxa %%g0, [%0]%2\n" \ > > "stxa %%xg0, [%0 + %1]%2\n" \ > > : /* No output */ \ > > : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P)) > > > > And insert it into __init_single_page() instead of memset() > > > > The final result is 4.01s/T which is even faster compared to current > > 4.97s/T > > Ok, indeed, that would work. On ppc64, that might not. We have a dcbz instruction that clears an entire cache line at once. That's what we use for memset's and page clearing. However, 64 bytes is half a cache line on modern processors so we can't use it with that semantic and would have to fallback to the slower stores. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Michal, I have considered your proposals: 1. Making memset(0) unconditional inside __init_single_page() is not going to work because it slows down SPARC, and ppc64. On SPARC even the BSTI optimization that I have proposed earlier won't work, because after consulting with other engineers I was told that stores (without loads!) after BSTI without membar are unsafe 2. Adding ARCH_WANT_LARGE_PAGEBLOCK_INIT is not going to solve the problem, because while arch might want a large memset(), it still wants to get the benefit of parallelized struct page initialization. 3. Another approach that have I considered is moving memset() above __init_single_page() and do it in a larger chunks. However, this solution is also not going to work, because inside the loops, there are cases where "struct page"s are skipped, so every single page is checked: early_pfn_valid(pfn), early_pfn_in_nid(), and also mirroed_kernelcore cases. > I wouldn't be so sure about this. If any other platform has a similar > issues with small memset as sparc then the overhead is just papered over > by parallel initialization. That is true, and that is fine, because parallelization gives an order of magnitude better improvements compared to trade of slower single thread performance. Remember, this will happen during boot and memory hotplug only, and not something that will eat up computing resources during runtime. So, at the moment I cannot really find a better solution compared to what I have proposed: do memset() inside __init_single_page() only when deferred initialization is enabled. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 26-05-17 12:45:55, Pasha Tatashin wrote: > Hi Michal, > > I have considered your proposals: > > 1. Making memset(0) unconditional inside __init_single_page() is not going > to work because it slows down SPARC, and ppc64. On SPARC even the BSTI > optimization that I have proposed earlier won't work, because after > consulting with other engineers I was told that stores (without loads!) > after BSTI without membar are unsafe Could you be more specific? E.g. how are other stores done in __init_single_page safe then? I am sorry to be dense here but how does the full 64B store differ from other stores done in the same function. [...] > So, at the moment I cannot really find a better solution compared to what I > have proposed: do memset() inside __init_single_page() only when deferred > initialization is enabled. As I've already said I am not going to block your approach I was just hoping for something that doesn't depend on the deferred initialization. Especially when the struct page is a small objects and it makes sense to initialize it completely at a single page. Writing to a single cache line should simply not add memory traffic for exclusive cache line and struct pages are very likely to exclusive at that stage. If that doesn't fly then be it but I have to confess I still do not understand why that is not the case.
> Could you be more specific? E.g. how are other stores done in > __init_single_page safe then? I am sorry to be dense here but how does > the full 64B store differ from other stores done in the same function. Hi Michal, It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) without membar, but they are slower compared to STBI which require a membar before memory can be accessed. So when on SPARC we zero a larger span of memory it is faster to use STBI, and do one membar at the end. This is why for single thread it is faster to zero multiple pages of memory and than initialize only fields that are needed in "struct page". I believe the same is true for ppc64, as they clear the whole cacheline 128-bytes at a time with larger memsets. Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 30-05-17 13:16:50, Pasha Tatashin wrote: > >Could you be more specific? E.g. how are other stores done in > >__init_single_page safe then? I am sorry to be dense here but how does > >the full 64B store differ from other stores done in the same function. > > Hi Michal, > > It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) > without membar, but they are slower compared to STBI which require a membar > before memory can be accessed. OK, so why cannot we make zero_struct_page 8x 8B stores, other arches would do memset. You said it would be slower but would that be measurable? I am sorry to be so persistent here but I would be really happier if this didn't depend on the deferred initialization. If this is absolutely a no-go then I can live with that of course.
From: Michal Hocko <mhocko@kernel.org> Date: Wed, 31 May 2017 18:31:31 +0200 > On Tue 30-05-17 13:16:50, Pasha Tatashin wrote: >> >Could you be more specific? E.g. how are other stores done in >> >__init_single_page safe then? I am sorry to be dense here but how does >> >the full 64B store differ from other stores done in the same function. >> >> Hi Michal, >> >> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) >> without membar, but they are slower compared to STBI which require a membar >> before memory can be accessed. > > OK, so why cannot we make zero_struct_page 8x 8B stores, other arches > would do memset. You said it would be slower but would that be > measurable? I am sorry to be so persistent here but I would be really > happier if this didn't depend on the deferred initialization. If this is > absolutely a no-go then I can live with that of course. It is measurable. That's the impetus for this work in the first place. When the do the memory barrier, the whole store buffer flushes because the memory barrier is done with a dependency on the next load or store operation, one of which the caller is going to do immediately. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches > would do memset. You said it would be slower but would that be > measurable? I am sorry to be so persistent here but I would be really > happier if this didn't depend on the deferred initialization. If this is > absolutely a no-go then I can live with that of course. Hi Michal, This is actually a very good idea. I just did some measurements, and it looks like performance is very good. Here is data from SPARC-M7 with 3312G memory with single thread performance: Current: memset() in memblock allocator takes: 8.83s __init_single_page() take: 8.63s Option 1: memset() in __init_single_page() takes: 61.09s (as we discussed because of membar overhead, memset should really be optimized to do STBI only when size is 1 page or bigger). Option 2: 8 stores (stx) in __init_single_page(): 8.525s! So, even for single thread performance we can double the initialization speed of "struct page" on SPARC by removing memset() from memblock, and using 8 stx in __init_single_page(). It appears we never miss L1 in __init_single_page() after the initial 8 stx. I will update patches with memset() on other platforms, and stx on SPARC. My experimental code looks like this: static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { __asm__ __volatile__( "stx %%g0, [%0 + 0x00]\n" "stx %%g0, [%0 + 0x08]\n" "stx %%g0, [%0 + 0x10]\n" "stx %%g0, [%0 + 0x18]\n" "stx %%g0, [%0 + 0x20]\n" "stx %%g0, [%0 + 0x28]\n" "stx %%g0, [%0 + 0x30]\n" "stx %%g0, [%0 + 0x38]\n" : :"r"(page)); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); INIT_LIST_HEAD(&page->lru); #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ if (!is_highmem_idx(zone)) set_page_address(page, __va(pfn << PAGE_SHIFT)); #endif } Thank you, Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 31-05-17 23:35:48, Pasha Tatashin wrote: > >OK, so why cannot we make zero_struct_page 8x 8B stores, other arches > >would do memset. You said it would be slower but would that be > >measurable? I am sorry to be so persistent here but I would be really > >happier if this didn't depend on the deferred initialization. If this is > >absolutely a no-go then I can live with that of course. > > Hi Michal, > > This is actually a very good idea. I just did some measurements, and it > looks like performance is very good. > > Here is data from SPARC-M7 with 3312G memory with single thread performance: > > Current: > memset() in memblock allocator takes: 8.83s > __init_single_page() take: 8.63s > > Option 1: > memset() in __init_single_page() takes: 61.09s (as we discussed because of > membar overhead, memset should really be optimized to do STBI only when size > is 1 page or bigger). > > Option 2: > > 8 stores (stx) in __init_single_page(): 8.525s! > > So, even for single thread performance we can double the initialization > speed of "struct page" on SPARC by removing memset() from memblock, and > using 8 stx in __init_single_page(). It appears we never miss L1 in > __init_single_page() after the initial 8 stx. OK, that is good to hear and it actually matches my understanding that writes to a single cacheline should add an overhead. Thanks!
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h index 962164d36506..c9a08463d9a8 100644 --- a/include/linux/bootmem.h +++ b/include/linux/bootmem.h @@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat, #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0) /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */ +void * memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align, + phys_addr_t min_addr, phys_addr_t max_addr, + int nid); void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index b049c9b2dba8..eab7da94f873 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1271,8 +1271,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i * * The memory block is aligned on SMP_CACHE_BYTES if @align == 0. * - * The phys address of allocated boot memory block is converted to virtual and - * allocated memory is reset to 0. + * The function has to be zeroed out explicitly. * * In addition, function sets the min_count to 0 using kmemleak_alloc for * allocated boot memory block, so that it is never reported as leaks. @@ -1280,15 +1279,18 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i * RETURNS: * Virtual address of allocated memory block on success, NULL on failure. */ -static void * __init memblock_virt_alloc_internal( +static inline void * __init memblock_virt_alloc_internal( phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, - int nid) + int nid, void *caller) { phys_addr_t alloc; void *ptr; ulong flags = choose_memblock_flags(); + memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", + __func__, (u64)size, (u64)align, nid, (u64)min_addr, + (u64)max_addr, caller); if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n")) nid = NUMA_NO_NODE; @@ -1334,7 +1336,6 @@ static void * __init memblock_virt_alloc_internal( return NULL; done: ptr = phys_to_virt(alloc); - memset(ptr, 0, size); /* * The min_count is set to 0 so that bootmem allocated blocks @@ -1347,6 +1348,14 @@ static void * __init memblock_virt_alloc_internal( return ptr; } +void * __init memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align, + phys_addr_t min_addr, phys_addr_t max_addr, + int nid) +{ + return memblock_virt_alloc_internal(size, align, min_addr, max_addr, nid, + (void *)_RET_IP_); +} + /** * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block * @size: size of memory block to be allocated in bytes @@ -1369,11 +1378,14 @@ void * __init memblock_virt_alloc_try_nid_nopanic( phys_addr_t min_addr, phys_addr_t max_addr, int nid) { - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", - __func__, (u64)size, (u64)align, nid, (u64)min_addr, - (u64)max_addr, (void *)_RET_IP_); - return memblock_virt_alloc_internal(size, align, min_addr, - max_addr, nid); + void *ptr; + + ptr = memblock_virt_alloc_internal(size, align, min_addr, + max_addr, nid, (void *)_RET_IP_); + if (ptr) + memset(ptr, 0, size); + + return ptr; } /** @@ -1401,13 +1413,12 @@ void * __init memblock_virt_alloc_try_nid( { void *ptr; - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", - __func__, (u64)size, (u64)align, nid, (u64)min_addr, - (u64)max_addr, (void *)_RET_IP_); ptr = memblock_virt_alloc_internal(size, align, - min_addr, max_addr, nid); - if (ptr) + min_addr, max_addr, nid, (void *)_RET_IP_); + if (ptr) { + memset(ptr, 0, size); return ptr; + } panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n", __func__, (u64)size, (u64)align, nid, (u64)min_addr, diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a56c3989f773..4e060f0f9fe5 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node, unsigned long align, unsigned long goal) { - return memblock_virt_alloc_try_nid(size, align, goal, + return memblock_virt_alloc_core(size, align, goal, BOOTMEM_ALLOC_ACCESSIBLE, node); }