Message ID | 20160804013729.7fffa45a@roar.ozlabs.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Thursday, August 4, 2016 1:37:29 AM CEST Nicholas Piggin wrote: > > I've attached what I'm using, which builds and runs for me without > any work. Your arch obviously has to select the option to use it. > > text data bss dec hex filename > 11196784 1185024 1923820 14305628 da495c vmlinuxppc64.before > 11187536 1181848 1923176 14292560 da1650 vmlinuxppc64.after > > ~9K text saving, ~3K data saving. I assume this comes from fewer > branch trampolines and toc entries, but haven't verified exactly. The patch seems to work great, but for me it's getting bigger (compared to my older patch, mainline allyesconfig doesn't build): text data bss dec hex filename 51299868 42599559 23362148 117261575 6fd4507 vmlinuxarm.before 51302545 42595015 23361884 117259444 6fd3cb4 vmlinuxarm.after Most of the difference appears to be in branch trampolines (634 added, 559 removed, 14837 unchanged) as you suspect, but I also see a couple of symbols show up in vmlinux that were not there before: -A __crc_dma_noop_ops -D dma_noop_ops -R __clz_tab -r fdt_errtable -r __kcrctab_dma_noop_ops -r __kstrtab_dma_noop_ops -R __ksymtab_dma_noop_ops -t dma_noop_alloc -t dma_noop_free -t dma_noop_map_page -t dma_noop_mapping_error -t dma_noop_map_sg -t dma_noop_supported -T fdt_add_reservemap_entry -T fdt_begin_node -T fdt_create -T fdt_create_empty_tree -T fdt_end_node -T fdt_finish -T fdt_finish_reservemap -T fdt_property -T fdt_resize -T fdt_strerror -T find_cpio_data From my first look, it seems that all of lib/*.o is now getting linked into vmlinux, while we traditionally leave out everything from lib/ that is not referenced. I also see a noticeable overhead in link time, the numbers are for a cache-hot rebuild after a successful allyesconfig build, using a 24-way Opteron@2.5Ghz, just relinking vmlinux: $ time make skj30 vmlinux # before real 2m8.092s user 3m41.008s sys 0m48.172s $ time make skj30 vmlinux # after real 4m10.189s user 5m43.804s sys 0m52.988s That is clearly a very sharp difference. Fortunately for the defconfig build, the times are much lower, and I see no real difference other than the noise between subsequent runs: $ time make skj30 vmlinux # before real 0m5.415s user 0m19.716s sys 0m9.356s $ time make skj30 vmlinux # before real 0m9.536s user 0m21.320s sys 0m9.224s $ time make skj30 vmlinux # after real 0m5.539s user 0m20.360s sys 0m9.224s $ time make skj30 vmlinux # after real 0m9.138s user 0m21.932s sys 0m8.988s $ time make skj30 vmlinux # after real 0m5.659s user 0m20.332s sys 0m9.620s Arnd
Hi Arnd, On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > From my first look, it seems that all of lib/*.o is now getting linked > into vmlinux, while we traditionally leave out everything from lib/ > that is not referenced. > > I also see a noticeable overhead in link time, the numbers are for > a cache-hot rebuild after a successful allyesconfig build, using a > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > $ time make skj30 vmlinux # before > real 2m8.092s > user 3m41.008s > sys 0m48.172s > > $ time make skj30 vmlinux # after > real 4m10.189s > user 5m43.804s > sys 0m52.988s Is it better when using rcT instead of rcsT? Segher
On Wednesday, August 3, 2016 2:44:29 PM CEST Segher Boessenkool wrote: > Hi Arnd, > > On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > > From my first look, it seems that all of lib/*.o is now getting linked > > into vmlinux, while we traditionally leave out everything from lib/ > > that is not referenced. > > > > I also see a noticeable overhead in link time, the numbers are for > > a cache-hot rebuild after a successful allyesconfig build, using a > > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > > > $ time make skj30 vmlinux # before > > real 2m8.092s > > user 3m41.008s > > sys 0m48.172s > > > > $ time make skj30 vmlinux # after > > real 4m10.189s > > user 5m43.804s > > sys 0m52.988s > > Is it better when using rcT instead of rcsT? It seems to be noticeably better for the clean rebuild case, though not as good as the original: real 3m34.015s user 5m7.104s sys 0m49.172s I've also tried now with my own patch applied as well (linking each drivers/*/built-in.o into vmlinux rather than having them linked into drivers/built-in.o first), but that makes no difference. Arnd
Hi Arnd, On Wed, 03 Aug 2016 20:52:48 +0200 Arnd Bergmann <arnd@arndb.de> wrote: > > Most of the difference appears to be in branch trampolines (634 added, > 559 removed, 14837 unchanged) as you suspect, but I also see a couple > of symbols show up in vmlinux that were not there before: > > -A __crc_dma_noop_ops > -D dma_noop_ops > -R __clz_tab > -r fdt_errtable > -r __kcrctab_dma_noop_ops > -r __kstrtab_dma_noop_ops > -R __ksymtab_dma_noop_ops > -t dma_noop_alloc > -t dma_noop_free > -t dma_noop_map_page > -t dma_noop_mapping_error > -t dma_noop_map_sg > -t dma_noop_supported > -T fdt_add_reservemap_entry > -T fdt_begin_node > -T fdt_create > -T fdt_create_empty_tree > -T fdt_end_node > -T fdt_finish > -T fdt_finish_reservemap > -T fdt_property > -T fdt_resize > -T fdt_strerror > -T find_cpio_data > > From my first look, it seems that all of lib/*.o is now getting linked > into vmlinux, while we traditionally leave out everything from lib/ > that is not referenced. You could try removing the --{,no-}whole-archive arguments to ld in scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh. Last time I did that, though, a whole lot of stuff failed to be linked in. (Especially stuff only referenced by EXPORT_SYMBOL()s, bu that may have been fixed). > I also see a noticeable overhead in link time, the numbers are for > a cache-hot rebuild after a successful allyesconfig build, using a > 24-way Opteron@2.5Ghz, just relinking vmlinux: I was afraid of that, but it is offset by the time saved by not doing the "ld -r"s along the way? It may also be that (for powerpc anyway) the linker is doing a better job.
On Wed, 03 Aug 2016 22:13:28 +0200 Arnd Bergmann <arnd@arndb.de> wrote: > On Wednesday, August 3, 2016 2:44:29 PM CEST Segher Boessenkool wrote: > > Hi Arnd, > > > > On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > > > From my first look, it seems that all of lib/*.o is now getting linked > > > into vmlinux, while we traditionally leave out everything from lib/ > > > that is not referenced. > > > > > > I also see a noticeable overhead in link time, the numbers are for > > > a cache-hot rebuild after a successful allyesconfig build, using a > > > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > > > > > $ time make skj30 vmlinux # before > > > real 2m8.092s > > > user 3m41.008s > > > sys 0m48.172s > > > > > > $ time make skj30 vmlinux # after > > > real 4m10.189s > > > user 5m43.804s > > > sys 0m52.988s > > > > Is it better when using rcT instead of rcsT? > > It seems to be noticeably better for the clean rebuild case, though > not as good as the original: > > real 3m34.015s > user 5m7.104s > sys 0m49.172s > > I've also tried now with my own patch applied as well (linking > each drivers/*/built-in.o into vmlinux rather than having them > linked into drivers/built-in.o first), but that makes no > difference. I just want to come back to this, because I've subbmitted the thin archives kbuild patch, I wanted to make sure we're doing okay on ARM/ARM64. I cross compiled with my laptop. For ARM64 allyesconfig: After building then removing all built-in.o then rebuilding vmlinux: inclink time make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 vmlinux real 1m18.977s user 2m14.512s sys 0m29.704s thinarc time make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 vmlinux real 1m18.433s user 2m6.128s sys 0m28.372s Final ld time inclink real 0m4.005s user 0m3.464s sys 0m0.536s thinarc real 0m5.841s user 0m4.916s sys 0m0.916s Build directory size is of course much better (3953MB vs 5519MB). For ARM, defconfig After building then removing all built-in.o then rebuilding vmlinux: inclink real 0m19.593s user 0m22.372s sys 0m6.428s thinarc real 0m18.919s user 0m21.924s sys 0m6.400s Final ld time inclink real 0m0.378s user 0m0.304s sys 0m0.076s thinarc real 0m0.894s user 0m0.684s sys 0m0.200s For both cases final link gets slower with thin archives. I guess there is some per-file overhead but I thought with --whole-archive it should not be that much slower. Still, overall time for main ar/ld phases comes out about the same in the end so I don't think it's too much problem. Unless ARM blows up significantly worse with a bigger config. Linking with thin archives takes significantly more time in bfd hash lookup code. I haven't dug much further yet. Thanks, Nick
On Thursday, August 11, 2016 10:43:20 PM CEST Nicholas Piggin wrote: > On Wed, 03 Aug 2016 22:13:28 +0200 > Arnd Bergmann <arnd@arndb.de> wrote: > > > On Wednesday, August 3, 2016 2:44:29 PM CEST Segher Boessenkool wrote: > > > Hi Arnd, > > > > > > On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > > > > From my first look, it seems that all of lib/*.o is now getting linked > > > > into vmlinux, while we traditionally leave out everything from lib/ > > > > that is not referenced. > > > > > > > > I also see a noticeable overhead in link time, the numbers are for > > > > a cache-hot rebuild after a successful allyesconfig build, using a > > > > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > > > > > > > $ time make skj30 vmlinux # before > > > > real 2m8.092s > > > > user 3m41.008s > > > > sys 0m48.172s > > > > > > > > $ time make skj30 vmlinux # after > > > > real 4m10.189s > > > > user 5m43.804s > > > > sys 0m52.988s > > > > > > Is it better when using rcT instead of rcsT? > > > > It seems to be noticeably better for the clean rebuild case, though > > not as good as the original: > > > > real 3m34.015s > > user 5m7.104s > > sys 0m49.172s > > > > I've also tried now with my own patch applied as well (linking > > each drivers/*/built-in.o into vmlinux rather than having them > > linked into drivers/built-in.o first), but that makes no > > difference. > > I just want to come back to this, because I've subbmitted the thin > archives kbuild patch, I wanted to make sure we're doing okay on > ARM/ARM64. I cross compiled with my laptop. > > For ARM64 allyesconfig: > > After building then removing all built-in.o then rebuilding vmlinux: > inclink > time make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 vmlinux > real 1m18.977s > user 2m14.512s > sys 0m29.704s > > thinarc > time make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 vmlinux > real 1m18.433s > user 2m6.128s > sys 0m28.372s > > > Final ld time > inclink > real 0m4.005s > user 0m3.464s > sys 0m0.536s > > thinarc > real 0m5.841s > user 0m4.916s > sys 0m0.916s > > > Build directory size is of course much better (3953MB vs 5519MB). Ok, looks great. Some downsides and some upsides here, but overall I think this is a win. > > For ARM, defconfig > > After building then removing all built-in.o then rebuilding vmlinux: > inclink > real 0m19.593s > user 0m22.372s > sys 0m6.428s > > thinarc > real 0m18.919s > user 0m21.924s > sys 0m6.400s > > > Final ld time > inclink > real 0m0.378s > user 0m0.304s > sys 0m0.076s > > thinarc > real 0m0.894s > user 0m0.684s > sys 0m0.200s This also still seems fine. > For both cases final link gets slower with thin archives. I guess there is some > per-file overhead but I thought with --whole-archive it should not be that much > slower. Still, overall time for main ar/ld phases comes out about the same in > the end so I don't think it's too much problem. Unless ARM blows up significantly > worse with a bigger config. Unfortunately I think it does. I haven't tried your latest series yet, but I think the total time for removing built-in.o and relinking went up from around 4 minutes (already way too much) to 18 minutes for me. > Linking with thin archives takes significantly more time in bfd hash lookup code. > I haven't dug much further yet. Can you try the ARM allyesconfig with thin archives? I'll follow up with two patches: one to get ARM to link without thin archives, and one that I used to get --gc-sections to work. Arnd
On Thu, 11 Aug 2016 15:04:00 +0200 Arnd Bergmann <arnd@arndb.de> wrote: > On Thursday, August 11, 2016 10:43:20 PM CEST Nicholas Piggin wrote: > > On Wed, 03 Aug 2016 22:13:28 +0200 > > Final ld time > > inclink > > real 0m0.378s > > user 0m0.304s > > sys 0m0.076s > > > > thinarc > > real 0m0.894s > > user 0m0.684s > > sys 0m0.200s > > This also still seems fine. > > > For both cases final link gets slower with thin archives. I guess there is some > > per-file overhead but I thought with --whole-archive it should not be that much > > slower. Still, overall time for main ar/ld phases comes out about the same in > > the end so I don't think it's too much problem. Unless ARM blows up significantly > > worse with a bigger config. > > Unfortunately I think it does. I haven't tried your latest series yet, > but I think the total time for removing built-in.o and relinking went > up from around 4 minutes (already way too much) to 18 minutes for me. > > > Linking with thin archives takes significantly more time in bfd hash lookup code. > > I haven't dug much further yet. > > Can you try the ARM allyesconfig with thin archives? I'll follow up with two > patches: one to get ARM to link without thin archives, and one that I used > to get --gc-sections to work. Okay send them over, I'll try digging into it. There is not much kbuild code to maintain so we don't have to switch every arch. It would be nice to though. Thanks, Nick
diff --git a/arch/Kconfig b/arch/Kconfig index d794384..1330bf4 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -424,6 +424,12 @@ config CC_STACKPROTECTOR_STRONG endchoice +config THIN_ARCHIVES + bool + help + Select this if the architecture wants to use thin archives + instead of ld -r to create the built-in.o files. + config HAVE_CONTEXT_TRACKING bool help diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 0d1ca5b..bbf60b3 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -358,10 +358,15 @@ $(sort $(subdir-obj-y)): $(subdir-ym) ; # Rule to compile a set of .o files into one .o file # ifdef builtin-target +ifdef CONFIG_THIN_ARCHIVES + cmd_make_builtin = rm -f $@; $(AR) rcsT$(KBUILD_ARFLAGS) +else + cmd_make_builtin = $(LD) $(ld_flags) -r -o +endif quiet_cmd_link_o_target = LD $@ # If the list of objects to link is empty, just create an empty built-in.o cmd_link_o_target = $(if $(strip $(obj-y)),\ - $(LD) $(ld_flags) -r -o $@ $(filter $(obj-y), $^) \ + $(cmd_make_builtin) $@ $(filter $(obj-y), $^) \ $(cmd_secanalysis),\ rm -f $@; $(AR) rcs$(KBUILD_ARFLAGS) $@) diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index f0f6d9d..ef4658f 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -41,8 +41,14 @@ info() # ${1} output file modpost_link() { - ${LD} ${LDFLAGS} -r -o ${1} ${KBUILD_VMLINUX_INIT} \ - --start-group ${KBUILD_VMLINUX_MAIN} --end-group + local objects + + if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then + objects="--whole-archive ${KBUILD_VMLINUX_INIT} ${KBUILD_VMLINUX_MAIN} --no-whole-archive" + else + objects="${KBUILD_VMLINUX_INIT} --start-group ${KBUILD_VMLINUX_MAIN} --end-group" + fi + ${LD} ${LDFLAGS} -r -o ${1} ${objects} } # Link of vmlinux @@ -51,11 +57,16 @@ modpost_link() vmlinux_link() { local lds="${objtree}/${KBUILD_LDS}" + local objects if [ "${SRCARCH}" != "um" ]; then + if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then + objects="--whole-archive ${KBUILD_VMLINUX_INIT} ${KBUILD_VMLINUX_MAIN} --no-whole-archive" + else + objects="${KBUILD_VMLINUX_INIT} --start-group ${KBUILD_VMLINUX_MAIN} --end-group" + fi ${LD} ${LDFLAGS} ${LDFLAGS_vmlinux} -o ${2} \ - -T ${lds} ${KBUILD_VMLINUX_INIT} \ - --start-group ${KBUILD_VMLINUX_MAIN} --end-group ${1} + -T ${lds} ${objects} ${1} else ${CC} ${CFLAGS_vmlinux} -o ${2} \ -Wl,-T,${lds} ${KBUILD_VMLINUX_INIT} \