Message ID | CAPweEDw710zFK8KLZY5gsQxEkQKrDiFkNRgABY9HJZ1rxpeVCg@mail.gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | microwatt booting linux-5.7 under verilator | expand |
Hi Luke, Interesting to read about the project, thanks for the post. Excerpts from Luke Kenneth Casson Leighton's message of January 3, 2022 10:45 am: > i am pleased to be able to announce the successful booting of microwatt-5.7 > linux buildroot... under a veriilator simulation of the microwatt VHDL. > from a hardware development and research perspective this is highly > significant because unlike the FPGA boot which was previously reported, > https://shenki.github.io/boot-linux-on-microwatt/ > full memory read/write snooping and full Signal tracing (gtkwave) is possible. > > https://ftp.libre-soc.org/microwatt-linux-5.7-verilator-boot-buildroot.txt > > the branch of microwatt HDL which is being used is here > https://git.libre-soc.org/?p=microwatt.git;a=shortlog;h=refs/heads/verilator_trace > > some minor strategic changes to microwatt HDL were required, including > adding a new SYSCON parameter to specify a BRAM chain-boot address, > and also it was necessary to turn sdram_init into a stand-alone "mini-BIOS" > which performed the role of early-initialising the 16550 uart followed by > chain-loading to the BRAM chain-boot memory location, at which the linux > 5.7 dtbImage.microwatt had been loaded (0x600000). > > microwatt-verilator.cpp itself needed some changes to add support for > emulation in c++ of 512 mbyte of "Block" RAM. the interface for BRAM > (aka SRAM) was far simpler than attempting to emulate DRAM, and > also meant that much of the mini-BIOS could be entirely cut. > > i also had to further modify microwatt-verilator.cpp to allow it to load > from files directly into memory, at run-time. this means it is possible > to execute hello_world.bin, zephyr.bin, micropython.bin, dtbImage-microwatt > all without recompiling the verilator binary. > > (not that you want to try compiling a 6 MB binary into VHDL like i did: > it resulted in the creation of a 512 MB verilog file which, at 60 GB resident > RAM by verilator attempting to compile that to c++, i decided that mayyybe > doing that at runtime was a better approach?) > > i also had to fix a couple of things in the linux kernel source > https://git.kernel.org/pub/scm/linux/kernel/git/joel/microwatt.git I think these have mostly (all?) been upstreamed now. > first attempts to boot a compressed image were quite hilarious: a > quick back-of-the-envelope calculation by examining the rate at which > LD/STs were being generated showed that the GZIP decompression > would complete maybe some time in about 1 hour of real-world time. > this led me to add support for CONFIG_KERNEL_UNCOMPRESSED > and cut that time entirely, hence why you can see this in the console log: > > 0x5b0e10 bytes of uncompressed data copied Interesting, it looks like your HAVE_KERNEL_UNCOMPRESSED support patch is pretty trivial. We should be able to upstream it pretty easily I think? > secondly, the microwatt Makefile assumes that verilator clock rate > runs at 50 mhz, where the microwatt.dts file says 100 mhz for both > the UART clock as well as the system clock. it would be really nice > to have microwatt-linux read the SYSCON parameter for the > clock rate, and for that to be dynamically inserted into the dtb. > however in the interim, the attached patch suffices by manually > altering the clock in microwatt.dts to match that of the SYSCON > parameter. There is a dt_fixup_clock() that's used by a few platforms. Can we read that parameter say in linux/arch/powerpc/boot/microwatt.c platform_init() and fix it up there? How do you even read the SYSCON parameter for frequency? Thanks, Nick
On January 31, 2022 3:31:41 AM UTC, Nicholas Piggin <npiggin@gmail.com> wrote: >Hi Luke, > >Interesting to read about the project, thanks for the post. no problem. it's been i think 18 years since i last did linux kernel work. >> i also had to fix a couple of things in the linux kernel source >> https://git.kernel.org/pub/scm/linux/kernel/git/joel/microwatt.git > >I think these have mostly (all?) been upstreamed now. i believe so, although last i checked (6 months?) there was some of dts still to do. instructions online all tend to refer to joel or benh's tree(s) >> this led me to add support for CONFIG_KERNEL_UNCOMPRESSED >> and cut that time entirely, hence why you can see this in the console >log: >> >> 0x5b0e10 bytes of uncompressed data copied > >Interesting, it looks like your HAVE_KERNEL_UNCOMPRESSED support >patch is pretty trivial. yeah i was really surprised, it was all there > We should be able to upstream it pretty >easily I think? don't see why not. the next interesting thing which would save another hour when emulating HDL at this astoundingly-slow speed of sub-1000 instructions per second would be in-place execution: no memcpy, just jump. i seem to recall this (inplace execution) being a standard option back in 2003 when i was doing xda-developers wince smartphone reverse-emgineering, although with it being 19 years ago i could be wrong other areas are the memset before VM is set up, followed by memset *again* on.individual pages once created. those are an hour each another hour is spent on early device tree flat walking. one very big one (90+ mins) is the sysfs binary tree walk. i'm sure even just saving the last node in a 1-entry cache would improve time there, or, better, a 4-entry cache (one per level) although it sounds weird talking in a timeframe that is literally 100,000 times slower than what anyone else is used to, if improved it results in dramatic reduction in boot times for embedded IoT e.g BMC systems. >> however in the interim, the attached patch suffices by manually >> altering the clock in microwatt.dts to match that of the SYSCON >> parameter. > >There is a dt_fixup_clock() that's used by a few platforms. Can we >read that parameter say in linux/arch/powerpc/boot/microwatt.c >platform_init() and fix it up there? > >How do you even read the SYSCON parameter for frequency? SYSCON is just a term for a memory-mapped wishbone ROM which contains a crude easily-decoded binary form of devicetree. when you read 0xc0001000 (say) its contents tell you the clock speed. at 0xc0001008 is the number of UARTs. 0xc0001010 contains the UART0 speed or well you can see the real contents syscon.vhdl it is _real_ basic but contains everything that a cold-start BIOS needs to know, such as "do i even have DRAM, do i have an SPI Flash i can read a second stage bootloader from" etc etc https://github.com/antonblanchard/microwatt/blob/master/syscon.vhdl Paul said it was always planned to do reading of these params, the entries in devicetree are a temporary hack. l.
On Mon, 2022-01-31 at 04:19 +0000, lkcl wrote: > > How do you even read the SYSCON parameter for frequency? > > > SYSCON is just a term for a memory-mapped wishbone ROM which contains > a crude easily-decoded binary form of devicetree. Talking of which, if we're going to make use if it (we should), we probably need to ensure it's also ported to microwatt on LiteX. Though LiteX has another issue in that it puts MMIO elsewhere iirc. That or we rely 100% on LiteX having a good DT (and thus use a different platform for it). > > when you read 0xc0001000 (say) its contents tell you the clock speed. > > > > at 0xc0001008 is the number of UARTs. > > 0xc0001010 contains the UART0 speed or well you can see the real > contents syscon.vhdl > > > > it is _real_ basic but contains everything that > > a cold-start BIOS needs to know, such as "do i even have DRAM, do i > have an SPI Flash i can read a second > > stage bootloader from" etc etc > > > > https://github.com/antonblanchard/microwatt/blob/master/syscon.vhdl > > > > Paul said it was always planned to do reading of these params, the > entries in devicetree are a temporary hack. > > > > l. > > _______________________________________________ > > OpenPOWER-HDL-Cores mailing list > > OpenPOWER-HDL-Cores@mailinglist.openpowerfoundation.org > > http://lists.mailinglist.openpowerfoundation.org/mailman/listinfo/openpower-hdl-cores
Excerpts from lkcl's message of January 31, 2022 2:19 pm: > > > On January 31, 2022 3:31:41 AM UTC, Nicholas Piggin <npiggin@gmail.com> wrote: >>Hi Luke, >> >>Interesting to read about the project, thanks for the post. > > no problem. it's been i think 18 years since i last did linux kernel work. > >>> i also had to fix a couple of things in the linux kernel source >>> https://git.kernel.org/pub/scm/linux/kernel/git/joel/microwatt.git >> >>I think these have mostly (all?) been upstreamed now. > > i believe so, although last i checked (6 months?) there was some of dts still to do. instructions online all tend to refer to joel or benh's tree(s) > >>> this led me to add support for CONFIG_KERNEL_UNCOMPRESSED >>> and cut that time entirely, hence why you can see this in the console >>log: >>> >>> 0x5b0e10 bytes of uncompressed data copied >> >>Interesting, it looks like your HAVE_KERNEL_UNCOMPRESSED support >>patch is pretty trivial. > > yeah i was really surprised, it was all there > >> We should be able to upstream it pretty >>easily I think? > > don't see why not. Okay then we should. > > the next interesting thing which would save another hour when emulating HDL at this astoundingly-slow speed of sub-1000 instructions per second would be in-place execution: no memcpy, just jump. > > i seem to recall this (inplace execution) being a standard option back in 2003 when i was doing xda-developers wince smartphone reverse-emgineering, although with it being 19 years ago i could be wrong Not sure of the details on that. Is it memcpy()ing out of ROM or RAM to RAM? Is this in the arch boot code? (I don't know very well). > > other areas are the memset before VM is set up, followed by memset *again* on.individual pages once created. those are an hour each Seems like we could should avoid the duplication and maybe be able to add an option to skip zeroing (I thought there was one, maybe thinking of something else). > > another hour is spent on early device tree flat walking. Are you using optimize for size? That can result in much slower code in some places. In skiboot we compile some of the string.h library code with -O2 for example. Thanks, Nick > > one very big one (90+ mins) is the sysfs binary tree walk. i'm sure even just saving the last node in a 1-entry cache would improve time there, or, better, a 4-entry cache (one per level) > > although it sounds weird talking in a timeframe that is literally 100,000 times slower than what anyone else is used to, if improved it results in dramatic reduction in boot times for embedded IoT e.g BMC systems. > >>> however in the interim, the attached patch suffices by manually >>> altering the clock in microwatt.dts to match that of the SYSCON >>> parameter. >> >>There is a dt_fixup_clock() that's used by a few platforms. Can we >>read that parameter say in linux/arch/powerpc/boot/microwatt.c >>platform_init() and fix it up there? >> >>How do you even read the SYSCON parameter for frequency? > > SYSCON is just a term for a memory-mapped wishbone ROM which contains a crude easily-decoded binary form of devicetree. > > when you read 0xc0001000 (say) its contents tell you the clock speed. > > at 0xc0001008 is the number of UARTs. > 0xc0001010 contains the UART0 speed or well you can see the real contents syscon.vhdl > > it is _real_ basic but contains everything that > a cold-start BIOS needs to know, such as "do i even have DRAM, do i have an SPI Flash i can read a second > stage bootloader from" etc etc > > https://github.com/antonblanchard/microwatt/blob/master/syscon.vhdl > > Paul said it was always planned to do reading of these params, the entries in devicetree are a temporary hack. > > l. >
On Tue, Feb 1, 2022 at 6:27 AM Nicholas Piggin <npiggin@gmail.com> wrote: > Not sure of the details on that. Is it memcpy()ing out of ROM or RAM to > RAM? Is this in the arch boot code? (I don't know very well). RAM to RAM. arch/powerpc/boot/main.c: if (uncompressed_image) { memcpy(addr, vmlinuz_addr + ei.elfoffset, ei.loadsize); printf("0x%lx bytes of uncompressed data copied\n\r", ei.loadsize); goto out; } in some systems those would be two different types of RAM, (one would be on-board SRAM, the target would be DRAM which had previously been initialised by the previous chain-boot loader e.g. u-boot) [in other circumstances, the source location might be addressable SPI NOR flash, which would be slower, expensive, and therefore compression is plain common sense, in which case it's out of scope for this discussion.] in the case of the simulation - and also in the case of the WinCE Smartphone hand-held reverse-engineering using GNUHARET.EXE (similar to LOADLIN.EXE if anyone remembers that) - the uncompressed initramfs are both in the same RAM, so the memcpy is completely redundant. the only good reason for the memcpy would be to ensure that the start location is at a known-fixed offset, and of course that can be arranged in advance by the simulator. even if it has to be at 0x0000_0000_0000_0000 that can be arranged by moving the cold-boot loader to an alternative hard-reset start address and telling the simulated-core to start from there. > > > > other areas are the memset before VM is set up, followed by memset *again* on.individual pages once created. those are an hour each > > Seems like we could should avoid the duplication and maybe be able to > add an option to skip zeroing (I thought there was one, maybe thinking > of something else). it makes sense for security reasons (on real hardware) - a simulation not so much, it's guaranteed to be all-zeros at startup. > Are you using optimize for size? That can result in much slower code in > some places. In skiboot we compile some of the string.h library code > with -O2 for example. interesting - no, this is default options. have to be careful not to introduce any VSX instructions (the core doesn't have them). CROSS_COMPILE="ccache powerpc64le-linux-gnu-" \ ARCH=powerpc \ make -j16 O=microwatt l.
Nicholas Piggin <npiggin@gmail.com> writes: > Excerpts from lkcl's message of January 31, 2022 2:19 pm: >> >> On January 31, 2022 3:31:41 AM UTC, Nicholas Piggin <npiggin@gmail.com> wrote: >>>Hi Luke, >>> >>>Interesting to read about the project, thanks for the post. >> >> no problem. it's been i think 18 years since i last did linux kernel work. >> >>>> i also had to fix a couple of things in the linux kernel source >>>> https://git.kernel.org/pub/scm/linux/kernel/git/joel/microwatt.git >>> >>>I think these have mostly (all?) been upstreamed now. >> >> i believe so, although last i checked (6 months?) there was some of dts still to do. instructions online all tend to refer to joel or benh's tree(s) >> >>>> this led me to add support for CONFIG_KERNEL_UNCOMPRESSED >>>> and cut that time entirely, hence why you can see this in the console >>>log: >>>> >>>> 0x5b0e10 bytes of uncompressed data copied >>> >>>Interesting, it looks like your HAVE_KERNEL_UNCOMPRESSED support >>>patch is pretty trivial. >> >> yeah i was really surprised, it was all there >> >>> We should be able to upstream it pretty >>>easily I think? >> >> don't see why not. > > Okay then we should. > >> >> the next interesting thing which would save another hour when emulating HDL at this astoundingly-slow speed of sub-1000 instructions per second would be in-place execution: no memcpy, just jump. >> >> i seem to recall this (inplace execution) being a standard option back in 2003 when i was doing xda-developers wince smartphone reverse-emgineering, although with it being 19 years ago i could be wrong > > Not sure of the details on that. Is it memcpy()ing out of ROM or RAM to > RAM? Is this in the arch boot code? (I don't know very well). If you build with CONFIG_RELOCATABLE=y and CONFIG_RELOCATABLE_TEST=y the kernel will run wherever you load it (must be 64K aligned), without copying itself down to zero first. That will save you a few cycles. cheers
On Tue, Feb 1, 2022 at 11:53 AM Michael Ellerman <mpe@ellerman.id.au> wrote: > If you build with CONFIG_RELOCATABLE=y and CONFIG_RELOCATABLE_TEST=y the > kernel will run wherever you load it (must be 64K aligned), without > copying itself down to zero first. That will save you a few cycles. ahh, thank you :) l.
Le 03/01/2022 à 01:45, Luke Kenneth Casson Leighton a écrit : > i am pleased to be able to announce the successful booting of microwatt-5.7 > linux buildroot... under a veriilator simulation of the microwatt VHDL. > from a hardware development and research perspective this is highly > significant because unlike the FPGA boot which was previously reported, > https://shenki.github.io/boot-linux-on-microwatt/ > full memory read/write snooping and full Signal tracing (gtkwave) is possible. > > https://ftp.libre-soc.org/microwatt-linux-5.7-verilator-boot-buildroot.txt > > the branch of microwatt HDL which is being used is here > https://git.libre-soc.org/?p=microwatt.git;a=shortlog;h=refs/heads/verilator_trace > > some minor strategic changes to microwatt HDL were required, including > adding a new SYSCON parameter to specify a BRAM chain-boot address, > and also it was necessary to turn sdram_init into a stand-alone "mini-BIOS" > which performed the role of early-initialising the 16550 uart followed by > chain-loading to the BRAM chain-boot memory location, at which the linux > 5.7 dtbImage.microwatt had been loaded (0x600000). > > microwatt-verilator.cpp itself needed some changes to add support for > emulation in c++ of 512 mbyte of "Block" RAM. the interface for BRAM > (aka SRAM) was far simpler than attempting to emulate DRAM, and > also meant that much of the mini-BIOS could be entirely cut. > > i also had to further modify microwatt-verilator.cpp to allow it to load > from files directly into memory, at run-time. this means it is possible > to execute hello_world.bin, zephyr.bin, micropython.bin, dtbImage-microwatt > all without recompiling the verilator binary. > > (not that you want to try compiling a 6 MB binary into VHDL like i did: > it resulted in the creation of a 512 MB verilog file which, at 60 GB resident > RAM by verilator attempting to compile that to c++, i decided that mayyybe > doing that at runtime was a better approach?) > > i also had to fix a couple of things in the linux kernel source > https://git.kernel.org/pub/scm/linux/kernel/git/joel/microwatt.git > > first attempts to boot a compressed image were quite hilarious: a > quick back-of-the-envelope calculation by examining the rate at which > LD/STs were being generated showed that the GZIP decompression > would complete maybe some time in about 1 hour of real-world time. > this led me to add support for CONFIG_KERNEL_UNCOMPRESSED > and cut that time entirely, hence why you can see this in the console log: > > 0x5b0e10 bytes of uncompressed data copied > > secondly, the microwatt Makefile assumes that verilator clock rate > runs at 50 mhz, where the microwatt.dts file says 100 mhz for both > the UART clock as well as the system clock. it would be really nice > to have microwatt-linux read the SYSCON parameter for the > clock rate, and for that to be dynamically inserted into the dtb. > however in the interim, the attached patch suffices by manually > altering the clock in microwatt.dts to match that of the SYSCON > parameter. I'm not sure whether you expect this attached patch to me merged in mainline. If so, could you re-submit as a proper patch ? Thanks Christophe
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d5be1a85f40b..2d332f025bb0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -202,6 +202,7 @@ config PPC select HAVE_IDE select HAVE_IOREMAP_PROT select HAVE_IRQ_EXIT_ON_IRQ_STACK + select HAVE_KERNEL_UNCOMPRESSED select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA if DEFAULT_UIMAGE select HAVE_KERNEL_LZO if DEFAULT_UIMAGE diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile index f3417bfc5ec4..5b33fd3aac47 100644 --- a/arch/powerpc/boot/Makefile +++ b/arch/powerpc/boot/Makefile @@ -249,6 +249,7 @@ CROSSWRAP := -C "$(CROSS_COMPILE)" endif endif +compressor-y := none compressor-$(CONFIG_KERNEL_GZIP) := gz compressor-$(CONFIG_KERNEL_XZ) := xz compressor-$(CONFIG_KERNEL_LZMA) := lzma diff --git a/arch/powerpc/boot/dts/microwatt.dts b/arch/powerpc/boot/dts/microwatt.dts index b63c9d9ec202..24972eba74bb 100644 --- a/arch/powerpc/boot/dts/microwatt.dts +++ b/arch/powerpc/boot/dts/microwatt.dts @@ -65,8 +65,8 @@ PowerPC,Microwatt@0 { i-cache-sets = <2>; ibm,dec-bits = <64>; reservation-granule-size = <64>; - clock-frequency = <100000000>; - timebase-frequency = <100000000>; + clock-frequency = <50000000>; + timebase-frequency = <50000000>; i-tlb-sets = <1>; ibm,ppc-interrupt-server#s = <0>; i-cache-block-size = <64>; @@ -120,7 +120,7 @@ UART0: serial@2000 { device_type = "serial"; compatible = "ns16550"; reg = <0x2000 0x8>; - clock-frequency = <100000000>; + clock-frequency = <50000000>; current-speed = <115200>; reg-shift = <2>; fifo-size = <16>; diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c index a9d209135975..1daf58213f13 100644 --- a/arch/powerpc/boot/main.c +++ b/arch/powerpc/boot/main.c @@ -30,8 +30,12 @@ static struct addr_range prep_kernel(void) long len; int uncompressed_image = 0; +#ifndef CONFIG_KERNEL_UNCOMPRESSED len = partial_decompress(vmlinuz_addr, vmlinuz_size, elfheader, sizeof(elfheader), 0); +#else + len = -1; +#endif /* assume uncompressed data if -1 is returned */ if (len == -1) { uncompressed_image = 1;