Message ID | 20230206194819.1679472-1-evan@rivosinc.com |
---|---|
Headers | show |
Series | RISC-V: ifunced memcpy using new kernel hwprobe interface | expand |
On 2/6/23 09:48, Evan Green wrote: > > This series illustrates the use of a proposed Linux syscall that > enumerates architectural information about the RISC-V cores the system > is running on. In this series we expose a small wrapper function around > the syscall. An ifunc selector for memcpy queries it to see if unaligned > access is "fast" on this hardware. If it is, it selects a newly provided > implementation of memcpy that doesn't work hard at aligning the src and > destination buffers. > > This is somewhat of a proof of concept for the syscall itself, but I do > find that in my goofy memcpy test [1], the unaligned memcpy performed at > least as well as the generic C version. This is however on Qemu on an M1 > mac, so not a test of any real hardware (more a smoke test that the > implementation isn't silly). > > v1 of the Linux series can be found at [2]. I'm about to post v2 (but > haven't yet!), I can reply here with the link once v2 is posted. > > [1] https://pastebin.com/Nj8ixpkX > [2] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/ Re the syscall: I question whether the heterogenous cpu case is something that you really want to query. In order to handle migration between such cpus, any such query must return the minimum level of support. Remove that possibility, and this becomes a simple array reference. Now you need to decide whether a vdso call, or HWCAP2 as pointer to read-only data is more or less efficient or extensible. r~
On 06/02/23 18:28, Richard Henderson via Libc-alpha wrote: > On 2/6/23 09:48, Evan Green wrote: >> >> This series illustrates the use of a proposed Linux syscall that >> enumerates architectural information about the RISC-V cores the system >> is running on. In this series we expose a small wrapper function around >> the syscall. An ifunc selector for memcpy queries it to see if unaligned >> access is "fast" on this hardware. If it is, it selects a newly provided >> implementation of memcpy that doesn't work hard at aligning the src and >> destination buffers. >> >> This is somewhat of a proof of concept for the syscall itself, but I do >> find that in my goofy memcpy test [1], the unaligned memcpy performed at >> least as well as the generic C version. This is however on Qemu on an M1 >> mac, so not a test of any real hardware (more a smoke test that the >> implementation isn't silly). >> >> v1 of the Linux series can be found at [2]. I'm about to post v2 (but >> haven't yet!), I can reply here with the link once v2 is posted. >> >> [1] https://pastebin.com/Nj8ixpkX >> [2] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/ > > Re the syscall: > > I question whether the heterogenous cpu case is something that you really want to query. In order to handle migration between such cpus, any such query must return the minimum level of support. > > Remove that possibility, and this becomes a simple array reference. Now you need to decide whether a vdso call, or HWCAP2 as pointer to read-only data is more or less efficient or extensible. It should at least work if kernel trap/emulate unaligned or any instruction not supported by the other code, although it would be really subpar. I would expect that kernel would report the minimum ISA as well. I would recommend also to cache the values as we do for aarch64/x86/powerpc to avoid issue multiple syscall on symbol resolution (check cpu-features.c).