Message ID | 20230713132540.2854320-2-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | Use introsort for qsort | expand |
On 2023-07-13 06:25, Adhemerval Zanella via Libc-alpha wrote: > +is_aligned (const void *base, size_t size, unsigned char align) ALIGN should be of type size_t, not unsigned char. Although this shouldn't change the machine code it's better documentation and more robust to future changes. > + if (is_aligned (pbase, size, 8)) Change "8" to "alignof (uint64_t)". > + swap_type = SWAP_WORDS_64; > + else if (is_aligned (pbase, size, 4)) Likewise, change "4" to "alignof (uint32_t)". Or is the problem that alignof didn't work for you? If so, perhaps __alignof__. Still, alignof is used elsewhere in glibc and is in C23 so it's better to use here if it works.
On 2023-07-13 15:17, Paul Eggert wrote: >> + if (is_aligned (pbase, size, 8)) > > Change "8" to "alignof (uint64_t)". Oops, scratch that. The size must be a multiple of sizeof (uint64_t) and the alignment must be a multiple of alignof (uint64_t). Since sizeof (uint64_t) must be a multiple of alignof (uint64_t) what you wrote was safe (but confusing), whereas what I wrote was not. To help avoid confusion by future readers, I suggest rewriting the comments and parameters for is_aligned to be something like the following. This shouldn't change the machine code. /* If this function returns true, elements can be safely copied using word loads and stores. Otherwise, it might not be safe. BASE (as an integer) must be a multiple of the word alignment. TOTAL_SIZE must be a multiple of WORDSIZE. Since WORDSIZE must be a multiple of the word alignment, and WORDSIZE is a power of two on all supported platforms, this function for speed merely checks that BASE and SIZE are both multiples of the word size. */ __attribute_const__ __always_inline static bool is_aligned (const void *base, size_t size, size_t wordsize) { return (((uintptr_t) base | size) & (wordsize - 1)) == 0; }
On 13/07/23 22:01, Paul Eggert wrote: > On 2023-07-13 15:17, Paul Eggert wrote: > >>> + if (is_aligned (pbase, size, 8)) >> >> Change "8" to "alignof (uint64_t)". > > > Oops, scratch that. The size must be a multiple of sizeof (uint64_t) and the alignment must be a multiple of alignof (uint64_t). Since sizeof (uint64_t) must be a multiple of alignof (uint64_t) what you wrote was safe (but confusing), whereas what I wrote was not. > > To help avoid confusion by future readers, I suggest rewriting the comments and parameters for is_aligned to be something like the following. This shouldn't change the machine code. > > /* If this function returns true, elements can be safely copied using > word loads and stores. Otherwise, it might not be safe. > BASE (as an integer) must be a multiple of the word alignment. > TOTAL_SIZE must be a multiple of WORDSIZE. I think you meant SIZE here instead of TOTAL_SIZE. > Since WORDSIZE must be a multiple of the word alignment, > and WORDSIZE is a power of two on all supported platforms, > this function for speed merely checks that BASE and SIZE > are both multiples of the word size. */ > __attribute_const__ __always_inline static bool > is_aligned (const void *base, size_t size, size_t wordsize) > { > return (((uintptr_t) base | size) & (wordsize - 1)) == 0; > } > Alright, I changed this version.
On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote: > +/* Returns true if elements can be copied using word loads and stores. > + The SIZE and BASE must be a multiple of the ALIGN. */ > +__attribute_const__ __always_inline static bool Kernel folks sure love their double underscores, but plain 'static inline bool' should work just as well here. I'd recommend to properly credit Linux lib/sort.c since obviously this commit originated as a direct copy of the code therein, not just being inspired by what's there. > +is_aligned (const void *base, size_t size, unsigned char align) > +{ > + return (((uintptr_t) base | size) & (align - 1)) == 0; > +} > + > +static void > +swap_words_64 (void * restrict a, void * restrict b, size_t n) > +{ > + do > + { > + n -= 8; > + uint64_t t = *(uint64_t *)(a + n); > + *(uint64_t *)(a + n) = *(uint64_t *)(b + n); > + *(uint64_t *)(b + n) = t; > + } while (n); > +} The undefined behavior resulting from attempt to access via an 'uint64_t *' something of a different type is known to cause miscompilation in practice: https://www.spec.org/cpu2017/Docs/benchmarks/605.mcf_s.html#portability I see that the new string routines do it properly via typedef unsigned long int __attribute__ ((__may_alias__)) op_t; Would be nice to avoid new instances of such UB here. Alexander
On 2023-07-17 09:57, Alexander Monakov wrote: > I'd recommend to properly credit Linux lib/sort.c since obviously this > commit originated as a direct copy of the code therein, not just being > inspired by what's there. What are the licensing issues here? The Linux kernel is GPLv2 only, whereas glibc/stdlib/qsort.c is LGPLv2.1+, so we can't simply take nontrivial source code from the Linux kernel.
On 17/07/23 14:29, Paul Eggert wrote: > On 2023-07-17 09:57, Alexander Monakov wrote: >> I'd recommend to properly credit Linux lib/sort.c since obviously this >> commit originated as a direct copy of the code therein, not just being >> inspired by what's there. > > What are the licensing issues here? The Linux kernel is GPLv2 only, whereas glibc/stdlib/qsort.c is LGPLv2.1+, so we can't simply take nontrivial source code from the Linux kernel. I am not sure in fact, my take was the code as simple enough to have this concern. But if it were case, I can reimplement it different to avoid this issue.
On 2023-07-17 11:07, Adhemerval Zanella Netto wrote: > I am not sure in fact, my take was the code as simple enough to have this > concern. The usual rule of thumb is that ten lines are trivial, and that more than that might be a concern. When it's a concern, you can't simply read the original code and rewrite it in a different way; you need to actually rewrite it from scratch, preferably without looking at the original code. It's a pain, admittedly.
On 17/07/23 15:58, Paul Eggert wrote: > On 2023-07-17 11:07, Adhemerval Zanella Netto wrote: >> I am not sure in fact, my take was the code as simple enough to have this >> concern. > > The usual rule of thumb is that ten lines are trivial, and that more than that might be a concern. When it's a concern, you can't simply read the original code and rewrite it in a different way; you need to actually rewrite it from scratch, preferably without looking at the original code. It's a pain, admittedly. Alright, I don't think it should much of trouble to reimplement a simple heapsort.
On 17/07/23 13:57, Alexander Monakov wrote: > > On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote: > >> +/* Returns true if elements can be copied using word loads and stores. >> + The SIZE and BASE must be a multiple of the ALIGN. */ >> +__attribute_const__ __always_inline static bool > > Kernel folks sure love their double underscores, but plain > 'static inline bool' should work just as well here. > > I'd recommend to properly credit Linux lib/sort.c since obviously this > commit originated as a direct copy of the code therein, not just being > inspired by what's there. In fact our msort_with_tmp uses the exact same idea, but it is somewhat convoluted because it adds an specific optimization to 'unsigned long' and for large values it uses a inplace mempcpy. > >> +is_aligned (const void *base, size_t size, unsigned char align) >> +{ >> + return (((uintptr_t) base | size) & (align - 1)) == 0; >> +} >> + >> +static void >> +swap_words_64 (void * restrict a, void * restrict b, size_t n) >> +{ >> + do >> + { >> + n -= 8; >> + uint64_t t = *(uint64_t *)(a + n); >> + *(uint64_t *)(a + n) = *(uint64_t *)(b + n); >> + *(uint64_t *)(b + n) = t; >> + } while (n); >> +} > > The undefined behavior resulting from attempt to access via an 'uint64_t *' > something of a different type is known to cause miscompilation in practice: > https://www.spec.org/cpu2017/Docs/benchmarks/605.mcf_s.html#portability > > I see that the new string routines do it properly via > > typedef unsigned long int __attribute__ ((__may_alias__)) op_t; > > Would be nice to avoid new instances of such UB here. Ack.
On 18/07/23 11:01, Adhemerval Zanella Netto wrote: > > > On 17/07/23 13:57, Alexander Monakov wrote: >> >> On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote: >> >>> +/* Returns true if elements can be copied using word loads and stores. >>> + The SIZE and BASE must be a multiple of the ALIGN. */ >>> +__attribute_const__ __always_inline static bool >> >> Kernel folks sure love their double underscores, but plain >> 'static inline bool' should work just as well here. >> >> I'd recommend to properly credit Linux lib/sort.c since obviously this >> commit originated as a direct copy of the code therein, not just being >> inspired by what's there. > > In fact our msort_with_tmp uses the exact same idea, but it is somewhat > convoluted because it adds an specific optimization to 'unsigned long' > and for large values it uses a inplace mempcpy. And just to be clear about the the license, the optimization is really a common technique already used in our mergesort implementation, and Linux implementation is slight different because it takes in consideration if the architectures support fast unaligned accesses (which glibc does not consider anymore since a9b3b770f596c), for 64 bit it also handles 64 and 32 bits with differently (which I don't think we should not bother, compiler will know better how to optimize it); and it does not have the large buffer optimization. So I don't think this falls to the idea this code is copy/pasted. For the heapsort, I really used Linux implementation which I replaced with textbook one.
diff --git a/stdlib/qsort.c b/stdlib/qsort.c index 728a0ed370..bf3326e4b9 100644 --- a/stdlib/qsort.c +++ b/stdlib/qsort.c @@ -23,20 +23,85 @@ #include <limits.h> #include <stdlib.h> #include <string.h> +#include <stdbool.h> -/* Byte-wise swap two items of size SIZE. */ -#define SWAP(a, b, size) \ - do \ - { \ - size_t __size = (size); \ - char *__a = (a), *__b = (b); \ - do \ - { \ - char __tmp = *__a; \ - *__a++ = *__b; \ - *__b++ = __tmp; \ - } while (--__size > 0); \ - } while (0) +/* Swap SIZE bytes between addresses A and B. These helpers are provided + along the generic one as an optimization. */ + +enum swap_type_t + { + SWAP_WORDS_64, + SWAP_WORDS_32, + SWAP_BYTES + }; + +/* Returns true if elements can be copied using word loads and stores. + The SIZE and BASE must be a multiple of the ALIGN. */ +__attribute_const__ __always_inline static bool +is_aligned (const void *base, size_t size, unsigned char align) +{ + return (((uintptr_t) base | size) & (align - 1)) == 0; +} + +static void +swap_words_64 (void * restrict a, void * restrict b, size_t n) +{ + do + { + n -= 8; + uint64_t t = *(uint64_t *)(a + n); + *(uint64_t *)(a + n) = *(uint64_t *)(b + n); + *(uint64_t *)(b + n) = t; + } while (n); +} + +static void +swap_words_32 (void * restrict a, void * restrict b, size_t n) +{ + do + { + n -= 4; + uint32_t t = *(uint32_t *)(a + n); + *(uint32_t *)(a + n) = *(uint32_t *)(b + n); + *(uint32_t *)(b + n) = t; + } while (n); +} + +static void +swap_bytes (void * restrict a, void * restrict b, size_t n) +{ + /* Use multiple small memcpys with constant size to enable inlining + on most targets. */ + enum { SWAP_GENERIC_SIZE = 32 }; + unsigned char tmp[SWAP_GENERIC_SIZE]; + while (n > SWAP_GENERIC_SIZE) + { + memcpy (tmp, a, SWAP_GENERIC_SIZE); + a = __mempcpy (a, b, SWAP_GENERIC_SIZE); + b = __mempcpy (b, tmp, SWAP_GENERIC_SIZE); + n -= SWAP_GENERIC_SIZE; + } + while (n > 0) + { + unsigned char t = ((unsigned char *)a)[--n]; + ((unsigned char *)a)[n] = ((unsigned char *)b)[n]; + ((unsigned char *)b)[n] = t; + } +} + +/* Replace the indirect call with a serie of if statements. It should help + the branch predictor. */ +static void +do_swap (void * restrict a, void * restrict b, size_t size, + enum swap_type_t swap_func) +{ + if (swap_func == SWAP_WORDS_64) + swap_words_64 (a, b, size); + else if (swap_func == SWAP_WORDS_32) + swap_words_32 (a, b, size); + else + swap_bytes (a, b, size); +} /* Discontinue quicksort algorithm when partition gets below this size. This particular magic number was chosen to work best on a Sun 4/260. */ @@ -96,6 +161,14 @@ _quicksort (void *const pbase, size_t total_elems, size_t size, /* Avoid lossage with unsigned arithmetic below. */ return; + enum swap_type_t swap_type; + if (is_aligned (pbase, size, 8)) + swap_type = SWAP_WORDS_64; + else if (is_aligned (pbase, size, 4)) + swap_type = SWAP_WORDS_32; + else + swap_type = SWAP_BYTES; + if (total_elems > MAX_THRESH) { char *lo = base_ptr; @@ -119,13 +192,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size, char *mid = lo + size * ((hi - lo) / size >> 1); if ((*cmp) ((void *) mid, (void *) lo, arg) < 0) - SWAP (mid, lo, size); + do_swap (mid, lo, size, swap_type); if ((*cmp) ((void *) hi, (void *) mid, arg) < 0) - SWAP (mid, hi, size); + do_swap (mid, hi, size, swap_type); else goto jump_over; if ((*cmp) ((void *) mid, (void *) lo, arg) < 0) - SWAP (mid, lo, size); + do_swap (mid, lo, size, swap_type); jump_over:; left_ptr = lo + size; @@ -144,7 +217,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size, if (left_ptr < right_ptr) { - SWAP (left_ptr, right_ptr, size); + do_swap (left_ptr, right_ptr, size, swap_type); if (mid == left_ptr) mid = right_ptr; else if (mid == right_ptr) @@ -216,7 +289,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size, tmp_ptr = run_ptr; if (tmp_ptr != base_ptr) - SWAP (tmp_ptr, base_ptr, size); + do_swap (tmp_ptr, base_ptr, size, swap_type); /* Insertion sort, running from left-hand-side up to right-hand-side. */