Message ID | 20160829202145.A2548403DC4A9@oldenburg.str.redhat.com |
---|---|
State | New |
Headers | show |
Hello Florian, On Tue, Aug 30, 2016 at 8:21 AM, Florian Weimer <fweimer@redhat.com> wrote: > 2016-08-29 Florian Weimer <fweimer@redhat.com> > > [BZ #20524] > * manual/string.texi (String/Array Comparison): Clarify the > strverscmp behavior. > > diff --git a/manual/string.texi b/manual/string.texi > index bce81a7..3577017 100644 > --- a/manual/string.texi > +++ b/manual/string.texi > @@ -1374,46 +1374,70 @@ The @code{strverscmp} function compares the string @var{s1} against > @var{s2}, considering them as holding indices/version numbers. The > return value follows the same conventions as found in the > @code{strcmp} function. In fact, if @var{s1} and @var{s2} contain no > -digits, @code{strverscmp} behaves like @code{strcmp}. > +digits, @code{strverscmp} behaves like @code{strcmp} > +(in the sense that the sign of the result is the same). > > -Basically, we compare strings normally (byte by byte), until > -we find a digit in each string - then we enter a special comparison > -mode, where each sequence of digits is taken as a whole. If we reach the > -end of these two parts without noticing a difference, we return to the > -standard comparison mode. There are two types of numeric parts: > -"integral" and "fractional" (those begin with a '0'). The types > -of the numeric parts affect the way we sort them: > +The comparison algorithm which the @code{strverscmp} function implements > +differs slightly from other version comparison algorithms. The s/version comparison/version-comparison/ > +implementation is based on a finite state machine, whose behavior is s/finite state machine/finite-state machine/ > +approximated below. > > @itemize @bullet > @item > -integral/integral: we compare values as you would expect. > +The input strings are each split into sequences of non-digits and > +digits. These sequences can be empty at the beginning and end of the > +string. Digits are determined by the @code{isdigit} function and are > +thus subject to the current locale. > > @item > -fractional/integral: the fractional part is less than the integral one. > -Again, no surprise. > +Comparison starts with a (possibly empty) non-digit sequence. The first > +non-equal sequences of non-digits or digits determines the outcome of > +the comparison. > > @item > -fractional/fractional: the things become a bit more complex. > -If the common prefix contains only leading zeroes, the longest part is less > -than the other one; else the comparison behaves normally. > +Corresponding non-digit sequences in both strings are compared > +lexicographically. If their lengths differ, the shorter non-digit Should this be "If their lengths differ, and the shorter string is equal to the corresponding prefix in the longer string..."? > +sequence is amended with the input string character immediately s/amended/extended/ ? > +following it (which can be the null terminator). This character is > +compared against the non-digit character in the other sequence, at the > +same string offset, and their difference determines the outcome of the > +comparison. > + > +@item > +For two digit sequences, the number of leading zeros is counted (which > +can be zero). If the count differs, the string with more leading zeros > +in the digit sequence is considered smaller than the other string. > + > +@item > +If the two digit sequences have no leading zeros, they are compared as > +integers, that is, the string with the longer digit sequence is deemed > +larger, and if both sequences are of equal length, they are compared > +lexicographically. > + > +@item > +If both digit sequences have an equal, positive number of leading zeros, Why is the word "positive" here? > +they are compared lexicographically. If their length differs, another > +character is added to to the shorter sequence, "another character is added to to the shorter sequence" is vague. You want wording like you used above. > to determine the outcome > +of the comparison, as described above for the non-digit sequence case. > @end itemize > > +The treatment of leading zeros and the tie-breaking characters (which in > +effect propagate across non-digit/digit sequence boundaries) differ from s/differ/differs/ > +other version comparison algorithms. s/version comparison/version-comparison/ > + > @smallexample > strverscmp ("no digit", "no digit") > @result{} 0 /* @r{same behavior as strcmp.} */ > strverscmp ("item#99", "item#100") > @result{} <0 /* @r{same prefix, but 99 < 100.} */ > strverscmp ("alpha1", "alpha001") > - @result{} >0 /* @r{fractional part inferior to integral one.} */ > + @result{} >0 /* @r{different number of leading zeros (0 and 2).} */ > strverscmp ("part1_f012", "part1_f01") > - @result{} >0 /* @r{two fractional parts.} */ > + @result{} >0 /* @r{lexicographical comparison with leading zeros.} */ > strverscmp ("foo.009", "foo.0") > - @result{} <0 /* @r{idem, but with leading zeroes only.} */ > + @result{} <0 /* @r{different number of leading zeros (2 and 1).} */ > @end smallexample > > -This function is especially useful when dealing with filename sorting, > -because filenames frequently hold indices/version numbers. > - > @code{strverscmp} is a GNU extension. > @end deftypefun I've not completely checked all of the details, but what you write certainly matches what I did check, and is a great deal better than the existing text. But, the algorithm it describes *is* strange. Cheers, Michael
diff --git a/manual/string.texi b/manual/string.texi index bce81a7..3577017 100644 --- a/manual/string.texi +++ b/manual/string.texi @@ -1374,46 +1374,70 @@ The @code{strverscmp} function compares the string @var{s1} against @var{s2}, considering them as holding indices/version numbers. The return value follows the same conventions as found in the @code{strcmp} function. In fact, if @var{s1} and @var{s2} contain no -digits, @code{strverscmp} behaves like @code{strcmp}. +digits, @code{strverscmp} behaves like @code{strcmp} +(in the sense that the sign of the result is the same). -Basically, we compare strings normally (byte by byte), until -we find a digit in each string - then we enter a special comparison -mode, where each sequence of digits is taken as a whole. If we reach the -end of these two parts without noticing a difference, we return to the -standard comparison mode. There are two types of numeric parts: -"integral" and "fractional" (those begin with a '0'). The types -of the numeric parts affect the way we sort them: +The comparison algorithm which the @code{strverscmp} function implements +differs slightly from other version comparison algorithms. The +implementation is based on a finite state machine, whose behavior is +approximated below. @itemize @bullet @item -integral/integral: we compare values as you would expect. +The input strings are each split into sequences of non-digits and +digits. These sequences can be empty at the beginning and end of the +string. Digits are determined by the @code{isdigit} function and are +thus subject to the current locale. @item -fractional/integral: the fractional part is less than the integral one. -Again, no surprise. +Comparison starts with a (possibly empty) non-digit sequence. The first +non-equal sequences of non-digits or digits determines the outcome of +the comparison. @item -fractional/fractional: the things become a bit more complex. -If the common prefix contains only leading zeroes, the longest part is less -than the other one; else the comparison behaves normally. +Corresponding non-digit sequences in both strings are compared +lexicographically. If their lengths differ, the shorter non-digit +sequence is amended with the input string character immediately +following it (which can be the null terminator). This character is +compared against the non-digit character in the other sequence, at the +same string offset, and their difference determines the outcome of the +comparison. + +@item +For two digit sequences, the number of leading zeros is counted (which +can be zero). If the count differs, the string with more leading zeros +in the digit sequence is considered smaller than the other string. + +@item +If the two digit sequences have no leading zeros, they are compared as +integers, that is, the string with the longer digit sequence is deemed +larger, and if both sequences are of equal length, they are compared +lexicographically. + +@item +If both digit sequences have an equal, positive number of leading zeros, +they are compared lexicographically. If their length differs, another +character is added to to the shorter sequence, to determine the outcome +of the comparison, as described above for the non-digit sequence case. @end itemize +The treatment of leading zeros and the tie-breaking characters (which in +effect propagate across non-digit/digit sequence boundaries) differ from +other version comparison algorithms. + @smallexample strverscmp ("no digit", "no digit") @result{} 0 /* @r{same behavior as strcmp.} */ strverscmp ("item#99", "item#100") @result{} <0 /* @r{same prefix, but 99 < 100.} */ strverscmp ("alpha1", "alpha001") - @result{} >0 /* @r{fractional part inferior to integral one.} */ + @result{} >0 /* @r{different number of leading zeros (0 and 2).} */ strverscmp ("part1_f012", "part1_f01") - @result{} >0 /* @r{two fractional parts.} */ + @result{} >0 /* @r{lexicographical comparison with leading zeros.} */ strverscmp ("foo.009", "foo.0") - @result{} <0 /* @r{idem, but with leading zeroes only.} */ + @result{} <0 /* @r{different number of leading zeros (2 and 1).} */ @end smallexample -This function is especially useful when dealing with filename sorting, -because filenames frequently hold indices/version numbers. - @code{strverscmp} is a GNU extension. @end deftypefun