From patchwork Tue Aug 30 11:17:40 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Florian Weimer X-Patchwork-Id: 664089 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3sNmGV0JhCz9s9W for ; Tue, 30 Aug 2016 21:18:09 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b=KVC8gSjG; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:cc:from:message-id:date :mime-version:in-reply-to:content-type; q=dns; s=default; b=YOLh IWV293+zbhw8So0yZG3rfnpIc6pvZSGbP7iAPGlO8W+1ug/0TQo3vhek3R7NlGRm 9cjkyQNHfOEY5q0DVG4YbqZy7HnsuiFDNuI3nMzpV0J1vzd5iT77VMolN/Pc0d6R kQ9MNChB5skzclcXwmHHUz3KnGExE37CxLbnY64= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:cc:from:message-id:date :mime-version:in-reply-to:content-type; s=default; bh=1/dSImHEW7 YS/vsFxAEXSIS1B30=; b=KVC8gSjGQsq7J4xLPPNT7xORChpkl6AI8kDX4+yqXc dt+piCQPputPQmZEdsvi7zbpPKgkn65erV053kW5RoZK7vS2gPuj7DgW7XW936tT dVe4xafNGelSt9LJxSCE/UL2yd4g8O6C1r5FWEZFnJsWF9Fbq/DNcKIi96czRckk I= Received: (qmail 101031 invoked by alias); 30 Aug 2016 11:17:56 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 100911 invoked by uid 89); 30 Aug 2016 11:17:55 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=BAYES_00, RP_MATCHES_RCVD, SPF_HELO_PASS, UNSUBSCRIBE_BODY autolearn=no version=3.3.2 spammy=boston, Temple, temple, Boston X-HELO: mx1.redhat.com Subject: Re: [PATCH] manual: Clarify the documentation of strverscmp [BZ #20524] To: Michael Kerrisk References: <20160829202145.A2548403DC4A9@oldenburg.str.redhat.com> Cc: libc-alpha From: Florian Weimer Message-ID: <44b7d954-354c-2407-3079-fd511dd1f510@redhat.com> Date: Tue, 30 Aug 2016 13:17:40 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: On 08/30/2016 12:06 AM, Michael Kerrisk wrote: > s/version comparison/version-comparison/ > >> +implementation is based on a finite state machine, whose behavior is > > s/finite state machine/finite-state machine/ Thank you for your corrections. >> @item >> -fractional/fractional: the things become a bit more complex. >> -If the common prefix contains only leading zeroes, the longest part is less >> -than the other one; else the comparison behaves normally. >> +Corresponding non-digit sequences in both strings are compared >> +lexicographically. If their lengths differ, the shorter non-digit > > Should this be "If their lengths differ, and the shorter string is > equal to the corresponding prefix in the longer string..."? I don't think it matters. If the shorter sequence is not a prefix of the other sequence, the lexicographic ordering will find a difference before the extension character. What about this? @item Corresponding non-digit sequences in both strings are compared lexicographically if their lengths are equal. If the lengths differ, the shorter non-digit sequence is extended with the input string character immediately following it (which can be the null terminator), the other sequence is truncated to be of the same (extended) length, and these two sequences are compared lexicographically. In this last case, the sequence comparison determines the result of the function because the extension character (or some character before it) is necessarily different from the character at the same offset in the other input string. >> +@item >> +If the two digit sequences have no leading zeros, they are compared as >> +integers, that is, the string with the longer digit sequence is deemed >> +larger, and if both sequences are of equal length, they are compared >> +lexicographically. >> + >> +@item >> +If both digit sequences have an equal, positive number of leading zeros, > > Why is the word "positive" here? It's used as a synonym for “non-zero” (to discriminate this case from the previous one). >> +they are compared lexicographically. If their length differs, another >> +character is added to to the shorter sequence, > > "another character is added to to the shorter sequence" is vague. You > want wording like you used above. Is this better? @item If both digit sequences have an equal, positive number of leading zeros, they are compared lexicographically if their lengths are the same. If the lengths differ, the shorter sequence is extended with the following character in its input string, and the other sequence is truncated to the same length, and both sequences are compared lexicographically (similar to the non-digit sequence case above). I'm attaching a new patch. > I've not completely checked all of the details, but what you write > certainly matches what I did check, and is a great deal better than > the existing text. But, the algorithm it describes *is* strange. The description is based on fuzz-strverscmp-alt.c, which is intended to run under a fuzzer to show that both implementations are equivalent. fuzz-strverscmp.c was a previous attempt at resolving the state machine, but it was still not very clear. fuzz-strverscmp3.c tests that the comparison is indeed a linear order. fuzz-strverscmp3-old2.c demonstrates that this works because with a start file of \003\0031.11.21.3, it quickly finds the ordering violations in the old implementation (from commit 4546646233574f321f9deedff928b980d82f4fc7). Thanks, Florian manual: Clarify the documentation of strverscmp [BZ #20524] 2016-08-30 Florian Weimer [BZ #20524] * manual/string.texi (String/Array Comparison): Clarify the strverscmp behavior. diff --git a/manual/string.texi b/manual/string.texi index bce81a7..193f786 100644 --- a/manual/string.texi +++ b/manual/string.texi @@ -1374,46 +1374,75 @@ The @code{strverscmp} function compares the string @var{s1} against @var{s2}, considering them as holding indices/version numbers. The return value follows the same conventions as found in the @code{strcmp} function. In fact, if @var{s1} and @var{s2} contain no -digits, @code{strverscmp} behaves like @code{strcmp}. +digits, @code{strverscmp} behaves like @code{strcmp} +(in the sense that the sign of the result is the same). -Basically, we compare strings normally (byte by byte), until -we find a digit in each string - then we enter a special comparison -mode, where each sequence of digits is taken as a whole. If we reach the -end of these two parts without noticing a difference, we return to the -standard comparison mode. There are two types of numeric parts: -"integral" and "fractional" (those begin with a '0'). The types -of the numeric parts affect the way we sort them: +The comparison algorithm which the @code{strverscmp} function implements +differs slightly from other version-comparison algorithms. The +implementation is based on a finite-state machine, whose behavior is +approximated below. @itemize @bullet @item -integral/integral: we compare values as you would expect. +The input strings are each split into sequences of non-digits and +digits. These sequences can be empty at the beginning and end of the +string. Digits are determined by the @code{isdigit} function and are +thus subject to the current locale. @item -fractional/integral: the fractional part is less than the integral one. -Again, no surprise. +Comparison starts with a (possibly empty) non-digit sequence. The first +non-equal sequences of non-digits or digits determines the outcome of +the comparison. @item -fractional/fractional: the things become a bit more complex. -If the common prefix contains only leading zeroes, the longest part is less -than the other one; else the comparison behaves normally. +Corresponding non-digit sequences in both strings are compared +lexicographically if their lengths are equal. If the lengths differ, +the shorter non-digit sequence is extended with the input string +character immediately following it (which can be the null terminator), +the other sequence is truncated to be of the same (extended) length, and +these two sequences are compared lexicographically. In this last case, +the sequence comparison determines the result of the function because +the extension character (or some character before it) is necessarily +different from the character at the same offset in the other input +string. + +@item +For two sequences of digits, the number of leading zeros is counted (which +can be zero). If the count differs, the string with more leading zeros +in the digit sequence is considered smaller than the other string. + +@item +If the two sequences of digits have no leading zeros, they are compared +as integers, that is, the string with the longer digit sequence is +deemed larger, and if both sequences are of equal length, they are +compared lexicographically. + +@item +If both digit sequences have an equal, positive number of leading zeros, +they are compared lexicographically if their lengths are the same. If +the lengths differ, the shorter sequence is extended with the following +character in its input string, and the other sequence is truncated to +the same length, and both sequences are compared lexicographically +(similar to the non-digit sequence case above). @end itemize +The treatment of leading zeros and the tie-breaking extension characters +(which in effect propagate across non-digit/digit sequence boundaries) +differs from other version-comparison algorithms. + @smallexample strverscmp ("no digit", "no digit") @result{} 0 /* @r{same behavior as strcmp.} */ strverscmp ("item#99", "item#100") @result{} <0 /* @r{same prefix, but 99 < 100.} */ strverscmp ("alpha1", "alpha001") - @result{} >0 /* @r{fractional part inferior to integral one.} */ + @result{} >0 /* @r{different number of leading zeros (0 and 2).} */ strverscmp ("part1_f012", "part1_f01") - @result{} >0 /* @r{two fractional parts.} */ + @result{} >0 /* @r{lexicographical comparison with leading zeros.} */ strverscmp ("foo.009", "foo.0") - @result{} <0 /* @r{idem, but with leading zeroes only.} */ + @result{} <0 /* @r{different number of leading zeros (2 and 1).} */ @end smallexample -This function is especially useful when dealing with filename sorting, -because filenames frequently hold indices/version numbers. - @code{strverscmp} is a GNU extension. @end deftypefun