Message ID | 87h6qjvfp1.fsf@euler.schwinge.homeip.net |
---|---|
State | New |
Headers | show |
Series | 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) | expand |
On Tue, Jul 4, 2023 at 11:50 AM Thomas Schwinge <thomas@codesourcery.com> wrote: > > Hi! > > I came across this one here on my way working through another (somewhat > related) GTY issue. I generally do understand the issue here, but do > have a question about 'unsigned int len' field in > 'libcpp/include/symtab.h:struct ht_identifier': > > On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains > > (whether they live in GC-controlled memory or not) will be marked for PCH > > output by the routine gt_pch_note_object in ggc-common.cc. This routine > > special-cases plain char* strings, and in particular it uses strlen() to get > > their length. > > Oh, wow, this special casing for strings... 8-| > > > Thus it does not handle strings with embedded null bytes, but it > > is possible for something PCH cares about (such as a string literal token in a > > macro definition) to contain such embedded nulls. To fix that up, add a new > > GTY option "string_length" so that gt_pch_note_object can be informed the > > actual length it ought to use, and use it in the relevant libcpp structs > > (cpp_string and ht_identifier) accordingly. > > For your test case: > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C > > @@ -0,0 +1,3 @@ > > +// { dg-do compile { target c++11 } } > > +#include "pch-string-nulls.H" > > +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error"); > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs > > @@ -0,0 +1,2 @@ > > +/* Note that there is a null byte following "ABC". */ > > +#define X R"(ABC^@[!])" > > ..., I understand how the following is necessary: > > > --- a/libcpp/include/cpplib.h > > +++ b/libcpp/include/cpplib.h > > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X, > > /* Payload of a NUMBER, STRING, CHAR or COMMENT token. */ > > struct GTY(()) cpp_string { > > unsigned int len; > > - const unsigned char *text; > > + > > + /* TEXT is always null terminated (terminator not included in len); but this > > + GTY markup arranges that PCH streaming works properly even if there is a > > + null byte in the middle of the string. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) text; > > }; > > (That is, the test case FAILs with that one reverted.) > > However, this one did confuse me: > > > --- a/libcpp/include/symtab.h > > +++ b/libcpp/include/symtab.h > > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3. If not see > > typedef struct ht_identifier ht_identifier; > > typedef struct ht_identifier *ht_identifier_ptr; > > struct GTY(()) ht_identifier { > > - const unsigned char *str; > > + /* This GTY markup arranges that the null-terminated identifier would still > > + stream to PCH correctly, if a null byte were to make its way into an > > + identifier somehow. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) str; > > unsigned int len; > > unsigned int hash_value; > > }; > > I did wonder whether that's an actual or just a theoretical concern, to > have 'ht_identifier's with embedded NULs? If an actual concern, can we > get a test case constructed? Otherwise, should we revert this hunk, > given that we have this auto-'strlen' handling, ignorant of embedded > NULs, in a lot of other places? > > But then I realized that possibly we do maintain 'len' here not for > correctness but as an optimization (trading an 'unsigned int' for > repeated 'strlen' calls)? My quick testing with the attached > "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm > this theory: no regressions (..., but I didn't bootstrap, and ran only > parts of the testsuite). (Not proposing that RFC for 'git push', of > course.) > > If that's indeed the intention here, I shall change/add source code > commentary to describe this rationale for this dedicated 'len' field > (plus, that handling of embedded NULs falls out of that, automatically). > > For reference, this 'len' field has existed "forever". Before > 'struct ht_identifier' was added in > commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we > had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or > earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode': > 'unsigned short length', earlier 'size_t length' with comment: > "length of token, for quick comparison", erlier 'int length', ever since > the 'gcc/cpp*' files were added in > commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191). > I don't think there is currently any possibility for a null byte to end up in an ht_identifier's string. I assumed that ht_identifier stores the length as an optimization (especially since it doesn't take up any extra space on 64-bit platforms, given the 32-bit hash code is stored as well there.) I created the string_length GTY markup mainly to support another patch that I have still pending review, which I thought would increase the likelihood of PCH needing to handle null bytes in general. When I did that, I added the markup to ht_identifier simply because the length was already there, so there was no reason not to add it. It does save a few cycles when streaming out the PCH, but I doubt it is meaningful. -Lewis
From 75b54da2f30b7ffa0c69c1659e3f3538bfdbd339 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Tue, 4 Jul 2023 14:07:47 +0200 Subject: [PATCH] [RFC] Verify no embedded NULs in 'struct ht_identifier' --- gcc/analyzer/pending-diagnostic.cc | 2 +- gcc/c-family/c-spellcheck.h | 2 +- gcc/c/c-decl.cc | 2 +- gcc/tree.h | 2 +- libcpp/include/symtab.h | 4 ++-- libcpp/symtab.cc | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/gcc/analyzer/pending-diagnostic.cc b/gcc/analyzer/pending-diagnostic.cc index e36ed4fd9c1..9452719a908 100644 --- a/gcc/analyzer/pending-diagnostic.cc +++ b/gcc/analyzer/pending-diagnostic.cc @@ -128,7 +128,7 @@ pending_diagnostic::same_tree_p (tree t1, tree t2) static bool ht_ident_eq (ht_identifier ident, const char *str) { - return (strlen (str) == ident.len + return (strlen (str) == HT_LEN (&ident) && 0 == strcmp (str, (const char *)ident.str)); } diff --git a/gcc/c-family/c-spellcheck.h b/gcc/c-family/c-spellcheck.h index e3748430b18..0f0c264ef01 100644 --- a/gcc/c-family/c-spellcheck.h +++ b/gcc/c-family/c-spellcheck.h @@ -31,7 +31,7 @@ struct edit_distance_traits<cpp_hashnode *> { static size_t get_length (cpp_hashnode *hashnode) { - return hashnode->ident.len; + return HT_LEN (&hashnode->ident); } static const char *get_string (cpp_hashnode *hashnode) diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc index 1af51c4acfc..c67ae5be514 100644 --- a/gcc/c/c-decl.cc +++ b/gcc/c/c-decl.cc @@ -4499,7 +4499,7 @@ lookup_name_fuzzy (tree name, enum lookup_name_fuzzy_kind kind, location_t loc) { const char *id = (const char *)best_macro->ident.str; tree macro_as_identifier - = get_identifier_with_length (id, best_macro->ident.len); + = get_identifier_with_length (id, HT_LEN (&best_macro->ident)); bm.set_best_so_far (macro_as_identifier, bmm.get_best_distance (), bmm.get_best_candidate_length ()); diff --git a/gcc/tree.h b/gcc/tree.h index 1854fe4a7d4..20169e7a467 100644 --- a/gcc/tree.h +++ b/gcc/tree.h @@ -1174,7 +1174,7 @@ extern void omp_clause_range_check_failed (const_tree, const char *, int, /* Unlike STRING_CST, in C terms this is strlen, not sizeof. */ #define IDENTIFIER_LENGTH(NODE) \ - (IDENTIFIER_NODE_CHECK (NODE)->identifier.id.len) + (HT_LEN (&IDENTIFIER_NODE_CHECK (NODE)->identifier.id)) #define IDENTIFIER_POINTER(NODE) \ ((const char *) IDENTIFIER_NODE_CHECK (NODE)->identifier.id.str) #define IDENTIFIER_HASH_VALUE(NODE) \ diff --git a/libcpp/include/symtab.h b/libcpp/include/symtab.h index c7ccc6db9f0..ea23e3dbedb 100644 --- a/libcpp/include/symtab.h +++ b/libcpp/include/symtab.h @@ -33,11 +33,11 @@ struct GTY(()) ht_identifier { stream to PCH correctly, if a null byte were to make its way into an identifier somehow. */ const unsigned char * GTY((string_length ("1 + %h.len"))) str; - unsigned int len; + unsigned int len_; unsigned int hash_value; }; -#define HT_LEN(NODE) ((NODE)->len) +#define HT_LEN(NODE) ({ gcc_assert ((NODE)->len_ == strlen ((const char *) HT_STR (NODE))); (NODE)->len_; }) #define HT_STR(NODE) ((NODE)->str) typedef struct ht cpp_hash_table; diff --git a/libcpp/symtab.cc b/libcpp/symtab.cc index ea9f26905a9..be5c1da9d67 100644 --- a/libcpp/symtab.cc +++ b/libcpp/symtab.cc @@ -155,7 +155,7 @@ ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str, node = (*table->alloc_node) (table); table->entries[index] = node; - HT_LEN (node) = (unsigned int) len; + node->len_ = (unsigned int) len; node->hash_value = hash; if (table->alloc_subobject) -- 2.34.1