diff mbox series

[WIP] libcpp, c-family: Emit #embed "." __gnu__::__base64__("...") when preprocessing

Message ID ZnRfX45a6CUBeS1T@tucnak
State New
Headers show
Series [WIP] libcpp, c-family: Emit #embed "." __gnu__::__base64__("...") when preprocessing | expand

Commit Message

Jakub Jelinek June 20, 2024, 4:57 p.m. UTC
Hi!

On Wed, Jun 19, 2024 at 08:29:37PM +0200, Jakub Jelinek wrote:
> Right now the patch only supports a single huge string literal in there,
> not concatenation of multiple strings, dunno if we shouldn't add support
> for that so that we don't run into the line length limits for column
> numbering.  The alternative would be emit
> #embed "." __gnu__::__base64__( \
> "Tm9uIGVyYW0gbsOpc2NpdXMsIEJydXRlLCBjdW0sIHF1w6Ygc3VtbWlzIGluZ8OpbmlpcyBleHF1" \
> "aXNpdMOhcXVlIGRvY3Ryw61uYSBwaGlsw7Nzb3BoaSBHcsOmY28gc2VybcOzbmUgdHJhY3RhdsOt" \
> "c3NlbnQsIGVhIExhdMOtbmlzIGzDrXR0ZXJpcyBtYW5kYXLDqW11cywgZm9yZSB1dCBoaWMgbm9z" \
> "dGVyIGxhYm9yIGluIHbDoXJpYXMgcmVwcmVoZW5zacOzbmVzIGluY8O6cnJlcmV0LiBuYW0gcXVp" \
> "YsO6c2RhbSwgZXQgaWlzIHF1aWRlbSBub24gw6FkbW9kdW0gaW5kw7NjdGlzLCB0b3R1bSBob2Mg" \
> "ZMOtc3BsaWNldCBwaGlsb3NvcGjDoXJpLiBxdWlkYW0gYXV0ZW0gbm9uIHRhbSBpZCByZXByZWjD" \
> "qW5kdW50LCBzaSByZW3DrXNzaXVzIGFnw6F0dXIsIHNlZCB0YW50dW0gc3TDumRpdW0gdGFtcXVl" \
> "IG11bHRhbSDDs3BlcmFtIHBvbsOpbmRhbSBpbiBlbyBub24gYXJiaXRyw6FudHVyLiBlcnVudCDD" \
> "qXRpYW0sIGV0IGlpIHF1aWRlbSBlcnVkw610aSBHcsOmY2lzIGzDrXR0ZXJpcywgY29udGVtbsOp" \
> "bnRlcyBMYXTDrW5hcywgcXVpIHNlIGRpY2FudCBpbiBHcsOmY2lzIGxlZ8OpbmRpcyDDs3BlcmFt" \
> "IG1hbGxlIGNvbnPDum1lcmUuIHBvc3Ryw6ltbyDDoWxpcXVvcyBmdXTDunJvcyBzw7pzcGljb3Is" \
> "IHF1aSBtZSBhZCDDoWxpYXMgbMOtdHRlcmFzIHZvY2VudCwgZ2VudXMgaG9jIHNjcmliw6luZGks" \
> "IGV0c2kgc2l0IGVsw6lnYW5zLCBwZXJzw7Nuw6YgdGFtZW4gZXQgZGlnbml0w6F0aXMgZXNzZSBu" \
> "ZWdlbnQu")
> so effectively split it at the 76 columns boundaries (or some other one,
> ideally multiple of 4); disadvantage is then that we lex that not into
> a single huge string but perhaps hundreds/thousands/millions of short
> CPP_STRINGs that would be gathered together.
> Thoughts on that?

Just to show up how this would be used here is an incremental WIP patch
which emits it during preprocessing for #embeds with 64 bytes or more.
Because I don't want to break everything immediately, so far it is guarded
with -fdirectives-only because then I know it is preprocessing stuff and
not using it immediately and the SUBSTRING_CST tree hasn't been introduced
nor parsing of it in the C and C++ FEs have been added.
But on the libcpp side except for the "CPP_OPTION (pfile, directives_only) && "
it is how I'd like to eventually use it, perhaps for some time limived just
for non-C++ so that I can convert just one FE at a time.
If the #embed is 2GB+ long, the patch splits it into smaller CPP_EMBED
tokens with CPP_COMMA in between, because we have various spots in the
compiler that just use unsigned int or even int lengths and I don't want to
bump into those; including e.g. cpp_string having unsigned int len.  Plus
for the preprocessing if that is base64 encoded it needs 4 / 3 times more
bytes.
Also, the patch makes CPP_EMBED from just the inner part of the sequence,
ignores the first and last byte.  That is because at least unless we know
for sure that e.g. prefix tokens end with CPP_COMMA and suffix tokens
start with CPP_COMMA, the boundary literals are special, one can perform
arithmetics on them or whatever else and seeing clang gets all the cases
like
const unsigned char a[] = {
#embed __FILE__ prefix(-400 + 4 *) suffix(-27)
};
or
const unsigned char b[] = { [46] =
#embed __FILE__
};
etc. wrong makes me believe it is more maintainable if the boundaries
are just plain CPP_NUMBERs.

BTW, once SUBSTRING_CST or how it will be called support for #embed is in,
we could then perhaps also tweak parsing of large char/unsigned char/signed
char array initializers which don't actually use #embed in the source;
while the compile time and memory wasted on CPP_NUMBER with CPP_COMMA
in between would be already wasted, if we see say after 64 parsed
INTEGER_CSTs it is all the same without overflows, we could just switch
into trying to parse it into a SUBSTRING_CST and give up only if we see some
out of range value or something other than CPP_NUMBER/CPP_COMMA; after all,
we could even do raw token quick scan to check if it will be worth it...

Note, this patch is totally untested except for -E -fdirectives-only
preprocessing a file with some small and some large #embed directives
(and I've tried to base64 decode them).

2024-06-20  Jakub Jelinek  <jakub@redhat.com>

libcpp/ChangeLog:
	* include/cpplib.h (TTYPE_TABLE): Add CPP_EMBED token type.
	* files.cc (finish_embed): For limit >= 64 and for now if
	-fdirectives-only instead of emitting CPP_NUMBER CPP_COMMA separated
	sequence for the whole embed emit it just for the first and last
	byte and in between emit a CPP_EMBED token or tokens if too large.
gcc/c-family/
	* c-ppoutput.cc (token_streamer::stream): Add special code to spell
	CPP_EMBED token.



	Jakub
diff mbox series

Patch

--- libcpp/include/cpplib.h.jj	2024-06-18 14:28:55.813706760 +0200
+++ libcpp/include/cpplib.h	2024-06-20 16:49:12.559722371 +0200
@@ -144,6 +144,8 @@  class rich_location;
   TK(STRING32_USERDEF,	LITERAL) /* U"string"_suffix - C++11 */		\
   TK(UTF8STRING_USERDEF,LITERAL) /* u8"string"_suffix - C++11 */	\
 									\
+  TK(EMBED,		LITERAL) /* #embed - C23 */			\
+									\
   TK(COMMENT,		LITERAL) /* Only if output comments.  */	\
 				 /* SPELL_LITERAL happens to DTRT.  */	\
   TK(MACRO_ARG,		NONE)	 /* Macro argument.  */			\
--- libcpp/files.cc.jj	2024-06-19 12:44:40.169920948 +0200
+++ libcpp/files.cc	2024-06-20 18:20:43.430319203 +0200
@@ -1240,15 +1240,17 @@  finish_embed (cpp_reader *pfile, _cpp_fi
   if (params->limit < limit)
     limit = params->limit;
 
-  /* For sizes larger than say 64 bytes, this is just a temporary
-     solution, we should emit a single new token which the FEs will
-     handle as an optimization.  */
+  size_t embed_tokens = 0;
+  if (CPP_OPTION (pfile, directives_only) && limit >= 64)
+    embed_tokens = ((limit - 2) / INT_MAX) + (((limit - 2) % INT_MAX) != 0);
+
   size_t max = INTTYPE_MAXIMUM (size_t) / sizeof (cpp_token);
-  if (limit > max / 2
+  if ((embed_tokens ? (embed_tokens > (max - 3) / 2) : (limit > max / 2))
       || (limit
 	  ? (params->prefix.count > max
 	     || params->suffix.count > max
-	     || (limit * 2 + params->prefix.count
+	     || ((embed_tokens ? embed_tokens * 2 + 3 : limit * 2 - 1)
+		 + params->prefix.count
 		 + params->suffix.count > max))
 	  : params->if_empty.count > max))
     {
@@ -1282,13 +1284,16 @@  finish_embed (cpp_reader *pfile, _cpp_fi
 			"%s is too large", file->path);
 	  return 0;
 	}
+      if (embed_tokens && i == 0)
+	i = limit - 2;
     }
   uchar *s = len ? _cpp_unaligned_alloc (pfile, len) : NULL;
   _cpp_buff *tok_buff = NULL;
   cpp_token *toks = NULL, *tok = &pfile->directive_result;
   size_t count = 0;
   if (limit)
-    count = (params->prefix.count + limit * 2 - 1
+    count = (params->prefix.count
+	     + (embed_tokens ? embed_tokens * 2 + 3 : limit * 2 - 1)
 	     + params->suffix.count) - 1;
   else if (params->if_empty.count)
     count = params->if_empty.count - 1;
@@ -1340,6 +1345,34 @@  finish_embed (cpp_reader *pfile, _cpp_fi
 	  tok->flags = NO_EXPAND;
 	  tok++;
 	}
+      if (i == 0 && embed_tokens)
+	{
+	  ++i;
+	  for (size_t j = 0; j < embed_tokens; ++j)
+	    {
+	      tok->src_loc = params->loc;
+	      tok->type = CPP_EMBED;
+	      tok->flags = NO_EXPAND;
+	      tok->val.str.text = &buffer[i];
+	      tok->val.str.len
+		= limit - 1 - i > INT_MAX ? INT_MAX : limit - 1 - i;
+	      i += tok->val.str.len;
+	      if (tok->val.str.len < 32)
+		{
+		  /* Avoid CPP_EMBED with a fewer than 32 bytes, shrink the
+		     previous CPP_EMBED by 64 and grow this one by 64.  */
+		  tok[-2].val.str.len -= 64;
+		  tok->val.str.text -= 64;
+		  tok->val.str.len += 64;
+		}
+	      tok++;
+	      tok->src_loc = params->loc;
+	      tok->type = CPP_COMMA;
+	      tok->flags = NO_EXPAND;
+	      tok++;
+	    }
+	  --i;
+	}
     }
   if (limit && params->suffix.count)
     {
--- gcc/c-family/c-ppoutput.cc.jj	2024-01-03 12:07:02.254732978 +0100
+++ gcc/c-family/c-ppoutput.cc	2024-06-20 18:21:55.030388341 +0200
@@ -299,6 +299,55 @@  token_streamer::stream (cpp_reader *pfil
 	maybe_print_line (UNKNOWN_LOCATION);
       in_pragma = false;
     }
+  else if (token->type == CPP_EMBED)
+    {
+      char buf[65];
+      maybe_print_line (token->src_loc);
+      fputs ("#embed \".\" __gnu__::__base64__(\"", print.outf);
+      print.printed = true;
+      unsigned int j = 0;
+      static const char base64_enc[] =
+	"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
+      for (unsigned i = 0; i < token->val.str.len; i += 3)
+	{
+	  unsigned char a = token->val.str.text[i];
+	  unsigned char b = 0, c = 0;
+	  unsigned int n = token->val.str.len - i;
+	  if (n > 1)
+	    b = token->val.str.text[i + 1];
+	  if (n > 2)
+	    c = token->val.str.text[i + 2];
+	  unsigned long v = ((((unsigned long) a) << 16)
+			     | (((unsigned long) b) << 8)
+			     | c);
+	  buf[j++] = base64_enc[(v >> 18) & 63];
+	  buf[j++] = base64_enc[(v >> 12) & 63];
+	  buf[j++] = base64_enc[(v >> 6) & 63];
+	  buf[j++] = base64_enc[v & 63];
+	  if (n < 3)
+	    {
+	      buf[j - 1] = '=';
+	      if (n == 1)
+		buf[j - 2] = '=';
+	    }
+	  if (j == 64)
+	    {
+	      buf[64] = '\0';
+	      fputs (buf, print.outf);
+	      j = 0;
+	    }
+	  if (n < 3)
+	    break;
+	}
+      if (j)
+	{
+	  buf[j] = '\0';
+	  fputs (buf, print.outf);
+	}
+      fputs ("\")", print.outf);
+      maybe_print_line (token->src_loc);
+      return;
+    }
   else
     {
       if (cpp_get_options (parse_in)->debug)