diff mbox series

[Ada] Make middle-end string literals NUL terminated

Message ID AM5PR0701MB2657F9BAF49A613759D90D13E42E0@AM5PR0701MB2657.eurprd07.prod.outlook.com
State New
Headers show
Series [Ada] Make middle-end string literals NUL terminated | expand

Commit Message

Bernd Edlinger July 31, 2018, 12:07 p.m. UTC
Hi!


This fixes a couple STRING_CST that are not explicitly NUL terminated.
These were caught in a new check in varasm.c I am currently working on.

Having a NUL terminated string does not change the binary output, but it
makes things easier for he middle-end.


Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk?


Thanks
Bernd.

Comments

Olivier Hainque Aug. 3, 2018, 4:50 p.m. UTC | #1
Hi Bernd,

> On 31 Jul 2018, at 14:07, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> 
> Hi!
> 
> 
> This fixes a couple STRING_CST that are not explicitly NUL terminated.
> These were caught in a new check in varasm.c I am currently working on.
> 
> Having a NUL terminated string does not change the binary output, but it
> makes things easier for he middle-end.
> 
> 
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk?

--- gcc/ada/gcc-interface/utils2.c	2017-12-21 07:57:41.000000000 +0100
+++ gcc/ada/gcc-interface/utils2.c	2018-07-31 11:44:01.517117923 +0200
@@ -1844,7 +1844,7 @@ expand_sloc (Node_Id gnat_node, tree *fi
     }
 
   const int len = strlen (str);
-  *filename = build_string (len, str);
+  *filename = build_string (len + 1, str);
   TREE_TYPE (*filename) = build_array_type (char_type_node,
 					    build_index_type (size_int (len)));
   *line = build_int_cst (NULL_TREE, line_number);

This one looks good to me. I'm not officially listed as a maintainer
so I'll let Eric have the final word. I'm answering because ...


--- gcc/ada/gcc-interface/trans.c	2018-07-17 10:10:04.000000000 +0200
+++ gcc/ada/gcc-interface/trans.c	2018-07-31 11:16:27.350728886 +0200
@@ -6079,7 +6079,7 @@ gnat_to_gnu (Node_Id gnat_node)
 	     where GCC will want to treat it as a C string.  */
 	  string[i] = 0;
 
-	  gnu_result = build_string (length, string);
+	  gnu_result = build_string (length + 1, string);
 
 	  /* Strings in GCC don't normally have types, but we want
 	     this to not be converted to the array type.  */

We have a local variant of this one, on which I worked after we realized
that it was not enough to achieve the intended effect.

The difference at this spot is simply that we prevent the +1 if the
original string happens to be explicitly nul terminated already. Like:

 	    build_string
               ((length > 0 && string[length-1] == 0) ? length : length + 1,
                string);

This is however not enough because the extra zero is only conveyed
through the string value, not the corresponding type, and 

  varasm.c: mergeable_string_section

currently uses the type bounds to search for a zero termination.

We can't really change the type, so came up with an adjustment to
varasm. The motivation for using the type bounds was

  https://gcc.gnu.org/ml/gcc-patches/2006-06/msg01004.html

which, IIUC, was tightly linked to string constants used as
initializers for objects which have a size of their own.
(Jakub cc'ed)

For string constants not used as initializers, it seems that
we might be able to use the string bounds instead, possibly
wider. The attached patch does that and passed testing a few weeks
ago.

So, while we're at it, does that look right, in light of all the
string length related issues that have been discussed lately ?

I'm not sure I grasped all the ramifications.

Thanks in advance!


        * varasm.c (mergeable_string_section): Accept an extra SCAT
        argument conveying the section category from which the request
        originates.  Only restrict the search to the string type bounds
        if we're queried for an initializer.  Consider the string bounds
        otherwise.
        (default_elf_select_section, STR and STR_INIT): Pass the section
        category to mergeable_string_section.
Bernd Edlinger Aug. 3, 2018, 6:37 p.m. UTC | #2
On 08/03/18 18:50, Olivier Hainque wrote:
> Hi Bernd,
> 
>> On 31 Jul 2018, at 14:07, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>> 
>> Hi!
>> 
>> 
>> This fixes a couple STRING_CST that are not explicitly NUL terminated.
>> These were caught in a new check in varasm.c I am currently working on.
>> 
>> Having a NUL terminated string does not change the binary output, but it
>> makes things easier for he middle-end.
>> 
>> 
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk?
> 
> --- gcc/ada/gcc-interface/utils2.c      2017-12-21 07:57:41.000000000 +0100
> +++ gcc/ada/gcc-interface/utils2.c      2018-07-31 11:44:01.517117923 +0200
> @@ -1844,7 +1844,7 @@ expand_sloc (Node_Id gnat_node, tree *fi
>       }
> 
>     const int len = strlen (str);
> -  *filename = build_string (len, str);
> +  *filename = build_string (len + 1, str);
>     TREE_TYPE (*filename) = build_array_type (char_type_node,
>                                               build_index_type (size_int (len)));
>     *line = build_int_cst (NULL_TREE, line_number);
> 
> This one looks good to me. I'm not officially listed as a maintainer
> so I'll let Eric have the final word. I'm answering because ...
> 
> 
> --- gcc/ada/gcc-interface/trans.c       2018-07-17 10:10:04.000000000 +0200
> +++ gcc/ada/gcc-interface/trans.c       2018-07-31 11:16:27.350728886 +0200
> @@ -6079,7 +6079,7 @@ gnat_to_gnu (Node_Id gnat_node)
>                where GCC will want to treat it as a C string.  */
>             string[i] = 0;
> 
> -         gnu_result = build_string (length, string);
> +         gnu_result = build_string (length + 1, string);
> 
>             /* Strings in GCC don't normally have types, but we want
>                this to not be converted to the array type.  */
> 
> We have a local variant of this one, on which I worked after we realized
> that it was not enough to achieve the intended effect.
> 
> The difference at this spot is simply that we prevent the +1 if the
> original string happens to be explicitly nul terminated already. Like:
> 
>               build_string
>                 ((length > 0 && string[length-1] == 0) ? length : length + 1,
>                  string);
> 
> This is however not enough because the extra zero is only conveyed
> through the string value, not the corresponding type, and
> 
>    varasm.c: mergeable_string_section
> 
> currently uses the type bounds to search for a zero termination.
> 
> We can't really change the type, so came up with an adjustment to
> varasm. The motivation for using the type bounds was
> 
> https://gcc.gnu.org/ml/gcc-patches/2006-06/msg01004.html
> 
> which, IIUC, was tightly linked to string constants used as
> initializers for objects which have a size of their own.
> (Jakub cc'ed)
> 
> For string constants not used as initializers, it seems that
> we might be able to use the string bounds instead, possibly
> wider. The attached patch does that and passed testing a few weeks
> ago.
> 
> So, while we're at it, does that look right, in light of all the
> string length related issues that have been discussed lately ?
> 
> I'm not sure I grasped all the ramifications.
> 
> Thanks in advance!
> 
> 
>          * varasm.c (mergeable_string_section): Accept an extra SCAT
>          argument conveying the section category from which the request
>          originates.  Only restrict the search to the string type bounds
>          if we're queried for an initializer.  Consider the string bounds
>          otherwise.
>          (default_elf_select_section, STR and STR_INIT): Pass the section
>          category to mergeable_string_section.
> 
> 
> 
> 
> 

Oh, how interesting.

My patch only intends to add a dummy byte that is stripped again
in the assembly.  The purpose is to make it trivial to find out if
a string constant is NUL-terminated in  the middle-end by comparing
TYPE_SIZE_UNIT (TREE_TYPE (decl)) with TREE_STRING_LENGTH (decl).
TYPE_SIZE_UNIT >= TREE_STRING_LENGTH means zero terminated
TYPE_SIZE_UNIT < TREE_STRING_LENGTH means not necessarily zero terminated,
Such strings shall never chop off more than one nul character.
Ada strings are generally not zero terminated, right?

So in your search loop you would not have to look after
type_len, because that byte would be guaranteed to be zero.

That is if we agree that we want to restrict the string constants
in that way as I proposed.

In the case str_len > type_len I do not understand if that is right.

Because varasm will only output type_size bytes,
this seems only to select a string section
but the last nul byte will not be assembled.
Something that is not contained in the type_size
should not be assembled.

What exactly do you want to accomplish?


Thanks
Bernd.
Bernd Edlinger Aug. 4, 2018, 11:33 a.m. UTC | #3
Hi Olivier,


I think I like your idea a lot, it should be highly useful for Fortran and GO as well.

Would somthing something like this (untested patch) work for you on top of my patch series.
It makes use of the zero-termination properties of STRING_CSTs, that the other patch ensures.

I have two C test cases for this:

$ cat y3.c
const char str[3] = "abc";
$ gcc -fmerge-all-constants -S y3.c
$ cat y3.s
	.file	"y3.c"
	.text
	.globl	str
	.section	.rodata.str1.1,"aMS",@progbits,1
	.type	str, @object
	.size	str, 3
str:
	.string	"abc"
	.ident	"GCC: (GNU) 9.0.0 20180730 (experimental)"
	.section	.note.GNU-stack,"",@progbits

$ cat y4.c
const char str[4] = "abc";
$ gcc -fmerge-all-constants -S y4.c
$ cat y4.s
$ cat y4.s
	.file	"y4.c"
	.text
	.globl	str
	.section	.rodata.str1.1,"aMS",@progbits,1
	.type	str, @object
	.size	str, 4
str:
	.string	"abc"
	.ident	"GCC: (GNU) 9.0.0 20180730 (experimental)"
	.section	.note.GNU-stack,"",@progbits


The .size directive uses the DECL_SIZE_UNIT, and is still 3.
I don't know of that needs to change or not.
But what is important that we have .string "abc"
and not .ascii "abc".

What do you think?


Thanks
Bernd.
Index: gcc/varasm.c
===================================================================
--- gcc/varasm.c	(revision 263072)
+++ gcc/varasm.c	(working copy)
@@ -111,7 +111,8 @@ static int compare_constant (const tree, const tre
 static void output_constant_def_contents (rtx);
 static void output_addressed_constants (tree);
 static unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT,
-					       unsigned int, bool);
+					       unsigned int, bool,
+					       bool = false);
 static void globalize_decl (tree);
 static bool decl_readonly_section_1 (enum section_category);
 #ifdef BSS_SECTION_ASM_OP
@@ -827,7 +828,7 @@ mergeable_string_section (tree decl ATTRIBUTE_UNUS
 	  unit = GET_MODE_SIZE (mode);
 
 	  /* Check for embedded NUL characters.  */
-	  for (i = 0; i < len; i += unit)
+	  for (i = 0; i < len - unit; i += unit)
 	    {
 	      for (j = 0; j < unit; j++)
 		if (str[i + j] != '\0')
@@ -2117,7 +2118,7 @@ assemble_noswitch_variable (tree decl, const char
 
 static void
 assemble_variable_contents (tree decl, const char *name,
-			    bool dont_output_data)
+			    bool dont_output_data, bool merge_strings = false)
 {
   /* Do any machine/system dependent processing of the object.  */
 #ifdef ASM_DECLARE_OBJECT_NAME
@@ -2140,7 +2141,7 @@ assemble_variable_contents (tree decl, const char
 	output_constant (DECL_INITIAL (decl),
 			 tree_to_uhwi (DECL_SIZE_UNIT (decl)),
 			 get_variable_align (decl),
-			 false);
+			 false, merge_strings);
       else
 	/* Leave space for it.  */
 	assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl)));
@@ -2316,7 +2317,9 @@ assemble_variable (tree decl, int top_level ATTRIB
 	switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_variable_contents (decl, name, dont_output_data);
+      assemble_variable_contents (decl, name, dont_output_data,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  unsigned HOST_WIDE_INT int size
@@ -4786,6 +4789,19 @@ static unsigned HOST_WIDE_INT
 output_constructor (tree, unsigned HOST_WIDE_INT, unsigned int, bool,
 		    oc_outer_state *);
 
+/* Find out if a string P of length LEN and character size UNIT
+   has double nul termination.
+   The string is already known to have a nul termination at
+   offset LEN.  Find out if there is already a NUL termination
+   at offset LEN - UNIT.  */
+
+static bool
+redundant_nul_term_p (const char *p, unsigned len, unsigned unit)
+{
+  gcc_assert(len >= unit);
+  return !memcmp (p + len, p + len - unit, unit);
+}
+
 /* Output assembler code for constant EXP, with no label.
    This includes the pseudo-op such as ".int" or ".byte", and a newline.
    Assumes output_addressed_constants has been done on EXP already.
@@ -4812,7 +4828,7 @@ output_constructor (tree, unsigned HOST_WIDE_INT,
 
 static unsigned HOST_WIDE_INT
 output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align,
-		 bool reverse)
+		 bool reverse, bool merge_strings /* = false */)
 {
   enum tree_code code;
   unsigned HOST_WIDE_INT thissize;
@@ -4940,8 +4956,13 @@ output_constant (tree exp, unsigned HOST_WIDE_INT
 	case CONSTRUCTOR:
 	  return output_constructor (exp, size, align, reverse, NULL);
 	case STRING_CST:
-	  thissize
-	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  thissize = (unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp);
+	  if (thissize > size && merge_strings
+	      && !redundant_nul_term_p (TREE_STRING_POINTER (exp),
+				        size, thissize - size))
+	    ;
+ 	  else
+	    thissize = MIN (thissize, size);
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Bernd Edlinger Aug. 4, 2018, 3:43 p.m. UTC | #4
Hi!


This patch is inspired by Olivier's feedback to my previous patch on the
zero-termination of Ada STRING_CST.

The idea is that strings that do not have embedded nul characters _and_
do not happen to be zero-terminated in the DECL_UNIT_SIZE, are currently
not in the merge string sections.  To be in the merge string section
they need a terminating nul character, that gets written directly
in varasm while assembling the string constant.  This patch uses
the new string properties that my previous patch series implements,
and is based on the other patches here:

[PATCH] Check the STRING_CSTs in varasm.c
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00331.html

[PATCH] Handle overlength strings in the C FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html

[PATCH] Handle overlength strings in C++ FE
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00045.html
Approved: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00105.html

[PATCH] Make GO string literals properly NUL terminated
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html

[PATCH] [Ada] Make middle-end string literals NUL terminated
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html


The existing test case gcc.dg/merge-all-constants-1.c
contains two overlength strings that get now stripped down
by the C FE, and look in the middle end exactly like normal
Ada, Fortran or Go strings.  And get now allocated
in the merge.str section, unless they have an embedded
nul character, which I changed to make the test pass again.

Olivier, could you add test cases from the Ada side to this?


Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
Is it OK for trunk (after all pre-condition patches
are approved/committed)?


Thanks
Bernd.
Bernd Edlinger Aug. 4, 2018, 3:44 p.m. UTC | #5
Again, with patch....

On 08/04/18 17:43, Bernd Edlinger wrote:
> Hi!
> 
> 
> This patch is inspired by Olivier's feedback to my previous patch on the
> zero-termination of Ada STRING_CST.
> 
> The idea is that strings that do not have embedded nul characters _and_
> do not happen to be zero-terminated in the DECL_UNIT_SIZE, are currently
> not in the merge string sections.  To be in the merge string section
> they need a terminating nul character, that gets written directly
> in varasm while assembling the string constant.  This patch uses
> the new string properties that my previous patch series implements,
> and is based on the other patches here:
> 
> [PATCH] Check the STRING_CSTs in varasm.c
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00331.html
> 
> [PATCH] Handle overlength strings in the C FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html
> 
> [PATCH] Handle overlength strings in C++ FE
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00045.html
> Approved: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00105.html
> 
> [PATCH] Make GO string literals properly NUL terminated
> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html
> 
> [PATCH] [Ada] Make middle-end string literals NUL terminated
> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html
> 
> 
> The existing test case gcc.dg/merge-all-constants-1.c
> contains two overlength strings that get now stripped down
> by the C FE, and look in the middle end exactly like normal
> Ada, Fortran or Go strings.  And get now allocated
> in the merge.str section, unless they have an embedded
> nul character, which I changed to make the test pass again.
> 
> Olivier, could you add test cases from the Ada side to this?
> 
> 
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk (after all pre-condition patches
> are approved/committed)?
> 
> 
> Thanks
> Bernd.
gcc:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* varasm.c (output_constant): Add new parameter merge_strings.
	(assemble_variable_contents): Likewise.
	(mergeable_string_section): Don't fail if the last char is non-zero.
	(assemble_variable): Pass merge_strings for string merge sections.
	(redundant_nul_term_p): New helper function.
	(output_constant): Make strings properly zero terminated in merge
	string sections.

testsuite:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* gcc.dg/merge-all-constants-1.c: Adjust test.

Index: gcc/varasm.c
===================================================================
--- gcc/varasm.c	(revision 263072)
+++ gcc/varasm.c	(working copy)
@@ -111,7 +111,8 @@ static int compare_constant (const tree, const tre
 static void output_constant_def_contents (rtx);
 static void output_addressed_constants (tree);
 static unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT,
-					       unsigned int, bool);
+					       unsigned int, bool,
+					       bool = false);
 static void globalize_decl (tree);
 static bool decl_readonly_section_1 (enum section_category);
 #ifdef BSS_SECTION_ASM_OP
@@ -827,7 +828,7 @@ mergeable_string_section (tree decl ATTRIBUTE_UNUS
 	  unit = GET_MODE_SIZE (mode);
 
 	  /* Check for embedded NUL characters.  */
-	  for (i = 0; i < len; i += unit)
+	  for (i = 0; i < len - unit; i += unit)
 	    {
 	      for (j = 0; j < unit; j++)
 		if (str[i + j] != '\0')
@@ -2117,7 +2118,7 @@ assemble_noswitch_variable (tree decl, const char
 
 static void
 assemble_variable_contents (tree decl, const char *name,
-			    bool dont_output_data)
+			    bool dont_output_data, bool merge_strings = false)
 {
   /* Do any machine/system dependent processing of the object.  */
 #ifdef ASM_DECLARE_OBJECT_NAME
@@ -2140,7 +2141,7 @@ assemble_variable_contents (tree decl, const char
 	output_constant (DECL_INITIAL (decl),
 			 tree_to_uhwi (DECL_SIZE_UNIT (decl)),
 			 get_variable_align (decl),
-			 false);
+			 false, merge_strings);
       else
 	/* Leave space for it.  */
 	assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl)));
@@ -2316,7 +2317,9 @@ assemble_variable (tree decl, int top_level ATTRIB
 	switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_variable_contents (decl, name, dont_output_data);
+      assemble_variable_contents (decl, name, dont_output_data,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  unsigned HOST_WIDE_INT int size
@@ -4786,6 +4789,19 @@ static unsigned HOST_WIDE_INT
 output_constructor (tree, unsigned HOST_WIDE_INT, unsigned int, bool,
 		    oc_outer_state *);
 
+/* Find out if a string P of length LEN and character size UNIT
+   has double nul termination.
+   The string is already known to have a nul termination at
+   offset LEN.  Find out if there is already a NUL termination
+   at offset LEN - UNIT.  */
+
+static bool
+redundant_nul_term_p (const char *p, unsigned len, unsigned unit)
+{
+  gcc_assert(len >= unit);
+  return !memcmp (p + len, p + len - unit, unit);
+}
+
 /* Output assembler code for constant EXP, with no label.
    This includes the pseudo-op such as ".int" or ".byte", and a newline.
    Assumes output_addressed_constants has been done on EXP already.
@@ -4812,7 +4828,7 @@ output_constructor (tree, unsigned HOST_WIDE_INT,
 
 static unsigned HOST_WIDE_INT
 output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align,
-		 bool reverse)
+		 bool reverse, bool merge_strings /* = false */)
 {
   enum tree_code code;
   unsigned HOST_WIDE_INT thissize;
@@ -4940,8 +4956,13 @@ output_constant (tree exp, unsigned HOST_WIDE_INT
 	case CONSTRUCTOR:
 	  return output_constructor (exp, size, align, reverse, NULL);
 	case STRING_CST:
-	  thissize
-	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  thissize = (unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp);
+	  if (thissize > size && merge_strings
+	      && !redundant_nul_term_p (TREE_STRING_POINTER (exp),
+					size, thissize - size))
+	    ;
+ 	  else
+	    thissize = MIN (thissize, size);
 	  gcc_checking_assert (check_string_literal (exp, thissize));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
 	case VECTOR_CST:
Index: gcc/testsuite/gcc.dg/merge-all-constants-1.c
===================================================================
--- gcc/testsuite/gcc.dg/merge-all-constants-1.c	(revision 263072)
+++ gcc/testsuite/gcc.dg/merge-all-constants-1.c	(working copy)
@@ -1,8 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-w -O2 -fmerge-all-constants" } */
 
-const char str1[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str1[36] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 const char str2[38] = "0123456789abcdefghijklmnopqrstuvwxyz";
-const char str3[10] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str3[10] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 
 /* { dg-final { scan-assembler-not "\.rodata\.str" } } */
Bernd Edlinger Aug. 4, 2018, 4:06 p.m. UTC | #6
Aehm, I forgot the Fortran FE patch
which is also a pre-condition (at least for building with all languages):

[PATCH] Create internally nul terminated string literals in fortan FE
https://gcc.gnu.org/ml/fortran/2018-08/msg00000.html


Thanks
Bernd.


On 08/04/18 17:44, Bernd Edlinger wrote:
> Again, with patch....
> 
> On 08/04/18 17:43, Bernd Edlinger wrote:
>> Hi!
>>
>>
>> This patch is inspired by Olivier's feedback to my previous patch on the
>> zero-termination of Ada STRING_CST.
>>
>> The idea is that strings that do not have embedded nul characters _and_
>> do not happen to be zero-terminated in the DECL_UNIT_SIZE, are currently
>> not in the merge string sections.  To be in the merge string section
>> they need a terminating nul character, that gets written directly
>> in varasm while assembling the string constant.  This patch uses
>> the new string properties that my previous patch series implements,
>> and is based on the other patches here:
>>
>> [PATCH] Check the STRING_CSTs in varasm.c
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00331.html
>>
>> [PATCH] Handle overlength strings in the C FE
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00142.html
>>
>> [PATCH] Handle overlength strings in C++ FE
>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00045.html
>> Approved: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00105.html
>>
>> [PATCH] Make GO string literals properly NUL terminated
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01931.html
>>
>> [PATCH] [Ada] Make middle-end string literals NUL terminated
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01929.html
>>
>>
>> The existing test case gcc.dg/merge-all-constants-1.c
>> contains two overlength strings that get now stripped down
>> by the C FE, and look in the middle end exactly like normal
>> Ada, Fortran or Go strings.  And get now allocated
>> in the merge.str section, unless they have an embedded
>> nul character, which I changed to make the test pass again.
>>
>> Olivier, could you add test cases from the Ada side to this?
>>
>>
>> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
>> Is it OK for trunk (after all pre-condition patches
>> are approved/committed)?
>>
>>
>> Thanks
>> Bernd.
Olivier Hainque Aug. 6, 2018, 6:01 p.m. UTC | #7
Hi Bernd,

Things are moving fast so I'll answer your
messages incrementally :-)

> On 3 Aug 2018, at 20:37, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
[...]
> Oh, how interesting.
> 
> My patch only intends to add a dummy byte that is stripped again
> in the assembly.  The purpose is to make it trivial to find out if
> a string constant is NUL-terminated in  the middle-end by comparing
> TYPE_SIZE_UNIT (TREE_TYPE (decl)) with TREE_STRING_LENGTH (decl).

Ah, Ok. I hadn't quite realized that yet.

> TYPE_SIZE_UNIT >= TREE_STRING_LENGTH means zero terminated
> TYPE_SIZE_UNIT < TREE_STRING_LENGTH means not necessarily zero terminated,

That is opposite to the situation we were having
for Ada before the patch series:

> Such strings shall never chop off more than one nul character.
> Ada strings are generally not zero terminated, right?

Right. Strings have explicit bounds in Ada, and a NUL character
is like any other. It can appear anywhere. For example, the
excerpt below constructs a string of 5 characters, bounds 1 to 5,
with a nul character in second position:

 X : String (1 .. 5) := "1" & Ascii.Nul & "345";

('&' is the catenation operator)

The problem we had was that string literals not explicitly
NUL terminated (as in X : String := "123" & Ascii.NUL) would
never go in mergeable sections. For example:

Ada.Text_IO.Put_Line ("Hello World!");

Ada has so called "generic" units that are sometimes
instanciated many many times so there is real interest
in improving the situation there.

> So in your search loop you would not have to look after
> type_len, because that byte would be guaranteed to be zero.
> 
> That is if we agree that we want to restrict the string constants
> in that way as I proposed.
> 
> In the case str_len > type_len I do not understand if that is right.

In your new model, I don't know what sense it would make, indeed.

In the previous model, it was clearly expected to be a possibility.

> Because varasm will only output type_size bytes,
> this seems only to select a string section
> but the last nul byte will not be assembled.
> Something that is not contained in the type_size
> should not be assembled.
> 
> What exactly do you want to accomplish?

Based on the previous (say, gcc-7 or gcc-8) code base, arrange
for most Ada string literals to end up in a mergeable section
on ELF targets.

In at least gcc-7 and gcc-8, our gigi adjustment + the
patch I posted achieve this effect.

For example, for the following source:

procedure Hello is
  procedure Process (S : String) with Import, Convention => C;
begin
 Process ("1234");
end;

Without the gigi change, I get (x86_64-linux,
-O2 -fmerge-all-constants):

  .section .rodata
  .LC0:
      .ascii "1234"

Regular section, no terminating zero.

The gigi change alone gets us the terminating nul (*),
but not yet a mergeable section, because the extra nul
isn't covered by the type bounds:

  .section .rodata
  .LC0:
      .string "1234"

With the varasm change on top, checking beyond the
type bounds when the string constant isn't an initializer,
we get:

  .section .rodata.str1.1,"aMS",@progbits,1
  .LC0:
      .string "1234"

(*) The terminating zero, part of the string value,
not spanned by the type bounds, gets assembled through:

assemble_constant_contents (...)
...
size = get_constant_size (exp);

then:

get_constant_size (tree exp)

  size = int_size_in_bytes (TREE_TYPE (exp));
  if (TREE_CODE (exp) == STRING_CST)
    size = MAX (TREE_STRING_LENGTH (exp), size);

Again, this is just providing more context on the
original patch I posted, based on your questions.

I understand there are proposals to change things
pretty significantly in this area, so ...

now on to your next messages and ideas :-) 


Olivier
Bernd Edlinger Aug. 7, 2018, 12:56 p.m. UTC | #8
Hi,

I realized that the previous version did not handle
zero terminated Ada constants correctly as in

$ cat hello.adb
procedure Hello is
    procedure puts  (S : String) with Import, Convention => C;
    X : String(1..8) := "abcdefg" & Ascii.Nul;
begin
   puts ("1234" & Ascii.Nul );
   puts (X);
end;

accidentally strings were doubly nul-terminated, because I forgot to
handle merge string sections in assemble_constant_contents, and
because get_constant_size increased the string literal by one.

Fixed, and added a positive test case for the merging non-zero terminated
strings in C.


Boot-strapped and regtested on x86_64-pc-linux-gnu.
Is it OK for trunk (after pre-condition patches)?


Thanks
Bernd.
gcc:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* varasm.c (output_constant): Add new parameter merge_strings.
	Make strings properly zero terminated in merge string sections.
	(mergeable_string_section): Don't fail if the last char is non-zero.
	(assemble_variable_contents): Handle merge string sections.
	(assemble_variable): Likewise.
	(assemble_constant_contents): Likewise.
	(output_constant_def_contents): Likewise.
	(get_constant_size): Remove special handling of STRING_CSTs.
	(redundant_nul_term_p): New helper function.

testsuite:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* gcc.dg/merge-all-constants-1.c: Adjust test.
	* gcc.dg/merge-all-constants-2.c: New test.

Index: gcc/varasm.c
===================================================================
--- gcc/varasm.c	(revision 263072)
+++ gcc/varasm.c	(working copy)
@@ -111,7 +111,8 @@ static int compare_constant (const tree, const tre
 static void output_constant_def_contents (rtx);
 static void output_addressed_constants (tree);
 static unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT,
-					       unsigned int, bool);
+					       unsigned int, bool,
+					       bool = false);
 static void globalize_decl (tree);
 static bool decl_readonly_section_1 (enum section_category);
 #ifdef BSS_SECTION_ASM_OP
@@ -827,7 +828,7 @@ mergeable_string_section (tree decl ATTRIBUTE_UNUS
 	  unit = GET_MODE_SIZE (mode);
 
 	  /* Check for embedded NUL characters.  */
-	  for (i = 0; i < len; i += unit)
+	  for (i = 0; i < len - unit; i += unit)
 	    {
 	      for (j = 0; j < unit; j++)
 		if (str[i + j] != '\0')
@@ -2117,7 +2118,7 @@ assemble_noswitch_variable (tree decl, const char
 
 static void
 assemble_variable_contents (tree decl, const char *name,
-			    bool dont_output_data)
+			    bool dont_output_data, bool merge_strings = false)
 {
   /* Do any machine/system dependent processing of the object.  */
 #ifdef ASM_DECLARE_OBJECT_NAME
@@ -2140,7 +2141,7 @@ assemble_variable_contents (tree decl, const char
 	output_constant (DECL_INITIAL (decl),
 			 tree_to_uhwi (DECL_SIZE_UNIT (decl)),
 			 get_variable_align (decl),
-			 false);
+			 false, merge_strings);
       else
 	/* Leave space for it.  */
 	assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl)));
@@ -2316,7 +2317,9 @@ assemble_variable (tree decl, int top_level ATTRIB
 	switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_variable_contents (decl, name, dont_output_data);
+      assemble_variable_contents (decl, name, dont_output_data,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  unsigned HOST_WIDE_INT int size
@@ -3300,12 +3303,7 @@ get_constant_section (tree exp, unsigned int align
 static HOST_WIDE_INT
 get_constant_size (tree exp)
 {
-  HOST_WIDE_INT size;
-
-  size = int_size_in_bytes (TREE_TYPE (exp));
-  if (TREE_CODE (exp) == STRING_CST)
-    size = MAX (TREE_STRING_LENGTH (exp), size);
-  return size;
+  return int_size_in_bytes (TREE_TYPE (exp));
 }
 
 /* Subroutine of output_constant_def:
@@ -3468,7 +3466,8 @@ maybe_output_constant_def_contents (struct constan
    constant's alignment in bits.  */
 
 static void
-assemble_constant_contents (tree exp, const char *label, unsigned int align)
+assemble_constant_contents (tree exp, const char *label, unsigned int align,
+			    bool merge_strings = false)
 {
   HOST_WIDE_INT size;
 
@@ -3478,7 +3477,7 @@ static void
   targetm.asm_out.declare_constant_name (asm_out_file, label, exp, size);
 
   /* Output the value of EXP.  */
-  output_constant (exp, size, align, false);
+  output_constant (exp, size, align, false, merge_strings);
 
   targetm.asm_out.decl_end ();
 }
@@ -3519,10 +3518,13 @@ output_constant_def_contents (rtx symbol)
 		   || (VAR_P (decl) && DECL_IN_CONSTANT_POOL (decl))
 		   ? DECL_ALIGN (decl)
 		   : symtab_node::get (decl)->definition_alignment ());
-      switch_to_section (get_constant_section (exp, align));
+      section *sect = get_constant_section (exp, align);
+      switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_constant_contents (exp, XSTR (symbol, 0), align);
+      assemble_constant_contents (exp, XSTR (symbol, 0), align,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  HOST_WIDE_INT size = get_constant_size (exp);
@@ -4786,6 +4789,19 @@ static unsigned HOST_WIDE_INT
 output_constructor (tree, unsigned HOST_WIDE_INT, unsigned int, bool,
 		    oc_outer_state *);
 
+/* Find out if a string P of length LEN and character size UNIT
+   has double nul termination.
+   The string is already known to have a nul termination at
+   offset LEN.  Find out if there is already a NUL termination
+   at offset LEN - UNIT.  */
+
+static bool
+redundant_nul_term_p (const char *p, unsigned len, unsigned unit)
+{
+  gcc_assert(len >= unit);
+  return !memcmp (p + len, p + len - unit, unit);
+}
+
 /* Output assembler code for constant EXP, with no label.
    This includes the pseudo-op such as ".int" or ".byte", and a newline.
    Assumes output_addressed_constants has been done on EXP already.
@@ -4812,7 +4850,7 @@ output_constructor (tree, unsigned HOST_WIDE_INT,
 
 static unsigned HOST_WIDE_INT
 output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align,
-		 bool reverse)
+		 bool reverse, bool merge_strings /* = false */)
 {
   enum tree_code code;
   unsigned HOST_WIDE_INT thissize;
@@ -4940,8 +4978,13 @@ output_constant (tree exp, unsigned HOST_WIDE_INT
 	case CONSTRUCTOR:
 	  return output_constructor (exp, size, align, reverse, NULL);
 	case STRING_CST:
-	  thissize
-	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  thissize = (unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp);
+	  if (thissize > size && merge_strings
+	      && !redundant_nul_term_p (TREE_STRING_POINTER (exp),
+					size, thissize - size))
+	    ;
+	  else
+	    thissize = MIN (thissize, size);
 	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
Index: gcc/testsuite/gcc.dg/merge-all-constants-1.c
===================================================================
--- gcc/testsuite/gcc.dg/merge-all-constants-1.c	(revision 263072)
+++ gcc/testsuite/gcc.dg/merge-all-constants-1.c	(working copy)
@@ -1,8 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-w -O2 -fmerge-all-constants" } */
 
-const char str1[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str1[36] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 const char str2[38] = "0123456789abcdefghijklmnopqrstuvwxyz";
-const char str3[10] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str3[10] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 
 /* { dg-final { scan-assembler-not "\.rodata\.str" } } */
Index: gcc/testsuite/gcc.dg/merge-all-constants-2.c
===================================================================
--- gcc/testsuite/gcc.dg/merge-all-constants-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/merge-all-constants-2.c	(working copy)
--- /dev/null	2018-07-24 12:25:16.122475518 +0200
+++ gcc/testsuite/gcc.dg/merge-all-constants-2.c	2018-08-07 10:14:35.384190377 +0200
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-w -O2 -fmerge-all-constants" } */
+
+const char str1[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str2[37] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str3[10] = "0123456789abcdefghijklmnopqrstuvwxyz";
+
+/* { dg-final { scan-assembler-not "\.rodata\[\n\r\]" } } */
Bernd Edlinger Aug. 7, 2018, 1:07 p.m. UTC | #9
Hi Olivier,

On 08/06/18 20:01, Olivier Hainque wrote:
> Hi Bernd,
> 
> Things are moving fast so I'll answer your
> messages incrementally :-)
> 
>> On 3 Aug 2018, at 20:37, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> [...]
>> Oh, how interesting.
>>
>> My patch only intends to add a dummy byte that is stripped again
>> in the assembly.  The purpose is to make it trivial to find out if
>> a string constant is NUL-terminated in  the middle-end by comparing
>> TYPE_SIZE_UNIT (TREE_TYPE (decl)) with TREE_STRING_LENGTH (decl).
> 
> Ah, Ok. I hadn't quite realized that yet.
> 
>> TYPE_SIZE_UNIT >= TREE_STRING_LENGTH means zero terminated
>> TYPE_SIZE_UNIT < TREE_STRING_LENGTH means not necessarily zero terminated,
> 
> That is opposite to the situation we were having
> for Ada before the patch series:
> 
>> Such strings shall never chop off more than one nul character.
>> Ada strings are generally not zero terminated, right?
> 
> Right. Strings have explicit bounds in Ada, and a NUL character
> is like any other. It can appear anywhere. For example, the
> excerpt below constructs a string of 5 characters, bounds 1 to 5,
> with a nul character in second position:
> 
>   X : String (1 .. 5) := "1" & Ascii.Nul & "345";
> 
> ('&' is the catenation operator)
> 
> The problem we had was that string literals not explicitly
> NUL terminated (as in X : String := "123" & Ascii.NUL) would
> never go in mergeable sections. For example:
> 
> Ada.Text_IO.Put_Line ("Hello World!");
> 
> Ada has so called "generic" units that are sometimes
> instanciated many many times so there is real interest
> in improving the situation there.
> 
>> So in your search loop you would not have to look after
>> type_len, because that byte would be guaranteed to be zero.
>>
>> That is if we agree that we want to restrict the string constants
>> in that way as I proposed.
>>
>> In the case str_len > type_len I do not understand if that is right.
> 
> In your new model, I don't know what sense it would make, indeed.
> 
> In the previous model, it was clearly expected to be a possibility.
> 
>> Because varasm will only output type_size bytes,
>> this seems only to select a string section
>> but the last nul byte will not be assembled.
>> Something that is not contained in the type_size
>> should not be assembled.
>>
>> What exactly do you want to accomplish?
> 
> Based on the previous (say, gcc-7 or gcc-8) code base, arrange
> for most Ada string literals to end up in a mergeable section
> on ELF targets.
> 
> In at least gcc-7 and gcc-8, our gigi adjustment + the
> patch I posted achieve this effect.
> 
> For example, for the following source:
> 
> procedure Hello is
>    procedure Process (S : String) with Import, Convention => C;
> begin
>   Process ("1234");
> end;
> 

Yes, thanks for this example,

$ cat hello.adb
procedure Hello is
    procedure puts  (S : String) with Import, Convention => C;
    X : String(1..8) := "abcdefg" & Ascii.Nul;
begin
   puts ("1234" & Ascii.Nul );
   puts (X);
end;

produces again identical output with the latest patch that I posted
right now.



> Without the gigi change, I get (x86_64-linux,
> -O2 -fmerge-all-constants):
> 
>    .section .rodata
>    .LC0:
>        .ascii "1234"
> 
> Regular section, no terminating zero.
> 
> The gigi change alone gets us the terminating nul (*),
> but not yet a mergeable section, because the extra nul
> isn't covered by the type bounds:
> 
>    .section .rodata
>    .LC0:
>        .string "1234"
> 
> With the varasm change on top, checking beyond the
> type bounds when the string constant isn't an initializer,
> we get:
> 
>    .section .rodata.str1.1,"aMS",@progbits,1
>    .LC0:
>        .string "1234"
> 
> (*) The terminating zero, part of the string value,
> not spanned by the type bounds, gets assembled through:
> 
> assemble_constant_contents (...)
> ...
> size = get_constant_size (exp);
> 
> then:
> 
> get_constant_size (tree exp)
> 
>    size = int_size_in_bytes (TREE_TYPE (exp));
>    if (TREE_CODE (exp) == STRING_CST)
>      size = MAX (TREE_STRING_LENGTH (exp), size);
> 
> Again, this is just providing more context on the
> original patch I posted, based on your questions.
> 

Ah, good point. That needs to be changed...
Actually I would like to have get_constant_size
return just the type size.

> I understand there are proposals to change things
> pretty significantly in this area, so ...
> 
> now on to your next messages and ideas :-)
> 
> 

When I try this example:

$ cat array9.adb
-- { dg-do run }

procedure Array9 is

   V1 : String(1..10) := "1234567890";
   V2 : String(1..-1) := "";

   procedure Compare (S : String) is
   begin
     if S'Size /= 8*S'Length then
       raise Program_Error;
     end if;
   end;

begin
   Compare ("");
   Compare ("1234");
   Compare (V1);
   Compare (V2);
end;

I see that "1234" is put in the merge section,
but V1 is not.  Maybe because of the alignment requirement?

But it sould not be much different from the C test case,
which is now able to merge the non-zero terminated strings.


Thanks
Bernd.
Olivier Hainque Aug. 7, 2018, 4:40 p.m. UTC | #10
Hi Bernd,

(Thanks for your interest in the Ada case as well :)

> On 7 Aug 2018, at 15:07, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:

> When I try this example:
> 
> $ cat array9.adb
> -- { dg-do run }
> 
> procedure Array9 is
> 
>   V1 : String(1..10) := "1234567890";
>   V2 : String(1..-1) := "";
> 
>   procedure Compare (S : String) is
>   begin
>     if S'Size /= 8*S'Length then
>       raise Program_Error;
>     end if;
>   end;
> 
> begin
>   Compare ("");
>   Compare ("1234");
>   Compare (V1);
>   Compare (V2);
> end;
> 
> I see that "1234" is put in the merge section,
> but V1 is not.  Maybe because of the alignment requirement?
> 
> But it sould not be much different from the C test case,
> which is now able to merge the non-zero terminated strings.

I'm happy to have a look. I'd just like to make
sure I'll be looking at the right thing:

I would need to start from trunk + the patches you
referenced at

  https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00339.html

+ the last one you sent at

  https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html

Correct ?

Can you then confirm the target triplet and
compilation options ?

Thanks in advance!

Olivier
Bernd Edlinger Aug. 7, 2018, 5:10 p.m. UTC | #11
On 08/07/18 18:40, Olivier Hainque wrote:
> Hi Bernd,
> 
> (Thanks for your interest in the Ada case as well :)
> 
>> On 7 Aug 2018, at 15:07, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> 
>> When I try this example:
>>
>> $ cat array9.adb
>> -- { dg-do run }
>>
>> procedure Array9 is
>>
>>    V1 : String(1..10) := "1234567890";
>>    V2 : String(1..-1) := "";
>>
>>    procedure Compare (S : String) is
>>    begin
>>      if S'Size /= 8*S'Length then
>>        raise Program_Error;
>>      end if;
>>    end;
>>
>> begin
>>    Compare ("");
>>    Compare ("1234");
>>    Compare (V1);
>>    Compare (V2);
>> end;
>>
>> I see that "1234" is put in the merge section,
>> but V1 is not.  Maybe because of the alignment requirement?
>>
>> But it sould not be much different from the C test case,
>> which is now able to merge the non-zero terminated strings.
> 
> I'm happy to have a look. I'd just like to make
> sure I'll be looking at the right thing:
> 
> I would need to start from trunk + the patches you
> referenced at
> 
>    https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00339.html
> 
> + the last one you sent at
> 
>    https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00481.html
> 
> Correct ?
> 

Yes,

Please use the latest patch patch here:
[PATCH] Check the STRING_CSTs in varasm.c
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00361.html

There is also a Fortran patch:
[PATCH] Create internally nul terminated string literals in fortan FE
https://gcc.gnu.org/ml/fortran/2018-08/msg00000.html

And a JIT FE patch:
[PATCH] Fix not properly nul-terminated string constants in JIT
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00370.html


My host triplet is x86_64-pc-linux-gnu (I let config.guess determine it).
I use gmp, mpfr, mpc, isl in-tree.
And configure simply with ../gcc-trunk/configure --prefix=/home/ed/gnu/install --enable-languages=all


Thanks
Bernd.

> Can you then confirm the target triplet and
> compilation options ?
> 
> Thanks in advance!
> 
> Olivier
> 
> 
> 
>
Olivier Hainque Aug. 9, 2018, 1:08 p.m. UTC | #12
Hi Bernd,

> On 07 Aug 2018, at 19:10, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> 
>>> procedure Array9 is

>>> I see that "1234" is put in the merge section,
>>> but V1 is not.  Maybe because of the alignment requirement?
>>> 
>>> But it sould not be much different from the C test case,
>>> which is now able to merge the non-zero terminated strings.
>> 
>> I'm happy to have a look.

An effect of alignments and inlining I think.
I see expected behavior with calls to an external program
instead of the local one.

I tried the testcases we have here plus a couple
of variations, e.g. with strings explicitly terminated
by multiple ascii.nul, and they all seem to be working
fine :-)

Regarding Ada testcases for the dejagnu suite, I'm not
quite sure how to go about them.

For starters at least, we could for example contemplate
something like:

-- { dg-do compile }                                                            
-- { dg-options "-O1 -fmerge-all-constants" }                                   

procedure String_Merge1 is
   procedure Process (X : String);
   pragma Import (Ada, Process);
begin
   Process ("1234");
   Process ("ABCD" & Ascii.NUL);
end;

-- We expect something like:                                                    

-- .section  .rodata.str1.1,"aMS",@progbits,1                                   
-- .LC2:                                                                        
--     .string "1234"                                                           
-- .LC3:                                                                        
--     .string "ABCD"                                                           

-- Not clear how to match that generally enough.                                

-- { dg-final { scan-assembler ".rodata.str1.1,\"aMS\"\[^\n\r\]*\n\.LC.:\n\[^\n\
\r\]*\.string \"1234\"\n\.LC.:\n\[^\n\r\]*\.string \"ABCD\"\n"  } }             


As the comment suggests, I'm not clear if that's robust/general enough.
I suppose we probably can't always expect .string, for instance. I suppose
we could also have other bits between the .section and the literals.

Would be fine as long as it's not something like ".section  .rodata".

Olivier
Bernd Edlinger Aug. 17, 2018, 5:35 p.m. UTC | #13
Hi,

this is an update of the patch, which has just two Ada test cases
added for string merging which are based on Olivier's suggested test case.

Note that except the "Check STRING_CSTs in varasm.c" patch series,
there is now another patch needed to work around issues with 71625:

[PATCH] Call braced_list_to_string after array size is fixed
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01015.html


Thanks
Bernd.


On 08/07/18 14:55, Bernd Edlinger wrote:
> Hi,
> 
> I realized that the previous version did not handle
> zero terminated Ada constants correctly as in
> 
> $ cat hello.adb
> procedure Hello is
>     procedure puts  (S : String) with Import, Convention => C;
>     X : String(1..8) := "abcdefg" & Ascii.Nul;
> begin
>    puts ("1234" & Ascii.Nul );
>    puts (X);
> end;
> 
> accidentally strings were doubly nul-terminated, because I forgot to
> handle merge string sections in assemble_constant_contents, and
> because get_constant_size increased the string literal by one.
> 
> Fixed, and added a positive test case for the merging non-zero terminated
> strings in C.
> 
> 
> Boot-strapped and regtested on x86_64-pc-linux-gnu.
> Is it OK for trunk (after pre-condition patches)?
> 
> 
> Thanks
> Bernd.
gcc:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* varasm.c (output_constant): Add new parameter merge_strings.
	Make strings properly zero terminated in merge string sections.
	(mergeable_string_section): Don't fail if the last char is non-zero.
	(assemble_variable_contents): Handle merge string sections.
	(assemble_variable): Likewise.
	(assemble_constant_contents): Likewise.
	(output_constant_def_contents): Likewise.
	(get_constant_size): Remove special handling of STRING_CSTs.
	(redundant_nul_term_p): New helper function.

testsuite:
2018-08-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* gnat.dg/string_merge1.adb: New test.
	* gnat.dg/string_merge2.adb: New test.
	* gcc.dg/merge-all-constants-1.c: Adjust test.
	* gcc.dg/merge-all-constants-2.c: New test.

Index: gcc/varasm.c
===================================================================
--- gcc/varasm.c	(revision 263072)
+++ gcc/varasm.c	(working copy)
@@ -111,7 +111,8 @@ static int compare_constant (const tree, const tre
 static void output_constant_def_contents (rtx);
 static void output_addressed_constants (tree);
 static unsigned HOST_WIDE_INT output_constant (tree, unsigned HOST_WIDE_INT,
-					       unsigned int, bool);
+					       unsigned int, bool,
+					       bool = false);
 static void globalize_decl (tree);
 static bool decl_readonly_section_1 (enum section_category);
 #ifdef BSS_SECTION_ASM_OP
@@ -827,7 +828,7 @@ mergeable_string_section (tree decl ATTRIBUTE_UNUS
 	  unit = GET_MODE_SIZE (mode);
 
 	  /* Check for embedded NUL characters.  */
-	  for (i = 0; i < len; i += unit)
+	  for (i = 0; i < len - unit; i += unit)
 	    {
 	      for (j = 0; j < unit; j++)
 		if (str[i + j] != '\0')
@@ -2117,7 +2118,7 @@ assemble_noswitch_variable (tree decl, const char
 
 static void
 assemble_variable_contents (tree decl, const char *name,
-			    bool dont_output_data)
+			    bool dont_output_data, bool merge_strings = false)
 {
   /* Do any machine/system dependent processing of the object.  */
 #ifdef ASM_DECLARE_OBJECT_NAME
@@ -2140,7 +2141,7 @@ assemble_variable_contents (tree decl, const char
 	output_constant (DECL_INITIAL (decl),
 			 tree_to_uhwi (DECL_SIZE_UNIT (decl)),
 			 get_variable_align (decl),
-			 false);
+			 false, merge_strings);
       else
 	/* Leave space for it.  */
 	assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl)));
@@ -2316,7 +2317,9 @@ assemble_variable (tree decl, int top_level ATTRIB
 	switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_variable_contents (decl, name, dont_output_data);
+      assemble_variable_contents (decl, name, dont_output_data,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  unsigned HOST_WIDE_INT int size
@@ -3300,12 +3303,7 @@ get_constant_section (tree exp, unsigned int align
 static HOST_WIDE_INT
 get_constant_size (tree exp)
 {
-  HOST_WIDE_INT size;
-
-  size = int_size_in_bytes (TREE_TYPE (exp));
-  if (TREE_CODE (exp) == STRING_CST)
-    size = MAX (TREE_STRING_LENGTH (exp), size);
-  return size;
+  return int_size_in_bytes (TREE_TYPE (exp));
 }
 
 /* Subroutine of output_constant_def:
@@ -3468,7 +3466,8 @@ maybe_output_constant_def_contents (struct constan
    constant's alignment in bits.  */
 
 static void
-assemble_constant_contents (tree exp, const char *label, unsigned int align)
+assemble_constant_contents (tree exp, const char *label, unsigned int align,
+			    bool merge_strings = false)
 {
   HOST_WIDE_INT size;
 
@@ -3478,7 +3477,7 @@ static void
   targetm.asm_out.declare_constant_name (asm_out_file, label, exp, size);
 
   /* Output the value of EXP.  */
-  output_constant (exp, size, align, false);
+  output_constant (exp, size, align, false, merge_strings);
 
   targetm.asm_out.decl_end ();
 }
@@ -3519,10 +3518,13 @@ output_constant_def_contents (rtx symbol)
 		   || (VAR_P (decl) && DECL_IN_CONSTANT_POOL (decl))
 		   ? DECL_ALIGN (decl)
 		   : symtab_node::get (decl)->definition_alignment ());
-      switch_to_section (get_constant_section (exp, align));
+      section *sect = get_constant_section (exp, align);
+      switch_to_section (sect);
       if (align > BITS_PER_UNIT)
 	ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
-      assemble_constant_contents (exp, XSTR (symbol, 0), align);
+      assemble_constant_contents (exp, XSTR (symbol, 0), align,
+				  sect->common.flags & SECTION_MERGE
+				  && sect->common.flags & SECTION_STRINGS);
       if (asan_protected)
 	{
 	  HOST_WIDE_INT size = get_constant_size (exp);
@@ -4786,6 +4789,19 @@ static unsigned HOST_WIDE_INT
 output_constructor (tree, unsigned HOST_WIDE_INT, unsigned int, bool,
 		    oc_outer_state *);
 
+/* Find out if a string P of length LEN and character size UNIT
+   has double nul termination.
+   The string is already known to have a nul termination at
+   offset LEN.  Find out if there is already a NUL termination
+   at offset LEN - UNIT.  */
+
+static bool
+redundant_nul_term_p (const char *p, unsigned len, unsigned unit)
+{
+  gcc_assert(len >= unit);
+  return !memcmp (p + len, p + len - unit, unit);
+}
+
 /* Output assembler code for constant EXP, with no label.
    This includes the pseudo-op such as ".int" or ".byte", and a newline.
    Assumes output_addressed_constants has been done on EXP already.
@@ -4812,7 +4850,7 @@ output_constructor (tree, unsigned HOST_WIDE_INT,
 
 static unsigned HOST_WIDE_INT
 output_constant (tree exp, unsigned HOST_WIDE_INT size, unsigned int align,
-		 bool reverse)
+		 bool reverse, bool merge_strings /* = false */)
 {
   enum tree_code code;
   unsigned HOST_WIDE_INT thissize;
@@ -4940,8 +4978,13 @@ output_constant (tree exp, unsigned HOST_WIDE_INT
 	case CONSTRUCTOR:
 	  return output_constructor (exp, size, align, reverse, NULL);
 	case STRING_CST:
-	  thissize
-	    = MIN ((unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp), size);
+	  thissize = (unsigned HOST_WIDE_INT)TREE_STRING_LENGTH (exp);
+	  if (thissize > size && merge_strings
+	      && !redundant_nul_term_p (TREE_STRING_POINTER (exp),
+					size, thissize - size))
+	    ;
+	  else
+	    thissize = MIN (thissize, size);
 	  gcc_checking_assert (check_string_literal (exp, size));
 	  assemble_string (TREE_STRING_POINTER (exp), thissize);
 	  break;
Index: gcc/testsuite/gnat.dg/string_merge1.adb
===================================================================
--- gcc/testsuite/gnat.dg/string_merge1.adb	(revision 0)
+++ gcc/testsuite/gnat.dg/string_merge1.adb	(working copy)
@@ -0,0 +1,19 @@
+-- { dg-do compile }
+-- { dg-options "-O1 -fmerge-all-constants" }
+
+procedure String_Merge1 is
+   procedure Process (X : String);
+   pragma Import (Ada, Process);
+begin
+   Process ("ABCD");
+end;
+
+-- We expect something like:
+
+-- .section  .rodata.str1.1,"aMS",@progbits,1
+-- .LC1:
+--     .string "ABCD"
+
+-- { dg-final { scan-assembler-times "\\.rodata\\.str" 1 } }
+-- { dg-final { scan-assembler-times "\\.string" 1 } }
+-- { dg-final { scan-assembler-times "\"ABCD\"" 1 } }
Index: gcc/testsuite/gnat.dg/string_merge2.adb
===================================================================
--- gcc/testsuite/gnat.dg/string_merge2.adb	(revision 0)
+++ gcc/testsuite/gnat.dg/string_merge2.adb	(working copy)
@@ -0,0 +1,19 @@
+-- { dg-do compile }
+-- { dg-options "-O1 -fmerge-all-constants" }
+
+procedure String_Merge2 is
+   procedure Process (X : String);
+   pragma Import (Ada, Process);
+begin
+   Process ("ABCD" & Ascii.NUL);
+end;
+
+-- We expect something like:
+
+-- .section  .rodata.str1.1,"aMS",@progbits,1
+-- .LC1:
+--     .string "ABCD"
+
+-- { dg-final { scan-assembler-times "\\.rodata\\.str" 1 } }
+-- { dg-final { scan-assembler-times "\\.string" 1 } }
+-- { dg-final { scan-assembler-times "\"ABCD\"" 1 } }
Index: gcc/testsuite/gcc.dg/merge-all-constants-1.c
===================================================================
--- gcc/testsuite/gcc.dg/merge-all-constants-1.c	(revision 263072)
+++ gcc/testsuite/gcc.dg/merge-all-constants-1.c	(working copy)
@@ -1,8 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-w -O2 -fmerge-all-constants" } */
 
-const char str1[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str1[36] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 const char str2[38] = "0123456789abcdefghijklmnopqrstuvwxyz";
-const char str3[10] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str3[10] = "\000123456789abcdefghijklmnopqrstuvwxyz";
 
-/* { dg-final { scan-assembler-not "\.rodata\.str" } } */
+/* { dg-final { scan-assembler-not "\\.rodata\\.str" } } */
Index: gcc/testsuite/gcc.dg/merge-all-constants-2.c
===================================================================
--- gcc/testsuite/gcc.dg/merge-all-constants-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/merge-all-constants-2.c	(working copy)
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-w -O2 -fmerge-all-constants" } */
+
+const char str1[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str2[37] = "0123456789abcdefghijklmnopqrstuvwxyz";
+const char str3[10] = "0123456789abcdefghijklmnopqrstuvwxyz";
+
+/* { dg-final { scan-assembler-not "\\.rodata\[\n\r\]" } } */
diff mbox series

Patch

2018-07-31  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	* gcc-interface/trans.c (gnat_to_gnu): Make string literal properly
	NUL terminated.
	* gcc-interface/utils2.c (expand_sloc): Likewise.

diff -pur gcc/ada/gcc-interface/trans.c gcc/ada/gcc-interface/trans.c
--- gcc/ada/gcc-interface/trans.c	2018-07-17 10:10:04.000000000 +0200
+++ gcc/ada/gcc-interface/trans.c	2018-07-31 11:16:27.350728886 +0200
@@ -6079,7 +6079,7 @@  gnat_to_gnu (Node_Id gnat_node)
 	     where GCC will want to treat it as a C string.  */
 	  string[i] = 0;
 
-	  gnu_result = build_string (length, string);
+	  gnu_result = build_string (length + 1, string);
 
 	  /* Strings in GCC don't normally have types, but we want
 	     this to not be converted to the array type.  */
diff -pur gcc/ada/gcc-interface/utils2.c gcc/ada/gcc-interface/utils2.c
--- gcc/ada/gcc-interface/utils2.c	2017-12-21 07:57:41.000000000 +0100
+++ gcc/ada/gcc-interface/utils2.c	2018-07-31 11:44:01.517117923 +0200
@@ -1844,7 +1844,7 @@  expand_sloc (Node_Id gnat_node, tree *fi
     }
 
   const int len = strlen (str);
-  *filename = build_string (len, str);
+  *filename = build_string (len + 1, str);
   TREE_TYPE (*filename) = build_array_type (char_type_node,
 					    build_index_type (size_int (len)));
   *line = build_int_cst (NULL_TREE, line_number);