diff mbox series

LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]

Message ID 8a61e4a0-87c0-4df5-af8b-6178581b0cab@baylibre.com
State New
Headers show
Series LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535] | expand

Commit Message

Tobias Burnus Sept. 2, 2024, 11:44 a.m. UTC
The attached patch tries to fix the issue exposed by the PR:

The main ingredient is partitioning of the LTO work, e.g. by using 
-flto-partition=max.

With -flto=2 (or higher or when a jobserver has been detected), not only 
the LTO part is run in parallel but also the creation of the ltrans 
files itself, i.e. gcc/lto/lto.cc's stream_out_partitions forks multiple 
processes to write those files concurrently (here: -flto=2 means two 
processes, each writing about half of the partitions).

For each partition, output_offload_tables is called – which in principle 
would add the offload tables to each file. To prevent this, in flag_wpa 
mode, the tables were freed. That solves the WPA problem, but only if 
all partitions are written by a single process (e.g. -flto=1). If not, 
the data is duplicated and only the data belonging to the fork is modified.

This patch moves the logic to gcc/lto/lto.cc and sets a global variable 
to ensure that it is only output for the first partition, independently 
whether there is only one or several processes writing the ltrans file, 
trying to follow what Richard proposed in the PR?

The patch has been tested on x86-64-gnu-linux with nvptx offloading, but 
I should do a full bootstrap+regtest next.

Comments, suggestions, remarks, approval?

Tobias

Comments

Richard Biener Sept. 2, 2024, 11:58 a.m. UTC | #1
On Mon, 2 Sep 2024, Tobias Burnus wrote:

> The attached patch tries to fix the issue exposed by the PR:
> 
> The main ingredient is partitioning of the LTO work, e.g. by using
> -flto-partition=max.
> 
> With -flto=2 (or higher or when a jobserver has been detected), not only the
> LTO part is run in parallel but also the creation of the ltrans files itself,
> i.e. gcc/lto/lto.cc's stream_out_partitions forks multiple processes to write
> those files concurrently (here: -flto=2 means two processes, each writing
> about half of the partitions).
> 
> For each partition, output_offload_tables is called – which in principle would
> add the offload tables to each file. To prevent this, in flag_wpa mode, the
> tables were freed. That solves the WPA problem, but only if all partitions are
> written by a single process (e.g. -flto=1). If not, the data is duplicated and
> only the data belonging to the fork is modified.
> 
> This patch moves the logic to gcc/lto/lto.cc and sets a global variable to
> ensure that it is only output for the first partition, independently whether
> there is only one or several processes writing the ltrans file, trying to
> follow what Richard proposed in the PR?
> 
> The patch has been tested on x86-64-gnu-linux with nvptx offloading, but I
> should do a full bootstrap+regtest next.
> 
> Comments, suggestions, remarks, approval?

Hmm, I can't really follow how and where it's currently decided whether to
output offload tables for the LTRANS units but instead of an odd global
variable would it be possible to pass that down as a flag or, 
alternatively encode that flag in the representation for the LTRANS
partition?  I suppose that's the out_decl_state?

Or at least make it more "obvious" and guard the only caller to
output_offload_tables with if (offload_output_tables_p) instead of
the check in output_offload_tables itself?

> Tobias
> 
>
diff mbox series

Patch

LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]

When ltrans was written concurrently, e.g. via -flto=N (N > 1, assuming
sufficient partiations, e.g., via -flto-partition=max), output_offload_tables
wrote the output tables once per fork.

	PR lto/116535

gcc/ChangeLog:

	* omp-offload.h (offload_output_tables_p): New extern bool var.
	* omp-offload.cc (offload_output_tables_p): Define it with value true.
	* lto-cgraph.cc (output_offload_tables): Only output tables when
	offload_output_tables_p is true.

gcc/lto/ChangeLog:

	* lto.cc (stream_out_partitions_1): Set offload_output_tables_p to false
	except for the first partition.

 gcc/lto-cgraph.cc  | 16 ++++------------
 gcc/lto/lto.cc     |  3 +++
 gcc/omp-offload.cc |  2 ++
 gcc/omp-offload.h  |  1 +
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/gcc/lto-cgraph.cc b/gcc/lto-cgraph.cc
index 6395033ab9d..19ac252e1b4 100644
--- a/gcc/lto-cgraph.cc
+++ b/gcc/lto-cgraph.cc
@@ -1081,8 +1081,10 @@  output_offload_tables (void)
 {
   bool output_requires = (flag_openmp
 			  && (omp_requires_mask & OMP_REQUIRES_TARGET_USED) != 0);
-  if (vec_safe_is_empty (offload_funcs) && vec_safe_is_empty (offload_vars)
-      && !output_requires)
+  if (!offload_output_tables_p
+      || (vec_safe_is_empty (offload_funcs)
+	  && vec_safe_is_empty (offload_vars)
+	  && !output_requires))
     return;
 
   struct lto_simple_output_block *ob
@@ -1139,16 +1141,6 @@  output_offload_tables (void)
 
   streamer_write_uhwi_stream (ob->main_stream, 0);
   lto_destroy_simple_output_block (ob);
-
-  /* In WHOPR mode during the WPA stage the joint offload tables need to be
-     streamed to one partition only.  That's why we free offload_funcs and
-     offload_vars after the first call of output_offload_tables.  */
-  if (flag_wpa)
-    {
-      vec_free (offload_funcs);
-      vec_free (offload_vars);
-      vec_free (offload_ind_funcs);
-    }
 }
 
 /* Verify the partitioning of NODE.  */
diff --git a/gcc/lto/lto.cc b/gcc/lto/lto.cc
index 52dd436fd9a..69c7527d399 100644
--- a/gcc/lto/lto.cc
+++ b/gcc/lto/lto.cc
@@ -58,6 +58,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "builtins.h"
 #include "lto-common.h"
 #include "opts-jobserver.h"
+#include "omp-offload.h"
 
 /* Number of parallel tasks to run.  */
 static int lto_parallelism;
@@ -226,12 +227,14 @@  wait_for_child ()
 static void
 stream_out_partitions_1 (char *temp_filename, int blen, int min, int max)
 {
+   offload_output_tables_p = (min == 0);
    /* Write all the nodes in SET.  */
    for (int p = min; p < max; p ++)
      {
        sprintf (temp_filename + blen, "%u.o", p);
        stream_out (temp_filename, ltrans_partitions[p]->encoder, p);
        ltrans_partitions[p]->encoder = NULL;
+       offload_output_tables_p = false;
      }
 }
 
diff --git a/gcc/omp-offload.cc b/gcc/omp-offload.cc
index 934fbd80bdd..76bfda94217 100644
--- a/gcc/omp-offload.cc
+++ b/gcc/omp-offload.cc
@@ -88,6 +88,8 @@  struct oacc_loop
 /* Holds offload tables with decls.  */
 vec<tree, va_gc> *offload_funcs, *offload_vars, *offload_ind_funcs;
 
+bool offload_output_tables_p = true;
+
 /* Return level at which oacc routine may spawn a partitioned loop, or
    -1 if it is not a routine (i.e. is an offload fn).  */
 
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index d972bb7eafd..2d1d173016c 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -29,6 +29,7 @@  extern int oacc_fn_attrib_level (tree attr);
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
 extern GTY(()) vec<tree, va_gc> *offload_ind_funcs;
+extern bool offload_output_tables_p;
 
 extern void omp_finish_file (void);
 extern void omp_discover_implicit_declare_target (void);