diff mbox series

[v2] LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]

Message ID 4bb288d2-540f-4d36-b67a-3af6de324c1c@baylibre.com
State New
Headers show
Series [v2] LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535] | expand

Commit Message

Tobias Burnus Sept. 2, 2024, 6:16 p.m. UTC
Hi Richard,

Am 02.09.24 um 13:58 schrieb Richard Biener:
> Hmm, I can't really follow how and where it's currently decided whether to
> output offload tables for the LTRANS units

Before the patch, output_offload_tables is called unconditionally, but 
guarded by the check whether there is anything to output at all. Call trees:

When outputting the .o files, the call is done via ipa_passes → 
ipa_write_summaries → ipa_write_summaries_1.

This calls ipa_write_summaries twice: once for the offload/for-device 
LTO section and once for the host LTO section – and both calls are needed.

For the LTO (lto1, ltrans) step, the call tree starts with: 
do_whole_program_analysis → lto_wpa_write_files → stream_out_partitions 
→ stream_out_partitions_1 → stream_out → ipa_write_optimization_summaries.

Here, stream_out_partitions potentially forks the 
'stream_out_partitions_1' calls. And each stream_out_partitions_1 calls 
for each (of its share) of the partitions 'stream_out' in a loop.

With either code path, the ipa_write... function then calls: write_lto → 
lto_output → output_offload_tables.

> but instead of an odd global
> variable would it be possible to pass that down as a flag or,
> alternatively encode that flag in the representation for the LTRANS
> partition?  I suppose that's the out_decl_state?

Actually, I tried follow your initial suggestion of the PR, but now 
moved to the somewhat clearer out_decl_state.

Tobias

Comments

Richard Biener Sept. 3, 2024, 7:29 a.m. UTC | #1
On Mon, 2 Sep 2024, Tobias Burnus wrote:

> Hi Richard,
> 
> Am 02.09.24 um 13:58 schrieb Richard Biener:
> > Hmm, I can't really follow how and where it's currently decided whether to
> > output offload tables for the LTRANS units
> 
> Before the patch, output_offload_tables is called unconditionally, but guarded
> by the check whether there is anything to output at all. Call trees:
> 
> When outputting the .o files, the call is done via ipa_passes →
> ipa_write_summaries → ipa_write_summaries_1.
> 
> This calls ipa_write_summaries twice: once for the offload/for-device LTO
> section and once for the host LTO section – and both calls are needed.
> 
> For the LTO (lto1, ltrans) step, the call tree starts with:
> do_whole_program_analysis → lto_wpa_write_files → stream_out_partitions
> → stream_out_partitions_1 → stream_out → ipa_write_optimization_summaries.
> 
> Here, stream_out_partitions potentially forks the 'stream_out_partitions_1'
> calls. And each stream_out_partitions_1 calls for each (of its share) of the
> partitions 'stream_out' in a loop.
> 
> With either code path, the ipa_write... function then calls: write_lto →
> lto_output → output_offload_tables.
> 
> > but instead of an odd global
> > variable would it be possible to pass that down as a flag or,
> > alternatively encode that flag in the representation for the LTRANS
> > partition?  I suppose that's the out_decl_state?
> 
> Actually, I tried follow your initial suggestion of the PR, but now moved to
> the somewhat clearer out_decl_state.

Yeah - much nicer.

OK if it passes testing.

Thanks,
Richard.
diff mbox series

Patch

LTO/WPA: Ensure that output_offload_tables only writes table once [PR116535]

When ltrans was written concurrently, e.g. via -flto=N (N > 1, assuming
sufficient partiations, e.g., via -flto-partition=max), output_offload_tables
wrote the output tables once per fork.

	PR lto/116535

gcc/ChangeLog:

	* lto-cgraph.cc (output_offload_tables): Remove offload_ frees.
	* lto-streamer-out.cc (lto_output): Make call to it depend on
	lto_get_out_decl_state ()->output_offload_tables_p.
	* lto-streamer.h (struct lto_out_decl_state): Add
	output_offload_tables_p field.
	* tree-pass.h (ipa_write_optimization_summaries): Add bool argument.
	* passes.cc (ipa_write_summaries_1): Add bool
	output_offload_tables_p arg.
	(ipa_write_summaries): Update call.
	(ipa_write_optimization_summaries): Accept output_offload_tables_p.

gcc/lto/ChangeLog:

	* lto.cc (stream_out): Update call to
	ipa_write_optimization_summaries to pass true for first partition.

 gcc/lto-cgraph.cc       | 10 ----------
 gcc/lto-streamer-out.cc |  3 ++-
 gcc/lto-streamer.h      |  3 +++
 gcc/lto/lto.cc          |  2 +-
 gcc/passes.cc           | 11 ++++++++---
 gcc/tree-pass.h         |  3 ++-
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/lto-cgraph.cc b/gcc/lto-cgraph.cc
index 6395033ab9d..1492409427c 100644
--- a/gcc/lto-cgraph.cc
+++ b/gcc/lto-cgraph.cc
@@ -1139,16 +1139,6 @@  output_offload_tables (void)
 
   streamer_write_uhwi_stream (ob->main_stream, 0);
   lto_destroy_simple_output_block (ob);
-
-  /* In WHOPR mode during the WPA stage the joint offload tables need to be
-     streamed to one partition only.  That's why we free offload_funcs and
-     offload_vars after the first call of output_offload_tables.  */
-  if (flag_wpa)
-    {
-      vec_free (offload_funcs);
-      vec_free (offload_vars);
-      vec_free (offload_ind_funcs);
-    }
 }
 
 /* Verify the partitioning of NODE.  */
diff --git a/gcc/lto-streamer-out.cc b/gcc/lto-streamer-out.cc
index 523d6dad221..a4b171358d4 100644
--- a/gcc/lto-streamer-out.cc
+++ b/gcc/lto-streamer-out.cc
@@ -2829,7 +2829,8 @@  lto_output (void)
      statements using the statement UIDs.  */
   output_symtab ();
 
-  output_offload_tables ();
+  if (lto_get_out_decl_state ()->output_offload_tables_p)
+    output_offload_tables ();
 
   if (flag_checking)
     {
diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
index 79c44d2cae7..4da1a3efe03 100644
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -531,6 +531,9 @@  struct lto_out_decl_state
 
   /* True if decl state is compressed.  */
   bool compressed;
+
+  /* True if offload tables should be output. */
+  bool output_offload_tables_p;
 };
 
 typedef struct lto_out_decl_state *lto_out_decl_state_ptr;
diff --git a/gcc/lto/lto.cc b/gcc/lto/lto.cc
index 52dd436fd9a..1ee215d8f1d 100644
--- a/gcc/lto/lto.cc
+++ b/gcc/lto/lto.cc
@@ -178,7 +178,7 @@  stream_out (char *temp_filename, lto_symtab_encoder_t encoder, int part)
 
   gcc_assert (!dump_file);
   streamer_dump_file = dump_begin (TDI_lto_stream_out, NULL, part);
-  ipa_write_optimization_summaries (encoder);
+  ipa_write_optimization_summaries (encoder, part == 0);
 
   free (CONST_CAST (char *, file->filename));
 
diff --git a/gcc/passes.cc b/gcc/passes.cc
index d73f8ba97b6..057850f4dec 100644
--- a/gcc/passes.cc
+++ b/gcc/passes.cc
@@ -2829,11 +2829,13 @@  ipa_write_summaries_2 (opt_pass *pass, struct lto_out_decl_state *state)
    summaries.  SET is the set of nodes to be written.  */
 
 static void
-ipa_write_summaries_1 (lto_symtab_encoder_t encoder)
+ipa_write_summaries_1 (lto_symtab_encoder_t encoder,
+		       bool output_offload_tables_p)
 {
   pass_manager *passes = g->get_passes ();
   struct lto_out_decl_state *state = lto_new_out_decl_state ();
   state->symtab_node_encoder = encoder;
+  state->output_offload_tables_p = output_offload_tables_p;
 
   lto_output_init_mode_table ();
   lto_push_out_decl_state (state);
@@ -2897,7 +2899,8 @@  ipa_write_summaries (void)
     if (vnode->need_lto_streaming)
       lto_set_symtab_encoder_in_partition (encoder, vnode);
 
-  ipa_write_summaries_1 (compute_ltrans_boundary (encoder));
+  ipa_write_summaries_1 (compute_ltrans_boundary (encoder),
+			 flag_generate_offload);
 
   free (order);
   if (streamer_dump_file)
@@ -2952,10 +2955,12 @@  ipa_write_optimization_summaries_1 (opt_pass *pass,
    NULL, write out all summaries of all nodes. */
 
 void
-ipa_write_optimization_summaries (lto_symtab_encoder_t encoder)
+ipa_write_optimization_summaries (lto_symtab_encoder_t encoder,
+				  bool output_offload_tables_p)
 {
   struct lto_out_decl_state *state = lto_new_out_decl_state ();
   state->symtab_node_encoder = encoder;
+  state->output_offload_tables_p = output_offload_tables_p;
 
   lto_output_init_mode_table ();
   lto_push_out_decl_state (state);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 3a0cf13089e..a928cbe4557 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -684,7 +684,8 @@  extern void emergency_dump_function (void);
 extern void print_current_pass (FILE *);
 extern void debug_pass (void);
 extern void ipa_write_summaries (void);
-extern void ipa_write_optimization_summaries (struct lto_symtab_encoder_d *);
+extern void ipa_write_optimization_summaries (struct lto_symtab_encoder_d *,
+					      bool);
 extern void ipa_read_summaries (void);
 extern void ipa_read_optimization_summaries (void);
 extern void register_one_dump_file (opt_pass *);