Message ID | 20240914191709.648227-1-chenyangyu@isrc.iscas.ac.cn |
---|---|
State | New |
Headers | show |
Series | [RFC] Allow functions with target_clones attribute to be inlined | expand |
Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: > I recently found that target_clones functions cannot inline even when > the caller has exactly the same target. However, if we only use target > attributes in C++ and let the compiler generate IFUNC for us, the > functions with the same target will be inlined. > > For example, the following code compiled on x86-64 target with -O3 will > generate IFUNC for foo and bar and inline foo into the bar: > > ```cpp > __attribute__((target("default"))) > int foo(int *arr) { > int sum = 0; > for (int i=0;i<16;i++) sum += arr[i]; > return sum; > } > > __attribute__((target("avx2"))) > int foo(int *arr) { > int sum = 0; > for (int i=0;i<16;i++) sum += arr[i]; > return sum; > } > > __attribute__((target("default"))) > int bar(int *arr) { > return foo(arr); > } > > __attribute__((target("avx2"))) > int bar(int *arr) { > return foo(arr); > } > ``` > > However, if we use target_clones attribute, the target_clones functions > will not be inlined: > > ```cpp > __attribute__((target_clones("default","avx2"))) > int foo(int *arr) { > int sum = 0; > for (int i=0;i<16;i++) sum += arr[i]; > return sum; > } > > __attribute__((target_clones("default","avx2"))) > int bar(int *arr) { > return foo(arr); > } > ``` > > This behavior may negatively impact performance since the target_clones > functions are not inlined. And since we didn't jump to the target_clones > functions based on PLT but used the same target as the caller's target. > I think it's better to allow the target_clones functions to be inlined. > > gcc/ada/ChangeLog: > > * gcc-interface/utils.cc (handle_target_clones_attribute): > Allow functions with target_clones attribute to be inlined. > > gcc/c-family/ChangeLog: > > * c-attribs.cc (handle_target_clones_attribute): > Allow functions with target_clones attribute to be inlined. > > gcc/d/ChangeLog: > > * d-attribs.cc (d_handle_target_clones_attribute): > Allow functions with target_clones attribute to be inlined. What I'm about to say applies to both sequences above, but: Before inlining avx2 foo into avx2 bar, don't we need to be careful about making sure that foo would still pick the avx2 version if called normally? E.g. if foo had an avx512 version, direct calls to foo would presumably pick that on avx512 targets, but still pick the avx2 version of bar. It would then seem strange for the avx2 version of bar to inline the avx2 version of foo, both for performance and ODR reasons. Thanks, Richard > > Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> > --- > gcc/ada/gcc-interface/utils.cc | 5 +---- > gcc/c-family/c-attribs.cc | 3 --- > gcc/d/d-attribs.cc | 5 ----- > 3 files changed, 1 insertion(+), 12 deletions(-) > > diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc > index 60f36b1e50d..d010b684177 100644 > --- a/gcc/ada/gcc-interface/utils.cc > +++ b/gcc/ada/gcc-interface/utils.cc > @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > int ARG_UNUSED (flags), bool *no_add_attrs) > { > /* Ensure we have a function type. */ > - if (TREE_CODE (*node) == FUNCTION_DECL) > - /* Do not inline functions with multiple clone targets. */ > - DECL_UNINLINABLE (*node) = 1; > - else > + if (TREE_CODE (*node) != FUNCTION_DECL) > { > warning (OPT_Wattributes, "%qE attribute ignored", name); > *no_add_attrs = true; > diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc > index 4dd2eecbea5..f8759bb1908 100644 > --- a/gcc/c-family/c-attribs.cc > +++ b/gcc/c-family/c-attribs.cc > @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > "single %<target_clones%> attribute is ignored"); > *no_add_attrs = true; > } > - else > - /* Do not inline functions with multiple clone targets. */ > - DECL_UNINLINABLE (*node) = 1; > } > else > { > diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc > index 0f7ca10e017..9f67415adb1 100644 > --- a/gcc/d/d-attribs.cc > +++ b/gcc/d/d-attribs.cc > @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, > warning (OPT_Wattributes, "%qE attribute ignored", name); > *no_add_attrs = true; > } > - else > - { > - /* Do not inline functions with multiple clone targets. */ > - DECL_UNINLINABLE (*node) = 1; > - } > > return NULL_TREE; > }
> On Sep 18, 2024, at 16:46, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: >> I recently found that target_clones functions cannot inline even when >> the caller has exactly the same target. However, if we only use target >> attributes in C++ and let the compiler generate IFUNC for us, the >> functions with the same target will be inlined. >> >> For example, the following code compiled on x86-64 target with -O3 will >> generate IFUNC for foo and bar and inline foo into the bar: >> >> ```cpp >> __attribute__((target("default"))) >> int foo(int *arr) { >> int sum = 0; >> for (int i=0;i<16;i++) sum += arr[i]; >> return sum; >> } >> >> __attribute__((target("avx2"))) >> int foo(int *arr) { >> int sum = 0; >> for (int i=0;i<16;i++) sum += arr[i]; >> return sum; >> } >> >> __attribute__((target("default"))) >> int bar(int *arr) { >> return foo(arr); >> } >> >> __attribute__((target("avx2"))) >> int bar(int *arr) { >> return foo(arr); >> } >> ``` >> >> However, if we use target_clones attribute, the target_clones functions >> will not be inlined: >> >> ```cpp >> __attribute__((target_clones("default","avx2"))) >> int foo(int *arr) { >> int sum = 0; >> for (int i=0;i<16;i++) sum += arr[i]; >> return sum; >> } >> >> __attribute__((target_clones("default","avx2"))) >> int bar(int *arr) { >> return foo(arr); >> } >> ``` >> >> This behavior may negatively impact performance since the target_clones >> functions are not inlined. And since we didn't jump to the target_clones >> functions based on PLT but used the same target as the caller's target. >> I think it's better to allow the target_clones functions to be inlined. >> >> gcc/ada/ChangeLog: >> >> * gcc-interface/utils.cc (handle_target_clones_attribute): >> Allow functions with target_clones attribute to be inlined. >> >> gcc/c-family/ChangeLog: >> >> * c-attribs.cc (handle_target_clones_attribute): >> Allow functions with target_clones attribute to be inlined. >> >> gcc/d/ChangeLog: >> >> * d-attribs.cc (d_handle_target_clones_attribute): >> Allow functions with target_clones attribute to be inlined. > > What I'm about to say applies to both sequences above, but: > > Before inlining avx2 foo into avx2 bar, don't we need to be careful about > making sure that foo would still pick the avx2 version if called normally? > E.g. if foo had an avx512 version, direct calls to foo would presumably > pick that on avx512 targets, but still pick the avx2 version of bar. > It would then seem strange for the avx2 version of bar to inline the > avx2 version of foo, both for performance and ODR reasons. > Yes. For now, the GCC will emit code to call the avx2 callee directly, which will be worse than inlining. This should be fixed in the future. > Thanks, > Richard > >> >> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> >> --- >> gcc/ada/gcc-interface/utils.cc | 5 +---- >> gcc/c-family/c-attribs.cc | 3 --- >> gcc/d/d-attribs.cc | 5 ----- >> 3 files changed, 1 insertion(+), 12 deletions(-) >> >> diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc >> index 60f36b1e50d..d010b684177 100644 >> --- a/gcc/ada/gcc-interface/utils.cc >> +++ b/gcc/ada/gcc-interface/utils.cc >> @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), >> int ARG_UNUSED (flags), bool *no_add_attrs) >> { >> /* Ensure we have a function type. */ >> - if (TREE_CODE (*node) == FUNCTION_DECL) >> - /* Do not inline functions with multiple clone targets. */ >> - DECL_UNINLINABLE (*node) = 1; >> - else >> + if (TREE_CODE (*node) != FUNCTION_DECL) >> { >> warning (OPT_Wattributes, "%qE attribute ignored", name); >> *no_add_attrs = true; >> diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc >> index 4dd2eecbea5..f8759bb1908 100644 >> --- a/gcc/c-family/c-attribs.cc >> +++ b/gcc/c-family/c-attribs.cc >> @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), >> "single %<target_clones%> attribute is ignored"); >> *no_add_attrs = true; >> } >> - else >> - /* Do not inline functions with multiple clone targets. */ >> - DECL_UNINLINABLE (*node) = 1; >> } >> else >> { >> diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc >> index 0f7ca10e017..9f67415adb1 100644 >> --- a/gcc/d/d-attribs.cc >> +++ b/gcc/d/d-attribs.cc >> @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, >> warning (OPT_Wattributes, "%qE attribute ignored", name); >> *no_add_attrs = true; >> } >> - else >> - { >> - /* Do not inline functions with multiple clone targets. */ >> - DECL_UNINLINABLE (*node) = 1; >> - } >> >> return NULL_TREE; >> }
On Wed, Sep 18, 2024 at 09:46:15AM +0100, Richard Sandiford wrote: > Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: > > I recently found that target_clones functions cannot inline even when > > the caller has exactly the same target. However, if we only use target > > attributes in C++ and let the compiler generate IFUNC for us, the > > functions with the same target will be inlined. > > > > For example, the following code compiled on x86-64 target with -O3 will > > generate IFUNC for foo and bar and inline foo into the bar: > > > > ```cpp > > __attribute__((target("default"))) > > int foo(int *arr) { > > int sum = 0; > > for (int i=0;i<16;i++) sum += arr[i]; > > return sum; > > } > > > > __attribute__((target("avx2"))) > > int foo(int *arr) { > > int sum = 0; > > for (int i=0;i<16;i++) sum += arr[i]; > > return sum; > > } > > > > __attribute__((target("default"))) > > int bar(int *arr) { > > return foo(arr); > > } > > > > __attribute__((target("avx2"))) > > int bar(int *arr) { > > return foo(arr); > > } > > ``` > > > > However, if we use target_clones attribute, the target_clones functions > > will not be inlined: > > > > ```cpp > > __attribute__((target_clones("default","avx2"))) > > int foo(int *arr) { > > int sum = 0; > > for (int i=0;i<16;i++) sum += arr[i]; > > return sum; > > } > > > > __attribute__((target_clones("default","avx2"))) > > int bar(int *arr) { > > return foo(arr); > > } > > ``` > > > > This behavior may negatively impact performance since the target_clones > > functions are not inlined. And since we didn't jump to the target_clones > > functions based on PLT but used the same target as the caller's target. > > I think it's better to allow the target_clones functions to be inlined. > > > > gcc/ada/ChangeLog: > > > > * gcc-interface/utils.cc (handle_target_clones_attribute): > > Allow functions with target_clones attribute to be inlined. > > > > gcc/c-family/ChangeLog: > > > > * c-attribs.cc (handle_target_clones_attribute): > > Allow functions with target_clones attribute to be inlined. > > > > gcc/d/ChangeLog: > > > > * d-attribs.cc (d_handle_target_clones_attribute): > > Allow functions with target_clones attribute to be inlined. > > What I'm about to say applies to both sequences above, but: > > Before inlining avx2 foo into avx2 bar, don't we need to be careful about > making sure that foo would still pick the avx2 version if called normally? > E.g. if foo had an avx512 version, direct calls to foo would presumably > pick that on avx512 targets, but still pick the avx2 version of bar. > It would then seem strange for the avx2 version of bar to inline the > avx2 version of foo, both for performance and ODR reasons. > > Thanks, > Richard This is actually an existing bug in the indirection elimination code for the 'target' attribute. AArch64's 'target_version' attribute is not affected by this bug because the code in redirect_to_specific_clone only checks the target attribute (however, AArch64 is affected by a more convoluted bug in cases where we use both target and target_version attributes at the same time). I think it's correct to mark functions with the target_clones attribute as non-inlinable by the usual inlining mechanisms, since we don't know which version will be called at runtime without doing further work to check this. If we can remove the indirection, then the individual function versions might then become inlinable, but we don't have these individual versions available until after the target clones pass. > > > > Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> > > --- > > gcc/ada/gcc-interface/utils.cc | 5 +---- > > gcc/c-family/c-attribs.cc | 3 --- > > gcc/d/d-attribs.cc | 5 ----- > > 3 files changed, 1 insertion(+), 12 deletions(-) > > > > diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc > > index 60f36b1e50d..d010b684177 100644 > > --- a/gcc/ada/gcc-interface/utils.cc > > +++ b/gcc/ada/gcc-interface/utils.cc > > @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > > int ARG_UNUSED (flags), bool *no_add_attrs) > > { > > /* Ensure we have a function type. */ > > - if (TREE_CODE (*node) == FUNCTION_DECL) > > - /* Do not inline functions with multiple clone targets. */ > > - DECL_UNINLINABLE (*node) = 1; > > - else > > + if (TREE_CODE (*node) != FUNCTION_DECL) > > { > > warning (OPT_Wattributes, "%qE attribute ignored", name); > > *no_add_attrs = true; > > diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc > > index 4dd2eecbea5..f8759bb1908 100644 > > --- a/gcc/c-family/c-attribs.cc > > +++ b/gcc/c-family/c-attribs.cc > > @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > > "single %<target_clones%> attribute is ignored"); > > *no_add_attrs = true; > > } > > - else > > - /* Do not inline functions with multiple clone targets. */ > > - DECL_UNINLINABLE (*node) = 1; > > } > > else > > { > > diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc > > index 0f7ca10e017..9f67415adb1 100644 > > --- a/gcc/d/d-attribs.cc > > +++ b/gcc/d/d-attribs.cc > > @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, > > warning (OPT_Wattributes, "%qE attribute ignored", name); > > *no_add_attrs = true; > > } > > - else > > - { > > - /* Do not inline functions with multiple clone targets. */ > > - DECL_UNINLINABLE (*node) = 1; > > - } > > > > return NULL_TREE; > > }
> On Sep 18, 2024, at 23:36, Andrew Carlotti <andrew.carlotti@arm.com> wrote: > > On Wed, Sep 18, 2024 at 09:46:15AM +0100, Richard Sandiford wrote: >> Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: >>> I recently found that target_clones functions cannot inline even when >>> the caller has exactly the same target. However, if we only use target >>> attributes in C++ and let the compiler generate IFUNC for us, the >>> functions with the same target will be inlined. >>> >>> For example, the following code compiled on x86-64 target with -O3 will >>> generate IFUNC for foo and bar and inline foo into the bar: >>> >>> ```cpp >>> __attribute__((target("default"))) >>> int foo(int *arr) { >>> int sum = 0; >>> for (int i=0;i<16;i++) sum += arr[i]; >>> return sum; >>> } >>> >>> __attribute__((target("avx2"))) >>> int foo(int *arr) { >>> int sum = 0; >>> for (int i=0;i<16;i++) sum += arr[i]; >>> return sum; >>> } >>> >>> __attribute__((target("default"))) >>> int bar(int *arr) { >>> return foo(arr); >>> } >>> >>> __attribute__((target("avx2"))) >>> int bar(int *arr) { >>> return foo(arr); >>> } >>> ``` >>> >>> However, if we use target_clones attribute, the target_clones functions >>> will not be inlined: >>> >>> ```cpp >>> __attribute__((target_clones("default","avx2"))) >>> int foo(int *arr) { >>> int sum = 0; >>> for (int i=0;i<16;i++) sum += arr[i]; >>> return sum; >>> } >>> >>> __attribute__((target_clones("default","avx2"))) >>> int bar(int *arr) { >>> return foo(arr); >>> } >>> ``` >>> >>> This behavior may negatively impact performance since the target_clones >>> functions are not inlined. And since we didn't jump to the target_clones >>> functions based on PLT but used the same target as the caller's target. >>> I think it's better to allow the target_clones functions to be inlined. >>> >>> gcc/ada/ChangeLog: >>> >>> * gcc-interface/utils.cc (handle_target_clones_attribute): >>> Allow functions with target_clones attribute to be inlined. >>> >>> gcc/c-family/ChangeLog: >>> >>> * c-attribs.cc (handle_target_clones_attribute): >>> Allow functions with target_clones attribute to be inlined. >>> >>> gcc/d/ChangeLog: >>> >>> * d-attribs.cc (d_handle_target_clones_attribute): >>> Allow functions with target_clones attribute to be inlined. >> >> What I'm about to say applies to both sequences above, but: >> >> Before inlining avx2 foo into avx2 bar, don't we need to be careful about >> making sure that foo would still pick the avx2 version if called normally? >> E.g. if foo had an avx512 version, direct calls to foo would presumably >> pick that on avx512 targets, but still pick the avx2 version of bar. >> It would then seem strange for the avx2 version of bar to inline the >> avx2 version of foo, both for performance and ODR reasons. >> >> Thanks, >> Richard > > This is actually an existing bug in the indirection elimination code for the > 'target' attribute. AArch64's 'target_version' attribute is not affected by > this bug because the code in redirect_to_specific_clone only checks the target > attribute (however, AArch64 is affected by a more convoluted bug in cases where > we use both target and target_version attributes at the same time). > > I think it's correct to mark functions with the target_clones attribute > as non-inlinable by the usual inlining mechanisms, since we don't know which > version will be called at runtime without doing further work to check this. I think the best case might be allowing inline when the top priority callee function has a subset of the caller's ISA. When the callee has some target ISA extension that does not exist in the caller's, we call the function through PLT. However, this solution also needs to allow the target_clones callee to be able to inline. > If > we can remove the indirection, then the individual function versions might then > become inlinable, but we don't have these individual versions available until > after the target clones pass. > >>> >>> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> >>> --- >>> gcc/ada/gcc-interface/utils.cc | 5 +---- >>> gcc/c-family/c-attribs.cc | 3 --- >>> gcc/d/d-attribs.cc | 5 ----- >>> 3 files changed, 1 insertion(+), 12 deletions(-) >>> >>> diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc >>> index 60f36b1e50d..d010b684177 100644 >>> --- a/gcc/ada/gcc-interface/utils.cc >>> +++ b/gcc/ada/gcc-interface/utils.cc >>> @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), >>> int ARG_UNUSED (flags), bool *no_add_attrs) >>> { >>> /* Ensure we have a function type. */ >>> - if (TREE_CODE (*node) == FUNCTION_DECL) >>> - /* Do not inline functions with multiple clone targets. */ >>> - DECL_UNINLINABLE (*node) = 1; >>> - else >>> + if (TREE_CODE (*node) != FUNCTION_DECL) >>> { >>> warning (OPT_Wattributes, "%qE attribute ignored", name); >>> *no_add_attrs = true; >>> diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc >>> index 4dd2eecbea5..f8759bb1908 100644 >>> --- a/gcc/c-family/c-attribs.cc >>> +++ b/gcc/c-family/c-attribs.cc >>> @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), >>> "single %<target_clones%> attribute is ignored"); >>> *no_add_attrs = true; >>> } >>> - else >>> - /* Do not inline functions with multiple clone targets. */ >>> - DECL_UNINLINABLE (*node) = 1; >>> } >>> else >>> { >>> diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc >>> index 0f7ca10e017..9f67415adb1 100644 >>> --- a/gcc/d/d-attribs.cc >>> +++ b/gcc/d/d-attribs.cc >>> @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, >>> warning (OPT_Wattributes, "%qE attribute ignored", name); >>> *no_add_attrs = true; >>> } >>> - else >>> - { >>> - /* Do not inline functions with multiple clone targets. */ >>> - DECL_UNINLINABLE (*node) = 1; >>> - } >>> >>> return NULL_TREE; >>> }
On Thu, Sep 19, 2024 at 01:01:39AM +0800, Yangyu Chen wrote: > > > > On Sep 18, 2024, at 23:36, Andrew Carlotti <andrew.carlotti@arm.com> wrote: > > > > On Wed, Sep 18, 2024 at 09:46:15AM +0100, Richard Sandiford wrote: > >> Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: > >>> I recently found that target_clones functions cannot inline even when > >>> the caller has exactly the same target. However, if we only use target > >>> attributes in C++ and let the compiler generate IFUNC for us, the > >>> functions with the same target will be inlined. > >>> > >>> For example, the following code compiled on x86-64 target with -O3 will > >>> generate IFUNC for foo and bar and inline foo into the bar: > >>> > >>> ```cpp > >>> __attribute__((target("default"))) > >>> int foo(int *arr) { > >>> int sum = 0; > >>> for (int i=0;i<16;i++) sum += arr[i]; > >>> return sum; > >>> } > >>> > >>> __attribute__((target("avx2"))) > >>> int foo(int *arr) { > >>> int sum = 0; > >>> for (int i=0;i<16;i++) sum += arr[i]; > >>> return sum; > >>> } > >>> > >>> __attribute__((target("default"))) > >>> int bar(int *arr) { > >>> return foo(arr); > >>> } > >>> > >>> __attribute__((target("avx2"))) > >>> int bar(int *arr) { > >>> return foo(arr); > >>> } > >>> ``` > >>> > >>> However, if we use target_clones attribute, the target_clones functions > >>> will not be inlined: > >>> > >>> ```cpp > >>> __attribute__((target_clones("default","avx2"))) > >>> int foo(int *arr) { > >>> int sum = 0; > >>> for (int i=0;i<16;i++) sum += arr[i]; > >>> return sum; > >>> } > >>> > >>> __attribute__((target_clones("default","avx2"))) > >>> int bar(int *arr) { > >>> return foo(arr); > >>> } > >>> ``` > >>> > >>> This behavior may negatively impact performance since the target_clones > >>> functions are not inlined. And since we didn't jump to the target_clones > >>> functions based on PLT but used the same target as the caller's target. > >>> I think it's better to allow the target_clones functions to be inlined. > >>> > >>> gcc/ada/ChangeLog: > >>> > >>> * gcc-interface/utils.cc (handle_target_clones_attribute): > >>> Allow functions with target_clones attribute to be inlined. > >>> > >>> gcc/c-family/ChangeLog: > >>> > >>> * c-attribs.cc (handle_target_clones_attribute): > >>> Allow functions with target_clones attribute to be inlined. > >>> > >>> gcc/d/ChangeLog: > >>> > >>> * d-attribs.cc (d_handle_target_clones_attribute): > >>> Allow functions with target_clones attribute to be inlined. > >> > >> What I'm about to say applies to both sequences above, but: > >> > >> Before inlining avx2 foo into avx2 bar, don't we need to be careful about > >> making sure that foo would still pick the avx2 version if called normally? > >> E.g. if foo had an avx512 version, direct calls to foo would presumably > >> pick that on avx512 targets, but still pick the avx2 version of bar. > >> It would then seem strange for the avx2 version of bar to inline the > >> avx2 version of foo, both for performance and ODR reasons. > >> > >> Thanks, > >> Richard > > > > This is actually an existing bug in the indirection elimination code for the > > 'target' attribute. AArch64's 'target_version' attribute is not affected by > > this bug because the code in redirect_to_specific_clone only checks the target > > attribute (however, AArch64 is affected by a more convoluted bug in cases where > > we use both target and target_version attributes at the same time). > > > > I think it's correct to mark functions with the target_clones attribute > > as non-inlinable by the usual inlining mechanisms, since we don't know which > > version will be called at runtime without doing further work to check this. > > I think the best case might be allowing inline when the top priority > callee function has a subset of the caller's ISA. When the callee > has some target ISA extension that does not exist in the caller's, > we call the function through PLT. However, this solution also needs > to allow the target_clones callee to be able to inline. If we redirect to a specific version in the target_clone pass, then presumably that means that it's safe to inline that version? I'm not sure whether the existing code in redirect_to_specific_clone checks whether this is safe before removing the indirection, which might be a bug in itself. If we know it's safe to inline, then we could mark the function as inlineable in redirect_to_specific_clone, which would allow the later ipa_inline pass toinline the function. > > If > > we can remove the indirection, then the individual function versions might then > > become inlinable, but we don't have these individual versions available until > > after the target clones pass. > > > >>> > >>> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> > >>> --- > >>> gcc/ada/gcc-interface/utils.cc | 5 +---- > >>> gcc/c-family/c-attribs.cc | 3 --- > >>> gcc/d/d-attribs.cc | 5 ----- > >>> 3 files changed, 1 insertion(+), 12 deletions(-) > >>> > >>> diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc > >>> index 60f36b1e50d..d010b684177 100644 > >>> --- a/gcc/ada/gcc-interface/utils.cc > >>> +++ b/gcc/ada/gcc-interface/utils.cc > >>> @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > >>> int ARG_UNUSED (flags), bool *no_add_attrs) > >>> { > >>> /* Ensure we have a function type. */ > >>> - if (TREE_CODE (*node) == FUNCTION_DECL) > >>> - /* Do not inline functions with multiple clone targets. */ > >>> - DECL_UNINLINABLE (*node) = 1; > >>> - else > >>> + if (TREE_CODE (*node) != FUNCTION_DECL) > >>> { > >>> warning (OPT_Wattributes, "%qE attribute ignored", name); > >>> *no_add_attrs = true; > >>> diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc > >>> index 4dd2eecbea5..f8759bb1908 100644 > >>> --- a/gcc/c-family/c-attribs.cc > >>> +++ b/gcc/c-family/c-attribs.cc > >>> @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > >>> "single %<target_clones%> attribute is ignored"); > >>> *no_add_attrs = true; > >>> } > >>> - else > >>> - /* Do not inline functions with multiple clone targets. */ > >>> - DECL_UNINLINABLE (*node) = 1; > >>> } > >>> else > >>> { > >>> diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc > >>> index 0f7ca10e017..9f67415adb1 100644 > >>> --- a/gcc/d/d-attribs.cc > >>> +++ b/gcc/d/d-attribs.cc > >>> @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, > >>> warning (OPT_Wattributes, "%qE attribute ignored", name); > >>> *no_add_attrs = true; > >>> } > >>> - else > >>> - { > >>> - /* Do not inline functions with multiple clone targets. */ > >>> - DECL_UNINLINABLE (*node) = 1; > >>> - } > >>> > >>> return NULL_TREE; > >>> } > >
On Wed, Sep 18, 2024 at 8:41 PM Andrew Carlotti <andrew.carlotti@arm.com> wrote: > > On Thu, Sep 19, 2024 at 01:01:39AM +0800, Yangyu Chen wrote: > > > > > > > On Sep 18, 2024, at 23:36, Andrew Carlotti <andrew.carlotti@arm.com> wrote: > > > > > > On Wed, Sep 18, 2024 at 09:46:15AM +0100, Richard Sandiford wrote: > > >> Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> writes: > > >>> I recently found that target_clones functions cannot inline even when > > >>> the caller has exactly the same target. However, if we only use target > > >>> attributes in C++ and let the compiler generate IFUNC for us, the > > >>> functions with the same target will be inlined. > > >>> > > >>> For example, the following code compiled on x86-64 target with -O3 will > > >>> generate IFUNC for foo and bar and inline foo into the bar: > > >>> > > >>> ```cpp > > >>> __attribute__((target("default"))) > > >>> int foo(int *arr) { > > >>> int sum = 0; > > >>> for (int i=0;i<16;i++) sum += arr[i]; > > >>> return sum; > > >>> } > > >>> > > >>> __attribute__((target("avx2"))) > > >>> int foo(int *arr) { > > >>> int sum = 0; > > >>> for (int i=0;i<16;i++) sum += arr[i]; > > >>> return sum; > > >>> } > > >>> > > >>> __attribute__((target("default"))) > > >>> int bar(int *arr) { > > >>> return foo(arr); > > >>> } > > >>> > > >>> __attribute__((target("avx2"))) > > >>> int bar(int *arr) { > > >>> return foo(arr); > > >>> } > > >>> ``` > > >>> > > >>> However, if we use target_clones attribute, the target_clones functions > > >>> will not be inlined: > > >>> > > >>> ```cpp > > >>> __attribute__((target_clones("default","avx2"))) > > >>> int foo(int *arr) { > > >>> int sum = 0; > > >>> for (int i=0;i<16;i++) sum += arr[i]; > > >>> return sum; > > >>> } > > >>> > > >>> __attribute__((target_clones("default","avx2"))) > > >>> int bar(int *arr) { > > >>> return foo(arr); > > >>> } > > >>> ``` > > >>> > > >>> This behavior may negatively impact performance since the target_clones > > >>> functions are not inlined. And since we didn't jump to the target_clones > > >>> functions based on PLT but used the same target as the caller's target. > > >>> I think it's better to allow the target_clones functions to be inlined. > > >>> > > >>> gcc/ada/ChangeLog: > > >>> > > >>> * gcc-interface/utils.cc (handle_target_clones_attribute): > > >>> Allow functions with target_clones attribute to be inlined. > > >>> > > >>> gcc/c-family/ChangeLog: > > >>> > > >>> * c-attribs.cc (handle_target_clones_attribute): > > >>> Allow functions with target_clones attribute to be inlined. > > >>> > > >>> gcc/d/ChangeLog: > > >>> > > >>> * d-attribs.cc (d_handle_target_clones_attribute): > > >>> Allow functions with target_clones attribute to be inlined. > > >> > > >> What I'm about to say applies to both sequences above, but: > > >> > > >> Before inlining avx2 foo into avx2 bar, don't we need to be careful about > > >> making sure that foo would still pick the avx2 version if called normally? > > >> E.g. if foo had an avx512 version, direct calls to foo would presumably > > >> pick that on avx512 targets, but still pick the avx2 version of bar. > > >> It would then seem strange for the avx2 version of bar to inline the > > >> avx2 version of foo, both for performance and ODR reasons. > > >> > > >> Thanks, > > >> Richard > > > > > > This is actually an existing bug in the indirection elimination code for the > > > 'target' attribute. AArch64's 'target_version' attribute is not affected by > > > this bug because the code in redirect_to_specific_clone only checks the target > > > attribute (however, AArch64 is affected by a more convoluted bug in cases where > > > we use both target and target_version attributes at the same time). > > > > > > I think it's correct to mark functions with the target_clones attribute > > > as non-inlinable by the usual inlining mechanisms, since we don't know which > > > version will be called at runtime without doing further work to check this. > > > > I think the best case might be allowing inline when the top priority > > callee function has a subset of the caller's ISA. When the callee > > has some target ISA extension that does not exist in the caller's, > > we call the function through PLT. However, this solution also needs > > to allow the target_clones callee to be able to inline. > > If we redirect to a specific version in the target_clone pass, then presumably > that means that it's safe to inline that version? I would say so. I think marking the resolver decl as not inlineable is correct but the individual target clones should not be marked that way (are they?) > I'm not sure whether the > existing code in redirect_to_specific_clone checks whether this is safe before > removing the indirection, which might be a bug in itself. I think the code should make the call inlineable already, unless the functions are marked as not inlineable - you'd need to debug what happens. RIchard. > > If we know it's safe to inline, then we could mark the function as inlineable > in redirect_to_specific_clone, which would allow the later ipa_inline pass > toinline the function. > > > > If > > > we can remove the indirection, then the individual function versions might then > > > become inlinable, but we don't have these individual versions available until > > > after the target clones pass. > > > > > >>> > > >>> Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> > > >>> --- > > >>> gcc/ada/gcc-interface/utils.cc | 5 +---- > > >>> gcc/c-family/c-attribs.cc | 3 --- > > >>> gcc/d/d-attribs.cc | 5 ----- > > >>> 3 files changed, 1 insertion(+), 12 deletions(-) > > >>> > > >>> diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc > > >>> index 60f36b1e50d..d010b684177 100644 > > >>> --- a/gcc/ada/gcc-interface/utils.cc > > >>> +++ b/gcc/ada/gcc-interface/utils.cc > > >>> @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > > >>> int ARG_UNUSED (flags), bool *no_add_attrs) > > >>> { > > >>> /* Ensure we have a function type. */ > > >>> - if (TREE_CODE (*node) == FUNCTION_DECL) > > >>> - /* Do not inline functions with multiple clone targets. */ > > >>> - DECL_UNINLINABLE (*node) = 1; > > >>> - else > > >>> + if (TREE_CODE (*node) != FUNCTION_DECL) > > >>> { > > >>> warning (OPT_Wattributes, "%qE attribute ignored", name); > > >>> *no_add_attrs = true; > > >>> diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc > > >>> index 4dd2eecbea5..f8759bb1908 100644 > > >>> --- a/gcc/c-family/c-attribs.cc > > >>> +++ b/gcc/c-family/c-attribs.cc > > >>> @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), > > >>> "single %<target_clones%> attribute is ignored"); > > >>> *no_add_attrs = true; > > >>> } > > >>> - else > > >>> - /* Do not inline functions with multiple clone targets. */ > > >>> - DECL_UNINLINABLE (*node) = 1; > > >>> } > > >>> else > > >>> { > > >>> diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc > > >>> index 0f7ca10e017..9f67415adb1 100644 > > >>> --- a/gcc/d/d-attribs.cc > > >>> +++ b/gcc/d/d-attribs.cc > > >>> @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, > > >>> warning (OPT_Wattributes, "%qE attribute ignored", name); > > >>> *no_add_attrs = true; > > >>> } > > >>> - else > > >>> - { > > >>> - /* Do not inline functions with multiple clone targets. */ > > >>> - DECL_UNINLINABLE (*node) = 1; > > >>> - } > > >>> > > >>> return NULL_TREE; > > >>> } > > > >
diff --git a/gcc/ada/gcc-interface/utils.cc b/gcc/ada/gcc-interface/utils.cc index 60f36b1e50d..d010b684177 100644 --- a/gcc/ada/gcc-interface/utils.cc +++ b/gcc/ada/gcc-interface/utils.cc @@ -7299,10 +7299,7 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), int ARG_UNUSED (flags), bool *no_add_attrs) { /* Ensure we have a function type. */ - if (TREE_CODE (*node) == FUNCTION_DECL) - /* Do not inline functions with multiple clone targets. */ - DECL_UNINLINABLE (*node) = 1; - else + if (TREE_CODE (*node) != FUNCTION_DECL) { warning (OPT_Wattributes, "%qE attribute ignored", name); *no_add_attrs = true; diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc index 4dd2eecbea5..f8759bb1908 100644 --- a/gcc/c-family/c-attribs.cc +++ b/gcc/c-family/c-attribs.cc @@ -6105,9 +6105,6 @@ handle_target_clones_attribute (tree *node, tree name, tree ARG_UNUSED (args), "single %<target_clones%> attribute is ignored"); *no_add_attrs = true; } - else - /* Do not inline functions with multiple clone targets. */ - DECL_UNINLINABLE (*node) = 1; } else { diff --git a/gcc/d/d-attribs.cc b/gcc/d/d-attribs.cc index 0f7ca10e017..9f67415adb1 100644 --- a/gcc/d/d-attribs.cc +++ b/gcc/d/d-attribs.cc @@ -788,11 +788,6 @@ d_handle_target_clones_attribute (tree *node, tree name, tree, int, warning (OPT_Wattributes, "%qE attribute ignored", name); *no_add_attrs = true; } - else - { - /* Do not inline functions with multiple clone targets. */ - DECL_UNINLINABLE (*node) = 1; - } return NULL_TREE; }
I recently found that target_clones functions cannot inline even when the caller has exactly the same target. However, if we only use target attributes in C++ and let the compiler generate IFUNC for us, the functions with the same target will be inlined. For example, the following code compiled on x86-64 target with -O3 will generate IFUNC for foo and bar and inline foo into the bar: ```cpp __attribute__((target("default"))) int foo(int *arr) { int sum = 0; for (int i=0;i<16;i++) sum += arr[i]; return sum; } __attribute__((target("avx2"))) int foo(int *arr) { int sum = 0; for (int i=0;i<16;i++) sum += arr[i]; return sum; } __attribute__((target("default"))) int bar(int *arr) { return foo(arr); } __attribute__((target("avx2"))) int bar(int *arr) { return foo(arr); } ``` However, if we use target_clones attribute, the target_clones functions will not be inlined: ```cpp __attribute__((target_clones("default","avx2"))) int foo(int *arr) { int sum = 0; for (int i=0;i<16;i++) sum += arr[i]; return sum; } __attribute__((target_clones("default","avx2"))) int bar(int *arr) { return foo(arr); } ``` This behavior may negatively impact performance since the target_clones functions are not inlined. And since we didn't jump to the target_clones functions based on PLT but used the same target as the caller's target. I think it's better to allow the target_clones functions to be inlined. gcc/ada/ChangeLog: * gcc-interface/utils.cc (handle_target_clones_attribute): Allow functions with target_clones attribute to be inlined. gcc/c-family/ChangeLog: * c-attribs.cc (handle_target_clones_attribute): Allow functions with target_clones attribute to be inlined. gcc/d/ChangeLog: * d-attribs.cc (d_handle_target_clones_attribute): Allow functions with target_clones attribute to be inlined. Signed-off-by: Yangyu Chen <chenyangyu@isrc.iscas.ac.cn> --- gcc/ada/gcc-interface/utils.cc | 5 +---- gcc/c-family/c-attribs.cc | 3 --- gcc/d/d-attribs.cc | 5 ----- 3 files changed, 1 insertion(+), 12 deletions(-)