This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Support ThinLTO
ClosedPublic

Authored by yaxunl on Mar 31 2021, 1:48 PM.

Details

Summary

Add options -[no-]offload-lto and -foffload-lto=[thin,full] for controlling
LTO for offload compilation. Allow LTO for AMDGPU target.

AMDGPU target does not support codegen of object files containing
call of external functions, therefore the LLVM module passed to
AMDGPU backend needs to contain definitions of all the callees.
An LLVM option is added to allow function importer to import
functions with noinline attribute.

HIP toolchain passes proper LLVM options to lld to make sure
function importer imports definitions of all the callees.

Diff Detail

Event Timeline

yaxunl created this revision.Mar 31 2021, 1:48 PM
yaxunl requested review of this revision.Mar 31 2021, 1:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2021, 1:48 PM
tra accepted this revision.Apr 1 2021, 10:29 AM
tra added a subscriber: tejohnson.

LGTM in general. Please give LTO folks some time to chime in case they have any feedback.

@tejohnson: Just a FYI that we're tinkering with LTO on GPUs here.

clang/include/clang/Driver/Options.td
1948–1951

Should it be BoolFOption ?

clang/lib/Driver/Driver.cpp
612

Leftover debug printout?

clang/lib/Driver/ToolChains/Clang.cpp
4433–4434

Nit: rephrase it as Only AMDGPU supports device-side LTO ?

llvm/test/Transforms/FunctionImport/noinline.ll
5

I'd add a meaningful suffix to the binaries we'll use to run the checks on. E.g %t3 -> %t.lto.bc, %t2 -> %t.inputs.noinline.bc, %t -> %t.main.bc.

This revision is now accepted and ready to land.Apr 1 2021, 10:29 AM

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

yaxunl marked 4 inline comments as done.Apr 5 2021, 8:52 AM

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

AMDGPU backend does not support linking of object files containing external symbols, i.e. one object file calling a function defined in another object file. Therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all callees, even if a callee has noinline attribute. To support backends like this, the function importer needs to be able to import functions with noinline attribute. Therefore we add an LLVM option for allowing that, which is off by default. We have comments at line 70 of HIP.cpp about this.

Will add a patch description.

clang/include/clang/Driver/Options.td
1948–1951

Yes. will fix

clang/lib/Driver/Driver.cpp
612

will remove

clang/lib/Driver/ToolChains/Clang.cpp
4433–4434

will do

llvm/test/Transforms/FunctionImport/noinline.ll
5

will rename %t and %t2 as suggested. However, llvm-lto will postfix the output file name with .thinlto.bc, therefore I would rename %t3 -> %t.summary

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

AMDGPU backend does not support linking of object files containing external symbols, i.e. one object file calling a function defined in another object file. Therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all callees, even if a callee has noinline attribute. To support backends like this, the function importer needs to be able to import functions with noinline attribute. Therefore we add an LLVM option for allowing that, which is off by default. We have comments at line 70 of HIP.cpp about this.

How does a non-LTO build work, or is (full) LTO currently required? Because with ThinLTO we only import functions that are externally defined but referenced in the current module. Also, when ThinLTO imports functions it makes them available_externally, which means they are dropped and made external symbols again after inlining. So anything imported but not inlined will go back to being an external symbol.

yaxunl updated this revision to Diff 335268.Apr 5 2021, 9:01 AM
yaxunl marked 4 inline comments as done.
yaxunl edited the summary of this revision. (Show Details)

Revise by Artem's comments. Add patch description.

yaxunl added a comment.Apr 5 2021, 9:19 AM

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

AMDGPU backend does not support linking of object files containing external symbols, i.e. one object file calling a function defined in another object file. Therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all callees, even if a callee has noinline attribute. To support backends like this, the function importer needs to be able to import functions with noinline attribute. Therefore we add an LLVM option for allowing that, which is off by default. We have comments at line 70 of HIP.cpp about this.

How does a non-LTO build work, or is (full) LTO currently required? Because with ThinLTO we only import functions that are externally defined but referenced in the current module. Also, when ThinLTO imports functions it makes them available_externally, which means they are dropped and made external symbols again after inlining. So anything imported but not inlined will go back to being an external symbol.

AMDGPU backend by default uses full LTO for linking. It does not support non-LTO linking. Currently, it inlines all functions except kernels. However we want to be able to be able not to inline all functions. Is it OK to add an LLVM option to mark imported functions as linkonce_odr so that AMDGPU backend can keep the definitions of the imported functions?

jansvoboda11 added inline comments.Apr 6 2021, 1:31 AM
clang/include/clang/Driver/Options.td
1948–1951

The BoolFOption marshalling multiclass should be only used for flags where either the positive or the negative (or both) are -cc1 options and map to a field in CompilerInvocation.

Since this patch only deals with the driver (not the cc1 frontend) using BoolFOption is not correct. Please, revert this change to the previous state.

I might need to explicitly call this out in the documentation https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option.

yaxunl marked an inline comment as done.Apr 6 2021, 9:57 AM

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

AMDGPU backend does not support linking of object files containing external symbols, i.e. one object file calling a function defined in another object file. Therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all callees, even if a callee has noinline attribute. To support backends like this, the function importer needs to be able to import functions with noinline attribute. Therefore we add an LLVM option for allowing that, which is off by default. We have comments at line 70 of HIP.cpp about this.

How does a non-LTO build work, or is (full) LTO currently required? Because with ThinLTO we only import functions that are externally defined but referenced in the current module. Also, when ThinLTO imports functions it makes them available_externally, which means they are dropped and made external symbols again after inlining. So anything imported but not inlined will go back to being an external symbol.

AMDGPU backend by default uses full LTO for linking. It does not support non-LTO linking. Currently, it inlines all functions except kernels. However we want to be able to be able not to inline all functions. Is it OK to add an LLVM option to mark imported functions as linkonce_odr so that AMDGPU backend can keep the definitions of the imported functions?

Actually AMDGPU backend will internalize all non-kernel functions before codegen. Those functions with available_externally linkage will have internal linkage before codegen, therefore they will not be dropped.

clang/include/clang/Driver/Options.td
1948–1951

will do

yaxunl updated this revision to Diff 335571.Apr 6 2021, 10:07 AM
yaxunl marked an inline comment as done.

revert the change about option

tra added inline comments.Apr 6 2021, 10:23 AM
clang/include/clang/Driver/Options.td
1948–1951

<off-topic for the patch>

@jansvoboda11 Thank you for the explanation. Updating the docs would indeed be useful.

I would also suggest describing the restrictions next to the BoolFOption definition. Developers tend to read the sources way more often than the docs, and the comments source code make it look like a general-purpose multiclass for any boolean -f option.

Would it also be possible to add some sort of compile-time safeguards to enforce intended constraints on the CLI tablegen?

I haven't looked extensively yet, but why import noinline functions? Also, please add a patch description.

AMDGPU backend does not support linking of object files containing external symbols, i.e. one object file calling a function defined in another object file. Therefore the LLVM module passed to AMDGPU backend needs to contain definitions of all callees, even if a callee has noinline attribute. To support backends like this, the function importer needs to be able to import functions with noinline attribute. Therefore we add an LLVM option for allowing that, which is off by default. We have comments at line 70 of HIP.cpp about this.

How does a non-LTO build work, or is (full) LTO currently required? Because with ThinLTO we only import functions that are externally defined but referenced in the current module. Also, when ThinLTO imports functions it makes them available_externally, which means they are dropped and made external symbols again after inlining. So anything imported but not inlined will go back to being an external symbol.

AMDGPU backend by default uses full LTO for linking. It does not support non-LTO linking. Currently, it inlines all functions except kernels. However we want to be able to be able not to inline all functions. Is it OK to add an LLVM option to mark imported functions as linkonce_odr so that AMDGPU backend can keep the definitions of the imported functions?

Actually AMDGPU backend will internalize all non-kernel functions before codegen. Those functions with available_externally linkage will have internal linkage before codegen, therefore they will not be dropped.

This raises some higher level questions for me:

First, how will you deal with other corner cases that won't or cannot be imported right now? While enabling importing of noinline functions and cranking up the threshold will get the majority of functions imported, there are cases that we still won't import (functions/vars that are interposable, certain funcs/vars that cannot be renamed, most non-const variables with non-trivial initializers).

Second, force importing of everything transitively referenced defeats the purpose of ThinLTO and would probably make it worse than regular LTO. The main entry module will need to import everything transitively referenced from there, so everything not dead in the binary, which should make that module post importing equivalent to a regular LTO module. In addition, every other module needs to transitively import everything referenced from those modules, making them very large depending on how many leaf vs non-leaf functions and variables they contain. What is the goal of doing ThinLTO in this case?

yaxunl added a comment.Apr 6 2021, 2:48 PM

This raises some higher level questions for me:

First, how will you deal with other corner cases that won't or cannot be imported right now? While enabling importing of noinline functions and cranking up the threshold will get the majority of functions imported, there are cases that we still won't import (functions/vars that are interposable, certain funcs/vars that cannot be renamed, most non-const variables with non-trivial initializers).

We will document the limitation of thinLTO support of HIP toolchain and recommend users not to use thinLTO in those corner cases.

Second, force importing of everything transitively referenced defeats the purpose of ThinLTO and would probably make it worse than regular LTO. The main entry module will need to import everything transitively referenced from there, so everything not dead in the binary, which should make that module post importing equivalent to a regular LTO module. In addition, every other module needs to transitively import everything referenced from those modules, making them very large depending on how many leaf vs non-leaf functions and variables they contain. What is the goal of doing ThinLTO in this case?

The objective is to improve optimization/codegen time by using multi-threads of thinLTO. For example, I have 10 modules each containing a kernel. In full LTO linking, I get one big module containing 10 kernels with all functions inlined, and I have one thread for optimization/codegen. With thinLTO, I get one kernel in each module, with all functions inlined. AMDGPU internalization and global DCE will remove functions not used by that kernel in each module. I will get 10 threads, each doing optimization/codegen for one kernel. Theoretically, there could be 10 times speed up.

This raises some higher level questions for me:

First, how will you deal with other corner cases that won't or cannot be imported right now? While enabling importing of noinline functions and cranking up the threshold will get the majority of functions imported, there are cases that we still won't import (functions/vars that are interposable, certain funcs/vars that cannot be renamed, most non-const variables with non-trivial initializers).

We will document the limitation of thinLTO support of HIP toolchain and recommend users not to use thinLTO in those corner cases.

Second, force importing of everything transitively referenced defeats the purpose of ThinLTO and would probably make it worse than regular LTO. The main entry module will need to import everything transitively referenced from there, so everything not dead in the binary, which should make that module post importing equivalent to a regular LTO module. In addition, every other module needs to transitively import everything referenced from those modules, making them very large depending on how many leaf vs non-leaf functions and variables they contain. What is the goal of doing ThinLTO in this case?

The objective is to improve optimization/codegen time by using multi-threads of thinLTO. For example, I have 10 modules each containing a kernel. In full LTO linking, I get one big module containing 10 kernels with all functions inlined, and I have one thread for optimization/codegen. With thinLTO, I get one kernel in each module, with all functions inlined. AMDGPU internalization and global DCE will remove functions not used by that kernel in each module. I will get 10 threads, each doing optimization/codegen for one kernel. Theoretically, there could be 10 times speed up.

That will work as long as there are no dependence edges anywhere between the kernels. Is this a library that has a bunch of totally independent kernels only called externally?

yaxunl added a comment.Apr 6 2021, 8:01 PM

This raises some higher level questions for me:

First, how will you deal with other corner cases that won't or cannot be imported right now? While enabling importing of noinline functions and cranking up the threshold will get the majority of functions imported, there are cases that we still won't import (functions/vars that are interposable, certain funcs/vars that cannot be renamed, most non-const variables with non-trivial initializers).

We will document the limitation of thinLTO support of HIP toolchain and recommend users not to use thinLTO in those corner cases.

Second, force importing of everything transitively referenced defeats the purpose of ThinLTO and would probably make it worse than regular LTO. The main entry module will need to import everything transitively referenced from there, so everything not dead in the binary, which should make that module post importing equivalent to a regular LTO module. In addition, every other module needs to transitively import everything referenced from those modules, making them very large depending on how many leaf vs non-leaf functions and variables they contain. What is the goal of doing ThinLTO in this case?

The objective is to improve optimization/codegen time by using multi-threads of thinLTO. For example, I have 10 modules each containing a kernel. In full LTO linking, I get one big module containing 10 kernels with all functions inlined, and I have one thread for optimization/codegen. With thinLTO, I get one kernel in each module, with all functions inlined. AMDGPU internalization and global DCE will remove functions not used by that kernel in each module. I will get 10 threads, each doing optimization/codegen for one kernel. Theoretically, there could be 10 times speed up.

That will work as long as there are no dependence edges anywhere between the kernels. Is this a library that has a bunch of totally independent kernels only called externally?

There are no dependence edges between the kernels since they cannot call each other. The HIP device compilation output is always a shared library which contains multiple independent kernels which can be launched by a HIP program.

Any other concerns? Thanks.

Any other concerns? Thanks.

I have some concerns around the fragility of this, for the reasons I mentioned earlier where it may not always be able to import everything needed.

We will document the limitation of thinLTO support of HIP toolchain and recommend users not to use thinLTO in those corner cases.

Where will this be documented?

My concern is that we start getting bugs filed for these corner cases, and it burns a bunch of someone's time to dig in only to discover that it is an unsupported corner case. Since we can detect in the function importer when we cannot import something successfully, I think it would therefore be worthwhile to issue a hard error with a meaningful error message in the AMDGPU case.

clang/lib/Driver/ToolChains/Clang.cpp
4434

Should there be an error (or is there one already) emitted somewhere if LTO is requested along with device offloading and this isn't AMDGPU?

tejohnson requested changes to this revision.Apr 12 2021, 8:58 AM

To do what I suggested in the prior comment, you'd probably want to add a new index-wide flag (since we don't read IR in the thin link). See for example how EnableSplitLTOUnit is set and used. You could add a flag like ForceImportAll or something like that. Then you don't necessarily even need to bump up the importing threshold or add the new import-noinline flag. Just key off of that in the importer to try to force import everything. If something cannot be imported, fail with a clear error.

This revision now requires changes to proceed.Apr 12 2021, 8:58 AM
yaxunl marked an inline comment as done.May 7 2021, 9:01 AM

To do what I suggested in the prior comment, you'd probably want to add a new index-wide flag (since we don't read IR in the thin link). See for example how EnableSplitLTOUnit is set and used. You could add a flag like ForceImportAll or something like that. Then you don't necessarily even need to bump up the importing threshold or add the new import-noinline flag. Just key off of that in the importer to try to force import everything. If something cannot be imported, fail with a clear error.

will do

clang/lib/Driver/ToolChains/Clang.cpp
4434

yes. will do

yaxunl updated this revision to Diff 343696.May 7 2021, 9:07 AM
yaxunl marked an inline comment as done.

revised by Teresa's comments

To do what I suggested in the prior comment, you'd probably want to add a new index-wide flag (since we don't read IR in the thin link). See for example how EnableSplitLTOUnit is set and used. You could add a flag like ForceImportAll or something like that. Then you don't necessarily even need to bump up the importing threshold or add the new import-noinline flag. Just key off of that in the importer to try to force import everything. If something cannot be imported, fail with a clear error.

will do

I noticed you implemented with an internal error rather than a flag in the index. I think this is ok for now, especially if the support will eventually be removed because the linker will support external functions as noted in your TODO (note in order to do this in the index you would need to set up the flag during the compile step, not the linker invocation as you are doing here, which has some advantages if this will persist longer term). I have a suggestion about the error detection below, so that you can report errors earlier along with the failure reason.

llvm/lib/Transforms/IPO/FunctionImport.cpp
494–507

Probably better to issue an error here with the import failure reason?

yaxunl marked an inline comment as done.May 12 2021, 7:00 AM
yaxunl added inline comments.
llvm/lib/Transforms/IPO/FunctionImport.cpp
494–507

My understanding is that the import failure reason is only available if PrintImportFailures is enabled. Also it can only print the GUID and can not print the failed callee name since it is not available, therefore the information is cryptic. It seems to me the current error msg at line 1332 is more suitable for common users. For compiler developers, they can enable PrintImportFailures and see the reason of failed imports.

tejohnson added inline comments.May 12 2021, 7:26 AM
llvm/lib/Transforms/IPO/FunctionImport.cpp
494–507

selectCallee always sets the Reason. And we have the name in addition to the GUID in normal circumstances (linking from modules). It would only not be available in certain debugging situations (e.g. linking from an existing combined module with llvm-lto). Also, by failing here, you don't need to wait until the LTO backends to issue the error, so it fails a little earlier.

yaxunl marked 2 inline comments as done.May 12 2021, 1:43 PM
yaxunl added inline comments.
llvm/lib/Transforms/IPO/FunctionImport.cpp
494–507

I see. Will do.

yaxunl updated this revision to Diff 344949.May 12 2021, 1:50 PM
yaxunl marked an inline comment as done.

Revised by Teresa's comments

This revision is now accepted and ready to land.May 21 2021, 8:29 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 22 2021, 7:49 AM