This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Internalize non-kernel symbols
ClosedPublic

Authored by rampitec on Jan 26 2017, 11:29 PM.

Details

Summary

Since we have no call support and late linking we can produce code
only for used symbols. This saves compilation time, size of the final
executable, and size of any intermediate dumps.

Run Internalize pass early in the opt pipeline followed by global
DCE pass. To enable it RT can pass -amdgpu-whole-program option.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jan 26 2017, 11:29 PM
vpykhtin edited edge metadata.Jan 27 2017, 7:49 AM

Please don't forget to add fullcontext diff.

rampitec updated this revision to Diff 86060.Jan 27 2017, 9:00 AM

Rebased to master and merged.
Added full context diff.

I don't see why the backend needs to do this. Why isn't internalize run when the library is linked?

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
225–226 ↗(On Diff #86060)

Shouldn't this be return false?

arsenm added inline comments.Jan 27 2017, 9:39 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
88–90 ↗(On Diff #86060)

This isn't a good name (and shouldn't it default to true?). How about amdgpu-internalize-functions/

I don't see why the backend needs to do this. Why isn't internalize run when the library is linked?

How do you propose to run it or anything else with linker?

rampitec marked 2 inline comments as done.Jan 27 2017, 9:44 AM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
88–90 ↗(On Diff #86060)

It also internalizes global variables, so "functions" here are misleading.
That shall not be on by default. For instance this is not suitable to build library. Instead I will submit a patch to add the option from the OpenCL itself (same applies to HCC).

225–226 ↗(On Diff #86060)

We want to preserve intrinsic declarations, so no, this should return true.

rampitec marked 2 inline comments as done.Jan 27 2017, 9:55 AM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
88–90 ↗(On Diff #86060)

Do you feel -amdgpu-internalize-symbols is better?

tony-tye added inline comments.Jan 27 2017, 11:31 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
88–90 ↗(On Diff #86060)

I don't think it should be on by default at all optimization levels as it removes unused global variables and the debugger prefers that all variables remain. So perhaps -O0 (or when -g is specified with the debugger abi, which can be queried from the subtarget by debuggerSupported()) it should not run, or at least the global variable elimination part should not run.

At higher optimization running it by default seems reasonable as it drastically reduces the code object size. But only if the language does not support separate compilation (which is the case currently).

rampitec marked an inline comment as done.Jan 27 2017, 12:07 PM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
88–90 ↗(On Diff #86060)

This does not run at -O0 (see Builder.OptLevel > 0 check below).
GlobalDCE pass will run in any way, I'm just running it early to save compilation time after internalize.
The main thing, it cannot be on by default. If it is default on it will immediately break all library builds, including those I'm not aware of. I see no problem to add an option to a language driver while building kernels.

rampitec updated this revision to Diff 86095.Jan 27 2017, 12:18 PM
rampitec marked an inline comment as done.

Renamed option to -amdgpu-internalize-symbols.
Moved opt level check higher, away from the lambda.

rampitec marked an inline comment as done.Jan 27 2017, 12:22 PM

Can we move this into clang/ the compiler driver?

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
90 ↗(On Diff #86095)

Description string needs update

test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

Should this check with various linkage types that should remain visible?

Can we move this into clang/ the compiler driver?

It is in the compiler driver, which is in case of OpenCL roccompiler.cpp/rocprogram.cpp passing the option. Otherwise I'm not sure what is to move here. This is target specific thing triggered by the lack of late linking and call support, therefor it is done from the target even if run in opt. It target could request it in the linker that would be even better, but this not the case currently. The moment these conditions change the use of the option and its existence will change as well.

rampitec marked an inline comment as done.Jan 27 2017, 6:37 PM
rampitec added inline comments.
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

As long as there is no late linking there are no linkages which should remain visible. Do you have an example?

rampitec updated this revision to Diff 86158.Jan 27 2017, 6:39 PM
rampitec marked 2 inline comments as done.

Updated option description.

tony-tye added inline comments.Jan 27 2017, 11:18 PM
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

The HSA Runtime can be used to query the address of global symbols so that the host program can access them. Will that still work after this phase?

kzhuravl requested changes to this revision.Jan 27 2017, 11:22 PM
kzhuravl added inline comments.
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

After this phase, if global symbol is unused, it would be removed. So it wouldn't be present in the object code. Hence hsa runtime won't know about it, and it would be inaccessible.

This revision now requires changes to proceed.Jan 27 2017, 11:22 PM
kzhuravl resigned from this revision.Jan 27 2017, 11:23 PM

Sorry, I clicked the wrong button.

kzhuravl requested changes to this revision. - Please ignore.

rampitec added inline comments.Jan 28 2017, 12:24 AM
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

That is right, but should it remain visible for RT if unused in kernels? Do we have a test for this?

rampitec added inline comments.Jan 28 2017, 12:29 AM
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

It will remain untouched and accessible from RT as long as it is used from GPU side. If no kernels or functions they call access such variables they will be eliminated. Do we expect anything different?

tony-tye added inline comments.Jan 28 2017, 12:42 AM
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

What about used global symbols, do they still have a linkage kind that the HSA Runtime will work with?

7–8 ↗(On Diff #86095)

That sounds reasonable to me. Just wanted to make sure that the HSA Runtime will not be expecting the ELF symbols to have a specific linker kind which will end up being changed by this phase.

rampitec added inline comments.Jan 28 2017, 12:54 AM
test/CodeGen/AMDGPU/internalize.ll
7–8 ↗(On Diff #86095)

Linkage of used symbols is not changed.

BTW, in most cases global variables will not be even dropped. Unlike functions which are internalized based on the calling convention variables are only internalized if totally unused. I.e., in this situation global will not be dropped, because it is used by an unused function:

global T var;
void unused_foo() { use(var); }
kernel void bar() { }

To drop var in this situation we would need to run internalize + global dce twice, so the first pass will remove functions and second will remove remaining variables.

I.e. if a variable is declared in a library module and the same module has a function using this variable (majority of cases), then variable will remain as long as the module is linked in (e.g. when using function is linked as well, even if unused). One pass of internalizer and dce will only remove completely unused and unreferenced global variables.

rampitec marked 9 inline comments as done.Jan 28 2017, 2:40 AM
rampitec requested review of this revision.Jan 28 2017, 2:43 AM
rampitec added a reviewer: arsenm.
This revision is now accepted and ready to land.Jan 30 2017, 12:53 PM
This revision was automatically updated to reflect the committed changes.