This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Register/unregister device fat binary only once
ClosedPublic

Authored by yaxunl on Jul 9 2018, 8:19 AM.

Details

Summary

HIP generates one fat binary for all devices after linking. However, for each compilation
unit a ctor function is emitted which register the same fat binary. Measures need to be
taken to make sure the fat binary is only registered once.

Currently each ctor function calls __hipRegisterFatBinary and stores the returned value
to __hip_gpubin_handle. This patch changes the linkage of __hip_gpubin_handle to be linkonce
so that they are shared between LLVM modules. Then this patch adds check of value of
__hip_gpubin_handle to make sure __hipRegisterFatBinary is only called once. The code
is equivalent to

void *_gpubin_handle;
void ctor() {
  if (__hip_gpubin_handle == 0) {
    __hip_gpubin_handle = __hipRegisterFatBinary(...);
  }
  // register kernels and variables.
}

The patch also does similar change to dtors so that __hipUnregisterFatBinary
is called once.

Diff Detail

Repository
rC Clang

Event Timeline

yaxunl created this revision.Jul 9 2018, 8:19 AM
tra added a comment.Jul 10 2018, 9:41 AM

HIP generates one fat binary for all devices after linking. However, for each compilation
unit a ctor function is emitted which register the same fat binary.
Measures need to be taken to make sure the fat binary is only registered once.

Are you saying that for HIP there's only one fatbin file with GPU code for the complete host executable, even if it consists of multiple HIP TUs?

lib/CodeGen/CGCUDANV.cpp
460

Given that it's HIP-only code, there will be no cuda.

In D49083#1157568, @tra wrote:

HIP generates one fat binary for all devices after linking. However, for each compilation
unit a ctor function is emitted which register the same fat binary.
Measures need to be taken to make sure the fat binary is only registered once.

Are you saying that for HIP there's only one fatbin file with GPU code for the complete host executable, even if it consists of multiple HIP TUs?

By 'TU' do you mean 'target unit'?

For HIP there is only one fatbin file with GPU code for the complete host executable even if there are mulitple GPU sub-targets. Device code for different sub-targets are bundled together by clang-offload-bundler as one fatbin. Runtime will extract device code for different sub-targets.

yaxunl updated this revision to Diff 154894.Jul 10 2018, 3:08 PM
yaxunl marked an inline comment as done.

clean up function prefix.

In D49083#1157568, @tra wrote:

HIP generates one fat binary for all devices after linking. However, for each compilation
unit a ctor function is emitted which register the same fat binary.
Measures need to be taken to make sure the fat binary is only registered once.

Are you saying that for HIP there's only one fatbin file with GPU code for the complete host executable, even if it consists of multiple HIP TUs?

By 'TU' do you mean 'target unit'?

For HIP there is only one fatbin file with GPU code for the complete host executable even if there are mulitple GPU sub-targets. Device code for different sub-targets are bundled together by clang-offload-bundler as one fatbin. Runtime will extract device code for different sub-targets.

It there are multiple translation units, there will be only one fatbin after linking, since the LLVM modules from different translation units will be linked together to generate one fatbin.

rjmccall added inline comments.Jul 17 2018, 3:22 PM
lib/CodeGen/CGCUDANV.cpp
456

Do you not need to worry about concurrency here?

yaxunl added inline comments.Jul 18 2018, 8:06 AM
lib/CodeGen/CGCUDANV.cpp
456

The ctor functions are executed by the dynamic loader before the program gains the control. The dynamic loader cannot excute the ctor functions concurrently since doing that would not gurantee thread safety of the loaded program. Therefore we can assume sequential execution of ctor functions here.

rjmccall added inline comments.Jul 18 2018, 10:49 AM
lib/CodeGen/CGCUDANV.cpp
456

Okay. That's worth a comment.

Is the name here specified by some ABI document, or is it just a conventional name that we're picking now?

yaxunl added inline comments.Jul 18 2018, 11:01 AM
lib/CodeGen/CGCUDANV.cpp
456

Will add a comment for that.

You mean __hip_gpubin_handle? It is an implementation detail. It is not defined by ABI or other documentation. Since it is only used internally by ctor functions, it is not a visible elf symbol. Its name is by convention since the cuda corresponding one was named __cuda_gpubin_handle.

yaxunl updated this revision to Diff 156184.Jul 18 2018, 4:57 PM
yaxunl marked 4 inline comments as done.

Added comments about thread safety of ctor functions.

Thanks for the comment.

lib/CodeGen/CGCUDANV.cpp
456

Well, it *is* ABI, right? It's necessary for all translation units to agree to use the same symbol here or else the registration will happen multiple times.

yaxunl marked an inline comment as done.Jul 19 2018, 7:28 AM
yaxunl added inline comments.
lib/CodeGen/CGCUDANV.cpp
456

Right. I created a pull request for HIP to document this https://github.com/ROCm-Developer-Tools/HIP/pull/580/files

rjmccall added inline comments.Jul 19 2018, 11:28 AM
lib/CodeGen/CGCUDANV.cpp
456

Okay. Please leave a comment here explaining that this variable's name, size, and initialization pattern are part of the HIP ABI, then.

460

Should you just make GpuBinaryHandle an Address so that you don't have to repeat the alignment assumption over and over?

Also, you should set an alignment on the variable itself.

464

When I'm generating control flow like this, I find it helpful to at least use vertical spacing to separate the blocks, and sometimes I even put all the code within a block in a brace-statement ({ ... }) to more clearly scope the block-local values within the block.

yaxunl marked 5 inline comments as done.Jul 19 2018, 4:16 PM
yaxunl added inline comments.
lib/CodeGen/CGCUDANV.cpp
456

Will do

460

will do

464

will do

yaxunl updated this revision to Diff 156383.Jul 19 2018, 4:28 PM
yaxunl marked 3 inline comments as done.

Revised by John's comments.

rjmccall added inline comments.Jul 19 2018, 6:24 PM
lib/CodeGen/CGCUDANV.cpp
478

I meant more putting all the code for IfBlock in a brace-statement, that kind of thing. It's not as important for the earlier stuff because that actually dominates the rest of your code here.

625

Don't just indent stuff if you're not putting it in a brace-statement. By "vertical space" I meant putting newlines between the emission of the different blocks.

yaxunl marked an inline comment as done.Jul 19 2018, 7:55 PM
yaxunl added inline comments.
lib/CodeGen/CGCUDANV.cpp
625

sorry I misunderstood. will fix.

yaxunl updated this revision to Diff 156424.Jul 19 2018, 8:37 PM
yaxunl marked an inline comment as done.

Revised by John's comments.

yaxunl edited the summary of this revision. (Show Details)Jul 19 2018, 8:38 PM
rjmccall accepted this revision.Jul 19 2018, 11:11 PM

Alright, thanks, LGTM.

This revision is now accepted and ready to land.Jul 19 2018, 11:11 PM
tra accepted this revision.Jul 20 2018, 2:34 PM
This revision was automatically updated to reflect the committed changes.