This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Link bitcode ROCm device libraries per-TU
ClosedPublic

Authored by jhuber6 on Sep 12 2022, 2:25 PM.

Details

Summary

Previously, we linked in the ROCm device libraries which provide math
and other utility functions late. This is not stricly correct as this
library contains several flags that are only set per-TU, such as fast
math or denormalization. This patch changes this to pass the bitcode
libraries per-TU using the same method we use for the CUDA libraries.
This has the advantage that we correctly propagate attributes making
this implementation more correct. Additionally, many annoying unused
functions were not being fully removed during LTO. This lead to
erroneous warning messages and remarks on unused functions.

I am not sure if not finding these libraries should be a hard error. let
me know if it should be demoted to a warning saying that some device
utilities will not work without them.

Diff Detail

Event Timeline

jhuber6 created this revision.Sep 12 2022, 2:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 2:25 PM
jhuber6 requested review of this revision.Sep 12 2022, 2:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 2:25 PM

We can do this but should expect an increase in code size from having multiple internalised copies of the same function. There may be an incidental benefit if we can specialise some functions to the call site without additional cloning. Address of the same functions from different TUs will be inequal, which is wrong, but probably doesn't matter in practice.

It does have the major advantage that mlink-builtin-bitcode patches up the invalid IR on the fly, which is likely easier than changing the device libs or making IR mcpu-agnostic.

clang/lib/Driver/ToolChains/AMDGPUOpenMP.h
58

Why hip device libs? There's a common set, plus a hip.bc plus a opencl.bc. I'd expect us to want only the common set. Hip.bc shouldn't have non-hip stuff in it.

jhuber6 added inline comments.Sep 12 2022, 3:41 PM
clang/lib/Driver/ToolChains/AMDGPUOpenMP.h
58

Existing virtual function, just re-used it.

We can do this but should expect an increase in code size from having multiple internalised copies of the same function. There may be an incidental benefit if we can specialise some functions to the call site without additional cloning. Address of the same functions from different TUs will be inequal, which is wrong, but probably doesn't matter in practice.

It does have the major advantage that mlink-builtin-bitcode patches up the invalid IR on the fly, which is likely easier than changing the device libs or making IR mcpu-agnostic.

It will probably decrease code size in the final executable now that this will forcefully internalize all the protected functions in ocml.bc that were sticking around because LTO couldn't remove them due to visibility.

clang/lib/Driver/ToolChains/AMDGPUOpenMP.h
58

Rename it rocm perhaps?

jhuber6 updated this revision to Diff 459716.Sep 13 2022, 4:57 AM

Renaming virtual function to make it more generic.

jhuber6 updated this revision to Diff 459721.Sep 13 2022, 5:34 AM

Adding a test for using -nogpulib.

yaxunl added inline comments.Sep 13 2022, 7:51 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
717 ↗(On Diff #459721)

can we emit an error if both -march and -mcpu are specified and the values are different? This is a potential source of error.

jhuber6 added inline comments.Sep 13 2022, 8:10 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
717 ↗(On Diff #459721)

I'm not exactly sure about the semantics for this, maybe someone else can help. AFAIK -mcpu is sometimes treated as an alias of -mtune which is generally implied by -march. But having -march and -mtune state different things isn't technically disallowed. -march seems to imply that we generate code that can run on a certain architecture. Since these are in some respects equivalent it would probably be fine to just take the last one specified.

Does this fix the weird behavior where you needed to use -lm to use anything in the device libraries? I don't see that being removed

Does this fix the weird behavior where you needed to use -lm to use anything in the device libraries? I don't see that being removed

That was removed earlier when these files were just sent directly to the link job, you can see that removed from this patch.

yaxunl added inline comments.Sep 13 2022, 8:28 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
717 ↗(On Diff #459721)

Some toolchains use -march to specify the target processor, some toolchains use -mcpu to specify the target processor. For AMDGPU toolchain, it uses -mcpu to specify the target processor in the beginning for OpenCL. Then we add HIP support and let HIPAMD toolchain adopt using -mcpu to specify the target processor. It is better to have a consistent way for specifying target processor for AMDGPU toolchains.

HIPAMD toolchain does this by https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChains/HIPAMD.cpp#L275 .

Probably AMDGPUOpenMP toolchain can make a similar change? Then we have a consistent way to specify target processor for AMDGPU toolchains.

jhuber6 added inline comments.Sep 13 2022, 8:39 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
717 ↗(On Diff #459721)

The rest of the OpenMP toolchain uses -march so I don't see a compelling reason to change it. If this is a stopper I'll just remove the changes to this function and check -march directly in the new getROCmDeviceLibs function for the OpenMP toolchain.

yaxunl added inline comments.Sep 13 2022, 9:29 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
717 ↗(On Diff #459721)

There are a bunch of places in AMDGPU and HIPAMD toolchains that check the -mcpu option. If we allow -march as an alternative way to specify the target processor for the AMDGPU toolchain, it will introduce inconsistency and probably incur more changes.

Probably leaving this code in AMDGPUOpenMP toolchain is a better choice for now.

march for openmp, mcpu for hip seems ok. Notably llc needs to be told this as well, using mcpu, which may be an issue for save-temps

jhuber6 updated this revision to Diff 459786.Sep 13 2022, 9:44 AM

Removing use of getGPUArch and just using -march directly for OpenMP

yaxunl added inline comments.Sep 13 2022, 12:49 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
720–722 ↗(On Diff #459786)

It seems the code still there.

jhuber6 added inline comments.Sep 13 2022, 12:50 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
720–722 ↗(On Diff #459786)

Sorry thought I reverted this file. I'll fix it.

jhuber6 updated this revision to Diff 459848.Sep 13 2022, 12:51 PM

Removing old function update for 'mcpu'

yaxunl added inline comments.Sep 13 2022, 1:44 PM
clang/include/clang/Driver/ToolChain.h
719

HIPSPV toolchain is not implemented on ROCm platform.

Probably we need to make it even more generic. How about getDeviceLibs ?

Also, you need to update the comment.

jhuber6 updated this revision to Diff 459874.Sep 13 2022, 2:29 PM

Changing interface to getAMDGPUDeviceLibs.

yaxunl added inline comments.Sep 13 2022, 3:39 PM
clang/include/clang/Driver/ToolChain.h
719

well, HIPSPV toolchain is not for AMDGPU. My thinking is that this API is to get device libraries for the toolchain, not for a specific target or platform.

jhuber6 added inline comments.Sep 13 2022, 3:48 PM
clang/include/clang/Driver/ToolChain.h
719

I guess the problem is that this option can't be totally generic since it's in the top-level Toolchain namespace. Otherwise it would get confused with CUDA's device library. What's a good common name to use for it?

yaxunl added inline comments.Sep 13 2022, 4:06 PM
clang/include/clang/Driver/ToolChain.h
719

This API return device libs for the toolchain.

For AMDGPU, HIPAMD, AMDGPUOpenMP toolchains, it returns device libs for AMDGPU.

For HIPSPV toolchain, it returns device libs for HIP for Intel GPUs.

If a CUDA toolchain decides to support this API, it will return device libs for CUDA.

If a toolchain needs to return device libs for multiple targets, we can add parameters to control that, although so far there is no such need.

Therefore, I don't see issue to call it getDeviceLibs.

jhuber6 updated this revision to Diff 459911.Sep 13 2022, 4:09 PM

Changing to getDeviceLibs. I suppose in the future we could make this work for CUDA, but for now it won't be defined for that toolchain so it's fine.

JonChesterfield accepted this revision.Sep 14 2022, 6:43 AM

I don't like this but will concede it's quicker than changing device libs to contain IR that doesn't have to be patched on the fly.

This revision is now accepted and ready to land.Sep 14 2022, 6:43 AM