The AMDGPU library uses several control constants to change code paths
for the math functions and intrinsics. These are normally included using
several individual bitcode libraries at link time. However, this is
problematic because it requires us to know the AMDGPU architecture at
link time which should not be strictly necessary. This patch adds new
code that emits the constants that would normally be included by the
bitcode libraries. This removes around six libraries we would otherwise
need to include and now we can link these libraries in unconditionally
like we do with libdevice.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Let me know if I should move this code somewhere else, or if there are problems. One change I made is that the constant is weak_odr and hidden instead of linkonce_odr and protected. This is so this constant is alive until link time, AMDGPU pretty much always uses LTO so these should be optimized out when we internalize symbols. I'm assuming we don't need protected visibility as these shouldn't be read from another executable (e.g. the host).
Tagging Brian as the code owner of rocm device libs - emitting these in clang would simplify that library.
Currently clang reads these commandline flags and conditionally links in bitcode files to introduce these symbols. There's existing command line flags for controlling which files are linked. I think this patch should probably use the existing flags to choose which values to set and delete the existing handling.
As written I think this is a no op, in that the libraries will currently be linked anyway and override the symbols clang has injected
I've thought that directly emitting these constants would be better. This will also make it so you can't try to continue using llvm-link for these libraries, which is a plus since it doesn't have the same necessary attribute propagation clang does when linking these
Yeah, I wasn't sure if I should do some scan to check if we actually need these. Basically just check if any function declarations start with __ocml. But that might untenable in the future as we try to move to a generic math library that doesn't eagerly emit target specific declarations in clang.
A safer bet is to use the current control flow that links in specific bitcode files, but create the global directly instead of linking in the file. That'll give us zero semantic change and a clang that ignores those bitcode files if present.
Do we expect those libraries to be linked per-TU via -mlink-builtin-bitcode? The usage I see passes them to lld directly.
There is no constant propagation for globals with weak linage, right? Otherwise, it won't work. My concern is that there may be optimization passes which do not respect the weak linkage and uses the incorrect default value for OpenCL or HIP. Therefore I am not very confident to enable this for OpenCL or HIP unless these variables have the correct value based on the compilation options.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9461 | should be determined by the code object version option. | |
clang/test/CodeGen/amdgcn-occl-constants.c | ||
8 ↗ | (On Diff #445849) | need a check for __oclc_wavefrontsize64=0 for gfx1030 |
Yes, the problem is that linkonce_odr can be removed and as-such isn't usable for linking libraries late like we want to. You are correct that weak_odr normally cannot be propagated as another TU could potentially change it, but if we're linking this via LTO like AMDGPU does it should always be internalized in the linker. The OpenMP runtime has a similar weak_odr variable that gets internalized when we do LTO so it should apply here as well. Although my assumption is that AMDGPU always feeds bitcode directly to either lld or clang-linker-wrapper without invoking llc manually, I may be wrong there.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9461 | Yes I wasn't sure about this one. Could you elaborate where we derive that? |
Instead of weak_odr we could probably use add this to compiler used instead if that's an issue.
It depends where we want to do the linking. For my purposes I'd like to be able to link in these libraries at link time. This allows us to link in target specific libraries as-needed so we can make generated code more generic until linking or the backend. The problem with linkonce_odr is that it does not need to emit a symbol, so it will usually be optimized out by clang. E.g. the following won't work because these generated globals will be optimized out completely before we have any library to use them.
clang amdgpu.c -c -O3 clang amdgpu.o <link ocml.bc>
I think I understand what you're saying better now. We should instead have this controlled as a flag via clang that the driver will add. This will just tell us to trigger some backend utility to emit the same code. I can look into doing that, will make it easier to just have the clang driver state that we should emit this for HIP / OpenMP unless nogpulib is passed for example.
The current patch does not consider HIP/OpenCL compile options, therefore the value of these variables are not correct for OpenCL/HIP. They need to be overridden by the variables with the same name in device libraries by clang through -mlink-builtin-bitcode.
If the patch check HIP/OpenCL compilation options to set the correct value for these variables, then it does not need weak linkage.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9461 |
CGM.getTarget().getTargetOpts().CodeObjectVersion |
Is we instead add it to compiler.used it should be propagated while staying alive for the linker https://godbolt.org/z/MG5n1MWWj. The downside is that this symbol will not be removed and a symbol to it will live in the binary. The symbol will have weak binding, so it won't cause any linker errors. But it's a little annoying to have things stick around like that. I'm considering making this code generation be controlled by a clang driver flag so we could potentially change behavior as needed there.
Adjusting, adding code generation options for the other constants and changing to use linkonce ODR linkage.
I attempted to follow Jon's suggestion and group it with the existing code. but all the existing handling for this occurs in the driver. So I don't think there's a convenient way to drop in this functionality without adding a new function as in this patch.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 | This does not support per-TU control variables. Probably should use internal linkage. | |
clang/lib/Frontend/CompilerInvocation.cpp | ||
1679–1682 ↗ | (On Diff #453021) | For OpenCL, it should be determined by options::OPT_cl_denorms_are_zero |
clang/test/CodeGen/amdgcn-control-constants.c | ||
8 | need a test for -target-cpu gfx1030 -target-feature +wavefrontsize64 and check __oclc_wavefrontsize64 to be 1. | |
9 | need an OpenCL test for -cl-denorms-are-zero | |
11 | need OpenCL tests for -cl-finite-math-only and -cl-fast-relaxed-math | |
12 | need OpenCL tests for -cl-unsafe-math-optimizations and -cl-fast-relaxed-math | |
13 | need an OpenCL test for -cl-fp32-correctly-rounded-divide-sqrt. If it needs CodeGenOpt you may need to re-use the option for HIP. |
Updating. I realized all of the math-related ones are already covered by driver options for AMDGPU passing the appropriate fp contract to the frontend. This patch gets rid of most of that handling and just uses those directly. Also makes it easier to test.
We also check if the +wavefront64 feature was explicitly turned on as part of @yaxunl's suggestion.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 | The AMDGPU device libraries use linkone_odr so I figured it was the most appropriate here. It should mean that we can have multiple identical definitions and they don't clash. There's also no requirement for these to be emitted as symbols AFAIK. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
clang uses -mlink-builtin-bitcode to link these device libraries for HIP and OpenCL. Then the linkage of these variables becomes internal linkage. That's why it works for per-TU control. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
You may let HIP and OpenCL use internal linkage and C/C++/OpenMP use linkonce_odr since only HIP and OpenCL toolchain use -mlink-builtin-bitcode to link these device libraries |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 | I see, linkonce_odr implies that these should all have the same value which isn't necessarily true after linking. I'll change it to use private linkage. OpenMP right now links everything late which means that we don't allow these to be defined differently per-TU. This may be incorrect given this new method as each TU will have different things set. I can change OpenMP to use the mlink method after this patch which may be more strictly correct. |
Changing to private linkage.
For OpenMP we could either make this use weak_odr so we have a single
definition surviving until link time for us to use. Or we could change OpenMP to
link in the bitcode libraries per-TU via -mlink-builtin-bitcode.
If you want to overwrite them, weak/linkonce will work (no _odr). Private/internal will not be overwritten but existing uses will keep the private/internal version, IIRC. I assume you want the former.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
On second thoughts, the idea for letting clang to emit these control variables might not work for HIP and OpenCL. The reason is that to support per-TU control variables, these variables need to be internal or private linkage, however, that means they cannot be used by other device library functions which are expecting non-internal linkage for them. Those device library functions will end up using control variables from device library bitcode any way. For OpenMP, it may be necessary to emit them as linkonce_odr, otherwise device library functions may not find them. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
Right now we include each file per-TU using -mlink-builtin-bitcode which converts linkonce_odr to private linkage. Shouldn't this be equivalent? It may be possible to make some test showing a user of these constants to verify they get picked up correctly. If you're worried about these getting removed we may be able to stash them in compiler.used, that shouldn't impede the necessary constant propagation. Side note, OpenCL seems to optimize these out without -disable-llvm-optzns while HIP will not. Does OpenCL use some mandatory passes to ensure that these control variables get handled? This method of using control constants in general is somewhat problematic as it hides invalid code behind some mandatory CP and DCE passes. For OpenMP right now we just generate one version for each architecture, which is wasteful but somewhat easier to work with. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
Let's assume the main program calls foo() and foo() uses a control variable bar. foo() is in a bitcode linked in by -mlink-builtin-bitcode. clang emits the control variable bar with private linkage in the main module. When clang tries to link foo(), it needs to resolve bar, but it cannot use the bar in the main module because bar has private linkage. Then bar becomes unresolved.
Are you using clang -cc1 without other options? There are LLVM passes by default, but they should not depend on language. You can see which pass is removing them by -mllvm -print-after-all. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9436 |
This is a good point, we link only used definitions when using -mlink-builtin-bitcode. I think we link via -mlink-builtin-bitcode prior to running the backend, so this will be after we create these definitions. In this case we will import a definition like the following: @__oclc_wavefrontsize64 = external local_unnamed_addr addrspace(4) constant i8, align 1 which has the same name, but cannot bind to the private variable. I think this is what linkonce linkage is supposed to provide, but I'm not overly familiar with the semantics.
The sample tests in this patch show the -x hip version does not require -disable-llvm-optzns while the -x cl version does. |
Changing to linkonce linkage. According to the LLVM spec this should have the
expected behaviour where a single definition is kept at link-time for each
module. I tested this with a sample HIP program and it had the desired
behaviour. I could add a test attempting to show this if needed.
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9449–9450 | Do we really have to scan through the features too? This seems broken | |
9455 | s/DenormAtZero/DenormAreZero/? | |
9458 | or doesn't look right. finite only is no infinities and no nans (not sure why the library control merges the two) | |
9473–9475 | These probably should use linkonce_odr |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468–9472 | we need to disable emitting these variables for HIP -fgpu-rdc mode and OpenCL since they will break per-TU control variable. Other cases are OK. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468 | wavefrontsize belongs with the system ones |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468–9472 | But the code would still depend on these and they wouldn't be present right |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468 |
You are right. __oclc_wavefrontsize64 should always be emitted with linkonce_odr linkage since they need to be consistent among TU's. Therefore they should always be emitted. __oclc_daz_opt, __oclc_finite_only_opt, __oclc_unsafe_math_opt, and __oclc_correctly_rounded_sqrt32 can be different per TU, therefore they should not be emitted for HIP -fgpu-rdc and OpenCL. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468 | I'm still unsure, if we do not emit any of those control variables how will we use the device libraries for those builds. |
clang/lib/CodeGen/TargetInfo.cpp | ||
---|---|---|
9468 |
In those cases, we will use -mlink-builtin-bitcode to get those variables from device libs, as we did before. They will have internal linkage after linking, therefore are per-TU. |
Adding an extra check in CodeGenAction.cpp that forcibly internalizes these if we link in any modules in RDC mode. This is a considerable hack, but should solve the problem. It's not a great solution, so let me know if you think that this is a leser evil than linking in many bitcode files as we do now.
To reiterate, what this patch offers is.
+ Reduces number of files needed to link in, (no on/off files, only ocml.bc and ockl.bc are needed).
+ Enforces that the architecture constants are the same across the compilation
And I think negatively,
- Requires a hack to internalize some variables to prevent linking problems
- Some extra code in Clang
The best solution would be to handle these per-TU variables in the backend. Or maybe even all of these could be placed in the backend where the code paths that currently require a control constant could be a simple attribute that the backend will use to control code emission.
I'd prefer to avoid spreading special treatment of the device libraries into the backend. The contract is poorly defined and spread around too much as it is
clang/lib/CodeGen/CodeGenAction.cpp | ||
---|---|---|
299–308 ↗ | (On Diff #462948) | need a test. Probably let clang generate a bitcode containing a function using these control vars, then link the bitcode by -mlink-builtin-bitcode, then check the linkage of these control vars. |
clang/test/CodeGen/amdgcn-link-control-constants.c | ||
---|---|---|
2–3 ↗ | (On Diff #463042) | This test should only require that the triple is amdgcn. I could potentially make the generation of the constants require HIP or OpenMPDevice, or OpenCL is enabled if you think that's bad. |
2–3 ↗ | (On Diff #463042) | I can also change it to just -x c if the HIP is the problem. |
clang/test/CodeGen/amdgcn-link-control-constants.c | ||
---|---|---|
2–3 ↗ | (On Diff #463042) | We probably want these magic constants for C++ code as well, so keying it off the triple (at least triple + that we're using rocm / compute stuff, which I think is adequately indicated by hsa in the triple) is better. And likewise don't want to emit these constants for non-gpu code, e.g. x64 host hip doesn't need the daz_opt constant, which also suggests triple is the right hook. |
clang/test/CodeGen/amdgcn-link-control-constants.c | ||
---|---|---|
2–3 ↗ | (On Diff #463042) | We don't officially support C on amdgcn but we officially support HIP. I would suggest move this to CodeGenCUDA and compile it as HIP, and use HIP syntax. |
clang/test/CodeGen/amdgcn-control-constants.c | ||
---|---|---|
9 | still missing this test, and some other tests for -cl-* options as commented below. Also, missing a HIP test for -ffast-math |
clang/test/CodeGen/amdgcn-control-constants.c | ||
---|---|---|
9 | The cc1 math options tested individually should be enabled by -ffast-math. |
clang/test/CodeGen/amdgcn-control-constants.c | ||
---|---|---|
9 | Since we cannot test -ffast-math directly, can we add a driver test to ensure we are not missing any -cc1 options needed by the control variables when -ffast-math is specified for the driver? Thanks. Also, the -cl-* options are -cc1 options. We need to test them. |
I don't like the fact that we need to have two different kinds of control constants, one per-TU and others per-link job. I'm wondering how difficult it would be to make the fast versions of the math calls use different entry points. That way we could handle this in the math header wrappers.
That's really how the C linkage model wants you to handle this. I also would like to have FP value tracking optimizations take care of the special cases in the library code
There's the "small matter" of implementing the new device library functions. Why is all that more likeable than two kinds of control constants?
Different functions providing different behaviors can be handled at link time like any other function, instead of the same functions providing different behaviors per translation unit and requires cloning. The current scheme transfers complexity from the device library build system into the driver and user binaries
Another benefit of this would be that linking could be done only once in a sound manner rather than requiring an instance of the ROCm device libraries to be included for each TU. Although we would probably still need the attribute propagation that -mlink-builtin-bitcode offers, so it wouldn't be quite as easy that throwing the device libs into the lld invocation.
Different functions providing different behaviors can be handled at link time like any other function, instead of the same functions providing different behaviors per translation unit and requires cloning. The current scheme transfers complexity from the device library build system into the driver and user binaries
OK, but we are talking about trading a solved problem with a solution working for years for adding a large amount of new work and new maintenance and new bugs. Does this need to be done now, or at all?
I wouldn't really call it a solved problem when only one of the many users is currently linking these libraries correctly
We should just do this now. clang shouldn't have to dig around on disk to emit a constant definition for a constant it already knows, and we have a clear path to removing these globals altogether. I have adequate patches to completely delete __oclc_daz_opt today. __oclc_finite_only_opt should be deleteable as soon as nofpclass is inferred by default. Deleting __oclc_correctly_rounded_sqrt32 and __oclc_unsafe_math_opt require more work, but are basically the same thing and require extending the libcall optimizer pass.
It will be easier to delete these from the library as they become unnecessary if clang stops enforcing these files exists like it does today, and it's easier to just stop using them entirely than to delete them one at a time
This does not support per-TU control variables. Probably should use internal linkage.