Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
ClosedPublic

Authored by saiislam on Dec 9 2022, 11:10 AM.

Details

Summary

Update DeviceRTL and the AMDGPU plugin to use code
object version 5. Default is code object version 4.

DeviceRTL uses rocm-device-libs instead of directly calling
amdgcn builtins for the functions which are affected by
cov5.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yaxunl added a comment.Aug 7 2023, 9:33 AM

need a lit test for the codegen of the clang builtin for cov 4/5/none and a lit test to show the branching code generated with cov none can be optimized away when linked with cov4 or cov5.

clang/lib/CodeGen/Targets/AMDGPU.cpp
389

I am not sure weak_odr linkage will work when code object version is none. This will cause conflict when a module emitted with cov none is linked with a module emitted with cov4 or cov5. Also, when all modules are emitted with cov none, we end up with a linked module with cov none and the work group size code will not work.

Probably we need to emit llvm.amdgcn.abi.version with external linkage for cov none.

Another issue is that llvm.amdgcn.abi.version is not internalized. It is always loaded from memory even though it is in constant address space. This will cause bad performance. Considering device libs may use clang builtin for workgroup size. The performance impact may be significant. To avoid performance degradation, we need to internalize it as early as possible in the optimization pipeline.

yaxunl added a comment.Aug 7 2023, 9:36 AM

I would suggest separating the clang/llvm part into a separate review.

arsenm added inline comments.Aug 7 2023, 2:12 PM
clang/lib/CodeGen/CGBuiltin.cpp
17112–17131

Move down to define and initialize

17132–17134

You could write all of this in terms of selects and avoid introducing all these blocks

clang/lib/CodeGen/Targets/AMDGPU.cpp
364

Don't need this?

saiislam updated this revision to Diff 551266.Aug 17 2023, 2:16 PM
saiislam marked 6 inline comments as done.

Updated the patch as per reviewers comments.

clang/lib/CodeGen/CGBuiltin.cpp
17112–17131

There are multiple uses of the same identifier. Defining them four times looks odd.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752

I have removed the preallocatedheap work from this patch.

2556–2557

Removed the logic for preallocatedheap.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

I have reduced the fields to bare minimum required for OpenMP.

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

clang/lib/CodeGen/CGBuiltin.cpp
17110
clang/lib/Driver/ToolChain.cpp
1368

Random whitespace.

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
97

Need newline

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
3007

Don't think this needs to be a debug message, same below

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

I'm still not a fan of replacing the struct. The mnemonic of having a struct is much more user friendly.

ImplicitArgsTy Args{};
std::memset(&Args, sizeof(ImplicitArgsTy), 0);
...

If we don't use something, just make it some random bytes, e.g.

struct ImplicitArgsTy {
  uint64_t OffsetX;
  uint8_t Unused[64]; // 64 byte offset.
};
saiislam updated this revision to Diff 551597.Aug 18 2023, 12:05 PM
saiislam marked 4 inline comments as done.

Changed ImplitArgs implementation using struct.

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

Clang-linker-wrapper was not passing -mllvm option to the clang backend.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

Replaced.

arsenm added inline comments.Aug 18 2023, 1:05 PM
clang/lib/CodeGen/CGBuiltin.cpp
17114

Spell out to DispatchPtr?

clang/lib/CodeGen/CodeGenModule.cpp
1206–1208

These could be one combined hook? this isn't really different from metadata

clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386

You moved GetOrCreateLLVMGlobal but don't use it?

The lamdba is unnecessary for a single local use

clang/lib/Driver/ToolChain.cpp
1373–1376

Capitalize

1376

Don't understand why this is necessary

arsenm added inline comments.Aug 18 2023, 1:07 PM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
41–44

test all the builtins?

saiislam marked 4 inline comments as done.Aug 21 2023, 11:25 AM
saiislam added inline comments.
clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386

I am using GetOrCreateLLVMGlobal in CGBuiltin.cpp while emitting code for amdgpu_worgroup_size.

369–386

I was hoping that this patch will pave way for D130096, so that it can generate rest of the control constants using the same lambda.
I can remove this and simplify the code if you want.

389

I tried external linkage but it didn't work. Only weak_odr is working fine.

clang/lib/Driver/ToolChain.cpp
1376

This function creates a derived argument list for OpenMP target specific flags.
mcode-object-version remains unset for device compilation step if we don't pass it here.

saiislam updated this revision to Diff 552085.Aug 21 2023, 11:26 AM
saiislam marked an inline comment as done.

Adressed reviewer's comments.

saiislam marked 3 inline comments as done.Aug 21 2023, 11:28 AM
arsenm added inline comments.Aug 21 2023, 1:11 PM
clang/lib/CodeGen/CGBuiltin.cpp
17124

Capitalization is weird, IsCOV5?

17139–17140

CreateConstInBoundsGEP1_64

17157

CreateConstInBoundsGEP1_64

clang/lib/CodeGen/Targets/AMDGPU.cpp
364

Single use lamdba, just make this the function body

381

No real point setting the alignment

saiislam updated this revision to Diff 552344.Aug 22 2023, 7:03 AM
saiislam marked 5 inline comments as done.

Used CreateConstInBoundsGEP1_32 for emitting GEP statements. Changed lambda function to simple fucntion body for defining the global variable.

Codegen parts LGTM, questions with the driver parts

clang/lib/Driver/ToolChain.cpp
1373–1376

Typos

1374
clang/lib/Driver/ToolChains/Clang.cpp
8648–8649

so device rtl is linked once as a normal library?

8652–8653

Why do you need this? The code object version is supposed to come from a module flag. We should be getting rid of the command line argument for it

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410

Shouldn't need this?

417

Commented out code

yaxunl added inline comments.Aug 23 2023, 8:07 PM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

need to test using clang -cc1 with -O3 and -mlink-builtin-bitcode to link the device lib and verify the load of llvm.amdgcn.abi.version being eliminated after optimization.

I think currently it cannot do that since llvm.amdgcn.abi.version is not internalized by the internalization pass. This can cause some significant perf drops since loading is expensive. Need to tweak the function controlling what variables can be internalized for amdgpu so that this variable gets internalized, or having a generic way to tell that function which variables should be internalized, e.g. by adding a metadata amdgcn.internalize

saiislam updated this revision to Diff 553179.Aug 24 2023, 10:06 AM
saiislam marked 7 inline comments as done.

Updated test case to check internalization of newly inserted global variable.

clang/lib/Driver/ToolChains/Clang.cpp
8648–8649

No, this is command generation for clang-linker-wrapper. Since, devicertl is compiled only to get bitcode file (-c), it is never called.

8652–8653

During command generation for clang-linker-wrapper, it is required to check user's provided mcode-object-version=X so that amdhsa-code-object-version=X can be passed to the clang/lto backend.

getAmdhsaCodeObjectVersion() and getHsaAbiVersion() both still use the above command line argument to override user's choice of COV, instead of the module flag.

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

load of llvm.amdgcn.abi.version is being eliminated with cc1, -O3, and mlink-builtin-bitcode of device lib.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410

It is required so that when clang pass (not the lto backend) is called from clang-linker-wrapper due to -save-temps, user provided COV is correctly propagated.

jhuber6 added inline comments.Aug 24 2023, 10:06 AM
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58

We should probably be using sizeof now that it's back to being a struct and keep the old struct definition.

saiislam marked an inline comment as done.Aug 24 2023, 10:12 AM
saiislam added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58

AMDGPU plugin doesn't use any implicitarg for COV4, but it does so for COV5. So, we are not keeping two separate structures for implicitargs of COV4 and COV5.
If we use sizeof then it will always return 256 corresponding to COV5 (even for cov4, which should be 56). That's why we need this function.

jhuber6 added inline comments.Aug 24 2023, 10:15 AM
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58

Yeah, I guess for COV4 the only thing that mattered was the size so that we could make sure it's all set to zero. We shouldn't use the enum value. It should be sizeof(ImplicitArgsTy) for COV5 and either hard-code it in the function for V4 or make a dummy struct.

saiislam updated this revision to Diff 553413.Aug 25 2023, 2:06 AM
saiislam marked 2 inline comments as done.

Changed getImplicitArgsSize to use sizeof.

Just a few more nits. I think it's looking fine but I haven't tested it. Anyone else?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406
415–417

No braces around a single line if.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
54

We return uint16_t here? These are sizes.

saiislam updated this revision to Diff 553991.Aug 28 2023, 10:44 AM
saiislam marked 3 inline comments as done.

Minor fixes addressing reviewer's comment.

jhuber6 accepted this revision.Aug 28 2023, 12:16 PM

I think it's fine now given that it's passing tests. Others feel free to comment.

This revision is now accepted and ready to land.Aug 28 2023, 12:16 PM
yaxunl accepted this revision.Aug 28 2023, 2:30 PM

LGTM. Thanks

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

It seems being eliminated by IPSCCP. It makes sense since it is constant weak_odr without externally_initialized. Either changing it to weak or adding externally_initialized will keep the load. Normal __constant__ var in device code may be changed by host code, therefore they are emitted with externally_initialized and do not have the load eliminated.

This revision was landed with ongoing or failed builds.Aug 29 2023, 4:36 AM
This revision was automatically updated to reflect the committed changes.
saiislam added inline comments.Aug 29 2023, 4:40 AM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

Thank you @yaxunl !
I have added these observations as comments in the code at load emit and global emit locations.