This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/
-
CodeGen/
15/15
CGBuiltin.cpp
-
CodeGenModule.h
1
CodeGenModule.cpp
-
TargetInfo.h
-
Targets/
7/8
AMDGPU.cpp
-
Driver/
5/7
ToolChain.cpp
-
ToolChains/
6/7
Clang.cpp
-
test/
-
CodeGenCUDA/
5/6
amdgpu-code-object-version-linking.cu
-
amdgpu-workgroup-size.cu
-
CodeGenOpenCL/
-
opencl_types.cl
-
tools/clang-linker-wrapper/
-
clang-linker-wrapper/
5/5
ClangLinkerWrapper.cpp
-
openmp/libomptarget/
-
libomptarget/
-
DeviceRTL/
-
CMakeLists.txt
-
plugins-nextgen/amdgpu/
-
amdgpu/
-
src/
11/13
rtl.cpp
-
utils/
8/11
UtilitiesRTL.h

Differential D139730

[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
ClosedPublic

Authored by saiislam on Dec 9 2022, 11:10 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
jhuber6
yaxunl

Commits

rGf616c3eeb43f: [OpenMP][DeviceRTL][AMDGPU] Support code object version 5

Summary

Update DeviceRTL and the AMDGPU plugin to use code
object version 5. Default is code object version 4.

DeviceRTL uses rocm-device-libs instead of directly calling
amdgcn builtins for the functions which are affected by
cov5.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

saiislam created this revision.Dec 9 2022, 11:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 11:10 AM

Herald added subscribers: kosarev, kerbowa, guansong and 4 others. · View Herald Transcript

saiislam requested review of this revision.Dec 9 2022, 11:10 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 9 2022, 11:10 AM

Herald added subscribers: openmp-commits, cfe-commits, sstefan1 and 2 others. · View Herald Transcript

Maybe we should wait until D138389 lands and we can update both, otherwise we'd need a second patch.

I'm not fully up-to-date, what's the main difference and advantage of the new code object version? What do all the new implicit arguments do.

clang/lib/Driver/ToolChains/AMDGPU.cpp
953 ↗	(On Diff #481701)	Unrelated?
openmp/libomptarget/DeviceRTL/include/Interface.h
169 ↗	(On Diff #481701)	This should probably use variants to match the rest of the style, also if you intend to read these outside of the library you'll need to put them in the exports file and set their visibility.
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
19 ↗	(On Diff #481701)	What if this isn't defined? We should be able to use the OpenMP library without the AMD device libraries. Should it be extern weak?
openmp/libomptarget/DeviceRTL/src/State.cpp
73 ↗	(On Diff #481701)	Variants
openmp/libomptarget/plugins/amdgpu/impl/get_elf_mach_gfx_name.h
15 ↗	(On Diff #481701)	Unrelated, but is there any particular reason these aren't defined in the `hsa_amd_ext.h`?

yaxunl added inline comments.Dec 9 2022, 11:32 AM

clang/lib/Driver/ToolChains/Clang.cpp
7323	Any reason you need the original args? This will bypass the driver translation, which should not in normal cases.
7324	clang -cc1 needs this to be default value false to emit code object version module flag

Could we elaborate on the benefits, please. Now we support two versions?

Why is this helpful:

DeviceRTL uses rocm-device-libs instead of directly calling amdgcn builtins for the functions which are affected by cov5.

openmp/libomptarget/DeviceRTL/src/State.cpp
81 ↗	(On Diff #481701)	Why do we need the "external..." stuff anyway?
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2233 ↗	(On Diff #481701)	What is this all about?

Harbormaster completed remote builds in B202273: Diff 481701.Dec 9 2022, 12:40 PM

I am reluctant to add the dependency edge to rocm device libs to openmp's GPU runtime.

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

My problem with device libs is that it usually doesn't build with trunk. It follows a rolling dev model tied to rocm clang and when upstream does something that takes a long time to apply, device libs doesn't build until rocm catches up. I've literally never managed to compile any branch of device libs with trunk clang without modifying the source, generally to delete directories that don't look necessary for libm.

Further, selecting an ABI based on runtime code found in a library which is hopefully constant folded is a weird layering choice. The compiler knows what ABI it is emitting code for, and that's how it picks files from device libs to effect that choice, but it would make far more sense to me for the compiler back end to set this stuff up itself.

Also, if we handle ABI in the back end, then we don't get the inevitable problem of rocm device libs and trunk clang having totally different ideas of what the ABI is as they drift in and out of sync.

tianshilei1992 added a subscriber: tianshilei1992.Dec 13 2022, 5:29 AM

tianshilei1992 added inline comments.

openmp/libomptarget/DeviceRTL/src/Mapping.cpp
19 ↗	(On Diff #481701)	It should be put into AMD's `declare variant`.

Thanks everyone for your review and comments!
I am going to address all of them in a series of smaller patches starting with D140784.

saiislam added inline comments.Jan 4 2023, 7:08 AM

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

50 ↗

(On Diff #481701)

If we still don't want to depend on rocm-device-libs then we will have to do something like (haven't tried this code yet):

uint32_t getNumHardwareThreadsInBlock() {
   if (__oclc_ABI_version < 500) {
      return __builtin_amdgcn_workgroup_size_x();
   } else {
      void *implicitArgPtr = __builtin_amdgcn_implicitarg_ptr();
      return (ushort)implicitArgPtr[6];
}

80 ↗

(On Diff #481701)

uint32_t getNumberOfBlocks() {
   if (__oclc_ABI_version < 500) {
      return __builtin_amdgcn_grid_size_x() / __builtin_amdgcn_workgroup_size_x();
   } else {
      void *implicitArgPtr = __builtin_amdgcn_implicitarg_ptr();
      return (uint)implicitArgPtr[0];
}

saiislam marked an inline comment as done.Jan 18 2023, 7:31 AM

saiislam added inline comments.

clang/lib/Driver/ToolChains/Clang.cpp
7323	We need derived args to look for mcode-object-version. I have created a separate review for this change. Please have a look at D142022

In D139730#3991628, @JonChesterfield wrote:

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

The ABI isn't defined in terms of what device-libs does. It's fixed offsets off of pointers accessible through amdgcn intrinsics. You can also just directly emit the same IR, these functions aren't complicated

Herald added subscribers: jplehr, sunshaoce. · View Herald TranscriptJun 13 2023, 11:55 AM

In D139730#4418630, @arsenm wrote:

In D139730#3991628, @JonChesterfield wrote:

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

The ABI isn't defined in terms of what device-libs does. It's fixed offsets off of pointers accessible through amdgcn intrinsics. You can also just directly emit the same IR, these functions aren't complicated

This is the suggestion I've talked with @saiislam about. I think we should just copy the magic intrinsics that are being queried here. I'm assuming we don't need to bother with supporting both v4 and v5 so we can just make the switch all at once.

Another attempt at cov5 support by using CodeGen for buitlin_amdgpu_workgroup_size.

arsenm added inline comments.Aug 4 2023, 12:12 PM

clang/lib/CodeGen/CGBuiltin.cpp
17124	this must always pass
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
3007	This isn't doing anything?
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	This is getting duplicated a few places, should it move to a support header? I don't love the existing APIs for this, I think a struct definition makes more sense

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

clang/lib/CodeGen/CGBuiltin.cpp
17118	Could you explain the function of this in a comment? Are we emitting generic code if unspecified?
17150–17151	nit.
17157	Leftover debugging?
clang/lib/Driver/ToolChain.cpp
1371	Shouldn't we be able to put this under the `OPT_m_group` below?
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752	Leftoever?
2548	Why do we need this? The current method shouldn't need to change if all we're doing is allocating memory of greater size.
3006	So we're required to emit some new arguments? I don't have any idea what'schanged between this COV4 and COV5 stuff.

jhuber6 added inline comments.Aug 4 2023, 12:14 PM

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	The other user here is my custom loader, @JonChesterfield has talked about wanting a common HSA helper header for awhile now. I agree that the struct definition is much better. Being able to simply allocate this size and then zero fill it is much cleaner.

In D139730#4561540, @jhuber6 wrote:

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The builtin is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

In D139730#4561573, @arsenm wrote:

In D139730#4561540, @jhuber6 wrote:

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

In D139730#4561575, @jhuber6 wrote:

In D139730#4561573, @arsenm wrote:

In D139730#4561540, @jhuber6 wrote:

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

In D139730#4561619, @arsenm wrote:

In D139730#4561575, @jhuber6 wrote:

In D139730#4561573, @arsenm wrote:

In D139730#4561540, @jhuber6 wrote:

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

Although I guess that doesn't really need the builtin changes?

Harbormaster completed remote builds in B250395: Diff 547297.Aug 4 2023, 1:50 PM

Removed unused cov5 implicitargs fields.
Added comments about EmitAMDGPUWorkGroupSize and ABI-agnostica code emission.
Adressed reviewers' comments.

In D139730#4561622, @arsenm wrote:

In D139730#4561619, @arsenm wrote:

In D139730#4561575, @jhuber6 wrote:

In D139730#4561573, @arsenm wrote:

In D139730#4561540, @jhuber6 wrote:

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

Although I guess that doesn't really need the builtin changes?

This builtin was already aware about cov4 and cov5. All this patch is changing is making it aware about a possibility where both needs to be present.
It is already used by device-libs, deviceRTL, and libc-gpu.
Also, encapsulating ABI related changes in implementation of the builtin allows other runtime developers to be agnostic to these lower level changes.

clang/lib/CodeGen/CGBuiltin.cpp
17150–17151	There are a couple of common lines after the inner if-else, in the outer else section.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752	No, it is not a left over. One of the fields in cov5 implicitikernarg is heap_v1 ptr. It should point to a 128KB zero-initialized block of coarse-grained memory on each device before launching the kernel. This code was working a while ago, but right now it is failing most likely due to some latest change in devicertl memory handling mechanism. I need to debug it with this patch, otherwise it will cause all target region code calling device-malloc to fail. I will try to fix it before the next revision.
2548	`PreAllocatedDeviceMemoryPool` is the pointer which stores the intermediate value before it is written to heap_v1_ptr field of cov5 implicitkernarg.
3006	In cov5, we need to set certain fields of the implicit kernel arguments before launching the kernel. Please see AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes for more details. Only NumBlocks, NumThreads(XYZ), GridDims, and Heap_V1_ptr are relevant for us, so I have simplified code further.
3007	Earlier we used to set hostcall_buffer here, but not anymore. I have left the message in DP just for debug help.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	Defining a struct for whole 256 byte of implicitargs in cov5 was becoming a little difficult due to different sizes of various fields (2, 4, 6, 8, 48, 72 bytes) along with multiple reserved fields in between. It made sense for cov4 because it only had 7 fields of 8 bytes each, where we needed only 4th field in OpenMP runtime (for hostcall_buffer). Offset based lookups like the following allows handling/exposing only required fields across generations of ABI.

saiislam added a subscriber: ronlieb.Aug 7 2023, 6:06 AM

jhuber6 added inline comments.Aug 7 2023, 6:23 AM

clang/lib/CodeGen/CGBuiltin.cpp
17107
17150–17151	You should be able to factor out LD = CGF.Builder.CreateLoad( Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2))); from both by making each assign the `Result` to a value.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752	Do we really need that? We only use a fraction of the existing implicit arguments. My understanding is that most of these are more for runtime handling for HIP and OpenCL while we would most likely want our own solution. I'm assuming that the 128KB is not required for anything we use?
2556–2557	This and below isn't correct. You can't discard an `llvm::Error` value like this without either doing `consumeError(std::move(Err))` or `toString(std::move(Err))`. However, you don't need to consume these in the first place, they already contain the error message from the callee and should just be forwarded.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	If we don't use it, just put it as `unused`. It's really hard to read as-is and it makes it more difficult to just zero fill.

Harbormaster completed remote builds in B250761: Diff 547751.Aug 7 2023, 8:07 AM

need a lit test for the codegen of the clang builtin for cov 4/5/none and a lit test to show the branching code generated with cov none can be optimized away when linked with cov4 or cov5.

clang/lib/CodeGen/Targets/AMDGPU.cpp
389	I am not sure weak_odr linkage will work when code object version is none. This will cause conflict when a module emitted with cov none is linked with a module emitted with cov4 or cov5. Also, when all modules are emitted with cov none, we end up with a linked module with cov none and the work group size code will not work. Probably we need to emit llvm.amdgcn.abi.version with external linkage for cov none. Another issue is that llvm.amdgcn.abi.version is not internalized. It is always loaded from memory even though it is in constant address space. This will cause bad performance. Considering device libs may use clang builtin for workgroup size. The performance impact may be significant. To avoid performance degradation, we need to internalize it as early as possible in the optimization pipeline.

I would suggest separating the clang/llvm part into a separate review.

arsenm added inline comments.Aug 7 2023, 2:12 PM

clang/lib/CodeGen/CGBuiltin.cpp
17112–17131	Move down to define and initialize
17132–17134	You could write all of this in terms of selects and avoid introducing all these blocks
clang/lib/CodeGen/Targets/AMDGPU.cpp
364	Don't need this?

Updated the patch as per reviewers comments.

clang/lib/CodeGen/CGBuiltin.cpp
17112–17131	There are multiple uses of the same identifier. Defining them four times looks odd.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752	I have removed the preallocatedheap work from this patch.
2556–2557	Removed the logic for preallocatedheap.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	I have reduced the fields to bare minimum required for OpenMP.

Harbormaster completed remote builds in B253309: Diff 551266.Aug 17 2023, 3:34 PM

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

clang/lib/CodeGen/CGBuiltin.cpp
17110
clang/lib/Driver/ToolChain.cpp
1368	Random whitespace.
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
97	Need newline
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
3007	Don't think this needs to be a debug message, same below
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	I'm still not a fan of replacing the struct. The mnemonic of having a struct is much more user friendly. ImplicitArgsTy Args{}; std::memset(&Args, sizeof(ImplicitArgsTy), 0); ... If we don't use something, just make it some random bytes, e.g. struct ImplicitArgsTy { uint64_t OffsetX; uint8_t Unused[64]; // 64 byte offset. };

Changed ImplitArgs implementation using struct.

In D139730#4597504, @jhuber6 wrote:

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

Clang-linker-wrapper was not passing -mllvm option to the clang backend.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46	Replaced.

arsenm added inline comments.Aug 18 2023, 1:05 PM

clang/lib/CodeGen/CGBuiltin.cpp
17114	Spell out to DispatchPtr?
clang/lib/CodeGen/CodeGenModule.cpp
1206–1208	These could be one combined hook? this isn't really different from metadata
clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386	You moved GetOrCreateLLVMGlobal but don't use it? The lamdba is unnecessary for a single local use
clang/lib/Driver/ToolChain.cpp
1373–1376	Capitalize
1376	Don't understand why this is necessary

arsenm added inline comments.Aug 18 2023, 1:07 PM

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
41–44	test all the builtins?

Harbormaster completed remote builds in B253549: Diff 551597.Aug 18 2023, 2:27 PM

saiislam marked 4 inline comments as done.Aug 21 2023, 11:25 AM

saiislam added inline comments.

clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386	I am using GetOrCreateLLVMGlobal in CGBuiltin.cpp while emitting code for amdgpu_worgroup_size.
369–386	I was hoping that this patch will pave way for D130096, so that it can generate rest of the control constants using the same lambda. I can remove this and simplify the code if you want.
389	I tried external linkage but it didn't work. Only weak_odr is working fine.
clang/lib/Driver/ToolChain.cpp
1376	This function creates a derived argument list for OpenMP target specific flags. `mcode-object-version` remains unset for device compilation step if we don't pass it here.

Adressed reviewer's comments.

saiislam marked 3 inline comments as done.Aug 21 2023, 11:28 AM

Harbormaster completed remote builds in B253892: Diff 552085.Aug 21 2023, 11:56 AM

arsenm added inline comments.Aug 21 2023, 1:11 PM

clang/lib/CodeGen/CGBuiltin.cpp
17124	Capitalization is weird, IsCOV5?
17139–17140	CreateConstInBoundsGEP1_64
17157	CreateConstInBoundsGEP1_64
clang/lib/CodeGen/Targets/AMDGPU.cpp
364	Single use lamdba, just make this the function body
381	No real point setting the alignment

Used CreateConstInBoundsGEP1_32 for emitting GEP statements. Changed lambda function to simple fucntion body for defining the global variable.

Harbormaster completed remote builds in B254086: Diff 552344.Aug 22 2023, 8:08 AM

Codegen parts LGTM, questions with the driver parts

clang/lib/Driver/ToolChain.cpp
1373–1376	Typos
1374
clang/lib/Driver/ToolChains/Clang.cpp
8648–8649	so device rtl is linked once as a normal library?
8652–8653	Why do you need this? The code object version is supposed to come from a module flag. We should be getting rid of the command line argument for it
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410	Shouldn't need this?
417	Commented out code

yaxunl added inline comments.Aug 23 2023, 8:07 PM

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13	need to test using clang -cc1 with -O3 and -mlink-builtin-bitcode to link the device lib and verify the load of llvm.amdgcn.abi.version being eliminated after optimization. I think currently it cannot do that since llvm.amdgcn.abi.version is not internalized by the internalization pass. This can cause some significant perf drops since loading is expensive. Need to tweak the function controlling what variables can be internalized for amdgpu so that this variable gets internalized, or having a generic way to tell that function which variables should be internalized, e.g. by adding a metadata amdgcn.internalize

Updated test case to check internalization of newly inserted global variable.

clang/lib/Driver/ToolChains/Clang.cpp
8648–8649	No, this is command generation for clang-linker-wrapper. Since, devicertl is compiled only to get bitcode file (-c), it is never called.
8652–8653	During command generation for clang-linker-wrapper, it is required to check user's provided `mcode-object-version=X` so that `amdhsa-code-object-version=X` can be passed to the clang/lto backend. `getAmdhsaCodeObjectVersion()` and `getHsaAbiVersion()` both still use the above command line argument to override user's choice of COV, instead of the module flag.
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13	load of llvm.amdgcn.abi.version is being eliminated with cc1, -O3, and mlink-builtin-bitcode of device lib.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410	It is required so that when clang pass (not the lto backend) is called from clang-linker-wrapper due to `-save-temps`, user provided COV is correctly propagated.

jhuber6 added inline comments.Aug 24 2023, 10:06 AM

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58	We should probably be using `sizeof` now that it's back to being a struct and keep the old struct definition.

saiislam marked an inline comment as done.Aug 24 2023, 10:12 AM

saiislam added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58	AMDGPU plugin doesn't use any implicitarg for COV4, but it does so for COV5. So, we are not keeping two separate structures for implicitargs of COV4 and COV5. If we use sizeof then it will always return 256 corresponding to COV5 (even for cov4, which should be 56). That's why we need this function.

jhuber6 added inline comments.Aug 24 2023, 10:15 AM

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–58	Yeah, I guess for COV4 the only thing that mattered was the size so that we could make sure it's all set to zero. We shouldn't use the enum value. It should be `sizeof(ImplicitArgsTy)` for `COV5` and either hard-code it in the function for V4 or make a dummy struct.

Harbormaster completed remote builds in B254660: Diff 553179.Aug 24 2023, 10:55 AM

Changed getImplicitArgsSize to use sizeof.

Harbormaster completed remote builds in B254836: Diff 553413.Aug 25 2023, 2:57 AM

Just a few more nits. I think it's looking fine but I haven't tested it. Anyone else?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406
415–417	No braces around a single line if.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
54	We return uint16_t here? These are sizes.

Minor fixes addressing reviewer's comment.

Harbormaster completed remote builds in B255263: Diff 553991.Aug 28 2023, 11:27 AM

I think it's fine now given that it's passing tests. Others feel free to comment.

This revision is now accepted and ready to land.Aug 28 2023, 12:16 PM

LGTM. Thanks

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13	It seems being eliminated by IPSCCP. It makes sense since it is constant weak_odr without externally_initialized. Either changing it to weak or adding externally_initialized will keep the load. Normal `__constant__` var in device code may be changed by host code, therefore they are emitted with externally_initialized and do not have the load eliminated.

This revision was landed with ongoing or failed builds.Aug 29 2023, 4:36 AM

Closed by commit rGf616c3eeb43f: [OpenMP][DeviceRTL][AMDGPU] Support code object version 5 (authored by saiislam). · Explain Why

This revision was automatically updated to reflect the committed changes.

saiislam added a commit: rGf616c3eeb43f: [OpenMP][DeviceRTL][AMDGPU] Support code object version 5.

saiislam added inline comments.Aug 29 2023, 4:40 AM

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13	Thank you @yaxunl ! I have added these observations as comments in the code at load emit and global emit locations.

saiislam mentioned this in D140973: [OpenMP][AMDGPU] Support of cov5 in the next gen plugin.Aug 30 2023, 2:41 AM

saiislam mentioned this in D140784: [OpenMP][AMDGPU] Introduce new matadata fields for code object 5.

saiislam mentioned this in D140783: [OpenMP][AMDGPU] Extract code object version from the ELF.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

64 lines

10 lines

2 lines

3 lines

Targets/

AMDGPU.cpp

25 lines

Driver/

ToolChain.cpp

5 lines

ToolChains/

Clang.cpp

8 lines

test/

CodeGenCUDA/

amdgpu-code-object-version-linking.cu

96 lines

amdgpu-workgroup-size.cu

34 lines

CodeGenOpenCL/

opencl_types.cl

5 lines

tools/

clang-linker-wrapper/

ClangLinkerWrapper.cpp

6 lines

openmp/

libomptarget/

DeviceRTL/

CMakeLists.txt

2 lines

plugins-nextgen/

amdgpu/

src/

rtl.cpp

29 lines

utils/

UtilitiesRTL.h

40 lines

Diff 554262

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 21 Lines

#include "PatternInit.h" #include "PatternInit.h"

#include "TargetInfo.h" #include "TargetInfo.h"

#include "clang/AST/ASTContext.h" #include "clang/AST/ASTContext.h"

#include "clang/AST/Attr.h" #include "clang/AST/Attr.h"

#include "clang/AST/Decl.h" #include "clang/AST/Decl.h"

#include "clang/AST/OSLog.h" #include "clang/AST/OSLog.h"

#include "clang/Basic/TargetBuiltins.h" #include "clang/Basic/TargetBuiltins.h"

#include "clang/Basic/TargetInfo.h" #include "clang/Basic/TargetInfo.h"

#include "clang/Basic/TargetOptions.h"

#include "clang/CodeGen/CGFunctionInfo.h" #include "clang/CodeGen/CGFunctionInfo.h"

#include "clang/Frontend/FrontendDiagnostic.h" #include "clang/Frontend/FrontendDiagnostic.h"

#include "llvm/ADT/APFloat.h" #include "llvm/ADT/APFloat.h"

#include "llvm/ADT/APInt.h" #include "llvm/ADT/APInt.h"

#include "llvm/ADT/FloatingPointMode.h" #include "llvm/ADT/FloatingPointMode.h"

#include "llvm/ADT/SmallPtrSet.h" #include "llvm/ADT/SmallPtrSet.h"

#include "llvm/ADT/StringExtras.h" #include "llvm/ADT/StringExtras.h"

#include "llvm/Analysis/ValueTracking.h" #include "llvm/Analysis/ValueTracking.h"

▲ Show 20 Lines • Show All 17,055 Lines • ▼ Show 20 Lines Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction &CGF) {

auto *Call = CGF.Builder.CreateCall(F); auto *Call = CGF.Builder.CreateCall(F);

Call->addRetAttr( Call->addRetAttr(

Attribute::getWithDereferenceableBytes(Call->getContext(), 256)); Attribute::getWithDereferenceableBytes(Call->getContext(), 256));

Call->addRetAttr(Attribute::getWithAlignment(Call->getContext(), Align(8))); Call->addRetAttr(Attribute::getWithAlignment(Call->getContext(), Align(8)));

return Call; return Call;

} }

// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively. // \p Index is 0, 1, and 2 for x, y, and z dimension, respectively.

/// Emit code based on Code Object ABI version.

/// COV_4 : Emit code to use dispatch ptr

/// COV_5 : Emit code to use implicitarg ptr

/// COV_NONE : Emit code to load a global variable "llvm.amdgcn.abi.version"

/// and use its value for COV_4 or COV_5 approach. It is used for

/// compiling device libraries in an ABI-agnostic way.

jhuber6Unsubmitted

Done

/// and use its value for COV_4 or COV_5 approach. It is used for

- /// compiling device libraries in ABI-agnostic way.

+ /// compiling device libraries in an ABI-agnostic way.

///

/// Note: "llvm.amdgcn.abi.version" is supposed to be emitted and intialized by

jhuber6:

///

/// Note: "llvm.amdgcn.abi.version" is supposed to be emitted and intialized by

/// clang during compilation of user code.

jhuber6Unsubmitted

Done

/// Note: "llvm.amdgcn.abi.version" is supposed to be emitted and intialized by

- /// the clang during compilation of user code.

+ /// clang during compilation of user code.

Value *EmitAMDGPUWorkGroupSize(CodeGenFunction &CGF, unsigned Index) {

jhuber6:

Value *EmitAMDGPUWorkGroupSize(CodeGenFunction &CGF, unsigned Index) { Value *EmitAMDGPUWorkGroupSize(CodeGenFunction &CGF, unsigned Index) {

bool IsCOV_5 = CGF.getTarget().getTargetOpts().CodeObjectVersion == llvm::LoadInst *LD;

clang::TargetOptions::COV_5;

Constant *Offset; auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion;

arsenmUnsubmitted

Done

Spell out to DispatchPtr?

arsenm: Spell out to DispatchPtr?

Value *DP;

if (IsCOV_5) { if (Cov == clang::TargetOptions::COV_None) {

auto *ABIVersionC = CGF.CGM.GetOrCreateLLVMGlobal(

"llvm.amdgcn.abi.version", CGF.Int32Ty, LangAS::Default, nullptr,

jhuber6Unsubmitted

Done

Could you explain the function of this in a comment? Are we emitting generic code if unspecified?

jhuber6: Could you explain the function of this in a comment? Are we emitting generic code if…

CodeGen::NotForDefinition);

// This load will be eliminated by the IPSCCP because it is constant

// weak_odr without externally_initialized. Either changing it to weak or

// adding externally_initialized will keep the load.

Value *ABIVersion = CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,

arsenmUnsubmitted

Done

this must always pass

arsenm: this must always pass

arsenmUnsubmitted

Done

Capitalization is weird, IsCOV5?

arsenm: Capitalization is weird, IsCOV5?

CGF.CGM.getIntAlign());

Value *IsCOV5 = CGF.Builder.CreateICmpSGE(

ABIVersion,

llvm::ConstantInt::get(CGF.Int32Ty, clang::TargetOptions::COV_5));

// Indexing the implicit kernarg segment.

arsenmUnsubmitted

Done

Move down to define and initialize

arsenm: Move down to define and initialize

saiislamAuthorUnsubmitted

Done

There are multiple uses of the same identifier. Defining them four times looks odd.

saiislam: There are multiple uses of the same identifier. Defining them four times looks odd.

Value *ImplicitGEP = CGF.Builder.CreateConstGEP1_32(

CGF.Int8Ty, EmitAMDGPUImplicitArgPtr(CGF), 12 + Index * 2);

arsenmUnsubmitted

Done

You could write all of this in terms of selects and avoid introducing all these blocks

arsenm: You could write all of this in terms of selects and avoid introducing all these blocks

// Indexing the HSA kernel_dispatch_packet struct.

Value *DispatchGEP = CGF.Builder.CreateConstGEP1_32(

CGF.Int8Ty, EmitAMDGPUDispatchPtr(CGF), 4 + Index * 2);

auto Result = CGF.Builder.CreateSelect(IsCOV5, ImplicitGEP, DispatchGEP);

LD = CGF.Builder.CreateLoad(

arsenmUnsubmitted

Done

CreateConstInBoundsGEP1_64

arsenm: CreateConstInBoundsGEP1_64

Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2)));

} else {

Value *GEP = nullptr;

if (Cov == clang::TargetOptions::COV_5) {

// Indexing the implicit kernarg segment. // Indexing the implicit kernarg segment.

Offset = llvm::ConstantInt::get(CGF.Int32Ty, 12 + Index * 2); GEP = CGF.Builder.CreateConstGEP1_32(

DP = EmitAMDGPUImplicitArgPtr(CGF); CGF.Int8Ty, EmitAMDGPUImplicitArgPtr(CGF), 12 + Index * 2);

} else { } else {

// Indexing the HSA kernel_dispatch_packet struct. // Indexing the HSA kernel_dispatch_packet struct.

Offset = llvm::ConstantInt::get(CGF.Int32Ty, 4 + Index * 2); GEP = CGF.Builder.CreateConstGEP1_32(

DP = EmitAMDGPUDispatchPtr(CGF); CGF.Int8Ty, EmitAMDGPUDispatchPtr(CGF), 4 + Index * 2);

jhuber6Unsubmitted

Done

Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2)));

- } else {

- if (Cov == clang::TargetOptions::COV_5) {

+ } else if (Cov == clang::TargetOptions::COV_5) {

// Indexing the implicit kernarg segment.

nit.

jhuber6: nit.

saiislamAuthorUnsubmitted

Done

There are a couple of common lines after the inner if-else, in the outer else section.

saiislam: There are a couple of common lines after the inner if-else, in the outer else section.

jhuber6Unsubmitted

Done

You should be able to factor out

LD = CGF.Builder.CreateLoad(
    Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2)));

from both by making each assign the Result to a value.

jhuber6: You should be able to factor out ``` LD = CGF.Builder.CreateLoad( Address(Result…

} }

LD = CGF.Builder.CreateLoad(

auto *GEP = CGF.Builder.CreateGEP(CGF.Int8Ty, DP, Offset);

auto *LD = CGF.Builder.CreateLoad(

Address(GEP, CGF.Int16Ty, CharUnits::fromQuantity(2))); Address(GEP, CGF.Int16Ty, CharUnits::fromQuantity(2)));

}

llvm::MDBuilder MDHelper(CGF.getLLVMContext()); llvm::MDBuilder MDHelper(CGF.getLLVMContext());

jhuber6Unsubmitted

Done

Leftover debugging?

jhuber6: Leftover debugging?

arsenmUnsubmitted

Done

CreateConstInBoundsGEP1_64

arsenm: CreateConstInBoundsGEP1_64

llvm::MDNode *RNode = MDHelper.createRange(APInt(16, 1), llvm::MDNode *RNode = MDHelper.createRange(APInt(16, 1),

APInt(16, CGF.getTarget().getMaxOpenCLWorkGroupSize() + 1)); APInt(16, CGF.getTarget().getMaxOpenCLWorkGroupSize() + 1));

LD->setMetadata(llvm::LLVMContext::MD_range, RNode); LD->setMetadata(llvm::LLVMContext::MD_range, RNode);

LD->setMetadata(llvm::LLVMContext::MD_noundef, LD->setMetadata(llvm::LLVMContext::MD_noundef,

llvm::MDNode::get(CGF.getLLVMContext(), std::nullopt)); llvm::MDNode::get(CGF.getLLVMContext(), std::nullopt));

LD->setMetadata(llvm::LLVMContext::MD_invariant_load, LD->setMetadata(llvm::LLVMContext::MD_invariant_load,

llvm::MDNode::get(CGF.getLLVMContext(), std::nullopt)); llvm::MDNode::get(CGF.getLLVMContext(), std::nullopt));

return LD; return LD;

▲ Show 20 Lines • Show All 3,288 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.h

Show First 20 Lines • Show All 1,565 Lines • ▼ Show 20 Lines	public:
void handleAMDGPUFlatWorkGroupSizeAttr(		void handleAMDGPUFlatWorkGroupSizeAttr(
llvm::Function F, const AMDGPUFlatWorkGroupSizeAttr A,		llvm::Function F, const AMDGPUFlatWorkGroupSizeAttr A,
const ReqdWorkGroupSizeAttr *ReqdWGS = nullptr);		const ReqdWorkGroupSizeAttr *ReqdWGS = nullptr);

/// Emit the IR encoding to attach the AMD GPU waves-per-eu attribute to \p F.		/// Emit the IR encoding to attach the AMD GPU waves-per-eu attribute to \p F.
void handleAMDGPUWavesPerEUAttr(llvm::Function *F,		void handleAMDGPUWavesPerEUAttr(llvm::Function *F,
const AMDGPUWavesPerEUAttr *A);		const AMDGPUWavesPerEUAttr *A);

		llvm::Constant *
		GetOrCreateLLVMGlobal(StringRef MangledName, llvm::Type *Ty, LangAS AddrSpace,
		const VarDecl *D,
		ForDefinition_t IsForDefinition = NotForDefinition);

private:		private:
llvm::Constant *GetOrCreateLLVMFunction(		llvm::Constant *GetOrCreateLLVMFunction(
StringRef MangledName, llvm::Type *Ty, GlobalDecl D, bool ForVTable,		StringRef MangledName, llvm::Type *Ty, GlobalDecl D, bool ForVTable,
bool DontDefer = false, bool IsThunk = false,		bool DontDefer = false, bool IsThunk = false,
llvm::AttributeList ExtraAttrs = llvm::AttributeList(),		llvm::AttributeList ExtraAttrs = llvm::AttributeList(),
ForDefinition_t IsForDefinition = NotForDefinition);		ForDefinition_t IsForDefinition = NotForDefinition);

// References to multiversion functions are resolved through an implicitly		// References to multiversion functions are resolved through an implicitly
// defined resolver function. This function is responsible for creating		// defined resolver function. This function is responsible for creating
// the resolver symbol for the provided declaration. The value returned		// the resolver symbol for the provided declaration. The value returned
// will be for an ifunc (llvm::GlobalIFunc) if the current target supports		// will be for an ifunc (llvm::GlobalIFunc) if the current target supports
// that feature and for a regular function (llvm::GlobalValue) otherwise.		// that feature and for a regular function (llvm::GlobalValue) otherwise.
llvm::Constant *GetOrCreateMultiVersionResolver(GlobalDecl GD);		llvm::Constant *GetOrCreateMultiVersionResolver(GlobalDecl GD);

// In scenarios where a function is not known to be a multiversion function		// In scenarios where a function is not known to be a multiversion function
// until a later declaration, it is sometimes necessary to change the		// until a later declaration, it is sometimes necessary to change the
// previously created mangled name to align with requirements of whatever		// previously created mangled name to align with requirements of whatever
// multiversion function kind the function is now known to be. This function		// multiversion function kind the function is now known to be. This function
// is responsible for performing such mangled name updates.		// is responsible for performing such mangled name updates.
void UpdateMultiVersionNames(GlobalDecl GD, const FunctionDecl *FD,		void UpdateMultiVersionNames(GlobalDecl GD, const FunctionDecl *FD,
StringRef &CurName);		StringRef &CurName);

llvm::Constant *
GetOrCreateLLVMGlobal(StringRef MangledName, llvm::Type *Ty, LangAS AddrSpace,
const VarDecl *D,
ForDefinition_t IsForDefinition = NotForDefinition);

bool GetCPUAndFeaturesAttributes(GlobalDecl GD,		bool GetCPUAndFeaturesAttributes(GlobalDecl GD,
llvm::AttrBuilder &AttrBuilder,		llvm::AttrBuilder &AttrBuilder,
bool SetTargetFeatures = true);		bool SetTargetFeatures = true);
void setNonAliasAttributes(GlobalDecl GD, llvm::GlobalObject *GO);		void setNonAliasAttributes(GlobalDecl GD, llvm::GlobalObject *GO);

/// Set function attributes for a function declaration.		/// Set function attributes for a function declaration.
void SetFunctionAttributes(GlobalDecl GD, llvm::Function *F,		void SetFunctionAttributes(GlobalDecl GD, llvm::Function *F,
bool IsIncompleteFunction, bool IsThunk);		bool IsIncompleteFunction, bool IsThunk);
▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,197 Lines • ▼ Show 20 Lines	if (getCodeGenOpts().SkipRaxSetup)
getModule().addModuleFlag(llvm::Module::Override, "SkipRaxSetup", 1);		getModule().addModuleFlag(llvm::Module::Override, "SkipRaxSetup", 1);
if (getLangOpts().RegCall4)		if (getLangOpts().RegCall4)
getModule().addModuleFlag(llvm::Module::Override, "RegCallv4", 1);		getModule().addModuleFlag(llvm::Module::Override, "RegCallv4", 1);

if (getContext().getTargetInfo().getMaxTLSAlign())		if (getContext().getTargetInfo().getMaxTLSAlign())
getModule().addModuleFlag(llvm::Module::Error, "MaxTLSAlign",		getModule().addModuleFlag(llvm::Module::Error, "MaxTLSAlign",
getContext().getTargetInfo().getMaxTLSAlign());		getContext().getTargetInfo().getMaxTLSAlign());

		getTargetCodeGenInfo().emitTargetGlobals(*this);

getTargetCodeGenInfo().emitTargetMetadata(*this, MangledDeclNames);		getTargetCodeGenInfo().emitTargetMetadata(*this, MangledDeclNames);
		arsenmUnsubmitted Not Done Reply Inline Actions These could be one combined hook? this isn't really different from metadata arsenm: These could be one combined hook? this isn't really different from metadata

EmitBackendOptionsMetadata(getCodeGenOpts());		EmitBackendOptionsMetadata(getCodeGenOpts());

// If there is device offloading code embed it in the host now.		// If there is device offloading code embed it in the host now.
EmbedObject(&getModule(), CodeGenOpts, getDiags());		EmbedObject(&getModule(), CodeGenOpts, getDiags());

// Set visibility from DLL storage class		// Set visibility from DLL storage class
// We do this at the end of LLVM IR generation; after any operation		// We do this at the end of LLVM IR generation; after any operation
▲ Show 20 Lines • Show All 6,248 Lines • Show Last 20 Lines

clang/lib/CodeGen/TargetInfo.h

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	virtual void setTargetAttributes(const Decl D, llvm::GlobalValue GV,
CodeGen::CodeGenModule &M) const {}		CodeGen::CodeGenModule &M) const {}

/// emitTargetMetadata - Provides a convenient hook to handle extra		/// emitTargetMetadata - Provides a convenient hook to handle extra
/// target-specific metadata for the given globals.		/// target-specific metadata for the given globals.
virtual void emitTargetMetadata(		virtual void emitTargetMetadata(
CodeGen::CodeGenModule &CGM,		CodeGen::CodeGenModule &CGM,
const llvm::MapVector<GlobalDecl, StringRef> &MangledDeclNames) const {}		const llvm::MapVector<GlobalDecl, StringRef> &MangledDeclNames) const {}

		/// Provides a convenient hook to handle extra target-specific globals.
		virtual void emitTargetGlobals(CodeGen::CodeGenModule &CGM) const {}

/// Any further codegen related checks that need to be done on a function call		/// Any further codegen related checks that need to be done on a function call
/// in a target specific manner.		/// in a target specific manner.
virtual void checkFunctionCallABI(CodeGenModule &CGM, SourceLocation CallLoc,		virtual void checkFunctionCallABI(CodeGenModule &CGM, SourceLocation CallLoc,
const FunctionDecl *Caller,		const FunctionDecl *Caller,
const FunctionDecl *Callee,		const FunctionDecl *Callee,
const CallArgList &Args) const {}		const CallArgList &Args) const {}

/// Determines the size of struct _Unwind_Exception on this platform,		/// Determines the size of struct _Unwind_Exception on this platform,
▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines

clang/lib/CodeGen/Targets/AMDGPU.cpp

//===- AMDGPU.cpp ---------------------------------------------------------===//		//===- AMDGPU.cpp ---------------------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "ABIInfoImpl.h"		#include "ABIInfoImpl.h"
#include "TargetInfo.h"		#include "TargetInfo.h"
		#include "clang/Basic/TargetOptions.h"

using namespace clang;		using namespace clang;
using namespace clang::CodeGen;		using namespace clang::CodeGen;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AMDGPU ABI Implementation		// AMDGPU ABI Implementation
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 250 Lines • ▼ Show 20 Lines
class AMDGPUTargetCodeGenInfo : public TargetCodeGenInfo {		class AMDGPUTargetCodeGenInfo : public TargetCodeGenInfo {
public:		public:
AMDGPUTargetCodeGenInfo(CodeGenTypes &CGT)		AMDGPUTargetCodeGenInfo(CodeGenTypes &CGT)
: TargetCodeGenInfo(std::make_unique<AMDGPUABIInfo>(CGT)) {}		: TargetCodeGenInfo(std::make_unique<AMDGPUABIInfo>(CGT)) {}

void setFunctionDeclAttributes(const FunctionDecl FD, llvm::Function F,		void setFunctionDeclAttributes(const FunctionDecl FD, llvm::Function F,
CodeGenModule &CGM) const;		CodeGenModule &CGM) const;

		void emitTargetGlobals(CodeGen::CodeGenModule &CGM) const override;

void setTargetAttributes(const Decl D, llvm::GlobalValue GV,		void setTargetAttributes(const Decl D, llvm::GlobalValue GV,
CodeGen::CodeGenModule &M) const override;		CodeGen::CodeGenModule &M) const override;
unsigned getOpenCLKernelCallingConv() const override;		unsigned getOpenCLKernelCallingConv() const override;

llvm::Constant *getNullPointer(const CodeGen::CodeGenModule &CGM,		llvm::Constant *getNullPointer(const CodeGen::CodeGenModule &CGM,
llvm::PointerType *T, QualType QT) const override;		llvm::PointerType *T, QualType QT) const override;

LangAS getASTAllocaAddressSpace() const override {		LangAS getASTAllocaAddressSpace() const override {
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	void AMDGPUTargetCodeGenInfo::setFunctionDeclAttributes(
if (const auto *Attr = FD->getAttr<AMDGPUNumVGPRAttr>()) {		if (const auto *Attr = FD->getAttr<AMDGPUNumVGPRAttr>()) {
uint32_t NumVGPR = Attr->getNumVGPR();		uint32_t NumVGPR = Attr->getNumVGPR();

if (NumVGPR != 0)		if (NumVGPR != 0)
F->addFnAttr("amdgpu-num-vgpr", llvm::utostr(NumVGPR));		F->addFnAttr("amdgpu-num-vgpr", llvm::utostr(NumVGPR));
}		}
}		}

		/// Emits control constants used to change per-architecture behaviour in the
		/// AMDGPU ROCm device libraries.
		void AMDGPUTargetCodeGenInfo::emitTargetGlobals(
		CodeGen::CodeGenModule &CGM) const {
		StringRef Name = "llvm.amdgcn.abi.version";
		arsenmUnsubmitted Done Reply Inline Actions Don't need this? arsenm: Don't need this?
		arsenmUnsubmitted Done Reply Inline Actions Single use lamdba, just make this the function body arsenm: Single use lamdba, just make this the function body
		if (CGM.getModule().getNamedGlobal(Name))
		return;

		auto *Type = llvm::IntegerType::getIntNTy(CGM.getModule().getContext(), 32);
		llvm::Constant *COV = llvm::ConstantInt::get(
		Type, CGM.getTarget().getTargetOpts().CodeObjectVersion);

		// It needs to be constant weak_odr without externally_initialized so that
		// the load instuction can be eliminated by the IPSCCP.
		auto *GV = new llvm::GlobalVariable(
		CGM.getModule(), Type, true, llvm::GlobalValue::WeakODRLinkage, COV, Name,
		nullptr, llvm::GlobalValue::ThreadLocalMode::NotThreadLocal,
		CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant));
		GV->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::Local);
		GV->setVisibility(llvm::GlobalValue::VisibilityTypes::HiddenVisibility);
		}

		arsenmUnsubmitted Done Reply Inline Actions No real point setting the alignment arsenm: No real point setting the alignment
void AMDGPUTargetCodeGenInfo::setTargetAttributes(		void AMDGPUTargetCodeGenInfo::setTargetAttributes(
const Decl D, llvm::GlobalValue GV, CodeGen::CodeGenModule &M) const {		const Decl D, llvm::GlobalValue GV, CodeGen::CodeGenModule &M) const {
if (requiresAMDGPUProtectedVisibility(D, GV)) {		if (requiresAMDGPUProtectedVisibility(D, GV)) {
GV->setVisibility(llvm::GlobalValue::ProtectedVisibility);		GV->setVisibility(llvm::GlobalValue::ProtectedVisibility);
GV->setDSOLocal(true);		GV->setDSOLocal(true);
		arsenmUnsubmitted Done Reply Inline Actions You moved GetOrCreateLLVMGlobal but don't use it? The lamdba is unnecessary for a single local use arsenm: You moved GetOrCreateLLVMGlobal but don't use it? The lamdba is unnecessary for a single…
		saiislamAuthorUnsubmitted Done Reply Inline Actions I am using GetOrCreateLLVMGlobal in CGBuiltin.cpp while emitting code for amdgpu_worgroup_size. saiislam: I am using GetOrCreateLLVMGlobal in CGBuiltin.cpp while emitting code for amdgpu_worgroup_size.
		saiislamAuthorUnsubmitted Done Reply Inline Actions I was hoping that this patch will pave way for D130096, so that it can generate rest of the control constants using the same lambda. I can remove this and simplify the code if you want. saiislam: I was hoping that this patch will pave way for D130096, so that it can generate rest of the…
}		}

if (GV->isDeclaration())		if (GV->isDeclaration())
		yaxunlUnsubmitted Not Done Reply Inline Actions I am not sure weak_odr linkage will work when code object version is none. This will cause conflict when a module emitted with cov none is linked with a module emitted with cov4 or cov5. Also, when all modules are emitted with cov none, we end up with a linked module with cov none and the work group size code will not work. Probably we need to emit llvm.amdgcn.abi.version with external linkage for cov none. Another issue is that llvm.amdgcn.abi.version is not internalized. It is always loaded from memory even though it is in constant address space. This will cause bad performance. Considering device libs may use clang builtin for workgroup size. The performance impact may be significant. To avoid performance degradation, we need to internalize it as early as possible in the optimization pipeline. yaxunl: I am not sure weak_odr linkage will work when code object version is none. This will cause…
		saiislamAuthorUnsubmitted Done Reply Inline Actions I tried external linkage but it didn't work. Only weak_odr is working fine. saiislam: I tried external linkage but it didn't work. Only weak_odr is working fine.
return;		return;

llvm::Function *F = dyn_cast<llvm::Function>(GV);		llvm::Function *F = dyn_cast<llvm::Function>(GV);
if (!F)		if (!F)
return;		return;

const FunctionDecl *FD = dyn_cast_or_null<FunctionDecl>(D);		const FunctionDecl *FD = dyn_cast_or_null<FunctionDecl>(D);
if (FD)		if (FD)
▲ Show 20 Lines • Show All 234 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChain.cpp

Show First 20 Lines • Show All 1,359 Lines • ▼ Show 20 Lines llvm::opt::DerivedArgList *ToolChain::TranslateOpenMPTargetArgs(

const llvm::opt::DerivedArgList &Args, bool SameTripleAsHost, const llvm::opt::DerivedArgList &Args, bool SameTripleAsHost,

SmallVectorImpl<llvm::opt::Arg *> &AllocatedArgs) const { SmallVectorImpl<llvm::opt::Arg *> &AllocatedArgs) const {

DerivedArgList *DAL = new DerivedArgList(Args.getBaseArgs()); DerivedArgList *DAL = new DerivedArgList(Args.getBaseArgs());

const OptTable &Opts = getDriver().getOpts(); const OptTable &Opts = getDriver().getOpts();

bool Modified = false; bool Modified = false;

// Handle -Xopenmp-target flags // Handle -Xopenmp-target flags

for (auto *A : Args) { for (auto *A : Args) {

// Exclude flags which may only apply to the host toolchain. // Exclude flags which may only apply to the host toolchain.

jhuber6Unsubmitted

Done

Random whitespace.

jhuber6: Random whitespace.

// Do not exclude flags when the host triple (AuxTriple) // Do not exclude flags when the host triple (AuxTriple)

// matches the current toolchain triple. If it is not present // matches the current toolchain triple. If it is not present

// at all, target and host share a toolchain. // at all, target and host share a toolchain.

jhuber6Unsubmitted

Not Done

Shouldn't we be able to put this under the OPT_m_group below?

jhuber6: Shouldn't we be able to put this under the `OPT_m_group` below?

if (A->getOption().matches(options::OPT_m_Group)) { if (A->getOption().matches(options::OPT_m_Group)) {

if (SameTripleAsHost) // Pass code object version to device toolchain

// to correctly set metadata in intermediate files.

arsenmUnsubmitted

Done

// Pass code objection version to device toolchain

- // to correctly set meta-data in intermediate files.

+ // to correctly set metadata in intermediate files.

if (SameTripleAsHost ||

arsenm:

if (SameTripleAsHost ||

A->getOption().matches(options::OPT_mcode_object_version_EQ))

arsenmUnsubmitted

Done

Capitalize

arsenm: Capitalize

arsenmUnsubmitted

Not Done

Don't understand why this is necessary

arsenm: Don't understand why this is necessary

saiislamAuthorUnsubmitted

Done

This function creates a derived argument list for OpenMP target specific flags.
mcode-object-version remains unset for device compilation step if we don't pass it here.

saiislam: This function creates a derived argument list for OpenMP target specific flags. `mcode-object…

arsenmUnsubmitted

Done

if (A->getOption().matches(options::OPT_m_Group)) {

- // Pass code objection version to device toolchain

+ // Pass code object version to device toolchain

// to correctly set meta-data in intermediate files.

Typos

arsenm: Typos

DAL->append(A); DAL->append(A);

else else

Modified = true; Modified = true;

continue; continue;

} }

unsigned Index; unsigned Index;

unsigned Prev; unsigned Prev;

▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,314 Lines • ▼ Show 20 Lines	if (IsOpenMPDevice) {
CmdArgs.push_back("-fopenmp-is-target-device");		CmdArgs.push_back("-fopenmp-is-target-device");
if (OpenMPDeviceInput) {		if (OpenMPDeviceInput) {
CmdArgs.push_back("-fopenmp-host-ir-file-path");		CmdArgs.push_back("-fopenmp-host-ir-file-path");
CmdArgs.push_back(Args.MakeArgString(OpenMPDeviceInput->getFilename()));		CmdArgs.push_back(Args.MakeArgString(OpenMPDeviceInput->getFilename()));
}		}
}		}

if (Triple.isAMDGPU()) {		if (Triple.isAMDGPU()) {
handleAMDGPUCodeObjectVersionOptions(D, Args, CmdArgs);		handleAMDGPUCodeObjectVersionOptions(D, Args, CmdArgs);
		yaxunlUnsubmitted Not Done Reply Inline Actions Any reason you need the original args? This will bypass the driver translation, which should not in normal cases. yaxunl: Any reason you need the original args? This will bypass the driver translation, which should…
		saiislamAuthorUnsubmitted Done Reply Inline Actions We need derived args to look for mcode-object-version. I have created a separate review for this change. Please have a look at D142022 saiislam: We need derived args to look for mcode-object-version. I have created a separate review for…

		yaxunlUnsubmitted Done Reply Inline Actions clang -cc1 needs this to be default value false to emit code object version module flag yaxunl: clang -cc1 needs this to be default value false to emit code object version module flag
Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,		Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,
options::OPT_mno_unsafe_fp_atomics);		options::OPT_mno_unsafe_fp_atomics);
Args.addOptOutFlag(CmdArgs, options::OPT_mamdgpu_ieee,		Args.addOptOutFlag(CmdArgs, options::OPT_mamdgpu_ieee,
options::OPT_mno_amdgpu_ieee);		options::OPT_mno_amdgpu_ieee);
}		}

// For all the host OpenMP offloading compile jobs we need to pass the targets		// For all the host OpenMP offloading compile jobs we need to pass the targets
// information using -fopenmp-targets= option.		// information using -fopenmp-targets= option.
▲ Show 20 Lines • Show All 1,307 Lines • ▼ Show 20 Lines	void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
if (Args.hasArg(options::OPT_v))		if (Args.hasArg(options::OPT_v))
CmdArgs.push_back("--wrapper-verbose");		CmdArgs.push_back("--wrapper-verbose");

if (const Arg *A = Args.getLastArg(options::OPT_g_Group)) {		if (const Arg *A = Args.getLastArg(options::OPT_g_Group)) {
if (!A->getOption().matches(options::OPT_g0))		if (!A->getOption().matches(options::OPT_g0))
CmdArgs.push_back("--device-debug");		CmdArgs.push_back("--device-debug");
}		}

		// code-object-version=X needs to be passed to clang-linker-wrapper to ensure
		// that it is used by lld.
		arsenmUnsubmitted Done Reply Inline Actions so device rtl is linked once as a normal library? arsenm: so device rtl is linked once as a normal library?
		saiislamAuthorUnsubmitted Done Reply Inline Actions No, this is command generation for clang-linker-wrapper. Since, devicertl is compiled only to get bitcode file (-c), it is never called. saiislam: No, this is command generation for clang-linker-wrapper. Since, devicertl is compiled only to…
		if (const Arg *A = Args.getLastArg(options::OPT_mcode_object_version_EQ)) {
		CmdArgs.push_back(Args.MakeArgString("-mllvm"));
		CmdArgs.push_back(Args.MakeArgString(
		Twine("--amdhsa-code-object-version=") + A->getValue()));
		arsenmUnsubmitted Done Reply Inline Actions Why do you need this? The code object version is supposed to come from a module flag. We should be getting rid of the command line argument for it arsenm: Why do you need this? The code object version is supposed to come from a module flag. We should…
		saiislamAuthorUnsubmitted Done Reply Inline Actions During command generation for clang-linker-wrapper, it is required to check user's provided `mcode-object-version=X` so that `amdhsa-code-object-version=X` can be passed to the clang/lto backend. `getAmdhsaCodeObjectVersion()` and `getHsaAbiVersion()` both still use the above command line argument to override user's choice of COV, instead of the module flag. saiislam: During command generation for clang-linker-wrapper, it is required to check user's provided…
		}

for (const auto &A : Args.getAllArgValues(options::OPT_Xcuda_ptxas))		for (const auto &A : Args.getAllArgValues(options::OPT_Xcuda_ptxas))
CmdArgs.push_back(Args.MakeArgString("--ptxas-arg=" + A));		CmdArgs.push_back(Args.MakeArgString("--ptxas-arg=" + A));

// Forward remarks passes to the LLVM backend in the wrapper.		// Forward remarks passes to the LLVM backend in the wrapper.
if (const Arg *A = Args.getLastArg(options::OPT_Rpass_EQ))		if (const Arg *A = Args.getLastArg(options::OPT_Rpass_EQ))
CmdArgs.push_back(Args.MakeArgString(Twine("--offload-opt=-pass-remarks=") +		CmdArgs.push_back(Args.MakeArgString(Twine("--offload-opt=-pass-remarks=") +
A->getValue()));		A->getValue()));
if (const Arg *A = Args.getLastArg(options::OPT_Rpass_missed_EQ))		if (const Arg *A = Args.getLastArg(options::OPT_Rpass_missed_EQ))
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu

This file was added.

				// RUN: %clang_cc1 -fcuda-is-device -triple amdgcn-amd-amdhsa -emit-llvm-bc \
				// RUN: -mcode-object-version=4 -DUSER -x hip -o %t_4.bc %s

				// RUN: %clang_cc1 -fcuda-is-device -triple amdgcn-amd-amdhsa -emit-llvm-bc \
				// RUN: -mcode-object-version=5 -DUSER -x hip -o %t_5.bc %s

				// RUN: %clang_cc1 -fcuda-is-device -triple amdgcn-amd-amdhsa -emit-llvm-bc \
				// RUN: -mcode-object-version=none -DDEVICELIB -x hip -o %t_0.bc %s

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -O3 \
				// RUN: %t_4.bc -mlink-builtin-bitcode %t_0.bc -o - \|\
				// RUN: FileCheck -check-prefix=LINKED4 %s

				yaxunlUnsubmitted Done Reply Inline Actions need to test using clang -cc1 with -O3 and -mlink-builtin-bitcode to link the device lib and verify the load of llvm.amdgcn.abi.version being eliminated after optimization. I think currently it cannot do that since llvm.amdgcn.abi.version is not internalized by the internalization pass. This can cause some significant perf drops since loading is expensive. Need to tweak the function controlling what variables can be internalized for amdgpu so that this variable gets internalized, or having a generic way to tell that function which variables should be internalized, e.g. by adding a metadata amdgcn.internalize yaxunl: need to test using clang -cc1 with -O3 and -mlink-builtin-bitcode to link the device lib and…
				saiislamAuthorUnsubmitted Done Reply Inline Actions load of llvm.amdgcn.abi.version is being eliminated with cc1, -O3, and mlink-builtin-bitcode of device lib. saiislam: load of llvm.amdgcn.abi.version is being eliminated with cc1, -O3, and mlink-builtin-bitcode of…
				yaxunlUnsubmitted Not Done Reply Inline Actions It seems being eliminated by IPSCCP. It makes sense since it is constant weak_odr without externally_initialized. Either changing it to weak or adding externally_initialized will keep the load. Normal `__constant__` var in device code may be changed by host code, therefore they are emitted with externally_initialized and do not have the load eliminated. yaxunl: It seems being eliminated by IPSCCP. It makes sense since it is constant weak_odr without…
				saiislamAuthorUnsubmitted Done Reply Inline Actions Thank you @yaxunl ! I have added these observations as comments in the code at load emit and global emit locations. saiislam: Thank you @yaxunl ! I have added these observations as comments in the code at load emit and…
				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -emit-llvm -O3 \
				// RUN: %t_5.bc -mlink-builtin-bitcode %t_0.bc -o - \|\
				// RUN: FileCheck -check-prefix=LINKED5 %s

				#include "Inputs/cuda.h"

				// LINKED4: @llvm.amdgcn.abi.version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 400
				// LINKED4-LABEL: bar
				// LINKED4-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED4-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED4: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED4: [[GEP_5_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 12
				// LINKED4: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// LINKED4: [[GEP_4_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 4
				// LINKED4: select i1 false, ptr addrspace(4) [[GEP_5_X]], ptr addrspace(4) [[GEP_4_X]]
				// LINKED4: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// LINKED4-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED4-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED4: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED4: [[GEP_5_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 14
				// LINKED4: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// LINKED4: [[GEP_4_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 6
				// LINKED4: select i1 false, ptr addrspace(4) [[GEP_5_Y]], ptr addrspace(4) [[GEP_4_Y]]
				// LINKED4: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// LINKED4-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED4-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED4: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED4: [[GEP_5_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 16
				// LINKED4: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				arsenmUnsubmitted Done Reply Inline Actions test all the builtins? arsenm: test all the builtins?
				// LINKED4: [[GEP_4_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 8
				// LINKED4: select i1 false, ptr addrspace(4) [[GEP_5_Z]], ptr addrspace(4) [[GEP_4_Z]]
				// LINKED4: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
				// LINKED4: "amdgpu_code_object_version", i32 400

				// LINKED5: llvm.amdgcn.abi.version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500
				// LINKED5-LABEL: bar
				// LINKED5-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED5-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED5: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED5: [[GEP_5_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 12
				// LINKED5: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// LINKED5: [[GEP_4_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 4
				// LINKED5: select i1 true, ptr addrspace(4) [[GEP_5_X]], ptr addrspace(4) [[GEP_4_X]]
				// LINKED5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// LINKED5-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED5-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED5: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED5: [[GEP_5_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 14
				// LINKED5: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// LINKED5: [[GEP_4_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 6
				// LINKED5: select i1 true, ptr addrspace(4) [[GEP_5_Y]], ptr addrspace(4) [[GEP_4_Y]]
				// LINKED5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// LINKED5-NOT: load i32, ptr addrspacecast (ptr addrspace(4) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// LINKED5-NOT: icmp sge i32 %{{.*}}, 500
				// LINKED5: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// LINKED5: [[GEP_5_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 16
				// LINKED5: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// LINKED5: [[GEP_4_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 8
				// LINKED5: select i1 true, ptr addrspace(4) [[GEP_5_Z]], ptr addrspace(4) [[GEP_4_Z]]
				// LINKED5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
				// LINKED5: "amdgpu_code_object_version", i32 500

				#ifdef DEVICELIB
				__device__ void bar(int x, int y, int *z)
				{
				*x = __builtin_amdgcn_workgroup_size_x();
				*y = __builtin_amdgcn_workgroup_size_y();
				*z = __builtin_amdgcn_workgroup_size_z();
				}
				#endif

				#ifdef USER
				__device__ void bar(int x, int y, int *z);
				__device__ void foo()
				{
				int x, y, *z;
				bar(x, y, z);
				}
				#endif
				jhuber6Unsubmitted Done Reply Inline Actions Need newline jhuber6: Need newline

clang/test/CodeGenCUDA/amdgpu-workgroup-size.cu

	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \			// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \
	// RUN: -fcuda-is-device -emit-llvm -o - -x hip %s \			// RUN: -fcuda-is-device -emit-llvm -o - -x hip %s \
	// RUN: \| FileCheck -check-prefix=PRECOV5 %s			// RUN: \| FileCheck -check-prefix=PRECOV5 %s


	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \			// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \
	// RUN: -fcuda-is-device -mcode-object-version=5 -emit-llvm -o - -x hip %s \			// RUN: -fcuda-is-device -mcode-object-version=5 -emit-llvm -o - -x hip %s \
	// RUN: \| FileCheck -check-prefix=COV5 %s			// RUN: \| FileCheck -check-prefix=COV5 %s

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \
				// RUN: -fcuda-is-device -mcode-object-version=none -emit-llvm -o - -x hip %s \
				// RUN: \| FileCheck -check-prefix=COVNONE %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// PRECOV5-LABEL: test_get_workgroup_size			// PRECOV5-LABEL: test_get_workgroup_size
	// PRECOV5: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()			// PRECOV5: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
	// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 4			// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 4
	// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
	// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 6			// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 6
	// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
	// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 8			// PRECOV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 8
	// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// PRECOV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

	// COV5-LABEL: test_get_workgroup_size			// COV5-LABEL: test_get_workgroup_size
	// COV5: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()			// COV5: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
	// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 12			// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 12
	// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
	// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 14			// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 14
	// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef
	// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 16			// COV5: getelementptr i8, ptr addrspace(4) %{{.*}}, i32 16
	// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef			// COV5: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef


				// COVNONE-LABEL: test_get_workgroup_size
				// COVNONE: load i32, ptr addrspacecast (ptr addrspace(1) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// COVNONE: [[ABI5_X:%.]] = icmp sge i32 %{{.}}, 500
				// COVNONE: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// COVNONE: [[GEP_5_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 12
				// COVNONE: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// COVNONE: [[GEP_4_X:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 4
				// COVNONE: select i1 [[ABI5_X]], ptr addrspace(4) [[GEP_5_X]], ptr addrspace(4) [[GEP_4_X]]
				// COVNONE: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// COVNONE: load i32, ptr addrspacecast (ptr addrspace(1) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// COVNONE: [[ABI5_Y:%.]] = icmp sge i32 %{{.}}, 500
				// COVNONE: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// COVNONE: [[GEP_5_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 14
				// COVNONE: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// COVNONE: [[GEP_4_Y:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 6
				// COVNONE: select i1 [[ABI5_Y]], ptr addrspace(4) [[GEP_5_Y]], ptr addrspace(4) [[GEP_4_Y]]
				// COVNONE: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

				// COVNONE: load i32, ptr addrspacecast (ptr addrspace(1) @llvm.amdgcn.abi.version to ptr), align {{.*}}
				// COVNONE: [[ABI5_Z:%.]] = icmp sge i32 %{{.}}, 500
				// COVNONE: call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
				// COVNONE: [[GEP_5_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 16
				// COVNONE: call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr()
				// COVNONE: [[GEP_4_Z:%.]] = getelementptr i8, ptr addrspace(4) %{{.}}, i32 8
				// COVNONE: select i1 [[ABI5_Z]], ptr addrspace(4) [[GEP_5_Z]], ptr addrspace(4) [[GEP_4_Z]]
				// COVNONE: load i16, ptr addrspace(4) %{{.}}, align 2, !range [[$WS_RANGE:![0-9]]], !invariant.load{{.*}}, !noundef

	__device__ void test_get_workgroup_size(int d, int *out)			__device__ void test_get_workgroup_size(int d, int *out)
	{			{
	switch (d) {			switch (d) {
	case 0: *out = __builtin_amdgcn_workgroup_size_x(); break;			case 0: *out = __builtin_amdgcn_workgroup_size_x(); break;
	case 1: *out = __builtin_amdgcn_workgroup_size_y(); break;			case 1: *out = __builtin_amdgcn_workgroup_size_y(); break;
	case 2: *out = __builtin_amdgcn_workgroup_size_z(); break;			case 2: *out = __builtin_amdgcn_workgroup_size_z(); break;
	default: *out = 0;			default: *out = 0;
	}			}
	}			}

	// CHECK-DAG: [[$WS_RANGE]] = !{i16 1, i16 1025}			// CHECK-DAG: [[$WS_RANGE]] = !{i16 1, i16 1025}

clang/test/CodeGenOpenCL/opencl_types.cl

	// RUN: %clang_cc1 -cl-std=CL2.0 %s -triple "spir-unknown-unknown" -emit-llvm -o - -O0 \| FileCheck %s --check-prefixes=CHECK-COM,CHECK-SPIR			// RUN: %clang_cc1 -cl-std=CL2.0 %s -triple "spir-unknown-unknown" -emit-llvm -o - -O0 \| FileCheck %s --check-prefix=CHECK-SPIR
	// RUN: %clang_cc1 -cl-std=CL2.0 %s -triple "amdgcn--amdhsa" -emit-llvm -o - -O0 \| FileCheck %s --check-prefixes=CHECK-COM,CHECK-AMDGCN			// RUN: %clang_cc1 -cl-std=CL2.0 %s -triple "amdgcn--amdhsa" -emit-llvm -o - -O0 \| FileCheck %s --check-prefix=CHECK-AMDGCN

	#define CLK_ADDRESS_CLAMP_TO_EDGE 2			#define CLK_ADDRESS_CLAMP_TO_EDGE 2
	#define CLK_NORMALIZED_COORDS_TRUE 1			#define CLK_NORMALIZED_COORDS_TRUE 1
	#define CLK_FILTER_NEAREST 0x10			#define CLK_FILTER_NEAREST 0x10
	#define CLK_FILTER_LINEAR 0x20			#define CLK_FILTER_LINEAR 0x20

	constant sampler_t glb_smp = CLK_ADDRESS_CLAMP_TO_EDGE\|CLK_NORMALIZED_COORDS_TRUE\|CLK_FILTER_NEAREST;			constant sampler_t glb_smp = CLK_ADDRESS_CLAMP_TO_EDGE\|CLK_NORMALIZED_COORDS_TRUE\|CLK_FILTER_NEAREST;
	// CHECK-COM-NOT: constant i32

	void fnc1(image1d_t img) {}			void fnc1(image1d_t img) {}
	// CHECK-SPIR: @fnc1(target("spirv.Image", void, 0, 0, 0, 0, 0, 0, 0)			// CHECK-SPIR: @fnc1(target("spirv.Image", void, 0, 0, 0, 0, 0, 0, 0)
	// CHECK-AMDGCN: @fnc1(ptr addrspace(4)			// CHECK-AMDGCN: @fnc1(ptr addrspace(4)

	void fnc1arr(image1d_array_t img) {}			void fnc1arr(image1d_array_t img) {}
	// CHECK-SPIR: @fnc1arr(target("spirv.Image", void, 0, 0, 1, 0, 0, 0, 0)			// CHECK-SPIR: @fnc1arr(target("spirv.Image", void, 0, 0, 1, 0, 0, 0, 0)
	// CHECK-AMDGCN: @fnc1arr(ptr addrspace(4)			// CHECK-AMDGCN: @fnc1arr(ptr addrspace(4)
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Show First 20 Lines • Show All 397 Lines • ▼ Show 20 Lines if (!Triple.isAMDGPU() && !Triple.isNVPTX()) {

for (const opt::Arg *Arg : Args.filtered(OPT_library, OPT_library_path)) for (const opt::Arg *Arg : Args.filtered(OPT_library, OPT_library_path))

Arg->render(Args, LinkerArgs); Arg->render(Args, LinkerArgs);

for (const opt::Arg *Arg : Args.filtered(OPT_rpath)) for (const opt::Arg *Arg : Args.filtered(OPT_rpath))

LinkerArgs.push_back( LinkerArgs.push_back(

Args.MakeArgString("-Wl,-rpath," + StringRef(Arg->getValue()))); Args.MakeArgString("-Wl,-rpath," + StringRef(Arg->getValue())));

llvm::copy(LinkerArgs, std::back_inserter(CmdArgs)); llvm::copy(LinkerArgs, std::back_inserter(CmdArgs));

} }

// Pass on -mllvm options to the clang invocation.

jhuber6Unsubmitted

Done

llvm::copy(LinkerArgs, std::back_inserter(CmdArgs));

}

- // pass on -mllvm options to the clang

+ // Pass on -mllvm options to the clang invocation.

for (const opt::Arg *Arg : Args.filtered(OPT_mllvm)) {

jhuber6:

for (const opt::Arg *Arg : Args.filtered(OPT_mllvm)) {

CmdArgs.push_back("-mllvm");

CmdArgs.push_back(Arg->getValue());

}

arsenmUnsubmitted

Done

Shouldn't need this?

arsenm: Shouldn't need this?

saiislamAuthorUnsubmitted

Done

It is required so that when clang pass (not the lto backend) is called from clang-linker-wrapper due to -save-temps, user provided COV is correctly propagated.

saiislam: It is required so that when clang pass (not the lto backend) is called from clang-linker…

if (Args.hasArg(OPT_debug)) if (Args.hasArg(OPT_debug))

CmdArgs.push_back("-g"); CmdArgs.push_back("-g");

if (SaveTemps) if (SaveTemps)

CmdArgs.push_back("-save-temps"); CmdArgs.push_back("-save-temps");

arsenmUnsubmitted

Done

Commented out code

arsenm: Commented out code

jhuber6Unsubmitted

Done

CmdArgs.push_back("-g");

- if (SaveTemps) {

+ if (SaveTemps)

CmdArgs.push_back("-save-temps");

- }

if (Verbose)

No braces around a single line if.

jhuber6: No braces around a single line if.

if (Verbose) if (Verbose)

CmdArgs.push_back("-v"); CmdArgs.push_back("-v");

if (!CudaBinaryPath.empty()) if (!CudaBinaryPath.empty())

CmdArgs.push_back(Args.MakeArgString("--cuda-path=" + CudaBinaryPath)); CmdArgs.push_back(Args.MakeArgString("--cuda-path=" + CudaBinaryPath));

for (StringRef Arg : Args.getAllArgValues(OPT_ptxas_arg)) for (StringRef Arg : Args.getAllArgValues(OPT_ptxas_arg))

llvm::copy(SmallVector<StringRef>({"-Xcuda-ptxas", Arg}), llvm::copy(SmallVector<StringRef>({"-Xcuda-ptxas", Arg}),

▲ Show 20 Lines • Show All 1,082 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/CMakeLists.txt

Show First 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	function(compileDeviceRTLLibrary target_cpu target_name target_triple)
endif()		endif()
endfunction()		endfunction()

# Generate a Bitcode library for all the gpu architectures the user requested.		# Generate a Bitcode library for all the gpu architectures the user requested.
add_custom_target(omptarget.devicertl.nvptx)		add_custom_target(omptarget.devicertl.nvptx)
add_custom_target(omptarget.devicertl.amdgpu)		add_custom_target(omptarget.devicertl.amdgpu)
foreach(gpu_arch ${LIBOMPTARGET_DEVICE_ARCHITECTURES})		foreach(gpu_arch ${LIBOMPTARGET_DEVICE_ARCHITECTURES})
if("${gpu_arch}" IN_LIST all_amdgpu_architectures)		if("${gpu_arch}" IN_LIST all_amdgpu_architectures)
compileDeviceRTLLibrary(${gpu_arch} amdgpu amdgcn-amd-amdhsa)		compileDeviceRTLLibrary(${gpu_arch} amdgpu amdgcn-amd-amdhsa -Xclang -mcode-object-version=none)
elseif("${gpu_arch}" IN_LIST all_nvptx_architectures)		elseif("${gpu_arch}" IN_LIST all_nvptx_architectures)
compileDeviceRTLLibrary(${gpu_arch} nvptx nvptx64-nvidia-cuda --cuda-feature=+ptx61)		compileDeviceRTLLibrary(${gpu_arch} nvptx nvptx64-nvidia-cuda --cuda-feature=+ptx61)
else()		else()
libomptarget_error_say("Unknown GPU architecture '${gpu_arch}'")		libomptarget_error_say("Unknown GPU architecture '${gpu_arch}'")
endif()		endif()
endforeach()		endforeach()

# Archive all the object files generated above into a static library		# Archive all the object files generated above into a static library
add_library(omptarget.devicertl STATIC)		add_library(omptarget.devicertl STATIC)
set_target_properties(omptarget.devicertl PROPERTIES LINKER_LANGUAGE CXX)		set_target_properties(omptarget.devicertl PROPERTIES LINKER_LANGUAGE CXX)
target_link_libraries(omptarget.devicertl PRIVATE omptarget.devicertl.all_objs)		target_link_libraries(omptarget.devicertl PRIVATE omptarget.devicertl.all_objs)

install(TARGETS omptarget.devicertl ARCHIVE DESTINATION ${OPENMP_INSTALL_LIBDIR})		install(TARGETS omptarget.devicertl ARCHIVE DESTINATION ${OPENMP_INSTALL_LIBDIR})

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 375 Lines • ▼ Show 20 Lines Error unloadExecutable() {

Status = hsa_code_object_destroy(CodeObject); Status = hsa_code_object_destroy(CodeObject);

return Plugin::check(Status, "Error in hsa_code_object_destroy: %s"); return Plugin::check(Status, "Error in hsa_code_object_destroy: %s");

} }

/// Get the executable. /// Get the executable.

hsa_executable_t getExecutable() const { return Executable; } hsa_executable_t getExecutable() const { return Executable; }

/// Get to Code Object Version of the ELF

uint16_t getELFABIVersion() const { return ELFABIVersion; }

/// Find an HSA device symbol by its name on the executable. /// Find an HSA device symbol by its name on the executable.

Expected<hsa_executable_symbol_t> Expected<hsa_executable_symbol_t>

findDeviceSymbol(GenericDeviceTy &Device, StringRef SymbolName) const; findDeviceSymbol(GenericDeviceTy &Device, StringRef SymbolName) const;

/// Get additional info for kernel, e.g., register spill counts /// Get additional info for kernel, e.g., register spill counts

std::optional<utils::KernelMetaDataTy> std::optional<utils::KernelMetaDataTy>

getKernelInfo(StringRef Identifier) const { getKernelInfo(StringRef Identifier) const {

auto It = KernelInfoMap.find(Identifier); auto It = KernelInfoMap.find(Identifier);

if (It == KernelInfoMap.end()) if (It == KernelInfoMap.end())

return {}; return {};

return It->second; return It->second;

} }

private: private:

/// The exectuable loaded on the agent. /// The exectuable loaded on the agent.

hsa_executable_t Executable; hsa_executable_t Executable;

hsa_code_object_t CodeObject; hsa_code_object_t CodeObject;

StringMap<utils::KernelMetaDataTy> KernelInfoMap; StringMap<utils::KernelMetaDataTy> KernelInfoMap;

uint16_t ELFABIVersion;

}; };

/// Class implementing the AMDGPU kernel functionalities which derives from the /// Class implementing the AMDGPU kernel functionalities which derives from the

/// generic kernel class. /// generic kernel class.

struct AMDGPUKernelTy : public GenericKernelTy { struct AMDGPUKernelTy : public GenericKernelTy {

/// Create an AMDGPU kernel with a name and an execution mode. /// Create an AMDGPU kernel with a name and an execution mode.

AMDGPUKernelTy(const char *Name, OMPTgtExecModeFlags ExecutionMode) AMDGPUKernelTy(const char *Name, OMPTgtExecModeFlags ExecutionMode)

: GenericKernelTy(Name, ExecutionMode), : GenericKernelTy(Name, ExecutionMode) {}

ImplicitArgsSize(sizeof(utils::AMDGPUImplicitArgsTy)) {}

/// Initialize the AMDGPU kernel. /// Initialize the AMDGPU kernel.

Error initImpl(GenericDeviceTy &Device, DeviceImageTy &Image) override { Error initImpl(GenericDeviceTy &Device, DeviceImageTy &Image) override {

AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(Image); AMDGPUDeviceImageTy &AMDImage = static_cast<AMDGPUDeviceImageTy &>(Image);

// Kernel symbols have a ".kd" suffix. // Kernel symbols have a ".kd" suffix.

std::string KernelName(getName()); std::string KernelName(getName());

KernelName += ".kd"; KernelName += ".kd";

Show All 24 Lines Error initImpl(GenericDeviceTy &Device, DeviceImageTy &Image) override {

// Make sure it is a kernel symbol. // Make sure it is a kernel symbol.

if (SymbolType != HSA_SYMBOL_KIND_KERNEL) if (SymbolType != HSA_SYMBOL_KIND_KERNEL)

return Plugin::error("Symbol %s is not a kernel function"); return Plugin::error("Symbol %s is not a kernel function");

// TODO: Read the kernel descriptor for the max threads per block. May be // TODO: Read the kernel descriptor for the max threads per block. May be

// read from the image. // read from the image.

ImplicitArgsSize = utils::getImplicitArgsSize(AMDImage.getELFABIVersion());

DP("ELFABIVersion: %d\n", AMDImage.getELFABIVersion());

// Get additional kernel info read from image // Get additional kernel info read from image

KernelInfo = AMDImage.getKernelInfo(getName()); KernelInfo = AMDImage.getKernelInfo(getName());

if (!KernelInfo.has_value()) if (!KernelInfo.has_value())

INFO(OMP_INFOTYPE_PLUGIN_KERNEL, Device.getDeviceId(), INFO(OMP_INFOTYPE_PLUGIN_KERNEL, Device.getDeviceId(),

"Could not read extra information for kernel %s.", getName()); "Could not read extra information for kernel %s.", getName());

return Plugin::success(); return Plugin::success();

} }

Show All 10 Lines struct AMDGPUKernelTy : public GenericKernelTy {

/// Get group and private segment kernel size. /// Get group and private segment kernel size.

uint32_t getGroupSize() const { return GroupSize; } uint32_t getGroupSize() const { return GroupSize; }

uint32_t getPrivateSize() const { return PrivateSize; } uint32_t getPrivateSize() const { return PrivateSize; }

/// Get the HSA kernel object representing the kernel function. /// Get the HSA kernel object representing the kernel function.

uint64_t getKernelObject() const { return KernelObject; } uint64_t getKernelObject() const { return KernelObject; }

/// Get the size of implicitargs based on the code object version

/// @return 56 for cov4 and 256 for cov5

uint32_t getImplicitArgsSize() const { return ImplicitArgsSize; }

private: private:

/// The kernel object to execute. /// The kernel object to execute.

uint64_t KernelObject; uint64_t KernelObject;

/// The args, group and private segments sizes required by a kernel instance. /// The args, group and private segments sizes required by a kernel instance.

uint32_t ArgsSize; uint32_t ArgsSize;

uint32_t GroupSize; uint32_t GroupSize;

uint32_t PrivateSize; uint32_t PrivateSize;

/// The size of implicit kernel arguments. /// The size of implicit kernel arguments.

const uint32_t ImplicitArgsSize; uint32_t ImplicitArgsSize;

/// Additional Info for the AMD GPU Kernel /// Additional Info for the AMD GPU Kernel

std::optional<utils::KernelMetaDataTy> KernelInfo; std::optional<utils::KernelMetaDataTy> KernelInfo;

}; };

/// Class representing an HSA signal. Signals are used to define dependencies /// Class representing an HSA signal. Signals are used to define dependencies

/// between asynchronous operations: kernel launches and memory transfers. /// between asynchronous operations: kernel launches and memory transfers.

struct AMDGPUSignalTy { struct AMDGPUSignalTy {

▲ Show 20 Lines • Show All 1,236 Lines • ▼ Show 20 Lines struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {

/// Initialize the device, its resources and get its properties. /// Initialize the device, its resources and get its properties.

Error initImpl(GenericPluginTy &Plugin) override { Error initImpl(GenericPluginTy &Plugin) override {

// First setup all the memory pools. // First setup all the memory pools.

if (auto Err = initMemoryPools()) if (auto Err = initMemoryPools())

return Err; return Err;

char GPUName[64]; char GPUName[64];

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_NAME, GPUName)) if (auto Err = getDeviceAttr(HSA_AGENT_INFO_NAME, GPUName))

jhuber6Unsubmitted

Not Done

Leftoever?

jhuber6: Leftoever?

saiislamAuthorUnsubmitted

Done

No, it is not a left over.
One of the fields in cov5 implicitikernarg is heap_v1 ptr. It should point to a 128KB zero-initialized block of coarse-grained memory on each device before launching the kernel. This code was working a while ago, but right now it is failing most likely due to some latest change in devicertl memory handling mechanism.
I need to debug it with this patch, otherwise it will cause all target region code calling device-malloc to fail.
I will try to fix it before the next revision.

saiislam: No, it is not a left over. One of the fields in cov5 implicitikernarg is heap_v1 ptr. It should…

jhuber6Unsubmitted

Done

Do we really need that? We only use a fraction of the existing implicit arguments. My understanding is that most of these are more for runtime handling for HIP and OpenCL while we would most likely want our own solution. I'm assuming that the 128KB is not required for anything we use?

jhuber6: Do we really need that? We only use a fraction of the existing implicit arguments. My…

saiislamAuthorUnsubmitted

Done

I have removed the preallocatedheap work from this patch.

saiislam: I have removed the preallocatedheap work from this patch.

return Err; return Err;

ComputeUnitKind = GPUName; ComputeUnitKind = GPUName;

// Get the wavefront size. // Get the wavefront size.

uint32_t WavefrontSize = 0; uint32_t WavefrontSize = 0;

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_WAVEFRONT_SIZE, WavefrontSize)) if (auto Err = getDeviceAttr(HSA_AGENT_INFO_WAVEFRONT_SIZE, WavefrontSize))

return Err; return Err;

GridValues.GV_Warp_Size = WavefrontSize; GridValues.GV_Warp_Size = WavefrontSize;

▲ Show 20 Lines • Show All 779 Lines • ▼ Show 20 Lines return utils::iterateAgentMemoryPools(

Plugin::get().allocate<AMDGPUMemoryPoolTy>(); Plugin::get().allocate<AMDGPUMemoryPoolTy>();

new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool); new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool);

AllMemoryPools.push_back(MemoryPool); AllMemoryPools.push_back(MemoryPool);

return HSA_STATUS_SUCCESS; return HSA_STATUS_SUCCESS;

}); });

} }

private: private:

using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>; using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;

jhuber6Unsubmitted

Not Done

Why do we need this? The current method shouldn't need to change if all we're doing is allocating memory of greater size.

jhuber6: Why do we need this? The current method shouldn't need to change if all we're doing is…

saiislamAuthorUnsubmitted

Done

PreAllocatedDeviceMemoryPool is the pointer which stores the intermediate value before it is written to heap_v1_ptr field of cov5 implicitkernarg.

saiislam: `PreAllocatedDeviceMemoryPool` is the pointer which stores the intermediate value before it is…

using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>; using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;

/// Envar for controlling the number of HSA queues per device. High number of /// Envar for controlling the number of HSA queues per device. High number of

/// queues may degrade performance. /// queues may degrade performance.

UInt32Envar OMPX_NumQueues; UInt32Envar OMPX_NumQueues;

/// Envar for controlling the size of each HSA queue. The size is the number /// Envar for controlling the size of each HSA queue. The size is the number

/// of HSA packets a queue is expected to hold. It is also the number of HSA /// of HSA packets a queue is expected to hold. It is also the number of HSA

/// packets that can be pushed into each queue without waiting the driver to /// packets that can be pushed into each queue without waiting the driver to

jhuber6Unsubmitted

Done

Error Err = retrieveAllMemoryPools();

- if (Err)

- return Plugin::error("Unable to retieve all memmory pools");

+ if (auto Err = retrieveAllMemoryPools())

+ return Err;

void *DevPtr;

This and below isn't correct. You can't discard an llvm::Error value like this without either doing consumeError(std::move(Err)) or toString(std::move(Err)). However, you don't need to consume these in the first place, they already contain the error message from the callee and should just be forwarded.

jhuber6: This and below isn't correct. You can't discard an `llvm::Error` value like this without either…

saiislamAuthorUnsubmitted

Done

Removed the logic for preallocatedheap.

saiislam: Removed the logic for preallocatedheap.

/// process them. /// process them.

UInt32Envar OMPX_QueueSize; UInt32Envar OMPX_QueueSize;

/// Envar for controlling the default number of teams relative to the number /// Envar for controlling the default number of teams relative to the number

/// of compute units (CUs) the device has: /// of compute units (CUs) the device has:

/// #default_teams = OMPX_DefaultTeamsPerCU * #CUs. /// #default_teams = OMPX_DefaultTeamsPerCU * #CUs.

UInt32Envar OMPX_DefaultTeamsPerCU; UInt32Envar OMPX_DefaultTeamsPerCU;

▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {

uint32_t Result; uint32_t Result;

Status = hsa_executable_validate(Executable, &Result); Status = hsa_executable_validate(Executable, &Result);

if (auto Err = Plugin::check(Status, "Error in hsa_executable_validate: %s")) if (auto Err = Plugin::check(Status, "Error in hsa_executable_validate: %s"))

return Err; return Err;

if (Result) if (Result)

return Plugin::error("Loaded HSA executable does not validate"); return Plugin::error("Loaded HSA executable does not validate");

if (auto Err = if (auto Err = utils::readAMDGPUMetaDataFromImage(

utils::readAMDGPUMetaDataFromImage(getMemoryBuffer(), KernelInfoMap)) getMemoryBuffer(), KernelInfoMap, ELFABIVersion))

return Err; return Err;

return Plugin::success(); return Plugin::success();

} }

Expected<hsa_executable_symbol_t> Expected<hsa_executable_symbol_t>

AMDGPUDeviceImageTy::findDeviceSymbol(GenericDeviceTy &Device, AMDGPUDeviceImageTy::findDeviceSymbol(GenericDeviceTy &Device,

StringRef SymbolName) const { StringRef SymbolName) const {

▲ Show 20 Lines • Show All 348 Lines • ▼ Show 20 Lines Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,

AMDGPUStreamTy *Stream = nullptr; AMDGPUStreamTy *Stream = nullptr;

if (auto Err = AMDGPUDevice.getStream(AsyncInfoWrapper, Stream)) if (auto Err = AMDGPUDevice.getStream(AsyncInfoWrapper, Stream))

return Err; return Err;

// If this kernel requires an RPC server we attach its pointer to the stream. // If this kernel requires an RPC server we attach its pointer to the stream.

if (GenericDevice.getRPCServer()) if (GenericDevice.getRPCServer())

Stream->setRPCServer(GenericDevice.getRPCServer()); Stream->setRPCServer(GenericDevice.getRPCServer());

// Only COV5 implicitargs needs to be set. COV4 implicitargs are not used.

jhuber6Unsubmitted

Done

So we're required to emit some new arguments? I don't have any idea what'schanged between this COV4 and COV5 stuff.

jhuber6: So we're required to emit some new arguments? I don't have any idea what'schanged between this…

saiislamAuthorUnsubmitted

Done

In cov5, we need to set certain fields of the implicit kernel arguments before launching the kernel.
Please see AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes for more details.

Only NumBlocks, NumThreads(XYZ), GridDims, and Heap_V1_ptr are relevant for us, so I have simplified code further.

saiislam: In cov5, we need to set certain fields of the implicit kernel arguments before launching the…

if (getImplicitArgsSize() == sizeof(utils::AMDGPUImplicitArgsTy)) {

arsenmUnsubmitted

Done

This isn't doing anything?

arsenm: This isn't doing anything?

saiislamAuthorUnsubmitted

Done

Earlier we used to set hostcall_buffer here, but not anymore.
I have left the message in DP just for debug help.

saiislam: Earlier we used to set hostcall_buffer here, but not anymore. I have left the message in DP…

jhuber6Unsubmitted

Done

Don't think this needs to be a debug message, same below

jhuber6: Don't think this needs to be a debug message, same below

ImplArgs->BlockCountX = NumBlocks;

ImplArgs->GroupSizeX = NumThreads;

ImplArgs->GroupSizeY = 1;

ImplArgs->GroupSizeZ = 1;

ImplArgs->GridDims = 1;

}

// Push the kernel launch into the stream. // Push the kernel launch into the stream.

return Stream->pushKernelLaunch(*this, AllArgs, NumThreads, NumBlocks, return Stream->pushKernelLaunch(*this, AllArgs, NumThreads, NumBlocks,

GroupSize, ArgsMemoryManager); GroupSize, ArgsMemoryManager);

} }

Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice, Error AMDGPUKernelTy::printLaunchInfoDetails(GenericDeviceTy &GenericDevice,

KernelArgsTy &KernelArgs, KernelArgsTy &KernelArgs,

uint32_t NumThreads, uint32_t NumThreads,

▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h

Show All 19 Lines
#include "llvm/Support/Error.h"		#include "llvm/Support/Error.h"

#include "llvm/BinaryFormat/AMDGPUMetadataVerifier.h"		#include "llvm/BinaryFormat/AMDGPUMetadataVerifier.h"
#include "llvm/BinaryFormat/ELF.h"		#include "llvm/BinaryFormat/ELF.h"
#include "llvm/BinaryFormat/MsgPackDocument.h"		#include "llvm/BinaryFormat/MsgPackDocument.h"
#include "llvm/Support/MemoryBufferRef.h"		#include "llvm/Support/MemoryBufferRef.h"

#include "llvm/Support/YAMLTraits.h"		#include "llvm/Support/YAMLTraits.h"
		using namespace llvm::ELF;

namespace llvm {		namespace llvm {
namespace omp {		namespace omp {
namespace target {		namespace target {
namespace plugin {		namespace plugin {
namespace utils {		namespace utils {

// The implicit arguments of AMDGPU kernels.		// The implicit arguments of COV5 AMDGPU kernels.
struct AMDGPUImplicitArgsTy {		struct AMDGPUImplicitArgsTy {
uint64_t OffsetX;		uint32_t BlockCountX;
uint64_t OffsetY;		uint32_t BlockCountY;
uint64_t OffsetZ;		uint32_t BlockCountZ;
uint64_t HostcallPtr;		uint16_t GroupSizeX;
uint64_t Unused0;		uint16_t GroupSizeY;
uint64_t Unused1;		uint16_t GroupSizeZ;
uint64_t Unused2;		uint8_t Unused0[46]; // 46 byte offset.
		uint16_t GridDims;
		uint8_t Unused1[190]; // 190 byte offset.
		arsenmUnsubmitted Not Done Reply Inline Actions This is getting duplicated a few places, should it move to a support header? I don't love the existing APIs for this, I think a struct definition makes more sense arsenm: This is getting duplicated a few places, should it move to a support header? I don't love the…
		jhuber6Unsubmitted Not Done Reply Inline Actions The other user here is my custom loader, @JonChesterfield has talked about wanting a common HSA helper header for awhile now. I agree that the struct definition is much better. Being able to simply allocate this size and then zero fill it is much cleaner. jhuber6: The other user here is my custom loader, @JonChesterfield has talked about wanting a common HSA…
		saiislamAuthorUnsubmitted Done Reply Inline Actions Defining a struct for whole 256 byte of implicitargs in cov5 was becoming a little difficult due to different sizes of various fields (2, 4, 6, 8, 48, 72 bytes) along with multiple reserved fields in between. It made sense for cov4 because it only had 7 fields of 8 bytes each, where we needed only 4th field in OpenMP runtime (for hostcall_buffer). Offset based lookups like the following allows handling/exposing only required fields across generations of ABI. saiislam: Defining a struct for whole 256 byte of implicitargs in cov5 was becoming a little difficult…
		jhuber6Unsubmitted Not Done Reply Inline Actions If we don't use it, just put it as `unused`. It's really hard to read as-is and it makes it more difficult to just zero fill. jhuber6: If we don't use it, just put it as `unused`. It's really hard to read as-is and it makes it…
		saiislamAuthorUnsubmitted Done Reply Inline Actions I have reduced the fields to bare minimum required for OpenMP. saiislam: I have reduced the fields to bare minimum required for OpenMP.
		jhuber6Unsubmitted Done Reply Inline Actions I'm still not a fan of replacing the struct. The mnemonic of having a struct is much more user friendly. ImplicitArgsTy Args{}; std::memset(&Args, sizeof(ImplicitArgsTy), 0); ... If we don't use something, just make it some random bytes, e.g. struct ImplicitArgsTy { uint64_t OffsetX; uint8_t Unused[64]; // 64 byte offset. }; jhuber6: I'm still not a fan of replacing the struct. The mnemonic of having a struct is much more user…
		saiislamAuthorUnsubmitted Done Reply Inline Actions Replaced. saiislam: Replaced.
};		};

static_assert(sizeof(AMDGPUImplicitArgsTy) == 56,		// Dummy struct for COV4 implicitargs.
"Unexpected size of implicit arguments");		struct AMDGPUImplicitArgsTyCOV4 {
		uint8_t Unused[56];
		};

		uint32_t getImplicitArgsSize(uint16_t Version) {
		jhuber6Unsubmitted Done Reply Inline Actions We return uint16_t here? These are sizes. jhuber6: We return uint16_t here? These are sizes.
		return Version < ELF::ELFABIVERSION_AMDGPU_HSA_V5
		? sizeof(AMDGPUImplicitArgsTyCOV4)
		: sizeof(AMDGPUImplicitArgsTy);
		}
		jhuber6Unsubmitted Done Reply Inline Actions We should probably be using `sizeof` now that it's back to being a struct and keep the old struct definition. jhuber6: We should probably be using `sizeof` now that it's back to being a struct and keep the old…
		saiislamAuthorUnsubmitted Done Reply Inline Actions AMDGPU plugin doesn't use any implicitarg for COV4, but it does so for COV5. So, we are not keeping two separate structures for implicitargs of COV4 and COV5. If we use sizeof then it will always return 256 corresponding to COV5 (even for cov4, which should be 56). That's why we need this function. saiislam: AMDGPU plugin doesn't use any implicitarg for COV4, but it does so for COV5. So, we are not…
		jhuber6Unsubmitted Done Reply Inline Actions Yeah, I guess for COV4 the only thing that mattered was the size so that we could make sure it's all set to zero. We shouldn't use the enum value. It should be `sizeof(ImplicitArgsTy)` for `COV5` and either hard-code it in the function for V4 or make a dummy struct. jhuber6: Yeah, I guess for COV4 the only thing that mattered was the size so that we could make sure…

/// Parse a TargetID to get processor arch and feature map.		/// Parse a TargetID to get processor arch and feature map.
/// Returns processor subarch.		/// Returns processor subarch.
/// Returns TargetID features in \p FeatureMap argument.		/// Returns TargetID features in \p FeatureMap argument.
/// If the \p TargetID contains feature+, FeatureMap it to true.		/// If the \p TargetID contains feature+, FeatureMap it to true.
/// If the \p TargetID contains feature-, FeatureMap it to false.		/// If the \p TargetID contains feature-, FeatureMap it to false.
/// If the \p TargetID does not contain a feature (default), do not map it.		/// If the \p TargetID does not contain a feature (default), do not map it.
StringRef parseTargetID(StringRef TargetID, StringMap<bool> &FeatureMap) {		StringRef parseTargetID(StringRef TargetID, StringMap<bool> &FeatureMap) {
▲ Show 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	private:
// Kernel names are the keys		// Kernel names are the keys
StringMap<KernelMetaDataTy> &KernelInfoMap;		StringMap<KernelMetaDataTy> &KernelInfoMap;
};		};
} // namespace		} // namespace

/// Reads the AMDGPU specific metadata from the ELF file and propagates the		/// Reads the AMDGPU specific metadata from the ELF file and propagates the
/// KernelInfoMap		/// KernelInfoMap
Error readAMDGPUMetaDataFromImage(MemoryBufferRef MemBuffer,		Error readAMDGPUMetaDataFromImage(MemoryBufferRef MemBuffer,
StringMap<KernelMetaDataTy> &KernelInfoMap) {		StringMap<KernelMetaDataTy> &KernelInfoMap,
		uint16_t &ELFABIVersion) {
Error Err = Error::success(); // Used later as out-parameter		Error Err = Error::success(); // Used later as out-parameter

auto ELFOrError = object::ELF64LEFile::create(MemBuffer.getBuffer());		auto ELFOrError = object::ELF64LEFile::create(MemBuffer.getBuffer());
if (auto Err = ELFOrError.takeError())		if (auto Err = ELFOrError.takeError())
return Err;		return Err;

const object::ELF64LEFile ELFObj = ELFOrError.get();		const object::ELF64LEFile ELFObj = ELFOrError.get();
ArrayRef<object::ELF64LE::Shdr> Sections = cantFail(ELFObj.sections());		ArrayRef<object::ELF64LE::Shdr> Sections = cantFail(ELFObj.sections());
KernelInfoReader Reader(KernelInfoMap);		KernelInfoReader Reader(KernelInfoMap);

		// Read the code object version from ELF image header
		auto Header = ELFObj.getHeader();
		ELFABIVersion = (uint8_t)(Header.e_ident[ELF::EI_ABIVERSION]);
		DP("ELFABIVERSION Version: %u\n", ELFABIVersion);

for (const auto &S : Sections) {		for (const auto &S : Sections) {
if (S.sh_type != ELF::SHT_NOTE)		if (S.sh_type != ELF::SHT_NOTE)
continue;		continue;

for (const auto N : ELFObj.notes(S, Err)) {		for (const auto N : ELFObj.notes(S, Err)) {
if (Err)		if (Err)
return Err;		return Err;
// Fills the KernelInfoTabel entries in the reader		// Fills the KernelInfoTabel entries in the reader
Show All 13 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][DeviceRTL][AMDGPU] Support code object version 5ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 554262

clang/lib/CodeGen/CGBuiltin.cpp

clang/lib/CodeGen/CodeGenModule.h

clang/lib/CodeGen/CodeGenModule.cpp

clang/lib/CodeGen/TargetInfo.h

clang/lib/CodeGen/Targets/AMDGPU.cpp

clang/lib/Driver/ToolChain.cpp

clang/lib/Driver/ToolChains/Clang.cpp

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu

clang/test/CodeGenCUDA/amdgpu-workgroup-size.cu

clang/test/CodeGenOpenCL/opencl_types.cl

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

openmp/libomptarget/DeviceRTL/CMakeLists.txt

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h

[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
ClosedPublic