This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
ClosedPublic

Authored by saiislam on Dec 9 2022, 11:10 AM.

Details

Summary

Update DeviceRTL and the AMDGPU plugin to use code
object version 5. Default is code object version 4.

DeviceRTL uses rocm-device-libs instead of directly calling
amdgcn builtins for the functions which are affected by
cov5.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Diff Detail

Event Timeline

saiislam created this revision.Dec 9 2022, 11:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2022, 11:10 AM
saiislam requested review of this revision.Dec 9 2022, 11:10 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 9 2022, 11:10 AM

Maybe we should wait until D138389 lands and we can update both, otherwise we'd need a second patch.

I'm not fully up-to-date, what's the main difference and advantage of the new code object version? What do all the new implicit arguments do.

clang/lib/Driver/ToolChains/AMDGPU.cpp
953 ↗(On Diff #481701)

Unrelated?

openmp/libomptarget/DeviceRTL/include/Interface.h
169 ↗(On Diff #481701)

This should probably use variants to match the rest of the style, also if you intend to read these outside of the library you'll need to put them in the exports file and set their visibility.

openmp/libomptarget/DeviceRTL/src/Mapping.cpp
19 ↗(On Diff #481701)

What if this isn't defined? We should be able to use the OpenMP library without the AMD device libraries. Should it be extern weak?

openmp/libomptarget/DeviceRTL/src/State.cpp
73 ↗(On Diff #481701)

Variants

openmp/libomptarget/plugins/amdgpu/impl/get_elf_mach_gfx_name.h
15 ↗(On Diff #481701)

Unrelated, but is there any particular reason these aren't defined in the hsa_amd_ext.h?

yaxunl added inline comments.Dec 9 2022, 11:32 AM
clang/lib/Driver/ToolChains/Clang.cpp
7324

Any reason you need the original args? This will bypass the driver translation, which should not in normal cases.

7325

clang -cc1 needs this to be default value false to emit code object version module flag

Could we elaborate on the benefits, please. Now we support two versions?

Why is this helpful:

DeviceRTL uses rocm-device-libs instead of directly calling amdgcn builtins for the functions which are affected by cov5.

openmp/libomptarget/DeviceRTL/src/State.cpp
81 ↗(On Diff #481701)

Why do we need the "external..." stuff anyway?

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2233 ↗(On Diff #481701)

What is this all about?

I am reluctant to add the dependency edge to rocm device libs to openmp's GPU runtime.

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

My problem with device libs is that it usually doesn't build with trunk. It follows a rolling dev model tied to rocm clang and when upstream does something that takes a long time to apply, device libs doesn't build until rocm catches up. I've literally never managed to compile any branch of device libs with trunk clang without modifying the source, generally to delete directories that don't look necessary for libm.

Further, selecting an ABI based on runtime code found in a library which is hopefully constant folded is a weird layering choice. The compiler knows what ABI it is emitting code for, and that's how it picks files from device libs to effect that choice, but it would make far more sense to me for the compiler back end to set this stuff up itself.

Also, if we handle ABI in the back end, then we don't get the inevitable problem of rocm device libs and trunk clang having totally different ideas of what the ABI is as they drift in and out of sync.

tianshilei1992 added inline comments.
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
19 ↗(On Diff #481701)

It should be put into AMD's declare variant.

Thanks everyone for your review and comments!
I am going to address all of them in a series of smaller patches starting with D140784.

saiislam added inline comments.Jan 4 2023, 7:08 AM
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
50 ↗(On Diff #481701)

If we still don't want to depend on rocm-device-libs then we will have to do something like (haven't tried this code yet):

uint32_t getNumHardwareThreadsInBlock() {
   if (__oclc_ABI_version < 500) {
      return __builtin_amdgcn_workgroup_size_x();
   } else {
      void *implicitArgPtr = __builtin_amdgcn_implicitarg_ptr();
      return (ushort)implicitArgPtr[6];
}
80 ↗(On Diff #481701)
uint32_t getNumberOfBlocks() {
   if (__oclc_ABI_version < 500) {
      return __builtin_amdgcn_grid_size_x() / __builtin_amdgcn_workgroup_size_x();
   } else {
      void *implicitArgPtr = __builtin_amdgcn_implicitarg_ptr();
      return (uint)implicitArgPtr[0];
}
saiislam marked an inline comment as done.Jan 18 2023, 7:31 AM
saiislam added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
7324

We need derived args to look for mcode-object-version. I have created a separate review for this change. Please have a look at D142022

arsenm added a subscriber: arsenm.Jun 13 2023, 11:55 AM

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

The ABI isn't defined in terms of what device-libs does. It's fixed offsets off of pointers accessible through amdgcn intrinsics. You can also just directly emit the same IR, these functions aren't complicated

We currently require that library for libm, which I'm also not thrilled about, but at least you can currently build and run openmp programs (that don't use libm, like much of our tests) without it.

The ABI isn't defined in terms of what device-libs does. It's fixed offsets off of pointers accessible through amdgcn intrinsics. You can also just directly emit the same IR, these functions aren't complicated

This is the suggestion I've talked with @saiislam about. I think we should just copy the magic intrinsics that are being queried here. I'm assuming we don't need to bother with supporting both v4 and v5 so we can just make the switch all at once.

saiislam updated this revision to Diff 547297.Aug 4 2023, 11:49 AM

Another attempt at cov5 support by using CodeGen for buitlin_amdgpu_workgroup_size.

arsenm added inline comments.Aug 4 2023, 12:12 PM
clang/lib/CodeGen/CGBuiltin.cpp
17123

this must always pass

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
3007

This isn't doing anything?

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

This is getting duplicated a few places, should it move to a support header?

I don't love the existing APIs for this, I think a struct definition makes more sense

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

clang/lib/CodeGen/CGBuiltin.cpp
17117

Could you explain the function of this in a comment? Are we emitting generic code if unspecified?

17146–17147

nit.

17153

Leftover debugging?

clang/lib/Driver/ToolChain.cpp
1366

Shouldn't we be able to put this under the OPT_m_group below?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752

Leftoever?

2548

Why do we need this? The current method shouldn't need to change if all we're doing is allocating memory of greater size.

3006

So we're required to emit some new arguments? I don't have any idea what'schanged between this COV4 and COV5 stuff.

jhuber6 added inline comments.Aug 4 2023, 12:14 PM
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

The other user here is my custom loader, @JonChesterfield has talked about wanting a common HSA helper header for awhile now.

I agree that the struct definition is much better. Being able to simply allocate this size and then zero fill it is much cleaner.

arsenm added a comment.EditedAug 4 2023, 12:22 PM

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The builtin is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

Although I guess that doesn't really need the builtin changes?

saiislam updated this revision to Diff 547751.Aug 7 2023, 5:52 AM
saiislam marked 5 inline comments as done.

Removed unused cov5 implicitargs fields.
Added comments about EmitAMDGPUWorkGroupSize and ABI-agnostica code emission.
Adressed reviewers' comments.

Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?

The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination

Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for __omp_rtl_debug_kind, which is a global created on the GPU by CGOpenMPRuntimeGPU's constructor and does more or less the same thing.

device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together

Although I guess that doesn't really need the builtin changes?

This builtin was already aware about cov4 and cov5. All this patch is changing is making it aware about a possibility where both needs to be present.
It is already used by device-libs, deviceRTL, and libc-gpu.
Also, encapsulating ABI related changes in implementation of the builtin allows other runtime developers to be agnostic to these lower level changes.

clang/lib/CodeGen/CGBuiltin.cpp
17146–17147

There are a couple of common lines after the inner if-else, in the outer else section.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752

No, it is not a left over.
One of the fields in cov5 implicitikernarg is heap_v1 ptr. It should point to a 128KB zero-initialized block of coarse-grained memory on each device before launching the kernel. This code was working a while ago, but right now it is failing most likely due to some latest change in devicertl memory handling mechanism.
I need to debug it with this patch, otherwise it will cause all target region code calling device-malloc to fail.
I will try to fix it before the next revision.

2548

PreAllocatedDeviceMemoryPool is the pointer which stores the intermediate value before it is written to heap_v1_ptr field of cov5 implicitkernarg.

3006

In cov5, we need to set certain fields of the implicit kernel arguments before launching the kernel.
Please see AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes for more details.

Only NumBlocks, NumThreads(XYZ), GridDims, and Heap_V1_ptr are relevant for us, so I have simplified code further.

3007

Earlier we used to set hostcall_buffer here, but not anymore.
I have left the message in DP just for debug help.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

Defining a struct for whole 256 byte of implicitargs in cov5 was becoming a little difficult due to different sizes of various fields (2, 4, 6, 8, 48, 72 bytes) along with multiple reserved fields in between. It made sense for cov4 because it only had 7 fields of 8 bytes each, where we needed only 4th field in OpenMP runtime (for hostcall_buffer).

Offset based lookups like the following allows handling/exposing only required fields across generations of ABI.

jhuber6 added inline comments.Aug 7 2023, 6:23 AM
clang/lib/CodeGen/CGBuiltin.cpp
17106
17146–17147

You should be able to factor out

LD = CGF.Builder.CreateLoad(
    Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2)));

from both by making each assign the Result to a value.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752

Do we really need that? We only use a fraction of the existing implicit arguments. My understanding is that most of these are more for runtime handling for HIP and OpenCL while we would most likely want our own solution. I'm assuming that the 128KB is not required for anything we use?

2556–2557

This and below isn't correct. You can't discard an llvm::Error value like this without either doing consumeError(std::move(Err)) or toString(std::move(Err)). However, you don't need to consume these in the first place, they already contain the error message from the callee and should just be forwarded.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

If we don't use it, just put it as unused. It's really hard to read as-is and it makes it more difficult to just zero fill.

yaxunl added a comment.Aug 7 2023, 9:33 AM

need a lit test for the codegen of the clang builtin for cov 4/5/none and a lit test to show the branching code generated with cov none can be optimized away when linked with cov4 or cov5.

clang/lib/CodeGen/Targets/AMDGPU.cpp
389

I am not sure weak_odr linkage will work when code object version is none. This will cause conflict when a module emitted with cov none is linked with a module emitted with cov4 or cov5. Also, when all modules are emitted with cov none, we end up with a linked module with cov none and the work group size code will not work.

Probably we need to emit llvm.amdgcn.abi.version with external linkage for cov none.

Another issue is that llvm.amdgcn.abi.version is not internalized. It is always loaded from memory even though it is in constant address space. This will cause bad performance. Considering device libs may use clang builtin for workgroup size. The performance impact may be significant. To avoid performance degradation, we need to internalize it as early as possible in the optimization pipeline.

yaxunl added a comment.Aug 7 2023, 9:36 AM

I would suggest separating the clang/llvm part into a separate review.

arsenm added inline comments.Aug 7 2023, 2:12 PM
clang/lib/CodeGen/CGBuiltin.cpp
17111–17127

Move down to define and initialize

17131–17133

You could write all of this in terms of selects and avoid introducing all these blocks

clang/lib/CodeGen/Targets/AMDGPU.cpp
364

Don't need this?

saiislam updated this revision to Diff 551266.Aug 17 2023, 2:16 PM
saiislam marked 6 inline comments as done.

Updated the patch as per reviewers comments.

clang/lib/CodeGen/CGBuiltin.cpp
17111–17127

There are multiple uses of the same identifier. Defining them four times looks odd.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1752

I have removed the preallocatedheap work from this patch.

2556–2557

Removed the logic for preallocatedheap.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

I have reduced the fields to bare minimum required for OpenMP.

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

clang/lib/CodeGen/CGBuiltin.cpp
17109
clang/lib/Driver/ToolChain.cpp
1363

Random whitespace.

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
97

Need newline

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
3007

Don't think this needs to be a debug message, same below

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

I'm still not a fan of replacing the struct. The mnemonic of having a struct is much more user friendly.

ImplicitArgsTy Args{};
std::memset(&Args, sizeof(ImplicitArgsTy), 0);
...

If we don't use something, just make it some random bytes, e.g.

struct ImplicitArgsTy {
  uint64_t OffsetX;
  uint8_t Unused[64]; // 64 byte offset.
};
saiislam updated this revision to Diff 551597.Aug 18 2023, 12:05 PM
saiislam marked 4 inline comments as done.

Changed ImplitArgs implementation using struct.

Some nits. I'm assuming we're getting the code object in the backend now? We'll need to make sure that -Wl,--amdhsa-code-object-version is passed to the clang invocation inside of the clang-linker-wrapper to handle -save-temps mode.

Clang-linker-wrapper was not passing -mllvm option to the clang backend.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
38–46

Replaced.

arsenm added inline comments.Aug 18 2023, 1:05 PM
clang/lib/CodeGen/CGBuiltin.cpp
17113

Spell out to DispatchPtr?

clang/lib/CodeGen/CodeGenModule.cpp
1206–1208

These could be one combined hook? this isn't really different from metadata

clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386

You moved GetOrCreateLLVMGlobal but don't use it?

The lamdba is unnecessary for a single local use

clang/lib/Driver/ToolChain.cpp
1368–1371

Capitalize

1371

Don't understand why this is necessary

arsenm added inline comments.Aug 18 2023, 1:07 PM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
41–44

test all the builtins?

saiislam marked 4 inline comments as done.Aug 21 2023, 11:25 AM
saiislam added inline comments.
clang/lib/CodeGen/Targets/AMDGPU.cpp
369–386

I am using GetOrCreateLLVMGlobal in CGBuiltin.cpp while emitting code for amdgpu_worgroup_size.

369–386

I was hoping that this patch will pave way for D130096, so that it can generate rest of the control constants using the same lambda.
I can remove this and simplify the code if you want.

389

I tried external linkage but it didn't work. Only weak_odr is working fine.

clang/lib/Driver/ToolChain.cpp
1371

This function creates a derived argument list for OpenMP target specific flags.
mcode-object-version remains unset for device compilation step if we don't pass it here.

saiislam updated this revision to Diff 552085.Aug 21 2023, 11:26 AM
saiislam marked an inline comment as done.

Adressed reviewer's comments.

saiislam marked 3 inline comments as done.Aug 21 2023, 11:28 AM
arsenm added inline comments.Aug 21 2023, 1:11 PM
clang/lib/CodeGen/CGBuiltin.cpp
17123

Capitalization is weird, IsCOV5?

17135–17136

CreateConstInBoundsGEP1_64

17153

CreateConstInBoundsGEP1_64

clang/lib/CodeGen/Targets/AMDGPU.cpp
364

Single use lamdba, just make this the function body

381

No real point setting the alignment

saiislam updated this revision to Diff 552344.Aug 22 2023, 7:03 AM
saiislam marked 5 inline comments as done.

Used CreateConstInBoundsGEP1_32 for emitting GEP statements. Changed lambda function to simple fucntion body for defining the global variable.

Codegen parts LGTM, questions with the driver parts

clang/lib/Driver/ToolChain.cpp
1368–1371

Typos

1369
clang/lib/Driver/ToolChains/Clang.cpp
8649–8650

so device rtl is linked once as a normal library?

8653–8654

Why do you need this? The code object version is supposed to come from a module flag. We should be getting rid of the command line argument for it

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410

Shouldn't need this?

417

Commented out code

yaxunl added inline comments.Aug 23 2023, 8:07 PM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

need to test using clang -cc1 with -O3 and -mlink-builtin-bitcode to link the device lib and verify the load of llvm.amdgcn.abi.version being eliminated after optimization.

I think currently it cannot do that since llvm.amdgcn.abi.version is not internalized by the internalization pass. This can cause some significant perf drops since loading is expensive. Need to tweak the function controlling what variables can be internalized for amdgpu so that this variable gets internalized, or having a generic way to tell that function which variables should be internalized, e.g. by adding a metadata amdgcn.internalize

saiislam updated this revision to Diff 553179.Aug 24 2023, 10:06 AM
saiislam marked 7 inline comments as done.

Updated test case to check internalization of newly inserted global variable.

clang/lib/Driver/ToolChains/Clang.cpp
8649–8650

No, this is command generation for clang-linker-wrapper. Since, devicertl is compiled only to get bitcode file (-c), it is never called.

8653–8654

During command generation for clang-linker-wrapper, it is required to check user's provided mcode-object-version=X so that amdhsa-code-object-version=X can be passed to the clang/lto backend.

getAmdhsaCodeObjectVersion() and getHsaAbiVersion() both still use the above command line argument to override user's choice of COV, instead of the module flag.

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

load of llvm.amdgcn.abi.version is being eliminated with cc1, -O3, and mlink-builtin-bitcode of device lib.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406–410

It is required so that when clang pass (not the lto backend) is called from clang-linker-wrapper due to -save-temps, user provided COV is correctly propagated.

jhuber6 added inline comments.Aug 24 2023, 10:06 AM
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–56

We should probably be using sizeof now that it's back to being a struct and keep the old struct definition.

saiislam marked an inline comment as done.Aug 24 2023, 10:12 AM
saiislam added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–56

AMDGPU plugin doesn't use any implicitarg for COV4, but it does so for COV5. So, we are not keeping two separate structures for implicitargs of COV4 and COV5.
If we use sizeof then it will always return 256 corresponding to COV5 (even for cov4, which should be 56). That's why we need this function.

jhuber6 added inline comments.Aug 24 2023, 10:15 AM
openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
49–56

Yeah, I guess for COV4 the only thing that mattered was the size so that we could make sure it's all set to zero. We shouldn't use the enum value. It should be sizeof(ImplicitArgsTy) for COV5 and either hard-code it in the function for V4 or make a dummy struct.

saiislam updated this revision to Diff 553413.Aug 25 2023, 2:06 AM
saiislam marked 2 inline comments as done.

Changed getImplicitArgsSize to use sizeof.

Just a few more nits. I think it's looking fine but I haven't tested it. Anyone else?

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
406
415–418

No braces around a single line if.

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
54

We return uint16_t here? These are sizes.

saiislam updated this revision to Diff 553991.Aug 28 2023, 10:44 AM
saiislam marked 3 inline comments as done.

Minor fixes addressing reviewer's comment.

jhuber6 accepted this revision.Aug 28 2023, 12:16 PM

I think it's fine now given that it's passing tests. Others feel free to comment.

This revision is now accepted and ready to land.Aug 28 2023, 12:16 PM
yaxunl accepted this revision.Aug 28 2023, 2:30 PM

LGTM. Thanks

clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

It seems being eliminated by IPSCCP. It makes sense since it is constant weak_odr without externally_initialized. Either changing it to weak or adding externally_initialized will keep the load. Normal __constant__ var in device code may be changed by host code, therefore they are emitted with externally_initialized and do not have the load eliminated.

This revision was landed with ongoing or failed builds.Aug 29 2023, 4:36 AM
This revision was automatically updated to reflect the committed changes.
saiislam added inline comments.Aug 29 2023, 4:40 AM
clang/test/CodeGenCUDA/amdgpu-code-object-version-linking.cu
13

Thank you @yaxunl !
I have added these observations as comments in the code at load emit and global emit locations.