This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
test/Driver/
-
Driver/
-
linker-wrapper-image.c
1/2
linker-wrapper.c
-
tools/clang-linker-wrapper/
-
clang-linker-wrapper/
2/4
ClangLinkerWrapper.cpp
-
OffloadWrapper.h
1/2
OffloadWrapper.cpp

Differential D128914

[HIP] Add support for handling HIP in the linker wrapper
ClosedPublic

Authored by jhuber6 on Jun 30 2022, 7:30 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
yaxunl
tra

Commits

rGce091eb3b91f: [HIP] Add support for handling HIP in the linker wrapper

Summary

This patch adds the necessary changes required to bundle and wrap HIP
files. The bundling is done using clang-offload-bundler currently to
mimic fatbinary and the wrapping is done using very similar runtime
calls to CUDA. This still does not support managed / surface / texture
variables, that would require some additional information in the entry.

One difference in the codegeneration with AMD is that I don't check if
the handle is null before destructing it, I'm not sure if that's
required.

With this we should be able to support HIP with the new driver.

Depends on D128850

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 30 2022, 7:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2022, 7:30 AM

jhuber6 requested review of this revision.Jun 30 2022, 7:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2022, 7:30 AM

Herald added subscribers: cfe-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B173035: Diff 441392.Jun 30 2022, 8:00 AM

Syntax/style looks OK to me with a few nits.

clang/test/Driver/linker-wrapper.c
120	Nit: This test case does not have any CHECK lines and could use a comment describing what it's supposed to test. AFAICT it's intended to make sure that no temporary files are left around, but I'm not 100% sure.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
614–616	We probably do not want to hardcode the assumption that the host is x86_64 linux. Bundle alignment should probably also be target-dependent, but 4K is common enough and is probably fine in practice.
1218	I'd move it to the end where the buffer is actually used.
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
393	We should probably have a helper function returning properly prefixed name, similar to what we do in clang: https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGCUDANV.cpp#L184

Thanks for the comments.

clang/test/Driver/linker-wrapper.c
120	Yes, it ensures that the files extracted from the static library are not leftover as temp files, this was a problem previously that I fixed. I'll add a comment explaining that.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
614–616	This is exactly the way it is in the Clang source for HIP. HIP uses the `clang-offload-bundler` which expects a host file and host triple, ergo the dummy triple and input from `/dev/null`. This obviously isn't great, maybe in the future I'll be able to convince the AMD folks to use my format instead.
1218	Sure, I'll do that for the others as well.
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
393	I had that thought, but unless I wanted to use regular expressions it would be a little weird since there's many different types here, e.g. `__cuda`, `.cuda` and `_cuda`. I figured it was easier to just make two strings rather than carry around three different functions to handle these cases, or introduce some weird regex.

Addressing some comments.

Harbormaster completed remote builds in B173738: Diff 442356.Jul 5 2022, 10:46 AM

ping

jhuber6 mentioned this in D129301: [clang-offload-bundler][NFC] Library-ize ClangOffloadBundler (1/4).Jul 7 2022, 7:14 PM

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

This revision is now accepted and ready to land.Jul 11 2022, 8:29 AM

In D128914#3642558, @JonChesterfield wrote:

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

In D128914#3642567, @jhuber6 wrote:

In D128914#3642558, @JonChesterfield wrote:

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

In D128914#3642869, @yaxunl wrote:

In D128914#3642567, @jhuber6 wrote:

In D128914#3642558, @JonChesterfield wrote:

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

I see, this wrapping will only happen in RDC-mode so it's probably safe to ignore here? When I support non-RDC mode in the new driver it will most likely rely on the old code generation. Although it's entirely feasible to make RDC-mode the default. There's no runtime overhead when using LTO.

This revision was landed with ongoing or failed builds.Jul 11 2022, 12:49 PM

Closed by commit rGce091eb3b91f: [HIP] Add support for handling HIP in the linker wrapper (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rGce091eb3b91f: [HIP] Add support for handling HIP in the linker wrapper.

In D128914#3643270, @jhuber6 wrote:

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

I see, this wrapping will only happen in RDC-mode so it's probably safe to ignore here? When I support non-RDC mode in the new driver it will most likely rely on the old code generation. Although it's entirely feasible to make RDC-mode the default. There's no runtime overhead when using LTO.

If you only unregister fatbin once for the whole program, then it should be safe -fgpu-rdc. I am not sure if that is the case.

My experience with -fgpu-rdc is that it causes much longer linking time for large applications like PyTorch or TensroFlow, and LTO does not help. This is because the compiler has lots of inter-procedural optimization passes which take more than linear time. Due to that those apps need to be compiled as -fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.

In D128914#3643451, @yaxunl wrote:

If you only unregister fatbin once for the whole program, then it should be safe -fgpu-rdc. I am not sure if that is the case.

it should be here, the generated handle is private to the registration module we created We only make one and it's impossible for anyone else to touch it even if mixing rdc with non-rdc codes.

My experience with -fgpu-rdc is that it causes much longer linking time for large applications like PyTorch or TensroFlow, and LTO does not help. This is because the compiler has lots of inter-procedural optimization passes which take more than linear time. Due to that those apps need to be compiled as -fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.

Yes, it's actually pretty difficult to find a CUDA application using fgpu-rdc. It seems much more common to just stick everything that's needed in the file.I've considered finding a CUDA / HIP benchmark suite and comparing compile times using the new driver stuff. The benefit of having fgpu-rdc be the default is that device code basically behaves exactly like host code and LTO makes fgpu-rdc behave like fno-gpu-rdc performance wise. The downside, as you mentioned, is compile time.

In D128914#3643495, @jhuber6 wrote:

Yes, it's actually pretty difficult to find a CUDA application using fgpu-rdc. It seems much more common to just stick everything that's needed in the file.I've considered finding a CUDA / HIP benchmark suite and comparing compile times using the new driver stuff. The benefit of having fgpu-rdc be the default is that device code basically behaves exactly like host code and LTO makes fgpu-rdc behave like fno-gpu-rdc performance wise. The downside, as you mentioned, is compile time.

For what it's worth, NCCL is the only nontrivial library that needs RDC compilation that I'm aware of.
It's also self-contained for RDC purposes we only need to use RDC on the library TUs and do not need to propagate it to all CUDA TUs in the build.

I believe such 'constrained' RDC compilation will likely be the reasonable practical trade-off. It may not become the default compilation mode, but we should be able to control where the "fully linked GPU executable" boundary is and it's not necessarily going to match the fully-linked host executable.

In D128914#3643802, @tra wrote:

For what it's worth, NCCL is the only nontrivial library that needs RDC compilation that I'm aware of.
It's also self-contained for RDC purposes we only need to use RDC on the library TUs and do not need to propagate it to all CUDA TUs in the build.

I believe such 'constrained' RDC compilation will likely be the reasonable practical trade-off. It may not become the default compilation mode, but we should be able to control where the "fully linked GPU executable" boundary is and it's not necessarily going to match the fully-linked host executable.

Theoretically we could do this with a relocatable link using the linker-wrapper. The only problem with this approach are the __start/__stop linker defined variables that we use to iterate the globals to be registered as these are tied to the section specifically. Potentially, we could move these to a unique section so they don't interfere with anything. So it would be something like this

clang-linker-wrapper -r a.o b.o c.o -o registered.o // Contains RTL calls to register all globals at section 'cuda_offloading_entries_<ID>'
llvm-strip ---remove-section .llvm.offloading registered.o // Remove embedded IR so no other files will link against it
llvm-objcopy --rename-section cuda_offloading_entries=cuda_offloading_entries_<ID> registered.o // Change the registration section to something unique

Think this would work?

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

In D128914#3644022, @thakis wrote:

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

I changed some of the argument formats in a previous patch, probably messed up the rebase. I'll try to land a patch real quick.

In D128914#3644022, @thakis wrote:

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

Let me know if rGfe6a391357fc resolved the issue.

Bot's happy again. Thanks for the quick fix!

Revision Contents

Path

Size

clang/

test/

Driver/

linker-wrapper-image.c

82 lines

linker-wrapper.c

14 lines

tools/

clang-linker-wrapper/

ClangLinkerWrapper.cpp

76 lines

OffloadWrapper.h

4 lines

OffloadWrapper.cpp

95 lines

Diff 443724

clang/test/Driver/linker-wrapper-image.c

	Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines
	// CUDA-NEXT: %name = load ptr, ptr %2, align 8			// CUDA-NEXT: %name = load ptr, ptr %2, align 8
	// CUDA-NEXT: %3 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 2			// CUDA-NEXT: %3 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 2
	// CUDA-NEXT: %size = load i64, ptr %3, align 4			// CUDA-NEXT: %size = load i64, ptr %3, align 4
	// CUDA-NEXT: %4 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 3			// CUDA-NEXT: %4 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 3
	// CUDA-NEXT: %flag = load i32, ptr %4, align 4			// CUDA-NEXT: %flag = load i32, ptr %4, align 4
	// CUDA-NEXT: %5 = icmp eq i64 %size, 0			// CUDA-NEXT: %5 = icmp eq i64 %size, 0
	// CUDA-NEXT: br i1 %5, label %if.then, label %if.else			// CUDA-NEXT: br i1 %5, label %if.then, label %if.else


	// CUDA: if.then:			// CUDA: if.then:
	// CUDA-NEXT: %6 = call i32 @__cudaRegisterFunction(ptr %0, ptr %addr, ptr %name, ptr %name, i32 -1, ptr null, ptr null, ptr null, ptr null, ptr null)			// CUDA-NEXT: %6 = call i32 @__cudaRegisterFunction(ptr %0, ptr %addr, ptr %name, ptr %name, i32 -1, ptr null, ptr null, ptr null, ptr null, ptr null)
	// CUDA-NEXT: br label %if.end			// CUDA-NEXT: br label %if.end

	// CUDA: if.else:			// CUDA: if.else:
	// CUDA-NEXT: switch i32 %flag, label %if.end [			// CUDA-NEXT: switch i32 %flag, label %if.end [
	// CUDA-NEXT: i32 0, label %sw.global			// CUDA-NEXT: i32 0, label %sw.global
	// CUDA-NEXT: i32 1, label %sw.managed			// CUDA-NEXT: i32 1, label %sw.managed
	Show All 17 Lines
	// CUDA: if.end:			// CUDA: if.end:
	// CUDA-NEXT: %7 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 1			// CUDA-NEXT: %7 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 1
	// CUDA-NEXT: %8 = icmp eq ptr %7, @__stop_cuda_offloading_entries			// CUDA-NEXT: %8 = icmp eq ptr %7, @__stop_cuda_offloading_entries
	// CUDA-NEXT: br i1 %8, label %while.end, label %while.entry			// CUDA-NEXT: br i1 %8, label %while.end, label %while.entry

	// CUDA: while.end:			// CUDA: while.end:
	// CUDA-NEXT: ret void			// CUDA-NEXT: ret void
	// CUDA-NEXT: }			// CUDA-NEXT: }

				// RUN: clang-offload-packager -o %t.out --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx908
				// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
				// RUN: -fembed-offload-object=%t.out
				// RUN: clang-linker-wrapper --print-wrapped-module --dry-run --host-triple x86_64-unknown-linux-gnu \
				// RUN: -linker-path /usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=HIP

				// HIP: @.fatbin_image = internal constant [0 x i8] zeroinitializer, section ".hip_fatbin"
				// HIP-NEXT: @.fatbin_wrapper = internal constant %fatbin_wrapper { i32 1212764230, i32 1, ptr @.fatbin_image, ptr null }, section ".hipFatBinSegment", align 8
				// HIP-NEXT: @__dummy.hip_offloading.entry = hidden constant [0 x %__tgt_offload_entry] zeroinitializer, section "hip_offloading_entries"
				// HIP-NEXT: @.hip.binary_handle = internal global ptr null
				// HIP-NEXT: @__start_hip_offloading_entries = external hidden constant [0 x %__tgt_offload_entry]
				// HIP-NEXT: @__stop_hip_offloading_entries = external hidden constant [0 x %__tgt_offload_entry]
				// HIP-NEXT: @llvm.global_ctors = appending global [1 x { i32, ptr, ptr }] [{ i32, ptr, ptr } { i32 1, ptr @.hip.fatbin_reg, ptr null }]

				// HIP: define internal void @.hip.fatbin_reg() section ".text.startup" {
				// HIP-NEXT: entry:
				// HIP-NEXT: %0 = call ptr @__hipRegisterFatBinary(ptr @.fatbin_wrapper)
				// HIP-NEXT: store ptr %0, ptr @.hip.binary_handle, align 8
				// HIP-NEXT: call void @.hip.globals_reg(ptr %0)
				// HIP-NEXT: %1 = call i32 @atexit(ptr @.hip.fatbin_unreg)
				// HIP-NEXT: ret void
				// HIP-NEXT: }

				// HIP: define internal void @.hip.fatbin_unreg() section ".text.startup" {
				// HIP-NEXT: entry:
				// HIP-NEXT: %0 = load ptr, ptr @.hip.binary_handle, align 8
				// HIP-NEXT: call void @__hipUnregisterFatBinary(ptr %0)
				// HIP-NEXT: ret void
				// HIP-NEXT: }

				// HIP: define internal void @.hip.globals_reg(ptr %0) section ".text.startup" {
				// HIP-NEXT: entry:
				// HIP-NEXT: br i1 icmp ne (ptr @__start_hip_offloading_entries, ptr @__stop_hip_offloading_entries), label %while.entry, label %while.end

				// HIP: while.entry:
				// HIP-NEXT: %entry1 = phi ptr [ @__start_hip_offloading_entries, %entry ], [ %7, %if.end ]
				// HIP-NEXT: %1 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 0
				// HIP-NEXT: %addr = load ptr, ptr %1, align 8
				// HIP-NEXT: %2 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 1
				// HIP-NEXT: %name = load ptr, ptr %2, align 8
				// HIP-NEXT: %3 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 2
				// HIP-NEXT: %size = load i64, ptr %3, align 4
				// HIP-NEXT: %4 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 0, i32 3
				// HIP-NEXT: %flag = load i32, ptr %4, align 4
				// HIP-NEXT: %5 = icmp eq i64 %size, 0
				// HIP-NEXT: br i1 %5, label %if.then, label %if.else

				// HIP: if.then:
				// HIP-NEXT: %6 = call i32 @__hipRegisterFunction(ptr %0, ptr %addr, ptr %name, ptr %name, i32 -1, ptr null, ptr null, ptr null, ptr null, ptr null)
				// HIP-NEXT: br label %if.end

				// HIP: if.else:
				// HIP-NEXT: switch i32 %flag, label %if.end [
				// HIP-NEXT: i32 0, label %sw.global
				// HIP-NEXT: i32 1, label %sw.managed
				// HIP-NEXT: i32 2, label %sw.surface
				// HIP-NEXT: i32 3, label %sw.texture
				// HIP-NEXT: ]

				// HIP: sw.global:
				// HIP-NEXT: call void @__hipRegisterVar(ptr %0, ptr %addr, ptr %name, ptr %name, i32 0, i64 %size, i32 0, i32 0)
				// HIP-NEXT: br label %if.end

				// HIP: sw.managed:
				// HIP-NEXT: br label %if.end

				// HIP: sw.surface:
				// HIP-NEXT: br label %if.end

				// HIP: sw.texture:
				// HIP-NEXT: br label %if.end

				// HIP: if.end:
				// HIP-NEXT: %7 = getelementptr inbounds %__tgt_offload_entry, ptr %entry1, i64 1
				// HIP-NEXT: %8 = icmp eq ptr %7, @__stop_hip_offloading_entries
				// HIP-NEXT: br i1 %8, label %while.end, label %while.entry

				// HIP: while.end:
				// HIP-NEXT: ret void
				// HIP-NEXT: }

clang/test/Driver/linker-wrapper.c

	Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
	// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \			// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \
	// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA			// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA

	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o
	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o
	// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_52,file={{.}}.out --image=profile=sm_70,file={{.}}.out			// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_52,file={{.}}.out --image=profile=sm_70,file={{.}}.out

	// RUN: clang-offload-packager -o %t.out \			// RUN: clang-offload-packager -o %t.out \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx90a \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx90a \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx908
				// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
				// RUN: -fembed-offload-object=%t.out
				// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \
				// RUN: /usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=HIP

				// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx908 -o {{.}}.out {{.*}}.o
				// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx90a -o {{.}}.out {{.*}}.o
				// HIP: clang-offload-bundler{{.}}-type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx908,hipv4-amdgcn-amd-amdhsa--gfx90a -input=/dev/null -input={{.}}.out -input={{.}}out -output={{.}}.hipfb

				// RUN: clang-offload-packager -o %t.out \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
	// RUN: -fembed-offload-object=%t.out			// RUN: -fembed-offload-object=%t.out
	// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \			// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \
	// RUN: --linker-path=/usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \			// RUN: --linker-path=/usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \
	// RUN: %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=LINKER_ARGS			// RUN: %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=LINKER_ARGS

	// LINKER_ARGS: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx908 -o {{.}}.out {{.*}}.o a			// LINKER_ARGS: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx908 -o {{.}}.out {{.*}}.o a
	// LINKER_ARGS: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.*}}.o a b			// LINKER_ARGS: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.*}}.o a b

				/// Ensure that temp files aren't leftoever from static libraries.
	// RUN: clang-offload-packager -o %t-lib.out \			// RUN: clang-offload-packager -o %t-lib.out \
				traUnsubmitted Not Done Reply Inline Actions Nit: This test case does not have any CHECK lines and could use a comment describing what it's supposed to test. AFAICT it's intended to make sure that no temporary files are left around, but I'm not 100% sure. tra: Nit: This test case does not have any CHECK lines and could use a comment describing what it's…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Yes, it ensures that the files extracted from the static library are not leftover as temp files, this was a problem previously that I fixed. I'll add a comment explaining that. jhuber6: Yes, it ensures that the files extracted from the static library are not leftover as temp files…
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70 \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o -fembed-offload-object=%t-lib.out			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o -fembed-offload-object=%t-lib.out
	// RUN: llvm-ar rcs %t.a %t.o			// RUN: llvm-ar rcs %t.a %t.o
	// RUN: rm -f %t.o			// RUN: rm -f %t.o
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t-obj.o			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t-obj.o
	// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run -save-temps \			// RUN: clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --dry-run -save-temps \
	// RUN: --linker-path=/usr/bin/ld -- %t.a %t-obj.o -o a.out			// RUN: --linker-path=/usr/bin/ld -- %t.a %t-obj.o -o a.out
	// RUN: not ls -device-			// RUN: not ls -device-

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Show First 20 Lines • Show All 581 Lines • ▼ Show 20 Lines	Expected<StringRef> link(ArrayRef<StringRef> InputFiles, const ArgList &Args) {

for (StringRef Arg : Args.getAllArgValues(OPT_linker_arg_EQ))		for (StringRef Arg : Args.getAllArgValues(OPT_linker_arg_EQ))
CmdArgs.push_back(Args.MakeArgString(Arg));		CmdArgs.push_back(Args.MakeArgString(Arg));
if (Error Err = executeCommands(*LLDPath, CmdArgs))		if (Error Err = executeCommands(*LLDPath, CmdArgs))
return std::move(Err);		return std::move(Err);

return *TempFileOrErr;		return *TempFileOrErr;
}		}

		Expected<StringRef>
		fatbinary(ArrayRef<std::pair<StringRef, StringRef>> InputFiles,
		const ArgList &Args) {
		// AMDGPU uses the clang-offload-bundler to bundle the linked images.
		Expected<std::string> OffloadBundlerPath = findProgram(
		"clang-offload-bundler", {getMainExecutable("clang-offload-bundler")});
		if (!OffloadBundlerPath)
		return OffloadBundlerPath.takeError();

		llvm::Triple Triple(
		Args.getLastArgValue(OPT_host_triple_EQ, sys::getDefaultTargetTriple()));

		// Create a new file to write the linked device image to.
		auto TempFileOrErr = createOutputFile(sys::path::filename(ExecutableName) +
		"-device-" + Triple.getArchName(),
		"hipfb");
		if (!TempFileOrErr)
		return TempFileOrErr.takeError();

		BumpPtrAllocator Alloc;
		StringSaver Saver(Alloc);

		SmallVector<StringRef, 16> CmdArgs;
		CmdArgs.push_back(*OffloadBundlerPath);
		CmdArgs.push_back("-type=o");
		CmdArgs.push_back("-bundle-align=4096");
		traUnsubmitted Not Done Reply Inline Actions We probably do not want to hardcode the assumption that the host is x86_64 linux. Bundle alignment should probably also be target-dependent, but 4K is common enough and is probably fine in practice. tra: We probably do not want to hardcode the assumption that the host is x86_64 linux. Bundle…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions This is exactly the way it is in the Clang source for HIP. HIP uses the `clang-offload-bundler` which expects a host file and host triple, ergo the dummy triple and input from `/dev/null`. This obviously isn't great, maybe in the future I'll be able to convince the AMD folks to use my format instead. jhuber6: This is exactly the way it is in the Clang source for HIP. HIP uses the `clang-offload-bundler`…

		SmallVector<StringRef> Targets = {"-targets=host-x86_64-unknown-linux"};
		for (const auto &FileAndArch : InputFiles)
		Targets.push_back(
		Saver.save("hipv4-amdgcn-amd-amdhsa--" + std::get<1>(FileAndArch)));
		CmdArgs.push_back(Saver.save(llvm::join(Targets, ",")));

		CmdArgs.push_back("-input=/dev/null");
		for (const auto &FileAndArch : InputFiles)
		CmdArgs.push_back(Saver.save("-input=" + std::get<0>(FileAndArch)));

		CmdArgs.push_back(Saver.save("-output=" + *TempFileOrErr));

		if (Error Err = executeCommands(*OffloadBundlerPath, CmdArgs))
		return std::move(Err);

		return *TempFileOrErr;
		}
} // namespace amdgcn		} // namespace amdgcn

namespace generic {		namespace generic {

const char *getLDMOption(const llvm::Triple &T) {		const char *getLDMOption(const llvm::Triple &T) {
switch (T.getArch()) {		switch (T.getArch()) {
case llvm::Triple::x86:		case llvm::Triple::x86:
if (T.isOSIAMCU())		if (T.isOSIAMCU())
▲ Show 20 Lines • Show All 490 Lines • ▼ Show 20 Lines	wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
case OFK_OpenMP:		case OFK_OpenMP:
if (Error Err = wrapOpenMPBinaries(M, BuffersToWrap))		if (Error Err = wrapOpenMPBinaries(M, BuffersToWrap))
return std::move(Err);		return std::move(Err);
break;		break;
case OFK_Cuda:		case OFK_Cuda:
if (Error Err = wrapCudaBinary(M, BuffersToWrap.front()))		if (Error Err = wrapCudaBinary(M, BuffersToWrap.front()))
return std::move(Err);		return std::move(Err);
break;		break;
		case OFK_HIP:
		if (Error Err = wrapHIPBinary(M, BuffersToWrap.front()))
		return std::move(Err);
		break;
default:		default:
return createStringError(inconvertibleErrorCode(),		return createStringError(inconvertibleErrorCode(),
getOffloadKindName(Kind) +		getOffloadKindName(Kind) +
" wrapping is not supported");		" wrapping is not supported");
}		}

if (Args.hasArg(OPT_print_wrapped_module))		if (Args.hasArg(OPT_print_wrapped_module))
errs() << M;		errs() << M;
Show All 11 Lines	for (const OffloadingImage &Image : Images)
Buffers.emplace_back(		Buffers.emplace_back(
MemoryBuffer::getMemBufferCopy(Image.Image->getBuffer()));		MemoryBuffer::getMemBufferCopy(Image.Image->getBuffer()));

return std::move(Buffers);		return std::move(Buffers);
}		}

Expected<SmallVector<std::unique_ptr<MemoryBuffer>>>		Expected<SmallVector<std::unique_ptr<MemoryBuffer>>>
bundleCuda(ArrayRef<OffloadingImage> Images, const ArgList &Args) {		bundleCuda(ArrayRef<OffloadingImage> Images, const ArgList &Args) {
		SmallVector<std::pair<StringRef, StringRef>, 4> InputFiles;
		for (const OffloadingImage &Image : Images)
		InputFiles.emplace_back(std::make_pair(Image.Image->getBufferIdentifier(),
		Image.StringData.lookup("arch")));

		Triple TheTriple = Triple(Images.front().StringData.lookup("triple"));
		auto FileOrErr = nvptx::fatbinary(InputFiles, Args);
		if (!FileOrErr)
		return FileOrErr.takeError();

		llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> ImageOrError =
		llvm::MemoryBuffer::getFileOrSTDIN(*FileOrErr);

SmallVector<std::unique_ptr<MemoryBuffer>> Buffers;		SmallVector<std::unique_ptr<MemoryBuffer>> Buffers;
		if (std::error_code EC = ImageOrError.getError())
		return createFileError(*FileOrErr, EC);
		Buffers.emplace_back(std::move(*ImageOrError));

		return std::move(Buffers);
		}

		Expected<SmallVector<std::unique_ptr<MemoryBuffer>>>
		bundleHIP(ArrayRef<OffloadingImage> Images, const ArgList &Args) {
SmallVector<std::pair<StringRef, StringRef>, 4> InputFiles;		SmallVector<std::pair<StringRef, StringRef>, 4> InputFiles;
for (const OffloadingImage &Image : Images)		for (const OffloadingImage &Image : Images)
InputFiles.emplace_back(std::make_pair(Image.Image->getBufferIdentifier(),		InputFiles.emplace_back(std::make_pair(Image.Image->getBufferIdentifier(),
Image.StringData.lookup("arch")));		Image.StringData.lookup("arch")));

Triple TheTriple = Triple(Images.front().StringData.lookup("triple"));		Triple TheTriple = Triple(Images.front().StringData.lookup("triple"));
auto FileOrErr = nvptx::fatbinary(InputFiles, Args);		auto FileOrErr = amdgcn::fatbinary(InputFiles, Args);
if (!FileOrErr)		if (!FileOrErr)
return FileOrErr.takeError();		return FileOrErr.takeError();

llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> ImageOrError =		llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> ImageOrError =
llvm::MemoryBuffer::getFileOrSTDIN(*FileOrErr);		llvm::MemoryBuffer::getFileOrSTDIN(*FileOrErr);

		SmallVector<std::unique_ptr<MemoryBuffer>> Buffers;
if (std::error_code EC = ImageOrError.getError())		if (std::error_code EC = ImageOrError.getError())
return createFileError(*FileOrErr, EC);		return createFileError(*FileOrErr, EC);
Buffers.emplace_back(std::move(*ImageOrError));		Buffers.emplace_back(std::move(*ImageOrError));

return std::move(Buffers);		return std::move(Buffers);
}		}

/// Transforms the input \p Images into the binary format the runtime expects		/// Transforms the input \p Images into the binary format the runtime expects
/// for the given \p Kind.		/// for the given \p Kind.
Expected<SmallVector<std::unique_ptr<MemoryBuffer>>>		Expected<SmallVector<std::unique_ptr<MemoryBuffer>>>
		traUnsubmitted Not Done Reply Inline Actions I'd move it to the end where the buffer is actually used. tra: I'd move it to the end where the buffer is actually used.
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Sure, I'll do that for the others as well. jhuber6: Sure, I'll do that for the others as well.
bundleLinkedOutput(ArrayRef<OffloadingImage> Images, const ArgList &Args,		bundleLinkedOutput(ArrayRef<OffloadingImage> Images, const ArgList &Args,
OffloadKind Kind) {		OffloadKind Kind) {
switch (Kind) {		switch (Kind) {
case OFK_OpenMP:		case OFK_OpenMP:
return bundleOpenMP(Images);		return bundleOpenMP(Images);
case OFK_Cuda:		case OFK_Cuda:
return bundleCuda(Images, Args);		return bundleCuda(Images, Args);
		case OFK_HIP:
		return bundleHIP(Images, Args);
default:		default:
return createStringError(inconvertibleErrorCode(),		return createStringError(inconvertibleErrorCode(),
getOffloadKindName(Kind) +		getOffloadKindName(Kind) +
" bundling is not supported");		" bundling is not supported");
}		}
}		}

/// Returns a new ArgList containg arguments used for the device linking phase.		/// Returns a new ArgList containg arguments used for the device linking phase.
▲ Show 20 Lines • Show All 289 Lines • Show Last 20 Lines

clang/tools/clang-linker-wrapper/OffloadWrapper.h

	Show All 15 Lines
	/// registers the images with the OpenMP Offloading runtime libomptarget.			/// registers the images with the OpenMP Offloading runtime libomptarget.
	llvm::Error wrapOpenMPBinaries(llvm::Module &M,			llvm::Error wrapOpenMPBinaries(llvm::Module &M,
	llvm::ArrayRef<llvm::ArrayRef<char>> Images);			llvm::ArrayRef<llvm::ArrayRef<char>> Images);

	/// Wraps the input fatbinary image into the module \p M as global symbols and			/// Wraps the input fatbinary image into the module \p M as global symbols and
	/// registers the images with the CUDA runtime.			/// registers the images with the CUDA runtime.
	llvm::Error wrapCudaBinary(llvm::Module &M, llvm::ArrayRef<char> Images);			llvm::Error wrapCudaBinary(llvm::Module &M, llvm::ArrayRef<char> Images);

				/// Wraps the input bundled image into the module \p M as global symbols and
				/// registers the images with the HIP runtime.
				llvm::Error wrapHIPBinary(llvm::Module &M, llvm::ArrayRef<char> Images);

	#endif			#endif

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp

Show All 16 Lines
#include "llvm/Support/Error.h"		#include "llvm/Support/Error.h"
#include "llvm/Transforms/Utils/ModuleUtils.h"		#include "llvm/Transforms/Utils/ModuleUtils.h"

using namespace llvm;		using namespace llvm;

namespace {		namespace {
/// Magic number that begins the section containing the CUDA fatbinary.		/// Magic number that begins the section containing the CUDA fatbinary.
constexpr unsigned CudaFatMagic = 0x466243b1;		constexpr unsigned CudaFatMagic = 0x466243b1;
		constexpr unsigned HIPFatMagic = 0x48495046;

/// Copied from clang/CGCudaRuntime.h.		/// Copied from clang/CGCudaRuntime.h.
enum OffloadEntryKindFlag : uint32_t {		enum OffloadEntryKindFlag : uint32_t {
/// Mark the entry as a global entry. This indicates the presense of a		/// Mark the entry as a global entry. This indicates the presense of a
/// kernel if the size size field is zero and a variable otherwise.		/// kernel if the size size field is zero and a variable otherwise.
OffloadGlobalEntry = 0x0,		OffloadGlobalEntry = 0x0,
/// Mark the entry as a managed global variable.		/// Mark the entry as a managed global variable.
OffloadGlobalManagedEntry = 0x1,		OffloadGlobalManagedEntry = 0x1,
▲ Show 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	if (!FatbinTy)
FatbinTy = StructType::create("fatbin_wrapper", Type::getInt32Ty(C),		FatbinTy = StructType::create("fatbin_wrapper", Type::getInt32Ty(C),
Type::getInt32Ty(C), Type::getInt8PtrTy(C),		Type::getInt32Ty(C), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C));		Type::getInt8PtrTy(C));
return FatbinTy;		return FatbinTy;
}		}

/// Embed the image \p Image into the module \p M so it can be found by the		/// Embed the image \p Image into the module \p M so it can be found by the
/// runtime.		/// runtime.
GlobalVariable *createFatbinDesc(Module &M, ArrayRef<char> Image) {		GlobalVariable *createFatbinDesc(Module &M, ArrayRef<char> Image, bool IsHIP) {
LLVMContext &C = M.getContext();		LLVMContext &C = M.getContext();
llvm::Type *Int8PtrTy = Type::getInt8PtrTy(C);		llvm::Type *Int8PtrTy = Type::getInt8PtrTy(C);
llvm::Triple Triple = llvm::Triple(M.getTargetTriple());		llvm::Triple Triple = llvm::Triple(M.getTargetTriple());

// Create the global string containing the fatbinary.		// Create the global string containing the fatbinary.
StringRef FatbinConstantSection =		StringRef FatbinConstantSection =
Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";		IsHIP ? ".hip_fatbin"
		: (Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin");
auto *Data = ConstantDataArray::get(C, Image);		auto *Data = ConstantDataArray::get(C, Image);
auto Fatbin = new GlobalVariable(M, Data->getType(), /isConstant*/ true,		auto Fatbin = new GlobalVariable(M, Data->getType(), /isConstant*/ true,
GlobalVariable::InternalLinkage, Data,		GlobalVariable::InternalLinkage, Data,
".fatbin_image");		".fatbin_image");
Fatbin->setSection(FatbinConstantSection);		Fatbin->setSection(FatbinConstantSection);

// Create the fatbinary wrapper		// Create the fatbinary wrapper
StringRef FatbinWrapperSection =		StringRef FatbinWrapperSection = IsHIP ? ".hipFatBinSegment"
Triple.isMacOSX() ? "__NV_CUDA,__fatbin" : ".nvFatBinSegment";		: Triple.isMacOSX() ? "__NV_CUDA,__fatbin"
		: ".nvFatBinSegment";
Constant *FatbinWrapper[] = {		Constant *FatbinWrapper[] = {
ConstantInt::get(Type::getInt32Ty(C), CudaFatMagic),		ConstantInt::get(Type::getInt32Ty(C), IsHIP ? HIPFatMagic : CudaFatMagic),
ConstantInt::get(Type::getInt32Ty(C), 1),		ConstantInt::get(Type::getInt32Ty(C), 1),
ConstantExpr::getPointerBitCastOrAddrSpaceCast(Fatbin, Int8PtrTy),		ConstantExpr::getPointerBitCastOrAddrSpaceCast(Fatbin, Int8PtrTy),
ConstantPointerNull::get(Type::getInt8PtrTy(C))};		ConstantPointerNull::get(Type::getInt8PtrTy(C))};

Constant *FatbinInitializer =		Constant *FatbinInitializer =
ConstantStruct::get(getFatbinWrapperTy(M), FatbinWrapper);		ConstantStruct::get(getFatbinWrapperTy(M), FatbinWrapper);

auto *FatbinDesc =		auto *FatbinDesc =
new GlobalVariable(M, getFatbinWrapperTy(M),		new GlobalVariable(M, getFatbinWrapperTy(M),
/isConstant/ true, GlobalValue::InternalLinkage,		/isConstant/ true, GlobalValue::InternalLinkage,
FatbinInitializer, ".fatbin_wrapper");		FatbinInitializer, ".fatbin_wrapper");
FatbinDesc->setSection(FatbinWrapperSection);		FatbinDesc->setSection(FatbinWrapperSection);
FatbinDesc->setAlignment(Align(8));		FatbinDesc->setAlignment(Align(8));

// We create a dummy entry to ensure the linker will define the begin / end		// We create a dummy entry to ensure the linker will define the begin / end
// symbols. The CUDA runtime should ignore the null address if we attempt to		// symbols. The CUDA runtime should ignore the null address if we attempt to
// register it.		// register it.
auto *DummyInit =		auto *DummyInit =
ConstantAggregateZero::get(ArrayType::get(getEntryTy(M), 0u));		ConstantAggregateZero::get(ArrayType::get(getEntryTy(M), 0u));
auto *DummyEntry = new GlobalVariable(		auto *DummyEntry = new GlobalVariable(
M, DummyInit->getType(), true, GlobalVariable::ExternalLinkage, DummyInit,		M, DummyInit->getType(), true, GlobalVariable::ExternalLinkage, DummyInit,
"__dummy.cuda_offloading.entry");		IsHIP ? "__dummy.hip_offloading.entry" : "__dummy.cuda_offloading.entry");
DummyEntry->setSection("cuda_offloading_entries");
DummyEntry->setVisibility(GlobalValue::HiddenVisibility);		DummyEntry->setVisibility(GlobalValue::HiddenVisibility);
		DummyEntry->setSection(IsHIP ? "hip_offloading_entries"
		: "cuda_offloading_entries");

return FatbinDesc;		return FatbinDesc;
}		}

/// Create the register globals function. We will iterate all of the offloading		/// Create the register globals function. We will iterate all of the offloading
/// entries stored at the begin / end symbols and register them according to		/// entries stored at the begin / end symbols and register them according to
/// their type. This creates the following function in IR:		/// their type. This creates the following function in IR:
///		///
Show All 11 Lines
/// if (!entry->size)		/// if (!entry->size)
/// __cudaRegisterFunction(fatbinHandle, entry->addr, entry->name,		/// __cudaRegisterFunction(fatbinHandle, entry->addr, entry->name,
/// entry->name, -1, 0, 0, 0, 0, 0);		/// entry->name, -1, 0, 0, 0, 0, 0);
/// else		/// else
/// __cudaRegisterVar(fatbinHandle, entry->addr, entry->name, entry->name,		/// __cudaRegisterVar(fatbinHandle, entry->addr, entry->name, entry->name,
/// 0, entry->size, 0, 0);		/// 0, entry->size, 0, 0);
/// }		/// }
/// }		/// }
Function *createRegisterGlobalsFunction(Module &M) {		Function *createRegisterGlobalsFunction(Module &M, bool IsHIP) {
LLVMContext &C = M.getContext();		LLVMContext &C = M.getContext();
// Get the __cudaRegisterFunction function declaration.		// Get the __cudaRegisterFunction function declaration.
auto *RegFuncTy = FunctionType::get(		auto *RegFuncTy = FunctionType::get(
Type::getInt32Ty(C),		Type::getInt32Ty(C),
{Type::getInt8PtrTy(C)->getPointerTo(), Type::getInt8PtrTy(C),		{Type::getInt8PtrTy(C)->getPointerTo(), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt32Ty(C),		Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt32Ty(C),
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt8PtrTy(C),		Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C), Type::getInt32PtrTy(C)},		Type::getInt8PtrTy(C), Type::getInt32PtrTy(C)},
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee RegFunc =		FunctionCallee RegFunc = M.getOrInsertFunction(
M.getOrInsertFunction("__cudaRegisterFunction", RegFuncTy);		IsHIP ? "__hipRegisterFunction" : "__cudaRegisterFunction", RegFuncTy);

// Get the __cudaRegisterVar function declaration.		// Get the __cudaRegisterVar function declaration.
auto *RegVarTy = FunctionType::get(		auto *RegVarTy = FunctionType::get(
Type::getVoidTy(C),		Type::getVoidTy(C),
{Type::getInt8PtrTy(C)->getPointerTo(), Type::getInt8PtrTy(C),		{Type::getInt8PtrTy(C)->getPointerTo(), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt32Ty(C),		Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt32Ty(C),
getSizeTTy(M), Type::getInt32Ty(C), Type::getInt32Ty(C)},		getSizeTTy(M), Type::getInt32Ty(C), Type::getInt32Ty(C)},
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee RegVar = M.getOrInsertFunction("__cudaRegisterVar", RegVarTy);		FunctionCallee RegVar = M.getOrInsertFunction(
		IsHIP ? "__hipRegisterVar" : "__cudaRegisterVar", RegVarTy);

// Create the references to the start / stop symbols defined by the linker.		// Create the references to the start / stop symbols defined by the linker.
auto *EntriesB = new GlobalVariable(		auto *EntriesB =
M, ArrayType::get(getEntryTy(M), 0), /isConstant/ true,		new GlobalVariable(M, ArrayType::get(getEntryTy(M), 0),
GlobalValue::ExternalLinkage,		/isConstant/ true, GlobalValue::ExternalLinkage,
/Initializer/ nullptr, "__start_cuda_offloading_entries");		/Initializer/ nullptr,
		IsHIP ? "__start_hip_offloading_entries"
		traUnsubmitted Not Done Reply Inline Actions We should probably have a helper function returning properly prefixed name, similar to what we do in clang: https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGCUDANV.cpp#L184 tra: We should probably have a helper function returning properly prefixed name, similar to what we…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I had that thought, but unless I wanted to use regular expressions it would be a little weird since there's many different types here, e.g. `__cuda`, `.cuda` and `_cuda`. I figured it was easier to just make two strings rather than carry around three different functions to handle these cases, or introduce some weird regex. jhuber6: I had that thought, but unless I wanted to use regular expressions it would be a little weird…
		: "__start_cuda_offloading_entries");
EntriesB->setVisibility(GlobalValue::HiddenVisibility);		EntriesB->setVisibility(GlobalValue::HiddenVisibility);
auto *EntriesE = new GlobalVariable(		auto *EntriesE =
M, ArrayType::get(getEntryTy(M), 0), /isConstant/ true,		new GlobalVariable(M, ArrayType::get(getEntryTy(M), 0),
GlobalValue::ExternalLinkage,		/isConstant/ true, GlobalValue::ExternalLinkage,
/Initializer/ nullptr, "__stop_cuda_offloading_entries");		/Initializer/ nullptr,
		IsHIP ? "__stop_hip_offloading_entries"
		: "__stop_cuda_offloading_entries");
EntriesE->setVisibility(GlobalValue::HiddenVisibility);		EntriesE->setVisibility(GlobalValue::HiddenVisibility);

auto *RegGlobalsTy = FunctionType::get(Type::getVoidTy(C),		auto *RegGlobalsTy = FunctionType::get(Type::getVoidTy(C),
Type::getInt8PtrTy(C)->getPointerTo(),		Type::getInt8PtrTy(C)->getPointerTo(),
/isVarArg/ false);		/isVarArg/ false);
auto *RegGlobalsFn = Function::Create(		auto *RegGlobalsFn =
RegGlobalsTy, GlobalValue::InternalLinkage, ".cuda.globals_reg", &M);		Function::Create(RegGlobalsTy, GlobalValue::InternalLinkage,
		IsHIP ? ".hip.globals_reg" : ".cuda.globals_reg", &M);
RegGlobalsFn->setSection(".text.startup");		RegGlobalsFn->setSection(".text.startup");

// Create the loop to register all the entries.		// Create the loop to register all the entries.
IRBuilder<> Builder(BasicBlock::Create(C, "entry", RegGlobalsFn));		IRBuilder<> Builder(BasicBlock::Create(C, "entry", RegGlobalsFn));
auto *EntryBB = BasicBlock::Create(C, "while.entry", RegGlobalsFn);		auto *EntryBB = BasicBlock::Create(C, "while.entry", RegGlobalsFn);
auto *IfThenBB = BasicBlock::Create(C, "if.then", RegGlobalsFn);		auto *IfThenBB = BasicBlock::Create(C, "if.then", RegGlobalsFn);
auto *IfElseBB = BasicBlock::Create(C, "if.else", RegGlobalsFn);		auto *IfElseBB = BasicBlock::Create(C, "if.else", RegGlobalsFn);
auto *SwGlobalBB = BasicBlock::Create(C, "sw.global", RegGlobalsFn);		auto *SwGlobalBB = BasicBlock::Create(C, "sw.global", RegGlobalsFn);
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	Function *createRegisterGlobalsFunction(Module &M, bool IsHIP) {
Builder.SetInsertPoint(ExitBB);		Builder.SetInsertPoint(ExitBB);
Builder.CreateRetVoid();		Builder.CreateRetVoid();

return RegGlobalsFn;		return RegGlobalsFn;
}		}

// Create the constructor and destructor to register the fatbinary with the CUDA		// Create the constructor and destructor to register the fatbinary with the CUDA
// runtime.		// runtime.
void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc) {		void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
		bool IsHIP) {
LLVMContext &C = M.getContext();		LLVMContext &C = M.getContext();
auto CtorFuncTy = FunctionType::get(Type::getVoidTy(C), /isVarArg*/ false);		auto CtorFuncTy = FunctionType::get(Type::getVoidTy(C), /isVarArg*/ false);
auto *CtorFunc = Function::Create(CtorFuncTy, GlobalValue::InternalLinkage,		auto *CtorFunc =
".cuda.fatbin_reg", &M);		Function::Create(CtorFuncTy, GlobalValue::InternalLinkage,
		IsHIP ? ".hip.fatbin_reg" : ".cuda.fatbin_reg", &M);
CtorFunc->setSection(".text.startup");		CtorFunc->setSection(".text.startup");

auto DtorFuncTy = FunctionType::get(Type::getVoidTy(C), /isVarArg*/ false);		auto DtorFuncTy = FunctionType::get(Type::getVoidTy(C), /isVarArg*/ false);
auto *DtorFunc = Function::Create(DtorFuncTy, GlobalValue::InternalLinkage,		auto *DtorFunc =
".cuda.fatbin_unreg", &M);		Function::Create(DtorFuncTy, GlobalValue::InternalLinkage,
		IsHIP ? ".hip.fatbin_unreg" : ".cuda.fatbin_unreg", &M);
DtorFunc->setSection(".text.startup");		DtorFunc->setSection(".text.startup");

// Get the __cudaRegisterFatBinary function declaration.		// Get the __cudaRegisterFatBinary function declaration.
auto *RegFatTy = FunctionType::get(Type::getInt8PtrTy(C)->getPointerTo(),		auto *RegFatTy = FunctionType::get(Type::getInt8PtrTy(C)->getPointerTo(),
Type::getInt8PtrTy(C),		Type::getInt8PtrTy(C),
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee RegFatbin =		FunctionCallee RegFatbin = M.getOrInsertFunction(
M.getOrInsertFunction("__cudaRegisterFatBinary", RegFatTy);		IsHIP ? "__hipRegisterFatBinary" : "__cudaRegisterFatBinary", RegFatTy);
// Get the __cudaRegisterFatBinaryEnd function declaration.		// Get the __cudaRegisterFatBinaryEnd function declaration.
auto *RegFatEndTy = FunctionType::get(Type::getVoidTy(C),		auto *RegFatEndTy = FunctionType::get(Type::getVoidTy(C),
Type::getInt8PtrTy(C)->getPointerTo(),		Type::getInt8PtrTy(C)->getPointerTo(),
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee RegFatbinEnd =		FunctionCallee RegFatbinEnd =
M.getOrInsertFunction("__cudaRegisterFatBinaryEnd", RegFatEndTy);		M.getOrInsertFunction("__cudaRegisterFatBinaryEnd", RegFatEndTy);
// Get the __cudaUnregisterFatBinary function declaration.		// Get the __cudaUnregisterFatBinary function declaration.
auto *UnregFatTy = FunctionType::get(Type::getVoidTy(C),		auto *UnregFatTy = FunctionType::get(Type::getVoidTy(C),
Type::getInt8PtrTy(C)->getPointerTo(),		Type::getInt8PtrTy(C)->getPointerTo(),
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee UnregFatbin =		FunctionCallee UnregFatbin = M.getOrInsertFunction(
M.getOrInsertFunction("__cudaUnregisterFatBinary", UnregFatTy);		IsHIP ? "__hipUnregisterFatBinary" : "__cudaUnregisterFatBinary",
		UnregFatTy);

auto *AtExitTy =		auto *AtExitTy =
FunctionType::get(Type::getInt32Ty(C), DtorFuncTy->getPointerTo(),		FunctionType::get(Type::getInt32Ty(C), DtorFuncTy->getPointerTo(),
/isVarArg/ false);		/isVarArg/ false);
FunctionCallee AtExit = M.getOrInsertFunction("atexit", AtExitTy);		FunctionCallee AtExit = M.getOrInsertFunction("atexit", AtExitTy);

auto *BinaryHandleGlobal = new llvm::GlobalVariable(		auto *BinaryHandleGlobal = new llvm::GlobalVariable(
M, Type::getInt8PtrTy(C)->getPointerTo(), false,		M, Type::getInt8PtrTy(C)->getPointerTo(), false,
llvm::GlobalValue::InternalLinkage,		llvm::GlobalValue::InternalLinkage,
llvm::ConstantPointerNull::get(Type::getInt8PtrTy(C)->getPointerTo()),		llvm::ConstantPointerNull::get(Type::getInt8PtrTy(C)->getPointerTo()),
".cuda.binary_handle");		IsHIP ? ".hip.binary_handle" : ".cuda.binary_handle");

// Create the constructor to register this image with the runtime.		// Create the constructor to register this image with the runtime.
IRBuilder<> CtorBuilder(BasicBlock::Create(C, "entry", CtorFunc));		IRBuilder<> CtorBuilder(BasicBlock::Create(C, "entry", CtorFunc));
CallInst *Handle = CtorBuilder.CreateCall(		CallInst *Handle = CtorBuilder.CreateCall(
RegFatbin, ConstantExpr::getPointerBitCastOrAddrSpaceCast(		RegFatbin, ConstantExpr::getPointerBitCastOrAddrSpaceCast(
FatbinDesc, Type::getInt8PtrTy(C)));		FatbinDesc, Type::getInt8PtrTy(C)));
CtorBuilder.CreateAlignedStore(		CtorBuilder.CreateAlignedStore(
Handle, BinaryHandleGlobal,		Handle, BinaryHandleGlobal,
Align(M.getDataLayout().getPointerTypeSize(Type::getInt8PtrTy(C))));		Align(M.getDataLayout().getPointerTypeSize(Type::getInt8PtrTy(C))));
CtorBuilder.CreateCall(createRegisterGlobalsFunction(M), Handle);		CtorBuilder.CreateCall(createRegisterGlobalsFunction(M, IsHIP), Handle);
		if (!IsHIP)
CtorBuilder.CreateCall(RegFatbinEnd, Handle);		CtorBuilder.CreateCall(RegFatbinEnd, Handle);
CtorBuilder.CreateCall(AtExit, DtorFunc);		CtorBuilder.CreateCall(AtExit, DtorFunc);
CtorBuilder.CreateRetVoid();		CtorBuilder.CreateRetVoid();

// Create the destructor to unregister the image with the runtime. We cannot		// Create the destructor to unregister the image with the runtime. We cannot
// use a standard global destructor after CUDA 9.2 so this must be called by		// use a standard global destructor after CUDA 9.2 so this must be called by
// `atexit()` intead.		// `atexit()` intead.
IRBuilder<> DtorBuilder(BasicBlock::Create(C, "entry", DtorFunc));		IRBuilder<> DtorBuilder(BasicBlock::Create(C, "entry", DtorFunc));
LoadInst *BinaryHandle = DtorBuilder.CreateAlignedLoad(		LoadInst *BinaryHandle = DtorBuilder.CreateAlignedLoad(
Show All 14 Lines	if (!Desc)
return createStringError(inconvertibleErrorCode(),		return createStringError(inconvertibleErrorCode(),
"No binary descriptors created.");		"No binary descriptors created.");
createRegisterFunction(M, Desc);		createRegisterFunction(M, Desc);
createUnregisterFunction(M, Desc);		createUnregisterFunction(M, Desc);
return Error::success();		return Error::success();
}		}

Error wrapCudaBinary(Module &M, ArrayRef<char> Image) {		Error wrapCudaBinary(Module &M, ArrayRef<char> Image) {
GlobalVariable *Desc = createFatbinDesc(M, Image);		GlobalVariable Desc = createFatbinDesc(M, Image, / IsHIP */ false);
		if (!Desc)
		return createStringError(inconvertibleErrorCode(),
		"No fatinbary section created.");

		createRegisterFatbinFunction(M, Desc, /* IsHIP */ false);
		return Error::success();
		}

		Error wrapHIPBinary(Module &M, ArrayRef<char> Image) {
		GlobalVariable Desc = createFatbinDesc(M, Image, / IsHIP */ true);
if (!Desc)		if (!Desc)
return createStringError(inconvertibleErrorCode(),		return createStringError(inconvertibleErrorCode(),
"No fatinbary section created.");		"No fatinbary section created.");

createRegisterFatbinFunction(M, Desc);		createRegisterFatbinFunction(M, Desc, /* IsHIP */ true);
return Error::success();		return Error::success();
}		}