This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
-
LangOptions.def
-
Driver/
-
Options.td
-
lib/
-
CodeGen/
-
CGCUDANV.cpp
9/13
CGCUDARuntime.h
-
Driver/ToolChains/
-
ToolChains/
-
Clang.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
offloading-entries.cu

Differential D123471

[CUDA] Create offloading entries when using the new driver
ClosedPublic

Authored by jhuber6 on Apr 10 2022, 1:15 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
ronlieb
yaxunl
tra

Commits

rG0035f7154c2a: [CUDA] Create offloading entries when using the new driver

Summary

The changes made in D123460 generalized the code generation for OpenMP's
offloading entries. We can use the same scheme to register globals for
CUDA code. This patch adds the code generation to create these
offloading entries when compiling using the new offloading driver mode.
The offloading entries are simple structs that contain the information
necessary to register the global. The struct used is as follows:

Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  int32_t reserved;
};

Currently CUDA handles RDC code generation by deferring the registration
of globals in the current TU to a callback function containing the
modules ID. Later all the module IDs will be used to register all of the
globals at once. Rather than mimic this, offloading entries allow us to
mimic the way OpenMP registers globals. That is, we create a simple
global struct for each device global to be registered. These are placed
at a special section cuda_offloading_entires. Because this section is
a valid C-identifier, the linker will profide a __start and __stop
pointer that we can use to iterate and register all globals at runtime.

the registration requires a flag variable to indicate which registration
function to use. I have assigned the flags somewhat arbitrarily, but
these use the following values.

Kernel: 0
Variable: 0
Managed: 1
Surface: 2
Texture: 3

Depends on D120272

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Apr 10 2022, 1:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2022, 1:15 PM

Herald added subscribers: carlosgalvezp, dexonsmith. · View Herald Transcript

jhuber6 requested review of this revision.Apr 10 2022, 1:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2022, 1:15 PM

Herald added subscribers: cfe-commits, sstefan1, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B158924: Diff 421807.Apr 10 2022, 1:29 PM

jhuber6 added a parent revision: D123460: [OpenMP] Make generating offloading entries more generic.Apr 10 2022, 2:05 PM

Is OpenMP runtime able to find these entries without registering them through some API functions? If so, do you have a pointer to the code doing that?

most CUDA/HIP programs assume -fno-gpu-rdc mode, which have multiple sections containing these entries merged by linker, with gaps between them. How do runtime identify such gaps and skip them?

In D123471#3443612, @yaxunl wrote:

Is OpenMP runtime able to find these entries without registering them through some API functions? If so, do you have a pointer to the code doing that?

Yes, the linker will define __start/__stop symbols for any sections found with a section name that is a valid C-identifier. If you compile the following file with any OpenMP offloading code you should be able to print out all the symbol names that will be registered when the runtime is initialized.

#include <stdint.h>
#include <stdio.h>
struct __tgt_offload_entry {
  void *addr;  // Pointer to the offload entry info.
               // (function or global)
  char *name;  // Name of the function or global.
  size_t size; // Size of the entry info (0 if it a function).
  int32_t flags;
  int32_t reserved;
};

extern struct __tgt_offload_entry __start_omp_offloading_entries;
extern struct __tgt_offload_entry __stop_omp_offloading_entries;

__attribute__((constructor)) void print() {
  struct __tgt_offload_entry *iter = &__start_omp_offloading_entries;
  for (; iter != &__stop_omp_offloading_entries; ++iter)
    printf("%s\n", iter->name);
}

And then compile like

$ clang input.c -fopenmp -fopenmp-targets=nvptx64 -c
$ clang print.c -c
$ clang input.o print.o -fopenmp -fopenmp-targets=nvptx64
$ ./a.out 
x
__omp_offloading_fd02_605785f3_main_l8

most CUDA/HIP programs assume -fno-gpu-rdc mode, which have multiple sections containing these entries merged by linker, with gaps between them. How do runtime identify such gaps and skip them?

I'm making the executive decision to always enable fgpu-rdc when using this new driver in the future. The above is handled by the linker so there shouldn't be any gaps.

I've mentioned in D123441 that it would be useful to have a list of GPU-side symbols needed by the host and this offload info is pretty close to what we need. The only remaining feature is being able to extract them by external tool, so we could pass them to the GPU-side linker. Perhaps we could just generate a GPU-side stub file which would only have an array of needed GPU-side references, compile and add it to the GPU-side linker as yet another input which would ensure we do link in the exact set of GPU objects from the static libraries.

In D123471#3446464, @tra wrote:

I've mentioned in D123441 that it would be useful to have a list of GPU-side symbols needed by the host and this offload info is pretty close to what we need. The only remaining feature is being able to extract them by external tool, so we could pass them to the GPU-side linker. Perhaps we could just generate a GPU-side stub file which would only have an array of needed GPU-side references, compile and add it to the GPU-side linker as yet another input which would ensure we do link in the exact set of GPU objects from the static libraries.

These will probably only be valid once the final executable is linked. Since the structure contains a pointers to other symbols they'll only have non-null values after the final linking. After linking for the host you should be able to just use something like objdump -s -j cuda_offloading_entries to get all of them. For my use-case I only need to be able to iterate these symbols when the program is run. If we want to use this for something else it would be good to keep them synced up to avoid duplicating error. Also the patches say "CUDA" but the vast majority will also apply to HIP without much change.

HIP is considering a unified device binary embedding scheme with OpenMP. However, some large MI frameworks are compiled with -fno-gpu-rdc. If compiling with -fgpu-rdc, the linking time will significantly increase since the post-linking optimizations take much longer time with the large linked IR. Therefore, it would be desirable if the new OpenMP device binary embedding scheme supports -fno-gpu-rdc mode.

That said, I think this new scheme may work for -fno-gpu-rdc, probably with some minor changes.

For -fno-gpu-rdc, each TU has its own device binary, so the device binaries in the final image would be per GPU and per TU. That seems not a big problem since they can be post-fixed with a unique ID for each TU.

Different offload entries may have the same name in different TU's, therefore an offload entry may not be uniquely identified by its name. To uniquely identify an offload entry, it needs its name and the pointer to its belonging device binary. Therefore, it would be desirable to have one extra field 'owner':

Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  void  *owner; // pointer to the device binary containing this offload-entry
  int32_t reserved;
};

It may be possible to use the reserved field for that purpose. However, it is not sure if reserved will be used for some other purpose later.

Another choice is to let addr point to a struct which contains owner info. However, that would introduce another level of indirection.

In D123471#3446751, @yaxunl wrote:

HIP is considering a unified device binary embedding scheme with OpenMP. However, some large MI frameworks are compiled with -fno-gpu-rdc. If compiling with -fgpu-rdc, the linking time will significantly increase since the post-linking optimizations take much longer time with the large linked IR. Therefore, it would be desirable if the new OpenMP device binary embedding scheme supports -fno-gpu-rdc mode.

This work should be very close to that, the new driver allows us to link everything together so OpenMP can call HIP / CUDA functions and vice-versa. I have done some preliminary tests with registering CUDA device variables with OpenMP, the only change required is to store these offloading sections at omp_offloading_entries and the OpenMP runtime will pick them up and try to register them. This method allows us to compile HIP / CUDA with OpenMP but since we're going to be registering two different images they'll have unique state. For full interoperability we'd need some way for make either HIP / CUDA or OpenMP "borrow" the other one's registered image so they can share the state.

That said, I think this new scheme may work for -fno-gpu-rdc, probably with some minor changes.

My understanding is that non-RDC builds do all the registration per-TU. Since that's the case then we should just be able to link them as we do now and they won't emit any device code that needs to be linked. So individual files could specify no-rdc and then they wouldn't be touched by the device linker run later.

For -fno-gpu-rdc, each TU has its own device binary, so the device binaries in the final image would be per GPU and per TU. That seems not a big problem since they can be post-fixed with a unique ID for each TU.

Different offload entries may have the same name in different TU's, therefore an offload entry may not be uniquely identified by its name. To uniquely identify an offload entry, it needs its name and the pointer to its belonging device binary. Therefore, it would be desirable to have one extra field 'owner':
Type struct __tgt_offload_entry {
  void    *addr;      // Pointer to the offload entry info.
                      // (function or global)
  char    *name;      // Name of the function or global.
  size_t  size;       // Size of the entry info (0 if it a function).
  int32_t flags;
  void  *owner; // pointer to the device binary containing this offload-entry
  int32_t reserved;
};
It may be possible to use the reserved field for that purpose. However, it is not sure if reserved will be used for some other purpose later.

For OpenMP we use an exec_mode global to control some kernel execution, there's a possibility we'd want to put it in the reserved field instead. We could add more fields to this, but it would break the ABI. We could work around that but it would be some additional complexity.

Another choice is to let addr point to a struct which contains owner info. However, that would introduce another level of indirection.

Yeah, I think for arbitrary extensions that would be the easiest way without breaking the ABI. We could use the reserved field to indicate if we have some "extension" there.

I think we're working through some similar stuff. I haven't worked much with HIP but I think there would be some benefit to bringing this all under the new driver I've been working on for OpenMP. Let me know if you want to collaborate on something for getting this to work with HIP.

jhuber6 added a child revision: D123810: [Cuda] Add initial support for wrapping CUDA images in the new driver..Apr 14 2022, 12:26 PM

Rebase

Herald added a subscriber: mattd. · View Herald TranscriptApr 29 2022, 7:18 AM

Harbormaster completed remote builds in B161979: Diff 426051.Apr 29 2022, 8:07 AM

Fixed missing info flag for --offload-new-driver.

Harbormaster completed remote builds in B162053: Diff 426150.Apr 29 2022, 3:10 PM

Fix test.

Harbormaster completed remote builds in B162239: Diff 426404.May 2 2022, 7:52 AM

dexonsmith removed a subscriber: dexonsmith.May 3 2022, 12:07 PM

jhuber6 mentioned this in D123812: [CUDA] Add wrapper code generation for registering CUDA images.May 6 2022, 8:45 AM

Type struct __tgt_offload_entry {

void    *addr;      // Pointer to the offload entry info.
                    // (function or global)
char    *name;      // Name of the function or global.
size_t  size;       // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;

};

One thing you need to consider is that this introduces a new ABI.
This structure may change over time and we will need to be able to deal with libraries compiled with potentially different version of clang which may use a different format for the entries.
I think we may need some sort of version stamp.
We could use the section name for this purpose and rename it when we change the struct format, but that would be a bit more fragile as it's easier to forget to update the name if/when the struct format changes.
Also, format mismatch would looks like offload section is missing, which would need special handling when we diagnose the problem to distinguish incompatible offload table from the missing offload table.

In D123471#3497169, @tra wrote:
Type struct __tgt_offload_entry {
void    *addr;      // Pointer to the offload entry info.
                    // (function or global)
char    *name;      // Name of the function or global.
size_t  size;       // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;
};
One thing you need to consider is that this introduces a new ABI.
This structure may change over time and we will need to be able to deal with libraries compiled with potentially different version of clang which may use a different format for the entries.
I think we may need some sort of version stamp.
We could use the section name for this purpose and rename it when we change the struct format, but that would be a bit more fragile as it's easier to forget to update the name if/when the struct format changes.
Also, format mismatch would looks like offload section is missing, which would need special handling when we diagnose the problem to distinguish incompatible offload table from the missing offload table.

It's a little tough, I chose that format because it's exactly the same as we use with OpenMP with a few different flags. I wish whoever initially designed the struct made the `reserved` field 64-bits so it could conceivably hold a pointer to some additional information, but that ship has sailed. I originally chose to have this match the OpenMP struct because it will heavily simplify things if every language uses this same method for registering their globals. I would like to change it, but I'm not sure how well it would be received considering backwards compatibility. I'm not sure what the best path forward is on that front.

In D123471#3497224, @jhuber6 wrote:

It's a little tough, I chose that format because it's exactly the same as we use with OpenMP with a few different flags. I wish whoever initially designed the struct made the `reserved` field 64-bits so it could conceivably hold a pointer to some additional information, but that ship has sailed. I originally chose to have this match the OpenMP struct because it will heavily simplify things if every language uses this same method for registering their globals. I would like to change it, but I'm not sure how well it would be received considering backwards compatibility. I'm not sure what the best path forward is on that front.

If it's exiting format that's already in use, then sticking with it is fine. We'll deal with this if/when we'll need to change it.

clang/lib/CodeGen/CGCUDARuntime.h
58–70	I'm a bit puzzled by this arrangement. Are those actually flags (i.e. can be set independently) or are they enumerating specific offload kinds (i.e. only one of these values is intended to be set)? I think we want the latter. If that's the case I'd propose to enumerate kernel and data together, so each kind gets a distinct value and is easy to tell when one needs to examine the offload table manually. Right now both kernels and global vars set the flags to 0.

jhuber6 added inline comments.May 6 2022, 11:38 AM

clang/lib/CodeGen/CGCUDARuntime.h
58–70	It probably should just be an enumeration. I was tentatively keeping them somewhat separate because OpenMP uses different values for these flags, but I think keeping this completely compatible is an impossible proposition. If we need them to use the same flag we should be able to configure that at some point. I will change it to just be a standard enum (I don't handle anything but kernels and regular globals in the linker wrapper right now anyway)

Changing enum values from a bitfield to simple enumeration.

jhuber6 edited the summary of this revision. (Show Details)May 6 2022, 11:59 AM

tra added inline comments.May 6 2022, 12:15 PM

clang/lib/CodeGen/CGCUDARuntime.h
56–70	We can also fold both enums into one, as we still have the ambiguity of what `flags=0` means.

jhuber6 added inline comments.May 6 2022, 1:02 PM

clang/lib/CodeGen/CGCUDARuntime.h
56–70	They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags. That's how it's done for OpenMP. I figured keeping the enums separate makes that more clear.

tra added inline comments.May 6 2022, 2:55 PM

clang/lib/CodeGen/CGCUDARuntime.h
56–70	They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags. Why use two different enums, when one would do? It does not buy us anything other than unnecessary additional complexity.

jhuber6 added inline comments.May 6 2022, 3:02 PM

clang/lib/CodeGen/CGCUDARuntime.h
56–70	I mostly copied this from OpenMP, I can merge it into one.

removing enum

jhuber6 marked 3 inline comments as done.May 6 2022, 3:06 PM

tra added inline comments.May 6 2022, 3:42 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	We're still using the same numeric value for two different kinds of entities. Considering that it's the third round we're making around this point, I'm starting to suspect that I may be missing something. Is there a particular reason kernels and global unmanaged variables have to have the same 'kind'? It's possible that I didn't do a good job explaining my enthusiastic nitpicking here. My suggestion to have unified enum for all entities we register is based on a principle of separation of responsibilities. If we want to know what kind of entry we're dealing with, checking the 'kind' field should be sufficient. The 'size' field should only indicate the size of the entity. Having to consider both kind and size to determine what you're dealing with just muddies things and should not be done unless there's a good reason for that. E.g. it might be OK if we were short on flag bits.

jhuber6 added inline comments.May 6 2022, 3:50 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	Ah, I see the point you're making now. This is yet another thing that OpenMP did that I just copied. I wouldn't have implemented it this way but I figured it would be simpler to keep them similar. I mostly did it this way because I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals. We could change it for CUDA in the future and I could make that change here if it's valuable. Ideally I would like to rewrite how we do all this registration with the structs but breaking the ABI makes it complicated...

tra added inline comments.May 6 2022, 4:13 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals. OK. So, there is something that requires this magic. If that's something we must have, then it must be mentioned in the comments around the enum. Do you know where I should find the code which needs this? I'm curious what's going on there. I wonder if it just checks for "flags==0" and refuses to deal with unknown flags. To think of it, we probably want to put the enum into a common header which defines the `__tgt_offload_entry`.We would not want OpenMP itself to start using the same bits for something else.

jhuber6 added inline comments.May 6 2022, 4:37 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	Sorry, I should be more specific. The OpenMP offloading runtime currently uses a size of zero to indicate a kernel function and the flags have a different meaning if it's a kernel. For OpenMP, 0 is a kernel, 1 and 2 are device ctors / dtors. I'm not sure why they chose this over just another flag but it's the current standard. You can see it used like this here https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/src/omptarget.cpp#L147. I'm not sure if there's a good way to wrangle these together now that I think about it, considering OpenMP already uses `0x1` to represent `link` OpenMP variables so this already collides. But treating the flags different on the size is at least consistent with what OpenMP does. It makes it a little hard to define one enum for it since we use it two different ways, I'm not a fan of it but it's what the current ABI uses.

Harbormaster completed remote builds in B163239: Diff 427766.May 6 2022, 5:44 PM

tra added inline comments.May 9 2022, 12:57 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	I see. Using `size=0` as the coda/data flag which changes interpretation of the flags sort of makes sense. In that case two different types for the flags field would be appropriate, with an appropriate comment describing that `size==0` determines which one is in effect.

jhuber6 added inline comments.May 9 2022, 1:00 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	Personally I'm find with it landing like this, and if we wanted to improve this later it would probably just go in some greater ABI break for offloading entries. There might be a good reason to change them all at once when we start focusing more on complete interoperability of offloading languages.

tra added inline comments.May 9 2022, 1:11 PM

clang/lib/CodeGen/CGCUDARuntime.h
58	I'm fine with that. Just add a comment describing how `OffloadGlobalEntry` is used for both code and data and that the size is used to distinguish them.

Updating comments to desceibe usage of flags with zero size.

Clang format

Harbormaster completed remote builds in B163563: Diff 428188.May 9 2022, 6:18 PM

Is this with D123810 and D123812 good to land? It would be nice to be able to test this upstream.

tra accepted this revision.May 10 2022, 3:47 PM

This revision is now accepted and ready to land.May 10 2022, 3:47 PM

This revision was landed with ongoing or failed builds.May 11 2022, 4:30 AM

Closed by commit rG0035f7154c2a: [CUDA] Create offloading entries when using the new driver (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG0035f7154c2a: [CUDA] Create offloading entries when using the new driver.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

LangOptions.def

1 line

Driver/

Options.td

8 lines

lib/

CodeGen/

CGCUDANV.cpp

45 lines

CGCUDARuntime.h

18 lines

Driver/

ToolChains/

Clang.cpp

4 lines

test/

CodeGenCUDA/

offloading-entries.cu

33 lines

Diff 427701

clang/include/clang/Basic/LangOptions.def

	Show First 20 Lines • Show All 260 Lines • ▼ Show 20 Lines
	LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")			LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")
	LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")			LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")
	LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")			LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")
	LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")			LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")
	LANGOPT(GPUAllowDeviceInit, 1, 0, "allowing device side global init functions for HIP")			LANGOPT(GPUAllowDeviceInit, 1, 0, "allowing device side global init functions for HIP")
	LANGOPT(GPUMaxThreadsPerBlock, 32, 1024, "default max threads per block for kernel launch bounds for HIP")			LANGOPT(GPUMaxThreadsPerBlock, 32, 1024, "default max threads per block for kernel launch bounds for HIP")
	LANGOPT(GPUDeferDiag, 1, 0, "defer host/device related diagnostic messages for CUDA/HIP")			LANGOPT(GPUDeferDiag, 1, 0, "defer host/device related diagnostic messages for CUDA/HIP")
	LANGOPT(GPUExcludeWrongSideOverloads, 1, 0, "always exclude wrong side overloads in overloading resolution for CUDA/HIP")			LANGOPT(GPUExcludeWrongSideOverloads, 1, 0, "always exclude wrong side overloads in overloading resolution for CUDA/HIP")
				LANGOPT(OffloadingNewDriver, 1, 0, "use the new driver for generating offloading code.")

	LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")			LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")
	LANGOPT(SYCLIsHost , 1, 0, "SYCL host compilation")			LANGOPT(SYCLIsHost , 1, 0, "SYCL host compilation")
	ENUM_LANGOPT(SYCLVersion , SYCLMajorVersion, 2, SYCL_None, "Version of the SYCL standard used")			ENUM_LANGOPT(SYCLVersion , SYCLMajorVersion, 2, SYCL_None, "Version of the SYCL standard used")

	LANGOPT(HIPUseNewLaunchAPI, 1, 0, "Use new kernel launching API for HIP")			LANGOPT(HIPUseNewLaunchAPI, 1, 0, "Use new kernel launching API for HIP")

	LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")			LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")
	▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,520 Lines • ▼ Show 20 Lines	def fopenmp_target_new_runtime : Flag<["-"], "fopenmp-target-new-runtime">,
Group<f_Group>, Flags<[CC1Option, HelpHidden]>;		Group<f_Group>, Flags<[CC1Option, HelpHidden]>;
def fno_openmp_target_new_runtime : Flag<["-"], "fno-openmp-target-new-runtime">,		def fno_openmp_target_new_runtime : Flag<["-"], "fno-openmp-target-new-runtime">,
Group<f_Group>, Flags<[CC1Option, HelpHidden]>;		Group<f_Group>, Flags<[CC1Option, HelpHidden]>;
defm openmp_optimistic_collapse : BoolFOption<"openmp-optimistic-collapse",		defm openmp_optimistic_collapse : BoolFOption<"openmp-optimistic-collapse",
LangOpts<"OpenMPOptimisticCollapse">, DefaultFalse,		LangOpts<"OpenMPOptimisticCollapse">, DefaultFalse,
PosFlag<SetTrue, [CC1Option]>, NegFlag<SetFalse>, BothFlags<[NoArgumentUnused, HelpHidden]>>;		PosFlag<SetTrue, [CC1Option]>, NegFlag<SetFalse>, BothFlags<[NoArgumentUnused, HelpHidden]>>;
def static_openmp: Flag<["-"], "static-openmp">,		def static_openmp: Flag<["-"], "static-openmp">,
HelpText<"Use the static host OpenMP runtime while linking.">;		HelpText<"Use the static host OpenMP runtime while linking.">;
def offload_new_driver : Flag<["--"], "offload-new-driver">, Flags<[CC1Option]>, Group<Action_Group>,		def offload_new_driver : Flag<["--"], "offload-new-driver">, Flags<[CC1Option]>, Group<f_Group>,
HelpText<"Use the new driver for offloading compilation.">;		MarshallingInfoFlag<LangOpts<"OffloadingNewDriver">>, HelpText<"Use the new driver for offloading compilation.">;
def no_offload_new_driver : Flag<["--"], "no-offload-new-driver">, Flags<[CC1Option]>, Group<Action_Group>,		def no_offload_new_driver : Flag<["--"], "no-offload-new-driver">, Flags<[CC1Option]>, Group<f_Group>,
HelpText<"Don't Use the new driver for offloading compilation.">;		HelpText<"Don't Use the new driver for offloading compilation.">;
def offload_device_only : Flag<["--"], "offload-device-only">,		def offload_device_only : Flag<["--"], "offload-device-only">,
HelpText<"Only compile for the offloading device.">;		HelpText<"Only compile for the offloading device.">;
def offload_host_only : Flag<["--"], "offload-host-only">,		def offload_host_only : Flag<["--"], "offload-host-only">,
HelpText<"Only compile for the offloading host.">;		HelpText<"Only compile for the offloading host.">;
def offload_host_device : Flag<["--"], "offload-host-device">,		def offload_host_device : Flag<["--"], "offload-host-device">,
HelpText<"Only compile for the offloading host.">;		HelpText<"Only compile for the offloading host.">;
def cuda_device_only : Flag<["--"], "cuda-device-only">, Alias<offload_device_only>,		def cuda_device_only : Flag<["--"], "cuda-device-only">, Alias<offload_device_only>,
HelpText<"Compile CUDA code for device only">;		HelpText<"Compile CUDA code for device only">;
def cuda_host_only : Flag<["--"], "cuda-host-only">, Alias<offload_host_only>,		def cuda_host_only : Flag<["--"], "cuda-host-only">, Alias<offload_host_only>,
HelpText<"Compile CUDA code for host only. Has no effect on non-CUDA compilations.">;		HelpText<"Compile CUDA code for host only. Has no effect on non-CUDA compilations.">;
def cuda_compile_host_device : Flag<["--"], "cuda-compile-host-device">, Alias<offload_host_device>,		def cuda_compile_host_device : Flag<["--"], "cuda-compile-host-device">, Alias<offload_host_device>,
HelpText<"Compile CUDA code for both host and device (default). Has no "		HelpText<"Compile CUDA code for both host and device (default). Has no "
"effect on non-CUDA compilations.">;		"effect on non-CUDA compilations.">;
def fopenmp_new_driver : Flag<["-"], "fopenmp-new-driver">, Flags<[CC1Option]>, Group<Action_Group>,		def fopenmp_new_driver : Flag<["-"], "fopenmp-new-driver">, Flags<[CC1Option]>, Group<f_Group>,
HelpText<"Use the new driver for OpenMP offloading.">;		HelpText<"Use the new driver for OpenMP offloading.">;
def fno_openmp_new_driver : Flag<["-"], "fno-openmp-new-driver">, Flags<[CC1Option]>, Group<Action_Group>,		def fno_openmp_new_driver : Flag<["-"], "fno-openmp-new-driver">, Flags<[CC1Option]>, Group<Action_Group>,
Alias<no_offload_new_driver>, HelpText<"Don't use the new driver for OpenMP offloading.">;		Alias<no_offload_new_driver>, HelpText<"Don't use the new driver for OpenMP offloading.">;
def fno_optimize_sibling_calls : Flag<["-"], "fno-optimize-sibling-calls">, Group<f_Group>, Flags<[CC1Option]>,		def fno_optimize_sibling_calls : Flag<["-"], "fno-optimize-sibling-calls">, Group<f_Group>, Flags<[CC1Option]>,
HelpText<"Disable tail call optimization, keeping the call stack accurate">,		HelpText<"Disable tail call optimization, keeping the call stack accurate">,
MarshallingInfoFlag<CodeGenOpts<"DisableTailCalls">>;		MarshallingInfoFlag<CodeGenOpts<"DisableTailCalls">>;
def foptimize_sibling_calls : Flag<["-"], "foptimize-sibling-calls">, Group<f_Group>;		def foptimize_sibling_calls : Flag<["-"], "foptimize-sibling-calls">, Group<f_Group>;
defm escaping_block_tail_calls : BoolFOption<"escaping-block-tail-calls",		defm escaping_block_tail_calls : BoolFOption<"escaping-block-tail-calls",
▲ Show 20 Lines • Show All 4,204 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGCUDANV.cpp

Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	private:
}		}

/// Creates module constructor function		/// Creates module constructor function
llvm::Function *makeModuleCtorFunction();		llvm::Function *makeModuleCtorFunction();
/// Creates module destructor function		/// Creates module destructor function
llvm::Function *makeModuleDtorFunction();		llvm::Function *makeModuleDtorFunction();
/// Transform managed variables for device compilation.		/// Transform managed variables for device compilation.
void transformManagedVars();		void transformManagedVars();
		/// Create offloading entries to register globals in RDC mode.
		void createOffloadingEntries();

public:		public:
CGNVCUDARuntime(CodeGenModule &CGM);		CGNVCUDARuntime(CodeGenModule &CGM);

llvm::GlobalValue getKernelHandle(llvm::Function F, GlobalDecl GD) override;		llvm::GlobalValue getKernelHandle(llvm::Function F, GlobalDecl GD) override;
llvm::Function getKernelStub(llvm::GlobalValue Handle) override {		llvm::Function getKernelStub(llvm::GlobalValue Handle) override {
auto Loc = KernelStubs.find(Handle);		auto Loc = KernelStubs.find(Handle);
assert(Loc != KernelStubs.end());		assert(Loc != KernelStubs.end());
Show All 37 Lines	static std::unique_ptr<MangleContext> InitDeviceMC(CodeGenModule &CGM) {

return std::unique_ptr<MangleContext>(CGM.getContext().createMangleContext(		return std::unique_ptr<MangleContext>(CGM.getContext().createMangleContext(
CGM.getContext().getAuxTargetInfo()));		CGM.getContext().getAuxTargetInfo()));
}		}

CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)		CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)
: CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),		: CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),
TheModule(CGM.getModule()),		TheModule(CGM.getModule()),
RelocatableDeviceCode(CGM.getLangOpts().GPURelocatableDeviceCode),		RelocatableDeviceCode(CGM.getLangOpts().GPURelocatableDeviceCode \|\|
		CGM.getLangOpts().OffloadingNewDriver),
DeviceMC(InitDeviceMC(CGM)) {		DeviceMC(InitDeviceMC(CGM)) {
CodeGen::CodeGenTypes &Types = CGM.getTypes();		CodeGen::CodeGenTypes &Types = CGM.getTypes();
ASTContext &Ctx = CGM.getContext();		ASTContext &Ctx = CGM.getContext();

IntTy = CGM.IntTy;		IntTy = CGM.IntTy;
SizeTy = CGM.SizeTy;		SizeTy = CGM.SizeTy;
VoidTy = CGM.VoidTy;		VoidTy = CGM.VoidTy;

▲ Show 20 Lines • Show All 880 Lines • ▼ Show 20 Lines	if (Info.Flags.getKind() == DeviceVarFlags::Variable &&
assert(!ManagedVar->isDeclaration());		assert(!ManagedVar->isDeclaration());
CGM.addCompilerUsedGlobal(Var);		CGM.addCompilerUsedGlobal(Var);
CGM.addCompilerUsedGlobal(ManagedVar);		CGM.addCompilerUsedGlobal(ManagedVar);
}		}
}		}
}		}
}		}

		// Creates offloading entries for all the kernels and globals that must be
		// registered. The linker will provide a pointer to this section so we can
		// register the symbols with the linked device image.
		void CGNVCUDARuntime::createOffloadingEntries() {
		llvm::OpenMPIRBuilder OMPBuilder(CGM.getModule());
		OMPBuilder.initialize();

		StringRef Section = "cuda_offloading_entries";
		for (KernelInfo &I : EmittedKernels)
		OMPBuilder.emitOffloadingEntry(
		KernelHandles[I.Kernel], getDeviceSideName(cast<NamedDecl>(I.D)), 0,
		DeviceVarFlags::OffloadRegionKernelEntry, Section);

		for (VarInfo &I : DeviceVars) {
		uint64_t VarSize =
		CGM.getDataLayout().getTypeAllocSize(I.Var->getValueType());
		if (I.Flags.getKind() == DeviceVarFlags::Variable) {
		OMPBuilder.emitOffloadingEntry(
		I.Var, getDeviceSideName(I.D), VarSize,
		I.Flags.isManaged() ? DeviceVarFlags::OffloadGlobalManagedEntry
		: DeviceVarFlags::OffloadGlobalVarEntry,
		Section);
		} else if (I.Flags.getKind() == DeviceVarFlags::Surface) {
		OMPBuilder.emitOffloadingEntry(I.Var, getDeviceSideName(I.D), VarSize,
		DeviceVarFlags::OffloadGlobalSurfaceEntry,
		Section);
		} else if (I.Flags.getKind() == DeviceVarFlags::Texture) {
		OMPBuilder.emitOffloadingEntry(I.Var, getDeviceSideName(I.D), VarSize,
		DeviceVarFlags::OffloadGlobalTextureEntry,
		Section);
		}
		}
		}

// Returns module constructor to be added.		// Returns module constructor to be added.
llvm::Function *CGNVCUDARuntime::finalizeModule() {		llvm::Function *CGNVCUDARuntime::finalizeModule() {
if (CGM.getLangOpts().CUDAIsDevice) {		if (CGM.getLangOpts().CUDAIsDevice) {
transformManagedVars();		transformManagedVars();

// Mark ODR-used device variables as compiler used to prevent it from being		// Mark ODR-used device variables as compiler used to prevent it from being
// eliminated by optimization. This is necessary for device variables		// eliminated by optimization. This is necessary for device variables
// ODR-used by host functions. Sema correctly marks them as ODR-used no		// ODR-used by host functions. Sema correctly marks them as ODR-used no
Show All 12 Lines	for (auto &&Info : DeviceVars) {
Kind == DeviceVarFlags::Surface \|\|		Kind == DeviceVarFlags::Surface \|\|
Kind == DeviceVarFlags::Texture) &&		Kind == DeviceVarFlags::Texture) &&
Info.D->isUsed() && !Info.D->hasAttr<UsedAttr>()) {		Info.D->isUsed() && !Info.D->hasAttr<UsedAttr>()) {
CGM.addCompilerUsedGlobal(Info.Var);		CGM.addCompilerUsedGlobal(Info.Var);
}		}
}		}
return nullptr;		return nullptr;
}		}
		if (!(CGM.getLangOpts().OffloadingNewDriver && RelocatableDeviceCode))
return makeModuleCtorFunction();		return makeModuleCtorFunction();

		createOffloadingEntries();
		return nullptr;
}		}

llvm::GlobalValue CGNVCUDARuntime::getKernelHandle(llvm::Function F,		llvm::GlobalValue CGNVCUDARuntime::getKernelHandle(llvm::Function F,
GlobalDecl GD) {		GlobalDecl GD) {
auto Loc = KernelHandles.find(F);		auto Loc = KernelHandles.find(F);
if (Loc != KernelHandles.end())		if (Loc != KernelHandles.end())
return Loc->second;		return Loc->second;

Show All 19 Lines

clang/lib/CodeGen/CGCUDARuntime.h

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	public:
class DeviceVarFlags {		class DeviceVarFlags {
public:		public:
enum DeviceVarKind {		enum DeviceVarKind {
Variable, // Variable		Variable, // Variable
Surface, // Builtin surface		Surface, // Builtin surface
Texture, // Builtin texture		Texture, // Builtin texture
};		};

		/// The kind flag of the target region entry.
		enum OffloadRegionEntryKindFlag : uint32_t {
		/// Mark the region entry as a kernel.
		OffloadRegionKernelEntry = 0x0,
		traUnsubmitted Not Done Reply Inline Actions We're still using the same numeric value for two different kinds of entities. Considering that it's the third round we're making around this point, I'm starting to suspect that I may be missing something. Is there a particular reason kernels and global unmanaged variables have to have the same 'kind'? It's possible that I didn't do a good job explaining my enthusiastic nitpicking here. My suggestion to have unified enum for all entities we register is based on a principle of separation of responsibilities. If we want to know what kind of entry we're dealing with, checking the 'kind' field should be sufficient. The 'size' field should only indicate the size of the entity. Having to consider both kind and size to determine what you're dealing with just muddies things and should not be done unless there's a good reason for that. E.g. it might be OK if we were short on flag bits. tra: We're still using the same numeric value for two different kinds of entities. Considering that…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Ah, I see the point you're making now. This is yet another thing that OpenMP did that I just copied. I wouldn't have implemented it this way but I figured it would be simpler to keep them similar. I mostly did it this way because I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals. We could change it for CUDA in the future and I could make that change here if it's valuable. Ideally I would like to rewrite how we do all this registration with the structs but breaking the ABI makes it complicated... jhuber6: Ah, I see the point you're making now. This is yet another thing that OpenMP did that I just…
		traUnsubmitted Not Done Reply Inline Actions I did some initial tests of registering and accessing CUDA globals in OpenMP and it required using the same flags for the kernels and globals. OK. So, there is something that requires this magic. If that's something we must have, then it must be mentioned in the comments around the enum. Do you know where I should find the code which needs this? I'm curious what's going on there. I wonder if it just checks for "flags==0" and refuses to deal with unknown flags. To think of it, we probably want to put the enum into a common header which defines the `__tgt_offload_entry`.We would not want OpenMP itself to start using the same bits for something else. tra: > I did some initial tests of registering and accessing CUDA globals in OpenMP and it required…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Sorry, I should be more specific. The OpenMP offloading runtime currently uses a size of zero to indicate a kernel function and the flags have a different meaning if it's a kernel. For OpenMP, 0 is a kernel, 1 and 2 are device ctors / dtors. I'm not sure why they chose this over just another flag but it's the current standard. You can see it used like this here https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/src/omptarget.cpp#L147. I'm not sure if there's a good way to wrangle these together now that I think about it, considering OpenMP already uses `0x1` to represent `link` OpenMP variables so this already collides. But treating the flags different on the size is at least consistent with what OpenMP does. It makes it a little hard to define one enum for it since we use it two different ways, I'm not a fan of it but it's what the current ABI uses. jhuber6: Sorry, I should be more specific. The OpenMP offloading runtime currently uses a size of zero…
		traUnsubmitted Not Done Reply Inline Actions I see. Using `size=0` as the coda/data flag which changes interpretation of the flags sort of makes sense. In that case two different types for the flags field would be appropriate, with an appropriate comment describing that `size==0` determines which one is in effect. tra: I see. Using `size=0` as the coda/data flag which changes interpretation of the flags sort of…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Personally I'm find with it landing like this, and if we wanted to improve this later it would probably just go in some greater ABI break for offloading entries. There might be a good reason to change them all at once when we start focusing more on complete interoperability of offloading languages. jhuber6: Personally I'm find with it landing like this, and if we wanted to improve this later it would…
		traUnsubmitted Not Done Reply Inline Actions I'm fine with that. Just add a comment describing how `OffloadGlobalEntry` is used for both code and data and that the size is used to distinguish them. tra: I'm fine with that. Just add a comment describing how `OffloadGlobalEntry` is used for both…
		};

		/// The kind flag of the global variable entry.
		enum OffloadVarEntryKindFlag : uint32_t {
		/// Mark the entry as a global variable.
		OffloadGlobalVarEntry = 0x0,
		/// Mark the entry as a managed global variable.
		OffloadGlobalManagedEntry = 0x1,
		/// Mark the entry as a surface variable.
		OffloadGlobalSurfaceEntry = 0x2,
		/// Mark the entry as a texture variable.
		OffloadGlobalTextureEntry = 0x3,
		traUnsubmitted Done Reply Inline Actions I'm a bit puzzled by this arrangement. Are those actually flags (i.e. can be set independently) or are they enumerating specific offload kinds (i.e. only one of these values is intended to be set)? I think we want the latter. If that's the case I'd propose to enumerate kernel and data together, so each kind gets a distinct value and is easy to tell when one needs to examine the offload table manually. Right now both kernels and global vars set the flags to 0. tra: I'm a bit puzzled by this arrangement. Are those actually flags (i.e. can be set independently)…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions It probably should just be an enumeration. I was tentatively keeping them somewhat separate because OpenMP uses different values for these flags, but I think keeping this completely compatible is an impossible proposition. If we need them to use the same flag we should be able to configure that at some point. I will change it to just be a standard enum (I don't handle anything but kernels and regular globals in the linker wrapper right now anyway) jhuber6: It probably should just be an enumeration. I was tentatively keeping them somewhat separate…
		traUnsubmitted Done Reply Inline Actions We can also fold both enums into one, as we still have the ambiguity of what `flags=0` means. tra: We can also fold both enums into one, as we still have the ambiguity of what `flags=0` means.
		jhuber6AuthorUnsubmitted Done Reply Inline Actions They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags. That's how it's done for OpenMP. I figured keeping the enums separate makes that more clear. jhuber6: They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it…
		traUnsubmitted Done Reply Inline Actions They're selected based on the size, if the size is zero it uses the kernel flags, otherwise it uses the variable flags. Why use two different enums, when one would do? It does not buy us anything other than unnecessary additional complexity. tra: > They're selected based on the size, if the size is zero it uses the kernel flags, otherwise…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I mostly copied this from OpenMP, I can merge it into one. jhuber6: I mostly copied this from OpenMP, I can merge it into one.
		};

private:		private:
unsigned Kind : 2;		unsigned Kind : 2;
unsigned Extern : 1;		unsigned Extern : 1;
unsigned Constant : 1; // Constant variable.		unsigned Constant : 1; // Constant variable.
unsigned Managed : 1; // Managed variable.		unsigned Managed : 1; // Managed variable.
unsigned Normalized : 1; // Normalized texture.		unsigned Normalized : 1; // Normalized texture.
int SurfTexType; // Type of surface/texutre.		int SurfTexType; // Type of surface/texutre.

▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,076 Lines • ▼ Show 20 Lines
	} else {			} else {
	Args.AddLastArg(CmdArgs, options::OPT_fopenmp_simd,			Args.AddLastArg(CmdArgs, options::OPT_fopenmp_simd,
	options::OPT_fno_openmp_simd);			options::OPT_fno_openmp_simd);
	Args.AddAllArgs(CmdArgs, options::OPT_fopenmp_version_EQ);			Args.AddAllArgs(CmdArgs, options::OPT_fopenmp_version_EQ);
	Args.addOptOutFlag(CmdArgs, options::OPT_fopenmp_extensions,			Args.addOptOutFlag(CmdArgs, options::OPT_fopenmp_extensions,
	options::OPT_fno_openmp_extensions);			options::OPT_fno_openmp_extensions);
	}			}

				// Forward the new driver to change offloading code generation.
				if (Args.hasArg(options::OPT_offload_new_driver))
				CmdArgs.push_back("--offload-new-driver");

	SanitizeArgs.addArgs(TC, Args, CmdArgs, InputType);			SanitizeArgs.addArgs(TC, Args, CmdArgs, InputType);

	const XRayArgs &XRay = TC.getXRayArgs();			const XRayArgs &XRay = TC.getXRayArgs();
	XRay.addArgs(TC, Args, CmdArgs, InputType);			XRay.addArgs(TC, Args, CmdArgs, InputType);

	for (const auto &Filename :			for (const auto &Filename :
	Args.getAllArgValues(options::OPT_fprofile_list_EQ)) {			Args.getAllArgValues(options::OPT_fprofile_list_EQ)) {
	if (D.getVFS().exists(Filename))			if (D.getVFS().exists(Filename))
	▲ Show 20 Lines • Show All 2,301 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/offloading-entries.cu

This file was added.

				// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals
				// RUN: %clang_cc1 -std=c++11 -triple x86_64-unknown-linux-gnu \
				// RUN: --offload-new-driver -emit-llvm -o - -x cuda %s \| FileCheck \
				// RUN: --check-prefix=HOST %s

				#include "Inputs/cuda.h"

				//.
				// HOST: @x = internal global i32 undef, align 4
				// HOST: @.omp_offloading.entry_name = internal unnamed_addr constant [8 x i8] c"_Z3foov\00"
				// HOST: @.omp_offloading.entry._Z3foov = weak constant %struct.__tgt_offload_entry { ptr @_Z18__device_stub__foov, ptr @.omp_offloading.entry_name, i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1
				// HOST: @.omp_offloading.entry_name.1 = internal unnamed_addr constant [8 x i8] c"_Z3barv\00"
				// HOST: @.omp_offloading.entry._Z3barv = weak constant %struct.__tgt_offload_entry { ptr @_Z18__device_stub__barv, ptr @.omp_offloading.entry_name.1, i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1
				// HOST: @.omp_offloading.entry_name.2 = internal unnamed_addr constant [2 x i8] c"x\00"
				// HOST: @.omp_offloading.entry.x = weak constant %struct.__tgt_offload_entry { ptr @x, ptr @.omp_offloading.entry_name.2, i64 4, i32 0, i32 0 }, section "cuda_offloading_entries", align 1
				//.
				// HOST-LABEL: @_Z18__device_stub__foov(
				// HOST-NEXT: entry:
				// HOST-NEXT: [[TMP0:%.*]] = call i32 @cudaLaunch(ptr @_Z18__device_stub__foov)
				// HOST-NEXT: br label [[SETUP_END:%.*]]
				// HOST: setup.end:
				// HOST-NEXT: ret void
				//
				__global__ void foo() {}
				// HOST-LABEL: @_Z18__device_stub__barv(
				// HOST-NEXT: entry:
				// HOST-NEXT: [[TMP0:%.*]] = call i32 @cudaLaunch(ptr @_Z18__device_stub__barv)
				// HOST-NEXT: br label [[SETUP_END:%.*]]
				// HOST: setup.end:
				// HOST-NEXT: ret void
				//
				__global__ void bar() {}
				__device__ int x = 1;